Ever wonder what the people on the other end of a Hangouts session are really looking at on their screens? With a little help from machine learning, you might be able to take a peek over their shoulders, based on research published at the CRYPTO 2018 conference in Santa Barbara last week. All you’ll need to do is process the audio picked up by their microphones.
Daniel Genkin of the University of Michigan, Mihir Pattani of the University of Pennsylvania, Roei Schuster of Cornell Tech and Tel Aviv University, and Eran Tromer of Tel Aviv University and Columbia University investigated a potential new avenue of remote surveillance that they have dubbed “Synesthesia“: a side-channel attack that can reveal the contents of a remote screen, providing access to potentially sensitive information based solely on “content-dependent acoustic leakage from LCD screens.”
The research, supported by the Check Point Institute for Information Security at Tel Aviv University (of which Schuster and Tromer are members) and funded in part by the Defense Advanced Research Projects Agency, examined what amounts to an acoustic form of Van Eck phreaking. While Van Eck phreaking uses radio signal emissions that leak from display connectors, the Synesthesia research leverages “coil whine,” the audio emissions from transformers and other electronic components powering a device’s LCD display.
Big audio dynamite
This isn’t the first acoustic side-channel attack ever discovered, by any means. Genkin and Tromer—with another team of researchers, including Adi Shamir, one of the co-inventors of the RSA cryptographic algorithm—previously demonstrated a way to use noise generated by a computer’s power supply and other components to recover RSA encryption keys. And nation-state use of acoustic side-channels has been documented, though not against computer screens. In former MI5 Assistant Director Peter Wright’s book Spycatcher, Wright recounted how British intelligence used a phone tap to record audio from an Egyptian embassy’s cipher machine during the Suez Crisis. And acoustic bugging has also been shown to reveal keystrokes on a physical keyboard.
Anyone who remembers working with cathode ray tube monitors is familiar with the phenomenon of coil whine. Even though LCD screens consume a lot less power than the old CRT beasts, they still generate the same sort of noise, though in a totally different frequency range.
Because of the way computer screens render a display—sending signals to each pixel of each line with varying intensity levels for each sub-pixel—the power sent to each pixel fluctuates as the monitor goes through its refresh scans. Variations in the intensity of each pixel create fluctuations in the sound created by the screen’s power supply, leaking information about the image being refreshed—information that can be processed with machine learning algorithms to extract details about what’s being displayed.
That audio could be captured and recorded in a number of ways, as demonstrated by the researchers in this case: over a device’s embedded microphone or an attached webcam microphone during a Skype, Google Hangouts, or other streaming audio chat; through recordings from a nearby device, such as a Google Home or Amazon Echo; over a nearby smartphone; or with a parabolic microphone from distances up to 10 meters. Even a reasonably cheap microphone could pick up and record the audio from a display, even though it is just on the edge of human hearing. And it turns out that audio can be exploited with a little bit of machine learning black magic.
The researchers began by attempting to recognize simple, repetitive patterns. “We created a simple program that displays patterns of alternating horizontal black and white stripes of equal thickness (in pixels), which we shall refer to as Zebras,” the researchers recounted in their paper. These “zebras” each had a different period, measured by the distance in pixels between black stripes. As the program ran, the team recorded the sound emitted by a Soyo DYLM2086 monitor. With each different period of stripes, the frequency of the ultrasonic noise shifted in a predictable manner.
The variations in the audio only really provide reliable data about the average intensity of a particular line of pixels, so it can’t directly reveal the contents of a screen. However, by applying supervised machine learning in three different types of attacks, the researchers demonstrated that it was possible to extract a surprising amount of information about what was on the remote screen.
After training, a neural-network-generated classifier was able to reliably identify which of the Alexa top 10 websites was being displayed on a screen based on audio captured over a Google Hangouts call—with 96.5 percent accuracy. In a second experiment, the researchers were able to reliably capture on-screen keyboard strokes on a display in portrait mode (the typical tablet and smartphone configuration) with 96.4 percent accuracy, for transition times of one and three seconds between key “taps.” On a landscape-mode display, accuracy of the classifiers was much lower, with a first-guess success rate of only 40.8 percent. However, the correct typed word was in the top three choices 71.9 percent of the time for landscape mode, meaning that further human analysis could still result in accurate data capture. (The correct typed word was in the top three choices for the portrait mode classifier 99.6 percent of the time.)
In a third experiment, the researchers used guided machine learning in an attempt to extract text from displayed content based on the audio—a much more fine-grained sort of data than detecting changes in screen keyboard intensity. In this case, the experiment focused on a test set of 100 English words and also used somewhat ideal display settings for this sort of capture: all the letters were capitalized (in the Fixedsys Excelsior typeface with a character size 175 pixels wide) and black on an otherwise white screen. The results, as the team reported them, were promising:
The per-character validation set accuracy (containing 10% of our 10,000 trace collection) ranges from 88% to 98%, except for the last character where the accuracy was 75%. Out of 100 recordings of test words, for two of them preprocessing returned an error. For 56 of them, the most probable word on the list was the correct one. For 72 of them, the correct word appeared in the list of top-five most probable words.
While these tests were all done with a single monitor type, the researchers also demonstrated that a “cross screen” attack was also possible—by using a remote connection to display the same image on a remote screen and recording the audio, it was possible to calibrate a baseline for the targeted screen.
It’s clear that there are limits to the practicality of acoustic side-channels as a means of remote surveillance. But as people move to use mobile devices such as smart phones and tablets for more computing tasks—with embedded microphones, limited screen sizes, and a more predictable display environment—the potential for these sorts of attacks could rise. And mitigating the risk would require re-engineering of current screen technology. So, while it remains a small risk, it’s certainly one that those working with sensitive data will need to keep in mind—especially if they’re spending much time in Google Hangouts with that data on-screen.