A research paper published by Cambridge University researchers Ross Anderson and Nicholas Boucher, titled “Trojan Source: Invisible Vulnerabilities,” reveals details of a unique class of vulnerabilities that can be exploited to inject malware in the source code without getting detected.
According to the research, the malware can alter the source code’s defined logic, allowing a range of first-party and supply-chain risks. The issue lies in Unicode, a digital text encoding standard that enables computers to exchange information no matter which language is used.
Currently, Unicode defines over 143,000 characters in 154 different languages scripts and many non-script character sets like emojis.
About Trojan Source Attacks
This technique exploits the text-encoding standards’ subtleties, including Unicode, so as to produce a different source code, the tokens of which are logically encoded in a completely different order from the original one. This can create vulnerabilities that human code reviewers cannot perceive directly.
- C, C++
“The fact that the Trojan Source vulnerability affects almost all computer languages makes it a rare opportunity for a system-wide and ecologically valid cross-platform and cross-vendor comparison of responses,” the paper [PDF] read.
For your information, compiler programs are responsible for interpreting high-level human-readable source code into their lower-level representations that the OS can execute. These include object code, assembly language, and machine code.
How is Unicode Algorithm Exploited?
The core issue lies in the Bidi (bidirectional) algorithm of Unicode. This algorithm encourages support for left-to-right and right-to-left languages, such as English and Arabic, respectively. Moreover, it also features Bidi overrides to enable writing of left-to-right words within a right-to-left sentence or vice versa. Hence, it forces the left-to-right text to be used as right-to-left.
But while the compiler’s output is required to implement the source code correctly, any alterations generated by injecting Unicode Bidi override characters into strings and comments can yield a syntactically valid source code where the characters’ display order present a different logic from the actual one.
The Attack details
The source code files’ encoding is exploited to create targeted vulnerabilities instead of introducing logical bugs independently. This allows visual reordering of tokens in the source code. When rendered acceptably, the compiler is tricked into processing the code in a novel way, thus modifying the program flow. For instance, it can make a comment appear as a code.
Therefore, if Program A is anagrammed into Program B, the change in code logic would be subtle enough to remain undetected in further testing as an adversary can introduce targeted vulnerabilities, and these would remain hidden.
“You can use them in source code that appears innocuous to a human reviewer [that] can actually do something nasty. That’s bad news for projects like Linux and Webkit that accept contributions from random people, subject them to manual review, then incorporate them into critical code. This vulnerability is, as far as I know, the first one to affect almost everything,” wrote Ross Anderson.
Impact on The Supply Chain
These encodings can impact the supply chain because when invisible software vulnerabilities are injected into open-source software, it will eventually affect all users. Furthermore, researchers warned that Trojan Source attacks’ impact could be severer if an attacker uses homoglyphs to redefine pre-existing functions within an upstream package, thus, invoking them from a victim program.
“As powerful supply-chain attacks can be launched easily using these techniques, it is essential for organizations that participate in a software supply chain to implement defenses,” researchers warned.