In the '90s, a lot of active exploration and research got done around self-replicating code with the aim of pushing software infection limits and crafting some proof of concept virus in PC platforms. In that period malware was primitive and simple, but seminal in papers and applied techniques.
Those days, while reversing malware and reading dead listings for some of the most recent sophisticated and aggresive virus and worms, I needed to implement some support code to verify as the infections were happening and how hosts could be uninfected. I used OCaml to implement some antivirus routines and I spent some time observing the impact and consecuences of using functional programming in this arena.
Along this post I am going to write about virus signature matching, bit string abstractions and the blending of both worlds using OCaml. In order to understand the beauty of pattern matching in this domain, this post will comment on antivirus architectures and signature scanners too. A real Win32 virus will be shredded to explore how functional programming could be used to detect/disinfect malware.
Antivirus architecture and security endpoints
I guess you would expect a traditional modular program under the concept of antivirus engine. I mean, a traditional antivirus engine orchestrating the usual modules such as file types/memory scanners, decompression routines, emulation modules, heuristic code, automatic update and so on.
But security companies are marketing several products under the same family/brand now, they are using the same box to include very different security technologies such as cloud support, firewalls, intrusion detection systems, etc. So, in a sense, the local and traditional antivirus becomes part of a security endpoint.
In this modern architecture, signatures are used by the traditional scanning modules (file types and memory scanners mainly) but they may be used by scanners requiring networking packet inspection (firewall, IDS, IPS, etc) too.
On viral signatures
A virus signature should be understood how a reliable way to detect a host infected by concrete malware. It encapsulates the essence of a virus. Ideally, the antivirus tries to detect infected files using a robust signature. Signature detection is complex and challenging but we will keep the focus on the need of gathering a simple signature together with related context information.
A signature starts with a bunch of bytes appearing in concrete positions of a program layout. Those bytes may be hidden. Some virus use encryption, polymorphism or metamorphism to avoid detection although, at the end, malware will require a predictable and well-known structure or mechanism to spread. At the end, the art of crafting signatures is understanding the nature of the virus in order to detect, disable and remove it when possible. The next picture shows the EICAR test string.
Imperative languages, such as Assembly, C or C++, use binary abstractions to handle bytes. The programmer uses pointers, unions, structs and bit operators to get the things done. From an implementation perspective, the scanning code providing context and signatures' data must operate in tandem to prevent false positives. It would be great if we can provide a more encapsulated and declarative way to implement those signatures. So, what about using bit string abstractions and pattern matching to detect/disinfect malware with OCaml?
Bit strings and pattern matching
While a binary is a sequence of bits where the number of bits is evenly divisible by eight, a bit string is a sequence of bits of any length. As you imagine a binary is also a bit string. Using a bit string abstraction we gather some benefits and flexibility such as avoiding explicit bit operators or using arbitrary arithmetic expressions in the size fields. This makes it possible to write high level specifications of operations on bit streams. Have a look in the next code to see one example handling IPv4 packets.
With pattern matching, patterns can depend upon the structure of a value, rather than just the constant value itself. At the same time they can bind variables to parts of a value. Those features are useful to catch the essence and structure of the signature. This structure is used as template along the detection stage while bound variables are used along the disinfection stage.
Understanding a Win32 virus and gathering a smart signature
The virus picked to illustrate how detection and removal routines work with these declarative signatures is Win32.H0rtiga You can read high level technical details here. Bear in mind those technical details won't be enough to understand and gather the signature, and we will need to reverse part of the code to fully understand how it spreads.
From a functional point of view, this virus is a runtime infector. It looks for PE32 Windows executables in order to add a new section with its code inside. Its payload drops a tiny proxy. This way when it receives connections using a concrete binary protocol it is able to redirect them. This payload is used to hide the real origin IP. Targets' logs will record infected IP addresses.
This virus was designed with portability in mind 14 years ago, and it is able to infect 32 bit Microsoft Windows binaries (x86). This virus runs in Microsoft Windows 9x and NT kernels. The relevant part of the disassembled detection routine for this virus follows:
So the virus looks for PE32 binaries with a different value of 0xd00d in the Win32VersionValue field. If it contains 0xd00d in that offset, it skips the infection.
This information is useful to craft a potential vaccine for this virus. It is not possible to build a vaccine for every virus but, in this case, it is just possible storing those two bytes in place.
I guess marking a binary to appear infected could sound weird but it is useful in some scenarios where you need to protect the binary integrity. The virus assumes the binary is infected and it doesn't try to infect it again.
The next code shows part of the infection routine.
This routine confirms the value 0xd00d as the value being checked to detect previous infections. Two new values are relevant here in order to disinfect the virus.
The first value (0x4c9) is the offset for the original entry point. The virus replaces the original entry point to take control.
The second value (0x134a) is the viral lenght. The virus replicates 4938 bytes in length.
As mentioned, the virus adds a new section with each new infection. The next dump shows how infected files contain this new last section. The new last section maps with read/write privileges.
Have in mind the virus doesn't use section values to check infections while some of them, such as the section name or the characteristics, remain constants among generations.
Implementing detection and disinfection routines with OCaml
Implementing a detection routine for this virus requires the following steps:
- checking it's a valid Microsoft Windows PE32 binary
- checking Win32VersionValue contains the value 0xd00d
- checking the last section name starts with '.|Zan' (extra check)
Implementing a disinfection routine requires:
- Restore the old entry point (saved at offset 0x4c9)
- Delete the last section of the binary and fix headers
- Truncate the program (section body) by 0x134a bytes
The proof of concept code implemented in Ocaml contains two functions called 'infected' and 'disinfect'. They are used to detect and remove the virus H0rtiga (Urtica).
You can have a look in antivirus module source here. Source contains a test program to illustrate virus removal in 20 lines.
Under the hood, a declarative signature drives the detection and removal processes. It uses bit string abstractions and pattern matching as supported by OCaml.
The implementation is good enough to support heuristics too. You can think around an entry point with a value out of the code section. The signature would be able to handle it with guards.