In one of my last blog posts, I reviewed and experimented with the Lehigh virus. A virus with the ability to store a copy of itself by overwriting a part of the host program that will not affect its subsequent execution.

This post contains some notes, reflections and experiments on this infection technique, known as "cavity infection". The work is based on an evolved version of the virus Lehigh called X-Lehigh.

**Impact and extent of a cavity infection**

If we study the logic of infection by cavity of the Lehigh virus we observe that it is tremendously simple. Almost trivial.

This infection method for Lehigh consists of storing a copy of the virus at the end of the host and modifying the first jump instruction to point to this copy. The cavity space on which the virus depends is large enough to store a copy, leaving even considerable space without overwriting. This would have allowed writing new versions and updates of the virus that could have continued using the same infection technique.

In the case of the Lehigh virus, it is also relevant to mention that the cavity is a sequence of zeros that at runtime will be used as data storage that does not require initialization.

This last characteristic has made the most widespread thinking about these cavities simplified to some extent.

These cavities are considered as simple sequences of bytes not used in programs that are usually the result of static stack allocations, program section alignments or any other aspect related to the fragmentation of the program and the peculiarities of the architecture.

This simplicity in the logic of infection by cavity also brings associated a series of characteristics that, compared to other more elaborate and complex types of infection, allow the virus to be simpler, more compact and difficult to detect.

The first of these characteristics is related to the simplicity in the infection logic. Being able to include the virus in the body of the host avoids making potential modifications to the headers of the hosts so that the loading of the same by the operating system is correct once they have been infected.

This simplicity translates into greater compatibility of the virus between different versions of the operating system that may present differences between the loading headers or the code of the operating system's executable file loader itself.

The second characteristic is related to the stealth capabilities of the virus itself. In a cavity infection, the size of the virus is not modified. The size of the host file is the same before and after being infected.

This feature that comes standard with this type of infection is relevant if the virus intends to go unnoticed and implement stealth features.

While the implementation of some stealth features is trivial and very light, such as the preservation of the original date and time of the host, hiding an increase in the size of the host file can lead to a performance penalty and increase the size of the virus significantly.

Finally, the third characteristic is materialized in the size and reliability of the virus itself. By not having to implement a complex infection routine, the total size of the virus is significantly reduced, which also has an impact on potential failures and errors that could occur in this routine.

The result is a more compact and reliable virus than other viruses with heavier and more complex infection routines. Here you can consider addition virus, companion virus, etc.

It can therefore be concluded that a cavity infection allows a more compact, compatible, reliable and simple virus design. These designs also facilitate desirable aspects such as the implementation of lighter stealth capabilities.

**Cavity search and identification**

As we mentioned, the Lehigh virus knows in advance the structure of the host it will infect and therefore knows where the cavity it needs to abuse is located.

All the actions in this case are straightforward, and the algorithm Lehigh employes is probably the simplest. Locate the last byte of the host file, subtract the length of the virus body from it, and replicate in that direction. The virus therefore does not carry out any type of check on the cavity space.

Now let's think about a more generic cavity concept. A more flexible and dynamic concept but also more complex. Thus we can understand a cavity as any space in the host whose size is equal to or greater than the size of the virus and, furthermore, it is possible to restore the original content of the host if necessary. All this while maintaining the condition that the size of the host before and after infection is the same.

It is therefore observed that a cavity whose content is a sequence of zeros and is used as uninitialized memory is only one of the possible cavity cases.

In fact, in this case, whether the sequence is zero, another value or even random values is irrelevant since this space will be used as uninitialized memory. This leads us to reflect on the need to differentiate between a possible cavity whose content is relevant versus another cavity where the content is irrelevant and, therefore, it will be necessary to restore it before the virus returns control to the host.

For the case of cavities where the content is relevant, and it is therefore necessary to restore the original content later, a sequence of identical bytes whose length is equal to or greater than the body of the virus constitutes the best case, regardless of whether the byte is zero or another different value. For all other cases, the virus would need to find a pattern that would allow it to compress the original content.

In this context, a virus that does not know the host's structure in advance will need a search and cavity identification algorithm. This search and identification algorithm should be as simple as possible and would probably try to take advantage of the host's natural cavities.

Performing a series of shifted searches on host content no larger than the virus size in bytes could be a compromise in terms of time, size and complexity of implementation. We have to take into account that the implementation of the algorithm itself should not exceed tens of bytes in size.

It can therefore be concluded that a virus cavity search and identification compromise would try to find identical byte sequences in the host that would allow the virus body to be stored and restore the original cavity content with very few instructions.

**The entropy of a cavity**

In the previous point we mentioned that the concept of cavity could be understood as a more flexible and dynamic concept, but where would the limits be?

From a theoretical point of view, a more sophisticated and complex virus could interpret part of the host's code section as a cavity made up of a sequence of random bytes. If this is the case, the virus would require at some point to study the lossless compression rate of the sequence in order to guarantee the necessary space requirements.

The latter allows us to approach the problem from the domain of information theory, whose main objective is to provide a rigorous definition of the notion of information that allows quantifying it as well as finding the fundamental limits of operations such as data compression, storage and communication.

Very briefly and according to Shannon, a general communication system consists of several parts:

A source, which generates a message to be received by the destination

A transmitter, which transforms the message generated at the source into a signal to be transmitted. In cases where the information is encoded, the encoding process is also implemented by the transmitter

A channel that will be any means that serves for the signal to be transmitted from the transmitter to the receiver

A receiver, which reconstructs the message from the signal

A destination, who is the one who receives the message

These elements are schematically represented in the following drawing:

In our case, we are interested in reducing this general communication system to a simple information source (X) which can select 256 different states in such a way that it is possible to assign a probability of occurrence for each of the different states of the source following a probability distribution.

We are also interested in quantifying the amount of uncertainty involved in the value of the source X, that is, the outcome of X understood as a random variable or a random process. This amount of uncertainty, or entropy, is defined as the average amount of information produced by the source.

Therefore the entropy of the source X, since it produces successions of states, is defined as the following average amount of information:

It is also relevant to mention at this point that our interest lies in studying the statistical structure of the sequence of bytes that we are choosing as "cavity".

That is, we are interested in knowing the average amount of information and we use the term entropy as defined and interpreted by Shannon in his original work.

From a practical perspective knowing how to quantify entropy allows us to know how uncertain a hypothetical next state in the sequence would be.

Thus, calculated the entropy of an arbitrary and random sequence of bytes in a host, if its entropy is zero, we will have the maximum certainty that all its bytes are equal.

In the same way, if the entropy of the sequence is maximum, we will know that we are facing a balanced probability distribution for the bytes of the sequence.

The entropy calculation by the virus would allow it to identify sequences of bytes whose value is identical throughout the sequence (entropy = 0). These sequences would be very easy cavities for the virus to exploit, since the algorithm to restore the original content is trivial and its implementation would only require a few instructions.

The more balanced the sequence of bytes is, the greater entropy it will have and, therefore, the more complexity the virus would require to use it.

Byte sequences or cavities with a skewed probability distribution will have lower entropies and are therefore easier for the virus to manipulate.

**Quantifying Shannon's entropy over DOS.X-Lehigh.v0.1.762**

Among the experiments carried out and the results obtained on the Lehigh virus is the development of a virus called X-Lehigh that has integration and support with an automated low-level testing platform.

X-Lehigh is the result of modifying the original virus to solve defects and problems in its design and implementation, always keeping the essence, and guaranteeing its correct operation for nine of the latest official versions of the original operating system (DOS).

Here are some results that use Shannon entropy. Wherever it appears, it is calculated for total or partial sequences of the host file (COMMAND.COM) using the following Python algorithm:

def shannon(self, data): ''' Performs a Shannon entropy analysis on a given block of data. ''' entropy = 0 if data: length = len(data) seen = dict(((chr(x), 0) for x in range(0, 256))) for byte in data: seen[byte] += 1 for x in range(0, 256): p_x = float(seen[chr(x)]) / length if p_x > 0: entropy -= p_x * math.log(p_x, 2) return (entropy / 8)

This algorithm is part of the firmware analysis tool called Binwalk that was used to extract the raw datasets.

In this context, we can proceed to review and comment on some of the graphs that seem appropriate in the analysis of the infection by cavity of the virus.

As expected, the evolution of the host file size in bytes increases progressively. Being the most abrupt size changes in the first versions and being subtle or non-existent at the level of minor versions.

There is also more aggressive growth and development in the first three versions of the file.

The same pattern of development for the host file is observed in the following graph, more oriented to obtain a characteristic growth profile for the virus.

For the first version of the host file, the virus represents approximately 3% of its size. The relative weight of the virus for the latest version under test is 1.39%

The progressive increase of the host file size dilutes the relative weight of the virus compared to its host.

For this particular case, and since the host is a COM file running on DOS, the size of the host file would be limited to a maximum of 64kB.

Regarding the progression of entropy in the pre-infection and post-infection states, the profile runs parallel as expected, although the increase in entropy of the host once infected is greater for the first versions than in the last ones.

It's also interesting to see how the entropy level decreases slightly for the minor versions after a major release.

Finally, you can see in the graphic above how the infection drives entropy by overwriting the cavity at the end of the file, which makes it very easy to recognize an infected host file from another that is not.

**Wrapping up**

The potential for a virus to make use of host file space with the intention of storing a copy of itself is very powerful. As we have seen, this approach enables a more compact, compliant, reliable and simple virus design.

These designs also lead to stealthier implementations that are more likely to go unnoticed without specific actions aimed at finding them.

Developing and expanding the concept of "cavity", even abusing it at some point, it is easy to come to use it as an abstraction that allows the virus to select an arbitrary sequence of bytes in the host file and handle it as a random variable.

The latter would add complexity in the implementation of the virus in terms of the uncertainty of the content of the sequence of bytes that would be used as a cavity. In these cases, there would be a need for the virus to study the suitability of the sequence before proceeding to its use.

Using Shannon entropy we can study the statistical structure of a sequence of bytes in an infected host file and obtain a quantification of the amount of uncertainty in it.

In the range of possible entropy outputs, there are scenarios composed of small sequences of identical bytes that constitute very simple cases of infection. These cases require some additional instructions versus simpler implementations that assume specific memory addresses and do not consider rebuilding the original content of the sequences before handing over control to the host.

It is evident that more sophisticated viruses could also make use of cavity compression techniques through encoding/decoding routines in a transparent way, although in this specific context and scenario (Lehigh, DOS, etc.) it is not necessary given the opportunity to work with sequences that present a more favorable statistical structure.

I finish this blog post with some graphs that allow us to observe and study the evolution of a laboratory virus (X-Lehigh), and its impact on the host file, throughout nine different versions of DOS before and after a successful infection.

As a curiosity for those who try to look for similarities or different interpretations of the term entropy in other domains, the following paragraph that appears in Energy and Information p.180 (1971) is interesting on how Shannon comes to include the term entropy in his line of job:

'... In the case of Shannon's measure the naming was not accidental. In 1961 one of us (Tribus) asked Shannon what he had thought about when he had finally confirmed his famous measure. Shannon replied: "My greatest concern was what to call it. I thought of calling it 'information,' but the word was overly used, so I decided to call it 'uncertainty.' When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one knows what entropy really is, so in a debate you will always have the advantage.'"'