Tuesday, September 22, 2009

Hackers Disassembling 1.1.7.3(Virtual Functions)

Virtual Functions


By definition, a virtual function is defined at the run time of a program. When a virtual function is called, the executable code should correspond to the dynamic type of the object from which the function is called. The address of a virtual function can't be determined at compile time — we have to do this just before we call it. Therefore, a virtual function always answers an indirect call. (The only exception is a virtual function of a static object.)

While nonvirtual C++ functions are called in exactly the same way as normal C functions, virtual functions are called in a substantially different way. The method of calling isn't standardized; it depends on the implementation of a particular compiler. But the references to all virtual functions are usually placed into a special array — a virtual table (VTBL). The virtual table pointer (VPTR) is placed in each instance of the object that uses at least one virtual function. Nonderived objects, or objects with a single inheritance, have no more than one VPTR, while objects with multiple inheritance can have several VPTRs.

Virtual functions usually are called indirectly through the pointer to the virtual table — for example, CALL [EBX+0x10], where EBX is the register containing the offset of the virtual table in memory, and 0x10 is the offset of the pointer to the virtual function inside the virtual table. The only exception is a virtual function of a static object.

The analysis of virtual function calls involves a number of complications, the most unpleasant of which is the necessity of backtracing the code to keep track of the value of the register used for indirect addressing. It's good to initialize this by an immediate value of the type MOV EBX, offset VTBL near the place where it's used. However, the pointer to VTBL is most often passed to a function as an implicit argument, or, even worse, the same register pointer is used for calling several different virtual functions. Then an uncertainty arises: Exactly which value (values) does it have in the given location of the program?

Let's analyze the following example (first recalling that the virtual function of the derived class is invoked for objects of the derived class, even if it is called using a pointer or reference to the base class):

Listing 38: Calling a Virtual Function


#include

class Base {
public:
virtual void demo(void)
{
printf("BASE\n");
};

virtual void demo_2(void)
{
printf("BASE DEMO 2\n");
};
void demo_3(void)
{
printf("Nonvirtual BASE DEMO 3\n");
};

};
class Derived: public Base{
public:
virtual void demo(void)
{
printf("DERIVED\n");
};

virtual void demo_2(void)
{
printf("DERIVED DEMO 2\n");
};

void demo_3(void)
{
printf("Nonvirtual DERIVED DEMO 3\n");
};

};

main()
{
Base *p = new Base;
p->demo();
p->demo_2();
p->demo_3();

p = new Derived;
p->demo();
p->demo_2();
p->demo_3();
}




In general, the disassembled code of its compiled version should look like this:

Listing 39: The Disassembled Code for Calling a Virtual Function


main proc near ; CODE XREF: start+AF↓p
push esi
push 4
call??2@YAPAXI@Z ; operator new(uint)
; EAX c is a pointer to the allocated memory block.
; Four bytes of memory are allocated for the instance of a new
; object. The object consists of only one pointer to VTBL.

add esp, 4
test eax, eax
jz short loc_0_401019 ; --> Memory allocation error
; Checking whether memory allocation is successful

mov dword ptr [eax], offset BASE_VTBL
; Here the pointer to the virtual table of the BASE class
; is written in the instance of the object just created.
; We can make sure that this is a virtual table of the BASE class
; by analyzing the table's elements. They point to members
; of the BASE class, and therefore, the table itself is
; a virtual table of the BASE class.

mov esi, eax ; ESI = **BASE_VTBL

; The compiler then writes the pointer to the object instance
; (the pointer to the pointer to BASE_VTBL) in ESI. Why?
; The pointer is written to the instance
; of the object in ESI (see the section "Objects,
; Structures, and Arrays"), but all these details are of no use
; at this point. Therefore, we'll simply say that ESI
; contains the pointer to the pointer to the virtual table
; of the BASE class, without going into why this double pointer
; is necessary.

jmp short loc_0_40101B

loc_0_401019: ; CODE XREF: sub_0_401000+D↑j
xor esi, esi
; This overwrites the pointer to the object instance with NULL.
; (This branch receives control only if there is a failure
; in allocating memory for the object.)
; The null pointer will evoke the structural exception handler
; at the first attempt of calling.

loc_0_40101B: ; CODE XREF: sub_0_401000+17↑j
mov eax, [esi] ; EAX = *BASE_VTBL == *BASE_DEMO

; Here, the pointer to the virtual table of the BASE class is
; placed in EAX, keeping in mind that the pointer to the virtual,
; table also is the pointer to the first element of this table.
; The first element of the virtual table, in turn, contains
; the pointer to the first virtual function
; (in the declaration order) of the class.

mov ecx, esi ; ECX = this

; Now, the pointer to the instance of the object is written into
; ECX, passing an implicit argument - the this pointer -
; to the called function.
; (See the "Function Arguments" section.)

call dword ptr [eax] ; CALL BASE_DEMO

; This is what we came for - the call of
; the virtual function! To understand which function is called,
; we should know the value of the EAX register.
; Scrolling the disassembler window upward, we see that EAX points
; to BASE_VTBL, and the first element of BASE_VTBL (see below)
; points to the BASE_DEMO function.
; Therefore,
; this code calls the BASE_DEMO function, and
; the BASE_DEMO function is a virtual function.

mov edx, [esi] ; EDX = *BASE_DEMO
; The pointer to the first element of the virtual table
; of the BASE class is placed into EDX.

mov ecx, esi ; ECX = this
; The pointer to the object instance is placed into ECX.
; This is an implicit argument of the function - the this
; pointer. (See "The this Pointer" section.)

call dword ptr [edx+4] ; CALL [BASE_VTBL+4] (BASE_DEMO_2)
; Here's one more call of a virtual function! To understand
; which function is called, we should know the contents of the
; EDX register. Scrolling the screen window upward, we see that
; it points to BASE_VTBL; thus, EDX+4 points to the second
; element of the virtual table of the BASE class, which, in turn,
; points to the BASE_DEMO_2 function.

push offset aNonVirtualBase ; "Nonvirtual BASE DEMO 3\n"
call printf
; Here's a call of a nonvirtual function. Pay attention - it's
; implemented in the same way as the call of a regular C function.
; Note that this is an inlined function; that is, it's declared
; directly in the class, and instead of calling it,
; code is inserted.

push 4
call ??2@YAPAXI@Z ; operator new(uint)
; The calls of DERIVED class functions continue.
; In general, we only needed the DERIVED class here
; to show how virtual tables are arranged.

add esp, 8 ; Clearing the stack after printf and new
test eax, eax
jz short loc_0_40104A ; Memory allocation error
mov dword ptr [eax], offset DERIVED_VTBL
mov esi, eax ; ESI == **DERIVED_VTBL
jmp short loc_0_40104C

loc_0_40104A: ; CODE XREF: sub_0_401000+3E↑j
xor esi, esi

loc_0_40104C: ; CODE XREF: sub_0_401000+48↑j

mov eax, [esi]= *DERIVED_VTBL
mov ecx, esi ; ECX = this
call dword ptr [eax] ; CALL [DERIVED_VTBL] (DERIVED_DEMO)
mov edx, [esi] ; EDX = *DERIVED_VTBL
mov ecx, esi ; ECX=this
call dword ptr [edx+4] ; CALL [DERIVED_VTBL+4] (DERIVED_DEMO_2)

push offset aNonVirtualBase ; "Nonvirtual BASE DEMO 3\n"
call printf
; Note that the called BASE_DEMO function is of the base class,
; not of the derived one!

add esp, 4
pop esi
retn
main endp

BASE_DEMO proc near ; DATA XREF: .rdata:004050B0↓o
push offset aBase ; "BASE\n"
call printf
pop ecx
retn
BASE_DEMO endp

BASE_DEMO_2 proc near ; DATA XREF: .rdata:004050B4↓o
push offset aBaseDemo2 ; "BASE DEMO 2\n"
call printf
pop ecx
retn
BASE_DEMO_2 endp

DERIVED_DEMO proc near ; DATA XREF: .rdata:004050A8↓o
push offset aDerived ; "DERIVED\n"
call printf
pop ecx
retn
DERIVED_DEMO endp

DERIVED_DEMO_2 proc near ; DATA XREF: .rdata:004050AC↓o
push offset aDerivedDemo2 ; "DERIVED DEMO 2\n"
call printf
pop ecx
retn
DERIVED_DEMO_2 endp

DERIVED_VTBL dd offset DERIVED_DEMO ; DATA XREF: sub_0_401000+40↑o
dd offset DERIVED_DEMO_2
BASE_VTBL dd offset BASE_DEMO ; DATA XREF: sub_0_401000+F↑o
dd offset BASE_DEMO_2
; Note that the virtual tables "grow" from the bottom up
; in the order classes were declared in the program, and the elements
; of the virtual tables "grow" from top down in the order virtual
; functions were declared in the class. This is not always the case.
; The order of allocating tables and their elements isn't standardized
; and depends entirely on the developer of the compiler;
; however, in practice, most of compilers behave in this manner.
; Virtual functions are allocated close to each other
; in the order that they were declared.






Figure 11: Implementing the calls of virtual functions
Identifying a pure virtual function If a function is declared in the base class and implemented in the derived class, such a function is called a pure virtual function. A class containing at least one such function is considered an abstract class. The C++ language prohibits the creation of instances of an abstract class. How can they be created anyway, if at least one of the functions of a class is undefined?

Note At first glance, it's not defined — and that's OK. A pure virtual function in a virtual table is replaced with a pointer to the library function __purecall. What is this function for? At the compile time, it's impossible to catch all attempts of calling pure virtual functions. But if such a call occurs, the control will be passed to __purecall, substituted here beforehand. It will then yell at you about the prohibition on calling pure virtual functions and will terminate the application.


Thus, the presence of the pointer to __purecall in the virtual table indicates that we're dealing with an abstract class. Let's consider the following example:

Listing 40: Calling a Pure Virtual Function


#include

class Base{
public:
virtual void demo(void)=0;
};

class Derived:public Base {
public:
virtual void demo (void)
{
printf ("DERIVED\n");
};
};

main()
{
Base *p = new Derived;
p->demo();
}




In general, the result of compiling it should look like this:

Listing 41: The Disassembled Code for Calling a Pure Virtual Function


Main proc near ; CODE XREF: start+AF↓p
push 4
call ??2@YAPAXI@Z
add esp, 4
; Memory is allocated for the new instance of the object.

test eax, eax
; This checks whether allocating memory is successful.

jz short loc_0_401017
mov ecx, eax
; ECX = this

call GetDERIVED_VTBL
; The pointer to the virtual table of the DERIVED class
; is placed in the instance of the object.

jmp short loc_0_401019

loc_0_401017: ; CODE XREF: main+C↑j
xor eax, eax
; EAX is set to null.

loc_0_401019: ; CODE XREF: main+15↑j
mov edx, [eax]
; Here, an exception is thrown on calling to a null pointer.

mov ecx, eax
jmp dword ptr [edx]
main endp

GetDERIVED_VTBL proc near ; CODE XREF: main+10↑p
push esi
mov esi, ecx
; The implicit argument this is passed to the function
; through the ECX register.

call SetPointToPure
; The function places the pointer to __purecall
; in the object instance. This function is a stub for the case
; of an unplanned call of a pure virtual function.

mov dword ptr [esi], offset DERIVED_VTBL
; The pointer to the virtual table of the DERIVED class is placed
; in the object instance, overwriting the previous value
; of the pointer to __purecall).

mov eax, esi
pop esi
retn
GetDERIVED_VTBL endp

DERIVED_DEMO proc near ; DATA XREF: .rdata:004050A8↓o
push offset aDerived ; "DERIVED\n"
call printf
pop ecx
retn
DERIVED_DEMO endp

SetPointToPureproc near ; CODE XREF: GetDERIVED_VTBL+3↓p
mov eax, ecx
mov dword ptr [eax], offset PureFunc
; The pointer to the special function __purecall is written
; at the [EAX] address (in the instance of the new object).
; The purpose of this function is to catch attempts of calling
; pure virtual functions in the course of executing the program.
; If such an attempt occurs, __purecall will scold you again,
; saying that you shouldn't call a pure virtual function,
; and will terminate the operation.

retn
SetPointToPureendp

DERIVED_VTBL dd offset DERIVED_DEMO ; DATA XREF: GetDERIVED_VTBL+8↑o
PureFunc dd offset __purecall ; DATA XREF: SetPointToPure+2↑o
; Here is a pointer to the stub-function __purecall.
; Hence, we're dealing with a pure virtual function.




Sharing a virtual table between several instances of an object However many instances of an object might exist, all of them use the same virtual table. The virtual table belongs to the object, not to the instance of this object. Exceptions to this rule are further on in this section.



Figure 12: Sharing one virtual table among several instances of the object
To confirm this, let's consider the following example:

Listing 42: Sharing One Virtual Table Among Several Instances of the Class


#include

class Base{
public:
virtual demo ()
{
printf ("Base\n");
}
};

class Derived:public Base{
public:
virtual demo()
{
printf("Derived\n");
}
};

main()
{
Base * obj1 = new Derived;
Base * obj2 = new Derived;

obj1->demo();
obj2->demo();
}




Generally, the result of compiling it should look like this:

Listing 43: The Disassembled Code for Sharing One Virtual Table Among Several Instances of the Class


main proc near ; CODE XREF: start+AF↓p
push esi
push edi
push 4
call ??2@YAPAXI@Z ; operator new(uint)
add esp, 4
; Memory is allocated for the first instance of the object.

test eax, eax
jz short loc_0_40101B
mov ecx, eax
; EAX points to the first instance of the object.

call GetDERIVED_VTBL
; EAX contains the pointer to the virtual table of the DERIVED class.

mov edi, eax ; EDI = *DERIVED_VTBL
jmp short loc_0_40101D

loc_0_40101B: ; CODE XREF: main+E↑j
xor edi, edi

loc_0_40101D: ; CODE XREF: main+19↑j
push 4
call ??2@YAPAXI@Z ; operator new(uint)
add esp, 4
; Memory is allocated for the second instance of the object.
test eax, eax
jz short loc_0_401043
mov ecx, eax ; ECX is this

call GetDERIVED_VTBL
; Note that the second object instance
; uses the same virtual table.

DERIVED_VTBL dd offset DERIVED_DEMO ; DATA XREF: GetDERIVED_VTBL+8↑o
BASE_VTBL dd offset BASE_DEMO ; DATA XREF: GetBASE_VTBL+2↑o
; Note that the virtual table is common for all instances of the class.




Copies of virtual tables Well, it's obvious that a single virtual table is quite enough for working successfully. However, in practice, we often face a situation in which the file being investigated is swarmed by copies of these virtual tables. What kind of invasion is this, where is it from, and how can we counteract it?

If a program consists of several files compiled into separate object modules (such an approach is used in practically all more or less serious projects), the compiler, obviously, should place its own virtual table in each object module for each class used by that module. Really, how can the compiler know about the existence of other object modules, and about the presence of virtual tables in them? This leads to unnecessary duplication of virtual tables, which consumes memory and complicates the analysis. However, at link time, the linker can detect copies and delete them, and compilers use various heuristic algorithms to increase the efficiency of the generated code. The following algorithm has gained the greatest popularity: The virtual table is placed in the module that contains the first non-inline, nonvirtual implementation of a function of the class. Each class is usually implemented in one module, and in most cases such heuristics works. Things are worse if a class consists of only virtual or inline functions — in this case, the compiler "breaks down" and starts pushing virtual tables into all modules where this class is used. The last hope for deleting the "garbage" copies is pinned on the linker — but don't think the linker is a panacea. Actually, these problems should worry the developers of compilers and linkers (if they're worried about the volume of memory occupied by the program); for analysis, the superfluous copies are only an annoyance, not a hindrance.

A linked list In most cases, a virtual table is an ordinary array. Certain compilers, however, present it as a linked list. Each element of the virtual table contains a pointer to the following element. Elements are not close to each other, but rather are scattered over the entire executable file.

In practice, however, linked lists are rarely used, so we won't consider this case here — just know that it sometimes occurs. If you come across linked lists, you can figure it out according to the circumstances.

Calling through a thunk When looking through a virtual table in the case of multiple inheritance, be ready to encounter a pointer not to a virtual function, but to a code that modifies the this pointer so that it points to an instance of the object from which the "replacing function" has been taken. This technique was offered by C++ language developer Bjarne Stroustrup, who borrowed it from early implementations of Algol-60. In Algol, the code that corrects the this pointer is known as a thunk, and the call itself is known as a call through the thunk. This terminology is applicable in C++ as well.

Although calling through the thunk provides a more compact storage of virtual tables, modifying the pointer results in an excessive overhead when processors with a pipeline architecture are used. (Pentium — the most commonly used processor — has an architecture of just this kind.) Therefore, using thunk calls is justified only in programs that are critical to the size, not to the speed.

A complex example that places nonvirtual functions in virtual tables Until now, we have considered only the simplest examples of using virtual functions. Life, however, is more complex, and you may occasionally be in for a big surprise with these virtual tables. Let's consider a complex case of inheritance that involves a name conflict:

Listing 44: Placing Nonvirtual Functions in Virtual Tables


#include

class A{
public:
virtual void f() { printf("A_F\n");};
};

class B{
public:
virtual void f() { printf("B_F\n");};
virtual void g() { printf("B_G\n");};
};

class C: public A, public B {
public:
void f() { printf("C_F\n");}
}

main()
{
A *a = new A;
B *b = new B;
C *c = new C;
a->f();
b->f();
b->g();
c->f();




What will the virtual table of class C look like? Well, let's think. Since class C is derived from classes A and B, it inherits the functions of both of them. However, the virtual function f() from class B overloads the virtual function of the same name from class A; therefore, it is not inherited from class A. Furthermore, since the nonvirtual function f() is also present in the derived C class, it overloads the virtual function of the derived class. (If a function is declared virtual in the base class, it will be virtual in the derived class by default.) Thus, a virtual table of class C should contain only one element — the pointer to the virtual function g(), inherited from B. The virtual function f() is called as a regular C function. Is this right? No, it's not!

This is the case when a function that isn't explicitly declared virtual is called through the pointer — as a virtual function. Moreover, the virtual table of the class will contain three, not two, elements! The third element is a reference to the virtual function f() inherited from B, but this element is immediately replaced by the compiler with a thunk to C: :f(). Phew! Tough, huh? Maybe it'll become more understandable after studying the disassembled listing.

Listing 45: The Disassembled Code for Placing Nonvirtual Functions in Virtual Tables


main proc near ; CODE XREF: start +AF↓p
push ebx
push esi
push edi
push 4
call ??2@YAPAXI@Z ; operator new(uint)
add esp, 4
; Memory is allocated for an instance of object A.

test eax, eax
jz short loc_0_40101C
mov ecx, eax ; ECX = this
call Get_A_VTBL ; a[0]=*A_VTBL
; The pointer to the virtual table of the object
; is placed in its instance.

mov ebx, eax ; EBX = *a
jmp short loc_0_40101E

loc_0_40101C: ; CODE XREF: main+F↑j
xor ebx, ebx

loc_0_40101E: ; CODE XREF: main+1A↑j
push 4
call ??2@YAPAXI@Z ; operator new(uint)
add esp, 4
; Memory is allocated for the instance of object B.

test eax, eax
jz short loc_0_401037
mov ecx, eax ; ECX = this
call Get_B_VTBL ; b[0] = *B_VTBL
; The pointer to the virtual table of the object
; is placed in its instance.
mov esi, eax ; ESI = *b
jmp short loc_0_401039

loc_0_401037: ; CODE XREF: main+2A↑j
xor esi, esi

loc_0_401039: ; CODE XREF: main+35↑j
push 8
call ??2@YAPAXI@Z ; operator new(uint)
add esp, 4
; Memory is allocated for the instance of object C.
test eax, eax
jz short loc_0_401052
mov ecx, eax ; ECX = this
call GET_C_VTBLs ; ret: EAX=*c
; The pointer to the virtual table of the object
; is placed in its instance.
; Attention: Look into the GET_C_VTBLs function.

mov edi, eax ; EDI = *c
jmp short loc_0_401054

loc_0_401052: ; CODE XREF: main+45↑j
xor edi, edi

loc_0_401054: ; CODE XREF: main+50↑j
mov eax, [ebx] ; EAX = a[0] = *A_VTBL
mov ecx, ebx ; ECX = *a
call dword ptr [eax] ; CALL [A_VTBL] (A_F)
mov edx, [esi] ; EDX = b[0]
mov ecx, esi ; ECX = *b
call dword ptr [edx] ; CALL [B_VTBL] (B_F)
mov eax, [esi] ; EAX = b[0] = B_VTBL
mov ecx, esi ; ECX = *b
call dword ptr [eax+4] ; CALL [B_VTBL+4] (B_G)
mov edx, [edi] ; EDX = c[0] = C_VTBL
mov ecx, edi ; ECX = *c
call dword ptr [edx] ; CALL [C_VTBL] (C_F)
; Attention: The nonvirtual function is called as a virtual one!
pop edi
pop esi
pop ebx
retn

main endp

GET_C_VTBLs proc near ; CODE XREF: main+49↑p
push esi ; ESI = *b
push edi ; ECX = *c
mov esi, ecx ; ESI = *c
call Get_A_VTBL ; c[0]=*A_VTBL
; The pointer to the virtual table of the A class
; is placed in the instance of object C.

lea edi, [esi+4] ; EDI = *c[4]
mov ecx, edi ; ECX = **_C_F
call Get_B_VTBL ; c[4]=*B_VTBL
; The pointer to the virtual table of class B is added
; in the instance of object C - that is, object C now contains
; two pointers to two virtual tables of the base class.
; Let's see how the compiler will cope with the name conflict.

mov dword ptr [edi], offset C_VTBL_FORM_B ; c[4]=*_C_VTBL
; The pointer to the virtual table of class B is replaced
; with the pointer to the virtual table of class C.
; (See the comments directly in the table.)

mov dword ptr [esi], offset C_VTBL ; c[0]=C_VTBL
; Once more - now the pointer to the virtual table of class A
; is replaced with the pointer to the virtual table of class C.
; What a poorly written code!
; It could easily have been cut down at compile time!

mov eax, esi ; EAX = *c
pop edi
pop esi
retn

GET_C_VTBLs endp

Get_A_VTBL proc near ; CODE XREF: main+13↑p GET_C_VTBLs+4↑p
mov eax, ecx
mov dword ptr [eax], offset A_VTBL
; The pointer to the virtual table of class B
; is placed in the instance of the object.

retn

Get_A_VTBL endp

A_F proc near ; DATA XREF: .rdata:004050A8↑o
; This is the virtual function f() of class A.

push offset aA_f ; "A_F\n"
call printf
pop ecx
retn

A_F endp

Get_B_VTBL proc near ; CODE XREF: main+2E↑p GET_C_VTBLs+E↑p
mov eax, ecx
mov dword ptr [eax], offset B_VTBL
; The pointer to the virtual table of class B
; is placed in the instance of the object.

retn
Get_B_VTBL endp

B_F proc near ; DATA XREF: .rdata:004050AC↑o
; This is the virtual function f() of class B.
push offset aB_f ; "B_F\n"
call printf
pop ecx
retn

B_F endp

B_G proc near ; DATA XREF: .rdata:004050B0↑o
; This is the virtual function g() of class B.

push offset aB_g ; "B_G\n"
call printf
pop ecx
retn
B_G endp

C_F proc near ; CODE XREF: _C_F+3↑j
; The nonvirtual function f() of class C looks like and is called
; as a virtual one!

push offset aC_f ; "C_F\n"
call printf
pop ecx
retn
C_F endp

_C_F proc near ; DATA XREF: .rdata:004050B8↑o
sub ecx, 4
jmp C_F
; Look what a strange function this is! This is exactly the same
; thunk of which we were speaking a moment ago. First, it's never
; called (although it would have been called if we had decided
; to address the replaced virtual function, and if
; the this pointer pointed "right past" this function).
; Second, it's a thunk to the C_F function.
; What is ECX decreased for? The compiler has placed the this pointer,
; which, before decreasing, tried to point to the entire object
; inherited from class B. Upon decreasing, it started pointing
; to the previous sub-object - that is, to the contents
; of the f() function called by JMP.

_C_F endp

A_VTBL dd offset A_F ; DATA XREF: Get_A_VTBL+2↑o
; This is the virtual table of the A class.

B_VTBL dd offset B_F ; DATA XREF: Get_B_VTBL+2↑o
dd offset B_G
; This is the virtual table of class B, which contains the pointers
; to two virtual functions.

C_VTBL dd offset C_F ; DATA XREF: GET_C_VTBLs+19↑o
; The virtual table of class C contains the pointer
; to the function f() which isn't explicitly declared
; virtual, but is virtual by default.

C_VTBL_FORM_B dd offset _C_F ; DATA XREF: GET_C_VTBLs+13↑o
dd offset B_G
; The virtual table of class C is copied by the compiler from
; class B. It originally consisted of two pointers to the f() and g()
; functions, but the compiler resolved the conflict of names
; at compile time, and replaced the pointer to B::f()
; with the pointer to the adapter for C::f().




Thus, the virtual table of a derived class actually includes virtual tables of all base classes (at least, of those classes from which it inherits virtual functions). In this case, the virtual table of class C contains the pointer to the C function, which isn't explicitly declared virtual but is virtual by default, and the virtual table of class B. The problem is how to figure out that the C::f() function isn't explicitly declared virtual, but is virtual by default, and how to find all base classes of class C.

Let's begin with the latter. The virtual table of class C doesn't contain any hint as to its relation to class A, but let's look at the contents of the GET_C_VTBLs function. There is an attempt to embed the pointer to the virtual table in the instance of class C, and, therefore, class C is derived from A. (This is really only an attempt, because the embedded pointer to the virtual table of class A is immediately overwritten by the new pointer to the "corrected" virtual table of class A, which contains the corrected addresses of the virtual functions of class C.) Someone might raise the objection that this isn't a reliable approach — the compiler might optimize the code by throwing out the call to the virtual table of class A, since it's not needed anyway. This is true, it might do so indeed. But in practice, however, most compilers don't do this. If they do, they leave enough redundant information allowing us to determine the base classes, even in a mode of aggressive optimization. Another question is: Do we really need to determine "parents" from whom not a single function is inherited? (If at least one function is inherited, no complexities arise in the analysis.) In general, it isn't a crucial point for the analysis. Still, the more accurately the original code of the program is reconstructed, the more readable and comprehensible it will be.

Now let's proceed to the function f(), which isn't explicitly declared virtual, but is virtual by default. Let's speculate about what would happen if it actually was explicitly declared virtual. It would overlap the same function of the base classes, and we would encounter no absurdity in the compiled program (like we did in those thunks). The function isn't virtual, although it tends to look like it. Theoretically, the smart compiler could throw out a thunk and a duplicated element of the virtual table of the C class, but such intelligence isn't exhibited in practice. Functions explicitly declared virtual and functions that are virtual by default are absolutely identical; therefore, they can't be distinguished in the disassembled code.

Static binding Is there any difference between the instance of an object created as MyClass zzz, or MyClass *zzz=new MyClass? Certainly. In the first case, the compiler can determine the addresses of virtual functions at compile time, whereas the addresses have to be calculated at run time in the second case. One more distinction: Static objects are allocated on the stack (in the data segment), and dynamic ones on the heap. The table of virtual functions is persistently created by compilers in both cases. When each function is called (including a nonvirtual one), the this pointer containing an address of the instance of an object is prepared. (As a rule, the pointer is placed in one of the general-purpose registers. See the "Function Arguments" section for more details.)

Thus, if we encounter a function called directly by its offset, but at the same time listed in the virtual table of a class, we can be sure that it's a virtual function of a static instance of an object.

Let's consider the following example:

Listing 46: Calling a Static Virtual Function


#include

class Base{
public:
virtual void demo (void)
{
printf("BASE DEMO\n");
};

virtual void demo_2(void)
{
printf("BASE DEMO 2\n");
};

void demo_3(void)
{
printf("Nonvirtual BASE DEMO 3\n");
};

};

class Derived: public Base{
public:
virtual void demo(void)
{
printf("DERIVED DEMO\n");
};

virtual void demo_2(void)
{
printf ("DERIVED DEMO 2\n");
};

void demo_3(void)
{
printf("Nonvirtual DERIVED DEMO 3\n");
};

};

main()
{
Base p;
p.demo();
p.demo_2();
p.demo_3();

Derived d;
d.demo();
d.demo_2();
d.demo_3();
}




Generally, the disassembled listing of the compiled version of this program should look like this:

Listing 47: The Disassembled Code for Calling a Static Virtual Function


main proc near ; CODE XREF: start+AF↓p

var_8 = byte ptr -8 ; derived
var_4 = byte ptr -4 ; base
; The instances of objects are often (but not always) allocated
; on the stack from the bottom up, that is, in the order opposite
; from which you declared them in the program.

push ebp
mov ebp, esp
sub esp, 8

lea ecx, [ebp+var_4] ; base
call GetBASE_VTBL ; p[0]=*BASE_VTBL
; Notice that the instance of the object is located on the stack,
; not on the heap! This, of course, doesn't yet prove the static
; nature of the instance of the object (dynamic objects can be allocated
; on the stack, too), but nevertheless hints at the "statics."

lea ecx, [ebp+var_4] ; base
; The this pointer is prepared
; (in case it will be needed for the function).

call BASE_DEMO
; A direct call of the function! Along with its presence
; in the virtual table, this is the evidence of the static
; character of the declaration of the object instance.

lea ecx, [ebp+var_4] ; base
; A new this pointer is prepared to the base instance.

call BASE_DEMO_2
; A direct call of the function. Is it there in the virtual table?
; Yes, it is! This means that it's a virtual function,
; and the instance of the object is declared static.

lea ecx, [ebp+var_4] ; base
; The this pointer is prepared for the nonvirtual function demo_3.

call BASE_DEMO_3
; This function isn't present in the virtual table
; (see the virtual table), hence, it's not a virtual one.

lea ecx, [ebp+var_8] ; derived
call GetDERIVED_VTBL ; d[0]=*DERIVED_VTBL

lea ecx, [ebp+var_8] ; derived
call DERIVED_DEMO
; same as above...

lea ecx, [ebp+var_8] ; derived
call DERIVED_DEMO_2
; same as above...

lea ecx, [ebp+var_8] ; derived
call BASE_DEMO_3_
; Attention: The this pointer points to the DERIVED object
; when the function of the BASE object is called!
; Hence, the BASE function is a derived one.

mov esp, ebp
pop ebp
retn
main endp

BASE_DEMO proc near ; CODE XREF: main+11↑p
; This is the demo function of the BASE class.

push offset aBase ; "BASE\n"
call printf
pop ecx
retn
BASE_DEMO endp

BASE_DEMO_2 proc near ; CODE XREF: main+19↑p
; This is the demo_2 function of the BASE class.

push offset aBaseDemo2 ; "BASE DEMO 2\n"
call printf
pop ecx
retn
BASE_DEMO_2 endp

BASE_DEMO_3 proc near ; CODE XREF: main+21↑p
; This is the demo_3 function of the BASE class.

push offset aNonVirtualBase ; "Nonvirtual BASE DEMO 3\n"
call printf
pop ecx
retn
BASE_DEMO_3 endp

DERIVED_DEMO proc near ; CODE XREF: main+31↑p
; This is the demo function of the DERIVED class.

push offset aDerived ; "DERIVED\n"
call printf
pop ecx
retn
DERIVED_DEMO endp

DERIVED_DEMO_2proc near ; CODE XREF: main+39↑p
; This is the demo_2 function of the DERIVED class.

push offset aDerivedDemo2 ; "DERIVED DEMO 2\n"
call printf
pop ecx
retn
DERIVED_DEMO_2endp

BASE_DEMO_3_ proc near ; CODE XREF: main+41↑p
; This is the demo_3 function of the DERIVED class.
; Attention: The demo_3 function occurs in the program twice.
; The first time, it appeared in the object of the BASE class,
; and the second time, it appeared in the DERIVED object.
; The DERIVED object inherited it from the BASE class,
; and has made a copy of it.
; This is kind of silly, isn't it?
; It'd be better off using the original...
; But you see, this simplifies the analysis
; of the program!

push offset aNonVirtualDeri ; "Nonvirtual DERIVED DEMO 3\n"
call printf
pop ecx
retn
BASE_DEMO_3_ endp

GetBASE_VTBL proc near ; CODE XREF: main+9↑p
; In the instance of the BASE object,
; the offset of its virtual table is written.

mov eax, ecx
mov dword ptr [eax], offset BASE_VTBL
retn
GetBASE_VTBL endp

GetDERIVED_VTBL proc near ; CODE XREF: main+29↑p
; In the instance of the DERIVED object,
; the offset of its virtual table is written.

push esi
mov esi, ecx
call GetBASE_VTBL
; Aha! Our object is derived from BASE.

mov dword ptr [esi], offset DERIVED_VTBL
; The pointer is written to the DERIVED virtual table.

mov eax, esi
pop esi
retn
GetDERIVED_VTBL endp

BASE_VTBL dd offset BASE_DEMO ; DATA XREF: GetBASE_VTBL+2↑o
dd offset BASE_DEMO_2
DERIVED_VTBL dd offset DERIVED_DEMO ; DATA XREF: GetDERIVED_VTBL+8↑o
dd offset DERIVED_DEMO_2
; Note that the virtual table occurs even where it's not needed!




Identifying derived functions Identifying derived nonvirtual functions is a rather subtle problem. At first you might think that if they're called like regular C functions, it's impossible to recognize in what class the function was declared. The compiler destroys this information at compile time — but not all of it. Before it calls each function (it doesn't matter whether it's a derived one or not), the this pointer must be created in case it is required by the function pointing to the object from which this function is called. For derived functions, the this pointer stores the offset of the derived object, not the base one. That's all! If the function is called with various this pointers, it's a derived function.

It's more difficult to figure out from which object the function has been derived. There are no universal solutions to this. Still, if we've singled out the A object that uses the f1(), f2()… functions and the B object that uses the f1(), f3(), f4()… functions, then we can safely assert that the f1() function is derived from class A. However, if the f1() function has never been called from the instance of the class, we won't be able to determine whether it's a derived one or not.

Let's consider all this in the following example:

Listing 48: Identifying Derived Functions


#include

class Base{
public:
void base_demo(void)
{
printf("BASE DEMO\n");
};

void base_demo_2(void)
{
printf("BASE DEMO 2\n");
};
};

class Derived: public Base{
public:
void derived_demo (void)
{
printf("DERIVED DEMO\n");
};

void derived_demo_2(void)
{
printf("DERIVED DEMO 2\n");
};
};




Generally, the disassembled listing of the compiled version of this program should look like this:

Listing 49: The Disassembled Code for Identifying Derived Functions


main proc near ; CODE XREF: start+AF↓p
push esi
push 1
call ??2@YAPAXI@Z ; operator new(uint)
; A new instance of some object is created.
; We don't yet know of which one. Let's say, it is the a object.

mov esi, eax ; ESI = *a
add esp, 4
mov ecx, esi ; ECX = *a (this)
call BASE_DEMO
; Now we're calling BASE_DEMO, taking into account the fact
; that this points to a.

mov ecx, esi ; ECX = *a (this)
call BASE_DEMO_2
; Now we're calling BASE_DEMO_2, taking into account the fact
; that this points to a.

push 1
call ??2@YAPAXI@Z ; operator new(uint)
; One more instance of some object is created; let's call it b.

mov esi, eax ; ESI = *b
add esp, 4
mov ecx, esi ; ECX = *b (this)
call BASE_DEMO
; Aha! We're calling BASE_DEMO, but now this points to b.
; Hence, BASE_DEMO is related to both a and b.

mov ecx, esi
call BASE_DEMO_2
; Here we're calling BASE_DEMO_2, but now this points to b.
; Hence, BASE_DEMO_2 is related to both a and b.

mov ecx, esi
call DERIVED_DEMO
; Now we're calling DERIVED_DEMO. The this pointer points to b,
; and we can't see any relation between DERIVED_DEMO and a.
; When calling, this has never pointed to a.

mov ecx, esi
call DERIVED_DEMO_2
; the same...

pop esi
retn
main endp




So you see, you can identify nonvirtual derived functions. The only difficulty is how to distinguish the instances of two different objects from instances of the same object.

We've already discussed identifying derived virtual functions. They are called in two stages — the offset of the virtual table of the base class is written in the object instance, then it's replaced with the offset of the virtual table of the derived class. Even though the compiler optimizes the code, the redundancy remainder will be greater than necessary for distinguishing derived functions from other ones.

Identifying virtual tables Now, having thoroughly mastered virtual tables and functions, we'll consider a very insidious question: Is any array of pointers to functions a virtual table? Certainly not! Indirectly calling a function through a pointer is often used by programmers in practice. An array of pointers to functions… hmm. Well, it's certainly not typical, but it happens, too!

Let's consider the following example — it's a somewhat ugly and artificial, but to show a situation where a pointer array is vitally necessary, we'd have to write hundreds of lines of code.

Listing 50: An Imitation of a Virtual Table


#include

void demo_1(void)
{
printf("Demo 1\n");
}

void demo_2(void)
{
printf("Demo 2\n");
}

void call_demo (void **x)
{
((void (*) (void)) x[0])();
((void (*) (void)) x[1])();
}

main()
{
static void* x[2] =
{ (void*) demo_1,(void*) demo_2};
// Attention: If you initialize an array
// in the course of the program (i.e.,
// x[0] = (void *) demo_1, ...), the compiler will generate
// an adequate code that writes the functions' offsets
// at run time, which is absolutely unlike a virtual table!
// On the contrary, initializing an array when it's declared
// causes ready pointers to be placed in the data segment,
// which resembles a true virtual table.
// (By the way, this also helps save CPU clocks ticks.)

call_demo(&x[0]);
}




Now, see if you can distinguish a handmade table from a true one.

Listing 51: Distinguishing an Imitation from a True Virtual Table


main proc near ; CODE XREF: start+AF↓p
push offset Like_VTBL
call demo_call
; A pointer to something very similar to a virtual table is passed
; to the function. But having grown wise with experience, we easily
; discover this crude falsification. First, the pointers to VBTL aren't
; passed so simply. (The code used for this isn't that basic.)
; Second,
; they're passed via the register, not via the stack.
; Third, no existing compiler uses the pointer to a virtual table
; directly, but places it in an object. But here, there's neither
; an object nor a this pointer. Therefore, this isn't a virtual table,
; although to the untrained eye, it looks very similar.
pop ecx
retn

main endp

demo_call proc near ; CODE XREF: sub_0_401030+5↑p

arg_0 = dword ptr 8
; That's it! The argument is a pointer,
; and virtual tables are addressed through the register.

push ebp
mov ebp, esp
push esi
mov esi, [ebp+arg_0]
call dword ptr [esi]
; Here's a two-level function call - through the pointer
; to the array of pointers to the function, which is typical for
; calling virtual functions. But again, the code is too simple -
; calling virtual functions involves a lot of redundancy,
; and in addition, the this pointer is absent.

call dword ptr [esi+4]
; The same thing here. This is too simple
; for calling a virtual function.

pop esi
pop ebp
retn
demo_call endp

Like_VTBL dd offset demo_1 ; DATA XREF:main
dd offset demo_2
; The pointer array externally looks like a virtual table,
; but does not reside where virtual tables usually reside.




Let's recap the main signs of a falsification:

The code is too simple - a minimum number of registers are used, and there is no redundancy. Calling virtual tables is much more intricate.

The pointer to a virtual function is placed in the instance of an object, and is passed via the register, not via the stack. (See "The this Pointer" section.)

There is no this pointer, which is always created before calling a virtual function.

Virtual functions and static variables are located in various places of the data segment — therefore, we can distinguish them at once.

Is it possible to organize the function call by reference so the compilation of the program produces a code identical to the call of a virtual function? Theoretically, yes. But in practice, it's hardly possible to do so (especially without intending to). Because of its high redundancy, the code that calls virtual functions is very specific and can be recognized on sight. It's easy to imitate a common technique of working with virtual tables, but it's impossible to exactly reproduce it without assembly inserts.

Conclusion In general, working with virtual functions involves many redundancies and "brakes", and the analysis of them is very labor-consuming. We permanently have to keep many pointers in mind and remember where each of them points. Still, code diggers seldom face insoluble problems.

0 comments:

Post a Comment