Sunday, June 13, 2010

64bit Calling Convention

64 bit calling convention - what it means to debugging.

While 32 bit (x86) has multiple calling conventions such as cdecl, stdcall, fastcall, thiscall, 64 bit (x64) only has single calling convention which has unique characteristics. Some important characteristics are
  • 64 bit calling converntion passes first 4 parameters to 4 registers (RCX, RDX, R8, R9) and additional parameters to stack (similar to fastcall calling convention). And even if parameters are less than 4, stack space for 4 parameters are always reserved (this area is called home space or home area). (Note: Fastcall calling convnetions pass one or more parameters by using registers to make a fast function call. x86 fastcall calling convention passes first 2 parameters to ECX, EDX registers.)
  • Stack will have 16 bytes alignment to aid performance. This means if there are 5 parameters, there will be 48 bytes reserved for parameters (5 params x 8 bytes + 8 bytes for alignment)
  • Stack pointer (rsp) typically does not change within a given function. Stack size for a function code is pre-calculated and so stack pointer does not change once prolog is done.
Understanding 64 bit calling convention is important for debugging since depending on whether one has optimized or non-optimized build, parameters in call stack can be useless or often misleading. For non-optimization build (ex: when compiled with /Od option in C++), called function, through its prolog code, copies all 4 parameters saved in registers (RCX,RDX,R8,R9) to stack home area. So parameters inspection through dv, kP debug command displays correct parameter values. However, optimization build does not save those parameters in registers to stack area and, even worse, those stack home area are used for other purpose. This behavior in optimziation build can often mislead developers to wrong parameter values. So developer shouldn't trust call stack parameter values (kP) or display variables (dv) results when debugging againt 64bit optimized build.

Let's look at a small sample.
int Calc(int a, int b, int c, int d, int e)
{                              // <= breakpoint 1
    int result = 0;            // <= breakpoint 2
    for(int i=0; i<10; i++)
    {
       result += a*i + b - c + d * 2 + e;
       printf("%d : %d\n", i, result);
    }
    result += a - b + c -d + e;
    return result;
}

int _tmain(int argc, _TCHAR* argv[])
{
    int s1,s2,s3,s4,s5;
    scanf("%d %d %d %d %d", &s1, &s2, &s3, &s4, &s5);

    int result = Calc(s1,s2,s3,s4,s5); // <= breakpoint 0
    printf("Result = %d", result);

    return 0;
}
I set 3 breakpoints as marked above.
0:000> bl
 0 e 00000001`3f5e10ed     0001 (0001)  0:**** Simple!wmain+0x3d
 1 e 00000001`3f5e1000     0001 (0001)  0:**** Simple!Calc
 2 e 00000001`3f5e1016     0001 (0001)  0:**** Simple!Calc+0x16
Right before calling a function at breakpoint 0, we can inspect the assembly code to see how the parameters are passed. Basically what it does is to pass first 4 parameters (I entered 1,2,3,4,5 for scanf()) to ECX, EDX, R8D, R9D registers. (Since passing parameters are int32, ECX register is used instead of RCX). The last 5th parameter is passed to stack (rsp+20h).
0:000> u .
Simple!wmain+0x3d [c:\temp\simple\simple.cpp @ 18]:
00000001`3fd910ed 8b442434        mov     eax,dword ptr [rsp+34h]
00000001`3fd910f1 89442420        mov     dword ptr [rsp+20h],eax  //5th param: 5
00000001`3fd910f5 448b4c2440      mov     r9d,dword ptr [rsp+40h]  // 4
00000001`3fd910fa 448b442430      mov     r8d,dword ptr [rsp+30h]  // 3
00000001`3fd910ff 8b542438        mov     edx,dword ptr [rsp+38h]  // 2
00000001`3fd91103 8b4c243c        mov     ecx,dword ptr [rsp+3Ch]  //1st param: 1
00000001`3fd91107 e8f4feffff      call    Simple!Calc (00000001`3fd91000)
Now let's continue to reach breakpoint 1 at the begining of Calc() function. This is the point where we can check prolog assembly code of the function. For non-optimzition build, here you can see that those registers for parameters are copied to stack home area.
0:000> uf .
Simple!Calc [c:\temp\simple\simple.cpp @ 4]:
    4 00000001`3f5d1000 44894c2420      mov     dword ptr [rsp+20h],r9d
    4 00000001`3f5d1005 4489442418      mov     dword ptr [rsp+18h],r8d
    4 00000001`3f5d100a 89542410        mov     dword ptr [rsp+10h],edx
    4 00000001`3f5d100e 894c2408        mov     dword ptr [rsp+8],ecx
Once those function prolog codes are executed, that is, when we move to breakpoint 2, the stack has correct 5 parameters and thus kP call stack command or dv command displays correct parameter values. Below we can check 5 parameters in stack address 00000000`0026feb0 ~ 00000000`0026fed0. Stack slot 00000000`0026fed8 has garbage value, just for 16 bytes alignment.
0:000> p
Breakpoint 2 hit
Simple!Calc+0x16:
00000001`3f5d1016 c744242000000000 mov     dword ptr [rsp+20h],0
0:000> dq /c 1 @rsp
00000000`0026fe70  00000000`00000000
00000000`0026fe78  00000000`5fca10b1
00000000`0026fe80  00000000`00000001
00000000`0026fe88  00000000`00000000
00000000`0026fe90  00000000`00000000
00000000`0026fe98  00000001`3f5d11ac
00000000`0026fea0  00000001`3f5d2150
00000000`0026fea8  00000001`3f5d110c //return address
00000000`0026feb0  00000001`00000001 //param 1
00000000`0026feb8  00000000`00000002
00000000`0026fec0  00000000`00000003
00000000`0026fec8  00000000`00000004
00000000`0026fed0  00000000`00000005 //param 5
00000000`0026fed8  00000000`0026fee4 //for alignment
And here is what I got when running kP and dv command.
0:000> kP
Child-SP          RetAddr           Call Site
00000000`0026fe70 00000001`3f5d110c Simple!Calc(
   int a = 0n1,
   int b = 0n2,
   int c = 0n3,
   int d = 0n4,
   int e = 0n5)+0x16 [c:\temp\simple\simple.cpp @ 5]
0:000> dv /i /V
prv param  00000000`0026feb0 @rsp+0x0040                     a = 0n1
prv param  00000000`0026feb8 @rsp+0x0048                     b = 0n2
prv param  00000000`0026fec0 @rsp+0x0050                     c = 0n3
prv param  00000000`0026fec8 @rsp+0x0058                     d = 0n4
prv param  00000000`0026fed0 @rsp+0x0060                     e = 0n5
prv local  00000000`0026fe90 @rsp+0x0020                result = 0n0
Now what if we have optimized build? I recompiled the source code with Maxmimum Speed optimization (/O2). For optimized build, the prolog of Calc() function starts like this.
0:000> uf Simple!Calc
Simple!Calc [c:\temp\simple\simple.cpp @ 4]:
    4 00000001`3ff51000 48895c2408      mov     qword ptr [rsp+8],rbx
    4 00000001`3ff51005 48896c2410      mov     qword ptr [rsp+10h],rbp
    4 00000001`3ff5100a 4889742418      mov     qword ptr [rsp+18h],rsi
    4 00000001`3ff5100f 57              push    rdi
    4 00000001`3ff51010 4154            push    r12
    4 00000001`3ff51012 4155            push    r13
    4 00000001`3ff51014 4156            push    r14
    4 00000001`3ff51016 4157            push    r15
    4 00000001`3ff51018 4883ec20        sub     rsp,20h
As you can see here, there is no mov command for parameter copy. By the time I reached breakpoint 2 where prolog codes are all executed, the first 4 parameter values were not copied at all and only registers held the parameter values.
0:000> p
Breakpoint 2 hit
Simple!Calc+0x1c:
00000001`3ff5101c 448b6c2470      mov     r13d,dword ptr [rsp+70h] ss:00000000`0022f8f0=00000005
0:000> kP L1
Child-SP          RetAddr           Call Site
00000000`0022f880 00000001`3ff510e1 Simple!Calc(
   int a = 0n1,
   int b = 0n0,
   int c = 0n0,
   int d = 0n2291968,
   int e = 0n5)+0x1c [c:\temp\simple\simple.cpp @ 5]
0:000> dv /i
prv param                a = 0n1
prv param                b = 0n0
prv param                c = 0n0
prv param                d = 0n2291968
prv param                e = 0n5
0:000> r rcx
rcx=0000000000000001
0:000> r rdx
rdx=0000000000000002
0:000> r r8
r8=0000000000000003
0:000> r r9
r9=0000000000000004
As you might already notice, this behavior of optimized build can cause a lot of headache for 64 bit debugging. The behavior means that the call stack parameter information in 64 bit optimization build is completely useless. It will be much painful if we need to analyze regular dump file or Watson dump file which has less debugging information. So then how can we find correct parameter values? We know from the previous inspection that only registers hold those 4 parameter values. Starting from this point, we can think we have to trace down what parameter values were entered from previous call frame. When caller calls a function, it saves 4 parameters to registers. Since we can see this in assembly code, we unassmeble the code and can track down the parameter value. But what if the caller doesn't pass constant value as a parameter? Well, then, it will be much more tedious investigation since we have to dig into the history of the registers or stack area. For unfortunate cases, we might need to inspect many call stack frames and the assmebly codes to figure out how the parameters were passed all the way up to current stack frame.

Thursday, June 10, 2010

How To Dump

How to dump user process [101]

There are many ways to dump the user process. I introduce here some commonly used methods of how to dump a process.

A. Using CDB

CDB is console based general purpose debugging tool and it's also good tool to dump a process. When dumping a process, we normally want to be "non-invasive" which means we don't want to ruin the process and just take a snapshot of the process. This can be done by specifying -pv option. If the process name is unique, you can use -pn option with exe file name. But if there are several processes having the same process name, typically we check process PID of interest and use -p option. The -c option below is actual debugger command that the CDB is going to run. The .dump command below dumps the process to specified file.
C> cdb -pv –pn myApp.exe -c ".dump /ma /u c:\tmp\myApp.dmp;q"   
  C> cdb -pv –p 500 -c ".dump /ma c:\tmp\myApp.dmp;q"   

B. Using ADPLUS

ADPLUS is the tool that Microsoft CSS often uses to take a dump. There are 2 dump modes in this tool - one for hang and the other for crash dump.

HANG : to capture hang dump, you run ADPLUS with -hang option after hang occurred. It will take a dump and leave the process intact (meaning non-invasive dump). Need to specify -p with PID and -o with output folder.

C:\Debuggers> adplus -hang -p 433 -o c:\Test (PID=433)

Logs and memory dumps will be placed in c:\Test\20100127_111336_Hang_Mode

CRASH : the other ADPLUS mode is crash mode, which takes a dump when the process is crashed. Since we never know when the crash occurs, the ADPLUS command - of course - shoud be run before the crash occurs. If you're using remote connection (mstsc.exe) , you should use /console. Crash mode is very handy since adplus will wait until the crash occurs.

C:\Debuggers> adplus -crash -pn App.exe -o c:\test

Logs and memory dumps will be placed in c:\test\20100127_111828_Crash_Mode

Note: adplus was originally written in VBScript but they wrote exe version in recent version. By the way, adplus internally uses CDB to capture dump.

C. Using Task Manager

Since Vista OS, Task Manager has new context menu called "Create Dump File." In order to create a dump for the specific process, you select a process and rightclick and then choose 'Create Dump File" menu. Here is an example of Windows 7 Task Manager.


Create Dump File From Task Manager
After dumping is done, it shows the dumpe file location in the message box.