Malware Analysis 1 - Creating a PE parser, Shannon Entropy and more (Golang)
Introduction
Hello hackers!
Today we’ll be creating a CLI tool to analyze and extract PE files information as much as possible using github.com/Binject/debug/pe
package.
Explanation
Parsing PE files in Golang is quite hard, that’s why we’ll use the github.com/Binject/debug/pe
package as it provides a lot of functions, structs and more to work with PE files without breaking our head.
But first of all let’s discuss about what a PE file is and its different parts. The Portable Executable (PE) format is a file format for executables, object code, DLLs and others used in 32-bit and 64-bit versions of Windows operating systems. The PE format is a data structure which encapsulates the information necessary for the Windows OS loader to manage the wrapped executable code. This includes dynamic library references for linking, API export and import tables, resource management data and thread-local storage (TLS) data. On NT operating systems, the PE format is used for EXE, DLL, SYS (device driver), MUI and other file types. The Unified Extensible Firmware Interface (UEFI) specification states that PE is the standard executable format in EFI environments. Filename extensions: .acm, .ax, .cpl, .dll, .drv, .efi, .exe, .mui, .ocx, .scr, .sys, .tsp
I won’t explain in depth all the PE structure parts as it will be a really long post so you can get a quick overview of it with this general scheme
However I encourage you to check out the “References” posts, they could be really useful
These are the main goals of our PE parser/analyzer:
- DOS, RICH, NT, File and Optional headers info
- Get all data directories info
- Section headers info like .text and .reloc
- Import table attributes
- Base relocations table
- MD5, Sha-1 and Sha-256 file hashes
- Calculate file Shannon entropy
Code
In this case we’ll create a CLI program with multiple arguments
We start by downloading the pe
package
1
go get github.com/Binject/debug/pe
Then we import the packages
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
package main
import (
"os"
"fmt"
"log"
"flag"
"time"
"crypto/md5"
"crypto/sha1"
"crypto/sha256"
"encoding/hex"
"encoding/binary"
// Used to parse PE files
"github.com/Binject/debug/pe"
)
Before starting with the PE part we define the flags. They will be a variable with the path to the PE and an extra “verbose”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
...
func main(){
var file string
var verbose bool
flag.StringVar(&file, "f", "", "path to PE file to parse and analyze")
flag.BoolVar(&verbose, "v", false, "enable verbose")
flag.Parse()
if file == "" {
Banner()
fmt.Println("Usage: .\\main.exe -f malware.exe")
flag.PrintDefaults()
os.Exit(0)
}
}
Let’s start with the pe
package
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
func main(){
...
// Parse PE structure
pe_file, err := pe.Open(file)
if err != nil {
log.Fatal(err)
}
// Close file
defer pe_file.Close()
if verbose {
fmt.Println("[*] Determining file type...")
}
// Check PE type (PE32 or PE32+)
var opt_header32 *pe.OptionalHeader32
var opt_header64 *pe.OptionalHeader64
var opt32 bool
if pe_file.FileHeader.SizeOfOptionalHeader == sizeofOptionalHeader32 {
fmt.Println("[+] PE type: PE32\n")
// We'll use this later
opt_header32, _ = pe_file.OptionalHeader.(*pe.OptionalHeader32)
opt32 = true
} else if pe_file.FileHeader.SizeOfOptionalHeader == sizeofOptionalHeader64 {
fmt.Println("[+] PE type: PE32+\n")
// We'll use this later
opt_header64, _ = pe_file.OptionalHeader.(*pe.OptionalHeader64)
opt32 = false
} else {
fmt.Println("[-] Error recognizing PE type!")
os.Exit(0)
}
}
As most of antivirus and malware analyzers, we also want to get the file hash in common formats like md5, sha1 and sha256 so let’s do that:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func main(){
...
// file handle
f, _ := os.Open(file)
// md5 hash
md5_h := md5.New()
_, err = io.Copy(md5_h, f)
hashInBytes := md5_h.Sum(nil)
fmt.Println("[*] MD5 hash:", hex.EncodeToString(hashInBytes))
// sha1 hash
sha1_h := sha1.New()
_, err = io.Copy(sha1_h, f)
hashInBytes = sha1_h.Sum(nil)
fmt.Println("[*] Sha-1 hash:", hex.EncodeToString(hashInBytes))
// sha256 hash
sha256_h := sha256.New()
_, err = io.Copy(sha1_h, f)
hashInBytes = sha256_h.Sum(nil)
fmt.Println("[*] Sha-256 hash:", hex.EncodeToString(hashInBytes), "\n")
}
Another simple malicious indicator is the entropy. What does entropy is? Well, it’s called Shannon entropy and comes from the Information Theory. It’s the amount of randomness in a message or data stream. This is its matemathical formula:
And we can calculate the PE entropy like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
func main(){
...
f2, _ := os.Open(file)
defer f2.Close()
contents, err := ioutil.ReadAll(f2)
if err != nil {
log.Fatal(err)
}
freq := make(map[byte]int)
for _, b := range contents {
freq[b]++
}
totalBytes := len(contents)
probs := make(map[byte]float64)
for b, f := range freq {
probs[b] = float64(f) / float64(totalBytes)
}
entropy := 0.0
for _, p := range probs {
if p > 0 {
entropy -= p * math.Log2(p)
}
}
fmt.Println("File entropy:", entropy)
}
Before starting with the PE file format thigs, I recommend you to install the CFF Explorer because it’s a really great tool which can be extremely useful to analyze PE files, processes and more. Some of its features are this:
- Process Viewer
- Drivers Viewer
- Windows Viewer
- PE and Memory Dumper
- Full support for PE32/64
- Special fields description and modification (.NET supported)
- PE Utilities
- PE Rebuilder (with Realigner, IT Binder, Reloc Remover, Strong Name Signature Remover, Image Base Changer)
- View and modification of .NET internal structures
- Resource Editor (full support for Windows Vista icons)
- Quick Disassembler (x86, x64, MSIL)
- Support in the Resource Editor for .NET resources (dumpable as well)
- File Scanner
- Extension support
And much more
As you can see here it has a multi-option menu with a clean view of the information
Once we’ve opened and parsed our file we have to iterate over the different properties we want to analyze. We start with the DOS header (also called MS-DOS), its structure is something like this (code in C++):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
typedef struct _IMAGE_DOS_HEADER { // DOS .EXE header
WORD e_magic; // Magic number
WORD e_cblp; // Bytes on last page of file
WORD e_cp; // Pages in file
WORD e_crlc; // Relocations
WORD e_cparhdr; // Size of header in paragraphs
WORD e_minalloc; // Minimum extra paragraphs needed
WORD e_maxalloc; // Maximum extra paragraphs needed
WORD e_ss; // Initial (relative) SS value
WORD e_sp; // Initial SP value
WORD e_csum; // Checksum
WORD e_ip; // Initial IP value
WORD e_cs; // Initial (relative) CS value
WORD e_lfarlc; // File address of relocation table
WORD e_ovno; // Overlay number
WORD e_res[4]; // Reserved words
WORD e_oemid; // OEM identifier (for e_oeminfo)
WORD e_oeminfo; // OEM information; e_oemid specific
WORD e_res2[10]; // Reserved words
LONG e_lfanew; // File address of new exe header
} IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;
So that can be translated into Golang code like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
type DosHeader struct {
MZSignature uint16
UsedBytesInTheLastPage uint16
FileSizeInPages uint16
NumberOfRelocationItems uint16
HeaderSizeInParagraphs uint16
MinimumExtraParagraphs uint16
MaximumExtraParagraphs uint16
InitialRelativeSS uint16
InitialSP uint16
CheckSum uint16
InitialIP uint16
InitialRelativeCS uint16
AddressOfRelocationTable uint16
OverlayNumber uint16
Reserved [4]uint16
OEMid uint16
OEMinfo uint16
Reserved2 [10]uint16
AddressOfNewExeHeader uint32
}
All the names are different as on the github.com/Binject/debug/pe
package but the order is the same.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func main(){
...
// DOS Header info
var dos_check string
if pe_file.DosExists == true {
dos_check = "Yes"
} else {
dos_check = "No"
}
fmt.Println("[+] DOS Header:")
fmt.Println(" Is header present?:", dos_check)
fmt.Printf(" Magic: 0x%X\n", pe_file.DosHeader.MZSignature)
fmt.Printf(" New exe header addr: 0x%X\n", pe_file.DosHeader.AddressOfNewExeHeader)
// If verbose flag is especified
// more DOS header info is printed
if verbose {
fmt.Printf(" File size in pages: 0x%X\n", pe_file.DosHeader.FileSizeInPages)
fmt.Printf(" Checksum: 0x%X\n", pe_file.DosHeader.CheckSum)
fmt.Printf(" Overlay number: 0x%X\n", pe_file.DosHeader.OverlayNumber)
fmt.Printf(" Relocation table addr: 0x%X\n", pe_file.DosHeader.AddressOfRelocationTable)
}
}
As you can see we check for verbose to print more info if user wants
Now let’s see what about the DOS Stub. First of all, we should know that the PE header starts with the MS-DOS header and contains a 16-bit MS-DOS executable (stub program). When the PE format was introduced (year 1994, Windows NT 3.1), DOS was still very much around. The risk that a Windows EXE would be run from DOS by mistake was very real. So they needed to make Windows EXE’s superficially compatible with the DOS loader. So that in such a scenario the program would do something (i. e. print a message and quit) instead of crashing randomly. That’s why DOS stub actually exists.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
func main(){
...
// DOS stub bytes
fmt.Println("[+] DOS Stub:")
fmt.Print(" ")
for i, b := range pe_file.DosStub {
// Print it in columns
if (i + 1) % 5 == 0 {
fmt.Printf("0x%X\n ", b)
} else {
fmt.Printf("0x%X, ", b)
}
}
}
Let’s move to the Rich Header. It’s an undocumented header contained within PE files compiled and linked using the Microsoft toolchain. It contains information about the build environment that the PE file was created in.
As well as with the DOS stub we’ll print it in columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func main(){
...
if verbose {
fmt.Println("[*] Parsing Rich header...")
}
if len(pe_file.RichHeader) == 0 {
fmt.Println("[+] Rich Header not found\n")
} else {
fmt.Println("[+] Rich Header:")
// Rich header bytes
fmt.Print(" ")
for i, b := range pe_file.RichHeader {
// Print it in columns
if (i + 1) % 5 == 0 {
fmt.Printf("0x%X\n ", b)
} else {
fmt.Printf("0x%X, ", b)
}
}
}
}
Let’s continue with the File Header (also calle PE Header) which is located by looking at the e_lfanew field of the MS-DOS Header, some of its most important fields are: Machine, NumberOfSections, NumberOfSymbols and others.
This is its Golang structure:
1
2
3
4
5
6
7
8
9
type FileHeader struct {
Machine uint16
NumberOfSections uint16
TimeDateStamp uint32
PointerToSymbolTable uint32
NumberOfSymbols uint32
SizeOfOptionalHeader uint16
Characteristics uint16
}
Now we move all this info into code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
func main(){
...
if verbose {
fmt.Println("[*] Parsing File header...")
}
// File header part
fmt.Println("[+] File Header:")
fmt.Printf(" Machine: 0x%X\n", pe_file.FileHeader.Machine)
fmt.Printf(" Number of sections: 0x%X\n", pe_file.FileHeader.NumberOfSections)
if verbose { // Print more info if verbose flag is enable
fmt.Printf(" Timestamp: 0x%X\n", pe_file.FileHeader.TimeDateStamp)
fmt.Printf(" Symbol table pointer: 0x%X\n", pe_file.FileHeader.PointerToSymbolTable)
fmt.Printf(" Number of symbols: 0x%X\n", pe_file.FileHeader.NumberOfSymbols)
fmt.Printf(" Characteristics: 0x%X\n", pe_file.FileHeader.Characteristics)
}
}
Let’s move onto the Optional Header, at this point we also have to use the earlier defined variables because we have to distinguise between OptionalHeader32 and OptionalHeader64. Both are almost the same but it doesn’t matter, this is the OptionalHeader32 structure in Golang:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
type OptionalHeader32 struct {
Magic uint16
MajorLinkerVersion uint8
MinorLinkerVersion uint8
SizeOfCode uint32
SizeOfInitializedData uint32
SizeOfUninitializedData uint32
AddressOfEntryPoint uint32
BaseOfCode uint32
BaseOfData uint32
ImageBase uint32
SectionAlignment uint32
FileAlignment uint32
MajorOperatingSystemVersion uint16
MinorOperatingSystemVersion uint16
MajorImageVersion uint16
MinorImageVersion uint16
MajorSubsystemVersion uint16
MinorSubsystemVersion uint16
Win32VersionValue uint32
SizeOfImage uint32
SizeOfHeaders uint32
CheckSum uint32
Subsystem uint16
DllCharacteristics uint16
SizeOfStackReserve uint32
SizeOfStackCommit uint32
SizeOfHeapReserve uint32
SizeOfHeapCommit uint32
LoaderFlags uint32
NumberOfRvaAndSizes uint32
DataDirectory [16]DataDirectory
}
So let’s print some important info of this header
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
func main(){
...
if verbose {
fmt.Println("[*] Parsing Optional header...")
}
fmt.Println("[+] Optional Header:")
// Check if optional header is 32 or 64
if opt32 == true {
fmt.Printf(" Magic: 0x%X\n", opt_header32.Magic)
fmt.Printf(" Code size: 0x%X\n", opt_header32.SizeOfCode)
fmt.Printf(" Checksum: 0x%X\n", opt_header32.CheckSum)
if verbose {
fmt.Printf(" Initialized data size: 0x%X\n", opt_header32.SizeOfInitializedData)
fmt.Printf(" Uninitialized data size: 0x%X\n", opt_header32.SizeOfUninitializedData)
fmt.Printf(" Entry point addr: 0x%X\n", opt_header32.AddressOfEntryPoint)
fmt.Printf(" Code base: 0x%X\n", opt_header32.BaseOfCode)
fmt.Printf(" Image base: 0x%X\n", opt_header32.ImageBase)
fmt.Printf(" File alignment: 0x%X\n", opt_header32.FileAlignment)
}
} else {
fmt.Printf(" Magic: 0x%X\n", opt_header64.Magic)
fmt.Printf(" Code size: 0x%X\n", opt_header64.SizeOfCode)
fmt.Printf(" Checksum: 0x%X\n", opt_header64.CheckSum)
if verbose {
fmt.Printf(" Initialized data size: 0x%X\n", opt_header64.SizeOfInitializedData)
fmt.Printf(" Uninitialized data size: 0x%X\n", opt_header64.SizeOfUninitializedData)
fmt.Printf(" Entry point addr: 0x%X\n", opt_header64.AddressOfEntryPoint)
fmt.Printf(" Code base: 0x%X\n", opt_header64.BaseOfCode)
fmt.Printf(" Image base: 0x%X\n", opt_header64.ImageBase)
fmt.Printf(" File alignment: 0x%X\n", opt_header64.FileAlignment)
}
}
}
One of the most important parts of malware analysis is the DLLs and functions the PE file imports so let’s take a look at it using the Import Tables such as Import Address Table, Import Directory Table or Import Lookup Table
The import address table is the part of the Windows module (executable or dynamic link library) which records the addresses of functions imported from other DLLs. For example, if your program calls GetSystemInfo()
, then the executable or DLL will have an entry in its import table that says, “I would like to be able to call the function GetSystemInfo()
from kernel32.dll.” When the module is loaded, the system goes and finds that function, obtains its address, and stores it in a table known as the Import Address Table (IAT). When the module needs to call the GetSystemInfo()
function, it does so by fetching the value from the Import Address Table and calling it
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
func main(){
...
fmt.Println("[+] Import Table:\n")
iat, _, _, err := pe_file.ImportDirectoryTable()
if err != nil {
log.Fatal(err)
}
symbols, err := pe_file.ImportedSymbols()
if err != nil {
log.Fatal(err)
}
for _, imp := range iat {
fmt.Println(" DLL:", imp.DllName)
fmt.Printf(" ILT RVA: 0x%X\n", imp.OriginalFirstThunk)
fmt.Printf(" IAT RVA: 0x%X\n", imp.FirstThunk)
if verbose {
fmt.Printf(" Name RVA: 0x%X\n", imp.NameRVA)
}
fmt.Println(" Entries:")
for _, s := range symbols {
if strings.Split(s, ":")[1] == imp.DllName {
fmt.Println(" " + strings.Split(s, ":")[0])
}
}
fmt.Println()
}
}
And finally let’s parse the Relocation Table. The relocation table is a lookup table that lists all parts of the PE file that need patching when the file is loaded at a non-default base address.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
func main(){
...
fmt.Println("[+] Relocation Table:")
reloc_table := pe_file.BaseRelocationTable
fmt.Printf(" Number of entries: 0x%X\n", len(*reloc_table))
fmt.Println(" Entries:\n")
for _, s := range *reloc_table {
time.Sleep(50 * time.Millisecond)
fmt.Printf(" Virtual Addr: 0x%X\n", s.RelocationBlock.VirtualAddress)
fmt.Printf(" Size: 0x%X\n", s.RelocationBlock.SizeOfBlock)
fmt.Printf(" Type: 0x%X\n\n", s.BlockItems[0].Type)
}
}
Now the code has ended, we could analyze a few more things but it’s fine like that so we put together all the code and it should work. The full tool source code is here
Demo
To check if this works we will use the generated .exe from first post to see the results.
Compile the code
And then we pass the .exe file to the program via -f
flag
It seems to work and we see its hashes, entropy and more values. Let’s keep looking
Then we see can see the Optional Header and all the PE sections
In this picture we see, in this case, the DLL, its information and the imported functions from kernel32.dll (Have in mind that Golang is a bit different from C++ which is the most common language in malware dev so this works better with them)
And finally the relocation table
Extra
As we know one of the biggest signs of a PE file to be malware is that it uses some uncommon API calls like NtCreateThread or VirtualAllocEx so let’s improve the program a little bit.
We start by defining an array which holds some “malicious” API calls
1
var malicious_calls = []string{"ntcreatethread","createthread","virtualallocex","writeprocessmemory","createremotethread","queueuserapc","rtlmovememory","convertthreadtofiber","setthreadcontext","ntqueryinformationprocess","ntprotectvirtualmemory","ntwritevirtualmemory","ntallocatevirtualmemory","ntcreatethreadex","virtualalloc"}
And the modified part would be something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
func main(){
...
fmt.Println("[+] Import Table:\n")
iat, _, _, err := pe_file.ImportDirectoryTable()
if err != nil {
log.Fatal(err)
}
symbols, err := pe_file.ImportedSymbols()
if err != nil {
log.Fatal(err)
}
for _, imp := range iat {
fmt.Println(" DLL:", imp.DllName)
fmt.Printf(" ILT RVA: 0x%X\n", imp.OriginalFirstThunk)
fmt.Printf(" IAT RVA: 0x%X\n", imp.FirstThunk)
if verbose {
fmt.Printf(" Name RVA: 0x%X\n", imp.NameRVA)
}
fmt.Println(" Entries:")
var m_check bool
for _, s := range symbols {
m_check = false
if strings.Split(s, ":")[1] == imp.DllName {
for _, call := range malicious_calls {
if strings.ToLower(strings.Split(s, ":")[0]) == call {
fmt.Println(" " + strings.Split(s, ":")[0] + " --> Malicious!")
m_check = true
break
}
}
if m_check == false {
fmt.Println(" " + strings.Split(s, ":")[0])
}
}
}
fmt.Println()
}
}
And now if we run this new program against a C++ malware (they work better on this) which uses some of the defined API calls, we should see how the program sais that it’s malicious. This isn’t professional and it’s just a simple implementation which could be highly improved with much more functions but I let you to do that.
References
1
2
3
4
5
6
7
https://learn.microsoft.com/en-us/windows/win32/debug/pe-format
https://en.wikipedia.org/wiki/Portable_Executable
https://medium.com/ax1al/a-brief-introduction-to-pe-format-6052914cc8dd
https://tech-zealots.com/malware-analysis/pe-portable-executable-structure-malware-analysis-part-2/
https://malwology.com/2018/10/05/exploring-the-pe-file-format-via-imports/
https://www.ired.team/miscellaneous-reversing-forensics/windows-kernel-internals/pe-file-header-parser-in-c++
https://github.com/RichHeaderResearch/RichPE
Conclusion
We’ve learned the different parts of PE file format and how we can approach them to extract information from headers, sections, data directories and more. If I have an error or you wanna ask me anything contact me via Discord, my user is d3ext
This tool isn’t professional so you should use other tools like CFF Explorer but I hope this post and tool has helped you to understand how PE files work and how you can extract information from them to analyze potential malwares.
Source code here