▶️ Intro
On the previous post we started our journey with a very simple scenario, and we used a nice feature of the Go programming language to get a measure of how much % of the target program our test is exercising.
This time I am going to experiment a Proof of Concept about how we can obtain a test code coverage metric estimation for a normal binary program, without any recompilation.
In this example we will pretend that our task is to write integration tests for the famous gzip
program, and try to measure the progresses we are doing about coverage of our tests.
Even pets need coverage! Image credits to: Em Hopper
🧮 How ?
The main idea is
- get in some way the complete list of functions present in the program = N
- record, during test, which of those functions are executed = E
The ratio E/N provides an approximation of test effectiveness, driving us toward areas needing coverage expansion.
We don’t want to recompile gzip
with coverage instrumentation, but in our distro we have the debug information of the program. Usually they are provided in separate packages, and the repository is not enabled by default, so first of all let’s enable them and install the related packages.
On Tumbleweed:
$ sudo zypper modifyrepo -e repo-debug
$ sudo zypper refresh
$ sudo zypper in gzip-debuginfo gzip-debugsource
👐 Functions all the way down
We can use the gdb
debugger to have a list of all the functions in a program:
$ sudo zypper install gdb
$ gdb /usr/bin/gzip
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/gzip...
Reading symbols from /usr/lib/debug/usr/bin/gzip.debug...
(gdb) info functions
All defined functions:
File ../sysdeps/x86_64/start.S:
void _start(void);
File ./lib/stat-time.h:
29: int openat_safer(int, const char *, int, ...);
30: int rpl_printf(const char *, ...);
116: int unzip(int, int);
[... long output omitted ...]
That looks promising!
☝️ Write the first test
As we did last time, for simplicity we are going to use the pytest
framework, but any other would work. First, let’s write a smoke test:
# test_gzip.py
import os,re
from subprocess import run
PROGRAM='/usr/bin/gzip'
# program should display help
def test_help(capfd):
process=run([PROGRAM,'-h'])
stdout, stderr = capfd.readouterr()
assert process.returncode == 0
assert "Usage:" in stdout
On this test, we spawn a process to simply execute gzip -h
, expecting some specific output.
let’s run it:
============================= test session starts ==============================
platform linux -- Python 3.13.2, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/andrea/binarycoverage
collected 1 item
test_gzip.py . [100%]
============================== 1 passed in 0.01s ===============================
👣 Trace it
Now we can trace which functions have been exercised by wrapping the test run with the powerful valgrind
tool:
$ sudo zypper install valgrind
$ valgrind --tool=callgrind --trace-children=yes pytest
the execution takes a bit longer and we get some new files which contains tracing data:
$ ls -l callgrind.out.*
-rw-------. 1 andrea andrea 1944681 Mar 30 17:54 callgrind.out.2771
-rw-------. 1 andrea andrea 82977 Mar 30 17:54 callgrind.out.2816
These data files are intended to be processed by callgrind_annotate
that will output a detailed report with all the functions executed (including those in libraries like glibc
).
$ callgrind_annotate callgrind.out.2816
--------------------------------------------------------------------------------
Profile data file 'callgrind.out.2816' (creator: callgrind-3.24.0)
--------------------------------------------------------------------------------
I1 cache:
D1 cache:
LL cache:
Timerange: Basic block 0 - 52685
Trigger: Program termination
Profiled target: /usr/bin/gzip -h (PID 2816, part 1)
Events recorded: Ir
Events shown: Ir
Event sort order: Ir
Thresholds: 99
Include dirs:
User annotated:
Auto-annotation: on
--------------------------------------------------------------------------------
Ir
--------------------------------------------------------------------------------
246,004 (100.0%) PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
41,382 (16.82%) /usr/src/debug/glibc-2.41/elf/dl-lookup.c:do_lookup_x [/usr/lib64/ld-linux-x86-64.so.2]
40,596 (16.50%) /usr/src/debug/glibc-2.41/elf/dl-reloc.c:_dl_relocate_object_no_relro [/usr/lib64/ld-linux-x86-64.so.2]
17,388 ( 7.07%) /usr/src/debug/glibc-2.41/elf/dl-lookup.c:_dl_lookup_symbol_x [/usr/lib64/ld-linux-x86-64.so.2]
13,781 ( 5.60%) /usr/src/debug/glibc-2.41/elf/dl-tunables.c:__GI___tunables_init [/usr/lib64/ld-linux-x86-64.so.2]
13,309 ( 5.41%) /usr/src/debug/glibc-2.41/elf/../sysdeps/generic/dl-new-hash.h:_dl_lookup_symbol_x
11,941 ( 4.85%) /usr/src/debug/glibc-2.41/string/../sysdeps/x86_64/multiarch/../multiarch/strcmp-sse2.S:strcmp [/usr/lib64/ld-linux-x86-64.so.2]
9,951 ( 4.05%) /usr/src/debug/glibc-2.41/elf/dl-lookup.c:check_match [/usr/lib64/ld-linux-x86-64.so.2]
8,321 ( 3.38%) /usr/src/debug/glibc-2.41/elf/do-rel.h:_dl_relocate_object_no_relro
7,033 ( 2.86%) /usr/src/debug/gzip-1.13/lib/vasnprintf.c:vasnprintf [/usr/bin/gzip]
6,968 ( 2.83%) /usr/src/debug/glibc-2.41/elf/../sysdeps/x86_64/dl-machine.h:_dl_relocate_object_no_relro
5,935 ( 2.41%) /usr/src/debug/glibc-2.41/elf/../sysdeps/x86/dl-cacheinfo.h:intel_check_word.constprop.0 [/usr/lib64/ld-linux-x86-64.so.2]
4,811 ( 1.96%) /usr/src/debug/glibc-2.41/elf/../bits/stdlib-bsearch.h:intel_check_word.constprop.0
4,402 ( 1.79%) /usr/src/debug/glibc-2.41/elf/dl-version.c:_dl_check_map_versions [/usr/lib64/ld-linux-x86-64.so.2]
4,356 ( 1.77%) /usr/src/debug/glibc-2.41/elf/dl-tunables.h:__GI___tunables_init
4,348 ( 1.77%) /usr/src/debug/gzip-1.13/lib/printf-parse.c:vasnprintf
2,660 ( 1.08%) /usr/src/debug/glibc-2.41/stdio-common/vfprintf-internal.c:__printf_buffer [/usr/lib64/libc.so.6]
2,064 ( 0.84%) /usr/src/debug/glibc-2.41/stdio-common/Xprintf_buffer_write.c:__printf_buffer_write [/usr/lib64/libc.so.6]
While a bit verbose, it contains all the information we need. It just needs some massaging …
🤖 Automate it
To make our life easier, better use some glue scripting to automate the tools and parse the data with some python code to get the information we need. The complete project is available on my GitHub repository, but here an excerpt of the script coverage.sh
that runs pytest
and outputs coverage measure:
#!/bin/bash
BINARY=gzip
TEMP_DIR=$(mktemp -d)
valgrind --tool=callgrind --trace-children=yes \
--callgrind-out-file=$TEMP_DIR/callgrind.%p pytest 2> /dev/null
# annotate all the files
for f in $TEMP_DIR/callgrind.*
do
base=$(basename $f)
# auto annotation with --context=0 can be useful
# to have precise source code line execution
callgrind_annotate --auto=yes --context=0 \
$f > $TEMP_DIR/"${base#*.}".log 2>/dev/null
done
# dump all the functions in the binary
gdb -ex 'set pagination off' -ex 'info functions' -ex quit \
$(which $BINARY) > $TEMP_DIR/all_funcs.gdb
python3 calc_coverage.py --binary $BINARY -d $TEMP_DIR
# Clean up: Remove the temporary directory and its contents
rm -rf "$TEMP_DIR"
> ./coverage.sh
============================= test session starts ==============================
platform linux -- Python 3.13.2, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/andrea/binarycoverage
collected 1 item
test_gzip.py . [100%]
============================== 1 passed in 0.54s ===============================
--- Binary coverage report ---
Functions coverage: 9/80 11.25%
As expected, our “smoke” test on gzip
runs only 9 functions of 80, with a low 11% coverage.
🏃➡️ Let’s move forward
Now we can improve our testing, as we are driven by the coverage metric. Shall we try with gzip -V
option ?
# program should display version information
def test_version(capfd):
process=run([PROGRAM,'-V'])
stdout, stderr = capfd.readouterr()
assert process.returncode == 0
assert "This is free software" in stdout
assert re.search(r"gzip \d.\d\d", stdout)
A simple test to ensure the program outputs a numeric version.
$ ./coverage.sh
============================= test session starts ==============================
collected 2 items
test_gzip.py .. [100%]
============================== 2 passed in 1.17s ===============================
--- Binary coverage report ---
Functions coverage: 10/80 12.50%
A bit better! Let’s add a negative test for good measure:
# program should fail when given a non existing file
def test_compress_non_existent():
process=run([PROGRAM,'foobar'])
assert process.returncode==1
$ ./coverage.sh
============================= test session starts ==============================
collected 3 items
test_gzip.py ... [100%]
============================== 3 passed in 1.51s ===============================
--- Binary coverage report ---
Functions coverage: 19/80 23.75%
We are on a good track. We doubled the coverage, and still we haven’t compressed anything…
🏋️ Do some actual work
Time to write a test to compress and decompress a file! We introduce also an helper function in the test, as we will need it more than once:
SAMPLE_FILE='sample.txt'
# program should compress and de-compress a file
def test_compress_decompress(capfd):
create_test_file(SAMPLE_FILE)
with open(SAMPLE_FILE) as file:
content=file.readlines()
process=run([PROGRAM,SAMPLE_FILE])
assert process.returncode == 0
compressed_file=SAMPLE_FILE+".gz"
# decompress and read back content
process=run([PROGRAM,'-d',compressed_file])
assert process.returncode == 0
with open(SAMPLE_FILE) as file:
assert(file.readlines()==content)
os.remove(SAMPLE_FILE)
# helper function to create a dummy sample file
def create_test_file(file_name):
sample_text = """This is a dummy sample text file.
It contains some random lines of text.
This is line 3 of the text file.
Here is line 4, just for testing purposes.
Feel free to modify or extend this text.
"""
# Open the file in write mode ('w') and write the sample text to it
with open(file_name, 'w') as file:
file.write(sample_text)
> ./coverage.sh
============================= test session starts ==============================
collected 4 items
test_gzip.py .... [100%]
============================== 4 passed in 2.30s ===============================
--- Binary coverage report ---
Functions coverage: 52/80 65.00%
That’s a big progress! Our tests are getting better. Just one more ? Get to the evil side and give it a damaged file:
# program should give error on a damaged compressed file
def test_decompress_error(capfd):
wrong_file='dummy.txt'
create_test_file(wrong_file)
wrong_compressed=wrong_file+'.gz'
process=run([PROGRAM,wrong_file])
# now damage the compressed file by writing a random byte
with open(wrong_compressed, "r+b") as file:
file.seek(32)
file.write(bytes(0xFF))
# decompression should fail
process=run([PROGRAM,'-d',wrong_compressed])
stdout, stderr = capfd.readouterr()
assert process.returncode==1
assert 'invalid compressed data' in stderr
os.remove(wrong_file+'.gz')
$ ./coverage.sh
============================= test session starts ==============================
collected 5 items
test_gzip.py ..... [100%]
============================== 5 passed in 3.02s ===============================
--- Binary coverage report ---
Functions coverage: 54/80 67.50%
That’s some good number! Can you think of some areas of improvement ?
👓 We miss something
If you use the -v
verbose option, the python calc_coverage
script will output the functions which are tested and which aren’t:
Executed functions: atdir_eq,atdir_set,bi_windup,build_tree,compress_block,ct_tally,discard_input_bytes,do_exit,fd_safer,file_read,fill_inbuf,fill_window,finish_out,finish_up_gzip,flush_block,flush_outbuf,flush_window,gen_codes,get_input_size_and_time,get_method,get_suffix,huft_build,huft_free,inflate_codes,inflate_dynamic,init_block,input_eof,last_component,license,longest_match,main,open_and_stat,open_safer,openat_safer,pqdownheap,progerror,read_buffer,remove_output_file,rpl_fclose,rpl_fflush,rpl_fprintf,rpl_printf,rpl_vfprintf,scan_tree,send_bits,send_tree,strlwr,treat_file,unzip,updcrc,vasnprintf,write_buf,xstrdup,zip
Missing functions : _start,abort_gzip_signal,copy,copy_block,direntry_cmp_name,display_ratio,do_list,fillbuf,fprint_off,gzip_error,inflate_fixed,make_table,mbszero,read_byte,read_error,read_pt_len,rpl_fcntl,rsync_roll,treat_stdin,try_help,unlzh,unlzw,unpack,write_error,xalloc_die,xpalloc
In this way, we have also some hints about which features of the program we aren’t testing. In this example, among others we can cite the rsync
compatibility and support for .Z
files. Of course, some (like the signal handling routines) are very difficult to properly test.
🧵 Final words
It’s crucial to remember that the coverage percentage obtained using this method is an approximation. valgrind
tracks function calls, not individual line or branch executions. Therefore, a function might be called but not fully tested, leading to potential false positives. Additionally, functions indirectly exercised by other calls might not be explicitly listed, resulting in false negatives. The performance overhead introduced by valgrind
also means this technique is more suitable for offline analysis than real-time testing.
On the other hand, it has the benefits that it’s simple to implement, doesn’t require big effort nor special setup and you can use it as an indication if the integration tests you are writing are improving over the time or not. Another good use can be to detect when the new version of the programs have more features, as your coverage will get lower with the update would mean you are not testing the new stuff.
Thanks for following me until the end of this long post, feel free to send comments and feedback, happy hacking! 👋