How to debug buffer corruptions - My STM32 case study
Memory corruptions !! The biggest nightmare for security engineers and hacker's jackpots for remote code execution, all of this starts with as simple as missing boundary checks combined with insane memory copy lengths.
In this post let's deep dive into one of the similar issues I faced while porting nRF8001 BLE library onto STM32L151 MCU and in the process let us also gain some knowledge here and there about techniques & tools used to debug this kind of issues.
Step 1 - Identify What is the problem
The way I identified the problem was very simple, my microcontroller screamed and threw a Hardfault at me 🥲. Then I had to unwind the stack to identify which function is causing this hardfault. Here is an article by STM that came in handy to perform the stack unwinding incase you don't have a IDE. Fault Analyzer tool is very handy to know what kind of fault occurred i.e., is it memory access or Bus error or debug events etc.,
Step 2 - Identify How is the Problem happening
Initially when the problem is happening I couldn't understand how and which part of code is exactly triggering this. Then I have to use the old school method of stepping through the code and see where the hard fault is occurring and I found the below code is giving it.
Then I have put a breakpoint in this function and ran a couple of times through this without any issues, however after a couple of executions later I got something as shown below
Do you find anything interesting? At first glance everything looked fine to me but you can see the entire data elements inside the structure pointer a_pins_local_ptr
vanished all of a sudden. So the code on line 149 is basically trying to read the digital value of a pin whose value doesn't exist or it's a garbage value, when the GPIO drivers try to access this unknown GPIO pin they don't find them in the memory map and immediately a hard fault is generated as you're accessing illegal memory location. That's nice, we now know what exactly is causing the Hardfault.
Step 3 - More Debugging
We now need to know what part of our code is corrupting this particular struct data such that the application code is going for a toss. To debug exactly this kind of problems we have a tool in our arsenal. Enter the WATCHPOINTS into the picture. They are similar to breakpoints but a little different in terms of functionality. A breakpoint basically halts the CPU when the program counter reaches the pre-defined PC value, but a watchpoint as the name depicts halts the CPU when a particular memory which we set to watch is being READ or WRITTEN, which is exactly what we need for our use-case.
So if I configured the write watchpoint to the memory location of my struct and free run the code, I should be able to know what function or piece of code is trying to over write this memory location as the watchpoint halts right when this memory is overwritten. You can do that as shown below
Step 4 - Catching the culprit 🥷
Once I did a free run of code after this, there was a halt by watchpoint and as I was expecting it stopped in memcpy
function and when I stepped one stack frame back I can see the aci_queue_enqueue
funciton is calling it.
I now analysed why this particular memcpy function is doing and I quickly figured that there's a length variable in this memcpy function and decided to check the length and BOOOMMM! it's copying more length than it's supposed to be.
You might think 255 bytes is not an unusual number to be used in most programs, How did you conclude it's a memory corrupted number. Well, it's fairly easy because here the data which is corrupted is surely more than 255 but the declaration of length variable in the code is uint8_t
which means no matter what the length variable always wraps around to 255 hence the errors.
This is where I did some more digging into the code and realized there's a queue size macro ACI_QUEUE_SIZE
defined somewhere which is controlling how many commands or data responses to be enqueued into a buffer. I then realized the RAM in the version of my MCU is very small therefore I can't afford to have ACI_QUEUE_SIZE to be 4, then I changed it to 2 which worked flawlessly.
All my hardfaults are gone, device started working. Reducing queue size can come at the cost of having low BLE data throughputs however for my application it's not essential to have higher data rates.
This concludes my debug story and hope you enjoyed & learned something out of this. Let's meet with another story, Au Revoir