Bringing order to a chaotic bug

Bringing order to a chaotic bug.

I think a good software engineer is a problem solver. To solve problems in software you have to do an analysis of the problem to find the cause. Many programmers notice the moment that the error occurs, take one look at the source code, guess at the cause and start “hacking away at the code” to fix it. While they may often get lucky, there are some situations where you really, really have to understand what’s going on.

Many years ago I was working on an application that would first show a login window that had a selection for which server to connect to. The list of servers was not loaded from a configuration file or hard coded into the client application. Instead, the application opened an internet socket to listen to broadcast messages which were sent from the servers. However, every once in a while the application would start and immediately show an error and die, after which the user would have to start the client application again for another attempt. Now this may not be a severe problem. It would just cost the user a couple seconds extra to close the error and retry. As this problem occurred only sporadically, it would usually work on the second try after getting this error.

But the issue with this is.. It’s a symptom of a deeper problem. And that makes users nervous. To a customer it looks like the developers are not completely in control of the product. It’s an error that occurs right after the user starts the application. So if you’ve got some bad luck during a demo session, the first thing the potential customer will see is an error message and a crashing application. Not a good way to start a demo. Add a couple database errors to that and some other (apparently randomly occurring) quality issues and some potential customers might just choose an alternative product from a competitor.

Because in my work I often needed to start the client application in order to test functionality I had added to it or changed, I would get confronted with this login error a few times a week, sometimes two or three times a day. You can imagine that after working this way for several months, this started to really bug me ;). So I asked some colleagues about it. It turned out several people had already looked at it but no one had been able to solve it in several years. In their defence, an error that happens maybe two or three times a day at most, is seriously hard to analyse, because the easiest way to find the cause of bugs is to reproduce them and then look under the hood to see what happened in the source code. This can give a clue about the cause of the bug.

This particular bug though, when it happened, it didn’t help much to look under the hood at the state of the application. For one, someone analysing it would have to go up several levels in function calls to find out it happened because the application tried to make a change to the login window, while the login window was not yet created. But ok, after I finally found that out, I knew the error was at least legit: the computer cannot make a change to a window that does not exist yet. But why did the login window not exist? It should always exist at that point. The randomness of it made no sense. Either it should never exist at that point (which meant it was simply not created yet), or it should always exist. Things don’t just happen in random order in the execution of a computer program unless you explicitly program it to do that. So this randomness must have a reason.

Well, this kind of randomness is what we developers call a race condition. Two processes “race” to get to some point first. Depending on the situation such a race condition can give entirely unpredictable behaviour, or, as in my case, happen only very rarely.

I decided to step through the code where the login window was created and do some digging.. Program source code is structured. We group a few lines of code that implement some functionality together in what we call a function. And one function can call another function. This way we can abstract some complicated behaviour behind a simple name such as “CreateLoginWindow”. When you pause a running application and look under the hood at the state of it then you can look at which function the computer was executing when the execution was paused. You can also see which function called the function that was being executed and thus go up many levels of function calls. So I placed a “breakpoint” on the function that did the actual creation of the login window, and ran the application. The breakpoint is a feature from the debugger (a tool we programmers use to analyse a running application) that automatically pauses the process when it is reached. So my debugger paused the application at the point that it was doing the actual creation of the login window. I went to the “stack”, which is like a history of all the functions that were involved in getting to the current state, and looked at the code which led to the creation of the login window. And then I saw something like this:

CreateUDPListenSocket();
CreateLoginWindow();

The first line calls a function that handles the creation and setup of a so called “UDP listen socket”, which is like an “ear” on internet that “listens” for UDP (User Datagram Protocol) messages (let’s say messages that can be sent without making a connection first). This function “CreateUDPListenSocket” also instructed the computer what needs to happen to process such a message when it is received. Now at first glance there is nothing wrong here. Though some of the developers reading this might have already understood. The first function creates the UDP socket and instructs the computer what needs to happen when a message is received and the second function creates the login window. These two function calls were right next to each other. How can anything else happen in between them? The computer doesn’t randomly just do other stuff in between, right? Well.. Yes, that is normally the case.. But not here…

Let’s dive a little deeper. The first function just instructed the computer what to do when a message from the server was received. So what would happen if such a UDP message from the server were received right after it was done with the first function, but before executing the second function? Yes, if you thought it would try to add the name and address of the server to the login window, you would be right.. But wait, if it does that before the second function is done creating the login window, doesn’t that cause our problem? Yes, it would indeed!

Ok, but that still doesn’t explain why it does that in between these two function calls, does it? Well, there are actually even multiple ways in which something can be done in between. One way is through something called an “interrupt service routine”. The other is because an application can consist of more than one process. Such processes that are part of the same application and running in parallel are called “threads”. So what would happen in this case is that the application used a library (which is like a set of functions that can be reused by multiple applications) that used a different thread to process internet messages.

At this point some smart readers might protest and say there are just nanoseconds between the first function and the second so that the chance of that server message being received right in between those two function calls is almost zero. It wouldn’t explain how this error would occur so often.

Yes it does! It does explain. But I agree with the sceptics actually. Those people saying the time between the first function finishing and the second function starting being extremely short are not wrong. At least, as a rule that time between executing these two functions would USUALLY be very short, a million times faster than the blink of an eye. However, there is such a thing in a computer as a “timeslice”. Computers are not great at doing multiple things at the same time. So they fake it. By switching very fast between running two processes, it appears as if those processes are running at the same time. Every process gets a little bit of processing time, called a “timeslice”. At the end of the timeslice, the computer pauses the process, stores the information about its state, loads the information about the state of another process, and runs that other process for a little while. If the timeslice would end right after the first function call but before the second, then there might even be several milliseconds during which a server message could be received and processed.

It’s even possible that the creation of the login window activated something to handle events that happened in the background before actually creating the window. That could then force the running thread to pause so that other background events could be handled first. And while handling background events, that UDP message from the server could be received and processed before returning to the thread that creates the login window.

There are maybe a few other explanations one could think of that would make it more likely that the UDP message from the server was received and processed in between these two function calls. Either way.. I switched the two lines around so that the login window was created first before creating the UDP listen socket. And after that, this error was never heard of again.

There is a little bit of a lesson in this though. It cost the original author of this code probably less than ten seconds to write those two lines. But writing it that way, in that order, cost several people multiple years of dealing with a bug they didn’t know how to solve. And while I don’t expect every developer to produce perfect source code (we are only human after all), there seems to be some carelessness involved here. You see, the function that processes the messages from the servers has an inherent dependency on the login window. Therefore, creating the UDP listen socket and telling the computer what should happen when a server broadcast message is received BEFORE creating the login window, shows a lack of thinking about dependencies. Such dependencies are vital to consider in software development. This bug could have been prevented easily by considering the relationship that exists between the two. That they were originally put in this order suggests the author didn’t understand the relationship or didn’t care about it. Maybe the author didn’t realise the process could be interrupted by a received message, but even then.. Whether or not it could be interrupted, it’s never a good idea to create A before B when A depends on B. You always need to do things in the right order because to do otherwise is inviting chaos.

If you’ve got similar issues with your software product and you don’t know how to solve it, contact us, and we’ll see what we can do.