Firmware Auto-Updates for GridSpy

I’d like to dive into some technical detail today about those automatic updates that everyone takes for granted.

If you are a web developer, I wouldn’t blame you for thinking that deploying updated software is (mostly) painless and simple. Once the latest version of our code is live on the server, all our customers are essentially updated. Because a web developer can control the set-up of the web server it is very easy to deploy quickly and predictably.

Even the desktop folks have it pretty easy. There are lots of automated ways to check for new versions of code, and just as many ways to install the latest version onto the client’s PC. If things go wrong, as things often do due to both human and hardware issues, there is a human involved who can interact with the computer to fix things.

Then there are hardware devices, each with their own custom communications system. Although hardware vendors completely control the design of the hardware, there are very few standard tools and these devices range from hard to access to impossible.

Gridspy Microcontroller and surrounding electronics

Since GridSpy is a hardware solution combined with a web service, we live at both extremes. On one side we have a web server that is easy for us to update, so keeping all our clients on the bleeding edge is a snap. On the other side we have our custom hardware devices that are to be installed in homes and businesses around the world. Most of these devices will be rarely accessed by humans, they are out of sight and out of mind in a locked cabinet somewhere.

Now that would be just fine if we didn’t want to continually add new features to our products. For example, we’d like to add info on the GridSpy dashboard showing glitches such as surges and transients in your power supply. To support this new feature we’d need new code in our Nexus to detect, log and upload these transients. It is a rare feature that does not require any changes at all to the firmware, and a lot of our competitive advantage comes from having a completely flexible smart device installed in our customer’s office. Without a remote update we would either have to ship new Nexuses to all our clients whenever we performed a firmware update leave or our early adopters behind as our solution evolves. A remote update gives us the agility we need.

There are also selfish reasons to keep all our deployed devices up to date. Each of our devices calls home and interacts with our server code 24/7. Our ongoing communication channel allows us to instantly “Push” instructions to the Nexus. This lets us deploy a wide range of interactive features from instantly switching loads, opening windows to setting the temperature in your building, all from our web interface. If we didn’t upgrade all our devices then the server would need to work with every version of our code that we ever release. So deploying firmware updates in a timely manner saves us effort as we only need to worry about support for the most recent few versions.

The microcontroller at the heart of our devices stores its instructions in 64kb of flash. This web page alone takes up more than double the size of our compiled code. The instructions are like any other file on your computer, just a lump of data that has to be copied into the microcontroller exactly right. There is no operating system on there so every instruction on that chip is our responsibility. On the lab bench, or in the factory, we use a debugging interface called JTAG to load the code into the chip. But we can’t use JTAG in the field since we’d need physical access to every Nexus any time we wanted to deploy an update.

Fortunately for us there is a mechanism inside the microcontroller where we can ask it to reprogram one of the flash bytes and store a new value. We have to erase the surrounding ‘page’ of 512 bytes before we write new data. Since the microcontroller reads instructions directly out of flash, we must ensure that we do not erase the code that is currently running. With this facility it is possible to install a new code version into the chip, but first we must handle all the detail of getting the code into the chip and creating the reprogramming process.

So how do we load new code into a microcontroller? The trick to a self-reprogramming microcontroller is to separate the reprogramming instructions from the actual application, kind of like storing two files side by side. These instructions become two separate programs that share the flash - the main program and a bootloader that faithfully takes care of it. Most of the time the bootloader stays out of the way. Occasionally the main code asks the bootloader to load new code.

Heaps of flash storage for your data

At GridSpy we ensure that there is plenty of memory space external to the microcontroller to store lots of your power samples and other data. Before the bootloader is invoked we transfer fresh code from the server to the Nexus and save into our memory space ready to load. The bootloader finds this code and checks to ensure that it is valid. It then writes a note to itself at the end of the memory used for code so that it remembers the version of both the new and old programs. It then begins copying the code itself. When finished the bootloader crosses its digital fingers and starts the new code.

It is crucial that no matter what happens the Nexus successfully wakes up and reconnects to our servers. The worst possible situation would be to permanently “brick” a Nexus in the field, especially if that Nexus is hard for our customers to access.

We have several mechanisms to prevent bricking a Nexus

When the Nexus turns on, the bootloader always gets a chance to check the main program before it starts. It quickly checks its own notes to see if there is a complete copy of code loaded or it was half way through a reprogram. If the main program is incomplete the bootloader will continue copying the flash code into memory. Once the bootloader is convinced that the code on the micro matches the image sent to the Nexus by our servers it will start running the new code.

But what if the new code is bad? There is always the possibility that we accidentally send out code for updating that is not ready for prime time. It is a human mistake that can happen all too easily in the heat of the moment. Should we get this wrong, hundreds of Nexuses across the world could upgrade to this new version and then fail to call back to the server. The reasons could be subtle and this may only occur on certain devices, say those manufactured at a certain time or with features that were disabled in the factory. We would like our Auto-Update system to detect ‘bad code’ and automatically downgrade to the previous working version.

But how do we know that the code is good? To us, ‘Good code’ is code that can call back to our server, code which we can upgrade later over the wire

So, after a reprogram the Nexus must call into the server and once it is happy that the nexus is functioning correctly the server gives its seal of approval. We can add as many checks on the server side as we like before we ‘okay’ the upgrade. If the Nexus reboots several times without getting a seal of approval, the bootloader then downgrades the firmware to the latest working version. Combined with a “watchdog” which reboots the Nexus when it crashes, we can be fairly sure that a Nexus with bad firmware will spring back to life some time later.

Finally, the rollout of new firmware is carefully monitored server-side. The first nexuses to be upgraded are those that are physically easy for us to visit - those at our beta tester’s homes, our own homes and at clients here in Auckland, New Zealand. From there we ensure that we roll out the new version gradually. If too many Nexuses fail to upgrade correctly we will halt the rollout until we have checked the situation ourselves. This ensures that it will be hard to brick a great number of Nexuses in one go.

As I publish this blog entry I am hard at work on our Firmware Auto-Updating system, which as you now know is a subtle and complex beast. It remains one of the few hurdles that we must cross before we can offer our devices for sale across the world or in large volume.

If you would like to learn more you could read how GridSpy has up to the second data, why we manufacture custom sensors or just check out the GridSpy homepage.

Also, did you notice that the images above are clickable?

Nexus in a photo - 3 sensors plus ethernet.