This is the second part of me putting in ink (well, digital ink) a list of tools/rules/principles I’m utilizing over and over again in my job as a software engineer, designer and architect.
In the previous post, I mentioned four principles: the “KISS” principle – and the importance of keeping things simple, the “Single Responsibility” principle that really helps to keep things simple. I also talked about “Eating your own dog food” as a technique to enhance the chances APIs and infrastructure, in general, would be tasty (useful) to your users. And about the “Five Whys” technique, that allows us to dive into the root causes of problems. You are invited to read more on these in my previous post: Tools for Good Software Design.
Now for the main course:
The “Main Flow” principle is about making your core functionality actually work when you most need it, by tying it together in your main service functionality. It is best explained with a negative example: Say you run your backup every day meticulously. However, you run your restore only once a year and usually only when disaster hits. Now, what are the chances the restore will be successful? How many times have you heard the phrase: “The restore did not work”. Now, more often than not restore would fail, and then you quickly realize what went wrong. Unfortunately, this is usually too late and you lost your data.
Let’s analyze this scenario: The backup and restore is actually a functionality that is cut in two: backup – every day and restore – once in a while. We are actually running only half of the functionality, the backup. This leaves the functionality not fully operational most of the time and thus very fragile.
This is where the “Main Flow” principle comes into play. If only parts of your system are partially active, try to make them active all the time by adding them to the main functionality flow. This way you know that every part of your system is actually working simply because it is in use all the time. So if something breaks you will see the impact on all the parts of the systems immediately and scenarios like a great backup but no restore could happen much much less.
So how can you identify deviations from main flow?
This could be tricky, but generally, if you have two distinct processes originating from the same source you should sniff around it and see if you can avoid the split in some fashion.
For example: sending an event (to the network) on a success of an action, sounds straightforward enough. But, if you need high success rates for sending of these events, you might consider persisting the events to disk first. But, since you already have them in memory there is no need to read them back from the disk in order to send them, right? Wrong! How would you know your persistence mechanism works? This is an almost identical problem to the backup/restore problem, just in the single event scale.
One “easy” approach is to write a mechanism that handles failures in sending events and write and read events to a file only on failure. Now you should be starting to see the warning signs. You have a process that runs at specific times (only when your event bus or network is down), but not always. It is definitely not part of your main flow and the downside is, again, if this process fails you might not know it until it is very painful and in this case, you already lost a lot of data.
One way of making this process a part of your main flow and still keep high consistency is always going through the disk. Thus paying for your system consistency with this overhead of writing and reading from disk. Another way you might consider is using multiple clusters of messaging infrastructure to receive these events and not persist events locally at all. Looking into these kinds of issues might make you clarify the SLAs needed it might be that events consistency is not as important as the speed of sending them.
Note that putting things on the main flow creates two outcomes:
- Your system is checking itself and you get fewer surprises of code that runs once in awhile.
- You actually ask yourself what is really part of your system and what is not, it leads to better design and a better and more stable system.
This is also where the “Main Flow” principle corresponds with the “Single Responsibility” principle. If you find such a side process it might be the case that it does not belong is this service at all and you might take it out altogether.
I’m ending this point about the “Main Flow” principle the way any good writing should end, with a semi-related quote from Yogi Berra:
“When you come to a fork in the road, take it.”
Here is a story told by Gary A. Klein, in his book “Sources of Power”:
The nurse was watching this newborn baby for a few hours now. She suspected something was wrong. The baby’s color was fluctuating between healthy pink and duller more troublesome hue. Suddenly the baby changed color to deep blue-black an indication of a serious problem. The team called for an X-ray technician and a doctor. The team was assuming a collapsed lung, a common problem for babies on ventilators. The typical response was piercing the chest and inserting a tube to suck the air around the collapsed lung allowing it to re-inflate. But the nurse suspected otherwise, she thought it might be a heart condition where air fills the sack surrounding the heart pressing it inward and preventing it from beating. She had seen this before, but the last time the baby died before the problem could even be diagnosed. She tried to stop the frantic preparation to treat the lungs. “It is the heart”, she said. The other medical personnel pointed out that the heart monitor was showing that the baby’s heart was fine; a steady normal newborn 130 beats per minute! The nurse didn’t cave, she cried for quite and used a stethoscope to check the baby’s heart – and there was nothing! No heartbeat. She was right the problem was in the heart. When the chief doctor came in she put the syringe in his hands and demanded him to do the procedure for the heart. The X-ray technician that was finally receiving results from his scans confirmed the nurse diagnosis. The doctor guided the syringe into the baby’s heart and slowly released the pressure around it, the baby’s life was saved and its color returned to normal.
Later, puzzled by the fact that the heart monitor showed normal heartbeat, they realized what happened. The heart monitor was sensitive to the electrical pulses of the heart, not its actual beating. The heart was generating the correct electrical pulses but was not beating because of the air pressure over the heart.
Now, this is a story about defying consensus and saving a life. But there is another angle that I would like to stress: Even the slightest understanding of how the heart monitor works could make the decision to use a stethoscope to check the baby’s heart be an option for the nurse.
This is the essence of mechanical sympathy. An understanding of the tools you use would allow you to use it in the best suitable way and expose its limitations. In this case, it resulted in saving a life. In software engineering, it is usually about building a higher quality product (that might, in some cases, be indirectly saving lives).
The term “Mechanical Sympathy” was coined by Jackie Stewart, a three times Formula 1 world champion. He believed the best drivers had enough understanding of how the car worked so they could work in harmony with it. You don’t have to be able to build a car from scratch. But in order be a better driver you should have a decent understanding of how the engine, gears and suspension works so you could fully utilize the car.
Think about the best software developers you ever worked with, I bet they were exactly like that: taking nothing for granted, poking into how the libraries, services and infrastructure they were using, constantly learning more about the tools.
For us as software engineers “Mechanical Sympathy” is about understanding the underlying tools, environments and runtime that we are using. It is about understanding how your program behaves when it executes. It is not focused on the design, architecture, coding paradigms or coding styles, which is important in their own right. It is more about what is your system actually doing in runtime. It’s about memory allocation, network access, CPU utilization and disk access. It’s about the behavior or your software inside its infrastructure, operating systems, virtualization, containers, compilers, GIT compilers, garbage collection, frameworks behavior, databases behavior etc.
If you work with a framework know its principles, understand what is actually happening when you call some method or use some functionality. The best way of doing it is actually opening the code of the framework or even debug through it. It is a good idea to do it anyway, there is a good chance to learn very interesting things from a popular framework code. But more than that, there will be things that would surprise you. Implementation details would make performance implications and side effects clearer, you will learn to use the framework much more effectively.
For example, the Ruby on Rails framework: Do you know how the accesses the DB is done, does it do an eager load of data or lazy? When does an N+1 query problem could occur and how can you prevent it? If you do not know the answer to these questions, you are most likely to stumble on them in the usage of Rails, probably in production, on Friday night, 3:00 AM
Take a hard look at the languages you are using. This is your main tool for solving any problem. How deeply do you know it? What is it good at? What is it not? Is it compiled and what are the implications of it? Are you using a VM like the JVM or Microsoft CLR? Do you have garbage collection working? How does it work? Can you configure or control it? In the case of garbage collection enabled runtime, tuning the GC to your application needs could be invaluable. Moreover, understanding how your garbage collection works could impact the way you create objects in your system.
The hardware you are using, are you aware of its power and limits?
Here is a table with some specs (3Ghz CPU), this is what the hardware can (or can’t) deliver:
|CPU access to Registers/Buffers||<1ns||How fast is your CPU when accessing a register.|
|CPU access to L1 cache||~4 CPU cycles, ~1ns||CPU fetch a 64 bytes cache line|
|CPU access to L1 L2 cache||~12CPU cycles ~3ns||333 times per microsecond|
|CPU access to L2 L3 cache||~40 cycles ~15ns||67 times per microsecond|
|CPU access to L3 cache||~60 cycles ~20ns||Dirty hit, 50 times per microsecond|
|QPI (Bus)||~40ns||25 times per microsecond|
|Main Memory access||~110 cycles ~65ns||15 times per microsecond|
|Send 1KB over 1 Gbps network||~10,000 ns = 10 us||~x1000 slower than memory access|
|Read 4K randomly from SSD||~150,000 ns = 150 us||~1GB/sec|
|Round trip within same datacenter||~500,000 ns = 500 us||1/2 millisecond – packet sent and received|
|Disk seek||~10,000,000 ns = 10,000 us = 10 ms||Magnetic Disks, 20x datacenter roundtrip|
|Send packet CA->Netherlands->CA||~150,000,000 ns = 150,000 us = 150 ms||1.5 KB over ~9000 KM|
These numbers are interesting; see how speed slows down, as you get further away from the CPU.
Access to cached data is in nanoseconds. Accessing the main memory is way slower, around a tenth of a microsecond. Then the real drop comes, accessing network and disk – at least a thousand time slower. And when we go across the WAN (the internet) speeds are in hundreds of milliseconds.
If you are writing a web application or a website. The I/O numbers, especially when getting over WAN, should be very interesting to you. Actually, in the long distance case, physics is very strict about its rules. For every 1 KM, it takes light ~3 1/3 microseconds to travel. No way around it! Truth is, our electronics are actually slower since every hop in the road (router, network switch, gateway, firewall etc…) takes its toll, so it’s actually worse. Just to clarify it, we can never get a packet from California to Europe (~9000 km) in less than 30 milliseconds, even if we had a clear line of sight and we used mirrors and lasers, physics just does not allow it.
On top of that working on the cloud would generate an overhead estimated as high as 30% (in most of the above numbers).
Bottom line: The more you know about the tools and infrastructure you use, their limits and strengths the better you can use them. Make sure you learn something new about the tools you use every day.
I have a couple more tools I can speak about, but I think I’ll call this a post :-D. So to recap: Following the “Main flow” principle means when you write software, pay attention to reduce detours and things that are not happening all the time. The more you do that, the more you stable you code be. As for “Mechanical sympathy” learning more about the tools you are using will make you much better at your job! Learn something new about them every day, this is what real professionals do. And of course, another semi-related quote from Yogi Berra:
“You can observe a lot by just watching.”
I’ll end with an insight, it is not really a tool, more like a rule of how things work in life:
“Culture eats strategy for breakfast”
Think about it…