Authors: Toby Weir-Jones, Andres Andreu and Vincent Turmel
Much has been written about the general state of security controls within industrial automation environments. Many of these articles, conference presentations, and marketing pieces all say broadly the same thing: “You face an insurmountable problem, and the only practical solution is to invest in my product since it will fill the proverbial gap.”
The issue, however, is not that simple. Network security product vendors obviously have a degree of bias in making such recommendations, but the reality is that most OT networks need a lot of preparation first, before any product is likely to deliver useful results.
The goal of this article is to introduce some improvements which should be made by the network’s owner, or at least by a competent professional services organization working under that owner’s authority. We will present certain investigations which should be completed, and can be implemented without any complex new technology. We will also indicate how those efforts will help the network owner prepare better for the possibility of investment in ICS technologies in the future.
The state of OT network security mirrors typical enterprise networks in the late 1990s and early 2000s.
Connectivity has increased far faster than any other attribute (such as security features, or network segmentation, or automated management) within the network, and the network owners are struggling to know what new risks have been exposed to their production and forecast safety as a result. Technology vendors, whether the OEMs for their OT equipment, or the networking hardware itself, are adding new features which create network and data visibility in unexpected ways. Most deployments do not aim to restrict existing products; instead, they seek to leverage the highest perceived-value features in an effort to drive more efficiency and greater ROI.
We have seen these patterns before. When the original popular firewalls first encountered significant acceptance, the buyers had all heard the same story: these products would “solve” their security problems, because the only use case presented was one of an unauthorized external user gaining access.
Customers didn’t know enough to anticipate other threats, and didn’t invest the time to research the question independently. So while those firewalls addressed a use case, they didn’t really solve the overall problem. Eventually the conversation matured to include notions of network architecture, and internal security controls, and intrusion detection/prevention systems. But by then the high-profile threats were shifting away from network layers and towards applications and databases.
In the intervening ~20 years, our overall understanding of myriad security threats – at the buyer’s level, not the vendor’s – has certainly advanced, and there is more concurrence on best practices than ever before. But just as that knowledge level has advanced, so has network and application complexity, so the threat landscape remains an “unsolved” problem. We implement layered approaches to try and contain cascading faults, and we invest in mitigation strategies which hope to (1) limit the damage when bad things happen and (2) make it difficult enough to attack that the bad guys look elsewhere.
What, then, are the lessons we should be able to learn faster, and better, instead of waiting another 20 years for IA network security to go through the same evolution?
1. OT network managers, when they are even present, don’t know enough about not only assets and connectivity, but also sources, destinations, and content of activity;
2. OT network security analysis requires different skills from enterprise security analysis, and IT analysts are kept at arm’s length;
3. Service and maintenance requirements, whether under contract or on demand, for OT OEM equipment directly impact available security risk mitigation options;
4. The relative importance of maintaining production, and honoring aspects such as bounded latency constraints, within the OT network directly impacts available incident response strategies; and
5. It is better to add perimeter and transactional controls now which work alongside your existing technology stack than to wait for a chance to do a full re-architecture from scratch.
Let’s look at each of these lessons in order.
1. OT network managers don’t know enough about not only assets and connectivity, but also sources, destinations, and content of activity
This means you should start by exercising the discipline and rigor necessary to build an inventory of what’s on your network, and be at least 90% confident it’s accurate. While building the inventory, using whatever tools your skills and policies permit, also think about documenting the process you’re following, to ensure you’re going to be compatible with existing procedures and able to expand your method when other groups within the business need updated information.
Furthermore, understanding not only what flows are present, but what information they’re carrying and why, is critical. This is harder to do, since you have to get into lower-level network traffic analysis, but you’ll probably discover the message traffic and protocol handshakes / transmissions happening all over your network are somewhat more complex (and potentially fragile) than you realized.
Despite that, being familiar with wireshark and knowing what ports and protocols are in use will only help you down the road.
2. OT network security analysis requires different skills from enterprise security analysis
An enterprise security analyst needs to know how to push patches, run scans, enable encryption, and interact across segments, VLANs, and domains in order to get all the information services to talk to each other in as secure a matrix as possible. The OT security analyst, in contrast, needs to know first and foremost how to avoid knocking over the OT network while learning useful things. This means, for most networks:
i) No or limited scans
ii) No active probing for information
iii) No ad hoc patching
iv) No ‘flipping the bit’ to enable encryption
v) No ‘emergency maintenance’ without approval from Ops/Production/Finance
These are some fairly weighty limitations, but they do not preclude solid forensics and analysis. You’ll want to have an ability to enable network taps all over your IA network; you’ll need to be able to do packet captures and recording for offline analysis and testing; and you’ll want to understand which nodes – even if they only ever receive instructions from your HMI or an engineering workstation – still generate ‘write’ instructions as part of the power-up/handshaking/health checking functions.
3. Service and maintenance contracts for OT OEM equipment directly impact available security risk mitigation options
This is enormously important and is probably the main reason you have two problems about which your enterprise colleagues will mock you relentlessly:
i) You have to run ancient versions of Windows or other desktop operating systems, because updating to newer versions would require application updates and production downtime; and
ii) In spite of the old operating systems, limited segmentation, and flat network designs, you are expected to provide remote access into your OEM equipment so your supplier can do maintenance and health checks, or because entire sub-processes of production are performed remotely.
The best thing you can do to plan for this is make sure you have all the real technical information about how your suppliers are dictating these issues. Your service contracts might say fuzzy things such as ‘[Customer] will enable [supplier] to maintain and manage equipment in accordance with warranty requirements’ but you need to go deeper. If you own the equipment, demand better compliance with your own standards. If you lease or rent, do what you can to contain their activity to only their equipment. Talk to the supplier technicians and whomever on your team gets them hooked up, and document each and every tool and network session required for a typical warranty lifecycle.
4. The relative importance of maintaining production within the OT network directly impacts available incident response strategies
There is a critical part of the risk-tolerance spectrum which comes into play when something unexpected is happening. You may not yet know if it’s unexpected by everyone, or only some people – perhaps your team doesn’t maintain an internal chat or otherwise communicate as well as it could? – and therefore it’s not immediately obvious if it’s unauthorized or not.
In an enterprise environment, it’s relatively easy to kill the activity first and see who raises a hand to complain, because it’s not likely to impact end-of-quarter inventory levels or financial performance. But with IA networks, that’s exactly what it’s likely to impact.
You therefore need to have very, very clear escalation paths and executive authority baked into your investigation and response protocols. 20 years ago in enterprise, it was easy to highlight the difference between “board room” and “server room” decision-making, and the IT staff typically worked in isolation from the real senior executives as a result. Unless something was on fire or otherwise catastrophic, it was “just” a technology problem which would get itself sorted out. That’s a dangerous and naïve prerogative, as we all know, but it was universally widespread.
If your team doesn’t have a strong leader who can make those calls with confidence and ownership, it should be an immediate priority to address that gap, because it’s all but guaranteed you’ll need someone tasked with those decisions sooner rather than later.
5. It is better to add perimeter and transactional controls now which work alongside your existing technology stack than to wait for a chance to do a full re-architecture from scratch
IA/OT networks can remain in a basic state of similar configuration for 10-20 years. Margin pressure is such that it’s difficult to obtain major capital budget during such a run to fund a major upgrade; such activities are often catalyzed when overall throughput trends downward for long enough that the ROI is forced as a recovery tactic rather than a proactive investment.
Security investment doesn’t directly change product quality, or increase output, or pay bigger dividends to shareholders. So why bother? Rare one-time recovery expenses are less costly overall, aren’t they?
This is where the most brutal honesty is important. Many IA environments suffer small blips and interruptions relatively frequently, but because the scale of their impact is modest, the effects and cumulative costs don’t factor into large strategic plans. They are a nuisance, part of the basic noise floor of operations, and accepted as a “cost of doing business.”
But measuring those costs, and determining which of them might be avoided with security investments, is a reasonable line of inquiry to explore. If 15% of your ‘blips’ arise when file transfers are performed by your HMIs or engineering workstations, and each archive requires scanning for malware, staging, and scheduling downtime, then optimizing that process could reduce your outages due to those blips significantly. Similarly if misconfigured networking equipment, or overly permissive whitelists, create unexpected control logic changes which have to be cleaned up, then again you might find a 10-20% savings by doing a better job controlling for those risks.
Even if you only focus on the top 5 causes of your blips, you stand to make significant improvements if you can invest in tools which reduce their likelihood a significant degree. These aren’t major forklift upgrades to the plant, and they don’t require massive sea changes in staffing or operations, but they do require some up-front effort and ongoing discipline. What is a 25% savings in downtime worth to your business? Start with that and see what numbers arise from the analysis.
We recognize you probably aren’t suffering from major outages regularly. Our message here is that understanding existing activity in more detail, bringing about some rigor to your OT network and security processes, and learning more about how all these devices communicate will equip you – with the assistance of quality OT security tools – to avoid downtime due to that very same activity in the future. The aftermath of a genuine attack could be extensive and cause significant harm to your business.