My day-job in Langton Blue is helping clients fix messy, complicated problems in the manner of Winston Wolf – quickly, efficiently, with a minimum of fuss and bother. When attending exhibitions my name badge often reads “Problem Solver” to encourage the good folk I meet to avoid assumptions based upon a job title. Wearing this title demands a broad knowledge of operations, processes, technology and architecture. Keeping current on all of that requires a good deal of personal maintenance and I was lucky enough to have a month this year when I could pick whatever I wanted to look at in the Langton Blue laboratory and refresh in a very, very hands on feast. Apologies in advance for the technical content below – this gets messy!
The Interesting stuff
August was family holiday time with September dedicated to my much anticipated lab time, tuning those oft neglected skills of sleeves up, hands-on real technology know-how. Here are the highlights of what I got up to.
SDN Controller – BigSwitch
Deployed a Dell S4810-ON ONIE switch for 10GE storage and application networking across a vSphere and Hyper-V cluster. Although these contain a first generation Broadcom Trident chipset, they are fully supported by BigSwitch. Big Fabric and Big Monitoring Fabric are the products deployed, with the latter looking *very* interesting for out-of-band layer 3+ visibility tools like ExtraHop. Lots more to come with this setup, and we’ll be breaking it in every way we can find. We drink strong coffee, not cool ade, when we discuss SDN and its applications for financial clients.
IAM Identity Access and Management – Vasco Identikey
Rebuilt and deployed a Vasco Identikey Server with fault-tolerant replication and 802.1x integration with Cisco core, distribution and edge devices. This was more painful than it needed to be – blog article pending. Next objective, Office 365 integration with 2FA on a selective basis. I’ve only ever experienced pain when deploying device management through Microsoft ADFS, which fuels my scepticism regarding BYOD and its business value without strict technical boundaries during implementation. If Microsoft would stick with consistent support for a SQL backend IdP to ADFS, then deploying customer specific federated identity models would be much easier. They are surely missing a trick here, as a SQL backend IdP repository was supported in what is referred to as ADFS v2.0 (Windows 2008 R2), then removed for ADFS v3.0 (Windows 2012 R2) only to be returned in Windows 2016 ADFS.
Application server – Oracle WebLogic 12c
The application tier with its transactional layers and persistence models is often where the infrastructure crew start to fade back into their LUNs and database tables. It is here that the Developer Jungle begins to grow. During my month I exercised the Oracle WebLogic 12c deployer – still haven’t got all my Eclipse plugins working, but the familiarity of the deployment model was reassuring. I’ve worked with WebLogic since version 8.1 in 2004 when BEA owned it, and am happy to report not too much is different in 12c with Oracle at the helm.
PKI Consistent deployment – Microsoft CA with PowerShell
Wrote PowerShell scripts to build a complete 2 tier Microsoft PKI infrastructure, with an offline root CA (see (2) above). For reasons unclear to me, Microsoft hasn’t really hit their PKI CA capability with the slick wizard treatment seen elsewhere. This is a good thing™. The reliance on the CLI and PowerShell scripting makes operations more predictable and auditable, with clearer Active Directory RBAC controls on who can issue which type of certificates. More to come on this as the expense and integration complexity of a 3rd party PKI suite is often unecessary when considering the training and licencing costs that accompany that approach.
DevOps – bring on the Puppets
The allure of a DevOps style “mode 2” approach to technology infrastructure is described as “bimodal” by Gartner here. So, I built and destroyed a new Puppet Enterprise environment for CentOS 6/7 and RedHat 6/7. Windows PowerShell DSC remains a beta thing in comparison. I remain undecided on whether I like Linux systemd, with much the same opinion as Linus. Business continuity looks very different in a DevOps world, and lacks coverage in many of the articles expounding the virtues of the discipline.
Database – Oracle 12c RAC with a smidgen of GRID
Oracle database 12c RAC and Grid infrastructure build in our vSphere lab – it’s still big, and it’s still more complicated than it needs to be. I’ll leave this one here, as I intend to dig into ASM with a view to explaining how to avoid overlap with LVMs and thin provisioned storage arrays at a later date. The KISS principle applies here, and works.
DDoS Protection with DOSarrest
Sadly, I didn’t get to this one, and it was going to be one of my favourites. With scary amounts of DDoS bandwidth across the globe, we’ve considered DOSarrest for previous client engagements. Their WAF and reporting capabilities were my primary interest, as anybody with a jumbo bucket of bandwidth can do the DDoS bit. This is still on the cards during the next month or so, depending on my workload.
Boring but Important Stuff
PKI Intermediate Certificate Authority – issue new certificates
Checked and reissued PKI Intermediate CA certificates from the root CA – a little early, but needs to be done in a timely manner. More than one public embarrassment (I randomly pick Apple and Azure) has happened due to expired certificates.
Validated our internal NTP hierarchy and synchronisation status – sub 2ms synchronisation across all endpoint and server objects within Langton Blue and supporting cloud infrastructure, nice. Not only that, our primary Stratum 1 servers are a couple of Raspberry PIs with GPS HATs. Bulletproof time keeping across the entirety of our infrastructure with Cisco routers, ESXi hosts, Windows and Linux VMs and transactional ESB services all generating SIEM logs with sub millisecond synchronised time. Wonderful.
Mail server move – new Cloud please
Moved a Linux sendmail server and a postfix mail server from one cloud provider to another – updated DKIM and SPF DNS records, checked iptables logs and generated reports against other log files for persistent/long term compromises. This was a 1 day exercise aided and abetted by the documentation we have on the configuration build and operation of our servers. By understanding where the independence from the cloud provider API/Service Level began, it was pretty straightforward to lift and shift our SMTP services, complete with certificates and anti-SPAM measures. LB uses a structured documentation approach that can adapt and grow to document and describe any client environment. I just introduced it to a 1600 employee housing trust, after putting the same structured approach into an oil trading company trading many billions of US$ on a daily basis.
VUM Updates – cleanliness begins at home
I updated our VMware Update Manager repository and all the ESXi hosts in the lab. Hampered somewhat by the size of the Oracle RAC VMs and their raw LUNs (don’t ask), this too was pretty much a single day exercise, with sufficient redundancy in the lab cluster to vacate each node in turn, updating the host profiles as I went.
vCenter was rebuilt to 6.0U2 Build 3634793 (15th March 2016). I do enjoy the vDS export and import capabilities as this is real a timesaver. The ability to build infrastructure at this velocity (‘scuse the Agile pun), and to do it fairly frequently, is tantamount to validating any DevOps DR approach. Do be careful to make sure you haven’t made inappropriate assumptions regarding lower level services such as DNS and NTP as this is an easier mistake to make than you would think. Been there, done that, during testing.
PowerCLI update to 6.3R1
Updated PowerCLI to version 6 – a few git repo updates needed. Enough said. I like PowerShell and the access it provides to .Net assemblies and the PowerCLI object model for vSphere.
Added a second PSC to VMware vCenter
To provide greater resiliency to vCenter, I added a second Platform Services Controller. vCenter remains a large lump of a thing, and occupies the central ground in any vSphere infrastructure deployment. It remains essential to most operations regarding vSphere clusters. Since vCenter 6.0U1 it is possible to repoint a vCenter to use a second PSC, making the deployment of an expensive and complex vADC (e.g. Citrix Netscalar or F5 BIG-IP) for “load-balancing” difficult to justify.
As shown below, the web-client screen layout isn’t fabulous at first login, even in IE11.
Thankfully, this does sort itself out after a new session is started.
vCenter Distributed Switch alerts
Investigation of vDS MTU and teaming health check errors – I am still not clear why the health check creates, then removes these errors. An aging but really great article by Chris Wahl here on vDS health checks. Just like Host Profiles, they are troublesome to setup with “green green green” status, but worth the effort when you’re whizzing along operationally. We all want to exception manage, with minimal noise to hide the exceptions when they happen.
Windows Server Update Services
Fiddled with our WSUS update cluster, but stopped short of bringing it under SCOM management. It has always intrigued me that Microsoft officially supports WSUS under SCOM with a shared SQL susdb database – in fact this is recommended for availability reasons. However it remains difficult to find an official statement on whether a pure WSUS 6.0 server pair, sharing the susdb database is a supported configuration.
Firewall Patching and CVE vulnerability checks
Checked the test Cisco ASA cluster for CVE vulnerabilities, day dreamed watching the script kiddies hitting the outside interface. Some of the traffic we see hitting the public IPs is hilarious.
The Architect’s View
Phew. What a month. The list above is a rather eclectic mish-mash of technology, from storage to networking, through server and database, to cloud and application tier. My goal in this post is to emphasize the importance of considering and understanding the detail within the entire technology stack, top to bottom. If you aren’t monitoring and measuring it, and getting meaningful reports on it, you can’t manage it. This is a line of frustration with AWS and Azure APIs, for example, where trust is expected to fill the gap. It doesn’t always, so knowing where those gaps are can be helpful.
Getting your products and components to do their work for your business, safely and securely, is the end game. Every component of your technology stack, whether cloud or on-premise, is an opportunity for exploitation. Keep it simple. Keep the attack surface as small as you can while providing the service the business wants. Tidy up after yourself – remove those old products and software builds. Stop paying for support and maintaining them. Stop having to remember to patch them. Procuring too many products, or tangling your selection of products in a complexity far beyond the capability of the operational team, are two common mistakes that deliver security theatre rather than the security you need.