An experienced Infrastructure Engineer to remotely join our global team.
Cognician is looking for an Infrastructure Engineer, starting immediately. This is a remote position, in the European timezone.The majority of the team are located in South Africa, however, we have some folks in the Netherlands, the UK, and the USA.
Our founding belief is that people are capable of great things when their behaviour is driven by powerful ideas. We enable clients and partners to deliver meaningful conversations that inspire learners to adopt powerful ideas that enhance their performance. The result is life-changing conversations and life-changing learning.
Cognician is a mature product start-up in the corporate learning space. Feb 2021 marks the beginning of our 11th year, and April 2020 our 9th year with the Clojure stack.
Our team is composed of a variety of purpose-driven, friendly, smart, skilled, and over-all “good-weird” people, all focused on solving the big problems of digital coaching in the workplace, each according to our own particular skills and strengths.
We know that we’ve got a lot of work to do here. We’re eager to do it, and enjoy working together as we do. You can get to know a little about each of us at our team page.
Although we have an office in Cape Town, we’re organised as remote-first and asynchronous, as that’s how we started! Our entire organisation is in the cloud.
You can learn quite a bit about our tech stack and culture in a fairly short space of time on episode 27 and episode 48 of the ZADevChat podcast, where our CTO, Robert Stuttaford, appeared as a guest.
Despite the age of these episodes, they're as relevant today as they were in 2016!
- Excellent spoken and written English is critical.
- Proactive. You understand the value of getting in front of potential problems, and work to do so.
- Organised. You care how your work is done just as much as what work you’re doing.
- Meticulous: The work you produce is fit-for-purpose, consistent, and complete.
- Responsible: You own the problem, not just the solution.
- Empathy: You know that this is all about people, first.
- An over-communicator. You understand that effective work is about shared understanding, and you strive to achieve that.
- Ask lots of questions.
- You prefer a ‘learning mindset’, but you can clearly motivate and defend your ‘fixed mindsets’.
- Calm and focused under pressure.
- Optimise for your intrinsic motivators — mastery, autonomy, and purpose.
- You understand and value the difference between a goal and a strategy.
- Natural teacher and coach.
- More of a ‘mender’ than a ‘maker’.
- Conscious of the passage of time, and what that means for a company and a team :-)
Roles in the work
EC2, S3, CodeDeploy, Terraform, Ansible, .jar, systemd, Networking, Monitoring, Metrics, Continuous Integration, Load testing, Profiling, Error tracking, Alerting, Capacity planning, Self-healing systems, 12 factor, Access control, Automation, Immutable infrastructure, zsh, git, Security, Encryption, Compliance, DevOps, Python and so on.
You find this soup delicious. You know how to cook it — in-fact, you’ve cooked it plenty of times — even if you used different ingredients to the ones listed here.
- You know or want to learn Clojure and/or Datomic — and apply that learning from day one.
- You understand how distributed systems are put together, and just how hard that is to do well.
- You understand that all decisions are about making tradeoffs — storage vs compute, simplicity vs flexibility, rapidly (perhaps with some technical debt) vs slowly (perhaps with none), and so on.
- That “mender” thing again — some of our code is 9 years old, and still delivering value! Vintage (‘legacy’ is so enterprisey) code is a way of life here.
Our infrastructure stack
Our entire platform is built with Clojure, ClojureScript, and Datomic. Familiarity with these technologies is beneficial, but not required.
Clojure runs on the JVM, so from the infrastructural perspective, our apps show up as Java applications.
You’ll need to be experienced with AWS, and with the infrastructure-as-code approach, preferably with Terraform and Ansible.
This system is mature, stable, and the platform it supports is ISO 27001 certified!
- Amazon Web Services
- EC2 + VPC + ALB
- Several AMIs built with Packer + Ansible
- Terraform v14
- Ansible v2
- Instances run NGINX & JVM
- Public site: WordPress - proxied to WPEngine from within AWS
- AWS CodeDeploy
- ‘Litmus’ — a Clojure web application with a Slack app UI for running test servers in our staging environment. Pick a repo, pick a branch, pick a server size, decide how many hours to run it, decide whether to use the shared staging database or an isolated database just for your server, and GO.
- Threat Stack
We maintain a bash script repo which handles workstation setup:
- Access Control
- AWS (incl. MFA and automated key rotation)
- Cloning source code
- Running Clojure, Datomic, NodeJS (for building ClojureScript) locally
- An optional oh-my-zsh plugin with conveniences and shorthand for accessing servers
Things you’ll do
Simplify, simplify, simplify
We’ve built up a fair amount of technical debt, by the straightforward process of repeatedly solving the problem of the day as simply as possible over many successive days in a row.
There are many opportunities to revise, refactor, and reorganise.
Production outages relating to infrastructure or our apps (in collaboration with the broader tech team).
Our Infrastructure team’s primary client is the rest of the tech team. When we need assistance with CI, staging data, access control, or anything like it, you’ll help out.
We’re at a place in our growth where performance at scale is more important than ever. We want to level up on our application profiling and monitoring in a serious way — and use newfound knowledge to drive improvements, big and small, on a continuous basis.
- Build metrics dashboards (for the measures we discover actually matter).
- Design and regularly run scenario based load-tests for the engagements we know are coming.
- Work with other engineers to isolate and improve performance hotspots, perhaps by adding new infrastructural elements, like additional queues or cache.
Any tech ops person worth his/her salt has heard of Netflix’ Simian Army. We’re not quite ready for that, but we could do with some of this approach.
- Audit and standardise alerts — close any gaps, and eliminate false positives.
- Test alerting, failover, auto-scaling, graceful degradation, and maintenance procedures — whether automated or manual.
- Configure a status dashboard.
- Improve per-instance health-checks so that they check on all their dependencies, too.
Security & Compliance
Research and document the overall surface area of all our systems and ensure that we are following best practice for security and encryption — including internal systems, like the use of cloud services for team management.
Also, we are ISO 27001 certified, and Infrastructure has a key role to play in maintaining that.
Reactive, ad-hoc work
Fix things when they break, and work to prevent them from breaking again :-)