Leaving Meta

I recently decided to leave Meta (called Facebook when I joined). Today is my last day. I have finished offboarding and spoken to a lot of colleagues, some of who have become close friends now. As I sit in the MPK17 coffee shop, I figured it’s a good time to reflect back on my time in the company as things are still fresh in my mind.

First of all, I am really proud of what my team and I achieved: we built a 0->1 streaming system that is powering use-cases like fleet wide distribution (across Meta’s data centers in all regions), control plane data indexing, etc. Building and productionizing a system at this scale with performance, correctness, and reliability is extremely challenging and I am glad that I am leaving at a point where the system is 100% rolled out.

Along the journey, there was a lot of learning. It had its ups - solving storage performance problems and CPU bottlenecks that unblock critical use-cases is a lot of fun. And it had its downs - layoffs in particular were really hard. At the end though, I feel like an upleveled engineer and feel that I have learnt more in the last few years than the entirety of my remaining career.

When I joined, I thought I will learn the most technically (I am an engineer after all). And I did learn a lot technically: in essence, I learnt a lot of what it takes to lead a project and team to build and deploy a critical streaming system, at Meta’s scale. But I could have never predicted that the “non-technical” part of learning would be much more. My hypothesis is that it was because of the nature of my role - I joined in a more senior level with project and team lead responsibilities, and so naturally had to make my mistakes and learn from them.

Inspired by Mahesh’s lessons learned blog post, I also wanted to share what I learned from my experience in the company and in the Delos team. Note that these lessons are from my perspective as an engineer working on core systems and infra of a large tech company - it’s very likely they don’t directly apply in other contexts.


Non-technical:

  1. It’s all about people. Do they inspire you? Do you trust them? Do they give you honest feedback? Are you learning from them? Everything else is secondary. If you’re choosing a company or team, over index heavily on people. The right people can make ordinary tasks enjoyable. The wrong people can make trivial things frustrating.

  2. For any sufficiently complex problem, you can’t achieve greatness without team comradery. When things go wrong (and they will eventually), you need a team with strong bonds to cover for each other and operate as one unit. Do whatever it takes to increase team comradery.

  3. Process (tasks, standups, scrums, etc.) are just means to an end. Such process is important but it has to be light weight and the team has to decide what works best for them. It should never be the end goal based on which perf reviews or calibrations are done.

  4. Planning is extremely useful, but plans are mostly useless. It’s important that everyone aligns on the overall mission of the team but “how” to get there i.e. “the plan” is likely to change with increasing probability as more people get involved and as the scope of the projects increase. Therefore, execution is key - you need to start somewhere, fail fast by discovering unknowns, iterate, and finally converge towards the goal.

  5. It is very important that managers and leads absorb external pressures and provide a healthy and safe environment for the team to execute.

  6. In the long run, perks matter very less relative to other factors which motivate people. Their primary purpose is convenience, not motivation.

  7. Talk to your customers frequently. Make sure what you are building is actually useful for them. If possible, sit with the customers in the same area. Relative to other companies, I’ve found Meta does this well.

  8. Building the right team culture takes time, discipline and active listening. Once you listen to feedback and objectively act on it, people start trusting you more and eventually become champions of the said culture.

  9. For critical decisions, don’t rush things and confuse urgency with taking short cuts. Think long term: if we are doing X, how will it affect A,B,C (people) in 0,6,12 months?

  10. Hiring and recruiting is the single most important factor for the success of the team. Hire people who are not only technically sound but also a cultural fit (this is really important).

  11. Making decisions based on metrics is good, but realize that not everything in life is a math equation. Hire smart people and trust them to make the right decisions while holding them accountable. Guide them when needed but don’t project your biases towards them.

  12. Remember that you can be smart and kind, strive to be the engineer that people on the team come to for advice without hestitation because they trust you.

  13. Most people problems can be solved by direct honest communication. The tricky part is if you trust the person enough that they will take feedback well, that’s why (1) is important. In my experience, the root cause of such issues is mostly miscommunication which can happen because of a lot of reasons.


Technical:

I’ll list a few learnings here which I think sound obvious in retrospect but are often overlooked:

  1. When you are building a large scale system, simplicity and understandability should be an explicit design goal. One of the main value propositions of Raft (over Paxos) was understandability. One or two engineers being able to understand and debug issues in your system is not enough. Everyone on the team should be able to easily do oncall.

  2. Observability of a system is often not priortized as much as it should be. If a production issue is happening, how quickly can any engineer on the team root cause the issue? It shouldn’t take more than a few minutes in most cases (there are always some hard issues which can take hours/days or even months!). Along the same lines, observability should find issues in your system before your customers report them.

  3. Define what correctness means for your system and build infrastructure to detect and root cause correctness issues quickly. By far (after security issues), the most painful issues in storage systems is because of correctness bugs.

  4. Networks are hard when it comes to perf issues. As an infra engineer, I regularly dealt with CPU bottlenecks, OOMs, memory pressures, bad disk/flash drives, but by far the hardest time I got was from Networks. This is also because I was building a streaming system, where throughput is highly affected by factors like client/sever buffer sizes, src and dst regions, etc. When unsure, it’s generally a good idea to follow a process of elimination (in other components like CPU/Memory/Disk) before calling network the culprit.

  5. Always do perf tests against production workloads. When building a storage/streaming system, there will be unknown bottlenecks that you’ll only find when you test your system against production workloads.

  6. Fail fast. For critical components, prototype and land early. Then iterate. It’s rare to get things right in v1, you need iteration to v2 and potentially v3. So optimize for reaching v1 early.