In my previous role, I was the in-house architect of a cloud-scale data and analytics platform. It was a rather large platform, at least by Finnish standards. It had 1PB+ data, dozens of integrations, and 150+ users. There were about 15 people in the team managing and developing the platform.
After reflecting on that experience, I have decided to share the key lessons I have learned. The platform operated on Azure and Databricks but the things I learned are general and probably apply to every large data platform.
1. Governance, Governance, Governance
Governance is a curious thing. While it’s often discussed, it is an elusive and abstract concept. As an ex-developer, I wasn’t initially keen on governance.
Governance is like a paddle in canoeing. It is possible to start your trip without one, letting the current do the work, and you might not even miss it. When you realize you need it, it is probably too late.
You don’t need absolutely need governance at the start while operating on a small-scale but as your platform grows to include hundreds of data pipelines and thousands of tables for hundreds of users and systems, it becomes essential. When you reach a certain scale, you cannot operate efficiently if you don’t have proper governance in place. In the beginning, you will feel like governance hinders your progress but as the platform grows, it starts to make you so much more efficient.
What is governance of a data platform? There is a multitude of definitions but I think of it as everything that helps you manage the platform at a scale. It can include processes, policies, handbooks, or tools.
As the business of the data platform is data, a large part of the governance focuses on that. Besides data, you need to manage other assets, such as data pipelines, Spark notebooks, ML-models, data models, reports, and so on.
Here are some questions that governance helps you answer. These questions are written for data but could concern other assets as well:
- What data do you have?
- Who owns it?
- What does it actually contain?
- How do you classify it?
- How do users find the data?
- Who can access the data?
The most important thing about governance is that you need to have enough but not too much of it. Secondly, you should automate as much as reasonably possible. Governance gets easily trumped by business requirements. Automating it helps to keep it updated.
2. Embrace Your Platform
Operating in cloud at scale is expensive, and when your cloud bill has as many digits as there are days in the week, you start to think about your options. There is a school of thought which says that you should only use platform-agnostic features to make switching providers easier. The rationale is that using only the common features make switching the platform easy - or at least easier - if things go awry. In this thinking, you would only use the common Spark features, and you would refuse take the advantage of Databricks’ or Fabric’s advanced Spark features.
I understand the idea but it is not a way to live. It is like buying a Ferrari but driving it like a Lada because you might want to trade to it someday.
The likelihood of switching platforms is quite low. The migration cost for a large size data platform is enormous, so you are probably not going to do it. It is best to embrace your platform and use it to the maximum extent while making sure your data is not locked in.
3. Spear-heading Technology is Hard
Using battle-proven technology is a safe bet but sometimes it isn’t possible. Data lakehouse technology provides tremendous benefits but it’s still quite young. If you want to use it, you are forced to be a pioneer in new technology.
When using new tech, you are going to do at least some R&D work for a platform provider. New features require testing and you will be the one testing them. If you choose the path of new technology, partner up wisely. Using proven technology can be learned from books but if you use choose new tech, you need a partner who is in the know.
Using new technology often means that something does not work as expected or that there are undocumented features. Maybe your cloud database gives a performance boost if the datafiles are at least 5 GB (or some other arbitrary figure) in size. The point is, you need someone with connections to the product team.
One way to check this is to ask if your tech partner attends the correct conferences? Local, regional, or the gold standard: global? If yes, that is a good sign. If not, you might want to keep looking. Platfrom provider MVP or similar status is also a positive sign. The goal is to find partners with strong relationships with the platform provider to make sure you get support when needed.
4. Data Keeps Changing
There are certain aspects in data handling that you need to fix before implementation. For example, data access patterns influence technology choices and implementation, and data modeling needs to be designed based on user requirements. Schema of the data determines the table layout, and the distribution of data affects Spark’s job efficiency.
The challenge is that, in the real world, data is constantly changing.
For example, the original requirement might have been to import data in a daily batch, so you design a write-once, read-many optimized solution. The data became popular, and now you need to bring it in every 10 minutes, creating a challenging write-many, read-many situation.
Perhaps, you modeled your data using the (way too popular) data vault 2.0 method. The original data didn’t have any PII but then someone started to use the description field in the source system to record customer’s email address. Now you need to design a delete or scrub process for you data vault by hand, as your data vault automation tool doesn’t support it.
One thing that constantly surprises new Spark developers is how you need to know the data distribution when using Spark. Say you partition your customer activity data by customer. The worst case scenario is that 20% of your executors are doing 80% of the work, as typically the customer activities are not evenly distributed. The thing is that data distribution can and will change, so even if you balance the distribution correctly when implementing a job, you need to monitor the workloads to identify when the change happens. Failure to do this might lead that to your Spark jobs starting to fail at out-of-memory errors.
5. Monitor or You Don’t Know
Is your platform being used? Are your tables, reports, and ML-models actually utilized? Processing data is expensive. Is the end product used? How do you know? The only answer is that if you don’t monitor, you don’t know.
We all know the examples of how people say one thing but actually do something else. Based on polls, charities should be drowning in money, and stores shouldn’t be able to stock organic food shelves fast enough.
The same applies in corporate IT. A report and its underlying data might be added to the platform ‘just in case,’ with claims that it’s the department’s most important report. In some cases, access to data or personal clusters can even become status symbols. Only monitoring will reveal the truth.
Another category of monitoring is the cloud resource usage. The developers are usually pressed mostly on delivering the feature to the business, not so on much how much resources the feature is using. Monitoring might reveal issues, such as a query running over the entire data lake when it was thought to access only the latest partition. Or a pipeline running every 5 minutes, not once a day as thought. Or a Spark cluster reading the same data eight times because executors are constantly being evicted.
Monitoring costs is another critical aspect. Monitoring resource usage covers a lot of this but not all of it. We have all heard the stories of a monster-sized VM accidentally being left on, causing an enormous cloud bill.
For monitoring, the best approach is to create a process, provide tools, and distribute responsibility.
Summary
Building a new data and analytics platform using developing technology can be an exciting journey. By focusing on governance, embracing your platform, choosing the right technology, adapting to changing data, monitoring effectively, you can create a platform that actually works in the real world and provides an immense amount of value.
ps. Automate the deployments, and manage the platform using IaC. They will save a ton of work in the long run.