Trends | 7 predictions for data and analytics in 2023

06.02.2023 | 14 min Read
Category: Data Market | Tags: #architecture, #data lakehouse, #dataops, #governance

Do you have a full overview of everything needed to become more data-driven? Here are 7 trends you should know about across the categories of architecture, developer teams, data flow, DataOps/MLOps and governance, data science and analytics, delivery model and value creation from data.

You can find a PDF version here.

1. Architecture

Data lakehouses are taking over!

Data lakes (aka Hadoop technology, aka Big Data) were much talked about for a period. Data lakes allow you to store very large amounts of data affordably, and can give great flexibility to developers and data scientists to create advanced data products. With Spark we also got the ability to scale the processing of large data volumes. However, the data needs to be further processed to provide value, and the user base remained small because data extraction and processing required significant technical expertise.

Data warehouses have been around for many years. The great advantage of data warehouses is that the data is prepared for use through shared data models. A large part of the workload of compiling and cleansing data and standardising business terminology is done in advance, so that for users there is a much shorter path to self-service. On the other hand, it takes a long time to agree and develop this, and the flexibility to quickly create something new is perceived as limited.

Some organisations even have multiple data lakes and multiple data warehouses. The data lakehouse model, where we have a data lake with raw data and logical data warehouse layers on top, helps minimise data movement. The Delta Lake file format (and the alternatives Hudi and Iceberg) means we also get logical database operations on data lake files (ACID, allowing us to both modify and delete data - and we need to be able to do that to comply with GDPR requirements, for example). Technologies like Databricks and Snowflake, running on cloud infrastructure, make it much easier to get this spinning. Finally, we can use our energy to create value from data, rather than working on provisioning hardware, configuring for performance or carrying out upgrades.

Data lakehouse architecture is becoming the standard for all larger organisations

WHAT’S HAPPENING IN 2023?

Data lakehouse architecture is becoming the standard for all larger organisations
Data lakehouses are now being implemented - everywhere.

What this means for you can follow two main tracks
Don’t have anything yet? If you have clear use cases you want to realise that encompass both reporting and data science, data lakehouse architecture is probably worth a discussion
Lots of technical debt and limited documentation? It takes time to migrate the old, and you often inherit a great deal of the technical debt regardless. We recommend assessing which parts of the solution should be rebuilt, using a gradual approach.

2. Developer teams

More are realising that the data platform needs its own team

Virtually no one establishes new data platforms on their own infrastructure. The 10 percent who don’t choose Google Cloud, AWS or Azure either have large data centres of their own, too much data that needs processing, or strict regulatory constraints that prevent the use of cloud solutions. The rest of us use cloud first.

You thought the world was simpler in the cloud? It is, but the platform in the cloud should ideally solve multiple types of use cases beyond historical reporting. New roles for the data platform are emerging. A data lakehouse in the cloud neither establishes itself, operates itself, nor gains new capabilities by itself. Unlike many other off-the-shelf applications, the bulk of the effort lies in how the data in the platform is ingested, stored and processed.

We need dedicated platform architects and developers. Note that these are not data engineers with a new name - this is an entirely separate role. However, it doesn’t hurt that several members of the data platform team have a background in ELT development.

WHAT’S HAPPENING IN 2023?

More are building data platform teams
More and more companies understand that developing a data platform is not the same as developing data pipelines, reports and other services on a cloud platform or on their own infrastructure.
The platform team needs a good understanding of business requirements that must be translated into shared capabilities, DevOps/DataOps/MLOps, infrastructure, integration methods and interfaces, data flow (ELT, Pub/Sub), modelling, monitoring and data testing (data observability), and not least security and access management.
The super-person who knows all this inside out naturally doesn’t exist, so build complementary expertise in the team. Make good use of the consultancy market to establish the team, the platform and good processes and routines. But remember to build internal competence along the way!
Data competence spreads as a result of economic downturns
Less funding for start-ups and scale-ups, and cost reductions in larger organisations mean that the mobility of data professionals increases. Companies like Oda, Elkjop, Google and Facebook are cutting their workforce.
Data professionals are still popular, and far too few. The consequence is that competence spreads, not least in effective organisation of data platform teams and their interaction with the rest of the organisation.

3. Data flow

More options for data ingestion and code-first

Although ETL (Extract, Transform, Load) has served well over recent decades, ELT (you guessed it: Extract, Load, Transform) has now taken over. It makes more sense to ingest the data first, and process afterwards so we can cover multiple use cases.

Data ingestion (Extract) can happen in several ways. Not everything needs to go through a batch-based integration tool (e.g. Informatica PowerCenter), but can be ingested via other methods. Sometimes we need data quickly. Pub/Sub integration can be labour-intensive, but frees up data for all possible use in near real-time, including in environments outside the data platform. The data stream essentially becomes microservices of data products that different solutions can subscribe to. Kafka is still the leading technology platform, but in Norway relatively few have extensive experience with this technology. The expertise being developed now is emerging from general development environments, not from the traditional ETL-heavy data warehouse world.

Data will flow into the data platform in more ways going forward, including via Pub/Sub

Reverse ELT is a fairly fresh concept, relating to a subset of data integration linked to Master Data Management. In short: there are now tools that push changes in, for example, product attributes into solutions that need them. Perhaps we’ll hear more about this in the years ahead, or it will simply become a natural capability within established MDM and data integration tools.

Within data transformation there are many visual tools (such as Alteryx, Matillion or Mapping Data Flows in ADF) that allow virtually anyone to transform data without writing a line of code. That’s great, because this supports data democratisation? But for more advanced transformation logic, most end up adding custom code as part of the visual flow. And then it suddenly becomes less clear.

At the same time, it’s significantly easier to use SQL (or Python) to write transformations, preferably in tools like dbt, Dataform or Delta Live Tables where code can be reused and there is full CI/CD support. This additionally provides opportunities to build in data observability and much more.

WHAT’S HAPPENING IN 2023?

Yes please, we want more ways to get hold of data
A trend that has become clear to me is that we typically say “yes please” to more ways of getting data into a data platform. We have some architecture patterns we prefer (because they’re cheapest/simplest/best to monitor, etc.), and some we use occasionally. A combination of ELT and Pub/Sub is probably a fairly uncontroversial prediction.

This means at least three things:
Data engineers must either learn more tools and architecture patterns, or collaborate with different development teams to get the data in.
Integration architects must master data architecture to a greater extent.
It becomes important to understand where the data comes from, how it has been processed, and what assumptions can be made when interpreting the data. This means that tools and capabilities such as data catalogues and data lineage become more important.
GUI vs code for transformation: currently a draw, but we’re rooting for code!
What should we choose? It depends is the usual answer:
Smaller organisations where the use cases are simple can happily use tools like Data Factory and Alteryx - representing low-code/no-code.
Larger organisations with “professional” development environments should use code as much as possible. At least for the official data flows. Power users in the business will largely also be able to use code-based tools - and love it. Data flows that are under testing and exploration, or that are not critical, can be developed in low-code/no-code tools if desired.
AI-based data quality and data transformation increases developer speed
Generative AI will make development of transformation jobs more efficient - and also be used in data profiling and data validation. Now that we can say in plain text, for example, “give me a table showing revenue summed per month, broken down by product” and then get a suggested SQL query, development of transformations will be significantly faster for many. This will be built into most tools where code is written.

Universal semantic layers are launched, but will they be adopted?
Universal semantic layers were launched by dbt and Looker in 2022, and it will be exciting to see if the concept spreads already now. The purpose is to make defined metrics and KPIs available across the organisation and ensure that everyone uses the same definitions and the same data foundation. Datasets and presentation are separated, is another way to look at it. Is this perhaps an attempt to wrest the role from Power BI as the tool that fixes everything?

4. DataOps, MLOps and Data Governance

Data observability + data catalogue?

There are many Ops concepts. A couple of years ago came MLOps - i.e. operations and further development of machine learning pipelines. And now we’re talking about DataOps. Nobody has quite landed on the definition of the concept, and the practical consequences, beyond the fact that it’s about bringing principles related to DevOps+Agile from software development to meet the data world.

WHAT’S HAPPENING IN 2023?

There are three areas that stand out as trends for 2023 within DataOps and Data Governance:

Data observability is something data engineers are starting to talk about, but few are truly making it work
Data observability is about automating the monitoring of data flows, with emphasis on data lineage (what has happened to the data up to a point) and data quality (various logical checks showing whether the data has changed character beyond what we expect). Both tools and methods are maturing over time, but we’re still at the starting line.
Lots of data, many users, and many different use cases require a data catalogue
Data catalogues require a lot from people and processes to agree on definitions, ownership and routines. The practical consequence is that data catalogue initiatives are scaled down to the most central data domains and kept to a minimum. And that’s fine. And perhaps it’s also the case that we’ll have data catalogue capabilities spread across several tools.
What do we want? Automated generation and sharing of metadata throughout the entire data flow, available to everyone who needs it, with opportunities for social interaction so we can learn from each other how the data should be used.
We’re getting more attention on MLOps
We still see that implementation of solutions containing AI-based components (i.e. machine learning, neural networks, etc.) has some shortcomings. Unfortunately, most solutions don’t get past the PoC stage. If they do, far too many are not properly implemented. By this we mean, for example, that critical solutions run in the development environment, and don’t follow the organisation’s guidelines for architecture, security, etc. And the models are only sporadically monitored and have unclear processes for error handling and further development.
It takes effort to scale and integrate models to run in production, and we need to establish monitoring, maintenance and further development of data flows and model flows. Fortunately, there is increasing attention on MLOps, and several organisations are already proficient at it today. This knowledge is being disseminated.

5. Data Science and analytics

Long live the business analysts!

We’re out to solve business problems and use data to support operational processes. My assertion is that most organisations aren’t ready to adopt the most advanced methods either. As long as they don’t have control of basic processes related to data governance and using data to answer business questions like “what happened”/“why did it happen”, then scaling up the organisation’s collective data competence will have relatively much greater value than optimising narrow use cases.

WHAT’S HAPPENING IN 2023?

Business analysts first
Data science as a term is calming down a bit, and business analysts are taking the spotlight. We can usually develop business analysts ourselves, while productive data scientists are extremely difficult to recruit.

Increased awareness of competence building? Yes please!
Cultivating the collective ability to use data is important, and for many there is still great potential for value creation from data here. Internal competence programmes continue to be carried out to increase skills, but unfortunately most will still be focused on how to use Excel/Power BI/Tableau and other self-service tools, rather than using data for problem-solving.

6. Delivery model

Decentralised ownership spreads further

If you work in the data world, you’ve probably come across the data mesh concept. The concept was first described in two blog posts by Zhamak Dehghani in 2019: “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” and “Data Mesh Principles and Logical Architecture”.

The core idea is that organisations can become more data-driven by moving from central data warehouses and data lakes to domain-oriented data ownership and architecture, driven by self-service analytics and a set of shared guidelines. The weakness of current organisation is that we create bottlenecks in the central link that a data platform represents. Data Mesh seeks to break free from a centralised architecture by decentralising both ownership and responsibility for developing data products.

What makes it all a bit confusing is that Data Mesh is not something you can buy. It’s more of a concept for organisational design, consisting of pieces that need to be assembled like a puzzle.

Ideas from Data Mesh about decentralised ownership are creeping in

WHAT’S HAPPENING IN 2023?

Decentralised ownership gains pace, but is adapted to reality
Companies like Zalando, ODA and Adevinta are now actively sharing how they are implementing Data Mesh. “True” data mesh very few have yet, since most organisations need to significantly scale up their analytical and technical competence to be able to drive domain-based ownership and architecture. And many are beginning to realise that it’s challenging to separate ownership and delivery models solely for data products, when the rest of the digital ecosystem should also be viewed in the same way.

Decentralised ownership of datasets, reports and analytics models is spreading further, but we’re not being fanatical. It takes longer with data flows and data domains (think “Customer” or “Product” as domains). And it’s not certain everyone should decentralise everything - data mesh as a design concept doesn’t suit all types of organisations - especially not the small and medium-sized ones.

It may also be worth debating whether you need to decentralise the architecture (we can achieve a lot with technologies like Snowflake, Databricks and dbt on a central platform) - isn’t it really through the organisation that the value of Data Mesh is realised?

7. From data to value

Everyone realises they need to become data-driven. And AI to the people.

Are Norwegian organisations becoming more data-driven? Perhaps. There are probably several who have conducted surveys on this. Maturation takes time, and doesn’t happen by chance. Fortunately, we have gained highly visible international - and a few Norwegian - examples showing that data is the great differentiator between being an industry leader and being at the back of the pack.

We no longer need to spend much time arguing that data can provide value. Instead, the conversations are primarily about concretising how much value, where we have the greatest potential, and what’s needed to get there. Most have done something, and are in the process of doing more.

WHAT’S HAPPENING IN 2023?

Can we start talking more about the future?
We talk less about advanced analytics, and more about becoming data-driven through all processes. We simply need to get fact-based decisions, be able to look further ahead through prediction, and automate where we have repetitive tasks that follow standard patterns. And speaking of prediction, I would really like more leaders to ask their business analysts whether we can try to look at what might happen in the future (next week, next month and next year).

Process improvement work is becoming more data-driven
Personally, I hope that process improvement, where Lean has been the buzzword for decades, can in time merge with data-driven automation and problem-solving. The large consultancies now have both Lean consultants and data consultants, but the coordination and collaboration has not been impressive so far. 2023 must be the year where someone seizes the opportunity, and achieves the first Norwegian success stories that everyone talks about. The gauntlet is thrown, as they said in the old days.

ChatGPT opens everyone’s eyes to the value of AI
ChatGPT has opened the eyes of most people - and now over 100 million users have tested how so-called generative AI can be used to automate tasks such as text writing and finding facts. Many more will follow throughout the year, as services like Microsoft Teams start building the capabilities into their services. For example, we now get meeting minutes with action points served right after the meeting. School pupils and students worldwide see the opportunity to make essay writing more efficient - to great delight. At the same time, they will discover that AI has bias, that text algorithms can’t do maths, and that facts aren’t always facts.