<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://suanto.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://suanto.com/" rel="alternate" type="text/html" /><updated>2026-03-01T14:27:21+00:00</updated><id>https://suanto.com/feed.xml</id><title type="html">The Anttidote</title><author><name>Antti Suanto</name></author><entry><title type="html">Porting Control Charts Library to Python (With a Little help from AI)</title><link href="https://suanto.com/2026/03/01/control-charts-in-python/" rel="alternate" type="text/html" title="Porting Control Charts Library to Python (With a Little help from AI)" /><published>2026-03-01T10:00:00+00:00</published><updated>2026-03-01T10:00:00+00:00</updated><id>https://suanto.com/2026/03/01/control-charts-in-python</id><content type="html" xml:base="https://suanto.com/2026/03/01/control-charts-in-python/"><![CDATA[<h3 id="background">Background</h3>

<p>Some years ago, I needed an implementation of control charts (Statistical Process Control or SPC charts) in TypeScript. I ended up building it myself. It eventually became quite a popular add-on for a BI tool.</p>

<p>Building it took a couple of months as designing the API, implementing the charts and plotting, and identifying all the needed user settings required lots of trial and error.</p>

<p>A few weeks ago, I needed control charts again. This time in Python.</p>

<p>There are several open-source control charts implementations available for Python. But none of them had the features I needed. Control charts are surprisingly easy to implement incorrectly. It is not uncommon to find even popular libraries calculating the limits in a wrong way.</p>

<p>So I started to think about how much effort it would take to port my TypeScript library to Python.</p>

<p>No matter how I looked at the problem, the conclusion was the same: too much work. As it was a personal side project which would be done in my free time, I simply didn’t have the time needed.</p>

<p><img src="/assets/2026/02/xmr_typescript.png" alt="XmR chart with the TypeScript library" />
<em>XmR chart plotted with the TypeScript library</em></p>

<h3 id="ai-assisted-coding---part-1">AI-Assisted Coding - Part 1</h3>

<p>By the end of the last year, there were more and more stories about AI-assisted coding. How the latest models were game changers.</p>

<p>I had tried these kinds of tool before but my experience was that they were useful for small tasks and struggled with larger codebases. The TypeScript library has roughly 10,000 lines of code, making it a medium-sized project. This made me a bit skeptical.</p>

<p>Four weeks ago, I decided to give a proper test. I bought a Cursor Pro subscription and dug in.</p>

<p>The first attempt was straightforward: I asked it to port the code.</p>

<blockquote>
  <p>You are an expert programmer fluent with typescript and python. You port typescript code to python and create simple, pythonic code. You create simple but powerful function signatures.</p>

  <p>in @typescript_lib folder, there is a typescript implementation of statistical process charts (SPC) calculations. Your job is to port it to python. Example interfaces for the calculation function you will find from the @tests/test_imr_chart.py file.</p>

  <p>Use pandas api and its functions, otherwise keep dependencies minimum. If there is an elegant way to implement some required functionality using external library, ask before using it.</p>

  <p>Mandatory requirements: the calculations must be identical, all functionality must be implemented in the new version, all the tests must pass, the new code must be pythonic, i.e. all the typescript idiosyncracies has to be replaced to python ones, the code must elegant and simple</p>

  <p>Ask questions as long as you need and make sure you understand all the detail.</p>
</blockquote>

<p>It chewed on the problem for a quite long time, but in the end it produced a working version of the library in Python. It wasn’t elegant and was definitely not maintainable, but it worked.</p>

<h3 id="using-the-planning-mode">Using The Planning Mode</h3>

<p>In the next attempt, we started by planning the solution together with the AI.</p>

<p>The session started like this (in verbatim, with errors and all):</p>

<blockquote>
  <p>You are an export [SIC :)] python programmer. You design elegant and pythonese apis.</p>

  <p>Your task is to first design an api for SPC/Control Charts library. Library calculates the values for a separate charting library so that’s easy to plot the control chart visuals.</p>

  <p>The library should not have any external dependencies unless it’s absolutely required or makes perfect sense. The libraries used must the main stream, common, and supported.</p>

  <p>The library needs to have separate functions to create diffent kinds of control chart objects (function will return these, they will contain the observations, labels, groups, ucls, lcls, usls, lsls, etc). The functions should have as similar apis as possible. The api must be as easy and intuitive as possible.</p>

  <p>Here’s a sketch of the input and output data objects. Review and make suggestions how to turn it into native python api and easier to use.</p>

  <p>Ask questions as long as everything is clear. Do not cut corners.</p>

</blockquote>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"""
Data models for control charts.

This module defines the data structures used throughout the pycontrolcharts package.
All models use Python dataclasses for clean, type-safe data structures.
"""

from dataclasses import dataclass, field
from enum import IntEnum
from typing import Optional, List, Union

class RunType(IntEnum):
    """
    Types of run test violations.
    
    Based on Western Electric and Nelson run test rules.
</code></pre></div></div>
<p>[…]</p>

<p>It took several hours of back and forth. I provided examples of code, answered a multitude of questions, and when the plan was finally ready, we had created almost a megabyte of chatlogs and markdown. Before the implementation started we had a detailed implementation plan.</p>

<p>This made all the difference. After the second round I had a solid library in my hands.</p>

<p>I used the TypeScript library to validate the result of the Python implementation. The charts differed in a couple of tests. Funnily, one of the run test implementations in the original code did not detect one of the edge cases. I had no idea about this bug, but the AI had fixed it while porting the code. Deliberately or not, I don’t know.</p>

<p>The code required some manual touches. Some parts of the plotting were not good, for example padding around the chart. No matter how I tried, it just didn’t get it right but kept on suggesting more and more complex solutions. The complex run tests were implemented as a state machine. This logic was too complex for the AI.</p>

<h3 id="my-findings">My Findings</h3>

<p>In the end, I was blown away by how amazingly helpful these tools have become.</p>

<p><strong>Productivity</strong></p>
<ul>
  <li><strong>Substantial acceleration.</strong> My rough estimate is that development was 5–10x faster compared to writing everything manually.</li>
  <li><strong>Faster learning.</strong> Discovering “pythonic” approaches was effortless. The feedback loop was immediate.</li>
</ul>

<p><strong>Quality</strong></p>

<ul>
  <li><strong>Code quality is reasonable.</strong> Especially in repetitive or boilerplate code, the output was often cleaner than what I would have written initially.</li>
  <li><strong>Improved documentation.</strong> The AI consistently generated more comprehensive documentation than I typically would for a side project.</li>
</ul>

<p><strong>Limits</strong></p>

<ul>
  <li><strong>AI struggles with difficult code.</strong> The model struggled with complex layouts and the non-trivial state machine. I ended up implementing those parts manually.</li>
  <li><strong>Good architecture and design do not emerge automatically.</strong> Without explicit guidance, the structure degrades quickly.</li>
  <li><strong>Using AI well is a skill.</strong> It resembles mentoring a junior developer. Instructions must be precise. Feedback must be continuous.</li>
</ul>

<p><strong>Other benefits</strong></p>

<ul>
  <li><strong>AI enables planning and revising easily.</strong> With AI tools, I would have implemented the first version, then realised a better way to do it, but at that point have no time nor energy to change it.</li>
  <li><strong>Building stuff is fun again.</strong> For me, the joy is back to programming or should I say building. No need to remember all the nitty-gritty details as AI is pretty good with that.</li>
  <li><strong>Your mileage will vary.</strong> This is not something everyone will enjoy. Personally, I get a kick out of building something new. If you love creating elegant code by hand, you might not enjoy AI-assisted coding.</li>
</ul>

<p>Sometimes AI produced excellent code for a long stretch and then suddenly introduced subtle regressions. Occasionally it modified unrelated parts of the code. It can lull you into complacency to “git add .” if you are not vigilant.</p>

<p>AI is extremely good at mechanical translation and boilerplate, but still weak at architectural clarity and complex state logic.</p>

<h3 id="what-does-this-mean">What Does This Mean?</h3>

<p>A few observations:</p>

<ul>
  <li>AI-assisted coding can’t be ignored. It increases developer efficiency substantially.</li>
  <li>The tool development is extremely fast at the moment. It’s not easy to stay up-to-date.</li>
  <li>It’s a lever. Used well and it provides you good results fast. Used poorly and it creates an unmaintainable codebase even faster.</li>
  <li>It takes skill to use it and learning takes time.</li>
  <li>It’s an excellent tool for prototyping.</li>
  <li>It could be a huge asset for code migrations and give a new life to legacy codebases.</li>
  <li>With a proper feedback loop, it can provide better results than humans. For example in performance optimizations, AI can try out different solutions if it can do the performance comparisons by itself.</li>
</ul>

<h3 id="introducing-pycontrolcharts">Introducing pyControlCharts</h3>

<p><img src="/assets/2026/02/xmr_python.png" alt="XmR chart with pyControlCharts" />
<em>XmR chart plotted with pyControlCharts.</em></p>

<p>You can find the <a href="https://github.com/suanto/pycontrolcharts">pyControlCharts library</a> from Github or use it from <a href="https://pypi.org/project/pycontrolcharts/">pypi</a>. It has all the major control charts along with the most important run tests. Most importantly, it provides the charts in both machine- and human-readable form. The charts are not just for human consumption but you can easily use them in data pipelines, AI models, and as an AI agent tool.</p>

<p>Control charts are fundamentally about separating signal from noise in variation. In an era where automated systems increasingly make decisions based on data, having that signal available in structured form matters more than ever.</p>]]></content><author><name>Antti Suanto</name></author><category term="build" /><category term="control-charts" /><category term="python" /><category term="ai" /></entry><entry><title type="html">Databricks SQL Pipeline Syntax</title><link href="https://suanto.com/2025/05/25/databricks-sql-pipelines/" rel="alternate" type="text/html" title="Databricks SQL Pipeline Syntax" /><published>2025-05-25T10:00:00+00:00</published><updated>2025-05-25T10:00:00+00:00</updated><id>https://suanto.com/2025/05/25/databricks-sql-pipelines</id><content type="html" xml:base="https://suanto.com/2025/05/25/databricks-sql-pipelines/"><![CDATA[<h1 id="sql-pipeline-syntax-in-databricks">SQL Pipeline Syntax in Databricks</h1>
<p>Earlier this year, Databricks <a href="https://www.databricks.com/blog/sql-gets-easier-announcing-new-pipe-syntax">introduced a new syntax for data querying</a>. It was originally introduced by Google in BigQuery, and now Databricks has adopted it as well. <a href="https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-pipeline.html">SQL Pipe syntax</a> resembles Microsoft’s KQL, but it is much easier to learn because the syntax has not been reinvented. I like KQL, but I don’t like that I have to look up all the commands from the manual as almost all the functions have been named differently than they have been named in SQL.</p>

<p>One big difference between SQL and the pipe syntax is that SQL is declarative and the order of the operations does not resemble the order what the machine does to return the data you requested. SQL has actually received quite a lot of critique for how unintuitive the order of operations is. SQL Pipes is not procedural language, but the order of the operations is much more intuitive and easier to grasp for beginners.</p>

<h2 id="how-to-use-it">How to Use it</h2>

<p>Let’s say you are a data geek who wants to stay in Stockholm during the Midsummer festival to see the <a href="https://en.wikipedia.org/wiki/Maypole">Maypole</a> in <a href="https://skansen.se/en/see-and-do/non-bookable-activities/midsummer-at-skansen/">Skansen</a>. You decide to look for accommodation on Airbnb but as a data geek, you want to use SQL and Databricks instead of the Airbnb website. You decide to download scraped data from the Inside Airbnb website and crunch the data in Databricks.</p>

<p>Prepare the data using a notebook. First, download the datasets:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%sh
curl https://data.insideairbnb.com/sweden/stockholms-l%C3%A4n/stockholm/2025-03-23/data/listings.csv.gz --output /tmp/listings.csv.gz
curl https://data.insideairbnb.com/sweden/stockholms-l%C3%A4n/stockholm/2025-03-23/data/reviews.csv.gz --output /tmp/reviews.csv.gz
curl https://data.insideairbnb.com/sweden/stockholms-l%C3%A4n/stockholm/2025-03-23/data/calendar.csv.gz --output /tmp/calendar.csv.gz
</code></pre></div></div>

<p>Then, save it as a table:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%python
listings_df = spark.read.csv("file:///tmp/listings.csv.gz", header=True, inferSchema=True, multiLine=True, escape='"')
listings_df.write.mode("overwrite").saveAsTable("listings")

reviews_df = spark.read.csv("file:///tmp/reviews.csv.gz", header=True, inferSchema=True, multiLine=True, escape='"')
reviews_df.write.mode("overwrite").saveAsTable("reviews")

calendar_df = spark.read.csv("file:///tmp/calendar.csv.gz", header=True, inferSchema=True, multiLine=True, escape='"')
calendar_df.write.mode("overwrite").saveAsTable("calendar")
</code></pre></div></div>

<p>You want to see the descriptions of available (or partially available) listed apartments for your dated, hosted by a person, that has at least one apartment listed with 50 reviews or more.</p>

<p>One way to do this using the traditional SQL is to use CTEs (Common Table Expression). First, you would have to craft a CTE to find the available listings, another CTE to find hosts having a listing with the required number of reviews, and so on until you have all the steps crafted. Finally, you would have to mesh this all together into the final results. You can do this in a Databricks SQL Query:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WITH listings_with_more_than_50_reviews AS (
    SELECT listing_id, COUNT(*) AS review_count
    FROM reviews AS r 
    GROUP BY listing_id
    HAVING review_count &gt;= 50
), 

hosts_with_listing_that_has_more_than_50_reviews AS (
    SELECT host_id, MAX(r.review_count) AS max_review_count_of_hosts_listing
    FROM listings_with_more_than_50_reviews AS r
        JOIN listings AS l ON r.listing_id = l.id
    GROUP BY host_id
), 

listings_of_our_qualified_hosts AS (
    SELECT id AS listing_id, MAX(h.max_review_count_of_hosts_listing) AS max_review_count_of_hosts_listing
    FROM listings
        JOIN hosts_with_listing_that_has_more_than_50_reviews AS h ON listings.host_id = h.host_id
    GROUP BY id
), 

available_listings AS (
    SELECT listing_id
    FROM calendar
    WHERE 
        (date == '2025-06-20' AND available == 't')
        OR (date == '2025-06-21' AND available == 't')
    GROUP BY listing_id
),

final_result_table AS (
    SELECT l.id AS listing_id, l.host_id, r.max_review_count_of_hosts_listing, l.name, l.description
    FROM listings_of_our_qualified_hosts AS r
        JOIN available_listings AS f ON r.listing_id = f.listing_id
        JOIN listings AS l ON r.listing_id = l.id
    ORDER BY r.max_review_count_of_hosts_listing DESC, r.listing_id
)

SELECT *
FROM final_result_table

</code></pre></div></div>

<p>This is how the query, or the CTE flow, is structured:</p>

<p><img src="/assets/2025/05/sql_pipelines/cte_flow.png" alt="" /></p>

<p>As the image shows, we have two main data flows. One that starts from the reviews and then the listings are joined into it several times. The second flow starts from the calendar and joins into the final results. These two flows are combined in the final_result_table CTE. It is perfectly doable using CTEs but the flow is a bit clumsy. Joining the two flows works perfectly when using CTEs.</p>

<p>How would you do it using SQL Pipe Syntax?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FROM reviews AS r
|&gt; AGGREGATE COUNT(*) AS review_count 
  GROUP BY listing_id
|&gt; WHERE review_count &gt;= 50 
|&gt; AS listings_with_more_than_50_reviews 

|&gt; JOIN listings AS l ON listings_with_more_than_50_reviews.listing_id = l.id
|&gt; AGGREGATE MAX(review_count) AS max_review_count_of_hosts_listing 
  GROUP BY l.host_id
|&gt; AS hosts_with_listing_that_has_more_than_50_reviews

|&gt; JOIN listings AS l2 ON hosts_with_listing_that_has_more_than_50_reviews.host_id = l2.host_id
|&gt; SELECT DISTINCT l2.id,  hosts_with_listing_that_has_more_than_50_reviews.max_review_count_of_hosts_listing
|&gt; AS listings_of_our_qualified_hosts

|&gt; JOIN calendar AS c ON listings_of_our_qualified_hosts.id = c.listing_id
|&gt; WHERE (c.date == '2025-06-20' AND c.available == 't')
        OR (c.date == '2025-06-21' AND c.available == 't')
|&gt; AGGREGATE MAX(max_review_count_of_hosts_listing) AS max_review_count_of_hosts_listing 
  GROUP BY c.listing_id
|&gt; AS available_listings

|&gt; JOIN listings AS l3 on available_listings.listing_id = l3.id 
|&gt; SELECT available_listings.listing_id, l3.host_id, available_listings.max_review_count_of_hosts_listing, l3.name, l3.description
|&gt; ORDER BY available_listings.max_review_count_of_hosts_listing DESC, available_listings.listing_id
|&gt; AS final_result_table

</code></pre></div></div>

<p>Look at the difference right from the beginning:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FROM reviews AS r
|&gt; AGGREGATE COUNT(*) AS review_count 
  GROUP BY listing_id
|&gt; WHERE review_count &gt;= 50 
</code></pre></div></div>
<p>vs</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT listing_id, COUNT(*) AS review_count
FROM reviews AS r 
GROUP BY listing_id
HAVING review_count &gt;= 50
</code></pre></div></div>

<p>The SQL Pipeline flow is much closer to how one naturally thinks. The long flow is much clearer and natural using the SQL Pipeline syntax. We can also use multiple WHERE clauses, unlike in the regular SQL. Joining the two dataflows works well here but if that flow were longer, the CTEs might have an advantage as the pipeline syntax does not allow using them. At least not yet.</p>

<h2 id="when-and-why-would-you-use-sql-pipe-syntax">When and why would you use SQL Pipe syntax?</h2>

<p>The Pipe syntax is more intuitive and faster to understand when reading code written by other people. Especially when the code does not combine multiple data flows into one. I’m sure it will be much faster for beginners to grasp but it might take some time for experienced SQL users to adapt.</p>

<p>It also has some small improvements to the SQL syntax, for example multiple WHERE clauses, so no need to do WHERE 1=1 to prevent errors when developing the code. DROP is also a big improvement, as it allows you to specify which columns to remove, instead of cataloging all the needed ones.</p>

<p>I’d expect the pipe syntax to be especially beneficial in cases where the data transformations are long, requiring a lot of filtering but the logic required might not be super complex. Also, the transformations written in SQL Pipe syntax might be easier to create and maintain, even by users who don’t have a lot of experience in SQL.</p>]]></content><author><name>Antti Suanto</name></author><category term="databricks" /><category term="sql" /><summary type="html"><![CDATA[Databricks has a new feature called SQL Pipeline Syntax. What is it and when to use it?]]></summary></entry><entry><title type="html">Track Development of Fabric using Fabric</title><link href="https://suanto.com/2025/03/28/fabric-tracking-using-fabric/" rel="alternate" type="text/html" title="Track Development of Fabric using Fabric" /><published>2025-03-28T17:00:00+00:00</published><updated>2025-03-28T17:00:00+00:00</updated><id>https://suanto.com/2025/03/28/fabric-tracking-using-fabric</id><content type="html" xml:base="https://suanto.com/2025/03/28/fabric-tracking-using-fabric/"><![CDATA[<p>If you work as a Data Architect on a platform built on Microsoft Fabric, you probably have a need to stay up-to-date with Fabric’s development roadmap as well it’s list of known issues. Microsoft has not made this easy as they don’t provide any mechanism for change alerts, and they don’t even provide history of developments. But don’t worry, there is a solution!</p>

<p>As we want to track Fabric, why not do the tracking in Fabric? From my <a href="https://github.com/suanto/FabricTracking">Github</a> repository, you can find (almost all of) the required Fabric items, and an installation guide on how to do the tracking, change detection, and alerting. The only item you need to create manually is the Activator, and the rules for it. You will find detailed instructions how to create them from the repo.</p>

<p><img src="/assets/2025/03/fabric_tracking/github.png" alt="" /></p>

<p>The code itself is fairly simple, the only complicated part is the Known Issue list scraping from the Power BI API but that part should be quite stable, as the data format is not likely to be changed.</p>

<p>You can configure the alerting you want by email, Teams. You can also start a data pipeline, or start a custom action. The alerting is done with Fabric’s Activator, so you have the power of the Activator in your hands.</p>]]></content><author><name>Antti Suanto</name></author><category term="fabric" /><category term="roadmap" /><category term="known_issues" /><category term="scraping" /><category term="short" /><summary type="html"><![CDATA[How to get alerts for changes to Fabric’s roadmap or known issues - using Fabric?]]></summary></entry><entry><title type="html">New Fabric Roadmap Items: Fabric CLI and Translytical Fabric</title><link href="https://suanto.com/2025/03/20/fabric-cli/" rel="alternate" type="text/html" title="New Fabric Roadmap Items: Fabric CLI and Translytical Fabric" /><published>2025-03-20T15:00:00+00:00</published><updated>2025-03-20T15:00:00+00:00</updated><id>https://suanto.com/2025/03/20/fabric-cli</id><content type="html" xml:base="https://suanto.com/2025/03/20/fabric-cli/"><![CDATA[<h1 id="march-2025-fabric-roadmap-analysis-131-changes-and-new-tools">March 2025 Fabric Roadmap Analysis: 131 Changes and New Tools</h1>

<p>As a data architect working with Microsoft Fabric, staying up-to-date with the platform’s development is important. That is why I have an automation script detecting changes on Fabric <a href="https://learn.microsoft.com/en-us/fabric/release-plan/">roadmap</a>. Today, March 20th, it picked up a number of changes, 303 to be exact. The script counts every change, such as renaming and schedule changes as two (deletion and addition) but even counting the changes manually, I was able to find 131 changed items on the roadmap. That’s a lot.</p>

<p>After categorizing the changes, here’s the breakdown:</p>

<table>
  <thead>
    <tr>
      <th>Change reason</th>
      <th>Count</th>
      <th>Percentage</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Shipped on time</td>
      <td>43</td>
      <td>33%</td>
    </tr>
    <tr>
      <td>Shipped late</td>
      <td>6</td>
      <td>5%</td>
    </tr>
    <tr>
      <td>New</td>
      <td>35</td>
      <td>27%</td>
    </tr>
    <tr>
      <td>Schedule push</td>
      <td>29</td>
      <td>22%</td>
    </tr>
    <tr>
      <td>Deleted*</td>
      <td>12</td>
      <td>9%</td>
    </tr>
    <tr>
      <td>Rename</td>
      <td>3</td>
      <td>2%</td>
    </tr>
    <tr>
      <td>Removed*</td>
      <td>4</td>
      <td>2%</td>
    </tr>
  </tbody>
</table>

<p><em>A ‘removed’ item refers to a roadmap item that is deleted after completion. A ‘deleted’ item refers to one deleted before completion.</em></p>

<p>Lots of completed features, many new additions, but also quite a few schedule changes and dropped items.</p>

<h1 id="translytical-fabric--fabric-cli">Translytical Fabric &amp; Fabric CLI</h1>

<p>There are some interesting items added to the roadmap. First one  is <a href="https://learn.microsoft.com/en-us/fabric/release-plan/powerbi#translytical-task-flows">Translytical Task Flows</a>, the ability to update data direcly from Power BI, which I <a href="https://suanto.com/2024/11/29/translytical-fabric/">wrote about</a> after Ignite. It should be in public preview in Q2. Second one is <a href="https://learn.microsoft.com/en-us/fabric/release-plan/shared-experiences#fabric-cli">Fabric CLI</a>.</p>

<h1 id="microsoft-fabric-cli">Microsoft Fabric CLI</h1>

<blockquote>
  <p>Estimated release timeline: Q1 2025</p>

  <p>Release Type: Public preview</p>

  <p>Fabric CLI is a command-line interface tool designed to interact with Microsoft Fabric. It provides a way for users/developers to manage and automate tasks within the Fabric environment. The [sic] toolwill support various functionalities such as running notebooks, managing pipelines, handling permissions, and more. It is aimed at enhancing the user experience by offering an alternative to graphical interfaces, similar to how Azure CLI operates within the Azure Portal The Fabric CLI offers two primary modes:</p>

  <p>Interactive Mode - this mode allows users to interact with the CLI in real-time, executing commands one at a time and receiving immediate feedback. It is particularly useful for exploratory tasks and learning the CLI commands</p>

  <p>Command Line Mode - also known as scripting or batch mode, this mode enables users to execute multiple commands at once. It is ideal for automation tasks, such as running scripts or integrating with CI/CD pipelines]</p>

  <p>These modes provide flexibility for different use cases, whether you are performing ad-hoc tasks or automating complex workflows.</p>
</blockquote>

<p><em>Source: Microsoft Fabric Roadmap</em></p>

<p>Fabric documentation does not mention anything about it yet, so this is all we have now. The description on the roadmap sounds great, as one thing that is missing from Fabric is the ability to manage the resources from the command line. Having a good UI is nice but managing a large number of resources is impossible with a scripting solution. That is why I’m really excited about this new tool.</p>

<p>It appears that we will soon have an Azure CLI-like tool for Fabric. The CLI is scheduled for release in Q1 2025, and it was added to the roadmap just eleven days before the quarter ends so it is quite obvious it will be announced at the <a href="https://www.fabricconf.com/">FabCon</a> in Las Vegas.</p>

<p>Update 2025-03-21: There is a session about Fabric CLI in FabCon: <a href="https://www.fabricconf.com/#!/session/Microsoft%20Fabric%20-%20The%20Command%20Line%20Way/7316">Microsoft Fabric - The Command Line Way</a></p>

<p>I am really looking forward to it.</p>

<p>Update 2025-03-28: There is also a session about <a href="https://www.fabricconf.com/#!/session/Announcing:%20User%20Data%20Functions%20in%20Fabric%20for%20Developer%20Flexibility%20Now%20in%20Public%20Preview!/7311">User Data Functions</a>.</p>

<blockquote>
  <p>Learn how to implement common data applications, such as data cleaning, data validations, data transformation and CRUD operations into Fabric data sources. […] Participants will also gain insights about the seamless integrations with Fabric Pipelines, Fabric Notebooks, Fabric Warehouses, Fabric SQL Databases and even Power BI reports, to build robust and flexible data applications.</p>
</blockquote>

<p>Sounds lot like Translytical Fabric.</p>]]></content><author><name>Antti Suanto</name></author><category term="fabric" /><category term="roadmap" /><category term="short" /><summary type="html"><![CDATA[Fabric roadmap was updated on 20th of March. What was changed?]]></summary></entry><entry><title type="html">Enable Azure Databricks Logging Using Bicep</title><link href="https://suanto.com/2025/02/03/enable-azure-databricks-logging-using-bicep/" rel="alternate" type="text/html" title="Enable Azure Databricks Logging Using Bicep" /><published>2025-02-03T15:00:00+00:00</published><updated>2025-02-03T15:00:00+00:00</updated><id>https://suanto.com/2025/02/03/enable-azure-databricks-logging-using-bicep</id><content type="html" xml:base="https://suanto.com/2025/02/03/enable-azure-databricks-logging-using-bicep/"><![CDATA[<h3 id="the-problem">The problem</h3>

<p>As an Azure or Databricks admin, you sometimes need to create Azure Databricks workspaces. You don’t want do it from the portal everytime, as it does not scale. The answer is to use ARM templates, or Bicep. You create the resources through the portal, enable workspace logging from Diagnostic Settings, and export the resource ARM-template. Then you notice, the <a href="https://stackoverflow.com/questions/70200523/diagnostic-setting-not-included-in-azure-portal-arm-template-export">diagnostic settings are not included</a> in the template. What to do?</p>

<h3 id="template-for-diagnostic-settings">Template for Diagnostic Settings</h3>

<p>You can find <a href="https://learn.microsoft.com/en-us/azure/templates/microsoft.insights/diagnosticsettings?pivots=deployment-language-bicep">templates for diagnostic settings</a> from the Azure documentation. Now you have the basic template but what are the categories for logs in the Azure Databricks resource? There is a <a href="https://learn.microsoft.com/en-us/azure/databricks/admin/account-settings/audit-logs">reference guide for logging</a> but the categories, or service names as they are on the page, are not all documented and their names are not the same as in the diagnostic settings. Where can you find the category names?</p>

<h3 id="finding-the-category-info">Finding the category info</h3>

<p>In the Azure portal, add a diagnostic setting for the Azure Databricks Workspace resource, and click the ‘JSON view’ link. The panel that opens up, contains the category settings in JSON which you combine to the template.</p>

<p>Here are all the current categories listed as a Bicep array:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>var dg_setting_categories = [
  {
    category: 'dbfs'
    enabled: true
  }
  {
    category: 'clusters'
    enabled: true
  }
  {
    category: 'accounts'
    enabled: true
  }
  {
    category: 'jobs'
    enabled: true
  }
  {
    category: 'notebook'
    enabled: true
  }
  {
    category: 'ssh'
    enabled: true
  }
  {
    category: 'workspace'
    enabled: true
  }
  {
    category: 'secrets'
    enabled: true
  }
  {
    category: 'sqlPermissions'
    enabled: true
  }
  {
    category: 'instancePools'
    enabled: true
  }
  {
    category: 'sqlAnalytics'
    enabled: true
  }
  {
    category: 'genie'
    enabled: false
  }
  {
    category: 'globalInitScripts'
    enabled: true
  }
  {
    category: 'iamRole'
    enabled: true
  }
  {
    category: 'mlflowExperiment'
    enabled: false
  }
  {
    category: 'featureStore'
    enabled: false
  }
  {
    category: 'RemoteHistoryService'
    enabled: false
  }
  {
    category: 'mlflowAcledArtifact'
    enabled: false
  }
  {
    category: 'databrickssql'
    enabled: true
  }
  {
    category: 'deltaPipelines'
    enabled: true
  }
  {
    category: 'modelRegistry'
    enabled: false
  }
  {
    category: 'repos'
    enabled: true
  }
  {
    category: 'unityCatalog'
    enabled: true
  }
  {
    category: 'gitCredentials'
    enabled: true
  }
  {
    category: 'webTerminal'
    enabled: true
  }
  {
    category: 'serverlessRealTimeInference'
    enabled: true
  }
  {
    category: 'clusterLibraries'
    enabled: true
  }
  {
    category: 'partnerHub'
    enabled: true
  }
  {
    category: 'clamAVScan'
    enabled: true
  }
  {
    category: 'capsule8Dataplane'
    enabled: true
  }
  {
    category: 'BrickStoreHttpGateway'
    enabled: true
  }
  {
    category: 'Dashboards'
    enabled: true
  }
  {
    category: 'CloudStorageMetadata'
    enabled: true
  }
  {
    category: 'PredictiveOptimization'
    enabled: true
  }
  {
    category: 'DataMonitoring'
    enabled: true
  }
  {
    category: 'Ingestion'
    enabled: true
  }
  {
    category: 'MarketplaceConsumer'
    enabled: true
  }
  {
    category: 'LineageTracking'
    enabled: true
  }
]
</code></pre></div></div>

<p>And here is an example of a diagnostic setting configuration using a <a href="https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/loops">for-loop</a>. Note that the scope has to be your Databricks workspace resource. This example sends the logs to a storage account.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>resource setting 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: 'dg-databricks-logs'
  scope: dbw_databricks_workspace
  properties: {
    storageAccountId: st_databricks_logs.id
    logs: [ for cat in dg_setting_categories: {
        category: cat.category
        enabled: cat.enabled
        retentionPolicy: {
          days: 0
          enabled: false
        }
      }
    ]
    metrics: []
  }
}
</code></pre></div></div>

<p>If you want to enable all the logs, you can just use the categoryGroup. Here’s an example</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
    "category": null,
    "categoryGroup": "allLogs",
    "enabled": true,
    "retentionPolicy": {
        "days": 0,
        "enabled": false
    }
}
</code></pre></div></div>

<h3 id="user-query-logging">User query logging</h3>

<p>One thing to keep in mind is that if you need to log the user SQL-queries and other commands, enabling the SQL-categories from diagnostic settings is not enough. You need to enable <a href="https://docs.databricks.com/en/admin/account-settings/verbose-logs.html">verbose audit logs</a> from the workspace settings.</p>

<p><img src="/assets/2025/02/databricks-logs/databricks-verbose-audit-logs.png" alt="" /></p>

<p>You might to want to do it using the <a href="https://docs.databricks.com/api/workspace/workspaceconf/setstatus">Databricks REST API</a> or Databricks CLI. Using these tools, what is the setting name? Fortunately, someone has <a href="https://github.com/fusionet24/DailyDatabricks/blob/main/tips/workspace-conf.md">documented it</a>.</p>

<p>Here is how you can query and enable the verbose logging using the Databricks CLI.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>databricks workspace-conf get-status enableVerboseAuditLogs
<span class="nv">$ </span>databricks workspace-conf set-status <span class="nt">--json</span> <span class="s1">'{ "enableVerboseAuditLogs": "true" }'</span>
</code></pre></div></div>

<p>One more thing to note. If logging user queries is required for compliance, you might need to send the logs to a Log Analytics workspace and create an alert for any changes in the verbose logging.</p>]]></content><author><name>Antti Suanto</name></author><category term="databricks" /><category term="azure" /><category term="technical" /><summary type="html"><![CDATA[How to enable Azure Databricks workspace logging using Bicep?]]></summary></entry><entry><title type="html">Key Lessons From Managing a Cloud-Scale Data Platform in Azure</title><link href="https://suanto.com/2025/01/22/key-lessons-from-cloud-scale-data-platform-in-azure/" rel="alternate" type="text/html" title="Key Lessons From Managing a Cloud-Scale Data Platform in Azure" /><published>2025-01-22T05:00:00+00:00</published><updated>2025-01-22T05:00:00+00:00</updated><id>https://suanto.com/2025/01/22/key-lessons-from-cloud-scale-data-platform-in-azure</id><content type="html" xml:base="https://suanto.com/2025/01/22/key-lessons-from-cloud-scale-data-platform-in-azure/"><![CDATA[<p>In my previous role, I was the in-house architect of a cloud-scale data and analytics platform. It was a rather large platform, at least by Finnish standards. It had 1PB+ data, dozens of integrations, and 150+ users. There were about 15 people in the team managing and developing the platform.</p>

<p>After reflecting on that experience, I have decided to share the key lessons I have learned. The platform operated on Azure and Databricks but the things I learned are general and probably apply to every large data platform.</p>

<h2 id="1-governance-governance-governance">1. Governance, Governance, Governance</h2>

<p>Governance is a curious thing. While it’s often discussed, it is an elusive and abstract concept. As an ex-developer, I wasn’t initially keen on governance.</p>

<p>Governance is like a paddle in canoeing. It is possible to start your trip without one, letting the current do the work, and you might not even miss it. When you realize you need it, it is probably too late.</p>

<p>You don’t need absolutely need governance at the start while operating on a small-scale but as your platform grows to include hundreds of data pipelines and thousands of tables for hundreds of users and systems, it becomes essential. When you reach a certain scale, you cannot operate efficiently if you don’t have proper governance in place. In the beginning, you will feel like governance hinders your progress but as the platform grows, it starts to make you so much more efficient.</p>

<p>What is governance of a data platform? There is a multitude of definitions but I think of it as everything that helps you manage the platform at a scale. It can include processes, policies, handbooks, or tools.</p>

<p>As the business of the data platform is data, a large part of the governance focuses on that. Besides data, you need to manage other assets, such as data pipelines, Spark notebooks, ML-models, data models, reports, and so on.</p>

<p>Here are some questions that governance helps you answer. These questions are written for data but could concern other assets as well:</p>

<ul>
  <li>What data do you have?</li>
  <li>Who owns it?</li>
  <li>What does it actually contain?</li>
  <li>How do you classify it?</li>
  <li>How do users find the data?</li>
  <li>Who can access the data?</li>
</ul>

<p>The most important thing about governance is that you need to have enough but not too much of it. Secondly, you should automate as much as reasonably possible. Governance gets easily trumped by business requirements. Automating it helps to keep it updated.</p>

<h2 id="2-embrace-your-platform">2. Embrace Your Platform</h2>

<p>Operating in cloud at scale is expensive, and when your cloud bill has as many digits as there are days in the week, you start to think about your options. There is a school of thought which says that you should only use platform-agnostic features to make switching providers easier. The rationale is that using only the common features make switching the platform easy - or at least easier - if things go awry. In this thinking, you would only use the common Spark features, and you would refuse take the advantage of Databricks’ or Fabric’s advanced Spark features.</p>

<p>I understand the idea but it is not a way to live. It is like buying a Ferrari but driving it like a Lada because you might want to trade to it someday.</p>

<p>The likelihood of switching platforms is quite low. The migration cost for a large size data platform is enormous, so you are probably not going to do it. It is best to embrace your platform and use it to the maximum extent while making sure your data is not locked in.</p>

<h2 id="3-spear-heading-technology-is-hard">3. Spear-heading Technology is Hard</h2>

<p>Using battle-proven technology is a safe bet but sometimes it isn’t possible. Data lakehouse technology provides tremendous benefits but it’s still quite young. If you want to use it, you are forced to be a pioneer in new technology.</p>

<p>When using new tech, you are going to do at least some R&amp;D work for a platform provider. New features require testing and you will be the one testing them. If you choose the path of new technology, partner up wisely. Using proven technology can be learned from books but if you use choose new tech, you need a partner who is in the know.</p>

<p>Using new technology often means that something does not work as expected or that there are undocumented features. Maybe your cloud database gives a performance boost if the datafiles are at least 5 GB (or some other arbitrary figure) in size. The point is, you need someone with connections to the product team.</p>

<p>One way to check this is to ask if your tech partner attends the correct conferences? Local, regional, or the gold standard: global? If yes, that is a good sign. If not, you might want to keep looking. Platfrom provider MVP or similar status is also a positive sign. The goal is to find partners with strong relationships with the platform provider to make sure you get support when needed.</p>

<h2 id="4-data-keeps-changing">4. Data Keeps Changing</h2>

<p>There are certain aspects in data handling that you need to fix before implementation. For example, data access patterns influence technology choices and implementation, and data modeling needs to be designed based on user requirements. Schema of the data determines the table layout, and the distribution of data affects Spark’s job efficiency.</p>

<p>The challenge is that, in the real world, data is constantly changing.</p>

<p>For example, the original requirement might have been to import data in a daily batch, so you design a write-once, read-many optimized solution. The data became popular, and now you need to bring it in every 10 minutes, creating a challenging write-many, read-many situation.</p>

<p>Perhaps, you modeled your data using the (way too popular) data vault 2.0 method. The original data didn’t have any <a href="https://www.gdpreu.org/the-regulation/key-concepts/personal-data/">PII</a> but then someone started to use the description field in the source system to record customer’s email address. Now you need to design a delete or scrub process for you data vault by hand, as your data vault automation tool doesn’t support it.</p>

<p>One thing that constantly surprises new Spark developers is how you need to know the data distribution when using Spark. Say you partition your customer activity data by customer. The worst case scenario is that 20% of your executors are doing 80% of the work, as typically the customer activities are not evenly distributed. The thing is that data distribution can and will change, so even if you balance the distribution correctly when implementing a job, you need to monitor the workloads to identify when the change happens. Failure to do this might lead that to your Spark jobs starting to fail at out-of-memory errors.</p>

<h2 id="5-monitor-or-you-dont-know">5. Monitor or You Don’t Know</h2>

<p>Is your platform being used? Are your tables, reports, and ML-models actually utilized? Processing data is expensive. Is the end product used? How do you know? The only answer is that if you don’t monitor, you don’t know.</p>

<p>We all know the examples of how people say one thing but actually do something else. Based on polls, charities should be drowning in money, and stores shouldn’t be able to stock organic food shelves fast enough.</p>

<p>The same applies in corporate IT. A report and its underlying data might be added to the platform ‘just in case,’ with claims that it’s the department’s most important report. In some cases, access to data or personal clusters can even become status symbols. Only monitoring will reveal the truth.</p>

<p>Another category of monitoring is the cloud resource usage. The developers are usually pressed mostly on delivering the feature to the business, not so on much how much resources the feature is using. Monitoring might reveal issues, such as a query running over the entire data lake when it was thought to access only the latest partition. Or a pipeline running every 5 minutes, not once a day as thought. Or a Spark cluster reading the same data eight times because executors are constantly being evicted.</p>

<p>Monitoring costs is another critical aspect. Monitoring resource usage covers a lot of this but not all of it. We have all heard the stories of a monster-sized VM accidentally being left on, causing an enormous cloud bill.</p>

<p>For monitoring, the best approach is to create a process, provide tools, and distribute responsibility.</p>

<h1 id="summary">Summary</h1>

<p>Building a new data and analytics platform using developing technology can be an exciting journey. By focusing on governance, embracing your platform, choosing the right technology, adapting to changing data, monitoring effectively, you can create a platform that actually works in the real world and provides an immense amount of value.</p>

<p><em>ps. Automate the deployments, and manage the platform using <a href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/considerations/infrastructure-as-code">IaC</a>. They will save a ton of work in the long run.</em></p>]]></content><author><name>Antti Suanto</name></author><category term="databricks" /><category term="azure" /><category term="data-platforms" /><category term="learnings" /><summary type="html"><![CDATA[What did I learn from being an architect on a petabyte-scale data and analytics platform?]]></summary></entry><entry><title type="html">Deep-dive Comparison of Databricks and MS Fabric - Part 2 - High-level Overview and Architecture</title><link href="https://suanto.com/2024/12/27/databricks-vs-microsoft-fabric-part-02/" rel="alternate" type="text/html" title="Deep-dive Comparison of Databricks and MS Fabric - Part 2 - High-level Overview and Architecture" /><published>2024-12-27T09:00:00+00:00</published><updated>2024-12-27T09:00:00+00:00</updated><id>https://suanto.com/2024/12/27/databricks-vs-microsoft-fabric-part-02</id><content type="html" xml:base="https://suanto.com/2024/12/27/databricks-vs-microsoft-fabric-part-02/"><![CDATA[<p>In the <a href="https://suanto.com/2024/12/11/databricks-vs-microsoft-fabric-part-01/">previous part</a> of the series, we saw how the platforms came to exist. In this part we will see how Microsoft and Databricks describe the platforms themselves, what these platforms contain, and what their architecture is. The goal here is to understand on what aspects the platforms differ and on what aspects they are similar. The big question is, of course, when should I use Fabric and when Databricks is a better choice.</p>

<p>These platforms, especially Fabric, are developing fast, and the things mentioned in this post are likely to become outdated soon.</p>

<p>Let’s first see how the platforms are described by the companies.</p>

<h1 id="high-level-overview">High-level overview</h1>

<table>
  <thead>
    <tr>
      <th>Databricks</th>
      <th>Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Azure Databricks is a <strong>unified, open analytics platform</strong> for building, deploying, sharing, and maintaining enterprise-grade <strong>data</strong>, analytics, and AI solutions at scale. The Databricks Data Intelligence Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf.<a href="https://learn.microsoft.com/en-us/azure/databricks/introduction/"><br />Source</a></td>
      <td>Microsoft Fabric is an end-to-end <strong>analytics and data platform</strong> designed for enterprises that require a <strong>unified</strong> solution.  It encompasses data movement, processing, ingestion, transformation, real-time event routing, and report building.<a href="https://learn.microsoft.com/en-us/azure/databricks/introduction/"><br />Source</a></td>
    </tr>
  </tbody>
</table>

<p>Not that different, right? They emphasize slightly different aspects but both claim to be unified data and analytics plaforms at scale.</p>

<p>Taking a deeper look at the introduction pages and cutting through the zeitgeist AI speak, some differences arise. Databricks emphasizes programmatic access, commitment to open-source, and customers’ options for components such as storage. Fabric emphasizes deep integration of the platform, and ease-of-use.</p>

<p>These differences are noticeable quite soon when you start using the platforms. Databricks requires more set-up than Fabric. Fabric, in turn, does not let you see or change the details of the platform. The administration of the platforms is also quite different, at least at the moment. Databricks is essentially an API-first platform, meaning that you can control everything using APIs, and most features by using the Admin UI. Fabric is the opposite; all control is possible from the Admin UI and only some things are possible to control using APIs.</p>

<p>It’s interesting to see that the design philosophy differences are in some sense similar to Linux/Unix compared to Windows. Linux being open-source and easy to automate through the command prompt while Windows still requires some point-and-click administration, and has a slick and polished UI making everything easy for the beginner.</p>

<h1 id="features">Features</h1>

<p>What about a feature-level comparison? How do the platforms fulfill the normal data platform requirements? Normally, these kinds of tables are done only by corporate buyers, but they actually do serve a purpose for us here. They help us get a high-level picture of what the platforms provide and also serve as a map for us in later parts of the series, when we will start comparing the features in practice.</p>

<h3 id="data-ingestion-transform-and-streaming">Data Ingestion, Transform, and Streaming</h3>

<p>Data ingestion is the bread and butter feature of a data platform. Getting data into the platform is a fundamental feature of the platform. Data engineers spend a large part of their daily job working with data ingestion.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Azure Databricks</th>
      <th>Microsoft Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Data Ingestion</strong></td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>Orchestrating the pipelines</td>
      <td>Workflows / Jobs ¹</td>
      <td>Data Factory</td>
    </tr>
    <tr>
      <td>Ingesting data</td>
      <td>Spark Notebooks</td>
      <td>* Data pipelines<br />* Data flows<br />* Fabric (Spark) Notebooks</td>
    </tr>
    <tr>
      <td>Processing the data</td>
      <td>Spark Notebooks</td>
      <td>* Fabric (Spark) Notebooks (in Lakehouse, KQL)<br />* Stored Procedures (T-SQL in Warehouse)<br />* KQL Querysets (KQL Database)</td>
    </tr>
    <tr>
      <td>Streaming</td>
      <td>* Spark Structured Streaming<br />* Delta Live Tables</td>
      <td>* Spark Structured Streaming<br />* Event Streams</td>
    </tr>
    <tr>
      <td>Database mirroring</td>
      <td>-</td>
      <td>Several²</td>
    </tr>
  </tbody>
</table>

<p>¹ Many times combined with Azure Data Factory or a similar orchestration tool.<br />
² Azure SQL, Azure SQL MI (preview), Cosmos DB (preview), Databricks (preview), Snowflake.</p>

<p>It is interesting to see how Fabric has a multitude of options compared to Databricks. Databricks relies on Spark and notebooks for everything, while Fabric offers different options for ingesting and processing the data. Fabric also has an option to mirror the database directly into Fabric, skipping the data pipeline altogether.</p>

<h3 id="data-storage">Data Storage</h3>

<p>What about storage? Ingesting data is important but where will it land?</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Azure Databricks</th>
      <th>Microsoft Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Data Storage</strong></td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>Storage Options</td>
      <td>Any major cloud vendor storage</td>
      <td>* Fabric’s OneLake<br />* Using Shortcuts, major cloud vendor storages</td>
    </tr>
    <tr>
      <td>Storage Format</td>
      <td>Delta Lake (delta tables or any other format)</td>
      <td>* Lakehouse, Delta Lake (or any other format)<br />* Warehouse, Delta Lake only<br />* KQL (proprietary)</td>
    </tr>
  </tbody>
</table>

<p>Storage is really interesting. At first, it seems that the platforms are the same. They both allow the use of any major cloud provider as the storage. On a closer look, the differences arise.</p>

<p>Databricks is agnostic to storage and it can combine multiple storage accounts or even multiple cloud vendors by using external locations. Using these locations is pretty much transparent when using them from Databtricks.</p>

<p>Fabric is different. On the surface, it forces you to use OneLake, their own storage solution. However, OneLake can have shortcuts to OneLake itself or any major cloud vendor storage. Using these shortcuts is completely transparent. The user does not know whether they are using OneLake or some other cloud storage through a shortcut. OneLake, in effect, can be used to integrate several storages into one. It also exposes a REST API so even 3rd party tools can enjoy this seamless integration. This means that you can use OneLake to combine several cloud vendors’ storages to single virtual storage and use it, for example, from Databricks.</p>

<h3 id="compute">Compute</h3>

<p>Compute is the heart of the data platform. It is used constantly and that is why it is so important that it fits to the purpose, scales, and is easy to use.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Azure Databricks</th>
      <th>Microsoft Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Compute</strong></td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>Compute engines</td>
      <td>Spark (Photon)</td>
      <td>Spark (in Lakehouse)<br />T-SQL (Polaris engine?) (in Warehouse)<br />KQL (Real-time)</td>
    </tr>
    <tr>
      <td>Compute hosting</td>
      <td>Serverless<br />Customer cloud account</td>
      <td>Serverless</td>
    </tr>
    <tr>
      <td>Transaction support</td>
      <td>Table-level</td>
      <td>Table-level (Lakehouse)<br />Database-level(Warehouse)</td>
    </tr>
  </tbody>
</table>

<p>For the compute engine, Databricks provides only one option: their trusted Spark engine, hosted either in the customer’s cloud account or is serverless, i.e. hosted in Databricks’ account. At the Databricks Data+AI summit in June 2024, Databricks’ CEO Ali Ghodsi said they will focus on serverless compute, providing all the new features first to the serverless compute.</p>

<p>Fabric has several compute engines, but they operate only on serverless compute, i.e. providing less customization options than Databricks. Fabric offers Spark for Lakehouse, <a href="https://www.vldb.org/pvldb/vol13/p3204-saborit.pdf">Polaris engine</a> (or a <a href="https://medium.com/creative-data/data-warehouse-polaris-vs-data-lakehouse-spark-in-microsoft-fabric-37bd65525def">modified</a> version of it) for Warehouse, and KQL for the Real Time.</p>

<p>They both provide table-level transactions when using Spark and Delta Lake. Fabric’s warehouse is capable doing database-level transactions.</p>

<h3 id="data-governance">Data Governance</h3>

<p>Data governance is an essential part of a modern data platform. It is an emerging feature, which means that the features are very much in development.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Azure Databricks</th>
      <th>Microsoft Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Governance</strong></td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>Access Control</td>
      <td>Unity Catalog (tables / ML-models / …)</td>
      <td>Workspace-level</td>
    </tr>
    <tr>
      <td>Auditing</td>
      <td>Unity Catalog</td>
      <td>MS Purview</td>
    </tr>
    <tr>
      <td>Discovery/Catalog</td>
      <td>Unity Catalog</td>
      <td>MS Purview</td>
    </tr>
    <tr>
      <td>Lineage</td>
      <td>Unity Catalog</td>
      <td>MS Purview</td>
    </tr>
    <tr>
      <td>Monitoring</td>
      <td>Unity Catalog</td>
      <td>MS Purview</td>
    </tr>
  </tbody>
</table>

<p>Databricks has fine-grained and functional access control based on Unity Catalog. It allows managing the access on workspace, catalog, and object level.</p>

<p>Fabric has a different security model. It relies heavily on using workspace as a basic building block for securing the access. Fabric has <a href="https://learn.microsoft.com/en-us/fabric/release-plan/onelake#onelake-security-model">OneLake security model</a> on the roadmap.</p>

<p>For other features of Governance, Databricks relies on Unity Catalog and Fabric on Purview.</p>

<p>Data governance is one of the areas where platform providers are <a href="https://www.cs.cmu.edu/~pavlo/blog/2025/01/2024-databases-retrospective.html#gangwar">competing heavily in</a>. This should bring quite dramatic improvements in this area over the next couple of years.</p>

<h3 id="bi--reporting">BI / Reporting</h3>

<p>Whether data and analytics platform should contain an integrated BI or reporting platform can be debated but it is, of course, nice to have one.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Azure Databricks</th>
      <th>Microsoft Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>BI / Reporting</strong></td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td> </td>
      <td>AI/BI ³</td>
      <td>Power BI</td>
    </tr>
  </tbody>
</table>

<p>³ Usually combined with an external tool, such as Power BI</p>

<p>Databricks has its AI/BI solution integrated into the platform, but it has been pretty feature light solution. Usually, Power BI or some other BI tool has been used to complement the platform. BI is now one of the Databricks’ investment areas, as we heard from the CEO Ali Ghodsi in the Data+AI Summit this year.</p>

<p>Fabric integrates Power BI into the platform; actually Fabric is Power BI with extended features. Power BI is one the best BI tools on the market.</p>

<h3 id="automation-ci--cd">Automation (CI &amp; CD)</h3>

<p>In any production platform, it is important to automate setting up the environment and its maintenance. If operating at scale, it is mandatory. What kind of tools do these platform provide for automation?</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Azure Databricks</th>
      <th>Microsoft Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Automation</strong></td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>Automation</td>
      <td>* Terraform<br />* Databricks CLI (Preview)<br />* APIs</td>
      <td>* Terraform (Preview)<br />* APIs</td>
    </tr>
    <tr>
      <td>Git Integration</td>
      <td>x</td>
      <td>x (preview)</td>
    </tr>
    <tr>
      <td>CI/CD deployment tool</td>
      <td>* Terraform<br />* Databricks Asset Bundles <br /> * Git</td>
      <td>* Terraform (preview) <br /> * Git</td>
    </tr>
    <tr>
      <td>APIs</td>
      <td>x</td>
      <td>x (partial support)</td>
    </tr>
  </tbody>
</table>

<p>Databricks has extensive support for automating the platform. It provides several full scale solutions to fully automate setting up and maintain the platform.</p>

<p>Fabric has a number of automation tools available, though many of them are in preview. Some of these features are really interesting. For example using <a href="https://blog.fabric.microsoft.com/en-us/blog/announcing-the-new-terraform-provider-for-microsoft-fabric-public-preview">Terraform</a> as the deployment tool seems odd, since it is a third party tool. One would expect Microsoft to use ARM or Bicep for this.</p>

<h2 id="ai--ml">AI / ML</h2>

<p>AI and ML features are an integral part of modern analytics platform, and it is the area where platform providers are <a href="https://www.cs.cmu.edu/~pavlo/blog/2025/01/2024-databases-retrospective.html#gangwar">investing heavily</a>. Platform may offer models, algorithms, and other functionality but it should at least support automation and deployment of the models.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Azure Databricks</th>
      <th>Microsoft Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>AI/ML</strong></td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>Tools</td>
      <td>MLflow</td>
      <td>MLflow</td>
    </tr>
    <tr>
      <td>AI</td>
      <td>Mosaic AI</td>
      <td>Prebuilt AI models (preview)</td>
    </tr>
  </tbody>
</table>

<p>Both platforms contain MLflow to manage ML models. Both platforms also contain AI features, though Fabric’s are in preview.</p>

<h2 id="developer-tools">Developer tools</h2>

<p>For the long term success of the platform, it is essential to provide good tools for the developers. What tools do these platforms offer?</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Azure Databricks</th>
      <th>Microsoft Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Dev tools</strong></td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>VS Code extension</td>
      <td>x</td>
      <td>x (notebooks)</td>
    </tr>
    <tr>
      <td>Other tools</td>
      <td>* PyCharm extension <br />* CLI (preview) <br />* SDK <br /></td>
      <td> </td>
    </tr>
  </tbody>
</table>

<p>Both provide VS Code extension. Databricks has some other tools. In practice, most developers seem to use the web-based UI and notebooks for code developement.</p>

<h3 id="other-features">Other features</h3>

<p>What other features do the platforms provide?</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Azure Databricks</th>
      <th>Microsoft Fabric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Other feature</strong></td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>Data sharing/Collaboration</td>
      <td>Delta sharing, Cleanrooms</td>
      <td>Fabric external data sharing (only between Fabric workspaces)</td>
    </tr>
    <tr>
      <td>Operational Databases</td>
      <td>-</td>
      <td>Fabric Database (preview)</td>
    </tr>
  </tbody>
</table>

<p>Databricks has Delta sharing to enable sharing data between organizations. They also have Data Cleanrooms, to process data with another org without actually giving them access.</p>

<p>Fabric has Databases, which was introduced at the Ignite conference, in Autumn 2024. It is still in preview but it is an interesting development for data platforms as Microsoft is bringing operational databases to a data platform for the first time.</p>

<h3 id="summary-of-features">Summary of features</h3>

<p>On paper, Fabric seems to offer more options than Databricks. For example, for the compute engine, Fabric has three options compared to Databricks’ one. Both platforms tick the boxes pretty well, but as always, the devil is in the details. Many of Fabric’s features are still in preview while writing this. On the other hand, Fabric has some features which Databricks does not have at all. For example, operational databases were introduced to Fabric at the Ignite conference in fall 2024. Databricks does not have anything similar.</p>

<h1 id="architecture">Architecture</h1>

<p>What about the architecture of the platforms? How are they constructed and how much does a platform’s provider actually share implementation details?</p>

<h2 id="databricks">Databricks</h2>

<p><img src="/assets/2024/12/post-2/architecture-azure.png" alt="Azure Databricks Architecture" /></p>

<p>Databricks’ architecture is pretty simple. On a high-level, architecture diagram shown above, there is a control plane and compute plane which are situated in Databrick’s account and in your cloud account. This is a high-level view and it hides a lot of the complexity behind the boxes, like Unity Catalog.</p>

<p>Databricks uses Spark as the compute engine. Spark’s internal architecture is shown below. Basically there is just a cluster manager, driver node and 0-n worker nodes.</p>

<p><img src="/assets/2024/12/post-2/spark-architecture.png" alt="Spark Architecture" /></p>

<p><a href="https://spark.apache.org/docs/latest/cluster-overview.html">Spark Architecture</a></p>

<p>Databricks is basically a managed Spark with some added elements, such as a nice web UI and some additional features such as Unity Catalog, workflows, and Delta Live Tables. Databricks uses Spark basically for all computing, but it needs to be noted that Databricks’ Spark version is not the open-source one but they have proprietary features. They also have a vectorized compute Engine called <a href="https://www.youtube.com/watch?v=pNn5W4ujP3w">Photon</a>, which is basically just a Spark engine with lower execution levels written in C++; there is no difference for the user besides the possible speed and cost difference.</p>

<p>That is the compute, but what about the other components: storage and networking? Databricks uses whatever resources you configure it to use; for storage, which typically an object storage, such as ADLS gen2 found from your cloud account. For networking, it uses the network you manage yourself in your cloud account.</p>

<h2 id="fabric">Fabric</h2>

<p>Fabric provides a wide set of tools for every task. So much so, that sometimes it might be difficult to choose which tool to use. For data ingestion, one can use Data Factory, Data Flows, or Fabric (Spark) Notebooks. Data Factory and Data Flows are low/no-code solutions and notebooks are a pro-code solution.</p>

<p>For data storage, Fabric offers Lakehouse and Warehouse. They both use Fabric’s Onelake object storage as the storage but the feature set they provide are somewhat different. Basically Lakehouse is Spark-based, meaning one can use Pyspark, Scala, R, or SparkSQL notebooks while Warehouse is a T-SQL based solution. Fabric’s Spark is the open-source version.</p>

<p>Reporting and BI in Fabric is done using the excellent Power BI which integrates to Fabric really nicely.</p>

<p>Fabric is basically a blackbox SaaS product for which we don’t have much information on how it is built. In fact, the only source I have come across so far is second-hand information from a recent conference described in <a href="https://redmondmag.com/Articles/2024/10/23/Microsoft-Fabric-Deep-Dive.aspx">Redmond Magazine</a>. I hope this will change in the future, as knowing how the platform is built helps to use it effectively.</p>

<h1 id="conclusion">Conclusion</h1>

<p>This article turned out to be much longer than I expected - and that is a good indication that there is so much to cover on both platforms. The feature list is long and the platforms are huge.</p>

<p>At a high-level, the platforms seem to be similar. Closer inspection reveals there are some differences in the design philosophy and the direction in which the platforms are heading.</p>

<p>The competition at the moment is fierce and the development speed, especially Fabric’s, is exceptional. Both platform providers are investing heavily in development. It is really interesting to see where these platforms will be in the next three years. Fabric is a bit of an underdog, but it is cleverly heading to extend the platform where Databricks has been under-serving the market.</p>

<p>In the next part, we will start by looking how to architecture a simple data platform implementation.</p>]]></content><author><name>Antti Suanto</name></author><category term="databricks" /><category term="fabric" /><category term="series" /><summary type="html"><![CDATA[High-level overview of Databricks and Microsoft Fabric]]></summary></entry><entry><title type="html">Deep-dive Comparison of Databricks and MS Fabric - Part 1 - The Background</title><link href="https://suanto.com/2024/12/11/databricks-vs-microsoft-fabric-part-01/" rel="alternate" type="text/html" title="Deep-dive Comparison of Databricks and MS Fabric - Part 1 - The Background" /><published>2024-12-11T10:00:00+00:00</published><updated>2024-12-11T10:00:00+00:00</updated><id>https://suanto.com/2024/12/11/databricks-vs-microsoft-fabric-part-01</id><content type="html" xml:base="https://suanto.com/2024/12/11/databricks-vs-microsoft-fabric-part-01/"><![CDATA[<h1 id="why-this-series">Why this series?</h1>

<p>Another blog series on the differences between Databricks and Microsoft Fabric. Why?</p>

<p>Well, I guess the reasons are purely selfish. I used to be an architect working on a data platform that ran mostly on Databricks. After changing my job, most of the projects I have been working on are Fabric-based so I need to learn Fabric well. The best way to learn something is to write about it. Comparing the new to the old is also valuable for learning. I’ll limit the scope of the series to Microsoft Fabric and Azure Databricks, as those are the platforms I am familiar with.</p>

<p>To understand the platforms, we need to understand how they came into existence. As always, nothing happens in vacuum, but both of the platforms are shaped by the general development of data platforms, cloud technology, and the rise of big data. The history of these platforms is intertwined as Databricks probably wouldn’t exist without the shortcomings of the data platforms of the early 2000s. And on the other hand, Fabric probably wouldn’t exist without Databricks.</p>

<p>Let’s start studying the platforms by looking at a brief history of Databricks.</p>

<h1 id="databricks">Databricks</h1>

<h2 id="early-internet-and-googles-problem">Early Internet and Google’s problem</h2>

<p><a href="https://web.archive.org/web/20050309204119/http://backrub.tjtech.org:80/1997/">Google</a> and other Internet giants had a problem at the end of the 1990s. The Internet was becoming too large to be captured, and Internet traffic was generating so much data that they had problems storing it - let alone analyzing or searching it. The traditional method to solve these kinds of problem was to use a database, but they were of no use here. The databases usually had an SMP (Symmetric Multiprocessing) architecture, which meant that they ran on a single machine. If you needed more power, you had to buy a bigger machine.  The SMP databases coupled storage and computing meaning that when you ran out of storage space, you were facing a long downtime migrating the data to a larger machine. Big enough machines cost a metric ton of money, if you could find a machine big enough.</p>

<p>There were also database systems that used MPP (Massively Parallel Processing) architecture, such as Teradata. MPP means that the systems didn’t run on a single big machine, but rather a group of machines. These systems were ridiculously priced for startups like Google was back then. A good indication was a <a href="https://web.archive.org/web/20030603200059/http://www.teradata.com/main/">poll on Teradata’s homepage in 2003</a>, in which the lowest option for yearly maintenance cost of a single datamart was US$ 1 million (US$ 1.7 million in 2024 dollars). (By the way, when did we stop having polls on corporate websites?)</p>

<p><a href="https://web.archive.org/web/20050401074347/http://www.google.com/corporate/history.html">Google solved</a> the problem. They <a href="https://web.archive.org/web/20041018140849/https://backrub.tjtech.org/May1998/hardware.htm">used</a> <a href="https://blog.codinghorror.com/google-hardware-circa-1999/">commodity</a> <a href="https://blog.codinghorror.com/building-a-computer-the-google-way">hardware</a>, which stored and analysed the data in unison, with many small machines acting like a one big machine. The technology was proprietary, but research ethos was strong in Google’s early days. So they published two research papers, describing the <a href="https://research.google/pubs/the-google-file-system/">distributed file system (2003)</a> and compute system, <a href="https://research.google.com/archive/mapreduce-osdi04.pdf">called MapReduce (2004)</a>.</p>

<p><img src="/assets/2024/12/early-google-servers.jpg" alt="Early google servers" /></p>

<p><em>Early Google servers.</em></p>

<h2 id="hadoop">Hadoop</h2>

<p>It took a couple of years for outsiders to develop the software systems based on these papers, but in April 2006, it was finally ready and released. It was called Hadoop.</p>

<p>Hadoop was groundbreaking. It was used to run analyses on cheap commodity hardware clusters. A cluster could have had just a couple of machines, or it could have scaled to more than 4,000 computers. For the first time, this scale of computing was available for universities and startups, that could not afford proprietary cluster computing. Hadoop gave results to problems and questions that couldn’t be answered before.</p>

<p><img src="/assets/2024/12/hadoop-cluster.jpg" alt="Hadoop cluster at Yahoo 2007" /> 
<em><a href="https://web.archive.org/web/20101128212620/http://developer.yahoo.com/blogs/ydn/posts/2007/07/yahoo-hadoop/">Hadoop cluster at Yahoo 2007</a></em></p>

<p>Even though Hadoop was great, it was not perfect. It had to be programmed using a model called MapReduce, which was a <a href="https://web.archive.org/web/20101128212620/http://developer.yahoo.com/blogs/ydn/posts/2007/07/yahoo-hadoop/">beast to program</a>. Hadoop programs usually required many phases or tasks, and each of them required the data to be written on a disk, which was a time-consuming operation. Hadoop also consisted of a large number of software packages, each of which had its own release cycle. You had to have compatible versions of all the components for the system to work correctly, so upgrading a cluster was difficult, as it was later discussed <a href="https://softwareengineeringdaily.com/2015/08/03/apache-spark-creator-matei-zaharia-interview/">in an episode of Software Engineering Daily podcast</a>.</p>

<h2 id="enter-the-spark">Enter the Spark</h2>

<p>Matei Zaharia was born in Romania, and later moved to Canada before finally settling in the US. He was considered a wunderkind of data science, especially on distributed systems. He <a href="https://archive.cra.org/Activities/awards/undergrad/2007.zaharia.html">won</a> several prizes and competitions in science and programming before he finally entered UC Berkeley. As a PhD student in Berkeley’s AMP Lab (Algorithms, Machines, and People) and an intern at Facebook, he saw firsthand the problems users had with Hadoop. He wanted to help them, <a href="https://www.reddit.com/r/IAmA/comments/31bkue/comment/cq0312u/">as he mentioned on Reddit</a>.</p>

<p>Reynold Xin, one of the AMP Lab students and Databricks co-founders, <a href="https://www.youtube.com/watch?v=jp9qFzXMQo4">later said in a conference talk</a> that Zaharia created Spark to help another student to participate in the legendary <a href="https://web.archive.org/web/20090628113959/http://www.netflixprize.com/index">Netflix Prize</a> competition. According to Xin, Spark was used in the competition to place <a href="https://web.archive.org/web/20091227111134/http://www.netflixprize.com/leaderboard">second</a>, losing the US$ 1 million prize money by 20 minutes.</p>

<p>Whatever the real origin was, Zaharia’s solution was to create a general-purpose distributed compute engine called <a href="https://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf">Spark</a>. Spark was easier to use than Hadoop and significantly faster. Zaharia started working on it in 2009, and it was <a href="https://github.com/apache/spark/commit/df29d0ea4c8b7137fdd1844219c7d">open-sourced</a> in 2010. In just a couple of years, Spark really <a href="https://web.archive.org/web/20140405023412/https://www.wired.com/2013/06/yahoo-amazon-amplab-spark/">took off</a>. Over the years, it became the de facto standard of distributed analytics.</p>

<p>Why did Spark take off like a rocket? If the early adopters of Hadoop were startups, universities, and other shoe-string budget organizations, why legitimate companies started to move to Spark. Zaharia tells his point of view in his book <a href="https://www.amazon.com/Spark-Definitive-Guide-Processing-Simple/dp/1491912219">about Spark</a>.</p>

<p>According to him, single processor core speed development started to plateau around 2005, which caused processor manufacturers to start increasing the number of cores. This in turn led to the need for parallel programming models, such as Spark. In addition, the cost of collecting data continued to decline.</p>

<blockquote>
  <p>The end result is a world in which collecting data is extremely inexpensive—many organizations today even consider it negligent not to log data of possible relevance to the business—but processing it requires large, parallel computations, often on clusters of machines. Moreover, in this new world, the software developed in the past 50 years cannot automatically scale up, and neither can the traditional programming models for data processing applications, creating the need for new programming models. It is this world that Apache Spark was built for.</p>
</blockquote>

<h2 id="to-the-cloud-databricks">To the Cloud: Databricks</h2>

<p>Spark was a fine piece of software, but it required a lot of manual work to set up and maintain the clusters. With on-premise infra, this wasn’t exactly easy. In <a href="https://www.databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html">2013, Databricks</a>, the company, was founded to provide cloud-based Spark platform and commercial support for users. It was founded by Zaharia and other members of the AMP Lab research team.</p>

<p><img src="/assets/2024/12/databricks-investment-deck-2013.png" alt="" />
<img src="/assets/2024/12/databricks-investment-deck-2013-2.png" alt="" />
<em><a href="https://fortune.com/2023/03/02/databricks-pitch-decks-ben-horowitz/">Databricks’ first investment deck 2013</a></em></p>

<p>Zaharia and Databricks continued to invent more products, making them available through Databricks and also open-sourcing them. These products included MLlib, <a href="https://www.databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html">Structured Streaming</a>, <a href="https://www.databricks.com/blog/2017/10/25/databricks-delta-a-unified-management-system-for-real-time-big-data.html">Delta</a>, and <a href="https://www.databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html">MLflow</a>.</p>

<p><img src="/assets/2024/12/databricks-investment-deck-2017-2.png" alt="" />
<img src="/assets/2024/12/databricks-investment-deck-2017.png" alt="" />
<em>In 2017, only four years after its inception, Databricks believed it will dominate the analytics world. Slide from Databricks’ investment deck in 2017.</em></p>

<p>Originally, Spark wasn’t a database system. Instead, it operated on bare files exposed by the cloud storage. This caused problems when updating the files or with streaming data. The reader of the file didn’t know if the data in the file was still being updated. You couldn’t enforce the structure of the data. With the introduction of Delta in 2017, Databricks and Spark started to include database-like features, such as transactions. Later, they brought it even closer to a regular data warehouse by introducing the <a href="https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html">Delta Lakehouse</a> and <a href="https://www.databricks.com/blog/2022/04/20/announcing-gated-public-preview-of-unity-catalog-on-aws-and-azure.html">Unity Catalog</a>.</p>

<p>Databricks continued to invent and grow. By 2024, its valuation was <a href="https://www.cnbc.com/2024/11/26/databricks-closes-in-on-multibillion-funding-round-at-55-billion-valuation.html">rumoured</a> to be 55 billion dollars. In just a few years, it had become the juggernaut it had envisioned.</p>

<p><img src="/assets/2024/12/spark-summit-2013.png" alt="" />
<em><a href="https://web.archive.org/web/20131215014710/http://spark-summit.org/summit-2013/">Spark summit 2013, 500 participants</a></em></p>

<p><img src="/assets/2024/12/databricks-summit-2024.jpg" alt="" />
<em>Databricks Data+AI summit 2024 (renamed from Spark Summit). 17,000 participants.</em></p>

<p>Databricks has a special relationship with Microsoft. In 2017, Databricks and Microsoft <a href="https://www.databricks.com/blog/2017/11/15/introducing-azure-databricks.html">announced</a> a co-operation. Microsoft would provide Databricks as a first-party service in Azure. It is even called Azure Databricks. Microsoft would allow Databricks to <a href="https://www.databricks.com/blog/2017/11/15/a-technical-overview-of-azure-databricks.html">integrate</a> to Azure’s components, such as storage, as it was Microsoft’s own service. For years, their partnership existed peacefully, until at the Build Conference 2023, out of the blue, Microsoft announced its new data platform. It was called Fabric.</p>

<h1 id="microsoft-fabric">Microsoft Fabric</h1>

<p>Microsoft Fabric is not a single product but a combination of about a dozen Microsoft products, so it doesn’t have as clear-cut history as Databricks. One thing is sure, though: Fabric wouldn’t exist, at least not in this form, without Satya Nadella.</p>

<h2 id="nadellas-problem">Nadella’s problem</h2>

<p>When starting as the CEO of Microsoft on February 4th, 2012, Satya Nadella faced a challenge. As he describes in his book “<a href="https://www.amazon.com/Hit-Refresh-Rediscover-Microsofts-Everyone-ebook/dp/B01HOT5SQA">Hit Refresh</a>,” he inherited a company with a strong culture of viewing open-source software as its enemy. For Nadella, the need for transition was clear. The transition to the cloud made him realize that Microsoft was not just a technology provider but also operated the tech in cloud, and clients wanted to run other products besides just the Microsoft stack.</p>

<p><img src="/assets/2024/12/nadella-gates-ballmer.jpeg" alt="Nadella appointed as CEO" />
<em>Nadella was appointed as Microsoft’s third CEO in 2014.<a href="https://commons.wikimedia.org/w/index.php?curid=91262640"> By Briansmale, CC BY-SA 4.0</a></em></p>

<h2 id="microsoft-embraces-open-source">Microsoft embraces open-source</h2>

<p>Microsoft had been flirting with the cloud since 2005, but soon after Nadella was appointed as the CEO of the company, the change started picking up pace. Developers were aghast to see Microsoft supporting open-source software in their offerings. The announcement of <a href="https://www.microsoft.com/en-us/sql-server/blog/2015/07/10/announcing-spark-for-azure-hdinsight-public-preview/">Spark for HD Insight</a> in 2015, as well as the <a href="https://blogs.microsoft.com/blog/2015/01/23/microsoft-acquire-revolution-analytics-help-customers-find-big-data-value-advanced-statistical-analysis/">acquisition of the Revolution Analytics</a> meant that the R language was integrated into multiple products. In 2016, <a href="https://www.microsoft.com/en-us/sql-server/blog/2016/06/06/microsoft-announces-major-commitment-to-apache-spark/">Microsoft announced its commitment to Spark</a>.</p>

<p><em>I remember talking to Microsoft engineers during those years, and they seemed just as surprised as we outsiders were. They were also really pleased to see that their company was embracing the best solutions on the market instead of trying to make everything in-house.</em></p>

<h2 id="converging-the-tools">Converging the tools</h2>

<p>Until 2019, Microsoft’s products were separate and lacked a proper data platform. But then came <a href="https://www.youtube.com/watch?v=tMYOi5E14eU">Azure Synapse Analytics</a>. Synapse’s selling points were: <em>“End-to-end analytics solution”</em>, <em>“Platform out-of-the-box”</em>, <em>“Integrated components”</em>, <em>“Eliminates the divide between datalake and data warehouse”</em>, and <em>“All capabilities in single place”</em>. As we later saw, this was pretty much the same sales talk as  that used for Fabric. So what went wrong the first time?</p>

<p>Initially, there was a lot of interest in the market for Microsoft’s unified data platform, but apparently the development speed and the features of the platform just couldn’t compete head to head with other big platforms, like Databricks and Snowflake.</p>

<p><em>From a personal experience, I saw one large scale data platform project which was using Synapse Analytics. It was abandoned after a couple of years when it became clear that the platform was not mature enough and the development speed was just too slow. Later the project was implemented on Databricks. In last couple of years Synapse just couldn’t compete with the big data platform providers. I’ve heard about several data platform projects where Synapse was quickly dismissed after initial research. Microsoft read the market and saw Databricks, Snowflake and all the other data platforms passing them from left, right and center, and saw that they needed to do better.</em></p>

<h2 id="enter-the-fabric">Enter the Fabric</h2>

<p>At the Build conference in 2023, Microsoft announced their new data platform, Fabric. According to Satya Nadella, it was their biggest data product launch since SQL Server. Fabric wasn’t created from scratch but it’s a combination of at least three major lines of software. Azure Synapse Analytics, Power BI, and Purview. A big part of Fabric are also the open-sourced innovations which are made by Databricks, such as Delta Lake and MLFlow.</p>

<p>Was Fabric just a rebranding of Synapse Analytics or was it really a new platform?</p>

<p>There are definitely lots of old parts that have been reused but there is at least one major difference: the wiring of the components to work as a seamless platform. The latest additions, such as Fabric Databases, which bring operational databases to an analytical platform, are also a major differentiators. In my view, it is not just a marketing gimmick but Fabric truly is a new data platform.</p>

<p><img src="/assets/2024/12/fabric-evolution.png" alt="" />
<em>Fabric is an amalgamation of a number of products. <a href="https://www.youtube.com/watch?v=JCZnv3RhTJQ">Source</a></em></p>

<h1 id="conclusions">Conclusions</h1>

<p>The modern data platform has to thank the early Internet search providers, Google and Yahoo. They started the development which lead to Hadoop, Spark, and eventually the Lakehouse paradigm. It also required companies to willingly publish release papers and release software as an open-source. One could think that the modern data warehouse is a triumph of openness and sharing.</p>

<p>Being open, of course, was not enough. It also required failure of the traditional data warehousing. As the Internet and falling costs of data storage drove the data amounts organizations gather to be on a constant rise, traditional data warehousing just could not keep up.</p>

<p>Databricks started from a single product and has evolved into a platform but it is still largely based on a single product: Spark. How well Databricks has been able to ride the data and AI wave from a research group to a start-up to a juggernaut of a company, is remarkable.</p>

<p>The history of Fabric is just the opposite: it’s a combination of numerous Microsoft products into a platform by one of the largest companies in the world. Evolutionary history of Fabric is well described in the <a href="https://www.youtube.com/watch?v=JCZnv3RhTJQ">Insights and Outliers channel in YouTube</a>.</p>

<p>Fabric enters a bit late into the data platform race, so this is Microsoft’s do or die moment in data. How does the platform compare to Databricks? In the next part of this series, we’ll continue with a feature-level comparison of the platforms.</p>]]></content><author><name>Antti Suanto</name></author><category term="databricks" /><category term="fabric" /><category term="series" /><summary type="html"><![CDATA[A short history of Databricks and Microsoft Fabric]]></summary></entry><entry><title type="html">Translytical Fabric (ie. Power BI write back)</title><link href="https://suanto.com/2024/11/29/translytical-fabric/" rel="alternate" type="text/html" title="Translytical Fabric (ie. Power BI write back)" /><published>2024-11-29T07:00:00+00:00</published><updated>2024-11-29T07:00:00+00:00</updated><id>https://suanto.com/2024/11/29/translytical-fabric</id><content type="html" xml:base="https://suanto.com/2024/11/29/translytical-fabric/"><![CDATA[<p>One of the more interesting announcements at Ignite 2024 was the sneak peek of <a href="https://ignite.microsoft.com/en-US/sessions/BRK204?source=/schedule">‘Translytical’</a> (@43:20) features in Power BI. According to Amir Netz, the CTO of Fabric (who, by the way, <a href="https://patents.google.com/patent/US8626725B2/en">invented</a> the <a href="https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&amp;seqNum=3">VertiPaq engine</a> that powers Power BI and other Microsoft products), this is the biggest update to Power BI since its inception. Or even bigger.</p>

<p>A couple of years ago I started to notice that fulfilling basic reporting needs in businesses was no longer enough. Many needs required two-way communication with the data. Typical cases included annotating an outlier in production data, perhaps shown in control charts; fixing classifications of an ML-model; re-running an ML-model; or just updating the data. This was possible before but it required for example embedding a Power App application into the report and setting up and configuring an external database.</p>

<p>I guess the need for interacting with data instead of just looking it was recognized by Microsoft, too. With these ‘Translytical’ features, data can be updated by using a Fabric database, <a href="https://sam.support.fabric.microsoft.com/en-us/blog/transform-validate-and-enrich-data-with-python-user-data-functions-in-your-data-pipelines">user data functions</a>, and Power BI button. No more external resources. Much simpler!</p>

<p align="center">
<img src="/assets/2024/11/translytical-apps.png" alt="drawing" width="300" />
</p>

<p>Updating analytical data can be a massive footgun, and you need to carefully design all the whats, whens, and whos, but the demand for this kind of feature is strong.</p>

<p>I am not yet sure about the term ‘Translytical’, but I see huge potential with this feature. We can see that large part of the small scale business apps will be built on Fabric in the future. It is possible that Power BI and Fabric will start expanding into the field that was once held by Access and Excel.</p>

<p>The feature is in private preview so we don’t know all the details or possibilities yet, but I am interested to see how this plays out and to what extent the border of transactional and analytical boundaries will be blurred in the future.</p>

<p>Exiting times!</p>

<p>Btw, if you are interested in listening the amazing story of Amir Netz and Vertipaq, you should listen <a href="https://open.spotify.com/episode/0uwjlMBAf3FeCaBdYseYNo">this episode of Insights Tomorrow podcast</a>.</p>

<p>Update 2025-03-28: <a href="https://suanto.com/2025/03/20/fabric-cli/">Translytical tasks are now on Fabric Roadmap</a></p>]]></content><author><name>Antti Suanto</name></author><category term="fabric" /><category term="databases" /><category term="ignite" /><category term="short" /><summary type="html"><![CDATA[The Blurring Boundaries of Applications and Analytics]]></summary></entry><entry><title type="html">The Time I Built an ROV to Solve Missing Person Cases - Part 1</title><link href="https://suanto.com/2024/06/06/the-time-I-built-an-ROV-01/" rel="alternate" type="text/html" title="The Time I Built an ROV to Solve Missing Person Cases - Part 1" /><published>2024-06-06T12:10:47+00:00</published><updated>2024-06-06T12:10:47+00:00</updated><id>https://suanto.com/2024/06/06/the-time-I-built-an-ROV-01</id><content type="html" xml:base="https://suanto.com/2024/06/06/the-time-I-built-an-ROV-01/"><![CDATA[<h1 id="part-1---introduction">Part 1 - Introduction</h1>

<p>I didn’t know it back then but it all started while I was reading <a href="https://news.ycombinator.com">Hacker News</a> in February 2019 and stumbled upon a story called <a href="https://www.otherhand.org/home-page/search-and-rescue/the-hunt-for-the-death-valley-germans/">“The Hunt for the Death Valley Germans”</a>. The real life events behind the story are unbelievably tragic but how the case was solved was remarkable. How the perseverance of a single guy led to him solving the case and by doing so, he was able to bring much relief to the families of the victims. Reading the story got me thinking how I could do this kind of thing myself.</p>

<p>I read the story a couple of times and sent it to my brother. I knew that he would be interested in the story as he is into these kinds of things as much as what I am. After reading the article, we talked for quite a long time about various missing person cases and how to solve them.</p>

<p>By the autumn of 2020 the story had faded from my mind until my brother called me with an interesting missing person case. That phone call was the starting point of the most interesting adventure I’ve ever had, and it lead to us solving two missing person cold cases, which had been unsolved for 9 and 15 years.</p>

<hr />

<p>This story is written based on my notes, photos, videos, GPS logs, sonar images, and other digital footprints. The facts were checked whenever it was possible. If there are any errors in the story, they are, of course, mine.</p>

<p>Maps would make following the story easier but I chose not to provide them. For those who are eager and skilled enough, you can deduct the actual places from public sources and my descriptions.</p>]]></content><author><name>Antti Suanto</name></author><category term="missing-persons" /><category term="building" /><category term="ROV" /><category term="the-time-I-built-an-ROV" /><summary type="html"><![CDATA[Introduction]]></summary></entry></feed>