Fabric Notebooks for Data Transformation and ML

Author: Mirko Peters - M365 Specialist
Published: Sat 09 Aug 2025
Episode Link: https://m365.show/p/fabric-notebooks-for-data-transformation

Ever wrangled data in Power BI and thought, "There has to be an easier way to prep and model this—without a maze of clicks"? Today, we're showing you how Fabric Notebooks let you control every stage, from raw Lakehouse data to a clean dataset ready for ML, all in a familiar Python or R environment. There's one trick in Fabric that most pros overlook—and it can transform your entire analytics workflow. Curious what it is?

Why Fabric Notebooks? Breaking the Clicks-and-Drag Cycle

If you’ve ever found yourself clicking through one Power BI menu after another, hoping for a miracle cleanup or that one magic filter, you’re not alone. Most teams I know have their routines dialed in: patching together loads of steps in Power Query, ducking into Excel for quick fixes, maybe popping open a notebook when the built-in “transform” options finally tap out. That patchwork gets the job done—until some missing or extra character somewhere throws it all off. Piece by piece, things spiral. The more hands on the pipeline, the more those tweaks, one-offs, and “just this once” workarounds pile up. Suddenly, nobody knows if you’re working with the right file, or if the logic that was so carefully added to your ETL step last month even survived.Here’s the reality: the more you glue together different tools and manual scripts, the more you’re inviting things to go sideways. Data quality problems start out small—maybe a few nulls in a column, or an Excel formula that got misapplied—but they spread quickly. You chase errors you can’t see. The business logic you worked so hard to build in gets lost between tools. Then someone copies a report or saves a “final” version in a shared folder. Great, until you try to track why one number’s off and realize there’s no audit trail, no history, just a chain of emails and a spreadsheet with “_v2final_REAL” in the name.Now, let’s make it a bit more concrete. Say you’ve set up a pipeline in Power Query to transform your sales data. Someone on the ops team renames a column, just to be helpful—cleans up the label, nothing major. Overnight, your refresh fails. The dashboard lights up with blanks. You spend your morning tracking through error messages, retracing steps, and realizing one change silently broke the whole chain. It’s one of those moments where you start wondering if there’s a smarter way to do this. This is where Fabric Notebooks start to make sense. They let you replace that chain of hidden steps and scattered scripts with something centralized. Open a Notebook inside the Lakehouse, and suddenly you’re not locked into whatever Power Query exposes, or what some old VBA script still supports. You use real Python or R. Your business logic is now code—executable, testable, transparent. And since Fabric Notebooks can talk directly to Spark, all the heavy lifting happens right where your data lives. No more exporting files, cutting and pasting formulas, or losing context between tools.Transparency is the secret here. With Power BI dataflows or legacy ETL tools, you get a UI and a list of steps, but it’s not always clear what’s happening or why. Sometimes those steps are black boxes; you see the outcome but tracing the logic can be a headache. Notebooks flip that on its head. Every transformation, every filter, every join is just code—easy to review, debug, and repeat. If you need to fix something or explain it to an auditor, you’re not trying to reverse-engineer a mouse click from six months ago. You’re reading straightforward code that lives alongside your data.If you want proof, talk to a data team that’s been burned by a lost transformation. I’ve seen teams spend whole days redoing work after Power Query steps vanished into versioning limbo. Once they switched to Fabric Notebooks, restoring a pipeline took minutes. Need to rerun a feature engineering script? Hit run. Want to check the output? It’s right there, alongside your transformations, not somewhere buried in another platform’s log files.It’s not just anecdotal, either. Gartner’s 2024 analytics trends point out that developer-friendly, governed analytics environments are at the top of IT wish lists this year. Teams want to govern workflows, reduce errors, and keep transformations clear—not just for compliance, but for sanity. Notebooks fit that brief. They bring repeatability without sacrificing flexibility. You get what you expect every single time you run your workflow, no matter if your data has doubled in size or your logic has gotten a bit more intricate.With Fabric Notebooks, you stop feeling at the mercy of a UI or the latest patch to a plug-in. You write transformations in native code, review the logic, iterate quickly, and keep everything controlled within the Lakehouse environment. Versioning is built in, so teams stop playing “which script is the right one?” There’s no more mystery meat—every step is right there in black and white, accessible to anyone with permissions.So, what you really get is that rare mix of flexibility and control. You aren’t tied down by a rigid workflow or a limited set of built-in steps. But you’re not just freewheeling either; everything happens in a secure, auditable, repeatable way, right where your business data sits. For anyone ready to ditch the endless cycle of clicks and patches, this is a much-needed reset.And that’s what’s on offer—but seeing how it all works together in a real end-to-end workflow is what matters next. What does the journey look like when you go from raw Lakehouse data to something ready for analysis or machine learning, all inside the Notebook experience?

From Raw Lakehouse Data to Ready-for-ML: The Real Workflow

You probably know the feeling—you upload a dump of last month’s sales data, some web logs, maybe an extract from customer support, and it all lands in your Lakehouse. Now what? Most folks think you slap a model on top, press run, and call it AI. But the real story is everything that happens in the messy middle. Raw data looks nothing like what your ML algorithm needs, and before you even think about training, someone has to piece it all together. Columns don’t line up. Time zones are inconsistent. Nulls wait to break scripts you haven’t written yet. If you’ve tried to join logs across sources, you know that each system has its own quirks—a date is never just a date, a customer ID might be lowercased in one file and uppercased in another, and outliers seem to multiply as soon as you ask serious questions.The huge pain here is manual cleanup. Even if you’re good with VLOOKUPs or Power Query, getting several million rows to a usable state isn’t just boring, it opens the door to errors that don’t always announce themselves. A missed join, a misplaced filter, or inconsistent encoding adds hours of debugging later. The more steps you run in different tools, the more you forget which fix you made where. You end up cross-referencing transformations, wondering if you cleaned out those four weird records, or if someone else rebuilt the staging table without telling you.Fabric Notebooks take that bottleneck and give you something that, for once, scales with your ambition. Because you’re scripting transformations directly in Python or R—right in the context of your Lakehouse—you can chain cleaning, enrichment, and feature engineering work in the way that actually matches your project, not just whatever some library supports out of the box. This isn’t dragging steps into a canvas and hoping the “advanced editor” lets you tweak what matters. You’re designing the logic, handling all the edge cases, and writing code once that you can use again across datasets or even other projects. Every cast, filter, and aggregate stays visible. Typed too fast and swapped a column? Change it and rerun—no need to re-import, re-export, or play the copy-paste game.Picture what this means for an actual project. Take a retail team that wants to spot which customers are about to churn. They’re not just loading the CRM export and rolling the dice. Inside a Fabric Notebook, they pull in last quarter’s sales, merge those records with support tickets, and tag each touchpoint from the website logs. When they run into missing values in the sales data—maybe several transactions marked incomplete or with suspicious nulls—they clean those up on the fly with a few lines of pandas or PySpark. Outliers that would throw off their predictions get identified, flagged, and handled right inside the workflow. Every part of this is code: repeatable, easy to tweak, and visible to the next analyst or developer who comes along. The team doesn’t have to circle back to a BI developer or search through dozens of saved exports—they see the entire process, from ingestion to the feature matrix, in one place.Then there’s scale. Most platforms start strong but choke when data grows. Fabric’s Native Notebook approach means you’re not running local scripts on a laptop. Instead, each transformation can harness Spark under the hood, so your process that once broke at 100,000 records now sails through 10 million without blinking. This is especially important when your data doesn’t come in neat weekly batches. If the pipeline gets a surge in records overnight, the code doesn’t care—it processes whatever lands in the Lakehouse, and the same cleaning, transforms, and feature engineering logic applies.If you mapped this out, you’d start with a batch of raw tables landing in your Lakehouse. The Notebook sits as the orchestrator, pulling data from source tables, applying your scripted transformations, and immediately saving the outputs back—either as new tables or as feature sets ready for modeling. For viewers who picture this, think of data flowing in, being reshaped and upgraded by your code, and then moving straight into Power BI dashboards or ML pipelines, all without a break in context or a switch to another tool.Microsoft’s documentation highlights another piece most teams miss: once your Notebook script is ready, you’re not stuck waiting on someone else’s process to finish out the pipeline. Notebooks in Fabric can trigger machine learning model training jobs or write feature sets directly back to your Lakehouse, so you’re not stuck exporting CSVs for some other tool to pick up. This tight coupling means you design, clean, feature-engineer, and prep for modeling all in one place, then kick off the next step at scale.All of this means your workflow for ML or analytics finally makes sense—start with raw ingestion, transform and enrich inside a governed, scalable Notebook, and push the data out for the next team or model to use. There’s no more losing track between tools or asking, “Where did column X get calculated?” It’s all right where you built it, and it works, no matter how messy the raw data was.But seeing the flow is one thing. To really design reliable, scalable projects, you need to know exactly how these pieces talk to each other when you put them into production. Let’s break down the connections behind the scenes.

How the Pieces Fit: Lakehouse, Notebooks, and Spark in Action

If you ask most teams what actually happens when they run a data job in Fabric, you’ll get answers that sound confident—until something fails. The Lakehouse, Notebooks, and Spark each get talked up in demos, but in practice, a lot of folks treat these like separate islands. It’s part of why pipelines break, or why a process that ran fine during testing suddenly starts timing out or throwing permission errors the second more people get involved. So let’s strip away the buzzwords and get into what actually happens when you put these pieces to work, side by side.The Lakehouse is straightforward in concept. It’s where all your raw data lands, gets curated, and, if you’ve done things right, turns into a foundation for every dashboard, report, and ML model you’re thinking of building. You can drop in CSVs from cloud blobs, load up logs, or publish system exports—whatever form your data takes, this is its home. It’s about having your single source of truth in one place, and keeping both your messy ingests and your golden, cleaned datasets under one roof. That’s the theory, anyway.Now, Notebooks are your playground as a developer or data analyst. If you’re tired of reverse-engineering someone else’s Power Query or unpicking a worksheet that’s seen ten rounds of copy-paste fixes, Notebooks feel like breathing room. Here you write real code—Python, R, use your favorite libraries, work through logic, build tests—and all without leaving the context of your Lakehouse data. It’s not a bolt-on or a disconnected tool. The Notebook is embedded right inside the Fabric ecosystem, so everything you author runs close to where your data sits.Spark is the heavy lifter, the compute engine working behind the scenes. When you run a Notebook cell that needs to process five million records—maybe it’s a complex join, or a batch transformation—Spark takes over. It distributes the job across its clusters, so your code runs at scale without you writing custom job orchestration or worrying about where your compute lives. This isn’t you spinning up servers, cloning scripts, or knitting together permissions across random VMs. With Fabric, Spark operates right where your curated and raw data is stored.But, and here’s what often gets teams, if you treat these three as separate, you hit problems. Teams will load data to the Lakehouse, but then export it just to process it locally, breaking governance and creating disconnected copies. Or they’ll write great transformation logic in a Notebook, but only share the output as a CSV, so nobody else can trace what actually happened between ingest and publish. Sometimes Spark gets sidelined, and workloads start running slow as people forget they’re working with more data than their laptops can handle. The end result is silos, confusion about who owns what, and security risks that show up in unpleasant ways.What Fabric does—if you set it up right—is keep every connection tight. Your Notebook isn’t running code out in the void; it’s submitting Spark jobs that execute exactly where the data is stored. Nothing leaves the Lakehouse unless you explicitly export it. This means you skip all the extra data movement, avoid random local files, and control access in one place. If your organization is nervous about compliance or data sovereignty, that single point of control is a lot easier to document and manage.Think about a finance team. They take in millions of daily transactions. Their Notebook is set to trigger every night. Instead of someone exporting yesterday’s CSV, cleaning the data in Excel, uploading it again, and hoping no rows got dropped, the team has a Spark job baked into their Notebook that ingests, joins, and processes ten million transactions in minutes. The results show up as a cleansed table, ready to plug into reporting in Power BI. Nobody outside their team sees the raw dataset. They don’t move files between systems. If there’s an error, the full lineage from Lakehouse to final table is visible…and, crucially, repeatable.Now, just because Notebooks give you flexibility doesn’t mean you’re out on your own. You can bring in almost any Python or R package you need for business logic or advanced analytics. But the code still runs inside the guardrails that Fabric provides. Version histories are kept so accidental changes can be rolled back. Permissions wrap both Notebooks and the data they touch, so you don’t end up with an analyst reading payroll tables they shouldn’t have access to.To stay sane in a growing project, it pays to group Notebooks by project or business domain—marketing, sales, operations. Modularize your scripts so you’re not copying the same cleaning logic everywhere. And even if you’re just starting with a solo team, get version control in place up front. It’s a lifesaver when something breaks, or when you want to see why a filter got added.One of the most common gotchas? Permissions. Too often, teams get enthusiastic and focus on transformations, only to realize that anyone with access to the Lakehouse or Notebook can overwrite data, or see more than they should. Double-check who can run, edit, or even just view your Notebooks. Set up access policies at both levels, not just one. A leak or accidental overwrite doesn’t need to happen to make you sweat—it just takes one bad incident to get everyone looking sideways at your setup.When you actually understand how your Lakehouse, Notebooks, and Spark mesh together, you get stable pipelines. You control the flow from ingest to transformation, through cleaning and enrichment, to analytics or modeling. The pieces work as one—not separate fiefdoms. You also keep your data secure, your logic visible, and your workflows fast and repeatable. But as your project grows and more people pile in, the challenge shifts. Suddenly, collaboration and governance get a lot harder, and that’s where smart teams put most of their attention.

Avoiding Chaos: Collaboration, Governance, and Scaling Up

Think about the first time you spin up a Fabric Notebook for a quick proof of concept. You connect your data, try out some transformations, maybe even train a test model. It feels clean, with just a handful of scripts and one or two people involved. Fast forward a month and your workspace looks nothing like it did on day one. Each team starts a handful of Notebooks, naming conventions fall apart by the third iteration, and suddenly, you’re searching for “final-final-customer-cleaning” instead of anything standardized. Now add in more teams—finance, marketing, operations. Someone requests access for a contractor “just for the quarter,” and that’s when the real surprises begin.For most organizations, this is where the fabric (no pun intended) starts to fray. Business units all want their own slice of the data pipeline, so they fork Notebooks, tweak scripts, and keep their logic in copies scattered throughout the environment. Side conversations move to Teams or email threads. Suddenly, two people are doing almost the same work in parallel, but with small, critical differences. With no governance, this drift is only spotted when someone runs a report and the numbers don’t add up. Someone will ask why a filter is missing or a metric jumped, but between the duplicated notebooks and conflicting logic, the root cause is buried under layers of undocumented changes.Auditors and compliance officers, for their part, aren’t just worried about business logic—they want to see who touched what, and when. Without a system of auditing and version management set up from day one, you’re stuck digging through old emails, asking who had the file last. There’s no single source of truth, and any data lineage story you can tell feels like guesswork. More than once, this mess has landed teams in hot water when an audit trail simply didn’t exist—or when a permissions slip let someone view raw PII that should have been locked down.Here’s where Fabric can actually make a difference, but only if you use what’s built in. On the surface, it’s easy to see Notebooks as just another script editor. Dig a bit deeper, though, and Fabric gives some key tools for staying sane—starting with workspace-level permissions. This isn’t the old model of handing out blanket access or hoping someone remembers to update a spreadsheet. Instead, you define exactly who can run, edit, or even view specific Notebooks and tables. Missteps here are usually unintentional—the difference between read and write can sound like a detail until someone overwrites a production table by accident. If you set the right roles up front, one slip doesn’t take down the whole pipeline.Audit logs are another underused safety net. Most teams think about logging after a scare, but Fabric keeps a detailed record of changes made inside Notebooks and data movement across the workspace. When a question comes up in an audit (and it will), tracking every modification is no longer a hero’s job; the logs are already waiting. This means fewer late nights retracing steps and explaining how data shifted between versions. The organizations that thrive here are the ones that make reviewing audit logs part of their regular process—not something reserved for emergencies.Consider a real example: a healthcare organization handling protected health information uses Fabric Notebooks to prepare patient records for analysis. Compliance is non-negotiable. They enforce role-based access from day one. No Notebook can interact with sensitive fields unless the user has explicit permission—and every step is versioned automatically. When an internal check rolls around, the team doesn’t scramble. They pull logs, trace back exactly when transformations ran, and demonstrate the lineage from original data to cleaned, analysis-ready tables. This is what HIPAA asks for, but the same approach works in any regulated industry.Documentation is another pain point that everyone means to solve but rarely does until onboarding devolves into legend telling. If you document your transformations, tag versions at meaningful checkpoints, and make notes about why code changed, your team doesn’t spend days or even weeks guessing at business logic. It’s about treating Notebooks less like scratch paper and more like evolving project assets. The gains pay off every time a new team member joins or someone picks up a pipeline months later.For teams operating at scale, Git integration becomes more than a nice-to-have. It’s where change tracking and a clear branching strategy save you from accidental overwrites or the accidental “oops” merge that wipes out a week’s work. This structure keeps your master Notebook stable and allows experimentation without risking the trusted production logic. The reality is, even small teams benefit from using Git early in the process, not waiting for chaos to set in.Of course, not every pitfall is about access or versioning. When folks get comfortable, shortcuts sneak in. Hard-coded credentials show up as quick fixes and linger in code for too long. Pull requests, if they happen, don’t always get a code review. Bugs and security holes slip through not because people aren’t skilled, but because process gets traded for speed. A necessary step is to build in code review and credential checks from the first Notebook onward—not as afterthoughts, but as part of everyday work.According to Forrester’s 2023 findings, robust governance isn’t just a compliance checkbox. It’s the most reliable predictor that a data project will deliver real value, both in agility and audit-readiness. Teams that get structure right from the start find that Fabric Notebooks don’t just scale—they scale without generating chaos.This setup turns what could be a mess into a platform you actually trust as your organization grows. Pipelines stay tidy, logic stays visible, and security lapses become the exception, not the rule. So, if every team can get a Notebook running, what really sets the pros apart? There’s one habit that makes all the difference as Fabric Notebooks become the backbone of your workflow.

Conclusion

Here’s what separates a smooth analytics setup from the usual patchwork: it’s never just about writing good code. It’s how you use Fabric Notebooks to make every part of your workflow visible, consistent, and easy to manage, no matter how much your data grows or how many hands are in the project. If you’ve lost a week to tracking down issues that only existed because tools didn’t connect right, you know the pain. Rethinking your approach now pays off when your next project doubles in size. Let us know your biggest data transformation struggle in the comments, and don’t forget to subscribe.

Get full access to M365 Show - Microsoft 365 Digital Workplace Daily at m365.show/subscribe

Share to:

EachPod

EachPod

Fabric Notebooks for Data Transformation and ML