AutoLoader in Azure Databricks

Anupam Mishra
4 min readDec 15, 2023

--

Databricks is a scalable big data analytics platform designed for data science and data engineering. Built on top of Apache Spark, it is a fast and generic engine for Large-Scale Data Processing with industry-leading performance and integration with major cloud platforms Amazon Web Services, Microsoft Azure, and Google Cloud Platform.

What is AutoLoader in Azure Databricks

Autoloader (aka Auto Loader) is a mechanism in Databricks that ingests data from a data lake. The power of autoloader is that there is no need to set a trigger for ingesting new data in the data lake — it automatically pulls new files into your streaming jobs once they land in the source location.

The Autoloader feature in Azure Databricks simplifies the process of loading streaming data from various sources into a Delta Lake table. It automatically detects new files in a specified directory and efficiently loads them into the table, eliminating the need for manual intervention. This enables real-time data ingestion and analysis, making it easier to build data pipelines and extract valuable insights from streaming data.

With Autoloader, you can easily handle large volumes of streaming data without having to write complex code for file discovery and data loading. It also provides automatic schema inference, allowing you to quickly adapt to changing data structures without manual configuration.

Not only does Autoloader simplify the process of loading streaming data, but it also integrates seamlessly with other services in the Azure ecosystem. You can easily ingest data from sources such as Azure Event Hubs and Azure Blob Storage, making it convenient to bring data from various sources into your Delta Lake table. Additionally, Autoloader provides options for data transformation and filtering, allowing you to preprocess your streaming data before loading it into the table. This helps streamline your data workflows and optimize data processing efficiency.

By leveraging the power of Delta Lake, the Autoloader feature ensures data reliability and consistency. It uses transactional capabilities to handle updates, deletes, and appends efficiently, providing a unified view of your data. This makes it easier to perform real-time analytics and deliver timely insights to your business.

With its simplicity and scalability, the Autoloader feature in Azure Databricks empowers data engineers and data scientists to efficiently process and analyze streaming data, accelerating the development of real-time applications and enabling data-driven decision-making.

Supported File Formats and Cloud Storage Services It is capable of ingesting a variety of file formats, including JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE, and can load data files from various cloud storage services such as AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage, Azure Blob Storage, ADLS Gen1, and Databricks File System.

Auto Loader components

  • Cloud Files: CloudFiles is a Databricks component that provides distributed file storage for large datasets.AutoLoader utilizes CloudFiles to manage and store the incoming data files. The source directory specified in the AutoLoader configuration typically points to a location in CloudFiles where new files are expected to arrive.
  • CloudNotification: CloudNotification is a Databricks component that enables event-driven workflows by allowing Databricks to listen for changes in cloud storage. AutoLoader uses CloudNotification to monitor the specified source directory for changes. When new files are added to this directory, CloudNotification triggers AutoLoader to initiate the data ingestion and processing workflow.

CloudFiles stores the incoming data files.

CloudNotification monitors the source directory in CloudFiles for any changes, such as the arrival of new files.

When CloudNotification detects new files, it triggers AutoLoader to start the defined processing logic.

Structured Streaming with CloudFiles Source: Auto Loader comes equipped with a Structured Streaming source called cloudFiles, which automatically processes new files as they arrive in an input directory path on the cloud file storage. This source can also process existing files in that directory. Auto Loader can support both Python and SQL in Delta Live Tables.

from pyspark.sql.functions * 
from pyspark.sql.types *
#Read from cloud Files source
df = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.path", "s3://mybucket/mydir/") \
.schema(json_schema) \
.load ()
# Write to Delta Lake
df.writeStream \
.format("delta") \
.option("checkpointLocation", "/mnt/delta/checkpoints/mytable/") \
.start("/mnt/delta/tables/mytable/")`

Scalability and Real-Time Ingestion

One of the most powerful features of Auto Loader is its scalability to support near real-time ingestion of millions of files per hour. This is made possible by its ability to efficiently process new data files as they arrive in cloud storage without any additional setup, as well as its support for parallel processing and distributed computing.

Syntax: —

df = spark.readStream \ 
.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.path", "s3://mybucket/mydir/") \
.schema(json_schema) \
.load()

--

--

No responses yet