MICROSOFT FABRIC DATA ENGINEERING PROJECT

UNITED STATES GEOLOGICAL SURVEY

Context:

In this data engineering project, I built an end-to-end data pipeline to process seismic event data from the United States Geological Survey (USGS) API. The objective was to extract, transform, and analyze this data to generate an interactive Power BI report that visualizes earthquake activity worldwide.

To ensure a structured and efficient data workflow, I implemented the Medallion architecture (Bronze, Silver, Gold). The raw data was ingested and stored in the Bronze layer, preserving its original form for traceability. In the Silver layer, I performed data cleaning, transformation, and validation to ensure consistency and accuracy. Finally, in the Gold layer, I refined the data into a business-ready format, optimized for analytics and reporting.

For data storage and processing, I leveraged Azure Synapse Analytics, while Azure Data Factory automated the orchestration of data ingestion and transformation, ensuring a seamless and scalable workflow. The final dataframe was integrated into Power BI, where I developed an interactive dashboard to provide insights into global earthquake trends based on magnitude, location, and frequency.

Solution Architecture

Step 1: Data Collection from the USGS API

The first step in this project was to set up a lakehouse named earthquakes_lakehouseto store seismic data from the USGS API in its raw format. This lakehouse serves as the foundation for processing and structuring the data using the Medallion architecture.

To ingest the data, I created a notebook within the lakehouse and used Python to fetch earthquake event records via the following API request:

URl : https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2014-01-01&endtime=2014-01-02

The retrieved data is then stored as Delta Lake tables, which provide ACID transactions, schema enforcement, and time-travel capabilities, ensuring a reliable and scalable storage format for further processing. This structured approach lays the groundwork for the transformation and analysis phases that follow.

Step 2: Bronze Layer – Storing data

To automate the data ingestion process, I developed a Python script that retrieves earthquake event data from the USGS API and stores it in the bronze layer. The goal is to dynamically fetch data for a rolling 7-day period, ensuring that the dataset remains up to date.

1. Dynamic Date Handling:

  • The script defines start_date as seven days ago and end_date as yesterday, ensuring a continuous data flow.
  • These dates will later be made fully dynamic and configurable via Azure Data Factory.

2. API Request:

  • The script constructs the API URL with the date parameters and requests the data in GeoJSON format.
  • The JSON response includes multiple fields, but only the features field is retained, as it contains the relevant earthquake event data.

3. Data Storage in Lakehouse:

  • If the API request is successful, the filtered earthquake data (features) is extracted and saved as a JSON file in the Lakehouse storage.
  • The data is stored in the /lakehouse/default/Files/ directory with a timestamped filename (YYYY-MM-DD_earthquake_data.json).

Now that the earthquake data is successfully stored as a JSON file in the Lakehouse, the next step is to load it into a Spark DataFrame for further processing. This will allow me to perform transformations and apply the Medallion architecture to progressively refine the data.

Step 3: Silver Layer – Data transformation

Now that the raw earthquake data has been loaded into a Spark DataFrame, the next step is to clean and structure it in the Silver Layer. This step is crucial for preparing the data for further analysis by extracting relevant attributes and applying meaningful transformations.

Implementation Details:

1. Filtering & Extracting Relevant Fields:

  • The raw JSON structure contains nested fields that need to be extracted and renamed for better readability.
  • I focus on key attributes such as location (latitude, longitude, elevation), magnitude, significance, timestamps, and event descriptions.

2. Transforming the DataFrame:

  • The geometry.coordinates field contains longitude, latitude, and elevation as an array, so we extract each value separately.
  • The properties field holds metadata such as title, place description, significance (sig), magnitude (mag), magnitude type (magType), and timestamps.
  • The new DataFrame keeps only these relevant fields, making it easier to analyze and visualize later.

At this stage, I refine the dataframe further by converting timestamps and storing the cleaned data in a structured format.

3. Converting Milliseconds to Readable Datetime Format

  • The time and updated columns in the raw dataset are stored in milliseconds (Unix epoch time). To make them more readable and usable in analytics, I convert them into standard timestamp format.

4. Storing in the Silver Table

  • Once the data is cleaned, I persist it into a structured table named earthquake_events_silver .
  • I use the « append » mode, ensuring that new earthquake data is continuously added without overwriting existing records.

5. Visualizing the Silver Layer Data

  • Now that the cleaned and structured earthquake data is stored in the earthquake_events_silver table, we can visualize it using SQL queries directly in the Lakehouse.

Step 4: Gold Layer – Enriching the Data

  1. Since new data is continuously added to the Silver Layer, I only want to retrieve recent events (greater than start_date) for processing.
  2. To enhance the Dataframe, I determine the country code based on latitude and longitude using reverse_geocoder library.
  3. Since Spark DataFrames operate on distributed data, we need to register get_country_code as a User Defined Function (UDF) to use it in transformations.

4. Adding Country Code to the DataFrame.

  • Using the previously registered UDF (get_country_code_udf), I enrich the dataframe with country_code, which helps in geographic analysis of earthquake patterns.

5. Classifying Earthquake Significancey.

To better analyze earthquakes based on their impact, I classify them into three categories:

  • Low (sig < 100)
  • Moderate (100 ≤ sig < 500)
  • High (sig ≥ 500)

6. Storing the Final DataFrame in the Gold Table

  • The refined DataFrame is now stored in earthquake_events_gold, ensuring that new data is continuously appended.

Step 5: Orchestrate Data Processing with Azure Data Factory Pipeline

To ensure daily automated data processing, I use Azure Data Factory (ADF) to create a pipeline that orchestrates the execution of these notebooks (Bronze, Silver, and Gold layers).

1.Creating the « Bronze Notebook » Activity

  • In Data Factory, I created a « Notebook » Activity named « Bronze Notebook »
  • I connected it to the Notebook that processes raw data into the Bronze layer
  • I defined dynamic parameters for start_date and end_date, ensuring they update daily
  • start_date → Yesterday’s date
  • end_date → Today’s date

     

     

    2. Linking the « Bronze Notebook » to the « Silver Notebook »

    • I created another « Notebook » Activity named « Silver Notebook »
    • I connected it to the Silver layer Notebook
    • Pass start_date as a parameter

    3. Linking the « Silver Notebook » to the « Gold Notebook »

    • I created a final « Notebook » Activity called « Gold Notebook »
    • I connected it to the Gold layer Notebook
    • Pass start_date as a parameter

    With the Azure Data Factory pipeline in place, the entire data engineering workflow is now fully automated. Each layer—Bronze, Silver, and Gold—executes sequentially, ensuring that fresh and processed earthquake data is available daily.

    Thanks to this setup, the pipeline can now be scheduled to run at specific times, allowing for automatic ingestion, transformation, and enrichment of earthquake data without manual intervention.

    Step 6: Semantic Modeling

    With the data now available in the Gold Layer, the next step is to create a semantic model to optimize its use in Power BI. This model structures the data, defines relationships between tables, and includes calculated measures, making analysis and visualization more efficient.

    Once the semantic model is set up, I can develop an interactive Power BI report based on this enriched data.

    Step 7: Creating the Power BI Report

    Here is the final Power BI report, providing a comprehensive visualization of global earthquake activity. Users can filter data by date range and significance level (High, Moderate, Low) to analyze trends. The interactive map displays earthquake locations, with bubble sizes representing significance, allowing for quick insights into seismic activity worldwide. Key metrics such as total earthquakes recorded and the maximum significance value ensure a clear and data-driven overview of recent seismic events.

    Summary & Automation

    This project successfully automates the ingestion, transformation, and visualization of earthquake data using a Lakehouse architecture. By leveraging Azure Data Factory, the pipeline dynamically retrieves daily earthquake data from the USGS API, processes it through Bronze, Silver, and Gold layers, and enhances it with geolocation attributes. The final dataset is then structured into a semantic model, enabling insightful visualizations in Power BI.

    Through automation, this pipeline ensures real-time updates without manual intervention, providing a scalable and efficient solution for monitoring global seismic activity. The final Power BI report offers an interactive and user-friendly dashboard, empowering users to analyze earthquake patterns based on date, location, and significance level. This project demonstrates the power of data engineering, cloud automation, and business intelligence in transforming raw data into actionable insights.

    The diagram below provides a structured overview of the entire pipeline, illustrating how earthquake data flows from raw ingestion in the Bronze Layer, through transformation and enrichment in the Silver and Gold Layers, before being automatically updated via Data Factory and visualized in Power BI. This architecture ensures a fully automated, reliable, and scalable solution for earthquake data monitoring.