Transform layer
The Transform Layer in Octopipe
The transform layer is a pivotal component in Octopipe, acting as the bridge between the generated type safe API and the labeled database schema. This document provides an in-depth look at how the transform layer is generated, approved, and executed, ensuring data consistency and reliability.Overview
The transform layer performs the following critical tasks:- Mapping: Aligns the type safe API schema with the labeled database schema.
- Validation: Allows users to review and approve the transformation logic.
- Execution: Writes the transformation logic to Spark for data processing.
Detailed Workflow
- Type Safe API Schema Generation
- Initially, Octopipe generates a type safe API for each connector by calling the external API.
- A Large Language Model (LLM) is used to derive the types based on the API responses.
- The generated schema ensures that all data is processed with strict type definitions.
- Database Schema Labeling
- The destination schema is pulled from the target system.
- Users are prompted to label the schema iteratively, ensuring that data fields are accurately defined.
- This labeling process is interactive, allowing for real-time adjustments and improvements.
- Mapping Process
- The transform layer is created by mapping the type safe API schema to the labeled database schema.
- This process includes aligning field names, data types, and any necessary conversion logic.
- Automated tools help generate an initial mapping, which is then refined by the user.
- User Approval
- The generated transformation is presented to the user for review.
- Users can adjust mappings, add transformation logic, or provide feedback on the generated code.
- Once approved, the transform layer becomes the definitive guide for how data is processed.
Integration with Spark
-
Execution Environment:
After user approval, the transform layer is written to Spark.
- Spark executes the transformation logic on large datasets.
- This ensures high performance during data processing, even at scale.
-
Fault Tolerance:
Spark’s robust error-handling mechanisms help catch and recover from processing issues.
- This minimizes downtime and ensures that pipeline execution is as reliable as possible.
Example Scenario
Consider a pipeline where sales data is extracted from an API and loaded into a PostgreSQL database:- Step 1: The API returns sales records with fields like
order_id
,amount
, anddate
. - Step 2: The destination schema labels these fields, confirming types such as integer, decimal, and timestamp.
- Step 3: The transform layer maps
order_id
from the API toorder_id
in the database, ensuring type consistency. - Step 4: After user review, the transform logic is deployed to Spark, which processes the data accordingly.
Error Handling and Debugging
- Automated Checks: Before deployment, automated tests verify that the mapping is consistent with both schemas.
- User Feedback: Users are provided with detailed logs and error messages if discrepancies are found.
- Iterative Improvements: The system supports iterative refinement, allowing the transformation logic to be adjusted based on runtime observations.
Benefits of the Transform Layer
- Consistency: Ensures that the data structure remains consistent from source to destination.
- Efficiency: Automates much of the transformation process, reducing manual coding and potential errors.
- Flexibility: Allows for custom adjustments and fine-tuning, accommodating complex data transformation needs.
- Scalability: By leveraging Spark, the transform layer can handle large volumes of data with ease.
Future Enhancements
- Enhanced UI: Plans to introduce a graphical interface for mapping and transformation approval.
- Custom Plugins: Support for third-party plugins to extend transformation capabilities.
- Real-Time Validation: Improvements to validate transformations as data flows through the pipeline, ensuring ongoing consistency.