F2F Interview Data Engineer Questions asked in Deloitte
- Tell me about yourself
- What Kind of projects worked and experience you have?
- Explain your projects in Detail ?
- In terms of your current project, please describe the business objective, team structure, technical details and your role.
- Difference between a clustered and a non clustered index
- Differences between AWS Glue and EMR? When would you use each?
- What are window functions in SQL ?
- Explain the difference between DELETE, TRUNCATE, DROP ?
- Explain the difference between INNER JOIN, LEFT JOIN and FULL OUTER JOIN ?
- Write a python script to connect to a database and fetch data using SQL queries ?
- Difference between Pandas and Pyspark for data manipulation ?
- Describe the architecture of a cloud based dataware house like Snowflake or BigQuery ?
- Difference between OLAP and OLTP databases ?
- How can you optimize your queries in Amazon Redshift?
- When would you create normalized tables and denormalized tables?
- Which relational databases you have worked with? What is your experience ?
- What type of major issues you faced in your experience ? How you resolved those ?
- What is ETL ? Explain its phases and tools you have worked with.
- How do you ensure data quality during ETL processes ?
- What is the role of Apache Kafka in data engineering ?
- How do you design pipelines to ensure user data privacy and compliance?
- Explain differential privacy and how it could be applied in Apple’s ecosystem.
- How do you anonymize event logs for large-scale analytics?
- Describe how you’d handle schema evolution without exposing personal data.
- What’s your approach to securing pipelines on-prem and in the cloud?
- Explain how to build idempotent ETL jobs in a privacy-constrained environment.
- How do you handle device telemetry ingestion efficiently?
- Describe the trade-offs between edge aggregation and central aggregation.
- How would you detect data leakage across Apple’s internal datasets.
- Explain how encryption and key rotation are managed in pipelines.
- How do you validate anonymized data for ML training?
- What metrics would you track for privacy compliance at scale?
- Describe your strategy for debugging encrypted datasets.
- How do you implement data minimization principles in engineering design?
- Explain differential privacy noise budgets.
- How would you architect pipelines to support local differential privacy?
- Describe your approach to testing privacy-preserving transformations.
- How do you maintain metadata catalogs without exposing identifiers?
- Explain GDPR/CCPA impacts on Apple-scale data architectures.
- Walk through designing a privacy-safe analytics pipeline for Siri.
- Write an SQL query to find the second highest salary from an employee table.
- How do you handle NULL values in SQL joins while ensuring data integrity?
- Write an SQL query to calculate customer churn rate over the last 6 months.
- Design a fact table for an e-commerce platform – what dimensions and measures would you include?
- Explain the difference between star schema and snowflake schema and when to choose each.
- Write a Python script to validate data quality and detect anomalies before loading.
- In PySpark, how would you efficiently join two very large DataFrames to avoid skew?
- Write PySpark code to find the top 3 customers by revenue per region.
- You are processing real-time data from Event Hub into Delta tables. How would you implement this?
- How do you implement schema evolution in Delta Lake without breaking existing jobs?
- How would you implement Slowly Changing Dimensions (SCD Type 2) in a data warehouse?
- Late-arriving data is detected in a batch pipeline – how do you ensure correctness of historical reporting?
- How do you design a pipeline that supports both batch and streaming workloads simultaneously?
- What is your approach to building incremental data loads in Azure Data Factory pipelines?
- Your PySpark job is failing due to skewed joins. Walk through your debugging and optimization steps.
- Explain the architecture of Azure Databricks integrated with Delta Lake in a production environment.
- How do you optimize query performance and concurrency in a Synapse dedicated SQL pool?
- Your data lake contains sensitive PII data. What best practices do you follow to secure data and manage secrets?
- How do you establish end-to-end data lineage and governance using Microsoft Purview?
- Design an end-to-end analytics pipeline that ingests from Event Hub, processes data in Databricks, stores it in Synapse, and powers dashboards in Power BI.
- Write SQL query to find duplicate rows ?
- Retrieve third highest salary from Employee table ?
- Get the latest order placed by each customer ?
- Calculate average order value per customer ?
- Write a query to identify the most selling product ?
- Find all employees hired on weekdays ?
- Find all employees with salary between 75000 and 200000 ?
- Write a query to get monthly sales revenue and order count ?
- Find the moving average of sales over the last 5 days ?
- Write a query to find the first and last order of a customer ?
SQL Database Concepts :
- What is the difference between UNION, UNION ALL and UNION DISTINCT ? Which once would you prefer for very large datasets, and why ?
- What are the different types of joins in SQL ? Briefly explain each with an example.
- From a performance perspective which is more efficient : UNION ALL with DISTINCT, UNION ALL with window functions or plain UNION ?
- Explain the difference between normalization and denormalization. When would you prefer which ?
Data Management Concepts :
- How does a data validation framework like Great Expectations fit into ensuring data quality in an OLTP -> OLAP pipeline ?
- Schema changes in OLTP often require application code updates. How can the impact of schema changes be minimized
- In an OLTP system, how can triggers be used to capture every insert, update or delete into a history table ?
- What would be the schema for a Card table and its corresponding CardHistory table ? How does each column support change tracking ?
- Why do we include valid_from and valid_to columns in history tables when the latest state can be derived from entry order ? What additional value do they provide ?
- In a history-tracking design, how would a trigger : Insert a new version of a record and Update the valid_to of the previous version ?
- How would you extend the same history-tracking mechanism to related tables like Account and Limits, which are linked to card ?
- Given CardHistory, AccountHistory and LimitHistory, how would you build the CardExtract target model in OLAP so that it captures changes from all three entities ?
- Can you demonstrate ( SQL or pseudocode ) how to join history tables to generate a daily extract of changes across cards, accounts and limits ?
- If only the LimitHistory changes (e.g., spending limit adjustment ), how do you ensure the CardExtract still reflects this change ?
- On a given day, how would you detect which records which records across history tables actually changed, so only those are extracted ?
- If millions of cards exist but only a few thousand changed daily, how would you optimize the extract to avoid recomputing all history ?
- What is the best way to identify the specific set of impacted cards that should be processed for the extract ?
Azure Data Engineering Interview Questions
- Difference between Delta Lake and Data Lake ?
- Why is Databricks better than Dataflow ?
- Explain Spark Architecture ?
- Difference between groupByKey and reduceByKey ?
- What is Delta Lake ? What are the key features and Creating Delta Tables ?
- What is a Mount Point in Databricks ?
- How to Optimize Databricks performance ?
- Difference between Data Lake and Data Warehouse ?
- What is Fault Tolerance and its use in Real-time Applications ?
- How does Auto Loader works ?
- Difference between repartition and coalesce ?
- Difference between External table and managed table in synapse ?
- Difference between DataFrame SQL and T-SQL ?
- Top 10 active users in PySpark.
- How Z-ordering works in Delta tables ?
- Difference between Narrow and Wide Transformations ?
- What is a Surrogate key ?
- Difference between Serverless Pool and Dedicated SQL Pool ?
- What is Incremental Loading ? How do we implement it ?
- How does Data Loader work ?
- What is Lazy Evaluation in PySpark ?
- What is DAG in Spark ?
- Write a query to find the 5th Highest Salary of an Employee ?
- What are the different Approaches to Creating RDD in PySpark ?
- What are the different Approaches to Creating DataFrame in PySpark ?
