Top Data Engineer Interview Questions for Deloitte 2025

F2F Interview Data Engineer Questions asked in Deloitte

  1. Tell me about yourself
  2. What Kind of projects worked and experience you have?
  3. Explain your projects in Detail ?
  4. In terms of your current project, please describe the business objective, team structure, technical details and your role.
  5. Difference between a clustered and a non clustered index
  6. Differences between AWS Glue and EMR? When would you use each?
  7. What are window functions in SQL ?
  8. Explain the difference between DELETE, TRUNCATE, DROP ?
  9. Explain the difference between INNER JOIN, LEFT JOIN and FULL OUTER JOIN ?
  10. Write a python script to connect to a database and fetch data using SQL queries ?
  11. Difference between Pandas and Pyspark for data manipulation ?
  12. Describe the architecture of a cloud based dataware house like Snowflake or BigQuery ?
  13. Difference between OLAP and OLTP databases ?
  14. How can you optimize your queries in Amazon Redshift?
  15. When would you create normalized tables and denormalized tables?
  16. Which relational databases you have worked with? What is your experience ?
  17. What type of major issues you faced in your experience ? How you resolved those ?
  18. What is ETL ? Explain its phases and tools you have worked with.
  19. How do you ensure data quality during ETL processes ?
  20. What is the role of Apache Kafka in data engineering ?
  21. How do you design pipelines to ensure user data privacy and compliance?
  22. Explain differential privacy and how it could be applied in Apple’s ecosystem.
  23. How do you anonymize event logs for large-scale analytics?
  24. Describe how you’d handle schema evolution without exposing personal data.
  25. What’s your approach to securing pipelines on-prem and in the cloud?
  26. Explain how to build idempotent ETL jobs in a privacy-constrained environment.
  27. How do you handle device telemetry ingestion efficiently?
  28. Describe the trade-offs between edge aggregation and central aggregation.
  29. How would you detect data leakage across Apple’s internal datasets.
  30. Explain how encryption and key rotation are managed in pipelines.
  31. How do you validate anonymized data for ML training?
  32. What metrics would you track for privacy compliance at scale?
  33. Describe your strategy for debugging encrypted datasets.
  34. How do you implement data minimization principles in engineering design?
  35. Explain differential privacy noise budgets.
  36. How would you architect pipelines to support local differential privacy?
  37. Describe your approach to testing privacy-preserving transformations.
  38. How do you maintain metadata catalogs without exposing identifiers?
  39. Explain GDPR/CCPA impacts on Apple-scale data architectures.
  40. Walk through designing a privacy-safe analytics pipeline for Siri.
  41. Write an SQL query to find the second highest salary from an employee table.
  42. How do you handle NULL values in SQL joins while ensuring data integrity?
  43. Write an SQL query to calculate customer churn rate over the last 6 months.
  44. Design a fact table for an e-commerce platform – what dimensions and measures would you include?
  45. Explain the difference between star schema and snowflake schema and when to choose each.
  46. Write a Python script to validate data quality and detect anomalies before loading.
  47. In PySpark, how would you efficiently join two very large DataFrames to avoid skew?
  48. Write PySpark code to find the top 3 customers by revenue per region.
  49. You are processing real-time data from Event Hub into Delta tables. How would you implement this?
  50. How do you implement schema evolution in Delta Lake without breaking existing jobs?
  51. How would you implement Slowly Changing Dimensions (SCD Type 2) in a data warehouse?
  52. Late-arriving data is detected in a batch pipeline – how do you ensure correctness of historical reporting?
  53. How do you design a pipeline that supports both batch and streaming workloads simultaneously?
  54. What is your approach to building incremental data loads in Azure Data Factory pipelines?
  55. Your PySpark job is failing due to skewed joins. Walk through your debugging and optimization steps.
  56. Explain the architecture of Azure Databricks integrated with Delta Lake in a production environment.
  57. How do you optimize query performance and concurrency in a Synapse dedicated SQL pool?
  58. Your data lake contains sensitive PII data. What best practices do you follow to secure data and manage secrets?
  59. How do you establish end-to-end data lineage and governance using Microsoft Purview?
  60. Design an end-to-end analytics pipeline that ingests from Event Hub, processes data in Databricks, stores it in Synapse, and powers dashboards in Power BI.
  61. Write SQL query to find duplicate rows ?
  62. Retrieve third highest salary from Employee table ?
  63. Get the latest order placed by each customer ?
  64. Calculate average order value per customer ?
  65. Write a query to identify the most selling product ?
  66. Find all employees hired on weekdays ?
  67. Find all employees with salary between 75000 and 200000 ?
  68. Write a query to get monthly sales revenue and order count ?
  69. Find the moving average of sales over the last 5 days ?
  70. Write a query to find the first and last order of a customer ?

SQL Database Concepts :

  1. What is the difference between UNION, UNION ALL and UNION DISTINCT ? Which once would you prefer for very large datasets, and why ?
  2. What are the different types of joins in SQL ? Briefly explain each with an example.
  3. From a performance perspective which is more efficient : UNION ALL with DISTINCT, UNION ALL with window functions or plain UNION ?
  4. Explain the difference between normalization and denormalization. When would you prefer which ?

Data Management Concepts :

  1. How does a data validation framework like Great Expectations fit into ensuring data quality in an OLTP -> OLAP pipeline ?
  2. Schema changes in OLTP often require application code updates. How can the impact of schema changes be minimized
  3. In an OLTP system, how can triggers be used to capture every insert, update or delete into a history table ?
  4. What would be the schema for a Card table and its corresponding CardHistory table ? How does each column support change tracking ?
  5. Why do we include valid_from and valid_to columns in history tables when the latest state can be derived from entry order ? What additional value do they provide ?
  6. In a history-tracking design, how would a trigger : Insert a new version of a record and Update the valid_to of the previous version ?
  7. How would you extend the same history-tracking mechanism to related tables like Account and Limits, which are linked to card ?
  8. Given CardHistory, AccountHistory and LimitHistory, how would you build the CardExtract target model in OLAP so that it captures changes from all three entities ?
  9. Can you demonstrate ( SQL or pseudocode ) how to join history tables to generate a daily extract of changes across cards, accounts and limits ?
  10. If only the LimitHistory changes (e.g., spending limit adjustment ), how do you ensure the CardExtract still reflects this change ?
  11. On a given day, how would you detect which records which records across history tables actually changed, so only those are extracted ?
  12. If millions of cards exist but only a few thousand changed daily, how would you optimize the extract to avoid recomputing all history ?
  13. What is the best way to identify the specific set of impacted cards that should be processed for the extract ?

Azure Data Engineering Interview Questions

  1. Difference between Delta Lake and Data Lake ?
  2. Why is Databricks better than Dataflow ?
  3. Explain Spark Architecture ?
  4. Difference between groupByKey and reduceByKey ?
  5. What is Delta Lake ? What are the key features and Creating Delta Tables ?
  6. What is a Mount Point in Databricks ?
  7. How to Optimize Databricks performance ?
  8. Difference between Data Lake and Data Warehouse ?
  9. What is Fault Tolerance and its use in Real-time Applications ?
  10. How does Auto Loader works ?
  11. Difference between repartition and coalesce ?
  12. Difference between External table and managed table in synapse ?
  13. Difference between DataFrame SQL and T-SQL ?
  14. Top 10 active users in PySpark.
  15. How Z-ordering works in Delta tables ?
  16. Difference between Narrow and Wide Transformations ?
  17. What is a Surrogate key ?
  18. Difference between Serverless Pool and Dedicated SQL Pool ?
  19. What is Incremental Loading ? How do we implement it ?
  20. How does Data Loader work ?
  21. What is Lazy Evaluation in PySpark ?
  22. What is DAG in Spark ?
  23. Write a query to find the 5th Highest Salary of an Employee ?
  24. What are the different Approaches to Creating RDD in PySpark ?
  25. What are the different Approaches to Creating DataFrame in PySpark ?

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top