See All: Miscellaneous Interviews @ FloCareer
RATE CANDIDATE FOR: - Advanced SQL - Coding - Hadoop - Spark or Pyspark or Python - Unix - Hive 1: You're tasked with optimizing a Hadoop-based data pipeline where large tables are joined using SQL queries in Hive. What advanced SQL strategies would you use to improve join performance and resource utilization in this scenario? Answer: - I would leverage partitioning and bucketing to minimize data scanned during joins - use map-side joins or broadcast joins for smaller tables - optimize query execution with appropriate join order - and consider using vectorized queries for further speedups - and analyze query plans with EXPLAIN - I'd also ensure statistics are up-to-dateCode by candidate:
2: Your team needs to securely transfer large log files between two Unix servers over an unreliable network. Describe your approach, including Unix tools and steps to ensure both data integrity and transfer resilience. Answer: - I would use 'rsync' over SSH for secure, resumable transfers. - To ensure data integrity, I'd use checksums (e.g., md5sum or sha256sum) before and after transfer. - If the network is highly unreliable, I might split large files with 'split', transfer the parts, and reassemble them. - Regular logs and monitoring would verify success. 3: A critical Hadoop job failed due to a sudden spike in input data size, causing cascading failures in downstream processes. How would you approach identifying the root cause and redesigning the workflow to handle unpredictable data volumes in the future? Answer: - First, review job logs and cluster metrics to confirm resource exhaustion or configuration limits. - Identify if data skew, input splits, or mapper/reducer allocation caused the failure. Redesign by: - adding dynamic resource allocation - implementing data sampling, or - breaking large jobs into smaller, fault-tolerant stages with retry mechanisms and - monitoring thresholds 4: Your Hadoop cluster faces frequent NameNode restarts, impacting data availability. Describe your approach to diagnosing the root cause and steps you would take to ensure high availability and minimize future disruptions. Answer: - I'd review NameNode logs: # for errors (e.g., memory, disk, or network issues), # check hardware health, and # verify JVM configurations. - I'd implement NameNode HA using standby nodes and shared storage - test failover procedures, - ensure regular metadata backups, and - monitor cluster health to proactively address issues. 5: A critical application on a Unix server starts exhibiting high CPU usage and becomes unresponsive. Outline your step-by-step approach to diagnose the issue and mitigate its impact without restarting the server. Answer: - I would use tools like top, ps, and vmstat to identify the processes consuming high CPU. - Next, I'd check logs, review recent changes, and analyze process states. - If needed, I'd reduce or limit resources for the offending process, and investigate code or system misconfigurations, aiming for minimal disruption. 6: Using PySpark, write a function to identify the top 3 products by total sales in each region from a DataFrame with columns: 'region', 'product', and 'sales'. Ensure scalability for large datasets. Hint: def top3_products_by_region(df): from pyspark.sql import Window from pyspark.sql.functions import sum as _sum, row_number w = Window.partitionBy('region').orderBy(_sum('sales').desc()) sales_df = df.groupBy('region', 'product').agg(_sum('sales').alias('total_sales')) ranked = sales_df.withColumn('rank', row_number().over(w)) return ranked.filter(ranked.rank <= 3) # This approach uses aggregation and window functions, ensuring scalability by minimizing shuffles and only keeping required records. 7: Your PySpark job needs to process sensitive financial transactions and deliver results within strict SLAs. How would you balance data security, job reliability, and performance in your pipeline design? Explain your approach and trade-offs. Answer: - I would use encryption at rest and in transit for sensitive data - restrict access using fine-grained Spark security features - and mask or tokenize data where feasible. - For reliability, I'd implement checkpointing, retries, and monitoring. - To meet SLAs, I'd optimize resource allocation, leverage partitioning, and cache data where appropriate. - Trade-offs may involve additional compute/storage costs for security and reliability features versus raw performance. 8: In a Hadoop environment, you need to merge multiple large, daily-partitioned Hive tables containing sales data into a single consolidated table, ensuring schema evolution and minimizing data skew. Describe your advanced SQL approach and optimization strategies. Answer: - I would use dynamic partition inserts to write into the consolidated table, # leverage ORC/Parquet formats for better performance, # handle schema evolution with Hive's schema-on-read # and add missing columns using ALTER TABLE. - To minimize data skew, I'd use salting techniques # and distribute by key columns during INSERT operations. 9: You need to migrate an existing Python ETL process to PySpark to handle increasing data volume. What factors would you consider in the migration, and how would you ensure data consistency and reliability during the transition? Answer: - I would analyze data partitioning, serialization, and transformation logic, # refactor code to leverage PySpark's distributed processing, # and design comprehensive validation tests. - To ensure data consistency and reliability: # I'd run both systems in parallel, compare outputs, set up error handling, and monitor performance 10: Your company needs to implement a GDPR-compliant data retention policy in Hive. How would you design a process to identify and purge personal data from large, partitioned Hive tables without affecting business-critical analytics? Answer: - Design a process leveraging partitioning by date/user to enable targeted deletions. - Use dynamic partition pruning and overwrite/drop partitions for data beyond retention limits. - Implement access controls and maintain audit logs for data deletion events. - Validate business reports post-purge to ensure analytics are unaffected. 11: Given a PySpark DataFrame 'visits' with columns 'user_id', 'visit_time', and 'page_url', write a function to identify, for each user, the sequence of pages visited during their longest single continuous session (no gap >30 minutes between consecutive visits). Optimize for large datasets. Answer: To solve this, sort visits by 'user_id' and 'visit_time'. Use window functions to calculate the time difference between consecutive visits for each user. Assign a session ID that increments when the gap exceeds 30 minutes. For each user, group by session ID and count visits. Find the session with the most visits (or longest duration), then return the ordered sequence of 'page_url' for that session. Use partitioning and windowing to ensure scalability for large datasets.
w= window.partitionBy("user_id").orderBy("visit_time")df=(df.withColumn("prev_time",lag("visit_time").over(w)) .withColumn("session_id", sum(when(col("prev_time").isNull() | ((unix_timestamp("visit_time")-unix_timestamp("prev_time"))>1800),1) .otherwise(0) ).over(w.rowsBetween(window.unboundedPreceeding, window.currentRow)) ) sessions=df.groupBy("user_id","session_id").agg( collect.list("page_url").alias("page_sequence"), min("visit_time").alias("session_start"), max("visit_time").alias("session_end") .withColumn("duration", unix_timestamp("session_end")-unix_timestamp("session_start")) ) result=session.withColumn("rn",row_number().over(window.partitionBy("user_id") .orderBy(col("duration").desc()) )) .filter(col("rn")==1) .drop("rn","session_id"))result.show()