Covid-19 Update!!    We have enabled all courses through virtual classroom facility using Skype or Zoom.    Don't stop learning.    Enjoy Learning from Home.

30% Discount Python        30% Discount Webdesign        30% Discount SEO        30% Discount Angular8        Free SQL Class        Free Agile Workshop       Free HTML Sessions        Free Python Basics

Important Bigdata Analytics Interview Questions and Answers

Big Data Hadoop Interview Questions and Answers

1. Mention what is the responsibility of a Data analyst?

Responsibility of a Data analyst include,
  • Provide support to all data analysis and coordinate with customers and staffs
  • Resolve business associated issues for clients and performing audit on data
  • Analyze results and interpret data using statistical techniques and provide ongoing reports
  • Prioritize business needs and work closely with management and information needs
  • Identify new process or areas for improvement opportunities
  • Analyze, identify and interpret trends or patterns in complex data sets
  • Acquire data from primary or secondary data sources and maintain databases/data systems
  • Filter and “clean” data, and review computer reports
  • Determine performance indicators to locate and correct code problems
  • Securing database by developing access system by determining user level of access

    2. What is required to become a data analyst?

    To become a data analyst,
  • Robust knowledge on reporting packages (Business Objects), programming language.
  • (XML, Javascript, or ETL frameworks), databases (SQL, SQLite, etc.)
  • Strong skills with the ability to analyze, organize, collect and disseminate big data with accuracy
  • Technical knowledge in database design, data models, data mining and segmentation techniques
  • Strong knowledge on statistical packages for survey large datasets (SAS, Excel, SPSS, etc.)

    3. Mention what are the various steps in an analytics project?

    Various steps in an analytics project include
  • Problem definition
  • Data exploration
  • Data preparation
  • Modelling
  • Validation of data
  • Implementation and tracking

    4. What is data cleansing?

    Data cleaning also called as data cleansing, deals with recognizing and removing errors and irregularity from data in order to enhance the classification of data.

    5. List out some of the best practices for data cleaning?

  • Sort data by different attributes
  • For large datasets cleanse one after the other and develop the data with each step until you reach a good data quality
  • For large datasets, break them into small data. Working with less data will increase your iteration speed
  • To handle general cleansing task create a set of utility functions/tools/scripts. It might involve, assign various values based on       a CSV file.
  • OR SQL database or, regex search-and-replace, blanking out all values that don’t match a regex
  • If you have an issue with data cleanliness, order them by estimated frequency and attack the most common problems
  • Analyze the summary statistics for each column ( standard deviation, mean, number of missing values,)
  • Keep lane of every date cleaning operation, so you can improve changes or remove operations if required

    6. What is logistic regression?

    Logistic regression is a statistical method for examining a dataset in which there are one or more independent variables that defines an outcome.

    7. List of some best tools that can be useful for data-analysis?

  • Tableau
  • RapidMiner
  • OpenRefine
  • KNIME
  • Google Search Operators
  • Solver
  • NodeXL
  • io
  • Wolfram Alpha’s
  • Google Fusion tables

    8. What is the difference between data mining and data profiling?

    The difference between data mining and data profiling is that
    Data profiling: It targets on the instance analysis of individual attributes. It gives information on various attributes like value range, discrete value and their frequency, occurrence of null values, data type, length, etc.
    Data mining: Complete attention is on cluster analysis, detection of different records, possession, order discovery, relation holding between some attributes, etc.

    9. List out some common problems faced by data analyst?

    Some of the common problems faced by data analyst are:
  • Common misspelling
  • Duplicate entries
  • Missing values
  • Illegal values
  • Varying value representations
  • Identifying overlapping data

    10. Mention the name of the framework developed by Apache for processing large data set for an application in a distributed computing environment?

    Hadoop and MapReduce is the programming framework developed by Apache for processing large data set for an application in a distributed computing environment.

    11. Mention what are the missing patterns that are generally observed?

    The missing patterns that are generally observed are
  • Missing completely at random
  • Missing at random
  • Missing that depends on the missing value itself
  • Missing that depends on unobserved input variable

    12. What is time series analysis?

    Time series analysis can be done in two domains, frequency domain and the time domain. In Time series analysis the output of a particular process can be forecast by analyzing the previous data by the help of various methods like exponential smoothening, log-linear regression method, etc.

    13. Explain what is correlogram analysis?

    A correlogram analysis is the common form of spatial analysis in geography. It consists of a series of estimated autocorrelation coefficients calculated for a different spatial relationship. It can be used to construct a correlogram for distance-based data, when the raw data is expressed as distance rather than values at individual points.

    14. What is a hash table?

    In computing, a hash table is a map of keys to values. It is a data structure used to implement an associative array. It uses a hash function to compute an index into an array of slots, from which desired value can be fetched.

    15. What are hash table collisions? How is it avoided?

    A hash table collision happens when two different keys hash to the same value. Two data cannot be stored in the same slot in array.
    To avoid hash table collision there are many techniques, here we list out two
  • Separate Chaining:
    It uses the data structure to store multiple items that hash to the same slot.
  • Open addressing:
    It explore for other slots using a second function and store item in first empty slot that is found

    16. What is imputation? List out different types of imputation techniques?

    During imputation we replace missing data with substituted values. The types of imputation techniques involve are Single Imputation
  • Hot-deck imputation: A missing value is imputed from a randomly selected similar record by the help of punch card
  • Cold deck imputation: It works same as hot deck imputation, but it is more advanced and selects donors from another datasets
  • Mean imputation: It involves replacing missing value with the mean of that variable for all other cases
  • Regression imputation: It involves replacing missing value with the predicted values of a variable based on other variables
  • Stochastic regression: It is same as regression imputation, but it adds the average regression variance to regression imputation Multiple Imputation
  • Unlike single imputation, multiple imputation estimates the values multiple times

    17. Explain what is n-gram?

    N-gram: An n-gram is a contiguous order of n items from a given order of text or speech. It is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1).

    18. Explain what is the criteria for a good data model?

    Criteria for a good data model includes
  • It can be easily consumed.
  • Large data modify in a good model should be scalable.
  • It should provide predictable performance.
  • A good model can adapt to modify in requirements.

    19. Which imputation method is more favourable?

    However single imputation is widely used, it does not reflect the variability created by missing data at random. So, multiple imputation is more favourable then single imputation in case of data missing at random.

    20. Explain what is Clustering? What are the properties for clustering algorithms?

    Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into natural groups or clusters.

    Properties for clustering algorithm are:
  • Hierarchical or flat.
  • Iterative.
  • Hard and soft.
  • Disjunctive.