Explain Data Exploration? Data exploration means exploring the application data for future analysis of the application. A data analyst finds the business problem by analyzing the data from the application and improves the process by going through clients’ requirements.

Explain the data preparation process. Data preparation is basically when the analyst receives the raw format of any data from a particular source or its client and then the analyst process it to find out missing values and variables so that it helps to model the data format.

- Explain data modeling. After data preparation is completed, the data model is run repeatedly to ensure it returns the best possible model for the application.

- Explain the validation process. After the trial model created by the analyst from trail data, it goes for validation against the data model provided by the client to ensure the final model will meet the business requirements.

- Important steps in the data validation process. There are two steps in the data validation process i.e data screening and data verification. Data screening is done by many algorithms which screen the entire dataset and helps to figure out the faults and errors. The Data verification process evaluates each fault in many use cases and then decides whether to include or reject those values or replace those values.

- Tools that are used for data analysis. Tableau RapidMiner NodeXL Google fusion tables Google search operators

- What is a normal distribution? While processing, the data is generally distributed with a bias to the left or to the right or can be spilled all over. To minimize this bias, data can be shifted around a central value which forms a bell-shaped curve when represented in a graphical chart. Its an arrangement of the dataset which is concentrated in the center and forms a cluster while the rest of the dataset moves to each extreme with maximum bias.

- Mention some properties of Normal Distribution. Unimodal Symmetrical Bell-shaped Mean, Mode and Median Asymptotic

- What is linear regression? Liner regression forms a straight line when two sets of big data hadoop analyst are given during the trial phase. When we see most of the points are concentrated towards the straight line with minimum points are scattered around, we use the linear regression algorithm to reduce bias.

- What is logistic regression? It trains the data through probability. It is a bounded variable from zero to one because of its sigmoid function. It’s used for solving classification problems when a data set is not classified.

- What is time-series data in time series analysis? This is basically a set of observations on the values that a variable uses at different times.

- What is cross-sectional data in time series analysis? These are data of one or multiple variables that are gathered or collected together at a time.

- What is pooled data in time series analysis? This is a data set when both pooled data and cross-sectional data are combined in time series analysis.

What are different types of Hypothesis testing? T-test Chi-Square Test for Independence Analysis of Variance Welch’s T-test

Explain Chi-Square Test These tests have been used to figure out the importance of the relationship between categorical variables in the population sample.

Explain ANOVA ANOVA or Analysis of variance is a kind of hypothesis testing that is used to analyze the difference between the medium in various groups of data set.

What is Imputation? It’s a process where the missing data of an application is filled with substituted values.

- What are the problems caused by missing data? There are mainly three problems caused by missing data that are, a: It can result in maximum bias which testing a data set. b: The goal of data analysis becomes hard to achieve. c: The efficiency of any algorithm gets reduced due to the unavailability of specific data.

- What are different types of imputation techniques? Hot-deck imputation Cold-deck imputation Mean imputation Regression imputation Stochastic imputation

- What is Mean imputation? It replaces the values which are missing from the dataset with the mean of that variable in all the cases where it executes.

- What is Stochastic imputation? As similar to regression imputation, it just adds the average regression variables to regression imputation.

- What is an Outlier? A value that appears far away and separates itself from a concentrated pattern in a trial dataset.

- What are the different types of Outliers? Univariate Multivariate

- What is K-mean Algorithm? K-mean is a well-known algorithm that is used for the partitioning method. The specific dataset is classified as the derivatives to one of the K groups. Here the data points are centered around the cluster.

- What is Map Reduce? It's a framework that separates the datasets into subsets where each dataset is processed in a different node or server.

