# Category Archives: Data Science

# Python Code for TTEST and ANOVA

- Posted by lhmay
- on May, 01, 2018
- in Data Science
- Blog No Comments.

TTEST What is t-score? The t score is a ratio between the difference between two groups and the difference within the groups. Types of t-tests? There are three main types of t-test: 1. An Independent Samples t-test compares the means for two groups. 2. A Paired sample t-test compares means from the same group at different times (say, one year apart). 3. A One sample t-test tests the mean of […]

Read More# Variance Reduction Techniques to Improve Power of the Test

- Posted by lhmay
- on Apr, 30, 2018
- in Data Science
- Blog No Comments.

Improving the Power of our experiment¶ Now that we understand the system we are dealing with, we can ask the question: how can we increase the detectable effect size of our experiments? We are left with a few options: Increase the effect size Increase the sample size Decrease the variance Increasing the effect size may […]

Read More# Central Limit Theorem, Violations & Remedy

- Posted by lhmay
- on Apr, 30, 2018
- in Data Science
- Blog No Comments.

Normal Distribution About 68% of values drawn from a normal distribution are within one standard deviation σ away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This fact is known as the 68-95-99.7 (empirical) rule, or the 3-sigma rule. What is Central Limit Theorem (CLT)? In […]

Read More# Multi-Armed Bandits and Contextual-Bandit

- Posted by lhmay
- on Apr, 29, 2018
- in Data Science
- Blog No Comments.

Multi-armed bandit uses machine learning algorithms to minimize opportunity costs and minimize regret. They’re more efficient because they move traffic towards winning variations gradually, instead of forcing you to wait for a “final answer” at the end of an experiment. They’re faster because samples that would have gone to obviously inferior variations can be assigned to […]

Read More# Multivariate Tests (Orthogonal Design)

- Posted by lhmay
- on Apr, 29, 2018
- in Data Science
- Blog No Comments.

Because resources are limited, it is very important to get the most information from each experiment you do. Well-designed experiments can produce significantly more information and often require fewer runs than haphazard or unplanned experiments. Also, a well-designed experiment will ensure that you can evaluate the effects that you have identified as important. As a […]

Read More# Using Google’s Convolutional Neural Networks (CNN) for Image Recognition

- Posted by lhmay
- on Apr, 26, 2018
- in Data Science
- Blog No Comments.

Convolutional neural networks are the state of the art technique for image recognition-that is, identifying objects such as people or cars in pictures. We call this a “deep neural network” because it has more layers than a traditional neural network. How Convolution Works Instead of feeding entire images into our neural network as one […]

Read More# Sequential Probability Ratio Test

- Posted by lhmay
- on Apr, 25, 2018
- in Data Science
- Blog No Comments.

What is a Sequential Probability Ratio Test? A sequential probability ratio test (SPRT) is a hypothesis test for sequential samples. Sequential sampling works in a very non-traditional way; instead of a fixed sample size, you choose one item (or a few) at a time, and then test your hypothesis. You can either: Reject the null hypothesis (H0) in favor of […]

Read More# Sampling and Sampling Bias

- Posted by lhmay
- on Apr, 22, 2018
- in Data Science
- Blog No Comments.

In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. It results in a biased sample, a non-random sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected.If this […]

Read More# Dealing with Imbalanced Data

- Posted by lhmay
- on Apr, 21, 2018
- in Data Science
- Blog No Comments.

What is Imbalanced Data? Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally. Imbalanced data example: the red points are greatly outnumbered by the blue. In reality, datasets can get far more imbalanced than this. Here are some examples: About 2% of credit card accounts are […]

Read More# How to Handle Missing Value?

- Posted by lhmay
- on Apr, 21, 2018
- in Data Science
- Blog No Comments.

Imputation vs Removing Data Before jumping to the methods of data imputation, we have to understand the reason why data goes missing. Missing at Random (MAR): Missing at random means that the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data […]

Read More