My CFA Journal: Quantitative Methods - Sampling and Estimation

Start - 11:00 am

Using Schweser notes again.

Sampling and Estimation

This should be a pretty straightforward reading.

Simple Random Sampling - selecting a sample in a way so each item in the population has the same likelihood of being included in sample

Systematic Sampling - approximately random - select every nth member of a group

Sampling Error - difference between sample (mean, variance, or std dev) and population (mean, var, or std dev). Ex. sampling error for mean = sample mean - population mean

The sample statistic itself is a random variable and thus has its own probability distribution. Sampling distribution is a probability distribution of a statistic from many samples. Sampling distribution is distinct from the population's actual parameters.

Simple Random Sample versus Stratified

Stratified - uses a classification system to divide population into strata, and random samples are taken from each strata. Size of sample from each strata is based on size of stratum relative to population. Results are then pooled.

Stratum sample = stratum size/total pop. size * 100
Guarantees that you will select a given number from a certain cell

Time Series vs. Cross Sectional

Time series - observations taken over time at specific and equally spaced intervals
Cross sectional - observations taken at a single point in time
Longitudinal data - multiple characteristics of same entity over time
Panel data - same characteristic for multiple entities over time

Central Limit Theorem

For simple random samples of size n, from a pop with mean u and finite variance, distribution of xbar approaches a normal distribution with:

mean of true u and
variance = sample variance/n

Means you can make inferences so long as n>30

This is because the sampling distribution will be approximately normal

Standard Error of Sample Mean

Standard error of sample mean = sdev of the distribution of sample means
If standard error of population is known:

std error = sdev(population) / sqrt(sample size)
But in practice, you just use sdev(sample)

Std error goes down as you increase sample size

Point Estimates and Confidence Intervals

Point estimates - single (sample) values used to estimate population parameters. E.g. for mean, it is just average of the observed values.
Confidence interval - range of values within with the actual value of a parameter will lie, given probability 1 - alpha.

Alpha = level of significance, 1-alpha = degree of confidence
Confidence interval = point estimate +/- reliability factor * standard error

Reliability factor depends on desired confidence

Desirable properties of an estimator

Unbiased - expected value of sample mean equals population mean
Unbiased is also efficient if variance of sampling distribution is smaller than that of all other unbiased estimators
Consistent - accuracy of estimate increases as sample size increases

T-distribution (aka Student T-distribution)

Appropriate for small samples from populations with unknown variance, and normal or approximately normal distributions
Also appropriate when population variance is unknown, and sample is large enough that central limit theorem will assure the sampling distribution is approximately normal
Distribution is defined by a single parameter: degrees of freedom (n-1)

Fatter tails than normal distribution
As degrees of freedom increases, it approaches normal - greater percentage of observations near the center of the distribution
Symmetrical and centered about 0

Standard normal distribution is called the z distribution
Harder to reject null hypotheses with a t distribution than with a z distribution

Confidence intervals in various situations

Normal distribution and known variance: xbar +/- z * std error

Remember std error = sdev(population)/sqrt(n)
For 90% confidence z = 1.645
For 95% confidence z = 1.960
For 99% confidence z = 2.575

Interpretation of confidence interval:

Probabilistic - 99% of the confidence intervals will in the long run include the pop mean
Practical - we are 99% confident the population mean is within the CI

Normal distribution with Unknown variance: xbar +/- t * std error

t score must correspond with the degrees of freedom (n-1)
t score CIs will be wider than z scores with known variance
First step is always to find degrees of freedom, then find significance factor

Pop variance unknown, sample is large, from ANY (inc. nonnormal) distribution

Variance known - use z statistic if n>=30
Variance unknown - use t statistic if n>=30
CANNOT create a CI if sample is less than 30

Issues with sample selection

Larger sample generally better but not always

Exception - when sample includes observations outside population
Other exception - cost

Data mining bias - looking through same database until you find something that 'works'

Could just be looking for a chance pattern
Look for evidence that many variables were tested but not reported
Lack of economic theory consistent with results is another redflag
Best way to avoid is to test the rule on a different data set than the one you found it from

Sample Selection bias - some data is systematically excluded, usually because it is not available

Makes sample nonrandom
Survivorship bias - most common form. Only includes those firms that survived.
Would not be a problem if surviving and nonsurviving fund characteristics were the same - but this is not the case

Returns are overestimated as a result

Lookahead bias - study tests a relationship using sample data that was not available at the test date. Can alleviate this bias by using an estimate rather than the actual data.
Time period bias - time period is either too short or too long.

Too short - result may be specific to that period's phenomena
Too long - things may have changed in the environment so no longer valid

End of reading

12:15 pm

1.25 hours

My CFA Journal

Saturday, September 15, 2012

Quantitative Methods - Sampling and Estimation

No comments:

Post a Comment