Using Schweser notes again.
Sampling and Estimation
This should be a pretty straightforward reading.
Simple Random Sampling - selecting a sample in a way so each item in the population has the same likelihood of being included in sample
Systematic Sampling - approximately random - select every nth member of a group
Sampling Error - difference between sample (mean, variance, or std dev) and population (mean, var, or std dev). Ex. sampling error for mean = sample mean - population mean
The sample statistic itself is a random variable and thus has its own probability distribution. Sampling distribution is a probability distribution of a statistic from many samples. Sampling distribution is distinct from the population's actual parameters.
Simple Random Sample versus Stratified
- Stratified - uses a classification system to divide population into strata, and random samples are taken from each strata. Size of sample from each strata is based on size of stratum relative to population. Results are then pooled.
- Stratum sample = stratum size/total pop. size * 100
- Guarantees that you will select a given number from a certain cell
Time Series vs. Cross Sectional
- Time series - observations taken over time at specific and equally spaced intervals
- Cross sectional - observations taken at a single point in time
- Longitudinal data - multiple characteristics of same entity over time
- Panel data - same characteristic for multiple entities over time
Central Limit Theorem
- For simple random samples of size n, from a pop with mean u and finite variance, distribution of xbar approaches a normal distribution with:
- mean of true u and
- variance = sample variance/n
- Means you can make inferences so long as n>30
- This is because the sampling distribution will be approximately normal
- Standard error of sample mean = sdev of the distribution of sample means
- If standard error of population is known:
- std error = sdev(population) / sqrt(sample size)
- But in practice, you just use sdev(sample)
- Std error goes down as you increase sample size
Point Estimates and Confidence Intervals
- Point estimates - single (sample) values used to estimate population parameters. E.g. for mean, it is just average of the observed values.
- Confidence interval - range of values within with the actual value of a parameter will lie, given probability 1 - alpha.
- Alpha = level of significance, 1-alpha = degree of confidence
- Confidence interval = point estimate +/- reliability factor * standard error
- Reliability factor depends on desired confidence
Desirable properties of an estimator
- Unbiased - expected value of sample mean equals population mean
- Unbiased is also efficient if variance of sampling distribution is smaller than that of all other unbiased estimators
- Consistent - accuracy of estimate increases as sample size increases
T-distribution (aka Student T-distribution)
- Appropriate for small samples from populations with unknown variance, and normal or approximately normal distributions
- Also appropriate when population variance is unknown, and sample is large enough that central limit theorem will assure the sampling distribution is approximately normal
- Distribution is defined by a single parameter: degrees of freedom (n-1)
- Fatter tails than normal distribution
- As degrees of freedom increases, it approaches normal - greater percentage of observations near the center of the distribution
- Symmetrical and centered about 0
- Standard normal distribution is called the z distribution
- Harder to reject null hypotheses with a t distribution than with a z distribution
Confidence intervals in various situations
- Normal distribution and known variance: xbar +/- z * std error
- Remember std error = sdev(population)/sqrt(n)
- For 90% confidence z = 1.645
- For 95% confidence z = 1.960
- For 99% confidence z = 2.575
- Interpretation of confidence interval:
- Probabilistic - 99% of the confidence intervals will in the long run include the pop mean
- Practical - we are 99% confident the population mean is within the CI
- Normal distribution with Unknown variance: xbar +/- t * std error
- t score must correspond with the degrees of freedom (n-1)
- t score CIs will be wider than z scores with known variance
- First step is always to find degrees of freedom, then find significance factor
- Pop variance unknown, sample is large, from ANY (inc. nonnormal) distribution
- Variance known - use z statistic if n>=30
- Variance unknown - use t statistic if n>=30
- CANNOT create a CI if sample is less than 30
Issues with sample selection
- Larger sample generally better but not always
- Exception - when sample includes observations outside population
- Other exception - cost
- Data mining bias - looking through same database until you find something that 'works'
- Could just be looking for a chance pattern
- Look for evidence that many variables were tested but not reported
- Lack of economic theory consistent with results is another redflag
- Best way to avoid is to test the rule on a different data set than the one you found it from
- Sample Selection bias - some data is systematically excluded, usually because it is not available
- Makes sample nonrandom
- Survivorship bias - most common form. Only includes those firms that survived.
- Would not be a problem if surviving and nonsurviving fund characteristics were the same - but this is not the case
- Returns are overestimated as a result
- Lookahead bias - study tests a relationship using sample data that was not available at the test date. Can alleviate this bias by using an estimate rather than the actual data.
- Time period bias - time period is either too short or too long.
- Too short - result may be specific to that period's phenomena
- Too long - things may have changed in the environment so no longer valid
End of reading
12:15 pm
1.25 hours
No comments:
Post a Comment