See how our partners can help you ace your CFA exams.
Quant Methods Reading 8 = statistics headache
hi all, i'm not a finance grad, having a hard time with all the various statistics terms in the chapter. i'm not sure how thoroughly they can be tested given the LOS seems quite detailed.
For example, in the multiple regressions sub section, these terms alone are causing confusion:
- r2 and adjusted r2
- heteroskedasticity
- multicollinearity etc
i'm starting to panic given that i know i don't fully grasp this tiny section alone, not to mention more from other chapters in the syllabus. any help and advice? rather anxious and feeling worse than level 1 days...
0 ·
Comments
I'll attempt to explain/clarify some of the terms the best I can:
R2 vs adjusted R2
R2 (R-squared) is a statistical measure of how close the data are to the fitted regression line, i.e. it measures the strength of the relationship between a linear model and the dependent variables on a 0 - 100% scale. This means that the higher that number, the better fit your model is.
- Note: the general form of a simple linear regression is Y = a + bX + ε
- Note: the general form of a multiple regression model is Y = a + bX + cX + dX + eX + .... + ε
However, there is a flaw in R2 in the sense that it increases the more independent variables you add to the model (regardless quality), so it is less reliable as a measure of goodness of fit in a multiple regression model versus a simple linear regression model with one independent variable.Adjusted R2 (adjusted R-squared) is a modified version of R2 which takes into account the number of independent variables and calculates R2 from only those variables whose addition in the model which are significant. Adjusted R2 actually penalises the addition of useless variables that don't help explain the model by showing a decreased number.
- So we need to use Adjusted R2 in a multiple regression model, R2 in a one variable linear regression model.
- Adjusted R2 will always be less than or equal to R2.
- Adjusted R2 is only more accurate than R2 if and only if you use a multiple regression.
HeteroskedasticityA mouthful to pronounce this one, but let's first take a step back to the assumptions of a classical normal multiple linear regression model:
A linear relation exists between the dependent variable and the independent variables. The independent variables are not random. Also, no exact linear relation exists between two or more of the independent variables. The expected value of the error term, conditioned on the independent variables, is 0. - The variance of the error term (ε) is the same for all observations.
The error term is uncorrelated across observations. The error term is normally distributed.
Heteroskedasticity breaks Assumption 4 above, whereby when the variance of the error terms differ across observations. Why do we care? Because the computed values for standard errors and test statistics will be incorrect unless they are adjusted for heteroskedasticity. A quick visual guide may help:Multicollinearity
Multicollinearity occurs when Assumption 2 above is broken in a classical normal multiple linear regression model, i.e. at least two or more independent variables are highly correlated with each other. It is therefore a type of disturbance in the data, and the statistical inferences made about the data may not be reliable.
With multicollinearity, the regression coefficients may not be individually statistically significant even when the overall regression is significant as judged by the F-statistic. Multicollinearity could be caused by incorrect usage of dummy variables, including a variable in the regression that is actually a combination of two other variables, or using 2 nearly identical variables.
Hope this helps as a start, but in generally, there would be things we won't know, but it will get better the last 2 months as you focus and do practice papers, revise, repeat. The learning accelerates in the final stretch as you keep reinforcing and testing your knowledge. In terms of feeling overwhelmed, one tip that could work for you (and all candidates) is to break things up in tiny sections, tiny steps so that progress seems faster (more encouraging) and you'll keep going one step at the time.
Best of luck!
in particular, i found your explanation style of framing each type of "statistics problem/errors" as a violation of the standard assumptions of the multiple linear reg models helpful, as somehow it makes things clearer (not sure why!).
so i now get that serial correlation is a violation of assumption 5, i.e. it occurs when the error terms are correlated across time. model misspecification seems to be a broader, catchall term that includes violation of assumption 3 (and potentially 1?).
thanks a lot once again sophie, quant methods isn't my best chapter but you made things easier...