As mentioned at home page of statistics group, discussion on statistical tool is central activity of group. Statistical tools are interface between data and conceptual images (may be in form of hypothesis or relationship in mathematical form) of context area. In this way, statistical tools are extension of statistical methods through different context area (like economics, sociology etc), different software platform (like SPSS, Stata etc) and different setup (classical and Bayesian). Thus, group will try to look different statistical methods in larger canvass so that it may be more useful in current scenario. Topics proposed as statistical tools may be seen from taxonomy of topic. How topic may be composed (subsection of topic) may be seen from taxonomy of book.
Correlation and Regression are most commonly used terms for applied statisticsians. Its fame is due to its different flavor- from line fitting to statistical modelling. It can be used as mathematical tool as well as statistical tool both. It has different level (set) of assumptions- weaker to stronger. With stronger assumptions one can get powerful result but it is difficult to justify in real field. So it is very important to judge that which version of this tool is suitable according to field situation.
Relation between two or more characteristics is one of the fundamental queries in development of human thought processes. Such relationship has been studied in different paradigms like causal system, control system, knowledge system. These paradigms are based on different type of believes like some thing is effect of some cause factors (event based causal system) or something may be controlled by some control factor (control system) or some thing may be explained by some explanatory factors (knowledge system). For example, it may be desired to know.
What is the relationship between education and income? For each year of education, how much does income increase (on average)?
What will be the rate of return on investment? For each dollar invested, how much will sales increase?
For a political candidate, how many votes will he get for each unit of money he spends on advertising?
With what confidence, height data may be used for taking decision regarding shoe size? In other words, how much variation in shoe size is explained by height?
With how much confidence we can predict weather on basis of height of barometer?
On basis of data, whether we have sufficient reason to consider parental education level as a cause for maximum level of education of child.
Whether Marginal Propensity to Consume (MPC) is less than 1 as assumed by Keynes?
In statistics there are many tools to get answer based on relationship between characteristics (as in above example) available in form of data. Regression is one of them which works in boundary determined by its assumption and based on concept of dependent characteristics (effect) with independent characteristics (cause) . Although main answer from regression is measure of degree of closeness between cause and effect and change in effect with unit change in cause, it may be used for prediction, validating causal factors, substituting more costly or non available information with set of other information. In fact dependent, and independent characteristics have different name in different framework, see Gujrati (2004) pp 50, (not only `cause’ and `effect’). Although basic statistical tool is same, but due to different frame, different type of answer one may obtained (Schield 1995).
Correlation is close associate of Regression and even widely used than Regression. Regression is a technique while Correlation is a measurement which measure degree of relationship between two variables (may be generalized). Generally speaking, Correlation is a common noun synonymous with ‘association’. In this non-technical sense, Correlation is necessary for causality. But in statistics, Correlation signifies a proper noun -- the Pearson linear product-moment Correlation. In this technical sense, Correlation is not necessary for causality Both concept Correlation and Regression is so much intermingled, that without one it difficult to get understanding of other.
Data Types, Scatter Plot, Straight Line in Cartesian Plane, Normal Distribution, Expectation, Median, Variance, Parameters , Causality
Historically Correlation was not interpreted by its inventor Sir Fransis Galton (not Karl Pearson as many people assume) in same manner as we do (measurement of linear relation between two characteristics) .
Like his cousin Charls Darwin, Galton’s fascination with genetics and hereditary led him for invention of modern notion of Correlation and Regression. He was trying to measure impact of parent generation on child one for various characteristics. He approached this problem, by examining self- fertilized (for minimizing impact of multiple parental source) sweet pea. He plotted size of parent sweet pea on X-axis and offspring pea on Y- axis and find that extremely large or small mother pea generated less extreme daughter pea. In other words the average size of offspring born of mother of a given size tended to move or “Regress” toward the average size in the population as a whole.
He tried to obtain regression coefficient by fitting line through median characteristics of offspring pea for size of given mother pea.
Although he used free hand line fitting technique, the important concept emerged from his realization was interrelation in form of variability in characteristics (size of mother and child pea) with dependency (slope of line) between characteristics (change in size of child pea with change in mother pea). He found that if the degree of association (hereditary constant or current days Correlation) between two variables was held constant, then the slope of the regression line could be described if the variability of the two measures were known. At that time Galton believed he had estimated a single heredity constant that was generalizable to many or most inherited characteristics (see …). In his opinion, although there is single heredity constant, different slope for different properties of pea (like size, color) is due to different type of variability in mother and daughter pea.
In 1896, Pearson published his first rigorous treatment of correlation and regression in the Philosophical Transactions of the Royal Society of LondonPearson credited Bravis (1846) with ascertaining the initial mathematical formulae for correlation. Pearson noted that Bravais happened upon the product-moment (that is, the "moment" or mean of a set of products) method for calculating the correlation coefficient but failed to prove that this provided the best fit to the data. Using an advanced statistical proof (involving a Taylor expansion), Pearson demonstrated that optimum values of both the regression slope and the correlation coefficient could be calculated from the product-moment, , where x and y are deviations of observed values from their respective means and n is the number of pairs.
Galton realized soon after he had collected and analyzed his sweet pea data that the generations prior to the immediate parents could also influence individual characteristics Pearson (1930). He even noticed that certain characteristics occasionally skipped one or more generations; a man may appear more similar to his grandfather than to his father in certain respects. In an 1898 paper to the journal Nature (cited in Pearson (1930)), Galton published a clever diagram that partitioned a unit square into successively smaller squares, where each square represented the ever diminishing influence of previous generations of ancestors on the present individual. Galton's conceptualization of the multiple influences of progenitors on characteristics of the present day individual was entirely parallel to the modern conception of multiple regression.
Bravais, A. (1846), "Analyse Mathematique sur les Probabilites des Erreurs de Situation d'un Point," Memoires par divers Savans, 9, 255-332.
Pearson, K. (1930), The Life, Letters and Labors of Francis Galton, Cambridge University Press.
Simplest form of regression can be undertaken as relation between two type of characteristics (for same entity like person, household etc) which may be represented as continuous data (variable see Data Type). For example, relation between earning and schooling of a person, relation between score and hr. of labour for individuals etc. General technique to study such relationship, is crating a graph on XY plane. Data shown on Y axis is called Dependent variable (DV or endogenous) and its counter part on X is called Independent variable (IV or exogenous) variable. to visitFigure 1 shows scatter plot monthly income (DV) in Rupees with level of education (IV). Although there are many straight lines (like HH, FF, LL) are possible with free hand to get summary of relationship between characteristics shown at Y and X. Best fit for scatter plot may be obtained by choosing a line in such a way so that it may minimize sum (e1+e2…en) of distances from point to proposed line (like FF).