Principal components analysis and Factorial analysis to measure latent variables in a quantitative research: A mathematical theoretical approach

The aim of this paper focuses on showing how the factorial analysis and principal components analysis are useful for measuring latent variables in a concise way and safely as a help to building for new concepts and theories. 1. PRELIMINARY NOTES AND NOTATION In words of Wulder [5] “Multivariate statistics provide the ability to analyze a complex set of data. Principal components analysis (PCA) and factor analysis (FA) are statistical techniques applied to a single set of variables to discover which sets of variables in the set form coherent subsets that are relatively independent of one another. Variables that are correlated with one another which are also largely independent of other subsets of variables are combined into factors. Factors which are generated are thought to be representative of the underlying processes that have created the correlations among variables”. Factor analysis is a multivariate statistical technique which allows obtaining a structure of latent variables in a data matrix known as factors therefore is considered as a data reduction technique conditionally if, his hypotheses are met; the information contained in the matrix can be expressed without a lot deviation, a lower number of dimensions represented by these factors [4]. Principal Component Analysis and Factor Analysis can be exploratory in nature; FA is used as a tool in attempts to reduce a large set of variables to a more meaningful, smaller set of variables. As both FA and PCA are sensitive to the magnitude of correlations robust comparisons must be made to ensure the quality of the analysis. Correlation coefficients tend to be less reliable when estimated from small sample sizes. In general it is a minimum to have at least five cases for each observed variable. Missing data need be dealt with to provide for the best possible relationships between variables. Fitting missing data through regression techniques are likely to over fit the data and result in correlations to be unrealistically high and may as a result manufacture factors. Normality provides for an enhanced solution, but some inference may still be derived from nonnormal data. Multivariate normality also implies that the relationships between variables are linear. Univariate and multivariate outliers need to be screened out due to a heavy influence upon the calculation of correlation coefficients, which in turn has a strong influence on the calculation of factors. In PCA multicollinearity is not a problem as matrix inversion is not required, yet for most forms of FA singularity and multicollinearity is a problem. If the determinant of R and eigenvalues associated with some factors approach zero, multicollinearity or singularity may be present. Deletion of singular or multicollinear variables is required. The theorems and it implications are following: In order to measure; X1 X 2 . . . . . . Xn observed random variables, which are defined in the same population that share, m (m<p) commons causes to find m+p new variables, which we call common factors (Z1, Z2, ... Zm), besides, unique factors (1 2...... p) in order to determine their contribution in the original variables (X1 X2 ........Xp-1 Xp) the model is now defined by the following equations according to Carrasco-Arroyo [1], GarciaSantillán,Venegas-Martínez and Escalera-Chávez [3]: The Bulletin of Society for Mathematical Services and Standards Online: 2013-09-02 ISSN: 2277-8020, Vol. 7, pp 3-12 doi:10.18052/www.scipress.com/BSMaSS.7.3 2013 SciPress Ltd., Switzerland SciPress applies the CC-BY 4.0 license to works we publish: https://creativecommons.org/licenses/by/4.0/


PRELIMINARY NOTES AND NOTATION
In words of Wulder [5] "Multivariate statistics provide the ability to analyze a complex set of data. Principal components analysis (PCA) and factor analysis (FA) are statistical techniques applied to a single set of variables to discover which sets of variables in the set form coherent subsets that are relatively independent of one another. Variables that are correlated with one another which are also largely independent of other subsets of variables are combined into factors. Factors which are generated are thought to be representative of the underlying processes that have created the correlations among variables". Factor analysis is a multivariate statistical technique which allows obtaining a structure of latent variables in a data matrix known as factors therefore is considered as a data reduction technique conditionally if, his hypotheses are met; the information contained in the matrix can be expressed without a lot deviation, a lower number of dimensions represented by these factors [4]. Principal Component Analysis and Factor Analysis can be exploratory in nature; FA is used as a tool in attempts to reduce a large set of variables to a more meaningful, smaller set of variables. As both FA and PCA are sensitive to the magnitude of correlations robust comparisons must be made to ensure the quality of the analysis. Correlation coefficients tend to be less reliable when estimated from small sample sizes. In general it is a minimum to have at least five cases for each observed variable. Missing data need be dealt with to provide for the best possible relationships between variables. Fitting missing data through regression techniques are likely to over fit the data and result in correlations to be unrealistically high and may as a result manufacture factors. Normality provides for an enhanced solution, but some inference may still be derived from nonnormal data. Multivariate normality also implies that the relationships between variables are linear. Univariate and multivariate outliers need to be screened out due to a heavy influence upon the calculation of correlation coefficients, which in turn has a strong influence on the calculation of factors. In PCA multicollinearity is not a problem as matrix inversion is not required, yet for most forms of FA singularity and multicollinearity is a problem. If the determinant of R and eigenvalues associated with some factors approach zero, multicollinearity or singularity may be present. Deletion of singular or multicollinear variables is required. The theorems and it implications are following: In order to measure; X 1 X 2 . . . . . . X n observed random variables, which are defined in the same population that share, m (m<p) commons causes to find m+p new variables, which we call common factors (Z 1 , Z 2 , … Z m ), besides, unique factors ( 1  2 ……  p ) in order to determine their contribution in the original variables (X 1 X 2 ……..X p-1 X p ) the model is now defined by the following equations according to Carrasco-Arroyo [1], Garcia-Santillán,Venegas-Martínez and Escalera-Chávez [3]: Where: Model equations can be expressed in matrix form as follow: ( 2) Therefore, the resulting model can be expressed in a condensed form as: Where, we assume that m<p because they want to explain the variables through a small number of new random variables and all of the (m + p) factors are correlated variables, that is, that the variability explained by a variable factor, have not relation with the other factors. We know that the each observed variable of model is a result of lineal combination of each common factor with different weights (a ia ), those weights are called saturations, but one of part of x i is not explained for common factors.
As we know, all problems intuitive can be inconsistent when obtaining solutions and therefore, we require the approach of hypothesis; hence, in the factor model we used the following assumptions: H 1: The factors are typified random variables, and inter correlated, like: Further, we must consider that the factors have a primary goal to study and simplify the correlations between variables, measures, through the correlation matrix, then, we will understand that: H2: The original variables could be typified by transforming these variables of type -4 Volume 7 Therefore, and considering the variance property we have:

THEOREM SATURATIONS, COMMUNALITIES AND UNIQUENESS
We denominated saturations of the variable x i in the factor z a of coefficient a ia In order to inform the relationship between the variables and the common factors is necessary determining the coefficient de A (assuming the hypotheses H 1 y H 2 ), where V is the matrix of eigenvectors and Λ matrix eigenvalues, so we obtained: The above suggests that a ia coincides with the correlation coefficient between the variables and factors. In the other sense, for the case of non-standardized variables, A is obtained from the covariance matrix S, hence the correlation between x i and z a is the ratio: Thus, the variance of the a i factor is results of the sum of squares of saturations of a i column of A: Considering that: We denominated communalities to the next theorem: The communalities show a percentage of variance of each variable (i) that explains for m factors.

The Bulletin of Society for Mathematical Services and Standards Vol. 7
Thus, every coefficient is called variable specificity. Therefore the matrix model X=AZ+ξ,ξ (unique factors matrix), Z (common factors matrix) will be lower while greater be the variation explain for every m (common factor). So, if we work with typified variables and considering the variance property, so, we have: Recall that the variance of any variable, is the result of adding their communalities and the uniqueness b 2 , thus, the number of factors obtained, there is a part of the variability of the original variables unexplained and correspond to a residue (unique factor).

Reduced correlation matrix
Based on correlation between variables i and i , we have now: Also, we know (15) The hypothesis which we started, now we have: Developing the product: From the linearity of hope and considering that the factors are uncorrelated (hypotheses of starting), now we have: The variance of variable i-esim is given for: If we take again the start hypothesis, we can prove the follow expression: Volume 7 In this way, we can test how the variance is divided into two parts: the communality and uniqueness, which is the residual variance not explained by the model Therefore, we can say that the matrix form is: R=AA'+ξ where R * =R-ξ 2 . R * is a reproduced correlation matrix, obtained from the matrix R The fundamental identity is equivalent to the following expression: R * AA'. Therefore the sample correlation matrix is a matrix estimator AA'. Meanwhile, a ia saturation coefficients of variables in the factors, should verify this condition, which certainly, is not enough to determine them. When the product is estimated AA', we diagonalizable the reduced correlation matrix, whereas a solution of the equation would be: R- 2 =R * =AA ' is the matrix A, whose columns are the standardized eigenvectors of R * . From this reduced matrix, through a diagonal, as mathematical instrument, we obtain through vectors and eigenvalues, the factor axes.

Factorial analysis viability
To validate the appropriateness of factorial model is necessary to design the sample correlation matrix R, from the data obtained. Also be performed prior hypothesis tests to determine the relevance of the factor model, that is, whether it is appropriate to analyze the data with this model [2,3]. A contrast to be performed is the Bartlett Test of Sphericity. It seeks to determine whether there is a relationship structure -relationships--or not among the original variables. The correlation matrix R indicates the relationship between each pair of variables (r ij ) and its diagonal will be compose for 1(ones). Hence, if there is not relationship between the variables h, then, all correlation coefficients between each pair of variable would be zero. Therefore, the population correlation matrix coincides with the identity matrix and determinant will be equal to 1.
If the data are a random sample from a multivariate normal distribution, then, under the null hypothesis, the determinant of the matrix is 1 and is shown as follows: Under the null hypothesis, this statistic is asymptotically distributed through a X 2 distribution with p(p-1)/2 degrees freedom. So, in case of accepting the null hypothesis would not be advisable to perform factor analysis.

The Bulletin of Society for Mathematical Services and Standards Vol. 7
Another index is, the contrast of Kaiser-Meyer-Olkin, which is to compare the correlation coefficients and partial correlation coefficients. This measure is called sampling adequacy (KMO) and can be calculated for the whole or for each variable (MSA) Where: r ij (p) is the partial coefficient of the correlation between variables X i and X j in all the cases. The statistical procedure to measure data is an exploratory Factorial Analyze Model; therefore it was taken the procedure proposed by García-Santillán et al [2,3] and obtains the following matrix: In order to measure the data collected from students and test the hypothesis (H i ) about a set of variables that form the construct for understanding the perception of students towards statistics, we considered the follow Hypothesis: Therefore, the equation can be expressed as follows: x = Af + u Û X = FA' + U (25) Where: With a variance equal to: Where: This equation corresponds to the communalities and the specificity of the variable X i . Thus the variance of each variable can be divided into two parts: a) in their communalities h i 2 representing the variance explained by common factors, and b) the specificity  I that represents the specific variance of each variable. Thus obtaining: With the transformation of the correlation matrix's determinants, we obtained Bartlett's test of sphericity, and it is given by the following equation: The variation of variable Yi, will be: Where: S = X'X In order to obtain the first and the second component, we have the following procedure: The first component is: So we need seek to maximize: And to address the process we must require:

CONCLUSION
This demonstration reveals the importance and convenience of having prior knowledge of the factors being measured to elect to measure the latent variables upon which to base new concepts or supporting established theories.