Detecting Essential Assumption Violations in Linear Regression

good. Cronbach's Alpha ≥ 0.6 is acceptable. When the Cronbach's Alpha of the scale is small (< 0.6), the scale correction is done by removing variables with low Variable-Total correlations one by one, and removing those variables increases Cronbach's Alpha.

2.3.4.2. Exploratory factor analysis (EFA)

After determining the reliability of the scale, exploratory factor analysis is performed to correct the components (factors) of the scale (if any).

According to Hoang Trong & Chu Nguyen Mong Ngoc (2008), in order to perform factor analysis, the variables must be related to each other and thus will be closely correlated with the same or more factors. Therefore, to determine whether factor analysis can be performed on the set of variables of the scale, it is necessary to test the null hypothesis (H 0 ) that the variables are not correlated with each other. We can use KMO & Bartlet's test to test this hypothesis. According to Nguyen Dinh Tho (2011), KMO & Bartlet's test >= 0.5 (data suitable for factor analysis) and Sig. of Bartlett's Test < 0.05 (observed variables are correlated with each other in the population) determine the set of variables that can be used in factor analysis.

After determining the set of variables of the scale that can be included in factor analysis, the extraction of factors can be done based on the predetermined number of factors or based on the amount of variation explained by the extracted factors (eigenvalue). According to Nguyen Dinh Tho (2011), the number of factors determined at the factor (stopping at the factor) has eigenvalue

≥ 1 and total variance extracted ≥ 50%.

The process of factor analysis and extraction creates a number of new factors, each of which depends on a number of initial independent variables (observed variables) in the form:

F i = W i1 X 1 + W i2 X 2 + … + W ik X k

In which: F i is the estimated value of the first factor; W ij (j=1..k) is the factor loading factor or weight of the observed variable X j (j=1..k) in F i .

When performing EFA, the factor loading coefficients W ij were calculated, however, according to Nguyen Dinh Tho (2011), the best method to calculate Wij is to use the sum or average of the observed variables for subsequent analyses (such as regression, ...).

According to Hoang Trong & Chu Nguyen Mong Ngoc (2008), the factor extraction process (without factor rotation) rarely produces factors that can be easily explained because the factors are correlated with many variables. When rotating factors, we want each factor to have a non-zero coefficient (i.e., significant) in only a few variables.

There are many methods of factor extraction and rotation, but Principal Component Analysis (PCA) and Varimax rotation are used in this thesis because they extract a lot of variance from measurement variables with the smallest number of components to serve the next prediction goal (Nguyen Dinh Tho, 2011).

The EFA analysis results may produce a scale model with new components and some variables may be removed from the model. The new scale is tested for reliability before being included in the regression.

2.3.4.3. Regression

From the model after exploratory factor analysis, regression was performed to determine the influence of the components (factors) on the scale.

1. Conditions for applying regression

The first condition for regression to be applicable is that the variables are closely correlated but not overlapping. Since the variables in this model are all quantitative variables, the Pearson correlation coefficient - denoted as r - is used (Hoang Trong & Chu Nguyen Mong Ngoc, 2008). Thus, the r of the pairs of variables must be large (closely correlated) but must be < 1 (not overlapping).

The multiple linear regression model is built on four assumptions:

 The dependent variable has a linear relationship with the independent variables.

 The variance of the error is constant.

 There is no autocorrelation between the residuals.

 The residuals are normally distributed. The regression model has the form:

𝑌 = 𝛽 0 + 𝛽 1 𝑋 1 + 𝛽 2 𝑋 2 + ⋯ + 𝛽 𝑘 𝑋 𝑘 + 𝜀

Where Y is the dependent variable, X i are the independent variables.  0 is the regression constant,  i

are the regression coefficients (i = 1..k, where k is the number of independent variables),  is the regression error.

Estimated regression model:

𝑌 ̂ = 𝑏 0 + 𝑏 1 𝑋 1 + 𝑏 2 𝑋 2 + ⋯ + 𝑏 𝑘 𝑋 𝑘

The regression coefficients b i (i = 0…k) are estimated using the ordinary least squares (OLS) method.

There are many different regression methods, according to Nguyen Dinh Tho (2011), the method of simultaneously entering independent variables into the regression equation (Enter method in SPSS) is the method used to confirm the relationship between variables, that is, to test the hypothesis. The SERVPERF scale is a tested scale, therefore, in this thesis, the Enter regression method is used.

2. Model fit testing

Regression models based on regression coefficients estimated by OLS may not be valid when extrapolated to the population, so it is necessary to test the regression model based on the coefficient of determination R 2 . R 2 reflects the part of the variation in the dependent variable Y explained by the independent variables X i (the rest, besides R 2 , is measurement error and the influence of other variables absent in the model). The larger R 2 is, the more consistent the model is with reality.

To test the suitability of the model, F test is used with the null hypothesis (H 0 ) as the coefficient of determination R 2 = 0, meaning the model is not suitable. Reject H 0 if the F test has Sig. (p-value) < significance level  (Nguyen Dinh Tho, 2011).

In a multiple regression model there are many independent variables, so when comparing two regression models, R 2 is not used but adjusted R 2 (R 2 adjusted , adjusted for degrees of freedom) is used.

It is necessary to consider the phenomenon of multicollinearity between independent variables, which makes R2 high but does not actually explain the data variation. According to Nguyen Dinh Tho (2011), to check the phenomenon of multicollinearity, the commonly used index is the Variance Inflation Factor (VIF). If the VIF of the independent variable > 10, this variable has almost no value in explaining the variation of the dependent variable (this independent variable has multicollinearity, or this independent variable depends a lot on another independent variable). However, in reality, if the independent variable has VIF > 2, it is necessary to carefully consider the correlation coefficients (Pearson, partial) when interpreting this variable in the model.

The regression model may tend to fit the population but the slope of the regression line may not be suitable, the next step is to test the significance of the regression coefficients, that is, test the linear regression relationship of each independent variable with the dependent variable with the hypothesis H 0 is the regression coefficient  i =0 (i=0..k with k being the number of independent variables). The t-test is used in this test. Reject H 0 if the t-test has Sig. (p-value) < significance level  .

3. Detect violations of necessary assumptions in linear regression

1. Linear relationship between dependent variable and independent variables : The basic assumption of linear regression is that there must be a linear relationship between dependent variable and independent variables. According to Hoang Trong & Chu Nguyen Mong Ngoc (2008), testing the linear relationship can be done through the scatter plot of the residual (standardized) u i ( 𝑣ở𝑖 𝑢 𝑖= 𝑌 − 𝑌 ̂ ) . If the residuals are randomly scattered in a region around the y-axis rather than following a rule, then the linearity assumption is satisfied.

2. Constant variance : According to Hoang Trong & Chu Nguyen Mong Ngoc (2008), constant variance is also a phenomenon that makes the estimated regression coefficients (by the least squares method - OLS) unbiased but

inefficient and the estimates of the variances are biased estimates leading to insignificant tests. One of the simple ways is to use the Spearman rank correlation test to test the hypothesis H 0 is that the rank correlation coefficient of the population and the dependent variable = 0 (i.e., the variance is constant). (Hoang Trong & Chu Nguyen Mong Ngoc, 2008). If the Sig. of the test  significance level  , then there is enough basis to reject H 0 , which means that the variance is changing. On the contrary, if Sig. > , it means that the variance is constant.

3. Autocorrelation : the phenomenon of autocorrelation is the phenomenon of correlation between the residuals u i , which also leads to the estimated regression coefficients (by the least squares method - OLS) being unbiased but inefficient and the estimates of the variances being biased estimates leading to the t and F tests being insignificant. To test for autocorrelation, the Durbin - Watson statistic (d) is used with the hypothesis H 0 being the overall correlation coefficient of the residuals = 0 (i.e. there is no autocorrelation).

According to Gujarati (1995), the rule of thumb is:

Table 2-16: Durbin – Watson test rule of thumb

H 0

D	Decision
No positive autocorrelation	0 < d < d L	Reject
No positive autocorrelation	d L  d  d U	Inconclusive
No negative autocorrelation	4-d L  d  4	Reject
No negative autocorrelation	4-d U  d  4-d L	Inconclusive
No autocorrelation (positive, negative)	dU < d < 4- dU	No rejection

Maybe you are interested!

Source: Gujarati (1995)

When d lies in the decision space (the regions [d L ,d U ] and [4-d U ,4-d L ]), one can use the following improved d test procedure: with a confidence level of 2  , consult the table to determine d L and d U :

 If d < d U then there is a statistically significant positive correlation;

 If 4-d < d U then there is a significant negative correlation;

 If d < d U or 4-d < d U then there is autocorrelation at the significance level 2  .

4. Residuals have normal distribution : this phenomenon can occur due to incorrect use of the model, variance changes, etc. Also according to Hoang Trong & Chu Nguyen Mong Ngoc (2008), it is possible to check whether the distribution of the residuals is normal distribution using Histogram and PP plot.

In this thesis, the regression model aims to determine the relationship between the components and satisfaction, not for the purpose of prediction, therefore, according to Nguyen Dinh Tho (2011), the model will use standardized regression coefficients. When using standardized regression coefficients , the regression equation will not have the root constant b 0 .

2.3.4.4. Identify strengths and weaknesses

The regression equation shows the factors and the influence of each factor on satisfaction, but it does not show which factors are strong and which are weak. To compare the level of customer evaluation of each factor, it is necessary to compare the overall average of the factors. The comparison can be done by estimating the overall average of each factor and comparing those averages.

2.3.4.5. Identify assessment differences between groups

Based on demographic variables (Gender, Training system, Occupation, Year of study, Part-time work, Seniority, etc.), there may be differences in assessing training quality. Identifying differences is to support the analysis of factors affecting training quality.

Analysis of variance helps to compare the overall mean of a quantitative variable across different groups grouped by one or more qualitative variables. The condition for performing ANOVA is that the variance of the comparison groups must be homogeneous. According to Hoang Trong & Chu Nguyen Mong Ngoc (2008), Levene's test is used to determine this condition with the hypothesis H 0 that the variance of the groups is homogeneous . If the Sig. of Levene's test > the significance level  means that there is not enough basis

reject H 0 , or the variances of the groups are equal, which means that ANOVA can be used. The results of ANOVA analysis are shown through the F test with the hypothesis H 0 that the overall mean of the groups is equal. If the F test has Sig. < , it means that there is enough basis to reject H 0 , which means that the overall mean of the groups is different. At that time, it is necessary to perform in-depth analysis to determine which groups are different and the level of difference.

According to Hoang Trong & Chu Nguyen Mong Ngoc (2008), there are two methods of in-depth analysis of ANOVA: pre-analysis (Contrast) and post-analysis (Post hoc), in which the post-analysis method is an approach close to the real research method, so this thesis uses post-hoc analysis. When using post-hoc analysis, according to Hoang Trong & Chu Nguyen Mong Ngoc (2008), Bonferroni statistics are one of the simplest testing procedures and are often used for the purpose of in-depth analysis of overall mean differences. In case the Levene test has Sig. <=  , meaning that it does not satisfy the conditions for applying variance analysis, use Tamhane's T2 statistics instead of Bonferroni.

2.3.5. Analysis of learner evaluation

2.3.5.1. Satisfaction scale analysis

Conducting a reliability assessment of the scale shows that the scale has good reliability with all variables in the scale having a Variable - Total correlation ≥ 0.3 and Cronbach's Alpha of the scale = 0.815 ≥ 0.7.

The results of the exploratory factor analysis of the Satisfaction scale showed that there was only one factor with all three initial observed variables (Table PL-B-1, page 91).

2.3.5.2. Evaluation of the reliability of components

Evaluate the reliability of 5 components: Trust, Tangibility, Empathy, Responsiveness, Competence through Cronbach's Alpha analysis on SPSS software.

The Reliability component has variable TC6 with a correlation of Variable - Total = 0.284 < 0.3, so this variable is eliminated. The remaining 5 Reliability components are TC1, TC2, TC3, TC4, TC5. Re-evaluate the reliability after eliminating variable TC6, all variables in the scale are

The correlation between Variable and Total is ≥ 0.3 and the scale has Cronbach's Alpha coefficient = 0.801 ≥ 0.8, showing that the scale has good reliability.

The variables in the Tangible, Empathy, and Competence components all have a Variable-Total correlation of ≥ 0.3 and Cronbach's Alpha of the scale of ≥ 0.7, indicating that these components have good reliability.

Variable DU3 has a Variable-Total correlation = 0.205 < 0.3, so this variable needs to be eliminated. After eliminating variable DU3, re-evaluate the reliability of the scale. At this time, all variables in the scale have a variable-total correlation ≥ 0.3 and Cronbach's Alpha of the scale = 0.718 ≥ 0.7, showing that the scale is usable.

2.3.5.3. Exploratory factor analysis of learner assessment scale

Performing KMO & Bartlett's Test shows that the set of variables can be factor analyzed: KMO = 0.901 > 0.5 and Bartlett's Test has Sig. = 0.000 < 0.005 (Table PL-B-2, page 91).

Conduct factor analysis using Principal Component analysis combined with Varimax rotation. The result at eigenvalue = 1.138 extracted 4 factors with a total extracted variance of 58.685% (Table PL-B-3, page 91). After factor rotation, variable HH5 has a factor loading coefficient < 0.5 so it was eliminated (Table PL-B-4, page 92). The learner assessment scale now includes 4 new factors with a total of 20 independent variables as follows:

1. Factor 1 (named Interest - QT) includes 6 variables:

DU2 Staff respond quickly to student requests DC2 Lecturers often show interest in your studies DC1 The school is very concerned about your living and studying conditions DC4 Staff are very sympathetic and considerate to students

DU1 Staff are always willing to help students

DC3 Lecturers always give you advice like an older brother or sister.

Comment