Fit Adequacy of Dichotomous Logit Response Models of the Regressor Bernoulli and Binomial Probability Distributions

Logit models belong to the class of probability models that determine discrete probabilities over a limited number of possible outcomes. They are often called ‘Quantal Variables’ or ‘Stimulus and Response Models’ in Biological Literature. The conventional 2 R measure of goodness-of-fit is problematic in logit models. This has therefore led to the proposal of several alternative goodness-of-fit measures. But researchers in this area have identified the base rate problem in using these several alternative goodness-of-fit measures. This research is an extension of work done by people in this area. Specifically, this research is aimed at investigating the goodnessof-fit performances of eight statistics using the Bernoulli and Binomial distributions as explanatory variables under various scenarios. The study will draw conclusions on the “best” fit. The data for the study was generated through simulation and analysed using the multiple correlation analysis. The findings clearly show that for the Bernoulli Distribution, the goodness-of-fit statistics to use are: 2 2 2 , , o C M p R R R and  ; and for the Binomial Distribution, the goodness-of-fit statistics to use are: 2 N R and p  . 2 o R stood out as the “best” goodness-of-fit statistics.

problem in using these several alternative goodness-of-fit measures. This research is an extension of work done by people in this area. Specifically, this research is aimed at investigating the goodnessof-fit performances of eight statistics using the Bernoulli and Binomial distributions as explanatory variables under various scenarios. The study will draw conclusions on the "best" fit. The data for the study was generated through simulation and analysed using the multiple correlation analysis. The findings clearly show that for the Bernoulli Distribution, the goodness-of-fit statistics to use are: 2 2 2 ,, R stood out as the "best" goodness-of-fit statistics.

Introduction
Logit models belong to the class of probability models that determine discrete probabilities over a limited number of possible outcomes [1]. Like the OLS regression model, the logit model permits of all sorts of extensions and of quite sophisticated variants. Logit models are often called 'Limited Dependent Variable Models' or 'Qualitative or Categorical Response Models' in Econometrics; 'Quantal Variables' or 'Stimulus and Response Models' in Biological Literature; 'Discrete Choice' in Psychology and Economics [1]. When the response takes one of only two possible values representing success and failure, or more generally the presence or absence of an attribute of interest, we say that it is binary or dichotomous; but if the response variable can have more than two outcomes, then, we say that it is polytomous. In this category, if the response variable is ordered, it gives rise to ordered logit or probit (ologit or oprobit) models; otherwise, it gives rise to multinomial logit or multinomial probit models [2]. Originally, only researchers from medical disciplines (especially epidemiology)used this form of regression. More recently, however, logistic regression has been discovered by those who conduct empirical investigations in a wide array of disciplines. Its popularity continues to grow at such a rate that it may soon overtake multiple regression and become the most frequently used regression tool of all [3]. For instance, the Logit model has been used extensively in analyzing growth phenomena, such as population, GNP, money supply, etc. [2]. It is now commonly used in many disciplines; for example, in health-sciences research, particularly in medical sciences and engineering settings. It is also becoming increasingly popular in the behavioural and social sciences as it is an important endpoint in quality control and quality testing [4].
Historically, the roots of the logit model spread back to the 19th century, when the function was invented to describe population growth and given its name by the Belgian mathematician Verhulst. The rediscovery of the growth function is due to Pearl and Reed, the survival of the term logistic to Yule, and the introduction of the function in bio-assay (and hence in statistics in general) to Berkson [5].
In this study, the dichotomous logit model is studied since by far, the 'most widely used discrete choice model is logit' [6].
The logit model as given in [4] is: where: x ) denote the j th setting of values of k explanatory variables, , k are the model parameters. In Eq. 1 above, the basic random variable Y is dichotomous response data taking the value 1 with the success probability P , and the value 0 with the failure probability (1 -P ). Y is therefore assumed to follow a Bernoulli ( P ) distribution. One of the uses of Regression Analysis is forecasting. In this case, we are interested in how well the regression model predicts movements in the dependent variable. A summary measure that tells us how well the sample regression line fits the data is the 2 R Statistic. It is a measure of goodnessof-fit which measures the proportion or percentage of the total variation in Y (the regressand) explained by the regression model. It is often measured as the squared correlation between the observed values of Y and the predictions produced by the estimated regression equation [7]. [7]emphasizes that 2 R is a measure of linear association between X (the independent variable) and Y (the dependent variable). It usually lies in the interval [0, 1].
In ordinary least squares (OLS) regression, there appears to be a general consensus on the use of the coefficient of determination or explained variance, 2 R , as an indicator of model fit for a quantitative dependent variable. In logistic regression analysis, by contrast, there is as yet no consensus on how we should calculate corresponding measures of the strength of association between the dependent variable and the total set of predictors. One reason is because, as described by Efroncited in [8], there is only one reasonable residual variation criterion for quantitative dependent variables in OLS, the familiar error sum of squares, 2 () yy   , but there are several possible residual variation criteria (entropy, squared error, qualitative difference) for binary dependent variables. Another hindrance to consensus is the existence of numerous mathematical equivalents to 2 R in OLS, which are not necessarily mathematically (same formula) or conceptually (same meaning in the context of the model) equivalent to 2 R in logistic regression. This is probably why [2] says that in binary regression models, goodness-of-fit models are of secondary importance. What matters are the expected signs of the regression coefficients and their statistical and/or practical importance. Also, John Aldrich and Forrest Nelson in [2] contend that "use of coefficient of determination as a summary statistic should be avoided in models of qualitative dependent variable".
Since the conventional 2 R measure of goodness-of-fit is problematic in limited variable models where the predicted values are probabilities and the actual values are either 0 or 1, it is desirable to generalize the definition of 2 R to more general models for which the concept of residual variance cannot be easily defined and maximum likelihood is the criterion of fit. This has therefore led to the proposal of several alternative goodness-of-fit measures.
There are two groups of goodness-of-fit statistics used to evaluate the association between the independent variables and the dependent variable in logistic regression analysis. One group is to

Bulletin of Mathematical Sciences and Applications Vol. 16
compare predicted and observed discrete values of the dependent variable, using prediction table and to classify as errors qualitative differences between the observed and predicted outcomes for each case (the person, place, or thing on which measurement is being performed) or on each observation (the measurement that is taken) when there is more than one observation per case. Such measures, says [8] are called Indexes of Predictive Efficiency and they use an absolute, qualitative "right or wrong" standard for assessing accuracy of prediction. These indexes are define : where: n = sample size Another group is the use of 2 R analogs (popularly called 'pseudo-2 R s') that compare the discrete observed values (typically zero and one for a dichotomous dependent variable) with the continuous predicted values (probabilities) that result from applying the logistic regression equation. [8]also identifies five of these measures as follows:

BMSA Volume 16
where: Eq. 5 -Eq. 9 are respectively: The Ordinary Least Squares 2 R , the Log Likelihood Ratio 2 R , the Geometric Mean Squared Improvement PerObservation 2 R , The Adjusted Geometric Mean Squared Improvement 2 R , and, The Contingency Coefficient 2 R . n =total sample size, the number of cases (people, cities, widgets) assuming asingle observation per case, or the number of observations when there are multiple observations per case (as for example in discrete event history models with pooled time series and cross-sectional data). R values as goodness-of-fit measures [9], [8], [4]. For instance, [8], in his study posits that 'in general, most of the coefficients of determination are so highly correlated with the base rate that the base rate itself could practically be used as a coefficient of determination'. So, those statistics that are base-rate invariant are seen to be the "best" measures of goodness-of-fit. [8]thus recommends future research for a more detailed mathematical analysis to determine the nature of the relationship of the various 2 R analogs and indexes of predictive efficiency with the base rate. In addition, he adds, simulation research like [9] research on the indices of predictive efficiency would be useful in studying 2 R to examine in more detail their relationship to the base rate, their behavior in the presence of unreliability of measurement, and the extent to which the limited result presented in his study may be peculiar to the data he used, or generalizable to a broader range of data. [9] define base rate to be 'the relative frequency of occurrence (i.e., ratio of success to failures) of the event being studied in the population of interest'. They opine that the base rate problem stems from the fact that statistical prediction models often are not valid when applied to populations with a different base rate than the population for which the prediction model was constructed. The reason for this, they continued, is that the base rate for a dichotomous variable becomes the marginal distribution of the outcome expectancy table and the computation of the predictive accuracy index is based on this expectancy table. They give an example that if 40% of students successfully complete a developmental mathematics program, then, a 40% base rate will be used in the logistic regression prediction model. But if that same model is then used on a group of developmental mathematics students representing a population with a true base rate of 20%, then, many errors in classification will occur since the prediction model will be invalid for the latter group of students. Kvalsethcited in [8] stated that one of the eight properties 2 R should have is that 2 R should be such that its values for different models fitted to the same dataset are directly comparable. Menard explained this to mean that the coefficient of determination should be comparable across not only different predictors but different dependent variables and different subsets of the dataset. An unknown reviewer cited [8] however, cautions that comparisons of performance across different datasets make sense only if the explanatory variables have a meaningfully defined distribution. In response to this remark, [4], carried out a research on the assessment of the 'adequacy-of-fit of logit models' under the dichotomous response classified by its probability levels using the Exponential, Bernoulli and Multinomial distributed explanatory variables. They used a single sample size of 200 and a single replication size of 1000 in their study. Their study does not give the "best" goodnessof-fit statistics among the studied statistics but they recommend that 'further studies for more details in the exponential explanatory distribution together with increased sample sizes' be carried out.

Bulletin of Mathematical Sciences and Applications Vol. 16
This research is, therefore, an extension of the work of [4] and further investigates the goodness-offit performances of the eight statistics of equations (2) -(9) using the Bernoulli and Binomial distributions as explanatory variables by investigating the effect of base rates on the performances of these statistics under variable conditions of sample sizes, replications and parameter values for stability of results. The study also draws conclusions on the "best" fit. [9] used Monte Carlo Simulation methods to investigate how the three indices of predictive efficiency in equations (2), (3) and (4)   is almost the most sensitive to base rate changes. They recommended that future research should investigate these same indices for base rate influence within other designs of logistic regression models, for, it is possible that there are other predictor variable combinations which alter the patterns of the index means. They insisted that the search for a base rate invariant predictive efficiency index continues for as long as model conditions dictate the appropriateness of an index for assessing efficiency, ambiguity will continue to exist regarding which index to use, and when to use it.

Literature Review
Again, in his study of 'Coefficients of Determination for Multiple Logistic Regression Analysis', [8] discovered that 2 MF R seems preferable to the other 2 R analogs in two respects. Firstly, 2 MF R has the most intuitive reasonable interpretation as a proportional reduction in error measure, parallel to 2 o R . Secondly, 2 MF R stands out for its independence from the base rate, relative to other 2 R analogs. He also says, with regard to his study, with a manipulated base rate but no constraints on the selection ratio, p  also stands out for its relative independence from the base rate. This contradicts with the findings of [9] mentioned in the preceding paragraph that p  is the best index of predictive efficiency given that both the base rate and the selection ratio are fixed. [8] definesselection ratioas the proportion of cases for which ŷ = 1, and is of concern only when some fixed proportion of cases is in some sense "selected" as successes or equivalently, when the threshold or "cut point" for classification is fixed at some value other than the usual expected value of 1/k, where k is the number of categories in the dependent variable. [8] thus recommends that future researches undertake both a more detailed mathematical analysis to determine the nature of the relationship of the various 2 R analogs and indexes of predictive efficiency with the base rate. In addition, he goes on to say that simulation research like [9]work on the indices of predictive efficiency would be useful in studying 2  R to examine in more detail their relationship to the base rate, their behavior in the presence of unreliability of measurement, and the extent to which the limited result presented in his study may be peculiar to the data he used, or generalizable to a broader range of data.
In line with the two works cited above, [4], carried out a research on the assessment of the 'adequacy-of-fit of logit models' under the dichotomous response classified by its probability levels using the Exponential, Bernoulli and Multinomial distributed explanatory variables. Their findings show that the coefficients of correlation between 2 R analogs and base rate levels of exponentialdistributed X s indicate that the 22 ,  further show that for the Bernoulli and Multinomial independent variable models, the correlation coefficients are approximately the same for the 2 R analogs and 'most values' tend to be more independent from base rates than those of the exponential distribution. Furthermore, the results of the correlation coefficients between the indexes of predictive efficiency and base rate levels show that p  and p  are better than p  . This also agrees partly with the work of [9] who posit that p  is the most base rate invariant index if the model comprised of two dichotomous variables, and partly with the work of [8]. Finally, [4] recommend the use of 2 2 2 , , ,  as goodness-of-fit measures for logit models with dichotomous response and exponential explanatory variables. They, however, say that when P is close to 0.5, the percentage correct is low and the range is high for the exponential response model. They thus recommend that 'further studies for more details in the exponential explanatory distribution together with increased sample sizes' be carried out. [10,11,12], [13], [14] and [15] all investigated the properties of various pseudo-R 2 s using Monte Carlo Experiments. These studies conducted simulations to determine how closely most of the Pseudo-R 2 s correspond to the OLS-R 2 on the underlying latent variable model. This paper does not use this approach but rather the approach specified in section 1.

Methodology
This paper is a comparative study of the performances of some goodness-of-fit statistics under various scenarios. The Monte Carlo Simulation (Experimentation) method is used to generate the data. The simulation study is undertaken using a sample size of 100, 200 and 500; a replication size of 1000, 2000 and 5000 for stability of results. The base rates are: 0.20, 0.35, 0.50 and 0.65. The analyses are carried out using the multiple correlation method. The correlation coefficients will indicate which of the goodness-of-fit measures are correlated under the base rates, sample sizes, replications and parameter values criteria. Any goodness-of-fit measure that is not correlated under the above criteria are seen in this sense as a good measure of goodness-of-fit. For ease of interpretation, the results of the analyses are ranked. Ranks are used to find out the "best" statistics and the most variable of the studied probability distributions. "Best" in this study is used to indicate the goodness-of-fit statistics that is most invariant to the base rate and any of the goodnessof-fit statistics that has the least total rank under the studied scenarios is considered as the "best" statistics. The coefficient of variation is used to find the variability of the distributions. Any of the probability distributions that has the highest total variability rank for all the studied statistics is seen as the most variable probability distribution.
All the analyses are carried out using STATA 12.0. The hypothesis of significant correlation coefficients was tested at 5% level of significance and the significant correlation coefficients at this level of significance are marked with asterisk (*). The Monte Carlos Simulation program was also written with STATA 12.0 and attached to this paper as Appendix.

The Binomial Distribution
The Binomial Distribution was simulated using a fixed sample size of n = 15, and all the parameter values specified in section 3.1.1. A total of 108 simulations was also carried out. Generally, some of the above values are taken from the work of [4] cited above. A sample size of 200 and a replicated data sets of 1,000 carried out under each of four probability of Y = 1 levels of 0.05, 0.20, 0.35 and 0.50 are used in their study. The computer program used for this work and written by the author (see the Appendix) could not run using a base-rate level of 0.05 as used by [4]. These researchers, therefore, use a base rate levels of 0.20, 0.35, 0.50 and 0.65 following the range of values used in their work. It is observed that those authors did not also use a base-rate level of 0.65. Therefore, 0.65 is used to see what happens to the models if there are more ones than zeros in the data. It is equally noted that their study was carried out under the fixed conditions of a sample size of 200 and a replication size of 1000 while this study uses the variable sample sizes of 100, 200 and 500 as well as the variable replication sizes of 1000, 2000 and 5000 as mentioned before, to test for stability of resultsand in compliance with the recommendations of [4]. The parameter values which [4] used in their work in Bernoulli distribution are also used in this work. Those values are extended to the Binomial distribution with a modification of the addition of a sample size n , of 15 to the Binomial distribution. In all, the use of some values in the work of [4] is for comparison purposes.

C.V. = Coefficient of Variation
In Tables 1 -2, it is observed that amongst the indices of predictive efficiency, p  has the highest mean value but the least C.V. value, while among the pseudo-R 2 statistics, 2 N R has the highest mean value but the second best C.V. value for the Binomial Distribution and the third best in the Bernoulli Distribution.

Bulletin of Mathematical Sciences and Applications Vol. 16
It therefore indicates that 2 N R is the best pseudo-R 2 statistics to use at the sample size of 500 and all replication and parameter values. p  is the best index of predictive efficiency statistics at the sample size of 500, replication value of 5,000 and parameter values of 0.10 and 0.30. Table 5 indicates that 2 o R with a total rank of 32 stands out as the "best" statistics. This is followed by 2 C R and 2 M R in this order.

Conclusion
The findings of this paper clearly show that:For the Bernoulli Distribution,