# The impact of individual characteristics on the length of life in India – Oaxaca-Blinder decomposition, Logit model

**Abstract**

This study estimates the impact of social status, education and average life standards on the length of life by analyzing India’s mortality statistics in 2009 for two states, Uttarakhand and Bihar. Using several estimation methods such as MCO, GLM and Logit regressions, furthermore the Oaxaca-Blinder decomposi- tion, we find that education, electricity and the access to toilet significantly raises the length of life. We also find that members of the scheduled tribes live shorter, and this difference cannot be explained by differences between the average value of the two groups’ characteristics.

## 1 Introduction

Dating back to the 18th century B.C., the Epic of Gilgamesh tells the story of Gilgamesh, the king of Uruk, who, after the death of his close friend, Enkidu, undertakes a long and dangerous journey to discover the secret of eternal life. He eventually understands that “Life, which you look for, you will never find. For when the gods created man, they let death be his share, and life withheld in their own hands” * ^{1}*.

The search for the “fountain of youth”, the secret of longevity of life has continued throughout the history, and the desire to prolong our days is just as present today as it was 3800 years ago. Today we believe that improvements of health, nutrition, a healthy lifestyle and regular exercise can lengthen our lives whereas diseases, malnutrition and stress can reduce our life span.

Life expectancy can be thought of as a reflection of the quality of life in a country, as individuals can expect to live longer, fuller lives. It is a statistical measure of the ex- pectable longevity of life, calculated from a fictitious group of people, who go through their life under the conditions of their birth year. There are two weaknesses of life expectancy as a statistical indicator. Firstly, it is calculated from a fictitious group of people, and secondly, it is highly sensitive to the year it is calculated in. We might consider the average length of life as a more realistic approach, which is defined as the average years lived of members of an observed cohort from their birth until their death. Naturally, as a consequence of improving conditions worldwide, the average calculated today from historic information may not imply to cohorts being born today.

Life expectancy has increased enormously from 31 to 71.5 years over last century **[Wikipedia]**, varies largely by countries, and is affected by economic and social conditions. This leaves us wondering, which factors determine life expectancy at a national level? Several studies have shown that the social and economic conditions of a country, diseases and the healthcare system all affect the life expectancy in each individual country **[T.Smith, 1993]**.

Richer countries have a higher life expectancy than poorer ones, however, from a certain level of income upwards we can no longer observe this correlation **[Marmot, 2001]**. We expect that population characteristics, such as population composition might be a source of variation between life expectancies of different countries as urban population has a better access to health care, but is also exposed to a higher level of pollution and overcrowding. The epidemic effects of diseases such as AIDS on life expectancy are evident, while access to information and technology allows people to be aware of these diseases, health discoveries and violence in their surroundings. Education affects positively life expectancy as educated people are in general more apt to make informed life-decisions. Finally, the environment, access to drinking water and in general, the sanitary conditions of a country highly influence the life expectancy of its population.

The second question we might ask is then which indicators cause life expectancy vary in a country? What are the individual key elements that determine how long we live?

The genetic background explains a great deal of our health, but not everything. Again, many of the indicators mentioned in the previous paragraph, as the individual level of education, whether we live in a rural or urban area might be thought of here. Moreover, our habits and behavior, such as smoking, drinking, and our physical activities influence largely how long we might live. What we do for a living counts also. “Status syndrome” was observed by Marmot** [Marmot, 2003]**, as he observed a strong correlation between the lower grade levels and the increased risk of mortality. The poorer people live less on average. According to the University of Boston, School of Public Health, the richest 1% of Americans live an average of 10 to 15 years longer than the poorest 1% **[Dickman et al., 2017]**. Poverty has always been linked to poorer health outcomes as people in low income groups can afford less or no health care. Moreover, many unhealthy habits, such as smoking and obesity are more observable among the poorest.

In this study, we analyze key individual factors influencing the years lived of people deceased in 2009 in India. We ask that given the health spending, sanitary conditions and the population characteristics of each state, which individuals lived longer? What effects did education, information, the environment, the employment, smoking and drinking habits had on the years the individuals actually lived? Finally, discrimination against tribes and castes are also examined. At last, so as it seems, we also, follow the footsteps of Gilgamesh.

## 2 Data

2.1 General data description

For this study, we use the Annual Health Survey: Mortality Schedule, conducted in 2010-11, updated in 2012-2013 by the Department of Health and Family Welfare, Government of India. This dataset contains district level mortality statistics of the Empowered Action Group States 2 and Assam. These nine states account for about 48% of the total population, 59% of births and 70% of infant deaths in India.

The data contains informations relating to deaths occurred during the reference period 2007-2012 and it includes information on the gender, the age, the level of education, the profession and the general quality of life of the deceased and the source of medical

2In India, the eight socioeconomically backward states of Bihar, Chhattisgarh, Jharkhand, Madhya Pradesh, Orissa, Rajasthan, Uttaranchal and Uttar Pradesh, referred to as the Empowered Action Group (EAG) states, lag behind in the demographic transition and have the highest infant mortality rates in the country attention received prior to death. There is a total of 770 thousand observations and more than 120 variables, so data selection and cleaning is necessary. We have chosen 2009 as the year of study to avoid possible autocorrelation of errors.

Figure 1 shows the GDP per capita **[Central Statistics Office]** and the average years lived by individuals deceased in 2009 in each of the nine states. We observe a positive correlation between increased revenue per head and years lived in the sample. To model the difference between the states the best we can, this study focuses on two states: Uttarakhand (50657 Indian rupees) and Bihar (13728 Indian rupees), with the highest and lowest GDP per capita in 2009 from our database respectively. We do not think that these deletions will result in selection bias as the already cleaned data base still contains 26 476 observations.

**Figure 1.**

The variables used to assess average length of life lived of the dataset include gender, caste, urban or rural residence, whether the deceased was a smoker, the level of literacy and education, professional occupation, the health care provider and whether he/she has electricity and a private toilet. These variables were chosen by StepWise method presented in part 3.1.

The explained variable, the years lived by the individuals of the sample merits an explanation. The survey contains the age of deceased individuals passed away in 2009, therefore it can be thought of as the average length of life of people died in this year. We are aware that this approach is not identical with the general method of calculating average length of life of individuals (explained in the section 1.), however, we argue that it can still be used as a good approximation of average length of life in the chosen states as the dataset contains observations of a large number of individuals and as we assume that the cause of death of individuals can be treated as independent from one another.

**Figure 2. Figure 3. **

The distribution of years lived by the two states is shown in figure 2 and 3. As we see, the average life years lived is around 55.5 and 51 in Uttarakhand and Bihar respec- tively. We observe a high rate of child mortality in both states but surprisingly, it is higher in Uttarakhand than in Bihar, even though the latter is poorer. The only plau- sible explication that we can think of is that not all child deaths are reported and so that this data is not reliable. As the data set contains observations of a survey, volun- tarily responded, we fear that this difference might distort our analysis. Furthermore, the variables that we investigate in this study, such as smoking or education do not affect child mortality directly, so we decided to exclude observations of individuals who have died before the age of 15. After the reduction of the data, the distribution of years lived is shown in Figure 4 and 5. The average years lived for the cleaned and selected data is 62 and 61 in Uttarakhand and Bihar respectively.

**Figure 4. Figure 5. **

In order to normalize this distribution, we take the logarithm of years lived by individuals. The distribution of logarithmic years lived is shown in Figure 6 and 7. As a consequence, we will use the logarithm of median years lived as the explained variable. Furthermore, we also create a dummy, More than median that indicates whether the individual lived more or less than the median in his state of residence.

**Figure 6. Figure 7.**

Besides, the original data set included both categorical and dummy variables. Dummy variables were accepted for further studies, while categorical variables necessitated additional treatment. The reason for the changes is that it is impossible to interpret categorical variables in regression models. For example, the initial variable “Occupation Status” originally consisted of 14 categories, from cultivator to beggars, indicated by a number from 1 to 14. Introducing these variables in the regression without changing their values and the structure of the variable assumes that the level “5” of occupational status has a fivefold effect in comparison to the level “1” occupational status. This is clearly wrong and so an efficient use of these variables necessitated the creation of dummies for the most influential occupations as we chose not to include a dummy for all occupational statuses.

In case of education, smoking and alcohol consumption, we used the method of re- grouping. Instead of creating a dummy variable for all levels of education, we simply created 2 dummies, literacy and high level of education.

### 2.2 Explanatory variables used- a theoretical explanation

• **Gender**: One one hand, we include this dummy variable as India has an imbal- anced sex ratio and female disadvantage is a major concern as highly skewed child sex ratio is distorting the demographic profile of several states 5. On the other hand, we observe that female live longer on average, so we expect to have a positive coefficient in a regression on the already cleaned database.

• **Rural**: As mentioned in the introduction, people living in urban areas have better access to health care but they also experience an increased level of pollution and overcrowding.

• **Smoker**: The reason we include smoking is because the harmful health effects of cigarettes are well-documented. According to the University of California, every cigarette smoked reduces life by 11 minutes. [University of California].

• **Literacy, Educ sup**:As mentioned in the introduction, more educated people live longer on average as they are apt to make better life-decisions. Knowledge about various types of diseases, the way to prevent them is determined by the extent of knowledge which in turn depends on education. The attitude towards sanitation, hygienic conditions and utility of clean drinking water depends on the degree of awareness, attitude, which, again, depends on people’s general knowledge. Moreover, although education does not directly affect health status, it does create the potential mechanism through which health status can be improved. A study conducted on Swedish men born between 1945 and 1955 reveals that an additional year of schooling reduces the risk of bad health by 18.5% 6. We use two dummies, literacy and educ sup that stands for for whether the individual is literate and whether he/she pursued higher education studies, respectively.

**• Scheduled tribe:** One in six Indians belong to the scheduled caste, to the “un- touchable” or “dalits” 7 , a group that is socially segregated and economically disadvantaged by their lower status in the traditional Hindu caste hierarchy. Upper-caste Hindus traditionally treated untouchables as agents of pollution. These communities historically have been denied access to education, public places such as temples and drinking water wells. Occupationally, most sched- uled tribes were landless laborers engaged in what were traditionally considered to be ritually polluting occupations such as human waste collection and crema- tion ceremonies.

The reason we find it important to include scheduled tribe among other vari- ables is that despite of the fact that India’s constitution 8 outlawed caste-based discrimination, it is still largely present in the Indian society. According to a 2014 report by the IndiaGovern Research institute, Dalit children constitute nearly half of primary school drop-outs. 88% of state school discriminate against “un- touchable” students, 79% required them to sit in the back of the classroom and to sit separately at lunch from children of upper castes, furthermore, that they eat from specially marked plates 9 . 93% of dalit families still live below the poverty line in 2012 according to a survey conducted by Mangalore University. There- fore, this study attempts to test to differences between effects of education and the level of life standards for the upper and scheduled tribes.

Figure 6 shows the average years lived by individuals deceased in 2009 divided by social class for the total dataset for all observations (including child mortality too). The value of 46.8 years lived on average by members of the scheduled tribes deceased in 2009 is shockingly low, it corresponds to the life expectancy in Europe in 1913 10.

**Figure 8.**

We see that there is a large difference between the average years lived between the social castes in both states but that it is even larger in Uttarakhand. The ini- tial hypothesis is that as Uttarakhand is richer than Bihar, its citizens live longer on average as they benefit from state investments such as better infrastructure, a more efficient waste management or access to drinking water. However, it seems that untouchables do not benefit from these advantages the same way as individ- uals from the upper castes. We prompt to investigate if the difference between the average years lived by members of the scheduled tribes and by individuals belonging to the upper castes is explained by differences between the individuals or not.

• **Toilet:** Studies have also revealed that the survival rates of children go up along with the rise in accessibility to piped water and toilet facilities 11. Therefore we assume that the length of life of the chosen cohort (individuals deceased above age 15 in 2009, in Uttarakhand or in Bihar) is also affected by the access to toilets.

• **Electricity:** We use electricity as a dummy for better living conditions.

• **Domestic, Laborer, Salaried, Self:** As already mentioned in the introduction, occupation status can affect our health by Status syndrome, the stress we experience during our professional life or as a result of subordination. Therefore we include these 4 dummies to control for the possible differences in health effects related to occupations. Laborer includes the agricultural laborers, self the self- employed or unpaid family laborers, salaried the regular salaried employees and domestic the domestic workers. If all occupational dummies take zeros, we refer to people who did not work as they were too old or sick to do so, or as a result of a disability, furthermore the beggars and prostitutes. Therefore we expect all occupational dummies to have a positive effect on the years lived, and therefore we are rather interested in the differences of these effects than their signs.

• **Treatment before the death:** The dataset also contains data related to the treat- ment, that the deceased received just before his/her death. Unfortunately, there- fore, this variable does not show the treatment for the total lifespan of each individual.

In India, the health sector was largely shaped by its federal structure as the state divisions are responsible for organizing and delivering health services to their residents, while the central government is only responsible for the medical edu- cation, prevention of food adulteration, quality control in drug manufacturing, national disease control and family planning programs.

Total health expenditures in India for 2013–2014 were 4.02 percent of GDP, out of which government expenditures amounted to only 1.15 % of GDP. Household out-of-pocket health spending was 69.1% of total health expenditures, making this a major component of the financing system. **[Ministry of Health, 2016]**

We use therefore two dummies, no treatment that stands for the case that the individual received no treatment before his/her death. These people had either no access to health care (explained by poverty or his residence) or had no pro- longing health problems. The second dummy we use is private treatment that shows whether the individual had access to private, out-of-pocket treatment be- fore his/her death. If both dummies take the value of zero, the individual received state-treatment.

## 3. Econometric methods

### 3.1 OLS method

The baseline regression for our study is the linear regression model that implies a linear relationship between the explained and unexplained variables, that the observations are normally distributed with an expected value of zero and a constant variance, moreover that the observations are independent.

The first step of the analysis is the appropriate selection of variables based on the Stepwise method **[StepWise, NCSS]**. The results are shown in Appendix C. This procedure allows us to choose the variables that contribute the most to an increased R2 by successively adding and subtracting variables according to their significance. If a variables does not improve the fit of the OLS model, it is removed. Following the Stepwise procedure we find 15 variables.

Finally, our regression model takes the form:

### 3.2 Oaxaca-Blinder Decomposition

The Oaxaca-Blinder decomposition12 explains the difference in the means of a dependent variable between two mutually exclusive groups by decomposing the gap into the part that is due to differences in the mean values of the explanatory variable within the groups (explained difference), and group differences due to the effects of the independent variable (unexplained difference). Our motivation to use the Oaxaca-Blinder method is to see whether education, occupation and the level of quality of life has the same effects for the members of the scheduled tribe and the upper castes.

The approach decomposes the mean outcome difference with respect to a vector of reference coefficients b. In the case of health inequalities of India, the coefficient vector is interpreted to be non-discriminatory, in other words, as the set of regression coefficients that would emerge in a world of no discrimination against the Untouchables.

As the above equation shows, the twofold decomposition divides the difference in mean outcomes into a portion that is explained by cross-group differences in the ex- planatory variables, and a part that remains unexplained by these differences.

It is important to keep in mind that the unexplained portion of the mean outcome gap can be interpreted as discrimination, but may also result from the influence of unobserved variables. Since we have many unobserved variables such as the revenue of the deceased, we are aware that possible gaps are the results of the differences of the missing variables.

### 3.3 Generalised Linear Method

The Generalised Linear Model (GLM) is a flexible generalization of the ordinary linear regression that allows the explained variable to be related to a linear model of the explicative variables via a link function. The generalised linear model with Yi as the response variable is therefore defined by three components:

- a conditional distribution of the response variable, Yi given the values of explicative variables in the sample ( E(Yi/X) = μ ), where the distribution of the Yi is a member of the exponential family such as the Gaussian (normal), binomial, Poisson or Gamma distributions;
- a linear predictor, that is a linear function of regressors, such that:

- and a smooth and invertible linearizing link function g( * ), which transforms the expectation of the response variable, μ ≡ E(Yi), to the linear predictor:As the link function is invertible, we can also write :
- and, thus, the GLM may be thought of as a linear model for a transformation of the expected response or as a nonlinear regression model for the response. We will use two forms of the generalized linear models, a GLM with a normal distribution and a log link function, and the logit model. GLMs are fit to data by the method of maximum likelihood, providing estimates of the regression coefficients.

### 3.3.1 Normal generalized linear model with a log link function

Our first GLM used assumes a normal distribution ε ∼ N(0, σ^{2}), however a log link function, so that log(E(Y_{i}|X)) = (β_{0 }+β_{1} X_{1i} + … +βp X_{1p}) The reason we chose a normal distribution with a logarithmic link function is that we observe only positive observations and that we wanted to allow for a non-linear relationship between E(y) and η = X β + u.

In our case, therefore, the exact effect on the response variables might be translated as:

### 3.3.2 Logit model

Unlike actual regression, logistic regression does not try to predict the value of a nu- meric variable given a set of inputs. Instead, the output is a probability that the given input point belongs to a certain class. In our model, we will attempt to model whether an individual lives more or less than the average years lived by citizens of his state.

We consider the case where the response Yi is binary, assuming only two values, one and zero. We view Yi as a realization of a random variable Yi that can take the values one and zero with probabilities πi and (1πi) , respectively.

We assume that Yi has a binomial distribution Yi – N(μi, pi) therefore we have:

and

We suppose further that the logit of the underlying probability i is a linear function of the predictors log(pi) = xib. Exponentiating this equation we find that the odds for the i-th observations are given by:

or, equivalently:

Since Logit function is a GLM with binomial functions and a logit link, we can use Maximum Likelihood to estimate it.

## 4 Results

### 4.1 Results of the Ordinary Least Squares Regression

Appendix D contains the results of both OLS regressions, Uttarakhand and Bihar. In both cases, we used a randomly selected 70% of the total observations to estimate our model (training dataset) and then we used the rest 30% of the observations to test the model (validation dataset).

Although the OLS results are globally significant in case of both states (F-values are large), the R squares are especially low. These latters indicate how well the data fits a model. The low R squares are due to the sheer volume of observations (6276 and 9369 observations for state Uttarakhand and Bihar respectively), furthermore, to the fact that our model misses important variables (such as revenue of the deceased), and lastly, is due to heterogeneity of the individuals. We therefore realise, that this model has significant limitations, however, we use it as a starting step towards the modelisation of the social effects on health. As the R2 is very low, we do not expect that the model predicts well the validation dataset. We find that the model predicts 5% of the observations in Bihar within a 2 year interval whereas 7% in Uttarakhand.

As we regressed the log of the years lived on the explicative variables, the coefficients can be interpreted as the percentage increase in the years lived if the variable changes by a unit. As we have only dummies in our model, this means that the coefficient shows the percentage increase in the years lived if the dummy takes the values of one.

where *** – significant at 1%, ** – significant at 5%, *-significant at 10%.

We observe that being female increases average life lived in both states. This is ex- pected as on average, females live longer than men. In Bihar, however, the coefficient is much lower in magnitude and also insignificant.

Interestingly, literacy seems to increase the length of life by more in Uttarakhand than in Bihar, by 3.4% and 2.7% respectively. This is not that we expect as we would expect that literacy increases the length of life more in a poorer state, where the average years lived is lower. Therefore we test whether the coefficients are significantly different by conducting a z-test:

and we find that at 5%, they are significantly different. Higher education follows exactly the same pattern, its coefficient is higher in Uttarakhand, and significantly different of the coefficient of the regression conducted in Bihar. If an individual attended higher education, lives 8.1% more in Uttarakhand while only 7.5% in Bihar.

The variable “rural” has negative effect in Bihar whereas positive ones in Uttarak- hand. However, in case of Bihar, it is not statistically significant. Toilet has again a higher effect in Uttarakhand than in Bihar, a result that again, we did not expect. The difference is statistically significant at 5%. Finally in case of Bihar, all forms of employ- ment have an insignificant coefficient except self-employed, that increases the length of life by 4.6%, while in case of Uttarakhand, it increases life only by 3.35%. “Salaried” and “domestic” are both significant at 5%, both having a positive effect on the years lived as expected.

Whether the individual received private or no treatment before his/her death are pos- itive in case of both states, however, their magnitudes are very different. In Bihar “no treatment” increases life by 18% and private treatment by 11%, while in Uttarak- hand these values are 6.5% and 5.9%. The coefficients are significantly different at 5%. As already mentioned in the introduction, if both treatment types take the value of zero, the patient was treated in public or governmental facility. We assume that if the individual had No treatment might include therefore people who were not sick enough to visit a medical facility before their death and therefore they lived longer because they were in a better health and not as a consequence of less treatment received, while private treatment is better than the public one, hence the positive coefficients. Other variables, such as smoking or electricity are insignificant and therefore not mentioned here.

Finally, we see that belonging to the scheduled tribe, holding other variables constants, reduces years lived, by 5% in Uttarakhand and 2% in Bihar. Of course, we realize that it might be correlated with the effects of revenue for which we can just simply not control here, we do conduct an Oaxaca-Blinder decomposition to see the possible differences between effects of education, employment and average life standards between members of the scheduled tribes and members of the higher castes.

### 4.2 Results of the Oaxaca-Blinder decomposition

As shown in section 2.2., the difference between the average years lived of members of the scheduled tribes and the upper castes is larger in Uttarakhand than in Bihar, therefore we chose to conduct the Oxaca-Blinder decomposition for the first state. Fig- ure TT shows the estimates of the coefficients of the linear regression model the two groups, the gap between them and the explained and unexplained percentage of this gap. Furthermore we chose to regress years lived and not the logarithm of years lived as the SAS macro we used was unable to handle logarithmic values efficiently.

“Coef A” shows the coefficients for members of the upper castes while “Coef B” shows for the schedule tribes. The “Gap”, the difference between the two estimated coefficient is broken down into two parts, the explained (“Exp.”) and the unexplained (“Un- exp.”) difference between them. We see for instance that the effect of literacy is more amplified for the untouchables than for others, since being literate add more than 3.5 years to the years lived in case of the Untouchables and only a year to the length of life of others. The mean values, “Mean A” and “Mean B” show the percentage of the upper and scheduled tribes that are literate, 62% and 51% respectively. Therefore we conclude that members of the scheduled tribes live under worse conditions and therefore it is even more important for them to be literate.

We observe the same pattern for the variable ⌧ Electricity , as it has an almost three- fold effect for the Dalits than for others. We see also that only 2% of this difference is explained by the model. The total gap between the two groups is 1.566 meaning that members of the scheduled tribes live on average less than members of the Upper Castes. Lastly, 88% of this difference is explained by differences in the variables. Fi- nally, we realize that this model does not model well the difference in the length of life lived by the two groups, possibly as a result of two drawbacks. Firstly, the macro estimates the differences by a linear regression model. As already mentioned in part 4.1, the linear model is not a good approach in case of this model as the goodness of fit is very low and as we might have endogenous variables. Secondly, the gap shown in the Oaxaca-Blinder decomposition might be caused by the differences between the two groups that are not controlled for by the model.

## 4.3 Results of GLM

### 4.3.1 GLM with normal distribution and log link function

The results for the GLM models are shown in table vv for the two states. We see that almost all coefficients of the regression conducted on the data of Bihar are insignifi- cant at 1%, except educ sup, toilet, sel f employed and the two forms of treatment, with all having a positive effect on the years lived. Uttarakhand seems to analyse the relationship better between the explained and explicative variables as literacy, female, rural, toilet and the two types of treatment are all significant and positive, moreover, belonging to the scheduled tribe give a negative and significant coefficient.

The goodness of fit is still very low for both states, appendix RR shows the deviance and the Pearson-Chi squared. The deviance is a measure of goodness of fit of a generalized linear model, the higher numbers indicating worse fits.The Pearson-Chi squared establishes whether an observed frequency distribution differs from a theoretical distribution. The GLM conducted on the data of Uttarakhand therefore gives us a better fit (Pearson-Chi squared is 0.14), however, just as in the OLS regression, we realise that this model gives a poor fit for our data, explained by the high volume of observations and the heterogeneity of individuals.

### 4.3.2 Results of Logit

The last model conducted is a logit regression, in which we regress the dummy “more than median” on all of the explicative variables selected beforehand. The co- efficient estimates are shown in figure RR. Appendix HHJ shows the exact output of SAS. Juste like for the GLM model, Bihar does not show many significative coefficients and therefore the interpretation will focus on the regression conducted on Uttarak- hand. We see that the joint insignificativity of all coefficients are rejected at 1% by the Wald test for both states. The goodness of fit is reported in the form of the Akaike’s information criterion and Schwarz criterion, however, both method allows us to com- pare different models and not not evaluate the goodness of fit of an individual model. For this reason, we will only use the percentage of well predicted values of the validation data set as an indicator for the goodness of fit. Finally, to decide from which threshold upwards we will consider the predicted value of the probability to be one, we show the the ROC curve in figure RT and GYU for both states. In a ROC curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100- Specificity) for different cut-off points of a parameter. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision thresh- old. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between the two diagnostic groups (individual lived more than average / less than average). Our ROC curves show that our models are relatively poor: the area under the curve is only 61 – 62%. Finally, we chose the threshold of 0.5.

Now we interpret the coefficients of the regression in case of Uttarakhand. We see that female, literacy, higher education, the access to toilets and private or no treatment have all positive and significant coefficients. To understand how much increase this would

mean in the probability that the individual lives longer than the average, we use the odds ratios. The odds ratio shows the probability that an individual lives more than average given a certain characteristics (for instance that he pursued higher education) divided by the probability that he does not live more than average (given that he has higher education). The odds ratio of higher education is for instance 1.567, so

So that p(more than median = 1|educ sup = 1) = 0.6. (All other variables can be interpreted similarly.)

Finally, we see that belonging to the scheduled tribe decreases the probability of living more than the average. The model has its limitations, once again. The well predicted percentage of the validation dataset is only 53%. Furthermore, we see that smoking has a positive and significant coefficient despite of the harmful effects of cigarettes.

## 5 Conclusion

In this study, we followed 4 different regression methods to evaluate the social effects on the length of life in India. We primarily focused on the differences between two, a richer and a poorer state, moreover two groups of people, members of the scheduled tribes and those belonging to upper castes.

We generally observe that education, literacy, the access to electricity and toilets in- creases the length of life while belonging to the scheduled tribe reduces it. We also find that even after controlling for the differences between the untouchables and oth- ers, differences persist between these castes.

Finally, in all of our models, indicators of the goodness of fit were weak and coeffi- cients obtained in the training dataset did not explained well the validation data set. We therefore conclude that to assess the social effects on the length of life, this model is insufficient. Firstly, we miss important variables such as individual revenue that biases our regressions.

If revenue is correlated with the length of life (as we assume that it is) and it is also correlated with other explicative variables (as it is correlated for instance with educa- tion as more educated people have higher revenues in general), this omitted variable causes all correlated variables to be troublesome. Endogeneity does not only under- mines the unbiasedness of the OLS, it also causes it to be non-convergent. Therefore the OLS model, the Oaxaca-Blinder decomposition and the GLM models all have pos- sible endogeneity bias.

The second reason for the poor fit of the model is the heterogeneity of individuals. Some individuals live unexpectedly long because they inherited a strong genetic back- ground and it is unrelated with their level of education or social status whereas some people are simply not that lucky, and therefore individual differences play a huge role in determining the length of life [**Sengupta, 2016**]. We therefore think that a panel regression with fix individual effects could give us a more reliable approach and that more information is needed in order to obtain a better understanding of the effects of individual characteristics on the length of life.

## References

- [1] Central Statistics Office, Directorate of Economics & Statistics of respective State Gov- ernments, and for All-India.
- [2] Dickman, S., Himmelstein, D., Woolhandler, S. (2017) Inequality and the health-care system in the USA. Lancet London, England.
- [3] Marmot, M. and Wilkinson, R.G. (2001) Psychosocial and material pathways in the relation between income and health: a response to Lynch et al.. International Centre for Health and Society.
- [4] Marmot and al (2003) Whitehall studies. Department of Medical Statistics & Epi- demiology.
- [5] Ministry of Health and Family Welfare (2016) National Health Accounts Estimates for India (2013–14) .Central Bureau of Health Intelligence, National Health Profile, 2016; accessed Oct. 13, 2016.
- [6] NCSS Statistical Software, StepWise modeling . Available from: https://www.ncss.com/
- [7] Philippe De Peretti (2008-2009) Econome ́trie Applique ́e sous SAS.
- [8] Smith, T. (1993) A statistical analysis of life expectancy across countries using multipleregression. University of Pennsylvania.
- [9] Sengupta, K. (2016) Determinants of Health Status in India. Indian Institute of Man-agement Nongthymmai, Shillong, India.
- [10] Subramanian,S.V.,Ackerson,L.,Malavika,A.,Sivaramakrishnan,S.(2008)HealthInequalities in India: The Axes of Stratification. Brown University.
- [11] Turner, H. (2008) Introduction to Generalized Linear Models. ESRC National Centre for Research Methods, UK and Department of Statistics University of Warwick, UK.
- [12] University of California (April 2010) Berkeley Wellness Letter.

[13] Wikipedia, Life Expectancy. Available from: https://en.wikipedia.org/wiki/Life expectancy

## Appendices

Model fit statistics |
||

Criterion |
Intercept Only |
Intercept and Covariates |

AIC |
8632.001 |
8373.300 |

SC |
8638.745 |
8481.212 |

-2 Log L |
8630.001 |
8341.300 |