# Direct Effect

## Explanation

In a conceptual model, the concepts are normally placed in a rectangular. We have two concepts, X, the independent variable (IV) and Y the dependent variable (DV).
The single headed arrow indicates that you assume a causal relation from X (on the left) to Y on the right. Thus: if X increases, Y will increase as a result as well; the more X, the more Y.
The shape of the relation remains implicit in your conceptual model but researchers commonly assume a linear relation between the concepts.

It is not always necessary to label the paths but for this tutorial it will turn out to be handy. Normally, when there is no sign (or label) it is assumed that the path has a positive valence. It is, however, good practice to include the valence of the paths in your conceptual models, i.e. replace a with + or -.

Please note, I use the label concept and not variable. A variable is something that is part of your dataset, measured by, for example, a survey item. A concept is a theoretical construct, and the concept as intended may have relations - according to your theory - with other concepts. This concept is measured by one or more variables, the ‘concept as measured’.
It is generally a good idea to use concepts that are (very) close to your actual measurements. Thus, although you may use the concept social cohesion in your conceptual model, this concept is overly broad and there is a fierce debate on how it should be defined. If you have measured social cohesion with for example a single item on generalized trust (“Generally speaking would you say that most people can be trusted or that you can’t be too careful in dealing with people?”) why not use the concept generalized trust in your conceptual model?

## Abstract hypothesis/hypotheses

or:
Hypo: The more X, the more Y

## Real life example

### Continuous DV

X is occupational success.
Y is health

Hypo 1a: Occupational success will lead to better health.

Note, that we used the same concepts as in the previous example (an association between occupational success and health). That we now formulate a causal path should be informed by theory.

### Dichotemous DV

X is occupational success.
Y is healthy (YES versus NO)

Hypo 1b: Occupational success increases the probability to be healthy.

Note, that you will probably use a logistic regression model to test your hypothesis. Thus where your conceptual model assumes a linear relation between your IV and your DV, your formal model assumes an S-shape relation; The logistic model is linear in the logit, not in the probability.1

## Structural equations

• Y=X

or, following the syntax of the R package Lavaan

• Y~X

The single ~ indicates a direct effect (regression path).

## Formal test of hypotheses

rm(list = ls())  #empty environment
require(haven)
nells <- read_dta("../static/NELLS panel nl v1_2.dta")  #change directory name to your working directory

Operationalize concepts.

# We will use the data of wave 2.
nellsw2 <- nells[nells$w2cpanel == 1, ] # As an indicator of occupational success we will use income in wave 2. table(nellsw2$w2fa61, useNA = "always")
attributes(nellsw2$w2fa61) # recode (I will start newly created variables with cm from conceptual models) nellsw2$cm_income <- nellsw2$w2fa61 nellsw2$cm_income[nellsw2$cm_income == 1] <- 100 nellsw2$cm_income[nellsw2$cm_income == 2] <- 225 nellsw2$cm_income[nellsw2$cm_income == 3] <- 400 nellsw2$cm_income[nellsw2$cm_income == 4] <- 750 nellsw2$cm_income[nellsw2$cm_income == 5] <- 1250 nellsw2$cm_income[nellsw2$cm_income == 6] <- 1750 nellsw2$cm_income[nellsw2$cm_income == 7] <- 2250 nellsw2$cm_income[nellsw2$cm_income == 8] <- 2750 nellsw2$cm_income[nellsw2$cm_income == 9] <- 3250 nellsw2$cm_income[nellsw2$cm_income == 10] <- 3750 nellsw2$cm_income[nellsw2$cm_income == 11] <- 4250 nellsw2$cm_income[nellsw2$cm_income == 12] <- 4750 nellsw2$cm_income[nellsw2$cm_income == 13] <- 5250 nellsw2$cm_income[nellsw2$cm_income == 14] <- 5750 nellsw2$cm_income[nellsw2$cm_income == 15] <- 6500 nellsw2$cm_income[nellsw2$cm_income == 16] <- 7500 nellsw2$cm_income[nellsw2$cm_income == 17] <- NA # let us scale the variable a bit and translate into income per 1000euro nellsw2$cm_income <- nellsw2$cm_income/1000 # from household income to personal income attributes(nellsw2$w2fa62)
table(nellsw2$w2fa62, useNA = "always") nellsw2$cm_income_per <- nellsw2$w2fa62 nellsw2$cm_income_per[nellsw2$cm_income_per == 1] <- 0 nellsw2$cm_income_per[nellsw2$cm_income_per == 2] <- 10 nellsw2$cm_income_per[nellsw2$cm_income_per == 3] <- 20 nellsw2$cm_income_per[nellsw2$cm_income_per == 4] <- 30 nellsw2$cm_income_per[nellsw2$cm_income_per == 5] <- 40 nellsw2$cm_income_per[nellsw2$cm_income_per == 6] <- 50 nellsw2$cm_income_per[nellsw2$cm_income_per == 7] <- 60 nellsw2$cm_income_per[nellsw2$cm_income_per == 8] <- 70 nellsw2$cm_income_per[nellsw2$cm_income_per == 9] <- 80 nellsw2$cm_income_per[nellsw2$cm_income_per == 10] <- 90 nellsw2$cm_income_per[nellsw2$cm_income_per == 11] <- 100 nellsw2$cm_income_per[nellsw2$cm_income_per == 12] <- NA nellsw2$cm_income_ind <- nellsw2$cm_income * nellsw2$cm_income_per/100

# as an indicator of health we will use subjective well being from 5 (excellent) to 1 (bad) thus we
# have to reverse code original variable

attributes(nellsw2$w2scf1) table(nellsw2$w2scf1, useNA = "always")
nellsw2$cm_health <- 6 - nellsw2$w2scf1
##
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17 <NA>
##   55   78  103  204  338  326  282  272  276  205  133   62   48   22   22   29  374    0
## $label ## [1] " wat is het netto inkomen per maand van u en uw partner samen?/van u?/ " ## ##$format.stata
## [1] "%8.0g"
##
## $labels ## Minder dan ¤150 per maand ¤150 - ¤299 per maand ¤300 - ¤499 per maand ## 1 2 3 ## ¤500 - ¤999 per maand ¤1.000 - ¤1.499 per maand ¤1.500 - ¤1.999 per maand ## 4 5 6 ## ¤2.000 - ¤2.499 per maand ¤2.500 - ¤2.999 per maand ¤3.000 - ¤3.499 per maand ## 7 8 9 ## ¤3.500 - ¤3.999 per maand ¤4.000 - ¤4.499 per maand ¤4.500 - ¤4.999 per maand ## 10 11 12 ## ¤5.000 - ¤5.499 per maand ¤5.500 - ¤5.999 per maand ¤6.000 - ¤6.999 per maand ## 13 14 15 ## ¤7.000 of meer per maand weet niet, wil niet zeggen ## 16 17 ## ##$class
## [1] "haven_labelled" "vctrs_vctr"     "double"
##
## $label ## [1] " hoe groot is uw bijdrage in dit inkomen ongeveer? kunt u een percentage noemen " ## ##$format.stata
## [1] "%8.0g"
##
## $labels ## vrijwel geen bijdrage ongeveer 10% ongeveer 20% ongeveer 30% ## 1 2 3 4 ## ongeveer 40% ongeveer 50% ongeveer 60% ongeveer 70% ## 5 6 7 8 ## ongeveer 80% ongeveer 90% ongeveer 100% weet niet ## 9 10 11 12 ## ##$class
## [1] "haven_labelled" "vctrs_vctr"     "double"
##
##
##    1    2    3    4    5    6    7    8    9   10   11   12 <NA>
##  253   48   89  259  233  242  183  229  114   63  887  229    0
## $label ## [1] " wat vindt u, over het algemeen genomen, van uw gezondheid? " ## ##$format.stata
## [1] "%8.0g"
##
## $labels ## uitstekend zeer goed goed matig slecht ## 1 2 3 4 5 ## ##$class
## [1] "haven_labelled" "vctrs_vctr"     "double"
##
##
##    1    2    3    4    5 <NA>
##  438  853 1211  247   48   32

And test the direct effect. Naturally, there are many ways to test for a direct effect in R but in this tutorial I will try to do everything at least also in the package Lavaan.

But first plot the association and add the regression line:

# I randomly select 200 respondents otherwise the plot will be too crowded
selection <- sample(1:length(nellsw2$cm_income_ind), 200, replace = FALSE) # because we are interested in a correlation, I plot the standardized variables plot(nellsw2$cm_income_ind[selection], nellsw2$cm_health[selection], xlab = "income", ylab = "health", main = "Effect of income on health") abline(lm(nellsw2$cm_health ~ nellsw2$cm_income_ind), lwd = 4, col = "red") I hope you observe that the regression line does not fit the data very well. And now,…estimate the direct effect via lm(): summary(lm(nellsw2$cm_health ~ nellsw2$cm_income_ind)) ## ## Call: ## lm(formula = nellsw2$cm_health ~ nellsw2$cm_income_ind) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.5178 -0.5178 -0.3913 0.5382 1.6087 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.39132 0.03276 103.516 < 2e-16 *** ## nellsw2$cm_income_ind  0.07230    0.01860   3.886 0.000105 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9132 on 2353 degrees of freedom
##   (474 observations deleted due to missingness)
## Multiple R-squared:  0.006377,   Adjusted R-squared:  0.005955
## F-statistic:  15.1 on 1 and 2353 DF,  p-value: 0.0001047

And with Lavaan.

require(lavaan)
# observed
var(cbind(nellsw2$cm_income_ind, nellsw2$cm_health), na.rm = TRUE)
cor(cbind(nellsw2$cm_income_ind, nellsw2$cm_health), use = "pairwise.complete.obs", method = "pearson")
##            [,1]       [,2]
## [1,] 1.02349869 0.07399422
## [2,] 0.07399422 0.83887136
##           [,1]      [,2]
## [1,] 1.0000000 0.0798558
## [2,] 0.0798558 1.0000000
model <- '
cm_health ~ cm_income_ind
cm_health ~ 1
'
fit <- cfa(model, data = nellsw2)  #I use cfa instead of lavaan. The only advantage is that I don't have to tell lavaan that I also need the error variances.
summary(fit, standardized = TRUE)
inspect(fit, "r2")  #to obtain r-squared
# parameterEstimates(fit)
## lavaan 0.6-7 ended normally after 16 iterations
##
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                          3
##
##                                                   Used       Total
##   Number of observations                          2355        2829
##
## Model Test User Model:
##
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
##
## Parameter Estimates:
##
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   cm_income_ind ~
##     cm_health         0.088    0.023    3.888    0.000    0.088    0.080
##
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .cm_income_ind     1.133    0.082   13.822    0.000    1.133    1.120
##
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .cm_income_ind     1.017    0.030   34.315    0.000    1.017    0.994
##
## cm_income_ind
##         0.006

Let us briefly discuss the results:

• The direct effect is 0.072. A causal interpretation would be: if your income increases with 1000euros your health will improve by 0.072 (on a scale from 1-5).
• You will also observe that the standardized regression coefficient is 0.80 which is exactly the same as the estimated correlation between our two concepts previously. Thus the correlation and direct effect models are equivalent and we should be very cautious in giving our regression coefficient a causal interpretation.
• I hope you also observe that the explained variance is very low (and thus that the error variance of our health variables is almost identical to the observed variance). Perhaps, you should conclude that even though the strong significant effect the impact (or linear relation) between income and health is not substantial and negligible?

Take Home Messages

• A significant direct effect does not mean it is meaningful.
• A direct effect cannot always be given a causal interpretation.

1. $p = P(Y=1) = \frac{exp(\beta_kx_k)}{1+exp(\beta_kx_k)}$↩︎