Setup

Load packages

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.1

Load data

load("brfss2013.RData")

Part 1: Data

According to the CDC’s FAQ web page on the subject[1], the Behavioral Risk Factor Surveillance System (BRFSS) data are collected by telephone interviews conducted by professional phone interviewers with a very large (n > 400,000) number of randomly selected US adults. Most of the data are categorical (rather than continuous) and pertain to health-related behaviors. Inasmuch as the calls are random and go to a large number of people, the data should be well-generalizable to the target population (US residents), but as variables are not randomly assigned, it is not possible to make causal conclusions about relations among the data.

Presumably only those who own and are able to answer a telephone respond to the survey, which could introduce bias, particularly among variables relating to income or disabilities; and since participation in the survey is voluntary, one can conjecture that certain personality types might tend to refuse, possibly affecting, for example, variables pertaining to mental health.

More significantly, many of the questions are of a personal nature and involve “hot-button” moral or emotional issues for many people (subjects like HIV, immunization, weight and diet) and there may be tendencies for respondants to under- or over-report on some of these issues, even when told their responses are anonymous.

As a final note, the survey broadly has three parts, one set of manditory and one set of optional questions; a third set are calculated after the fact. On the whole, the manditory question were answered by most respondants, but very large numbers of the optional questions are mostly not present in the data. In cases where calculated variables are calculated based on manditory variables, they are themselves complete, whereas when they are calculated based on optional variables, they are therefore themselves mostly incomplete. Since it is difficult to compare variables that are mostly complete with those that are mostly missing, these analyses use only manditory variables and variables calculated based on them.

Footnotes:

[1] http://www.cdc.gov/brfss/about/brfss_faq.htm


Part 2: Research questions

Research quesion 1: Children and Lack of Sleep

Many parents whine that having children causes them to lose sleep. Is there any association between having children in the house and less sleep per night? Although the causal connection is beyond the scope of these data, as a parent of a two-year-old I would like to know whether there is really an association between the two things.

Research quesion 2: Days of the Week and Reported General Health

Do Mondays really suck? Could we get some insight into this by asking whether people tend to report on their general health more or less positively depending on what day of the week it is? If people are entirely objective and answered independently of how they currently feel, we would expect to see no difference in how they report on their general health no matter how they happened to feel at the time. Again, causality cannot be established with an observational dataset, but we can at least see if there is any association between the variables.

Research quesion 3: BMI and State of Residence

As a resident of Colorado, I have many times had to listen to my fellow Coloradans claim that Colorado is the “healthiest state in the nation,” and further, that the altitude is the cause. Well, we won’t be able to establish causality here, but let’s see if we can see any association between elevation and health.


Part 3: Exploratory data analysis

Research quesion 1: Children and Lack of Sleep

Do parents sleep less? Let’s create a plot of the mean number of hours slept (variable sleptim1) ordered by number of children (variable children).

Before beginning, it is worth noting that while number of children is usually not considered a major moral issue, and is very difficult to be mistaken about, the number of hours that a person self-reports to sleep per night is possibly very biased. One suspects under-reporting in a culture which values productivity more highly than rest, and further, small sleep disturbances may feel subjectively more extended than they really are, again possibly leading to under-reporting of sleep hours. Given that many people already believe that having children cuts into sleep time, those who have children may again for that reason under-report their sleep time. But continuing –

First let’s examine our variables.

range(brfss2013$sleptim1, na.rm = TRUE)
## [1]   0 450

Quickly looking at the sleptim1 variable’s range shows that there are values as large as 450 – implying that someone slept that many hours in a day. This is obviously bad data, so we will have to filter it (and any NAs) out.

Let’s create a working version of the datset called dta which will contain only observations which include both variables (no NAs in those variables) and in which sleep time is less than or equal to 24 (the maximum number of hours a person can sleep in a day) – but see next item:

To simplify the analysis, and based on a brief discussion online [1] we will also remove all values for sleep time that are greater than 10, as 10 hours per day is given as a value that is regarded as medically pathological. This will help rule out values which are due to clinical factors other than number of children.

dta <- brfss2013 %>% filter(!is.na(sleptim1) & sleptim1 <= 10 & !is.na(children))

We can define another working value, means as the mean sleep time for each level of the children variable, then plot that.

means <- dta %>% group_by(children) %>% summarize(mean = mean(sleptim1))
plot(x = means$children, y = means$mean, type = "b", xlab = "Number of children in household", ylab = "Mean hours sleep per night")

It is worth noting that the number of respondents with large numbers of children is very small. Let us find the standard deviation for the number of children, then add three times that to the mean number of children and then round up to a whole number of children. We do this to get an idea of what number might contain outliers or otherwise oddball values:

ceiling(mean(dta$children + 3 * sd(dta$children)))
## [1] 4

We conclude that even just 4 children is more than three standard deviations above the mean in this data set. This seems a surprisingly low number, but we must consider that many of the respondants were probably young or at any rate single, which would explain why about 73% of observations are zero for number of children, as determined here:

table(dta$children)[1] / sum(table(dta$children))
##         0 
## 0.7318254

Further, the number of children represents only those actually living with the respondant, so children living with other parents or away at college, or over the age of 18 would not count for this variable. This is exactly what we want, in fact, because we would assume that children not currently living at home would not affect the respondants’ sleep time.

But if we leave out zero values, and find the mean and standard deviation of only those who do currently live with children, we still find

hasKids <- dta %>% filter(children >= 1)
ceiling(mean(hasKids$children) + 3 * sd(hasKids$children))
## [1] 6

that only zero through six children are within three standard deviations of the mean. If we regenerate our plot using only those first seven levels, we get a clearer trend:

plot(x = head(means$children, 7), y = head(means$mean, 7), type = "b", xlab = "Number of children in household", ylab = "Mean hours sleep per night")

It seems evident that sleeping less is indeed associated with having children in the house, and at least for the first several, the more kids, the less sleep. Who knew, right? Notably, the biggest drop in mean sleep is as one might expect, between zero and one child. Again, this is only an association, and given the observational and possibly biased nature of the data, not causality and possibly not even association with confidence can be established.

With that, I think I need to get some sleep.

Footnotes:

[1] http://sleep.emory.edu/sleep_disorders/idiopathic_hypersomnia/diagnosis.html

Research quesion 2: Days of the Week and Reported General Health

Do respondants answer subjective questions differently depending on how they are currently feeling? Since that’s a pretty broad question, let’s just see if we can find any average difference between how they claim their health is in general, depending on what day of the week it is. Of course, we can’t establish causalty with this observational data, but we might be able to see some interesting trend or relation.

First, let’s create a working dataset called dta again, this time only including observations for which X_rfhlth (General Health) is not NA. We also will filter out any survey responses that were given in the year 2014. (According to the dataset codebook, a small number were. We have to calculate our day of the week later based on the day of the month, and making sure our data only contains responses from the year 2013 will make this much simpler, wile still retaining most of the data.)

dta <- brfss2013 %>%
  filter(!is.na(X_rfhlth) & iyear == 2013)

Moving on, we now need to create a computed variable containing the day of the week on which the survey was administered. For this we will need to look up the day of the week on which the year 2013 began. According to timeanddate.com, 2013 began on a Tuesday. We will begin the week on Monday, so Tuesday will be coded 2. But we need to know what value to add to our result, so that number will be 1. (1 + 1 = 2, Tuesday.)

firstDayOffset <- 1

For the final bit of preparation, we will create a new computed variable containing the numeric code for the day of the week on which the survey was administered (Monday = 1, Tuesday = 2, etc.). We do this by adding up the number of days in all the months preceeding the current month (or zero if it is January), then adding the day of the current month. This gives the day of the year. We compute the modulus 7 for this and then add one (since the result is zero-based, but we want it to be one-based).

The syntax below is absolutely ridiculous, but I was not able to find a way to get R to lookup a value from a vector of values based on the content of a frame or vector variable within mutate, as in mutate(sum(monthsLengthVector[1:imonth]) … etc., or the like. So we do it the hard way.

dta <- dta %>% 
  mutate(imonthDays = ifelse(imonth == "January", 0,
                       ifelse(imonth == "February", 31,
                        ifelse(imonth == "March", 59,
                         ifelse(imonth == "April", 90,
                          ifelse(imonth == "May", 120,
                           ifelse(imonth == "June", 151,
                            ifelse(imonth == "July", 181,
                             ifelse(imonth == "August", 212,
                              ifelse(imonth == "September", 243,
                               ifelse(imonth == "October", 273,
                                ifelse(imonth == "November", 304, 334)))))))))))) %>%
  mutate(weekday = ((imonthDays + as.numeric(iday) + firstDayOffset - 1) %% 7) + 1)

Now we can plot the proportion of respondants who reported good or bad health against the day of the week.

ggplot(data = dta, aes(x = weekday, fill = X_rfhlth)) + geom_bar(position = 'fill') + 
  labs(x = "Day of the Week (1 = Monday)", y = "Percent",
       title = "Reported General Health by Day of the Week")

Although it appears that the proportion of respondants reporting negatively does increase slightly during the course of the working week, with then a slight drop again during the weekend, the effect is very slight – the association is not clear if it exists at all. Mean values for each day vary within just about 2.3%:

means <- dta %>% group_by(weekday) %>% 
  mutate(isHealthy = ifelse(X_rfhlth == "Good or Better Health", TRUE, FALSE)) %>%
  summarize(mean = mean(isHealthy))
max(means$mean) - min(means$mean)
## [1] 0.02329834

In any case, it does not appear that Monday is the worst day of the week. Friday has the lowest percentage of negative responses. Perhaps people are more tired on Friday and that is associated with negative responses, but that is only speculation. I know I’m tired, and it’s only Tuesday.

Research quesion 3: BMI and State of Residence

If we want to find an association between altitude and health we will need to narrow the focus somewhat. Let’s use a state’s residents’ Body Mass Index (BMI) as a stand-in for general health.
The BRFSS calculated variable X_bmi5 is the respondant’s BMI with the decimal point (to two places) omitted. We will always therefore divde by 100 when using this value. First, let’s look at the mean and median values and a histogram of that variable to decide which of mean or median to use:

mean(brfss2013$X_bmi5 / 100, na.rm = TRUE)
## [1] 27.82448
median(brfss2013$X_bmi5 / 100, na.rm = TRUE)
## [1] 26.63
ggplot(brfss2013, aes(x = X_bmi5 / 100)) + geom_histogram(bins = 100)
## Warning: Removed 26727 rows containing non-finite values (stat_bin).

We see that the mean is greater than the median, and the histogram confirms that the distribution is right skewed. There may be a harder limit on how low the BMI can go than how high. There also may be outliers to the high (right) end with medical pathologies. For these reasons, we will use the median of BMI for the further analysis.

Next we create a table of all states with their median BMI. The resulting table will be in alphabetical order by state. For this table we filter out any NAs in the BMI variable, and we also filter out any respondants who did not claim to be a resident of the state that was called for the interview:

states <- brfss2013 %>% 
  filter(!is.na(X_bmi5) & stateres == "Yes") %>% 
  group_by(X_state) %>%
  summarise(median = median(X_bmi5 / 100))

We will confine ourselves to just the 50 actual states. Let’s get rid of Guam, Puerto Rico and the District of Columbia:

states <- states[-c(9,52,53),]

Next we will have to look up the mean elevations (in meters) of the 50 states[1] and put them in alphabetical order into a 50-item vector. Then we add it to our states table:

elevations <- c(150, 580, 1300, 200, 880, 2100, 150, 20, 30, 180, 920, 1500, 180, 210, 340, 610, 230, 30, 180, 110, 150, 280, 366, 90, 240, 1000, 790, 1700, 300, 80, 1700, 300, 210, 580, 260, 400, 1000, 340, 110, 110, 670, 280, 520, 1860, 300, 290, 520, 460, 320, 2044)
states$elevations <- elevations

The date in the states variable now looks like this:

head(states)
## Source: local data frame [6 x 3]
## 
##      X_state median elevations
##       <fctr>  <dbl>      <dbl>
## 1    Alabama 27.500        150
## 2     Alaska 27.280        580
## 3    Arizona 26.160       1300
## 4   Arkansas 27.385        200
## 5 California 26.220        880
## 6   Colorado 25.800       2100

Before we produce our scatter plot, let’s take a look at the distribition of state elevations:

ggplot(states, aes(x = elevations)) + geom_histogram(bins = 15)

The result is unsurprisingly strongly right skewed. This suggests that we should expect our scatter plot to have far more data points on the left side than on the right.

plot(states$elevations, states$median, main = "States", xlab = "Mean Elevation", ylab = "Median BMI")

We do see more points on the left. But there are enough on the right to suggest a negative association between a state’s mean elevation and the median BMI of its residents – that is, to some extent, the higher the mean elevation, the lower the median BMI.

An interesting outlier is Hawaii, with mean elevation of 920 meters, and a median BMI of 25.1 – the lowest in the dataset. It is the dot near the very bottom of the plot. It is interesting to note that most of the population of Hawaii probably lives quite close to the ocean and therefore at lower elevation than the statewide mean. That would suggest that Hawaii is even more of an unusual case than is implied by the plot.

So, leaving causality aside, maybe Coloradans are on to something.

Footnotes:

[1] https://simple.wikipedia.org/wiki/List_of_U.S._states_by_elevation