Blog

How to deal with Missing Values in R Programming?
Scenario
One day I was imparting training to the participants on Statistical Techniques using R Programming. In the class, I was taking Descriptive Statistics and was trying to demonstrate them how to calculate mean using Survey data of Package MASS.
Quick View of Data
## Warning: package 'MASS' was built under R version 3.5.3
## Warning: package 'knitr' was built under R version 3.5.3
kable(head(survey))
| Sex | Wr.Hnd | NW.Hnd | W.Hnd | Fold | Pulse | Clap | Exer | Smoke | Height | M.I | Age |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Female | 18.5 | 18.0 | Right | R on L | 92 | Left | Some | Never | 173.00 | Metric | 18.250 |
| Male | 19.5 | 20.5 | Left | R on L | 104 | Left | None | Regul | 177.80 | Imperial | 17.583 |
| Male | 18.0 | 13.3 | Right | L on R | 87 | Neither | None | Occas | NA | NA | 16.917 |
| Male | 18.8 | 18.9 | Right | R on L | NA | Neither | None | Never | 160.00 | Metric | 20.333 |
| Male | 20.0 | 20.0 | Right | Neither | 35 | Right | Some | Never | 165.00 | Metric | 23.667 |
| Female | 18.0 | 17.7 | Right | L on R | 64 | Right | Some | Never | 172.72 | Imperial | 21.000 |
Find the Mean
Variable Wr. Hand is showing span (distance from tip of thumb to tip of little finger of spread hand) of writing hand, in centimetres. Its continuous variable, so mean would be the correct measurement for central tendency. So I tried following command,
mean(survey$Wr.Hnd)
## [1] NA
I was surprised why is it showing NA even though the data is of continuous type as you can see in the above quick view of data.
I just tried to reload the MASS Package and again did the same procedure considering that there would be some error loading the package. But after that also I was getting the same error. Now I was feeling embarrassed.
Then I started looking at individual data value of that particular variable. And suddently I found, observation no. 43 has value NA.Due to that I was getting the error while calculated mean.
kable(survey[c(40:45),])
| Sex | Wr.Hnd | NW.Hnd | W.Hnd | Fold | Pulse | Clap | Exer | Smoke | Height | M.I | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40 | Male | 19.0 | 19.0 | Right | R on L | NA | Neither | Freq | Occas | 171.00 | Metric | 19.917 |
| 41 | Female | 17.5 | 16.0 | Right | L on R | NA | Right | Some | Never | 169.00 | Metric | 17.500 |
| 42 | Female | 17.8 | 18.0 | Right | R on L | 72 | Right | Some | Never | 154.94 | Imperial | 17.083 |
| 43 | Male | NA | NA | Right | R on L | 60 | NA | Some | Never | 172.00 | Metric | 28.583 |
| 44 | Female | 20.1 | 20.2 | Right | L on R | 80 | Right | Some | Never | 176.50 | Imperial | 17.500 |
| 45 | Female | 13.0 | 13.0 | NA | L on R | 70 | Left | Freq | Never | 180.34 | Imperial | 17.417 |
Now I came to know that yes this is the observation which make me embarrased. But how to deal with it. I can remove this observation, but it is having the data for other variables. So if I remove it, then it would be a loss of information. The best way is to skip this observation while calculating mean of Wr. hand variable. So I used following argument in command,
mean(survey$Wr.Hnd, na.rm = T)
## [1] 18.66907
Wooooo !!!!!! Now it’s giving the result without loosing the other information by skipping just NA values.
