LOCF and NOCF methods for missing data: how to plot data? - r

I'm working on the following dataset and its missing data:
# A tibble: 27 x 6
id sex d8 d10 d12 d14
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5
# ... with 17 more rows
I would like to fill the missiningness data via the Last Observation Carried Forward method (LOCF) and the Next Observation Carried Backward one (NOCB) and report also a graphic representation, plotting the individual profiles during age by sex, highlighting the imputed values, and compute the means and the standard errors at each age by sex. May you suggest a way to set properly the argument in plot() function?
Someone may have any clue about this?
I let you below some code, just in case they could turn out as useful, drawn from other dataset as example.
par(mfrow=c(1,1))
Oz <- airquality$Ozone
locf <- function(x) {
a <- x[1]
for (i in 2:length(x)) {
if (is.na(x[i])) x[i] <- a
else a <- x[i]
}
return(x)
}
Ozi <- locf(Oz)
colvec <- ifelse(is.na(Oz),mdc(2),mdc(1))
### Figure
plot(Ozi[1:80],col=colvec,type="l",xlab="Day number",ylab="Ozone (ppb)")
points(Ozi[1:80],col=colvec,pch=20,cex=1)

Next Observation Carried Backward / Last Observation Carried Forward is probably a very bad choice for your data.
These algorithms are usually used for time series data. Where carrying the last observation forward might be a good idea. E.g. if you think of 10 minute temperature measurements, the actual outdoor temperature will be quite likely quite similar to the temperature 10 minutes ago.
For cross sectional data (it seems you are looking at persons) the previous person is usually no more similar to actual person than any other random person.
Take a look at the mice R package for your cross-sectional dataset.
It offers way better algorithms for your case than locf/nocb.
Here is a overview about the function it offers: https://amices.org/mice/reference/index.html
It also includes different plots to assess the imputations e.g.:
Usually when using mice you create multiple possible imputations ( is worth reading about the technique of multiple imputation ). But you can also just produce one imputed dataset with the package.
There are the following functions for visualization of your imputations:
bwplot() (Box-and-whisker plot of observed and imputed data)
densityplot() (Density plot of observed and imputed data)
stripplot() (Stripplot of observed and imputed data)
xyplot()(Scatterplot of observed and imputed data)
Hope this helps a little bit. So my advice would be to take a look at this package and then start a new approach with your new knowledge.

Related

Plotting missing data

I'm trying plotting the following imputed dataset with LOCF method, according this procedure
> dati
# A tibble: 27 x 6
id sex d8 d10 d12 d14
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5
# ... with 17 more rows
dati_locf <- dati %>% mutate(across(everything(),na.locf)) %>%
mutate(across(everything(),na.locf,fromlast = T))
apply(dati_locf[which(dati_locf$sex=="F"),1:4], 1, function(x) lines(x, col = "green"))
Howrever, when I run the last line to plot dataset it turns me back both these error and warning messages:
Warning in xy.coords(x, y) : a NA has been produced by coercion
Error in plot.xy(xy.coords(x, y), type = type, ...) :
plot.new has not been called yet
Called from: plot.xy(xy.coords(x, y), type = type, ...)
Can you explain why and how I could fix them? I let you attach the page I has been being address to after running it.
enter image description here
If you just want to plot the LOCF imputation for one variable to see how good the fit for the imputations looks for this one variable, you can use the following:
library(imputeTS)
# Example 1: Visualize imputation by LOCF
imp_locf <- na_locf(tsAirgap)
ggplot_na_imputations(tsAirgap, imp_locf)
tsAirgap is an time series example, which comes with the imputeTS package. You would have to replace this with the time series / variable you want to plot. Imputed values are shown in red. As you can see, for this series last observation carried forward would be kind of ok, but there are algorithms tat come with the imputeTS package, that give a better result (e.g. na_kalman or na_seadec). Here is also an example of next observation carried backward, since you also used NOCB.
library(imputeTS)
# Example 2: Visualize imputation by NOCB
imp_locf <- na_locf(tsAirgap, option = "nocb")
ggplot_na_imputations(tsAirgap, imp_locf)
There are several problems here:
apply will convert its first argument to matrix and since the second column is character it gives a character matrix. Clearly one can't plot that with lines.
presumably we want to plot columns 3:6, not 1:4
na.locf will produce multiple values that are the same wherever there is an NA but what we really want is to connect non-NA points. Use na.approx instead.
lines can only be used after plot but there is no plot command. Use matplot instead.
Making these changes we have the following.
library(zoo)
# see Note below for dati in reproducible form
matplot(na.approx(dati[3:6]), type = "l", ylab = "")
legend("topright", names(dati)[3:6], col = 1:4, lty = 1:4)
(continued after plot)
We could alternately use ggplot2 graphics. First convert to zoo and then use na.approx and autoplot. Omit facet=NULL if you want separate panels.
library(ggplot2)
autoplot(na.approx(zoo(dati[3:6])), facet = NULL)
Note
We provide dati in reproducible form below. Note that the sex column only contains NA and F so in the absence of direction it will assume those are a logical NA and FALSE. Instead we specify that the sex column is character in the read.table line.
Lines <- "
id sex d8 d10 d12 d14
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5"
dati <- read.table(text = Lines, colClasses = list(sex = "character"))

How to fill a new data frame based on the value and the results of an calculus integrating this very value?

For graphical purpose, I want to create a new data frame with two columns.
The first column is the dose of the treatment received (i; 10 grammes up to 200 grammes).
The second column must be filed with the result of a calculus corresponding to the value of the dose received, id est the percentage of patients developing the disease according the corresponding dose which is given by the formula below:
The dose is extracted from a much larger dataset (data_fcpa) of more than 1 000 rows (patients).
percent_i <- round (prop.table (table (data_fcpa $ n_chir_act [data_fcpa $ cyproterone_dose > i] > 1))[2] * 100, 1)
I know how to create a new data (df) with the doses I want to explore:
df <- data.frame (dose <- seq (10, 200, by = 10))
names (df) <- c("cpa_dose")
> df
cpa_dose
1 10
2 20
3 30
4 40
5 50
6 60
7 70
8 80
9 90
10 100
11 110
12 120
13 130
14 140
15 150
16 160
17 170
18 180
19 190
20 200
For example for a dose of 10 grammes the result is:
> round (prop.table (table (data_fcpa $ n_chir_act [data_fcpa $ cyproterone_dose > 10] > 1))[2] * 100, 1)
TRUE
11.7
I suspect that a loop is needed to produce an output alike the little example provided below but, I have no idea of how to do it.
cpa_dose percentage
1 10 11.7
2 20
3 30
4 40
Any suggestion are welcomed.
Thank you in advance for your help.
It seems that you are describing a a situation where you want to show predicted effects from a statistical model? In that case, ggeffects is your best friend.
library(tidyverse)
library(ggeffects)
lm(mpg ~ hp,mtcars) %>%
ggpredict() %>%
as_tibble()
Btw, in order to answer your question it's required to provide some data and show what you have tried.

Find max value of a dataset using tangent of the curve in R

I'm taking measures for an experiment where I'm interested in the value of V2 when the R1 measure is the highest. I was just looking for the highest R1 value and then checking its V2 value, but I've been told it's more accurate to get the V2 value when the slope of the tangent of the curve is 0 (derivative of the slope = 0). I've found several and easy ways to do this in R when using functions, but since this is a discrete dataset, I have no clue as how to do it. Any help? Thanks in advance!
Data frame looking like this:
> pruebaR
# A tibble: 1,001 x 2
V2 R1
<dbl> <dbl>
1 -0.100 1672.
2 -0.0991 1668.
3 -0.0982 1665.
4 -0.0973 1663.
5 -0.0964 1662.
6 -0.0955 1661.
7 -0.0946 1660.
8 -0.0937 1659.
9 -0.0928 1659.
10 -0.0919 1659.
# ... with 991 more rows
Plot looking like this:
I cannot attach pics yet!

Looping through columns of csv-file in R [duplicate]

This question already has answers here:
How to apply a shapiro test by groups in R?
(3 answers)
Closed 9 years ago.
This probably is an easy question, but I'm just starting learning how to use R.
I have a csv-file filled with columns containing numbers. For every column of numbers I want R to conduct a Shapiro-Wilks test of normality. So, I want to loop through the columns from left to right, as to conduct shapiro.test(file$column1), shapiro.test(file$column2), etc.
All columns have a name as their header, and they don't contain the same number of rows.
How do I go about? Many thanks in advance!
Try
apply(file, 2, shapiro.test)
and take a look at ?apply
Another way is using sapply
sapply(file, shapiro.test, simplify=FALSE)
also take a look at ?sapply
An example using airquality dataset
> data(airquality)
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
# Applying shapiro.test function
> Test <- apply(airquality, 2, shapiro.test)
# Showing results in a nice format
> sapply(Test, function(x) unlist(x[c( "statistic", "p.value")]))
Ozone Solar.R Wind Temp Month Day
statistic.W 8.786661e-01 9.418347e-01 0.9857501 0.976173252 8.880451e-01 9.531254e-01
p.value 2.789638e-08 9.493099e-06 0.1178033 0.009320041 2.258290e-09 5.047775e-05
> sapply(Test, function(x) c(x["statistic"], x["p.value"])) # same results as above
Ozone Solar.R Wind Temp Month Day
statistic 0.8786661 0.9418347 0.9857501 0.9761733 0.8880451 0.9531254
p.value 2.789638e-08 9.493099e-06 0.1178033 0.009320041 2.25829e-09 5.047775e-05

Using tapply on two columns instead of one

I would like to calculate the gini coefficient of several plots with R unsing the gini() function from the package reldist.
I have a data frame from which I need to use two columns as input to the gini function.
> head(merged[,c(1,17,29)])
idp c13 w
1 19 126 14.14
2 19 146 14.14
3 19 76 39.29
4 19 74 39.29
5 19 86 39.29
6 19 93 39.29
The gini function uses the first elements for calculation (c13 here) and the second elements are the weights (w here) corresponding to each element from c13.
So I need to use the column c13 and w like this:
gini(merged$c13,merged$w)
[1] 0.2959369
The thing is I want to do this for each plot (idp). I have 4 thousands different values of idp with dozens of values of the two other columns for each.
I thought I could do this using the function tapply(). But I can't put two colums in the function using tapply.
tapply(list(merged$c13,merged$w), merged$idp, gini)
As you know this does not work.
So what I would love to get as a result is a data frame like this:
idp Gini
1 19 0.12
2 21 0.45
3 35 0.65
4 65 0.23
Do you have any idea of how to do this?? Maybe the plyr package?
Thank you for your help!
You can use function ddply() from library plyr() to calculate coefficient for each level (changed in example data frame some idp values to 21).
library(plyr)
library(reldist)
ddply(merged,.(idp),summarize, Gini=gini(c13,w))
idp Gini
1 19 0.15307402
2 21 0.05006588

Resources