LOCF imputation and how to fill missing entries - r

I'm working on the following dataset and I'm trying to fill the missing entries of the VISUAL52 variables, imputing data by LOCF method (Last Observation Carried Forward).
library(readr)
library(mice)
library(finalfit)
library(Hmisc)
library(lattice)
library(VIM)
library(rms)
library(zoo)
> hw3
# A tibble: 240 x 11
treat LINE0 LOST4 LOST12 LOST24 LOST52 VISUAL0 VISUAL4 VISUAL12 VISUAL24 VISUAL52
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 12 1 3 NA NA 59 55 45 NA NA
2 2 13 -1 0 0 2 65 70 65 65 55
3 1 8 0 1 6 NA 40 40 37 17 NA
4 1 13 0 0 0 0 67 64 64 64 68
5 2 14 NA NA NA NA 70 NA NA NA NA
6 2 12 2 2 2 4 59 53 52 53 42
7 1 13 0 -2 -1 0 64 68 74 72 65
8 1 8 1 0 1 1 39 37 43 37 37
9 2 12 1 2 1 1 59 58 49 54 58
10 1 10 0 -4 -4 NA 49 51 71 71 NA
# ... with 230 more rows
I don't know whether I've done it good or not, but I've tried to describe the sample size, mean and the standard error for the VISUAL52 variable per treatment in this way (just let me know whether I would have been better to use a different function).
numSummary(hw3[,"VISUAL52", drop=FALSE], groups=hw3$treat,
statistics=c("mean", "se(mean)", "quantiles"),
quantiles=c(0,.25,.5,.75,1))
binnedCounts(hw3[hw3$treat == '1', "VISUAL52", drop=FALSE])
# treat = 1
binnedCounts(hw3[hw3$treat == '2', "VISUAL52", drop=FALSE])
# treat = 2
However, as to the imputation part, I've run the function nafill() from the data-table package, but I get back the error you may see aftyer ruuning the complete() function.
library(data.table)
imp_locf <- nafill(hw3$VISUAL52, "locf", nan=NA)
data_imputed <- complete(imp_locf)
*emphasized text*Error in UseMethod("complete_") :
no applicable method for 'complete_' applied to an object of class "c('double', 'numeric')"
I'm wondering why the function turn back this error and whether someone may know some alternative methods to impute data with locf method and fill the missing data in dataset.

If you want to apply locf on your dataset, you can use the imputeTS package.
library(imputeTS)
hw3 <- na_locf(hw3)
hw3
or if you just want to use LOCF for the VISUAL52 variable:
library(imputeTS)
hw3$VISUAL52 <- na_locf(hw3$VISUAL52)
hw3
Also keep in mind other algorithms might be even better suited for your data. imputeTS offers multiple functions especially for time series imputation (more algorithms in imputeTS). The mice package you already seem to use has additional algorithms for cross-sectional data.

Related

survMisc::comp give incorrect p-value due to incorrect covariance matrix calculation

I`m performing survival analysis on somewhat unusual data-set, which in raw form looks like this:
> print(df)
# A tibble: 407 × 4
stress len time status
<dbl> <dbl> <dbl> <dbl>
1 0 5.32 1 1
2 0 5.97 1 1
3 0 4.08 1 1
4 0 4.57 1 1
5 0 6.11 1 1
6 0 7.74 1 1
7 0 5.55 1 1
8 0 5.86 1 1
9 0 6.01 1 1
10 0 4.86 1 1
# … with 397 more rows
It is seed germination time set, composed of 407 individual seeds, with variable time indicating day at which seed germinated (day 1 to 7), variable status indicating event status (1 - germinated, 0 - censored) and variable stress indicating whether or not stressing treatment was applied to the seed (1 - applied, 0 - not applied), variable len isn't used.
It is treated as right-censored data and does not follow PH assumption. When passed to survMisc::ten() it looks like this:
> df_ten <- ten(Surv(time,status)~stress,data=df)
> as.data.frame(df_ten)
t e n cg ncg
1 1 52 407 1 200
2 2 44 335 1 148
3 3 51 236 1 104
4 4 20 127 1 53
5 5 15 69 1 32
6 6 2 42 1 17
7 7 6 29 1 15
8 1 20 407 2 207
9 2 55 335 2 187
10 3 58 236 2 132
11 4 37 127 2 74
12 5 12 69 2 37
13 6 11 42 2 25
14 7 2 29 2 14
I wanted to compare survival curves using log-rank test and its different weighted modifications. Simple log-rank test can be done by using survival::survdiff(), but survMisc::comp() additionally performs calculation of its weighted modifications. However I found out that survdiff() and comp() give different p-values for log-rank.
Example:
> survdiff(Surv(time,status)~stress,df)
Call:
survdiff(formula = Surv(time, status) ~ stress, data = df)
N Observed Expected (O-E)^2/E (O-E)^2/V
stress=0 200 190 173 1.70 4.72
stress=1 207 195 212 1.38 4.72
Chisq= 4.7 on 1 degrees of freedom, p= 0.03
survdiff gives p=0.03
>comp(df_ten)
# new R version have some troubles with displaying comp() results so I include them like this
> as.data.frame(attr(df_ten,'lrt'))
W Q Var Z pNorm chiSq df pChisq
1 1 -17.1389764 7.788363e+01 -1.94205613 0.052130305 3.771582025 1 0.052130305
2 n -7159.0000000 6.365017e+06 -2.83760918 0.004545280 8.052025835 1 0.004545280
3 sqrtN -352.4452268 2.030262e+04 -2.47352082 0.013378901 6.118305256 1 0.013378901
4 S1 -14.2339087 2.040569e+01 -3.15100126 0.001627118 9.928808966 1 0.001627118
5 S2 -14.1996105 2.028548e+01 -3.15270847 0.001617633 9.939570710 1 0.001617633
6 FH_p=1_q=1 -0.1198792 2.298284e+00 -0.07907548 0.936972583 0.006252932 1 0.936972583
But comp gives p=0.052
survdiff() result is actually the correct one and comp() is incorrect. I was able to figure out that it`s happening due to incorrect covariance matrix calculation by survMisc::COV() which is used within comp().
Again, example:
> survdiff(Surv(time,status)~stress,df)$var
[,1] [,2]
[1,] 62.17916 -62.17916
[2,] -62.17916 62.17916
This is correct value of VAR(Oi-Ei)=62.2, I was able to obtain the same result by calculating it manually using formula from Kleinbaum and Klein book.
> COV(df_ten)[COV(df_ten)>0]
1 2 3 4 5 6 7
16.128233 20.824661 20.724324 10.556427 5.463805 2.473937 1.712247
But COV() give incorrect one, when summed up it will give value of 77.83, which is then used by comp().
I still can obtain correct matrix value from COV() by giving it different ten format, but it can`t be used in comp().
> df_wide <- asWide(df_ten)
> as.data.frame(df_wide)
t n e n_1 e_1 n_2 e_2
1 1 407 72 200 52 207 20
2 2 335 99 148 44 187 55
3 3 236 109 104 51 132 58
4 4 127 57 53 20 74 37
5 5 69 27 32 15 37 12
6 6 42 13 17 2 25 11
7 7 29 8 15 6 14 2
> rowSums(df_wide[,COV(x=e, n=n, ncg=matrix(data=c(n_1, n_2), ncol=2))],dims=2)
1 2
1 62.17916 -62.17916
2 -62.17916 62.17916
I assume this have something to do with properties of my data-set, like specifically with the number of events per time point or with relatively small amount of time points and with formula internally used by COV() to calculate value which is:
...
res2 <- x[, (ncg / n) * (1 - (ncg / n)) * ((n - e) / (n - 1)) * e, by=list(t, cg)]
res2 <- data.table::setkey(res2[, sum(V1), by=t], t)
...
full function code here
I have tried this stuff with different survival datasets in R, and it doesn`t seem to affect larger ones, but in relatively small data-set like coin::ocarcinoma you can also see difference in covariance and corresponding log-test value calculated by survdiff/comp functions.
So is there any straightforward way to fix this issue with comp()? Like by restructuring my data-set somehow or perhaps by tweaking formula used in COV()?

Use a dynamcially created variable to select column in mutate

I am trying to use the value of vector_of_names[position] in the code above to dynamically select a column from data which to use for the value "age" using mutate.
vector_of_names <- c("one","two","three")
id <- c(1,2,3,4,5,6)
position <- c(1,1,2,2,1,1)
one <- c(32,34,56,77,87,98)
two <- c(45,67,87,NA,33,56)
three <- c(NA,NA,NA,NA,NA,60)
data <- data.frame(id,position,one,two,three)
attempt <- data %>%
mutate(age=vector_of_names[position])
I see a similar question here but the various answer fail as I am using a variable within the data "posistion" on which to select the column from the vector of names which is never recognised as I suspect is is looking outside of the data.
I am taking this approach as the number of columns "one","two" and "three" is not known before hand but the vector of their names is, and so they need to be selected dynamically.
You could do:
data %>%
rowwise() %>%
mutate(age = c_across(all_of(vector_of_names))[position])
id position one two three age
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 32 45 NA 32
2 2 1 34 67 NA 34
3 3 2 56 87 NA 87
4 4 2 77 NA NA NA
5 5 1 87 33 NA 87
6 6 1 98 56 60 98
If you want to be more explicit about what values should be returned:
named_vector_of_names <- setNames(seq_along(vector_of_names), vector_of_names)
data %>%
rowwise() %>%
mutate(age = get(names(named_vector_of_names)[match(position, named_vector_of_names)]))
Base R vectorized option using matrix subsetting.
data$age <- data[vector_of_names][cbind(1:nrow(data), data$position)]
data
# id position one two three age
#1 1 1 32 45 NA 32
#2 2 1 34 67 NA 34
#3 3 2 56 87 NA 87
#4 4 2 77 NA NA NA
#5 5 1 87 33 NA 87
#6 6 1 98 56 60 98

Why is my R code for filtering data producing different results with "fread()" and "ffdf()"?

I have a huge file with 7 million records and 160 variables. I came to know that fread() and read.csv.ffdf() are two ways to handle such big data. But when I try to use dplyr to filter these two data sets, I get different results. Below is a small subset of my data-
sample_data
AGE AGE_NEONATE AMONTH AWEEKEND
2 18 5 0
3 32 11 0
4 67 7 0
5 37 6 1
6 57 5 0
7 50 6 0
8 59 12 0
9 44 9 0
10 40 9 0
11 27 3 0
12 59 8 0
13 44 7 0
14 81 10 0
15 59 6 1
16 32 10 0
17 90 12 1
18 69 7 0
19 62 11 1
20 85 6 1
21 43 10 0
Code1
sample_data <- fread("/user/sample_data.csv", stringsAsFactors = T)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result1-
AGE AGE_NEONATE AMONTH AWEEKEND
1 67 NA 7 0
2 81 NA 10 0
3 90 NA 12 1
4 69 NA 7 0
5 85 NA 6 1
Code2-
sample_data <- read.csv.ffdf(file="C:/Users/sample_data.csv", header=F ,fill=T)
header.true <- function(df) {
names(df) <- as.character(unlist(df[1,]))
df[-1,]
}
sample_data<-tbl_ffdf(sample_data)
sample_data<-header.true(sample_data)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result2-
AGE AGE_NEONATE AMONTH AWEEKEND
1 81 10 0
2 90 12 1
3 85 6 1
I know that my 1st code is correct and gives me the correct results. What am I doing wrong in the 2nd code?
I haven't really tried running your code, but from what I can see, I suspect the following:
In your 2nd code version, you are reading the headers as part of the data. This leads to all the columns being imported as character rather than numeric.
In addition, most likely you have default.stringsAsFactors() returning TRUE, meaning that the imported character columns are treated as factors.
Now I guess that your between is being applied to factor levels between 65 and 95, rather than to the actual numbers. Since you probably don't have data for every year (age), 67 and 69 are likely mapped to factor levels below 65 (i.e. as.numeric(AGE) will return you the factor levels the numbers map to, and not the numbers as you see them when printing).
Try to use stringsAsFactors = FALSE or convert explicitly to character after reading.

Conditional filtering of data.frame with preceeding and tailing NA observations

I have a data.frame composed of observations and modelled predictions of data. A minimal example dataset could look like this:
myData <- data.frame(tree=c(rep("A", 20)), doy=c(seq(75, 94)), count=c(NA,NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,NA,2,NA,NA,NA,NA,NA), pred=c(0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,6,9,12,20,44))
The count column represents when observations were made and predictions are modelled over a complete set of days, in effect interpolating the data to a day level (from every 5 days).
I would like to conditionally filter this dataset so that I end up truncating the predictions to the same range as the observations, in effect keeping all predictions between when count starts and ends (i.e. removing preceding and trailing rows/values of pred when they correspond to an NA in the count column). For this example, the ideal outcome would be:
tree doy count pred
5 A 79 0 1
6 A 80 NA 1
7 A 81 NA 1
8 A 82 NA 2
9 A 83 NA 2
10 A 84 1 2
11 A 85 NA 2
12 A 86 NA 3
13 A 87 NA 3
14 A 88 NA 3
15 A 89 2 3
I have tried to solve this problem through combining filter with first and last, thinking about using a conditional mutate to create a column that determines if there is an observation in the previous doy (probably using lag) and filling that with 1 or 0 and using that output to then filter, or even creating a second data.frame that contains the proper doy range that can be joined to this data.
In my searches on StackOverflow I have come across the following questions that seemed close, but were not quite what I needed:
Select first observed data and utilize mutate
Conditional filtering based on the level of a factor R
My actual dataset is much larger with multiple trees over multiple years (with each tree/year having different period of observation depending on elevation of the sites, etc.). I am currently implementing the dplyr package across my code, so an answer within that framework would be great but would be happy with any solutions at all.
I think you're just looking to limit the rows to fall between the first and last non-NA count value:
myData[seq(min(which(!is.na(myData$count))), max(which(!is.na(myData$count)))),]
# tree doy count pred
# 5 A 79 0 1
# 6 A 80 NA 1
# 7 A 81 NA 1
# 8 A 82 NA 2
# 9 A 83 NA 2
# 10 A 84 1 2
# 11 A 85 NA 2
# 12 A 86 NA 3
# 13 A 87 NA 3
# 14 A 88 NA 3
# 15 A 89 2 3
In dplyr syntax, grouping by the tree variable:
library(dplyr)
myData %>%
group_by(tree) %>%
filter(seq_along(count) >= min(which(!is.na(count))) &
seq_along(count) <= max(which(!is.na(count))))
# Source: local data frame [11 x 4]
# Groups: tree
#
# tree doy count pred
# 1 A 79 0 1
# 2 A 80 NA 1
# 3 A 81 NA 1
# 4 A 82 NA 2
# 5 A 83 NA 2
# 6 A 84 1 2
# 7 A 85 NA 2
# 8 A 86 NA 3
# 9 A 87 NA 3
# 10 A 88 NA 3
# 11 A 89 2 3
Try
indx <- which(!is.na(myData$count))
myData[seq(indx[1], indx[length(indx)]),]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
If this is based on groups
ind <- with(myData, ave(!is.na(count), tree,
FUN=function(x) cumsum(x)>0 & rev(cumsum(rev(x))>0)))
myData[ind,]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
Or using na.trim from zoo
library(zoo)
do.call(rbind,by(myData, myData$tree, FUN=na.trim))
Or using data.table
library(data.table)
setDT(myData)[,.SD[do.call(`:`,as.list(range(.I[!is.na(count)])))] , tree]
# tree doy count pred
#1: A 79 0 1
#2: A 80 NA 1
#3: A 81 NA 1
#4: A 82 NA 2
#5: A 83 NA 2
#6: A 84 1 2
#7: A 85 NA 2
#8: A 86 NA 3
#9: A 87 NA 3
#10: A 88 NA 3
#11: A 89 2 3

What is the difference between with and within in R?

I always use "with" instead of "within" within the context of my research, but I originally thought they were the same. Just now I mistype "with" for "within" and the results returned are quite different. I am wondering why?
I am using the baseball data in the plyr package, so I first load the library by
require(plyr)
Then, I want to select all rows with an id "ansonca01". At first, as I said, I used "within", and run the function as follows:
within(baseball, baseball[id=="ansonca01", ])
I got very strange results which basically includes everything:
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
113 yorkto01 1871 1 TRO 29 145 36 37 5 7 2 23 2 2 9 1 NA NA NA NA NA
.........
Then I use "with" instead of "within",
with(baseball, baseball[id=="ansonca01",])
and got the results that I expected
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
121 ansonca01 1872 1 PH1 46 217 60 90 10 7 0 50 6 6 16 3 NA NA NA NA NA
276 ansonca01 1873 1 PH1 52 254 53 101 9 2 0 36 0 2 5 1 NA NA NA NA NA
398 ansonca01 1874 1 PH1 55 259 51 87 8 3 0 37 6 0 4 1 NA NA NA NA NA
525 ansonca01 1875 1 PH1 69 326 84 106 15 3 0 58 11 6 4 2 NA NA NA NA NA
I checked the documentation of with and within by typing help(with) in R environment, and got the following:
with is a generic function that evaluates expr in a local environment constructed from data. The environment has the caller's environment as its parent. This is useful for simplifying calls to modeling functions. (Note: if data is already an environment then this is used with its existing parent.)
Note that assignments within expr take place in the constructed environment and not in the user's workspace.
within is similar, except that it examines the environment after the evaluation of expr and makes the corresponding modifications to data (this may fail in the data frame case if objects are created which cannot be stored in a data frame), and returns it. within can be used as an alternative to transform.
From this explanation of the differences, I don't get why I obtained different results with such a simple operation. Anyone has ideas?
I find simple examples often work to highlight the difference. Something like:
df <- data.frame(a=1:5,b=2:6)
df
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
with(df, {c <- a + b; df;} )
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
within(df, {c <- a + b; df;} )
# equivalent to: within(df, c <- a + b)
# i've just made the return of df explicit
# for comparison's sake
a b c
1 1 2 3
2 2 3 5
3 3 4 7
4 4 5 9
5 5 6 11
The documentation is quite clear about the semantics and return values (and nicely matches the everyday meanings of the words “with” and “within”):
Value:
For ‘with’, the value of the evaluated ‘expr’. For ‘within’, the
modified object.
Since your code doesn’t modify anything inside baseball, the unmodified baseball is returned. with on the other hand doesn’t return the object, it returns expr.
Here’s an example where the expression modifies the object:
> head(within(cars, speed[dist < 20] <- 1))
speed dist
1 1 2
2 1 10
3 1 4
4 7 22
5 1 16
6 1 10
As above, with returns the value of the last evaluated expression. It is handy for one-liners such as:
with(cars, summary(lm (speed ~ dist)))
but is not suitable for sending multiple expressions.
I often find within useful for manipulating a data.frame or list (or data.table) as I find the syntax easy to read.
I feel that the documentation could be improved by adding examples of use in this regard, e.g.:
df1 <- data.frame(a=1:3,
b=4:6,
c=letters[1:3])
## library("data.table")
## df1 <- as.data.table(df1)
df1 <- within(df1, {
a <- 10:12
b[1:2] <- letters[25:26]
c <- a
})
df1
giving
a b c
1: 10 y 10
2: 11 z 11
3: 12 6 12
and
df1 <- as.list(df1)
df1 <- within(df1, {
a <- 20:23
b[1:2] <- letters[25:26]
c <- paste0(a, b)
})
df1
giving
$a
[1] 20 21 22 23
$b
[1] "y" "z" "6"
$c
[1] "20y" "21z" "226" "23y"
Note also that methods("within") gives only these object types, being:
within.data.frame
within.list
(and within.data.table if the package is loaded).
Other packages may define additional methods.
Perhaps unexpectedly for some, with and within are generally not appropriate choices when manipulating variables within defined environments...
To address the comment - there is no within.environment method. Using with requires you to have the function you're calling within the environment, which somewhat defeats the purpose for me e.g.
df1 <- as.environment(df1)
## with(df1, ls()) ## Error
assign("ls", ls, envir=df1)
with(df1, ls())

Resources