flatten record based list/object into dataframe - r

Edit: this question is outdated. The jsonlite package flattens automatically.
I am dealing with online datastreams that have record-based encoding, usually in JSON. The structure of the object (i.e. the names in the JSON) are known from the API documentation, however, values are mostly optional and not present in every record. Lists can contain new lists, and the structure is sometimes quite deep. Here is a quite simple example of some GPS data: http://pastebin.com/raw.php?i=yz6z9t25. Note that in the lower rows, the "l" object is missing due to no GPS signal.
I am looking for an elegant way to flatten these objects into a dataframe. I am currently using something like this:
library(RJSONIO)
library(plyr)
obj <- fromJSON("http://pastebin.com/raw.php?i=yz6z9t25", simplifyWithNames=FALSE, simplify=FALSE)
flatdata <- lapply(obj$data, as.data.frame);
mydf <- rbind.fill(flatdata)
This does the job, however it is slow and a bit error prone. A problem with this approach is that I am not using my knowledge about the structure (object names) in the data; instead it is inferred from the data. This leads to problems when a certain property happens to be absent in every record. In this case, it will not appear in the dataframe at all, instead of a column with NA values. This can lead to issues downstream. For example, I need to process the location timestamp:
mydf$l.t <- structure(mydf$l.t/1000, class="POSIXct")
However, this will result in an error in case of a dataset in which the l$t object isn't there. Furthermore both the as.data.frame and rbind.fill make things quite slow. The example dataset is a relatively small one. Any suggestions for better implementation? A robust solution would always yield a dataframe with the same columns in the same order, and where only the number of rows varies.
Edit: below a dataset with more meta data. It is larger in size and nested more deeply:
obj <- fromJSON("http://www.stat.ucla.edu/~jeroen/files/output.json", simplifyWithNames=FALSE, simplify=FALSE)

Here's a solution that lets you take advantage of your prior knowledge of data field names and classes. Also, by avoiding repeated calls to as.data.frame and the single call to plyr's rbind.fill() (both time-intensive) it runs about 60 times faster on your example data.
cols <- c("id", "ls", "ts", "l.lo","l.tz", "l.t", "l.ac", "l.la", "l.pr", "m")
numcols <- c("l.lo", "l.t", "l.ac", "l.la")
## Flatten each top-level list element, converting it to a character vector.
x <- lapply(obj$data, unlist)
## Extract fields that might be present in each record (returning NA if absent).
y <- sapply(x, function(X) X[cols])
## Convert to a data.frame with columns of desired classes.
z <- as.data.frame(t(y), stringsAsFactors=FALSE)
z[numcols] <- lapply(numcols, function(X) as.numeric(as.character(z[[X]])))
Edit: To confirm that my approach gives results identical to those in the original question, I ran the following test. (Notice that in both cases I set stringsAsFactors=FALSE to avoid meaningless differences in orderings of the factor levels.)
flatdata <- lapply(obj$data, as.data.frame, stringsAsFactors=FALSE)
mydf <- rbind.fill(flatdata)
identical(z, mydf)
# [1] TRUE
Further Edit:
Just for the record, here's an alternate version of the above that in addition automatically:
finds names of all data fields
determines their class/type
coerces the columns of the final data.frame to the correct class
.
dat <- obj$data
## Find the names and classes of all fields
fields <- unlist(lapply(xx, function(X) rapply(X, class, how="unlist")))
fields <- fields[unique(names(fields))]
cols <- names(fields)
## Flatten each top-level list element, converting it to a character vector.
x <- lapply(dat, unlist)
## Extract fields that might be present in each record (returning NA if absent).
y <- sapply(x, function(X) X[cols])
## Convert to a data.frame with columns of desired classes.
z <- as.data.frame(t(y), stringsAsFactors=FALSE)
## Coerce columns of z (all currently character) back to their original type
z[] <- lapply(seq_along(fields), function(i) as(z[[cols[i]]], fields[i]))

Here's an attempt that tries to make no assumptions about the types of the data. It's a bit slower than #JoshOBrien's, but faster than the OP's original solution.
Joshua <- function(x) {
un <- lapply(x, unlist, recursive=FALSE)
ns <- unique(unlist(lapply(un, names)))
un <- lapply(un, function(x) {
y <- as.list(x)[ns]
names(y) <- ns
lapply(y, function(z) if(is.null(z)) NA else z)})
s <- lapply(ns, function(x) sapply(un, "[[", x))
names(s) <- ns
data.frame(s, stringsAsFactors=FALSE)
}
Josh <- function(x) {
cols <- c("id", "ls", "ts", "l.lo","l.tz", "l.t", "l.ac", "l.la", "l.pr", "m")
numcols <- c("l.lo", "l.t", "l.ac", "l.la")
## Flatten each top-level list element, converting it to a character vector.
x <- lapply(obj$data, unlist)
## Extract fields that might be present in each record (returning NA if absent).
y <- sapply(x, function(X) X[cols])
## Convert to a data.frame with columns of desired classes.
z <- as.data.frame(t(y))
z[numcols] <- lapply(numcols, function(X) as.numeric(as.character(z[[X]])))
z
}
Jeroen <- function(x) {
flatdata <- lapply(x, as.data.frame)
rbind.fill(flatdata)
}
library(rbenchmark)
benchmark(Josh=Josh(obj$data), Joshua=Joshua(obj$data),
Jeroen=Jeroen(obj$data), replications=5, order="relative")
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 Josh 5 0.24 1.000000 0.24 0 NA NA
# 2 Joshua 5 0.31 1.291667 0.32 0 NA NA
# 3 Jeroen 5 12.97 54.041667 12.87 0 NA NA

Just for clarity, I am adding a combination of Josh and Joshua's solution which is the best I have come up with so far.
flatlist <- function(mylist){
lapply(rapply(mylist, enquote, how="unlist"), eval)
}
records2df <- function(recordlist, columns) {
if(length(recordlist)==0 && !missing(columns)){
return(as.data.frame(matrix(ncol=length(columns), nrow=0, dimnames=list(NULL,columns))))
}
un <- lapply(recordlist, flatlist)
if(!missing(columns)){
ns <- columns;
} else {
ns <- unique(unlist(lapply(un, names)))
}
un <- lapply(un, function(x) {
y <- as.list(x)[ns]
names(y) <- ns
lapply(y, function(z) if(is.null(z)) NA else z)})
s <- lapply(ns, function(x) sapply(un, "[[", x))
names(s) <- ns
data.frame(s, stringsAsFactors=FALSE)
}
The function is reasonably fast. I still think it should be able to speed this up though:
obj <- fromJSON("http://www.stat.ucla.edu/~jeroen/files/output.json", simplifyWithNames=FALSE, simplify=FALSE)
flatdata <- records2df(obj$data)
It also allows you to 'force' certain columns, although it doesn't result in too much of a speedup:
flatdata <- records2df(obj$data, columns=c("m", "doesnotexist"))

Related

probe global variables to call inside function

I want to pass variables within the .Globalenv when inside a function. Basically concatenate x number of data frames into a matrix.
Here is some dummy code;
Alpha <- data.frame(lon=124.9167,lat=1.53333)
Alpha_2 <- data.frame(lon=3.13333, lat=42.48333)
Alpha_3 <- data.frame(lon=-91.50667, lat=27.78333)
myfunc <- function(x){
vars <- ls(.GlobalEnv, pattern=x)
mat <- as.matrix(rbind(vars[1], vars[2], vars[3]))
return(mat)
}
When calling myfunc('Alpha') I would like the same thing to be returned as when you run;
as.matrix(rbind(Alpha, Alpha_2, Alpha_3)
lon lat
1 124.91670 1.53333
2 3.13333 42.48333
3 -91.50667 27.78333
Any pointers would be appreciated, thanks!
You can use get to retrieve variables by name. We do this here in a loop with lapply, and then use rbind to bind them together.
myfunc <- function(x){
vars <- ls(.GlobalEnv, pattern=x)
df <- do.call(rbind, mget(vars, .GlobalEnv)) # courtesy #Roland
return(df)
}
myfunc("Alpha")
# lon lat
# 1 124.91670 1.53333
# 2 3.13333 42.48333
# 3 -91.50667 27.78333
Note, in practice, you probably want to check that the variables that match the pattern actually are what you think they are, but this gives you the rough tools you want.
Old version (2nd line of func):
df <- do.call(rbind, lapply(vars, get, envir=.GlobalEnv))

Is it possible to modify list elements?

I have a list of records:
z <- list(list(a=1),list(a=4),list(a=2))
and I try to add fields to each of them.
Alas, neither
lapply(z,function(l) l$b <- 1+l$a)
nor
for(l in z) l$b <- 1+l$a
modifies z.
In this simple case I can, of course, do
z <- lapply(z,function(l) c(list(b= 1+l$a),l))
but this quickly gets out of hand when the lists have more nesting:
z <- list(list(a=list(b=1)),list(a=list(b=4)),list(a=list(b=2)))
How do I turn it into
list(list(a=list(b=1,c=2)),list(a=list(b=4,c=5)),list(a=list(b=2,c=3)))
without repeating the definition of the whole structure?
Each element of z has many fields, not just a; and z[[10]]$a has many subfields, not just b.
Your first code example doesn't modify the list because you need to return the list in your call to lapply:
z <- list(list(a=1),list(a=4),list(a=2))
expected <- list(list(a=1, b=2), list(a=4, b=5), list(a=2, b=3))
outcome <- lapply(z,function(l) {l$b <- 1+l$a ; l})
all.equal(expected, outcome)
# [1] TRUE
In the doubly nested example, you could use lapply within lapply, again making sure to return the list in the inner lapply:
z <- list(list(a=list(b=1)),list(a=list(b=4)),list(a=list(b=2)))
expected <- list(list(a=list(b=1, c=2)), list(a=list(b=4, c=5)), list(a=list(b=2, c=3)))
obtained <- lapply(z, function(l1) { lapply(l1, function(l2) {l2$c = l2$b+1 ; l2 } )})
all.equal(expected, obtained)
# [1] TRUE
Another, somewhat convoluted, option:
z <- list(list(a=1),list(a=4),list(a=2))
res <- list(list(a=list(b=1,c=2)),list(a=list(b=4,c=5)),list(a=list(b=2,c=3)))
res1 <- rapply(z,function(x) list(b = x,c = x+1),how = "replace")
> all.equal(res,res1)
[1] TRUE
I only say convoluted because rapply can be tricky to use at times (for me at least).

How to append rows to an R data frame

I have looked around StackOverflow, but I cannot find a solution specific to my problem, which involves appending rows to an R data frame.
I am initializing an empty 2-column data frame, as follows.
df = data.frame(x = numeric(), y = character())
Then, my goal is to iterate through a list of values and, in each iteration, append a value to the end of the list. I started with the following code.
for (i in 1:10) {
df$x = rbind(df$x, i)
df$y = rbind(df$y, toString(i))
}
I also attempted the functions c, append, and merge without success. Please let me know if you have any suggestions.
Update from comment:
I don't presume to know how R was meant to be used, but I wanted to ignore the additional line of code that would be required to update the indices on every iteration and I cannot easily preallocate the size of the data frame because I don't know how many rows it will ultimately take. Remember that the above is merely a toy example meant to be reproducible. Either way, thanks for your suggestion!
Update
Not knowing what you are trying to do, I'll share one more suggestion: Preallocate vectors of the type you want for each column, insert values into those vectors, and then, at the end, create your data.frame.
Continuing with Julian's f3 (a preallocated data.frame) as the fastest option so far, defined as:
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(n), y = character(n), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
Here's a similar approach, but one where the data.frame is created as the last step.
# Use preallocated vectors
f4 <- function(n) {
x <- numeric(n)
y <- character(n)
for (i in 1:n) {
x[i] <- i
y[i] <- i
}
data.frame(x, y, stringsAsFactors=FALSE)
}
microbenchmark from the "microbenchmark" package will give us more comprehensive insight than system.time:
library(microbenchmark)
microbenchmark(f1(1000), f3(1000), f4(1000), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# f1(1000) 1024.539618 1029.693877 1045.972666 1055.25931 1112.769176 5
# f3(1000) 149.417636 150.529011 150.827393 151.02230 160.637845 5
# f4(1000) 7.872647 7.892395 7.901151 7.95077 8.049581 5
f1() (the approach below) is incredibly inefficient because of how often it calls data.frame and because growing objects that way is generally slow in R. f3() is much improved due to preallocation, but the data.frame structure itself might be part of the bottleneck here. f4() tries to bypass that bottleneck without compromising the approach you want to take.
Original answer
This is really not a good idea, but if you wanted to do it this way, I guess you can try:
for (i in 1:10) {
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
Note that in your code, there is one other problem:
You should use stringsAsFactors if you want the characters to not get converted to factors. Use: df = data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
Let's benchmark the three solutions proposed:
# use rbind
f1 <- function(n){
df <- data.frame(x = numeric(), y = character())
for(i in 1:n){
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
df
}
# use list
f2 <- function(n){
df <- data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
for(i in 1:n){
df[i,] <- list(i, toString(i))
}
df
}
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(1000), y = character(1000), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
system.time(f1(1000))
# user system elapsed
# 1.33 0.00 1.32
system.time(f2(1000))
# user system elapsed
# 0.19 0.00 0.19
system.time(f3(1000))
# user system elapsed
# 0.14 0.00 0.14
The best solution is to pre-allocate space (as intended in R). The next-best solution is to use list, and the worst solution (at least based on these timing results) appears to be rbind.
Suppose you simply don't know the size of the data.frame in advance. It can well be a few rows, or a few millions. You need to have some sort of container, that grows dynamically. Taking in consideration my experience and all related answers in SO I come with 4 distinct solutions:
rbindlist to the data.frame
Use data.table's fast set operation and couple it with manually doubling the table when needed.
Use RSQLite and append to the table held in memory.
data.frame's own ability to grow and use custom environment (which has reference semantics) to store the data.frame so it will not be copied on return.
Here is a test of all the methods for both small and large number of appended rows. Each method has 3 functions associated with it:
create(first_element) that returns the appropriate backing object with first_element put in.
append(object, element) that appends the element to the end of the table (represented by object).
access(object) gets the data.frame with all the inserted elements.
rbindlist to the data.frame
That is quite easy and straight-forward:
create.1<-function(elems)
{
return(as.data.table(elems))
}
append.1<-function(dt, elems)
{
return(rbindlist(list(dt, elems),use.names = TRUE))
}
access.1<-function(dt)
{
return(dt)
}
data.table::set + manually doubling the table when needed.
I will store the true length of the table in a rowcount attribute.
create.2<-function(elems)
{
return(as.data.table(elems))
}
append.2<-function(dt, elems)
{
n<-attr(dt, 'rowcount')
if (is.null(n))
n<-nrow(dt)
if (n==nrow(dt))
{
tmp<-elems[1]
tmp[[1]]<-rep(NA,n)
dt<-rbindlist(list(dt, tmp), fill=TRUE, use.names=TRUE)
setattr(dt,'rowcount', n)
}
pos<-as.integer(match(names(elems), colnames(dt)))
for (j in seq_along(pos))
{
set(dt, i=as.integer(n+1), pos[[j]], elems[[j]])
}
setattr(dt,'rowcount',n+1)
return(dt)
}
access.2<-function(elems)
{
n<-attr(elems, 'rowcount')
return(as.data.table(elems[1:n,]))
}
SQL should be optimized for fast record insertion, so I initially had high hopes for RSQLite solution
This is basically copy&paste of Karsten W. answer on similar thread.
create.3<-function(elems)
{
con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::dbWriteTable(con, 't', as.data.frame(elems))
return(con)
}
append.3<-function(con, elems)
{
RSQLite::dbWriteTable(con, 't', as.data.frame(elems), append=TRUE)
return(con)
}
access.3<-function(con)
{
return(RSQLite::dbReadTable(con, "t", row.names=NULL))
}
data.frame's own row-appending + custom environment.
create.4<-function(elems)
{
env<-new.env()
env$dt<-as.data.frame(elems)
return(env)
}
append.4<-function(env, elems)
{
env$dt[nrow(env$dt)+1,]<-elems
return(env)
}
access.4<-function(env)
{
return(env$dt)
}
The test suite:
For convenience I will use one test function to cover them all with indirect calling. (I checked: using do.call instead of calling the functions directly doesn't makes the code run measurable longer).
test<-function(id, n=1000)
{
n<-n-1
el<-list(a=1,b=2,c=3,d=4)
o<-do.call(paste0('create.',id),list(el))
s<-paste0('append.',id)
for (i in 1:n)
{
o<-do.call(s,list(o,el))
}
return(do.call(paste0('access.', id), list(o)))
}
Let's see the performance for n=10 insertions.
I also added a 'placebo' functions (with suffix 0) that don't perform anything - just to measure the overhead of the test setup.
r<-microbenchmark(test(0,n=10), test(1,n=10),test(2,n=10),test(3,n=10), test(4,n=10))
autoplot(r)
For 1E5 rows (measurements done on Intel(R) Core(TM) i7-4710HQ CPU # 2.50GHz):
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
It looks like the SQLite-based sulution, although regains some speed on large data, is nowhere near data.table + manual exponential growth. The difference is almost two orders of magnitude!
Summary
If you know that you will append rather small number of rows (n<=100), go ahead and use the simplest possible solution: just assign the rows to the data.frame using bracket notation and ignore the fact that the data.frame is not pre-populated.
For everything else use data.table::set and grow the data.table exponentially (e.g. using my code).
Update with purrr, tidyr & dplyr
As the question is already dated (6 years), the answers are missing a solution with newer packages tidyr and purrr. So for people working with these packages, I want to add a solution to the previous answers - all quite interesting, especially .
The biggest advantage of purrr and tidyr are better readability IMHO.
purrr replaces lapply with the more flexible map() family,
tidyr offers the super-intuitive method add_row - just does what it says :)
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
This solution is short and intuitive to read, and it's relatively fast:
system.time(
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
0.756 0.006 0.766
It scales almost linearly, so for 1e5 rows, the performance is:
system.time(
map_df(1:100000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
76.035 0.259 76.489
which would make it rank second right after data.table (if your ignore the placebo) in the benchmark by #Adam Ryczkowski:
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
A more generic solution for might be the following.
extendDf <- function (df, n) {
withFactors <- sum(sapply (df, function(X) (is.factor(X)) )) > 0
nr <- nrow (df)
colNames <- names(df)
for (c in 1:length(colNames)) {
if (is.factor(df[,c])) {
col <- vector (mode='character', length = nr+n)
col[1:nr] <- as.character(df[,c])
col[(nr+1):(n+nr)]<- rep(col[1], n) # to avoid extra levels
col <- as.factor(col)
} else {
col <- vector (mode=mode(df[1,c]), length = nr+n)
class(col) <- class (df[1,c])
col[1:nr] <- df[,c]
}
if (c==1) {
newDf <- data.frame (col ,stringsAsFactors=withFactors)
} else {
newDf[,c] <- col
}
}
names(newDf) <- colNames
newDf
}
The function extendDf() extends a data frame with n rows.
As an example:
aDf <- data.frame (l=TRUE, i=1L, n=1, c='a', t=Sys.time(), stringsAsFactors = TRUE)
extendDf (aDf, 2)
# l i n c t
# 1 TRUE 1 1 a 2016-07-06 17:12:30
# 2 FALSE 0 0 a 1970-01-01 01:00:00
# 3 FALSE 0 0 a 1970-01-01 01:00:00
system.time (eDf <- extendDf (aDf, 100000))
# user system elapsed
# 0.009 0.002 0.010
system.time (eDf <- extendDf (eDf, 100000))
# user system elapsed
# 0.068 0.002 0.070
Lets take a vector 'point' which has numbers from 1 to 5
point = c(1,2,3,4,5)
if we want to append a number 6 anywhere inside the vector then below command may come handy
i) Vectors
new_var = append(point, 6 ,after = length(point))
ii) columns of a table
new_var = append(point, 6 ,after = length(mtcars$mpg))
The command append takes three arguments:
the vector/column to be modified.
value to be included in the modified vector.
a subscript, after which the values are to be appended.
simple...!!
Apologies in case of any...!
My solution is almost the same as the original answer but it doesn't worked for me.
So, I gave names for the columns and it works:
painel <- rbind(painel, data.frame("col1" = xtweets$created_at,
"col2" = xtweets$text))

mapply for row cor.test function

I am trying to use cor.test over the rows in 2 matrices, namely cer and par.
cerParCorTest <-mapply(function(x,y)cor.test(x,y),cer,par)
mapply,however, works on columns.
This issue has been discussed in Efficient apply or mapply for multiple matrix arguments by row . I tried that split solution (as below)
cer <- split(cer, row(cer))
par <- split(par, row(par))
and it results in the error (plus it is slow)
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
I also tried t(par) and t(cer) to get it running over the rows, but it results in the error
Error in cor.test.default(x, y) : not enough finite observations
The martices are shown below (for cer and same in par):
V1698 V1699 V1700 V1701
YAL002W(cer) 0.01860500 0.01947700 0.02043300 0.0214740
YAL003W(cer) 0.07001600 0.06943900 0.06891200 0.0684330
YAL005C(cer) 0.02298100 0.02391900 0.02485800 0.0257970
YAL007C(cer) -0.00026047 -0.00026009 -0.00026023 -0.0002607
YAL008W(cer) 0.00196200 0.00177360 0.00159490 0.0014258
My question is why transposing the matrix does not work and what is a short solution that will allow running over rows with mapply for cor.test().
I apologise for the long post and thanks in advance for any help.
I don't know what are the dimensions of your matrix , but this works fine for me
N <- 3751 * 1900
cer.m <- matrix(1:N,ncol=1900)
par.m <- matrix(1:N+rnorm(N),ncol=1900)
ll <- mapply(cor.test,
split(par.m,row(par.m)),
split(cer.m,row(cer.m)),
SIMPLIFY=FALSE)
this will give you a list of 3751 elements(the correlation for each row)
EDIT without split, you give the index of the row , this should be fast
ll <- mapply(function(x,y)cor.test(cer.m[x,],par.m[y,]),
1:nrow(cer.m),
1:nrow(cer.m),
SIMPLIFY=FALSE)
EDIT2 how to get the estimate value:
To get the estimate value for example :
sapply(ll,'[[','estimate')
You could always just program things in a for loop, seems reasonably fast on these dimensions:
x1 <- matrix(rnorm(10000000), nrow = 2000)
x2 <- matrix(rnorm(10000000), nrow = 2000)
out <- vector("list", nrow(x1))
system.time(
for (j in seq_along(out)) {
out[[j]] <- cor.test(x1[j, ], x2[j, ])
}
)
user system elapsed
1.35 0.00 1.36
EDIT: If you only want the estimate, I wouldn't store the results in a list, but a simple vector:
out2 <- vector("numeric", nrow(x1))
for (j in seq_along(out)) {
out2[j] <- cor.test(x1[j, ], x2[j, ])$estimate
}
head(out2)
If you want to store all the results and simply extract the estimate from each, then this should do the trick:
> out3 <- as.numeric(sapply(out, "[", "estimate"))
#Confirm they are the same
> all.equal(out2, out3)
[1] TRUE
The tradeoff is that the first method stores all the data in a list which may be useful for further processing vs a mroe simple method that only grabs what you initially want.

return identical DF or vector instead of NULL

users,
I have data.frames which are NULL in my results, but I don't want them to be NULL. I want them to be the same as the beginning (unchanged). I'm working on a list of files and the aim of my code is to fill all the NA with data from my other data.frames (according to the best correlation coefficient). Here's a small example:
Imagine these are my 3 input data frames (10 rows each):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,NA,NA,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
The aim here is for example, if there're NAs in ST1, ST1 must be filled with data from the best correlated file with ST1 (between ST2 and ST3 in this example)).
As ST3 has no data here, I cannot have any correlation coefficient. So NAs from ST3 cannot be filled, and ST3 cannot also be used to fill another file. So ST3 has no use if you want. Nevertheless I want to keep ST3 unchanged during all my code.
So the problem in my code comes from data.frames with no data and so with only NAs.
For the moment my code would give this for "refill" (end of my code) (filled NA in my data.frames):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- NULL
But actually, I want for results in "refill" this:
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
So for data.frames with only NAs, I don't want them to be NULL in "refill", but I want them to be identical as in input. I need this to have the same dimensions of data.frames between inputs and outputs.
If they are as NULL (like it is for the moment but I don't understand why and I want to change this), there will be 0 rows in this data.frame instead of 10 rows like the other data.frames.
So I think there's something wrong in my code in function "process.all" or "na.fill" or maybe "lst".
Here's my code and it is a reproductible example for you to understand my error (you'll see in head(refill) ST2 is set as NULL).
Sorry if it is a bit long but my error depends on other functions previously used. Hope you've understand my problem and what I'm trying to do. Thanks for your help!
(For information, in function "process.all" and "na.fill": x is the data.frame I want to fill, and y is the file which will be used to fill x (so the best correlated file with x)).
Geoffrey
# my data for example
DF1 <- data.frame(x1=c(NA,NA,rnorm(3:20)),x2=c(31:50))
write.table(DF1,"ST001_2008.csv",sep=";")
DF2 <- data.frame(x1=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,rnorm(1:10)),x2=c(1:20))
write.table(DF2,"ST002_2008.csv",sep=";")
DF3 <- data.frame(x1=rnorm(81:100),x2=NA)
write.table(DF3,"ST003_2008.csv",sep=";")
DF4 <- data.frame(x1=c(21:40),x2=rnorm(1:20))
write.table(DF4,"ST004_2008.csv",sep=";")
# Correlation table
corhiver2008capt1 <- read.table(text=" ST001 ST002 ST003 ST004
ST001 1.0000000 NA -0.4350665 0.3393549
ST002 NA NA NA NA
ST003 -0.4350665 NA 1.0000000 -0.4992513
ST004 0.3393549 NA -0.4992513 1.0000000",header=T)
lst <- lapply(list.files(pattern="\\_2008.csv$"), read.table,sep=";", header=TRUE, stringsAsFactors=FALSE)
Stations <-c("ST001","ST002","ST003","ST004")
names(lst) <- Stations
# searching the highest correlation for each data.Frame
get.max.cor <- function(station, mat){
mat[row(mat) == col(mat)] <- -Inf
m <- max(mat[station, ],na.rm=TRUE)
if (is.finite(m)) {return(which( mat[station, ] == m ))}
else {return(NA)}
}
# fill the data.frame with the data.frame which has the highest correlation coefficient
na.fill <- function(x, y){
if(all(!is.finite(y[1:10,1]))) return(y)
i <- is.na(x[1:10,1])
xx <- y[1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][i] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude),new)[i]
x
}
process.all <- function(df.list, mat){
f <- function(station)
na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]])
g <- function(station){
x <- df.list[[station]]
if(any(!is.finite(x[1:10,1]))){
mat[row(mat) == col(mat)] <- -Inf
nas <- which(is.na(x[1:10,1]))
ord <- order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))]
for(y in ord){
if(all(!is.na(df.list[[y]][1:10,1][nas]))){
xx <- df.list[[y]][1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][nas] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude), new)[nas]
break
}
}
}
x
}
n <- length(df.list)
nms <- names(df.list)
max.cor <- sapply(seq.int(n), get.max.cor, corhiver2008capt1)
df.list <- lapply(seq.int(n), f)
df.list <- lapply(seq.int(n), g)
names(df.list) <- nms
df.list
}
refill <- process.all(lst, corhiver2008capt1)
refill <- as.data.frame(refill) ########## HERE IS THE PROBLEM ######
refill
How about
if(sum(!is.na(ST3)) == 0) {
skip whatever you normally would do and go to the next vector
}
This assumes, of course, that you don't have any problems with, say, a vector of 1999 NAs and one numerical value.

Resources