Data.table: rbind a list of data tables with unequal columns [duplicate] - r

This question already has answers here:
rbindlist data.tables with different number of columns
(1 answer)
Rbind with new columns and data.table
(5 answers)
Closed 4 years ago.
I have a list of data tables that are of unequal lengths. Some of the data tables have 35 columns and others have 36.
I have this line of code, but it generates an error
> lst <- unlist(full_data.lst, recursive = FALSE)
> model_dat <- do.call("rbind", lst)
Error in rbindlist(l, use.names, fill, idcol) :
Item 1362 has 35 columns, inconsistent with item 1 which has 36 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.
Any suggestions on how I can modify that so that it works properly.

Here's a minimal example of what you are trying to do.
No need to use any other package to do this. Just set fill=TRUE in rbindlist.
You can do this:
df1 <- data.table(m1 = c(1,2,3))
df2 <- data.table(m1 = c(1,2,3), m2=c(3,4,5))
df3 <- rbindlist(list(df1, df2), fill=T)
print(df3)
m1 m2
1: 1 NA
2: 2 NA
3: 3 NA
4: 1 3
5: 2 4
6: 3 5

If I understood your question correctly, I could possibly see only two options for having your data tables appended.
Option A: Drop the extra variable from one of the datasets
table$column_Name <- NULL
Option B) Create the variable with missing values in the incomplete dataset.
full_data.lst$column_Name <- NA
And then do rbind function.

Try to use rbind.fill from package plyr:
Input data, 3 dataframes with different number of columns
df1<-data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4,5))
df2<-data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5,6),c=c(1,2,3,4,5,6))
df3<-data.frame(a=c(1,2,3),d=c(1,2,3))
full_data.lst<-list(df1,df2,df3)
The solution
library("plyr")
rbind.fill(full_data.lst)
a b c d
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 1 1 1 NA
7 2 2 2 NA
8 3 3 3 NA
9 4 4 4 NA
10 5 5 5 NA
11 6 6 6 NA
12 1 NA NA 1
13 2 NA NA 2
14 3 NA NA 3

Related

Appending csvs with different column quantities and spellings

Nothing too complicated, it would just be useful to use rbindlist on a large number of csvs where the column names change a little over time (minor spelling changes), the column orders remain the same, and at some point, two additional columns are added to the csvs (which I don't really need).
library(data.table)
csv1 <- data.table("apple" = 1:3, "orange" = 2:4, "dragonfruit" = 13:15)
csv2 <- data.table("appole" = 7:9, "orangina" = 6:8, "dragonificfruit" = 2:4, "pear" = 1:3)
l <- list(csv1, csv2)
When I run
csv_append <- rbindlist(l, fill=TRUE) #which also forces use.names=TRUE
it gives me a data.table with 7 columns
apple orange dragonfruit appole orangina dragonificfruit pear
1: 1 2 13 NA NA NA NA
2: 2 3 14 NA NA NA NA
3: 3 4 15 NA NA NA NA
4: NA NA NA 7 6 2 1
5: NA NA NA 8 7 3 2
6: NA NA NA 9 8 4 3
as opposed to what I want, which is:
V1 V2 V3 V4
1: 1 2 13 NA
2: 2 3 14 NA
3: 3 4 15 NA
4: 7 6 2 1
5: 8 7 3 2
6: 9 8 4 3
which I can use, even though I have to go through the extra step later of renaming the columns back to standard variable names.
If I instead try the default fill=FALSE and use.names=FALSE, it throws an error:
Error in rbindlist(l) :
Item 2 has 4 columns, inconsistent with item 1 which has 3 columns. To fill missing columns use fill=TRUE.
Is there a simple way to manage this, either by forcing fill=TRUE and use.names=FALSE somehow or by omitting the additional columns in the csvs that have them by specifying a vector of columns to append?
If we only need first 3 columns, then drop the rest and bind as usual:
rbindlist(lapply(l, function(i) i[, 1:3]))
# apple orange dragonfruit
# 1: 1 2 13
# 2: 2 3 14
# 3: 3 4 15
# 4: 7 6 2
# 5: 8 7 3
# 6: 9 8 4
Another option, from the comments: we could directly read the files, and set to keep only first 3 columns using fread, then bind:
rbindlist(lapply(filenames, fread, select = c(1:3)))
Here is an option with name matching using phonetic from stringdist. Extract the column names from the list of data.table ('nmlist'), unlist, group using phonetic, get the first element, relist it to the same list structure as 'nmlist', use Map to change the column names of the list of data.table, and then apply rbindlist
library(stringdist)
library(data.table)
nmlist <- lapply(l, names)
nm1 <- unlist(nmlist)
rbindlist(Map(setnames, l, relist(ave(nm1, phonetic(nm1),
FUN = function(x) x[1]), skeleton = nmlist)), fill = TRUE)
-output
# apple orange dragonfruit pear
#1: 1 2 13 NA
#2: 2 3 14 NA
#3: 3 4 15 NA
#4: 7 6 2 1
#5: 8 7 3 2
#6: 9 8 4 3

R - counting with NA in dataframe [duplicate]

This question already has answers here:
ignore NA in dplyr row sum
(6 answers)
Closed 4 years ago.
lets say that I have this dataframe in R
df <- read.table(text="
id a b c
1 42 3 2 NA
2 42 NA 6 NA
3 42 1 NA 7", header=TRUE)
I´d like to calculate all columns to one, so result should look like this.
id a b c d
1 42 3 2 NA 5
2 42 NA 6 NA 6
3 42 1 NA 7 8
My code below doesn´t work since there is that NA values. Please note that I have to choose columns that I want to count since in my real dataframe I have some columns that I don´t want count together.
df %>%
mutate(d = a + b + c)
You can use rowSums for this which has an na.rm parameter to drop NA values.
df %>% mutate(d=rowSums(tibble(a,b,c), na.rm=TRUE))
or without dplyr using just base R.
df$d <- rowSums(subset(df, select=c(a,b,c)), na.rm=TRUE)

Full outer join of multiple dataframes stored as elements of a list using data.table

I'm trying to do a full outer join of multiple dataframes stored as elements of a list using data.table. I have successfully done this using the merge_recurse() function of the reshape package, but it is very slow with larger datasets, and I'd like to speed up the merge by using data.table. I'm not sure the best way for data.table to handle the list structure with multiple dataframes. I'm also not sure if I've written the Reduce() function correctly on unique keys to do a full outer join on multiple dataframes.
Here's a small example:
#Libraries
library("reshape")
library("data.table")
#Specify list of multiple dataframes
filelist <- list(data.frame(x=c(1,1,1,2,2,2,3,3,3), y=c(1,2,3,1,2,3,1,2,3), a=1:9),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,1), b=seq(from=0, by=5, length.out=9)),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,2), c=seq(from=0, by=10, length.out=9)))
#Merge with merge_recurse()
listMerged <- merge_recurse(filelist, by=c("x","y"))
#Attempt with data.table
ids <- lapply(filelist, function(x) x[,c("x","y")])
unique_keys <- unique(do.call("rbind", ids))
dt <- data.table(filelist)
setkey(dt, c("x","y")) #error here
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
Here's my expected output:
> listMerged
x y a b c
1 1 1 1 0 0
2 1 2 2 5 10
3 1 3 3 10 20
4 2 1 4 15 30
5 2 2 5 20 40
6 2 3 6 25 50
7 3 1 7 30 60
8 3 2 8 35 70
9 3 3 9 NA NA
10 4 1 NA 40 NA
11 4 2 NA NA 80
Here are my resources:
Suggestion to use Reduce() function on data.table (see last comment of answer)
Suggestion to use "unique keys" to do full outer join in data.table
This worked for me:
library("reshape")
library("data.table")
##
filelist <- list(
data.frame(
x=c(1,1,1,2,2,2,3,3,3),
y=c(1,2,3,1,2,3,1,2,3),
a=1:9),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,1),
b=seq(from=0, by=5, length.out=9)),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,2),
c=seq(from=0, by=10, length.out=9)))
##
## I used copy so that this would
## not modify 'filelist'
dtList <- copy(filelist)
lapply(dtList,setDT)
lapply(dtList,function(x){
setkeyv(x,cols=c("x","y"))
})
##
> Reduce(function(x,y){
merge(x,y,all=T,allow.cartesian=T)
},dtList)
x y a b c
1: 1 1 1 0 0
2: 1 2 2 5 10
3: 1 3 3 10 20
4: 2 1 4 15 30
5: 2 2 5 20 40
6: 2 3 6 25 50
7: 3 1 7 30 60
8: 3 2 8 35 70
9: 3 3 9 NA NA
10: 4 1 NA 40 NA
11: 4 2 NA NA 80
Also I noticed a couple of problems in your code. dt <- data.table(filelist) resulted in
> dt
filelist
1: <data.frame>
2: <data.frame>
3: <data.frame>
which is most likely the cause of the error in setkey(dt, c("x","y")) that you pointed out above. Also, did this work for you?
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
I'm just curious, because I was getting an error when I tried to run it (using dtList instead of filelist)
Error in eval(expr, envir, enclos) : could not find function "J"
which I believe has to do with the changes implemented since version 1.8.8 of data.table, explained by #Arun in this answer.

ifelse rows the same in R [duplicate]

This question already has answers here:
ifelse matching vectors in r
(2 answers)
Closed 9 years ago.
I have a dataframe that looks like this:
> df<-data.frame(A=c(NA,1,2,3,4),B=c(NA,5,NA,3,4),C=c(NA,NA,NA,NA,4))
> df
A B C
1 NA NA NA
2 1 5 NA
3 2 NA NA
4 3 3 NA
5 4 4 4
I am trying to create a "D" column based on the row values in df, where D gets an NA if the values in the row are different (i.e. row 2) or all NAs (i.e. row 1), and the value in the row if the values in that row are the same, excluding NAs (i.e. rows 3, 4, 5). This would produce a vector and dataframe that looks like this:
> df$D<-c(NA,NA,2,3,4)
> df
A B C D
1 NA NA NA NA
2 1 5 NA NA
3 2 NA NA 2
4 3 3 NA 3
5 4 4 4 4
Thank you in advance for your suggestions.
You can use apply() to do calculation for each row and then use unique() and !is.na(). With !is.na() you select values that are not NA. With unique() you get unique values and then with length() get number of unique values. If number is 1 then use first non NA value, if not then NA.
df$D<-apply(df,1,function(x)
ifelse(length(unique(x[!is.na(x)]))==1,x[!is.na(x)][1],NA))
Here is one possible approach:
FUN <- function(x) {
no.na <- x[!is.na(x)]
len <- length(no.na)
if (len == 0) return(NA)
if (len == 1) return(no.na)
runs <- rle(no.na)[[2]]
if(length(runs) > 1) return(NA)
runs
}
df$D <- apply(df, 1, FUN)
## > df
## A B C D
## 1 NA NA NA NA
## 2 1 5 NA NA
## 3 2 NA NA 2
## 4 3 3 NA 3
## 5 4 4 4 4

Turn different sized rows into columns

I am reading in a data file with many different rows, all of which can have different lengths like so:
dataFile <- read.table("file.txt", as.is=TRUE);
The rows can be as follows:
1 5 2 6 2 1
2 6 24
2 6 1 5 2 7 982 24 6
25 2
I need the rows to be transformed into columns. I'll be then using the columns for a violin plot like so:
names(dataCol)[1] <- "x";
jpeg("violinplot.jpg", width = 1000, height = 1000);
do.call(vioplot,c(dataCol,))
dev.off()
I'm assuming there will be an empty string/placeholder for any column with fewer entries than the column with the maximum number of entries. How can it be done?
Use the fill = TRUE argument in read.table. Then to change rows to columns, use t to transpose. Using your data this would look like...
df <- read.table( text = "1 5 2 6 2 1
2 6 24
2 6 1 5 2 7 982 24 6
25 2
" , header = FALSE , fill = TRUE )
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1 1 5 2 6 2 1 NA NA NA
#2 2 6 24 NA NA NA NA NA NA
#3 2 6 1 5 2 7 982 24 6
#4 25 2 NA NA NA NA NA NA NA
t(df)
# [,1] [,2] [,3] [,4]
#V1 1 2 2 25
#V2 5 6 6 2
#V3 2 24 1 NA
#V4 6 NA 5 NA
#V5 2 NA 2 NA
#V6 1 NA 7 NA
#V7 NA NA 982 NA
#V8 NA NA 24 NA
#V9 NA NA 6 NA
EDIT: apparently read.table has a fill=TRUE option, which is WAYYYY easier than my answer.
I've never used vioplot before, and that seems like a weird way to make a function call (instead of something like vioplot(dataCol)), but I have worked with ragged arrays before, so I'll try that.
Have you read the data in yet? That tends to be the hardest part. The code below reads the above data from a file called temp.txt into a matrix called out2
file = 'temp.txt'
dat = readChar(file,file.info(file)$size)
split1 = strsplit(dat,"\n")
split2 = strsplit(split1[[1]]," ")
n = max(unlist(lapply(split2,length)))
out=matrix(nrow=n,ncol=length(split2))
tFun = function(i){
vect = as.numeric(split2[[i]])
length(vect)=n
out[,i]=vect
}
out2 = sapply(1:length(split2),tFun)
I'll try and explain what I've done: the first step is to read in every character via readChar. You then split the lines, then the elements within each line to get the list split2, where each element of the list is a row of the input file.
From there you create a blank matrix that would be the right size for your data, then iterate through the list and assign each element to a column.
It's not pretty, but it works!

Resources