Is there R function to combine many data frames - r

I have a sample of 2 datasets (c and d). I have combines them using combine command
c <- data.frame(x=c("a","b"),y=c("c","d"))
d <- data.frame(x=c("f","g"),y=c("h","e"))
library(gdata)
combine(c,d)
x y source
1 a c c
2 b d c
3 f h d
4 g e d
Well. Suppose I have 100 dataframe like c,d,e,f..... and so on(with same columns). Is there a way to combine all these in a quick way. Or else i need to call below command
combine(c,d,e,f........)
df <- read.csv(file.choose())
combine(df)
And the above is very time consuming. Is there a alternate to combine all dataframes easily

You can list all file you want to read in a directory using this:
listoffiles <- list.files(pattern = ".csv")
Then loop over all files and assign it a variable name with df_.
for(i in 1:length(listoffiles)) {
assign(paste0("df_", i), read.csv2(listoffiles[i]))
}
Then search for all files in your global environment.
Then you can specify a search pattern which would be "df_" and would result in a list of data.frames.
dflist <- mget(ls(.GlobalEnv, pattern = "df_"), envir = .GlobalEnv)
Then use rbindlist from data.table to combine your data.frames.
> data.table::rbindlist(dflist)
x y
1: a c
2: b d
3: f h
4: g e

If I understand the question correctly, then the OP has the names of the dataframes in a character vector, but the dataframes themselves are single objects in the global environment . In that case I would suggest the following.
Let this be the data and the character vector:
c <- data.frame(x=c("a","b"),y=c("c","d"), stringsAsFactors = FALSE)
d <- data.frame(x=c("f","g"),y=c("h","e"), stringsAsFactors = FALSE)
e <- data.frame(x=c("x","y"),y=c("o","p"), stringsAsFactors = FALSE)
df_names <- c("c", "d","e")
Then dplyr::bind_rows with c(mget(...)) should do the the job.
library(dplyr)
bind_rows(c(mget(df_names)), .id = "source")
> source x y
1 c a c
2 c b d
3 d f h
4 d g e
5 e x o
6 e y p

Related

Convert a list of vectors of different lengths to a data frame

Couldn't find any solution to this question online, but apologies if I missed it.
I have a list of several vectors (all character in this example), of different lengths:
ll <- list(f1 = c("a","b","c"),f2 = c("d","e"),f3 = "f")
I want to convert it into a data.frame that will cover all combinations of the lists elements. So the resulting data.frame will be:
data.frame(f1 = rep(f1,2), f2 = rep(f2,3), f3 = rep(f3,6))
Is there any function that achieves that?
expand.grid should work in this case -
expand.grid(ll)
# f1 f2 f3
#1 a d f
#2 b d f
#3 c d f
#4 a e f
#5 b e f
#6 c e f
Another similar alternative would be purrr::cross_df.
purrr::cross_df(ll)

Extracting one row from multiple data tables and combining into a new data table while keeping column names

Three text files are in the same directory ("data001.txt", "data002.txt", "data003.txt"). I write a loop to read each data file and generate three data tables;
for(i in files) {
x <- read.delim(i, header = F, sep = "\t", na = "*")
setnames(x, 2, i)
assign(i,x)
}
So let's say each individual table looks something like this:
var1 var2 var3
row1 2 1 3
I've used rbind to combine all of the tables...
combined <- do.call(rbind, mget(ls(pattern="^data")))
and get something like this:
var1 var2 var3
row1 2 1 3
var1 var2 var3
row1 3 2 4
var1 var2 var3
row1 1 3 5
leaving me with superfluous column names. At the moment I can get around this by just deleting that specific row containing the column names, but it's a bit clunky.
colnames(combined) = combined[1, ] # make the first row the column names
combined <- combined[-1, ] # delete the now-unnecessary first row
toDelete <- seq(1, nrow(combined), 2) # define which rows to be deleted i.e. every second odd row
combined <- combined[ toDelete ,] # delete them suckaz
This does give me what I want...
var1 var2 var3
row1 2 1 3
row1 3 2 4
row1 1 3 5
But I feel like a better way would simply be to extract the values of "row1" as a vector or as a list or whatever, and combine them all together into one data table. I feel like there is a quick and easy way to do this but I haven't been able to find anything yet. I've had a look here and here and here.
One possibility is to take the second row (that I want), and convert it into a matrix (then transpose it to make it a row instead of column!?) and rbind:
data001.txt <- as.matrix(data001.txt[2,])
data001.txt <- t(data001.txt)
combined <- rbind(data001.txt, data002.txt)
This gives me more or less what I want except without the column name headers (e.g. va1, var2, var3).
v1 v2 v3
2 1 3
3 2 4
Any ideas? Would this second method work well if there is some way to add the column names? I feel like it's less clunky than the first method. Thanks for any input :)
edit - solved in answer below.
Figured it out. Converting to data matrix and using set.names from data.table package required. Say you have a range of text data files like the one that follows, and you want to extract just the seventh column (the one with the numbers, not letters), and combine them together in their own data table including the row names:
chemical1 a b c d e 1 g h i j k l m
chemical2 a b c d e 2 g h i j k l m
chemical3 a b c d e 3 g h i j k l m
chemical4 a b c d e 4 g h i j k l m
chemical5 a b c d e 5 g h i j k l m
setting row.names = 1 and header = F.
setwd("directory")
files <- list.files(pattern = "data") # take all files with 'data' in their name
for(i in files) {
x <- read.delim(i, row.names = 1, header = F, sep = "\t", na = "*")
setnames(x, 6, i) # if the data you want is in column six. Sets data file name as the column name.
x <- as.matrix(x[6]) # just take the sixth column with the numeric data (delete everything else)
x <- t(x) # transform (if you want..)
assign(i,x)
}
combined <- do.call(rbind, mget(ls(pattern="^data"))) # combine the data matrices into one table
write.table(combined, file="filename.csv", sep=",", row.names=T, col.names = NA)

Read a column of data in R?

I am trying to read just one column of data in R. I know that the shortcut to do it is to do something like (assuming d1 is a data frame): d1[[3]] to read the third column. However, I'm just curious how would this simple function look like if you used read function instead? How would you make it a vector rather than a truncated data frame?
Here's an example of reading just one column from a .csv file
dat <- data.frame(a = letters[1:3], b = LETTERS[1:3], c = 1:3, d = 3:1)
dat
a b c d
1 a A 1 3
2 b B 2 2
3 c C 3 1
# write dat to a csv file
write.csv(dat,file="mydata.csv")
# scan the first row only from the file
firstrow <- scan("mydata.csv", sep=",", what=character(0), nlines=1)
# which position has the desired column (header = b in this cases)
col.pos <- match("b", firstrow)
# number of columns in data
nc <- length(firstrow)
# default of NA for desired column b; NULL for others
colClasses <- replace(rep("NULL", nc), col.pos, NA)
# read just column b
cols.b <- read.csv("mydata.csv", colClasses = colClasses)
cols.b
b
1 A
2 B
3 C
The above reads in a data frame. If you want to read a vector,
cols.b <- read.csv("mydata.csv", colClasses = colClasses)[, 1]
cols.b
[1] A B C
Levels: A B C

Loop with column binding

I am self-taught useR so please bear with me.
I have something similar the following dataset:
individual value
a 0.917741317
a 0.689673689
a 0.846208486
b 0.439198006
b 0.366260159
b 0.689985484
c 0.703381117
c 0.29467743
c 0.252435687
d 0.298108973
d 0.42951805
d 0.011187204
e 0.078516181
e 0.498118235
e 0.003877632
I would like to create a matrix with the values for a in column1, values for b in column2, etc. [I also add a 1 at the bottom of every column for a later algebra operations]
I have tried so far:
for (i in unique(df$individual)) {
values <- subset(df$value, df$individual == i)
m <- cbind(c(values[1:3],1))
}
I get a (4,1) matrix with the last individual values. What is missing to make it additive for each loop and get all as many columns as individuals?
This operation is called "reshaping". There is a base function, but I find it easier with the reshape2 package:
DF <- read.table(text="individual value
a 0.917741317
a 0.689673689
a 0.846208486
b 0.439198006
b 0.366260159
b 0.689985484
c 0.703381117
c 0.29467743
c 0.252435687
d 0.298108973
d 0.42951805
d 0.011187204
e 0.078516181
e 0.498118235
e 0.003877632", header=TRUE)
DF$id <- 1:3
library(reshape2)
DF2 <- dcast(DF, id ~ individual)
DF2[,-1]
# a b c d e
#1 0.9177413 0.4391980 0.7033811 0.2981090 0.078516181
#2 0.6896737 0.3662602 0.2946774 0.4295180 0.498118235
#3 0.8462085 0.6899855 0.2524357 0.0111872 0.003877632

rbindlist for factors with missing levels

I have several data.tables that I would like to rbindlist. The tables contain factors with (possibly missing) levels. Then rbindlist(...) behaves differently from do.call(rbind(...)):
dt1 <- data.table(x=factor(c("a", "b"), levels=letters))
rbindlist(list(dt1, dt1))[,x]
## [1] a b a b
## Levels: a b
do.call(rbind, list(dt1, dt1))[,x]
## [1] a b a b
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
If I want to keep the levels, do I have tor resort to rbind or is there a data.table way?
I guess rbindlist is faster because it doesn't do the checking of do.call(rbind.data.frame,...)
Why not to set the levels after binding?
Dt <- rbindlist(list(dt1, dt1))
setattr(Dt$x,"levels",letters) ## set attribute without a copy
from the ?setattr:
setattr() is useful in many situations to set attributes by reference and can be used on any object or part of an object, not just data.tables.
Thanks for pointing out this problem. As of version 1.8.11 it has been fixed:
dt1 <- data.table(x=factor(c("a", "b"), levels=letters))
rbindlist(list(dt1, dt1))[,x]
#[1] a b a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Resources