Looping over csv files - r

So, I created a list a of csv files:
tbl = list.files(pattern="*.csv")
Then I separated them into two different lists:
tbl1 <- tbl[c(1,3:7,10:12,14:18,20)]
tbl2 <- tbl[c(2,19,8:9,13)]
Then loaded them:
list_of_data1 = lapply(tbl1, read.csv)
list_of_data2 = lapply(tbl2, read.csv)
And now I want to create a master file. I just want to select some data from each of csv file and store it in one table. To do that I created such loop:
gdata1 = lapply(list_of_data1,function(x) x[3:nrow(x),10:13])
for( i in 1:length(list_of_data1)){
rownames(gdata1[[i]]) = list_of_data1[[i]][3:nrow(list_of_data1[[i]]),1]
}
tmp = lapply(gdata1,function(x) matrix(as.numeric(x),ncol=4))
final.table1=c()
for(i in 1:length(gnames)){
print(i)
tmp=gnames[i]
f1 = function(x) {x[tmp,]}
tmp2 = lapply(gdata1,f1)
tmp3 = c()
for(j in 1:length(tmp2)){
tmp3=rbind(tmp3,tmp2[[j]])
}
tmp4 = as.vector(t(tmp3))
final.table1 = rbind(final.table1,tmp4)
}
rownames(final.table1) = gnames
I created two different lists of data because in first one list_of_data1 there are four interesting columns for me (10:13) and in the other one list_of_data2 there are only 3 columns (10:12). I want to put all of the data in one table. Is there any way to do it in one loop ?
I have an idea how to solve that problem. I may create a new loop for list_of_data2and after that bind both of them using cbind. I want to do it in more elegant way so that's why I came here!

I would suggest looking into do.call , you can rbind your first list of tables and then rbind your second list of tables and then cbind as you stated. Below a trivial use of do.call
#creating a list of tables that we are interested in appending
#together in one master dataframe
ts<-lapply(c(1,2,3),function(x) data.frame(c1=rep(c("a","b"),2),c2=(1:4)*x,c3=rnorm(4)))
#you could of course subset ts to the set of columns
#you find of interest ts[,colsOfInterest]
master<-do.call(rbind,ts)
After seeing your complication of various row/columns of interest in each file, I think you could do something like this. Seems a bit hackerish but could get the job done. I assume you merge the files based on a column named id, you could of course generalize this to multiple columns etc
#creating a series of data frames for which we only want a subset of row/cols
> df1<-data.frame(id=1:10,val1=rnorm(10),val2=rnorm(10))
> df2<-data.frame(id=5:10,val3=rnorm(6))
> df3<-data.frame(id=1:3,val4=rnorm(3), val5=rnorm(3), val6=rnorm(3))
#specifying which rows/cols we are interested in
#i assume you have some way of doing this programmatically or you defined elsewhere
> colsofinterest<-list(df1=c("id","val1"),df2=c("id","val3"),df3=c("id","val5","val6"))
> rowsofinterest<-list(df1=1:5,df2=5:8,df3=2:3)
#create a list of data frames where each has only the row/cols combination we want
> ts<-lapply(c("df1","df2","df3"),
function(x) get(x)[rowsofinterest[[x]],colsofinterest[[x]]])
> ts
[[1]]
id val1
1 1 0.24083489
2 2 -0.50140019
3 3 -0.24509033
4 4 1.41865350
5 5 -0.08123618
[[2]]
id val3
5 9 -0.1862852
6 10 0.5117775
NA NA NA
NA.1 NA NA
[[3]]
id val5 val6
2 2 0.2056010 -0.6788145
3 3 0.2057397 0.8416528
#now merge these based on a key column "id", and we want to keep all.
> final<-Reduce(function(x,y) merge(x,y,by="id",all=T), ts)
> head(final)
id val1 val3 val5 val6
1 1 0.24083489 NA NA NA
2 2 -0.50140019 NA 0.2056010 -0.6788145
3 3 -0.24509033 NA 0.2057397 0.8416528
4 4 1.41865350 NA NA NA
5 5 -0.08123618 NA NA NA
6 9 NA -0.1862852 NA NA
Is this what you are thinking about or did I misinterpret?

not ldplyr() functions in the same way as do.call() in JPC's answer.... I just happen to use plyr more, if you are looking at manipulating r datastructures in a vectorised way then lots of useful stuff in there.
library(plyr)
d1 <- ldplyr(list_of_data1, rbind)
d2 <- ldplyr(list_of_data2, rbind)
select cols of d1 and d2
d1 <- d1[,c(10:13)]
d2 <- d2[,c(10:12)]
final.df <- cbind(d1,d2)

Related

Rbind data frames with names in a list [duplicate]

I have an issue that I thought easy to solve, but I did not manage to find a solution.
I have a large number of data frames that I want to bind by rows. To avoid listing the names of all data frames, I used "paste0" to quickly create a vector of names of the data frames. The problem is that I do not manage to make the rbind function identify the data frames from this vector of name.
More explicitely:
df1 <- data.frame(x1 = sample(1:5,5), x2 = sample(1:5,5))
df2 <- data.frame(x1 = sample(1:5,5), x2 = sample(1:5,5))
idvec <- noquote(c(paste0("df",c(1,2))))
> [1] df1 df2
What I would like to get:
dftot <- rbind(df1,df2)
x1 x2
1 4 1
2 5 2
3 1 3
4 3 4
5 2 5
6 5 3
7 1 4
8 2 2
9 3 5
10 4 1
dftot <- rbind(idvec)
> [,1] [,2]
> idvec "df1" "df2"
If there are multiple objects in the global environment with the pattern df followed by digits, one option is using ls to find all those objects with the pattern argument. Wrapping it with mget gets the values in the list, which we can rbind with do.call.
v1 <- ls(pattern='^df\\d+')
`row.names<-`(do.call(rbind,mget(v1)), NULL)
If we know the objects, another option is paste to create a vector of object names and then do as before.
v1 <- paste0('df', 1:2)
`row.names<-`(do.call(rbind,mget(v1)), NULL)
This should give the result:
dfcount <- 2
dftot <- df1 #initialise
for(n in 2:dfcount){dftot <- rbind(dftot, eval(as.name(paste0("df", as.character(n)))))}
eval(as.name(variable_name)) reads the data frames from strings matching their names.

How to conditionally combine data.frame object in the list in more elegant way?

I have data.frame in the list, and I intend to merge specific data.frame objects conditionally where merge second, third data.frame objects without duplication, then merge it with first data.frame objects. However, I used rbind function to do this task, but my approach is not elegant. Can anyone help me out the improve the solution ? How can I achieve more compatible solution that can be used in dynamic functional programming ? How can I get desired output ? Any idea ?
reproducible example:
dfList <- list(
DF.1 = data.frame(red=c(1,2,3), blue=c(NA,1,2), green=c(1,1,2)),
DF.2 = data.frame(red=c(2,3,NA), blue=c(1,2,3), green=c(1,2,4)),
DF.3 = data.frame(red=c(2,3,NA,NA), blue=c(1,2,NA,3), green=c(1,2,3,4))
)
dummy way to do it:
rbind(dfList[[1L]], unique(rbind(dfList[[2L]], dfList[[3L]])))
Apparently, my attempt is not elegant to apply in functional programming. How can make this happen elegantly ?
desired output :
red blue green
1 1 NA 1
2 2 1 1
3 3 2 2
11 2 1 1
21 3 2 2
31 NA 3 4
6 NA NA 3
How can I improve my solution more elegantly and efficiently ? Thanks in advance
The best (easiest and fastest way) to do this is data.table::rbindlist.
It would work like this:
library(data.table)
dfList <- list(
DF.1 = data.table(red=c(1,2,3), blue=c(NA,1,2), green=c(1,1,2)),
DF.2 = data.table(red=c(2,3,NA), blue=c(1,2,3), green=c(1,2,4)),
DF.3 = data.table(red=c(2,3,NA,NA), blue=c(1,2,NA,3), green=c(1,2,3,4))
)
# part 1: list element 1
dt_1 <- dfList[[1]]
# part 2: all other list elements (in your case 2 and 3)
dt_2 <- unique(rbindlist(dfList[-1]))
# use rbindlist to bind the rows together
dt_all <- rbindlist(list(dt_1, dt_2))
Comment.
My solution is pretty close to your proposed solution. I think the "ugliness" about this way is that it is an edge case to merge datasets and deattach the first element (and treat it in a different way). The best solution would probably be to step back and think about the underlying idea and solve it using an additional variable in the datasets (i.e., for df1 and then for df2_3), which I would consider the R-way.
Something along this thought would look like this:
myList2 <- list(
DF.1 = data.table(red=c(1,2,3), blue=c(NA,1,2), green=c(1,1,2), var = "df1"),
DF.2 = data.table(red=c(2,3,NA), blue=c(1,2,3), green=c(1,2,4), var = "other"),
DF.3 = data.table(red=c(2,3,NA,NA), blue=c(1,2,NA,3), green=c(1,2,3,4), var = "other")
)
dt <- rbindlist(myList2)
unique(dt)
# red blue green var
# 1: 1 NA 1 df1
# 2: 2 1 1 df1
# 3: 3 2 2 df1
# 4: 2 1 1 other
# 5: 3 2 2 other
# 6: NA 3 4 other
# 7: NA NA 3 other
A way of rbinding a list of data.frames with only base R is do.call(list, rbind) (see this question that also presents some alternatives).
If you then desire only unique rows you can follow-up with a unique
unique(do.call(dfList, rbind))

Using a data frame value as a list name

Let's say that I have a list which contains fourteen data frames. Each data frame contains a final column which contains a value for the city. So it would like
[[1]]
Row.Labels brz zone
1 3/31/09 NA SNE
2 4/30/09 NA SNE
3 5/31/09 NA SNE
[[2]]
Row.Labels brz zone
1 3/31/09 NA FED
2 4/30/09 NA FED
3 5/31/09 NA FED
...
What I want to do is name each data frame within the list with the value from the zone column. I figured a quick for loop would do the trick but I can't seem to find a solution to this problem.
dataset <- do.call("list", lapply(file_list, FUN = function(files){
read.csv(files, header=TRUE, stringsAsFactors=FALSE)
}))
# doesn't work
for( j in 1:length(dataset) ) {
names(dataset[j]) <- unique(dataset[[j]][,"zone"])
}
So the desired result of to name the first list element as SNE, the second list element as FED, and so forth. But I don't want to do it manually.

Create Data Frame and Populate It R

How do I create a fixed size data frame of size [40 2], declare the first column with unique strings, and populate the other with specific values? Again, I want the first column to be the list of strings; I don't
want a row of headers.
(Someone please give me some pointers. I haven't program in R for a while and my R skills are terrible to
begin with.)
Two approaches:
# sequential strings
library(stringr)
df.1 <- data.frame(id=paste0("X",str_pad(1:40,2,"left","0")),value=NA)
head(df.1)
# id value
# 1 X01 NA
# 2 X02 NA
# 3 X03 NA
# 4 X04 NA
# 5 X05 NA
# 6 X06 NA
Second Approach:
# random strings
rstr <- function(n,k){
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
set.seed(1)
df.2 <- data.frame(id=rstr(40,5),value=NA)
head(df.2)
# id value
# 1 gjoxf NA
# 2 xyrqb NA
# 3 ferju NA
# 4 mszju NA
# 5 yfqdg NA
# 6 kajwi NA
The function rstr(n,k) produces a vector of length n with each element being a string of random characters of length k. rstr(...) does not guarantee that all strings are unique, but the probability of duplication is O(n/26^k).
Create the data.frame and define it's columns with the values
The reciclying rule, repeats the strings to match the 40 rows defined by the second column
df <- data.frame(x = c("unique_string 1", "unique_string 2"), y = rpois(40, 2))
# Change column names
names(df) <- c("string_col", "num_col")
I found this way of creating dataframes in R extremely productive and easy,
Create a raw array of values , then convert into matrix of required dimenions and finally name the columns and rows
dataframe.values = c(value1, value2,.......)
dataframe = matrix(dataframe.values,nrow=number of rows ,byrow = T)
colnames(dataframe) = c("column1","column2",........)
row.names(dataframe) = c("row1", "row2",............)
exampledf <- data.frame(columnofstrings=c("a string", "another", "yetanother"),
columnofvalues=c(2,3,5) )
gives
> exampledf
columnofstrings columnofvalues
1 a string 2
2 another 3
3 yetanother 5

Elements within lists.

I'm relatively new in R (~3 months), and so I'm just getting the hang of all the different data types. While lists are a super useful way of holding dissimilar data all in one place, they are also extremely inflexible for function calls, and riddle me with angst.
For the work I'm doing, I often uses lists because I need to hold a bunch of vectors of different lengths. For example, I'm tracking performance statistics of about 10,000 different vehicles, and there are certain vehicles which are so similar they can essentially be treated as the same vehicles for certain analyses.
So let's say we have this list of vehicle ID's:
List <- list(a=1, b=c(2,3,4), c=5)
For simplicity's sake.
I want to do two things:
Tell me which element of a list a particular vehicle is in. So when I tell R I'm working with vehicle 2, it should tell me b or [2]. I feel like it should be something simple like how you can do
match(3,b)
> 2
Convert it into a data frame or something similar so that it can be saved as a CSV. Unused rows could be blank or NA. What I've had to do so far is:
for(i in length(List)) {
length(List[[i]]) <- max(as.numeric(as.matrix(summary(List)[,1])))
}
DF <- as.data.frame(List)
Which seems dumb.
For your first question:
which(sapply(List, `%in%`, x = 3))
# b
# 2
For your second question, you could use a function like this one:
list.to.df <- function(arg.list) {
max.len <- max(sapply(arg.list, length))
arg.list <- lapply(arg.list, `length<-`, max.len)
as.data.frame(arg.list)
}
list.to.df(List)
# a b c
# 1 1 2 5
# 2 NA 3 NA
# 3 NA 4 NA
Both of those tasks (and many others) would become much easier if you were to "flatten" your data into a data.frame. Here's one way to do that:
fun <- function(X)
data.frame(element = X, vehicle = List[[X]], stringsAsFactors = FALSE)
df <- do.call(rbind, lapply(names(List), fun))
# element vehicle
# 1 a 1
# 2 b 2
# 3 b 3
# 4 b 4
# 5 c 5
With a data.frame in hand, here's how you could perform your two tasks:
## Task #1
with(df, element[match(3, vehicle)])
# [1] "b"
## Task #2
write.csv(df, file = "outfile.csv")

Resources