combine list of data frames in list in specific manner - r

I got a list which have another list of data frames.
The outside list elements represents years and inside list represent months data.
Now I want to create a final list which will contain data for all months. Each Month columns will be "cbinded" by other years column values.
Alldata <- list()
Alldata[[1]] <- list(data.frame(Jan_2015_A=c(1,2), Jan_2015_B=c(3,4)), data.frame(Feb_2015_C=c(5,6), Feb_2015_D=c(7,8)))
Alldata[[2]] <- list(data.frame(Jan_2016_A=c(1,2), Jan_2016_B=c(3,4)), data.frame(Feb_2016_C=c(5,6), Feb_2016_D=c(7,8)))
Expected output list is as following
I've tried using for loops and its little complex, I want any R function to do this task.
I have done this using for loops using following code. But this is really complex and I myself found this little complicate. Hope I will get any simpler and tidy code for this operation.
I created list with each months and years data as a list item in form of data frames
x2 <- list()
for(l1 in 1: length(Alldata[[1]])){
temp <- list()
for(l2 in 1: length(Alldata)){
temp <- append(temp, list(Alldata[[l2]][[l1]]))
}
x2 <- append(x2, list(temp))
}
# then created final List with succesive years data of each month as list items. This is primarily used for Tracking data for years For Example: how much was count was for Jan_2015 and Jan_2016 for "A"
finalList <- list()
for(l3 in 1: length(x2)){
temp <- x2[[l3]]
td2 <- as.data.frame(matrix("", nrow = nrow(temp[[1]])))
rownames(td2)[rownames(temp[[1]])!=""] <- rownames(temp[[1]])[rownames(temp[[1]])!=""]
for(l4 in 1:ncol(temp[[1]])){
for(l5 in 1: length(temp)){
# lapply(l4, function(x) do.call(cbind,
td2 <- cbind(td2, temp[[l5]][, l4, drop=F])
}
}
finalList <- append(finalList, list(td2))
}
> finalList
[[1]]
V1 Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
1 1 1 3 3
2 2 2 4 4
[[2]]
V1 Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
1 5 5 7 7
2 6 6 8 8

You could do the following below. The lapply will iterate over the outer list and the do.call will cbind the inner list of data frames.
lapply(Alldata, do.call, what = 'cbind')
[[1]]
Jan_2015_A Jan_2015_B Feb_2015_C Feb_2015_D
1 1 3 5 7
2 2 4 6 8
[[2]]
Jan_2016_A Jan_2016_B Feb_2016_C Feb_2016_D
1 1 3 5 7
2 2 4 6 8
You can also use dplyr to get the same results.
library(dplyr)
lapply(Alldata, bind_cols)
Here is a third option proposed by J.R.
lapply(Alldata, Reduce, f = cbind)
EDIT
After clarification from OP, the above solution has been modified (see below) to produce the newly specified output. The solution above has been left there since it is a building block for the solution below.
pattern.vec <- c("Jan", "Feb")
### For a given vector of months/patterns, returns a
### list of elements with only that month.
mon_data <- function(mo) {
return(bind_cols(sapply(Alldata, function(x) { x[grep(pattern = mo, x)]})))
}
### Loop through months/patterns.
finalList <- lapply(pattern.vec, mon_data)
finalList
## [[1]]
## Jan_2015_A Jan_2015_B Jan_2016_A Jan_2016_B
## 1 1 3 1 3
## 2 2 4 2 4
##
## [[2]]
## Feb_2015_C Feb_2015_D Feb_2016_C Feb_2016_D
## 1 5 7 5 7
## 2 6 8 6 8
## Ordering the columns as specified in the original question.
## sorting is by the last character in the column name (A or B)
## and then the year.
lapply(finalList, function(x) x[ order(gsub('[^_]+_([^_]+)_(.*)', '\\2_\\1', colnames(x))) ])
## [[1]]
## Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
## 1 1 1 3 3
## 2 2 2 4 4
##
## [[2]]
## Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
## 1 5 5 7 7
## 2 6 6 8 8

Related

Call a specific column from every dataframe from list of dataframes

I like to report a specific column from every dataframe from a list of dataframes. Any ideas? This is my code:
# Create dissimilarity matrix
df.diss<-dist(t(df[,6:11]))
mds.table<-list() # empty list of values
for(i in 1:6){ # For Loop to iterate over a command in a function
a<-mds(pk.diss,ndim=i, type="ratio", verbose=TRUE,itmax=1000)
mds.table[[i]]<-a # Store output in empty list
}
Now here is where I'm having trouble. After storing the values, I'm unable to call a specific column from every dataframe from the list.
# This function should call every $stress column from each data frame.
lapply(mds.table, function(x){
mds.table[[x]]$stress
})
Thanks again!
you are very close:
set.seed(1)
l_df <- lapply(1:5, function(x){
data.frame(a = sample(1:5,5), b = sample(1:5,5))
})
lapply(l_df, function(x){
x[['a']]
})
[[1]]
[1] 2 5 4 3 1
[[2]]
[1] 2 1 3 4 5
[[3]]
[1] 5 1 2 4 3
[[4]]
[1] 3 5 2 1 4
[[5]]
[1] 5 3 4 2 1

Add new column to data.frame through loop in R

I have n number of data.frame i would like to add column to all data.frame
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
test=ls()
for (j in test){
j = cbind(get(j),IssueType=j)
}
Problem that i'm running into is
j = cbind(get(j),IssueType=j)
because it assigns all the data to j instead of a, b.
As commented, it's mostly better to keep related data in a list structure. If you already have the data.frames in your global environment and you want to get them into a list, you can use:
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
This is from here and makes sure that you only get data.frame objects from your global environment.
You will notice that you now already have a named list:
> dflist
# $a
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
#
# $b
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
So you can easily select the data you want by typing for example
dflist[["a"]]
If you still want to create extra columns, you could do it like this:
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
Now, each data.frame in dflist has a new column called IssueType:
> dflist
# $a
# X1.4 X5.8 IssueType
# 1 1 5 a
# 2 2 6 a
# 3 3 7 a
# 4 4 8 a
#
# $b
# X1.4 X5.8 IssueType
# 1 1 5 b
# 2 2 6 b
# 3 3 7 b
# 4 4 8 b
In the future, you can create the data inside a list from the beginning, i.e.
dflist <- list(
a = data.frame(1:4,5:8)
b = data.frame(1:4, 5:8)
)
To create a list of your data.frames do this:
a <- data.frame(1:4,5:8); b <- data.frame(1:4, 5:8); test <- list(a,b)
This allows you to us the lapply function to perform whatever you like to do with each of the dataframes, eg:
out <- lapply(test, function(x) cbind(j))
For most data.frame operations I recommend using the packages dplyr and tidyr.
wooo wooo
here is answer for the issue
helped by #docendo discimus
Created Dataframe
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
Group data.frame into list
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
Add's extra column
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
unstinting the data frame
list2env(dflist ,.GlobalEnv)

Split dataframe into list's based on id's

Please note, I'm not a programmer by trade. I'm literature student. So please bear with me.
I would like to improve the existing working procedure. Certainly function split is one option (I'm not sure how however).
Basically, I'm trying to subdivide existing dataframe into list of sub samples so that the sequnce of id's is not splitted into second list.
Here is working example together with sample data:
df <- data.frame(id=c(rep(1,3),rep(2,2),rep(3,3),rep(4,2),5,6,7,8,9,rep(10,5)),r1=rep(1,40),r2=rep(2,40))
x <- transform(df, rec=ave(df$id,df$id, FUN=seq_along))
x$cum <- cumsum(x$rec)
x$dif <- diff(c(0,x$cum),1)
x$lab <- ifelse(x$dif!=1,0,1)
x$seq <- seq_along(x$id)
x$subs <- x$lab*x$seq
seqrow <- seq(1,nrow(x),3) # how many rows approx. per part
rw <- x$subs[x$subs %in% seqrow]
start_rw <- c(1,rw[2:length(rw)])
end_rw <- c(start_rw[2:length(start_rw)]-1,nrow(x))
df.lst <- list()
for(i in 1:length(start_rw)){
df.lst[[i]] <- x[(start_rw[i]:end_rw[i]), ]
}
In each list the id's should be also sorted increasingly and should be arranged according to id's.
Reading through your code, I would summarize your procedure as:
Compute seqrow, which is row numbers where you would be willing to split the list
Split df only at the positions in seqrow where df$id is new (hasn't appeared above); this list of positions is called start_rw in your code.
You can use duplicated to determine if df$id has appeared above or not, which enables you to grab start_rw more easily:
seqrow <- seq(1,nrow(df),3)
(start_rw <- intersect(which(!duplicated(df$id)), seqrow))
# [1] 1 4 13 16
All that remains is to split df at these positions. You can use diff to compute the number of elements in each grouping:
(groups <- rep(seq(start_rw), times=diff(c(start_rw, nrow(df)+1))))
# [1] 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
df.lst2 <- split(df, groups)
This matches the output of your code:
all.equal(unname(df.lst2), lapply(df.lst, function(x) x[,1:3]))
# [1] TRUE

Using a list to store results of a double loop (for-loop) in R

I want to make calculations for elements of individual rows using a for-loop.
I have two data.frames
df: contains data of all trading-days stocks
events: contains data of only event days of stocks
Even though there might be a much easier approach for this specific example, I’d like to know how to do such a task with a loop in a loop (for-loops).
First, my data.frames:
comp1 <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
date1 <- c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
ret <- c(1.2,2.2,-0.5,0.98,0.73,-1.3,-0.02,0.3,1.1,2.0,1.9,-0.98,1.45,1.71,0.03)
df <- data.frame(comp1,date1,ret)
comp2 <- c(1,1,2,2,2,3,3)
date2 <- c(2,4,1,2,5,4,5)
q <- paste("")
events <- data.frame(comp2,date2,q)
df
# comp1 date1 ret
# 1 1 1 1.20
# 2 1 2 2.20
# 3 1 3 -0.50
# 4 1 4 0.98
# 5 1 5 0.73
# 6 2 1 -1.30
# 7 2 2 -0.02
# 8 2 3 0.30
# 9 2 4 1.10
# 10 2 5 2.00
# 11 3 1 1.90
# 12 3 2 -0.98
# 13 3 3 1.45
# 14 3 4 1.71
# 15 3 5 0.03
events
# comp2 date2 q
# 1 1 2
# 2 1 4
# 3 2 1
# 4 2 2
# 5 2 5
# 6 3 4
# 7 3 5
I want to make calculations of df$ret. As an example let's just take 2 * df$ret. The results for each event-day should be stored in mylist. The final output should be the data.frame "events" with a column "q" where I want the results of the calculation to be stored.
# important objects:
companies <- as.vector(unique(df$comp1)) # all the companies (here: 1, 2, 3)
days <- as.vector(unique(df$date1)) # all the trading-days (here: 1, 2, 3, 4, 5)
mylist <- vector('list', length(companies)) # a list where the results should be stored for each company
I came up with some piece of code which doesn't work. But I still think it should look something like this:
for(i in 1:nrow(events)) {
events_k <- events[which(comp1==companies[i]),] # data of all event days of company i
df_k <- df[which(comp2==companies[i]),] # data of all trading days of company i
for(j in 1:nrow(df_k)) {
events_k[j, "q"] <- df_k[which(days==events_k[j,"date2"]), "ret"] * 2
}
mylist[i] <- events_k
}
I don't understand how to set up the loop inside the other loop and how to store the results in mylist. Any help appreciated!!
Thank you!
Don't feel bad. All of your problems are common R gotchas. First, try changing
events <- data.frame(comp2,date2,q,stringsAsFactors=FALSE)
earlier instead. Your column q is being converted to a factor implicitly, disallowing the arithmetic * 2 operation later.
Next, let's consider the fixed loop
for(i in 1:nrow(events)) {
events_k <- events[which(comp1==companies[i]),] # data of all event days of company i
df_k <- df[which(comp2==companies[i]),] # data of all trading days of company i
for(j in 1:nrow(df_k)) {
events_k[j, "q"] <-
if (0 == length(tmp <- df_k[which(days==events_k[j,"date2"]), "ret"] * 2)) NA
else tmp
}
mylist[[i]] <- events_k
}
Your first problem was on the last line, where you used [ instead of [[ (in R, the former means always wrapped with a list, whereas the latter actually accessed the value in the list).
Your second problem is that sometimes which(days==events_k[j,"date2"]) is numeric(0) (i.e., there is no matching event date). The code will then work, but you'll still have a lot of dataframes with NAs. To remove those, you could do something like:
mylist <- Filter(function(df) nrow(df) > 0,
lapply(mylist, function(df) df[apply(df, 1, function(row) !all(is.na(row))), ]))
which will filter out list elements with empty dataframes, and rows in dataframes with all NA.

Read multiple files into separate data frames and process every dataframe

for all the files in one directory, I want to read each file into a data frame then process the file, for example, calculate cor across columns. For example:
files<-list.files(path=".") <br>
names <- substr(files,18,20)
for(i in c(1:length(names))){
name <- names[i]
assign (name, read.table(files[i]))
sapply(3:ncol(name), function(y) cor(name[, 2], name[, y], ))
}
but 'name' is a string in the last statement of the code, how can I process the dataframe 'name'?
This is exactly what R's lists are for. Also calling sapply to get all of the correlations is unnecessary since cor returns the correlation matrix so you can just subset
R> files <- list.files(pattern = "tsv")
R> dat <- lapply(files, read.table)
R> dat
[[1]]
a b c
1 2.802164 4.835557 6
2 1.680186 4.974198 3
3 3.002777 4.670041 6
4 2.182691 5.137982 11
5 4.206979 5.170269 5
6 1.307195 4.753041 9
7 2.919497 4.657171 7
8 2.938614 5.305558 9
9 2.575200 4.893604 2
10 1.548161 4.871108 4
[[2]]
a b c
1 -1.8483890 2 6
2 -2.9035164 0 7
3 -0.6490283 1 6
4 -2.8842633 3 2
5 -1.8803775 0 12
6 -3.0267870 1 9
7 0.5287124 0 7
8 -3.7220733 0 2
9 -2.0663912 2 9
10 -1.6232248 1 6
You can then lapply over this list again to process or do it as a one liner.
R> dat <- lapply(files, function(x) cor(read.table(x))[1,-1] )
R> dat
[[1]]
b c
0.27236143 -0.04973541
[[2]]
b c
-0.1440812 0.2771511
The way to do this is to put all the files you wish to read in in one folder, and then work with lists:
your.dir <- "" # adjust
files <- list.files(your.dir)
your.dfs <- lapply(file.path(your.dir, files), read.table)
your.dfsis now a list holding all your data frames. You can perform functions on all data frames simultaneously using lapply, or you can access individual data frames with the usual subsetting syntax, for example your.dfs[[1]] to access the first data frame.

Resources