Make a named table from list of dataframes

Make a named table from list of dataframes - r

Say I have a column with Id of a product and a list of data frames with characteristics about them:
bundle dataframe
bundle
1 284993459
2 1048768805
3 511310430
4 1034630958
5 1235581326
d2 list
[[1]]
id value
1 35 0.2
2 1462 0.2
3 1109 0.2
4 220 0.2
5 211 0.1
[[2]]
list()
[[3]]
id name value
1 394 0.5
2 1462 0.5
[[4]]
id name value
1 926 0.3
2 1462 0.3
3 381 0.3
4 930 0.2
[[5]]
id name value
1 926 0.5
2 1462 0.5
I need to create columns with all characteristics ID and their values for each product.
bundle = data.frame(bundle = c(284993459,1048768805,511310430,1034630958,1235581326))
d2<- list(data.frame(id = c(35,1462,1109,220,211), value = c(0.2, 0.2, 0.2,0.2,0.1)),
data.frame(id = NULL, value = NULL),
data.frame(id = c(394,1462), value = c(0.5,0.5)),
data.frame(id = c(926,1462,381,930), value = c(0.3,0.3,0.3,0.2)),
data.frame(id = c(926,1462), value = c(0.5,0.5)))
bundle 35 1462 1109 220 211 394 1462
1 284993459 0.2 0.2 0.2 0.2 0.1 0 0
2 1048768805 0 0 0 0 0 0 0
3 511310430 0 0 0 0 0 0.5 0.5
Can't figure out how to do this. Had an idea to unlist this data frame list, but no good came of it, since a have more than 8000 prodict IDs:
for (i in seq(d2))
assign(paste0("df", i), d2[[i]])
If we take a different approach than I have to to join transposed characteristics data frames so the values are filled row by row.

Here's a tidyverse solution. First we add a bundle column to all data.frames and stitch them together using purr::map2_dfr , then use tidyr::spread to format as wide.
library(tidyverse)
res <- map2_dfr(bundle$bundle,d2,~mutate(.y,bundle=.x)) %>%
spread(id,value,)
res[is.na(res)] <- 0
# bundle 35 211 220 381 394 926 930 1109 1462
# 1 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
# 2 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
# 3 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
# 4 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5

You can first add the bundle to each data.frame within the list, then pivot it using reshape2::dcast or data.table::dcast before updating NAs to 0
ans <- data.table::dcast(
do.call(rbind, Map(function(nm, DF) within(DF, bundle <- nm), bundle$bundle, d2)),
bundle ~ id)
ans[is.na(ans)] <- 0
ans
# bundle 35 211 220 381 394 926 930 1109 1462
#1 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
#2 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
#3 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
#4 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5
edit: adding more explanations after OP's comment
1) function(nm, DF) within(DF, bundle <- nm) takes the input data.frame DF and adds a new column called bundle with values equal to nm.
2) Map applies a function to the corresponding elements of given vectors. (see ?Map) That means that Map applies the above function using each of the bundle values and add them to each data.frame in d2

Another approach could be
library(data.table)
library(tidyverse)
df <- rbindlist(
lapply(lapply(d2, function(x) if(nrow(x)==0) data.frame(id=NA, value=NA) else x), #in case there is no dataframe row in a list assign a blank dataframe
function(y) y %>% spread(id, value)), #convert all dataframes in wide format
fill = T) %>% #rbind all dataframe in a single dataframe
select(-`<NA>`) %>%
cbind.data.frame(bundle = bundle$bundle)
Output is:
35 211 220 1109 1462 394 381 926 930 bundle
1: 0.2 0.1 0.2 0.2 0.2 NA NA NA NA 284993459
2: NA NA NA NA NA NA NA NA NA 1048768805
3: NA NA NA NA 0.5 0.5 NA NA NA 511310430
4: NA NA NA NA 0.3 NA 0.3 0.3 0.2 1034630958
5: NA NA NA NA 0.5 NA NA 0.5 NA 1235581326
Sample data:
bundle <- data.frame(bundle = c(284993459,1048768805,511310430,1034630958,1235581326))
d2 <- list(data.frame(id = c(35,1462,1109,220,211), value = c(0.2, 0.2, 0.2,0.2,0.1)),
data.frame(id = NULL, value = NULL),
data.frame(id = c(394,1462), value = c(0.5,0.5)),
data.frame(id = c(926,1462,381,930), value = c(0.3,0.3,0.3,0.2)),
data.frame(id = c(926,1462), value = c(0.5,0.5)))

There are two possible approaches which differ only in the sequence of operations:
Reshape all dataframes in the list individually from long to wide format and rbind() matching columns.
rbind() all dataframes in long form and reshape to wide format afterwards.
Both approaches require to include bundle somehow.
For the sake of completeness, here are different implementations of the second approach using data.table.
library(data.table)
library(magrittr)
d2 %>%
# bind row-wise into large data.table, create id column
rbindlist(idcol = "bid") %>%
# right join to append bundle column
setDT(bundle)[, bid := .I][., on = "bid"] %>%
# reshape from long to wide format
dcast(., bundle ~ id, fill = 0)
bundle 35 211 220 381 394 926 930 1109 1462
1: 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
2: 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
3: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
4: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5
Here, piping is used just to visualize the sequence of function calls. With data.table's chaining the statement becomes more concise:
library(data.table) # library(magrittr) not required
setDT(bundle)[, bid := .I][
rbindlist(d2, id = "bid"), on = "bid"][, dcast(.SD, bundle ~ id, fill = 0)]
or
library(data.table) # library(magrittr) not required
dcast(setDT(bundle)[, bid := .I][
rbindlist(d2, id = "bid"), on = "bid"], bundle ~ id, fill = 0)
Another variant is to rename the list elements before calling rbindlist() which will take the names for creating the idcol:
library(data.table)
library(magrittr)
d2 %>%
# rename list elements
setNames(bundle$bundle) %>%
# bind row-wise into large data.table, create id column from element names
rbindlist(idcol = "bundle") %>%
# convert bundle from character to factor to maintain original order
.[, bundle := forcats::fct_inorder(bundle)] %>%
# reshape from long to wide format
dcast(., bundle ~ id, fill = 0)
bundle 35 211 220 381 394 926 930 1109 1462
1: 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
2: 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
3: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
4: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5
Note that the variants presented so far have skipped the empty dataframe which belongs to bundle 1048768805 (likewise the answers by Moody_Mudskipper and chinsoon12).
In order to keep the empty dataframe in the final result, the order of the join has to be changed so that all rows of bundle will be kept:
library(data.table)
dcast(
rbindlist(d2, id = "bid")[setDT(bundle)[, bid := .I], on = "bid"],
bundle ~ id, fill = 0
)[, "NA" := NULL][]
bundle 35 211 220 381 394 926 930 1109 1462
1: 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
2: 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
3: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
4: 1048768805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5
Or, if the exact order of bundle is to be maintained:
library(data.table)
dcast(
rbindlist(d2, id = "bid")[setDT(bundle)[, bid := .I], on = "bid"],
bid + bundle ~ id, fill = 0
)[, c("bid", "NA") := NULL][]
bundle 35 211 220 381 394 926 930 1109 1462
1: 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
2: 1048768805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3: 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
4: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
5: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5

Related

Can't get data for all combinations

I have 31 possible scenarios for the value of the average and I would like to generate for each scenario samples but I do not get as much as I would like.
This is the scenarios
> scenari<-s*shift;scenari
Var1 Var2 Var3 Var4 Var5
2 1.5 0.0 0.0 0.0 0.0
3 0.0 1.5 0.0 0.0 0.0
4 1.5 1.5 0.0 0.0 0.0
5 0.0 0.0 1.5 0.0 0.0
6 1.5 0.0 1.5 0.0 0.0
7 0.0 1.5 1.5 0.0 0.0
8 1.5 1.5 1.5 0.0 0.0
9 0.0 0.0 0.0 1.5 0.0
10 1.5 0.0 0.0 1.5 0.0
11 0.0 1.5 0.0 1.5 0.0
12 1.5 1.5 0.0 1.5 0.0
13 0.0 0.0 1.5 1.5 0.0
14 1.5 0.0 1.5 1.5 0.0
15 0.0 1.5 1.5 1.5 0.0
16 1.5 1.5 1.5 1.5 0.0
17 0.0 0.0 0.0 0.0 1.5
18 1.5 0.0 0.0 0.0 1.5
19 0.0 1.5 0.0 0.0 1.5
20 1.5 1.5 0.0 0.0 1.5
21 0.0 0.0 1.5 0.0 1.5
22 1.5 0.0 1.5 0.0 1.5
23 0.0 1.5 1.5 0.0 1.5
24 1.5 1.5 1.5 0.0 1.5
25 0.0 0.0 0.0 1.5 1.5
26 1.5 0.0 0.0 1.5 1.5
27 0.0 1.5 0.0 1.5 1.5
28 1.5 1.5 0.0 1.5 1.5
29 0.0 0.0 1.5 1.5 1.5
30 1.5 0.0 1.5 1.5 1.5
31 0.0 1.5 1.5 1.5 1.5
32 1.5 1.5 1.5 1.5 1.5
and this is the function
genereting_fuction<-function(n){
for (i in 1:length(scenari)){
X1=rnorm(n)+scenari[i,1]
X4=rnorm(n)+scenari[i,4]
X2=X1*p12+std_e2*rnorm(n)+scenari[i,2]
X3=X1*p13+X4*p43+std_e3*rnorm(n)+scenari[i,3]
X5=X2*p25+X3*p35+std_e5*rnorm(n)+scenari[i,5]
sample=cbind(X1,X2,X3,X4,X5)
return(sample)
}
}
genereting_fuction(10)
I should get 31 samples of size 10X5 but I get only one sample

You are applying the for loop over return as well and eventually returning the sample corresponding to the last scenario only.
Try this :
genereting_fuction<-function(n){
sample <- list()
for (i in 1:nrow(scenari)){
X1=rnorm(n)+scenari[i,1]
X4=rnorm(n)+scenari[i,4]
X2=X1*p12+std_e2*rnorm(n)+scenari[i,2]
X3=X1*p13+X4*p43+std_e3*rnorm(n)+scenari[i,3]
X5=X2*p25+X3*p35+std_e5*rnorm(n)+scenari[i,5]
sample[[i]]=cbind(X1,X2,X3,X4,X5)
}
sample
}
The output will be a list and its ith element will be a sample corresponding to the ith scenario.

Aggregate/sum and N/A values

I have a problem with the way aggregate or N/A deals with sums.
I would like the sums per area.code from following table
test <- read.table(text = "
area.code A B C D
1 0 NA 0.00 NA NA
2 1 0.0 3.10 9.6 0.0
3 1 0.0 3.20 6.0 0.0
4 2 0.0 6.10 5.0 0.0
5 2 0.0 6.50 8.0 0.0
6 2 0.0 6.90 4.0 3.1
7 3 0.0 6.70 3.0 3.2
8 3 0.0 6.80 3.1 6.1
9 3 0.0 0.35 3.2 6.5
10 3 0.0 0.67 6.1 6.9
11 4 0.0 0.25 6.5 6.7
12 5 0.0 0.68 6.9 6.8
13 6 0.0 0.95 6.7 0.0
14 7 1.2 NA 6.8 0.0
")
So, seems pretty easy:
aggregate(.~area.code, test, sum)
area.code A B C D
1 1 0 6.30 15.6 0.0
2 2 0 19.50 17.0 3.1
3 3 0 14.52 15.4 22.7
4 4 0 0.25 6.5 6.7
5 5 0 0.68 6.9 6.8
6 6 0 0.95 6.7 0.0
Apparently not so simple, because area code 7 is completely omitted from the aggregate() command.
I would however like the N/As to be completely ignored or computed as zero values, which na= command gives that option?
replacing all N/As with 0 is an option if I just want the sum... but the mean is really problematic then (since it can't differentiate between 0 and N/A anymore)

If you are willing to consider an external package (data.table):
setDT(test)
test[, lapply(.SD, sum), area.code]
area.code A B C D
1: 0 NA 0.00 NA NA
2: 1 0.0 6.30 15.6 0.0
3: 2 0.0 19.50 17.0 3.1
4: 3 0.0 14.52 15.4 22.7
5: 4 0.0 0.25 6.5 6.7
6: 5 0.0 0.68 6.9 6.8
7: 6 0.0 0.95 6.7 0.0
8: 7 1.2 NA 6.8 0.0

One option is to create a function that gives NA when all the values are NA or otherwise use sum. Along with that, use na.action argument in aggregate as aggregate can remove the row if there is at least one NA
f1 <- function(x) if(all(is.na(x))) NA else sum(x, na.rm = TRUE)
aggregate(.~area.code, test, f1, na.action = na.pass)
# area.code A B C D
#1 0 NA 0.00 NA NA
#2 1 0.0 6.30 15.6 0.0
#3 2 0.0 19.50 17.0 3.1
#4 3 0.0 14.52 15.4 22.7
# 4 0.0 0.25 6.5 6.7
#6 5 0.0 0.68 6.9 6.8
#7 6 0.0 0.95 6.7 0.0
#8 7 1.2 NA 6.8 0.0
When there are only NA elements and we use sum with na.rm = TRUE, it returns 0
sum(c(NA, NA), na.rm = TRUE)
#[1] 0

Another solution is to use dplyr:
test %>%
group_by(area.code) %>%
summarise_all(sum, na.rm = TRUE)

Use lapply to create new variable over multiple data frames

I have searched all the lapply questions and solutions, and none of those solutions seems to address and/or work for the following...
I have a list "temp" that contains the names of 100 data frames: "sim_rep1.dat" through "sim_rep100.dat".
Each data frame has 2000 observations and the same 11 variables: ARAND and w1-w10, all of which are numeric.
For all 100 data frames, I am trying to create a new variable called "ps_true" that incorporates certain of the "w" variables, each with a unique coefficient.
The only use of lapply that is working for me is the following:
lapply(mget(paste0("sim_rep", 1:100,".dat")), transform,
ps_true = (1 + exp(-(0.8*w1 - 0.25*w2 + 0.6*w3 -
0.4*w4 - 0.8*w5 - 0.5*w6 + 0.7*w7)))^-1)
When I run the code above, R loops through all 100 data frames and shows newly calculated values for ps_true in the console. Unfortunately, the new column is not getting added to the data frames.
When I try to create a function, the wheels come completely off.
I have tried different variations of the following:
lapply(temp, function(x){
ps_true = (1 + exp(-(0.8*w1 - 0.25*w2 + 0.6*w3 -
0.4*w4 - 0.8*w5 - 0.5*w6 + 0.7*w7)))^-1
cbind(x, ps_true)
return(x)
})
Error in FUN(X[[i]], ...) : object 'w1' not found results from the function shown above
Error in x$w1 : $ operator is invalid for atomic vectors results if I try to reference x$w1 instead
Error in FUN(X[[i]], ...) : object 'w1' not found results if I try to reference x[[w1]] instead
Error in x[["w1"]] : subscript out of bounds results if I try to reference x[["w1"]] instead
I am hoping there is something obvious that I am missing. I'd appreciate your insights and suggestions to solve this frustrating problem.
In response to Uwe's addendum:
The code I had used to read all the files was the following:
temp = list.files(pattern='*.dat')
for (i in 1:length(temp)) {
assign(temp[i], read.csv(temp[i], header=F,sep="",
col.names = c("ARAND", "w1", "w2", "w3", "w4", "w5", "w6", "w7", "w8", "w9", "w10")))
}

According to the OP, there are 100 data.frames with identical columns names. The OP wants to create a new column in all of the data.frames using exactly the same formula.
This indicates a fundamental flaw in the design of the data structure. I guess, no data base admin would create 100 identical tables where only the data contents differs. Instead, he would create one table with an additional column identifying the origin of each row. Then, all subsequent operations would be applied on one table instead to be repeated for each of many.
In R, the data.table package has the convenient rbindlist() function which can be used for this purpose:
library(data.table) # CRAN version 1.10.4 used
# get list of data.frames from the given names and
# combine the rows of all data sets into one large data.table
DT <- rbindlist(mget(temp), idcol = "origin")
# now create new column for all rows across all data sets
DT[, ps_true := (1 + exp(-(0.8*w1 - 0.25*w2 + 0.6*w3 -
0.4*w4 - 0.8*w5 - 0.5*w6 + 0.7*w7)))^-1]
DT
origin ARAND w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 ps_true
1: sim_rep1.dat -0.6 -0.5 0.2 -0.7 0.5 2.4 -0.2 -0.9 -1.1 0.3 -0.8 0.0287485
2: sim_rep1.dat -0.2 0.2 0.7 1.0 1.8 -0.2 0.8 0.3 -1.3 -1.6 -0.2 0.4588433
3: sim_rep1.dat 1.6 -0.5 0.7 -0.7 -1.7 0.9 -1.2 -1.0 1.1 -0.3 -2.1 0.2432395
4: sim_rep1.dat 0.1 1.2 -1.3 -0.1 0.3 -0.6 0.4 0.3 0.8 -1.2 -1.7 0.8313184
5: sim_rep1.dat 0.1 0.2 -2.0 0.6 -0.3 0.2 0.2 0.5 -0.9 -0.8 -1.1 0.7738186
---
199996: sim_rep100.dat 0.1 -1.4 1.6 -0.7 -1.0 -0.6 0.8 -0.6 -0.5 -0.4 -0.8 0.1323889
199997: sim_rep100.dat 0.3 1.3 -2.4 -0.7 -0.4 0.0 1.0 -0.2 1.0 -0.1 0.3 0.6769959
199998: sim_rep100.dat 0.3 1.2 0.0 -1.3 -0.8 -0.7 -0.3 0.1 0.9 0.9 -1.3 0.7824498
199999: sim_rep100.dat 0.5 -0.7 0.2 0.5 1.1 -0.3 0.3 -0.5 -0.8 1.9 -0.7 0.2669799
200000: sim_rep100.dat -0.5 1.1 0.8 0.2 -0.6 -0.5 -0.4 1.1 -1.8 0.9 -1.3 0.9175867
DT consists now of 200 K rows. Performance is no reason to worry as data.tablewas built to deal with large (even larger) data efficiently.
The origin of each row can be identified in case the data of the individual data sets need to be treated separately. E.g.,
DT[origin == "sim_rep47.dat"]
origin ARAND w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 ps_true
1: sim_rep47.dat -0.6 -0.5 0.2 -0.7 0.5 2.4 -0.2 -0.9 -1.1 0.3 -0.8 0.0287485
2: sim_rep47.dat -0.2 0.2 0.7 1.0 1.8 -0.2 0.8 0.3 -1.3 -1.6 -0.2 0.4588433
3: sim_rep47.dat 1.6 -0.5 0.7 -0.7 -1.7 0.9 -1.2 -1.0 1.1 -0.3 -2.1 0.2432395
4: sim_rep47.dat 0.1 1.2 -1.3 -0.1 0.3 -0.6 0.4 0.3 0.8 -1.2 -1.7 0.8313184
5: sim_rep47.dat 0.1 0.2 -2.0 0.6 -0.3 0.2 0.2 0.5 -0.9 -0.8 -1.1 0.7738186
---
1996: sim_rep47.dat 0.1 -1.4 1.6 -0.7 -1.0 -0.6 0.8 -0.6 -0.5 -0.4 -0.8 0.1323889
1997: sim_rep47.dat 0.3 1.3 -2.4 -0.7 -0.4 0.0 1.0 -0.2 1.0 -0.1 0.3 0.6769959
1998: sim_rep47.dat 0.3 1.2 0.0 -1.3 -0.8 -0.7 -0.3 0.1 0.9 0.9 -1.3 0.7824498
1999: sim_rep47.dat 0.5 -0.7 0.2 0.5 1.1 -0.3 0.3 -0.5 -0.8 1.9 -0.7 0.2669799
2000: sim_rep47.dat -0.5 1.1 0.8 0.2 -0.6 -0.5 -0.4 1.1 -1.8 0.9 -1.3 0.9175867
extracts all row belonging to data set sim_rep47.dat.
Data
For test and demonstration, I've created 100 sample data.frames using the code below:
# create vector of file names
temp <- paste0("sim_rep", 1:100, ".dat")
# create one sample data.frame
nr <- 2000L
nc <- 11L
set.seed(123L)
foo <- as.data.frame(matrix(round(rnorm(nr * nc), 1), nrow = nr))
names(foo) <- c("ARAND", paste0("w", 1:10))
str(foo)
# create 100 individually named data.frames by "copying" foo
for (t in temp) assign(t, foo)
# print warning message on using assign
fortunes::fortune(236)
# verify objects have been created
ls()
Addendum: Reading all files at once
The OP has named the single data.frames sim_rep1.dat, sim_rep2.dat, etc. which resemble typical file names. Just in case the OP indeed has 100 files on disk I would like to suggest a way to read all files at once. Let's suppose all files are stored in one directory.
# path to data directory
data_dir <- file.path("path", "to", "data", "directory")
# create vector of file paths
files <- dir(data_dir, pattern = "sim_rep\\d+\\.dat", full.names = TRUE)
# read all files and create one large data.table
# NB: it might be necessary to add parameters to fread()
# or to use another file reader depending on the file type
DT <- rbindlist(lapply(files, fread), idcol = "origin")
# rename origin to contain the file names without path
DT[, origin := factor(origin, labels = basename(files))]
DT
origin ARAND w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 ps_true
1: sim_rep1.dat -0.6 -0.5 0.2 -0.7 0.5 2.4 -0.2 -0.9 -1.1 0.3 -0.8 0.0287485
2: sim_rep1.dat -0.2 0.2 0.7 1.0 1.8 -0.2 0.8 0.3 -1.3 -1.6 -0.2 0.4588433
3: sim_rep1.dat 1.6 -0.5 0.7 -0.7 -1.7 0.9 -1.2 -1.0 1.1 -0.3 -2.1 0.2432395
4: sim_rep1.dat 0.1 1.2 -1.3 -0.1 0.3 -0.6 0.4 0.3 0.8 -1.2 -1.7 0.8313184
5: sim_rep1.dat 0.1 0.2 -2.0 0.6 -0.3 0.2 0.2 0.5 -0.9 -0.8 -1.1 0.7738186
---
199996: sim_rep99.dat 0.1 -1.4 1.6 -0.7 -1.0 -0.6 0.8 -0.6 -0.5 -0.4 -0.8 0.1323889
199997: sim_rep99.dat 0.3 1.3 -2.4 -0.7 -0.4 0.0 1.0 -0.2 1.0 -0.1 0.3 0.6769959
199998: sim_rep99.dat 0.3 1.2 0.0 -1.3 -0.8 -0.7 -0.3 0.1 0.9 0.9 -1.3 0.7824498
199999: sim_rep99.dat 0.5 -0.7 0.2 0.5 1.1 -0.3 0.3 -0.5 -0.8 1.9 -0.7 0.2669799
200000: sim_rep99.dat -0.5 1.1 0.8 0.2 -0.6 -0.5 -0.4 1.1 -1.8 0.9 -1.3 0.9175867
All data sets are now stored in one large data.table DT consisting of 200 k rows. However, the order of data sets is different as files is sorted alphabetically, i.e.,
head(files)
[1] "./data/sim_rep1.dat" "./data/sim_rep10.dat" "./data/sim_rep100.dat"
[4] "./data/sim_rep11.dat" "./data/sim_rep12.dat" "./data/sim_rep13.dat"

probably just need single brackets.
test = data.frame('w1' = c(1,2,3),'w2' = c(2,3,4))
temp = list(test,test,test)
temp2 = lapply(temp,function(x){cbind(x,setNames(x['w1'] + x['w2'],'ps_true'))})
temp2
[[1]]
w1 w2 ps_true
1 1 2 3
2 2 3 5
3 3 4 7
[[2]]
w1 w2 ps_true
1 1 2 3
2 2 3 5
3 3 4 7
[[3]]
w1 w2 ps_true
1 1 2 3
2 2 3 5
3 3 4 7

Find all points on a plane

I am trying to get all points on a 2d plane in the range (0..10,0..10) with a step of 0.5. I would like two store these values in a dataframe like this:
x y
1 1 1.5
2 0 0.5
3 4 2.0
I am considering using a loop to start from 0.0 for the x column and fill the y column such that I get something like this:
x y
1 0 0
2 0 0.5
3 0 1
and so on upto 10. And increment it by 0.5 and do for 1 and so on. I would like to know a more efficient way of doing this in R?.

Is this what you want?
expand.grid(x=seq(0,10,by=0.5),y=seq(0,10,by=0.5))
x y
1 0.0 0.0
2 0.5 0.0
3 1.0 0.0
4 1.5 0.0
5 2.0 0.0
6 2.5 0.0
7 3.0 0.0
8 3.5 0.0
9 4.0 0.0
10 4.5 0.0
11 5.0 0.0
12 5.5 0.0
13 6.0 0.0
14 6.5 0.0
15 7.0 0.0
16 7.5 0.0
17 8.0 0.0
18 8.5 0.0
19 9.0 0.0
20 9.5 0.0
21 10.0 0.0
22 0.0 0.5
23 0.5 0.5
24 1.0 0.5
25 1.5 0.5
26 2.0 0.5
27 2.5 0.5
28 3.0 0.5
29 3.5 0.5
30 4.0 0.5
...

Mapping rowname of dataframe to column value of other dataframe with different length?

I have a dataframe df
col1 col2 col3 col4 col5
row1 0.0 0.0 0.0 0.0 0.0
row2 0.0 0.4 0.4 0.0 0.0
row3 0.5 1.2 0.4 0.3 0.8
row4 3.3 1.4 1.4 1.0 6.3
row5 0.0 0.2 0.0 0.0 0.0
row6 0.8 0.0 0.0 0.0 0.2
and a dataframe mapping
rowname mapped_name
row1 a
row2 a
row3 a
row5 b
row6 c
and I want to get
col1 col2 col3 col4 col5 mapped_name
row1 0.0 0.0 0.0 0.0 0.0 a
row2 0.0 0.4 0.4 0.0 0.0 a
row3 0.5 1.2 0.4 0.3 0.8 a
row4 3.3 1.4 1.4 1.0 6.3 NA
row5 0.0 0.2 0.0 0.0 0.0 b
row6 0.8 0.0 0.0 0.0 0.2 c
Because they are different length when I do df$mapped_name <- df[mapping$rowname == rownames(df),]$mapped_name I get
character(0)
Warning message:
In mapping$rowname == rownames(df) :
longer object length is not a multiple of shorter object length

We can match the row names of 'df' with 'rowname' column of 'mapping' dataset, use that numeric index to get the corresponding 'mapped_name'
df$mapped_name <- mapping$mapped_name[match(row.names(df), mapping$rowname)]
df
# col1 col2 col3 col4 col5 mapped_name
#row1 0.0 0.0 0.0 0.0 0.0 a
#row2 0.0 0.4 0.4 0.0 0.0 a
#row3 0.5 1.2 0.4 0.3 0.8 a
#row4 3.3 1.4 1.4 1.0 6.3 <NA>
#row5 0.0 0.2 0.0 0.0 0.0 b
#row6 0.8 0.0 0.0 0.0 0.2 c

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Make a named table from list of dataframes - r

Related

Can't get data for all combinations

Aggregate/sum and N/A values

Use lapply to create new variable over multiple data frames

Find all points on a plane

Mapping rowname of dataframe to column value of other dataframe with different length?

Categories

Resources