Change multiple dataframes in a loop - r

I have, for example, this three datasets (in my case, they are many more and with a lot of variables):
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame2 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
on each data frame I want to add a variable resulting from a transformation of an existing variable on that data frame. I would to do this by a loop. For example:
datasets <- c("data_frame1","data_frame2","data_frame3")
vars <- c("a","b","c")
for (i in datasets){
for (j in vars){
# here I need a code that create a new variable with transformed values
# I thought this would work, but it didn't...
get(i)$new_var <- log(get(i)[,j])
}
}
Do you have some valid suggestions about that?
Moreover, it would be great for me if it were possible also to assign the new column names (in this case new_var) by a character string, so I could create the new variables by another for loop nested in the other two.
Hope I've not been too tangled in explain my problem.
Thanks in advance.

You can put your dataframes in a list and use lapply to process them one by one. So no need to use a loop in this case.
For example you can do this :
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame3 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
ll <- list(data_frame1,data_frame2,data_frame3)
lapply(ll,function(df){
df$log_a <- log(df$a) ## new column with the log a
df$tans_col <- df$a+df$b+df$c ## new column with sums of some columns or any other
## transformation
### .....
df
})
the dataframe1 becomes :
[[1]]
a b c log_a tans_col
1 1 3 4 0.0000000 8
2 5 6 4 1.6094379 15
3 3 1 1 1.0986123 5
4 3 5 9 1.0986123 17
5 2 5 2 0.6931472 9

I had the same need and wanted to change also the columns in my actual list of dataframes.
I found a great method here (the purrr::map2 method in the question works for dataframes with different columns), followed by
list2env(list_of_dataframes ,.GlobalEnv)

Related

Search and term variables by character string

I have a maybe simple problem, but I can't solve it.
I have two list's. List A is empty and a list B has several named columns. Now, I want to select a colum of B by a variable and put it in list A. Somehow like shown in the example:
A<-list()
B<-list()
VAR<-"a"
B$a<-c(1:10)
B$b<-c(10:20)
B$c<-c(20:30)
#This of course dosn't work...
A$VAR<-B$VAR
You can extract list entry with B[[VAR]] and append new entry to a list using get (A[[get("VAR")]] <- newEntry):
A[[get("VAR")]] <- B[[VAR]]
## A list
# $a
# [1] 1 2 3 4 5 6 7 8 9 10

Calling & creating new columns based on string

I have searched quite a bit and not found a question that addresses this issue--but if this has been answered, forgive me, I am still quite green when it comes to coding in general. I have a data frame with a large number of variables that I would like to combine & create new variables from based on names I've put in a 2nd data frame in a loop. The data frame formulas should create & call columns from the main data frame data
USDb = c(1,2,3)
USDc = c(4,5,6)
EURb = c(7,8,9)
EURc = c(10,11,12)
data = data.frame(USDb, USDc, EURb, EURc)
Now I'd like to create a new column data$USDa as defined by
data$USDa = data$USDb - data$USDc
and so on for EUR and other variables. This is easy enough to do manually, but I'd like to create a loop that pulls the names from formulas, something like this:
a = c("USDa", "EURa")
b = c("USDb", "EURb")
c = c("USDc", "EURc")
formulas = data.frame(a,b,c)
for (i in 1:length(formulas[,a])){
data$formulas[i,a] = data$formulas[i,b] - data$formulas[i,c]
}
Obviously data$formulas[i,a] this returns NULL, so I tried data$paste0(formulas[i,a]) and that returns Error: attempt to apply non-function
How can I get these strings to be recognized as variables in this way? Thanks.
There are simpler ways to do this, but I'll stick to most of your code as a means of explanation. Your code should work so long as you edit your for loop to the following:
for (i in 1:length(formulas[,"a"])){
data[formulas[i,"a"]] = data[formulas[i,"b"]] - data[formulas[i,"c"]]
}
formulas[,a] won't work because you have a variable defined as a already that is not appropriate inside an index. Use formulas[, "a"] instead if you want all rows from column "a" in data.frame formulas.
data$formulas is literally searching for the column called "formulas" in the data.frame data. Instead you want to write data[formulas](of course, knowing that you need to index formulas in order to make it a proper string)
logic : iterate through each of the formulae, using a apply which is a for loop internally, and do calculation based on the formula
x = apply(formulas, 1, function(x) data[[x[3]]] - data[[x[2]]])
colnames(x) = formulas$a
x
# USDa EURa
#[1,] 3 3
#[2,] 3 3
#[3,] 3 3
cbind(data, x)
# USDb USDc EURb EURc USDa EURa
#1 1 4 7 10 3 3
#2 2 5 8 11 3 3
#3 3 6 9 12 3 3
Another option is split with sapply
sapply(setNames(split.default(as.matrix(formulas[-1]),
row(formulas[-1])), formulas$a), function(x) Reduce(`-`, data[rev(x)]))
# USDa EURa
#[1,] 3 3
#[2,] 3 3
#[3,] 3 3

Assigning value to an R object without using its name with get()

I am having a problem with get() in R.
I have a set of data.frames with a common structure in my environment. I want to loop through these data frames and change the name of the 2nd column so that the name of the 2nd column contains a prefix from the 1st column.
For example, if column 1 = A_cat and column 2 is dog, I want column 2 to be changed to A_dog.
Below is an example of the R code I am using:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
for( element in grep('^df$', names(environment()), value=TRUE) ) {
colnames(get(element))[2] <- paste(strsplit(colnames(get(element)) [1], '`_`')[[1]][1],
colnames(get(element))[2], sep='`_`')
}
The arguments within the for loop, on either side of the assignment operator, both give the expected result if I run them separately but when run together produce the following error.
Error in colnames(get(element))[2] <- paste(strsplit(colnames(get(element))[1], :
could not find function "get<-"
Any help with this problem would be greatly appreciated.
This does the same thing as the code in the question without using get:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
e <- environment() ##
df.names <- grep("^df$", names(e), value = TRUE)
# nm is the current data frame name and nms are its column names
for(nm in df.names) {
nms <- names(e[[nm]])
names(e[[nm]])[2] <- paste0(sub("_.*", "_", nms[1]), nms[2])
}
giving:
> df
A_cat A_dog
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Keeping the data.frames in a named list as suggested in a comment to the question might be even better. For example, if instead of keeping the data.frames in an environment they were in a list called e
e <- list(df = df)
then omit the line marked ## and the rest works as is.
Here would be one way to accomplish this goal if the data.frames have systematic names (here, df1 df2 df3, etc) and the prefix ends with "_" as in the example:
# suggested by #roland roll them up in a list:
myDfList <- mget(ls(pattern="^df"))
# change names
for(dfName in names(myDfList)) {
names(myDfList[[dfName]])[2] <- paste0(gsub("^(.*_)", "\\1",
names(myDfList[[dfName]])[1]),
names(myDfList[[dfName]])[2])
}

trouble setting up iteration on multiple data.frames in r

I am having a recurring issue of performing specific tasks on multiple data.frames. Here is my working example data.frame, which was imported from text files.
cellID X Y Area AVGFP DeviationGFP AvgRFP DeviationsRFP Slice GUI.ID
1 1 18.20775 26.309859 568 5.389085 7.803248 12.13028 5.569880 0 1
2 2 39.78755 9.505495 546 5.260073 6.638375 17.44505 17.220153 0 1
3 3 30.50000 28.250000 4 6.000000 4.000000 8.50000 1.914854 0 1
4 4 38.20233 132.338521 257 3.206226 5.124264 14.04669 4.318130 0 1
5 5 43.22467 35.092511 454 6.744493 9.028574 11.49119 5.186897 0 1
6 6 57.06534 130.355114 352 3.781250 5.713022 20.96591 14.303546 0 1
7 7 86.81765 15.123529 1020 6.043137 8.022179 16.36471 19.194279 0 1
8 8 75.81932 132.146417 321 3.666667 5.852172 99.47040 55.234726 0 1
9 9 110.54277 36.339233 678 4.159292 6.689660 12.65782 4.264624 0 1
10 10 127.83480 11.384886 569 4.637961 6.992881 11.39192 4.287963 0 1
As previous questions I have posted, there are 40 of these data.frames named slice1...slice40.
What I want to do is add a new column to each of these data.frames that contains the product of AVGFP and Area. I can perform this on one data.frame easily by using
stats[[1]]$totalGFP <- stats[[1]]$AVGFP * stats[[1]]$Area
I am stuck trying to apply this command to every data.frame in stats
I appreciate any and all help. To help moving forward when you post a solution can you please describe the details of the commands used to help me follow along, thank you!
Like this:
stats <- lapply(stats, transform, totalGFP = AVGFP * Area)
I'll do my best to explain but please refer to ?lapply and ?transform for the full docs.
transform is a function to add columns to a data.frame, according to formulas of the type totalGFP = AVGFP * Area passed as arguments. For example, to add the totalGFP column to your first data.frame, you could run transform(stats[[1]], totalGFP = AVGFP * Area).
lapply applies a function (here transform) to each element of a list or a vector (here stats), and returns a list. If the function to be applied requires more arguments, they can be passed at the end of the lapply call, here totalGFP = AVGFP * Area. So here lapply is an elegant way of running transform on each element of stats.
Given that you wrote "please describe the details of the commands", try this simple example:
# create two small data frames
df1 <- data.frame(AVGFP = 1:3, Area = 4:6)
df2 <- data.frame(AVGFP = 7:9, Area = 1:3)
# create a list with named objects: the two data frames.
# ?list: "The arguments to list [...] of the form [...] tag = value
ll <- list(df1 = df1, df2 = df2)
str(ll)
# apply a function on each element in the list
# each element is a single data frame
# Use an 'anonymous function', function(x), where 'x' corresponds to each single data frame
# The function does this:
# (1) calculate the new variable 'total', and (2) add it to the data frame
ll2 <- lapply(X = ll, FUN = function(x){
total <- x$AVGFP * x$Area
x <- data.frame(x, total)
})
# check ll2
str(ll2)

Specifying column names from a list in the data.frame command

I have a list called cols with column names in it:
cols <- c('Column1','Column2','Column3')
I'd like to reproduce this command, but with a call to the list:
data.frame(Column1=rnorm(10))
Here's what happens when I try it:
> data.frame(cols[1]=rnorm(10))
Error: unexpected '=' in "data.frame(I(cols[1])="
The same thing happens if I wrap cols[1] in I() or eval().
How can I feed that item from the vector into the data.frame() command?
Update:
For some background, I have defined a function calc.means() that takes a data frame and a list of variables and performs a large and complicated ddply operation, summarizing at the level specified by the variables.
What I'm trying to do with the data.frame() command is walk back up the aggregation levels to the very top, re-running calc.means() at each step and using rbind() to glue the results onto one another. I need to add dummy columns with 'All' values in order to get the rbind to work properly.
I'm rolling cast-like margin functionality into ddply, basically, and I'd like to not retype the column names for each run. Here's the full code:
cols <- c('Col1','Col2','Col3')
rbind ( calc.means(dat,cols),
data.frame(cols[1]='All', calc.means(dat, cols[2:3])),
data.frame(cols[1]='All', cols[2]='All', calc.means(dat, cols[3]))
)
Use can use structure:
cols <- c("a","b")
foo <- structure(list(c(1, 2 ), c(3, 3)), .Names = cols, row.names = c(NA, -2L), class = "data.frame")
I don't get why you are doing this though!
I'm not sure how to do it directly, but you could simply skip the step of assigning names in the data.frame() command. Assuming you store the result of data.frame() in a variable named foo, you can simply do:
names(foo) <- cols
after the data frame is created
There is one trick. You could mess with lists:
cols_dummy <- setNames(rep(list("All"), 3), cols)
Then if you use call to list with one paren then you should get what you want
data.frame(cols_dummy[1], calc.means(dat, cols[2:3]))
You could use it on-the-fly as setNames(list("All"), cols[1]) but I think it's less elegant.
Example:
some_names <- list(name_A="Dummy 1", name_B="Dummy 2") # equivalent of cols_dummy from above
data.frame(var1=rnorm(3), some_names[1])
# var1 name_A
# 1 -1.940169 Dummy 1
# 2 -0.787107 Dummy 1
# 3 -0.235160 Dummy 1
I believe the assign() function is your answer:
cols <- c('Col1','Col2','Col3')
data.frame(assign(cols[1], rnorm(10)))
Returns:
assign.cols.1...rnorm.10..
1 -0.02056822
2 -0.03675639
3 1.06249599
4 0.41763399
5 0.38873118
6 1.01779018
7 1.01379963
8 1.86119518
9 0.35760039
10 1.14742560
With the lapply() or sapply() function, you should be able to loop the cbind() process. Something like:
operation <- sapply(cols, function(x) data.frame(assign(x, rnorm(10))))
final <- data.frame(lapply(operation, cbind))
Returns:
Col1.assign.x..rnorm.10.. Col2.assign.x..rnorm.10.. Col3.assign.x..rnorm.10..
1 0.001962187 -0.3561499 -0.22783816
2 -0.706804781 -0.4452781 -1.09950505
3 -0.604417525 -0.8425018 -0.73287079
4 -1.287038060 0.2545236 -1.18795684
5 0.232084366 -1.0831463 0.40799046
6 -0.148594144 0.4963714 -1.34938144
7 0.442054119 0.2856748 0.05933736
8 0.984615916 -0.0795147 -1.91165189
9 1.222310749 -0.1743313 0.18256877
10 -0.231885977 -0.2273724 -0.43247570
Then, to clean up the column names:
colnames(final) <- cols
Returns:
Col1 Col2 Col3
1 0.19473248 0.2864232 0.93115072
2 -1.08473526 -1.5653469 0.09967827
3 -1.90968422 -0.9678024 -1.02167873
4 -1.11962371 0.4549290 0.76692067
5 -2.13776949 3.0360777 -1.48515698
6 0.64240694 1.3441656 0.47676056
7 -0.53590163 1.2696336 -1.19845723
8 0.09158526 -1.0966833 0.91856639
9 -0.05018762 1.0472368 0.15475583
10 0.27152070 -0.2148181 -1.00551111
Cheers,
Adam

Resources