I have around 10 DFs and would like to perform the following calculations on all of them and then have out as 10 new DFs.
I have been able to get this to work for 1 DF, but rather than copying the code and changing the names, 10 times, I wanted to see if there is a way to do this in. Ideally, I end up with 1 DF and 10 different columns, but am happy with anything
The calculations I am trying to do are:
temp <- merge (x=DF1, y=temp1, by = c("name"), all.x= TRUE)
asset_column <-grep("^Assets_", names(DF1))
return_column <-grep("^Return_", names(DF1))
OutputDF <-
stack(colSums(t(t(temp[asset_column])/colSums(temp[asset_column],
na.rm=TRUE)) * US_only[return_column],na.rm =TRUE))
OutputDF['values'] = OutputDF['values']/100
If these are repeatable calculations in a list, loop through the list with lapply and do the same code where we specify the first dataset from the anonymous function call (function(x) x)
out <- lapply(lst1, function(x) {
temp <- merge (x, y=temp1, by = c("name"), all.x= TRUE)
asset_column <-grep("^Assets_", names(x))
return_column <-grep("^Return_", names(x))
OutputDF <-
stack(colSums(t(t(temp[asset_column])/colSums(temp[asset_column],
na.rm=TRUE)) * US_only[return_column],na.rm =TRUE))
OutputDF['values'] = OutputDF['values']/100
OutputDF
})
Here, the output is also a list of data.frames which can be kept in the list as such or extract with [[
Related
I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)
I have a code that makes some changes in a dataframe.
value <- iris[1:120,]
cngfunc <- function(day,howmany,howmuch){
shuffled= day[sample(1:nrow(day)), ]
n = as.integer((howmany/100)*nrow(day)) #select percentage of data to be changed
extracted <- shuffled[1:n, ]
extracted$changed <- extracted[,1]*((howmuch/100)+1) #how much the data changes
extracted}
cngfunc(value,10,20)
Now I want to loop through the values of howmany and howmuch.
For example, howmuch <- c(10,20,30,40,50) and howmany <- c(10,20,30,40,50)
So the first result would be for cngfunc(value,10,10), cngfunc(value,10,20),cngfunc(value,10,30)....and cngfunc(value,20,10), cngfunc(value,20,20), and so on such that I'll have 25 different data frame.
Is there a way to do that?
You can do it with expand.grid to get all of the combinations, and the a map2 to create a list of dataframes:
library(tidyverse)
combos <- expand.grid(c(10,20,30,40,50), c(10,20,30,40,50))
result <- map2(combos$Var1, combos$Var2, function(x, y) cngfunc(value, x, y)) %>%
setNames(tidyr::unite(combos, Var, Var1:Var2, sep = "-")$Var)
Not sure where you are getting 120 dataframes from, as 5 * 5 = 25. This should be the general idea though.
I have the following 3 data frames, each of which has columns with names. I want to combine them and retain the column names. When I use the patch I found for combining dataframes, it drops that name on any dataframes that don't have at least 2 columns. How can I retain the names?
x<-data.frame(mean(1:10))
names(x)[names(x) == 'mean.1.10.'] <- 'var.name'
y<-data.frame(1:4)
names(y)[names(y) == 'X1.4'] <- 'var.name2'
z<-data.frame(matrix(1:10,5,2))
names(z)[names(z) == 'X1'] <- 'var.name3'
names(z)[names(z) == 'X2'] <- 'var.name4'
list_datf <- list(x, y, z)
n_r <- seq_len(max(sapply(list_datf, nrow)))
NEW <- do.call(cbind, lapply(list_datf, `[`, n_r, ))
You need to include drop = FALSE in the indexing step so that the things you're binding together retain all of their dimensions. I couldn't figure out a way to do this by passing drop = FALSE as an extra argument to [, so I resorted to using an anonymous function instead.
NEW <- do.call(cbind, lapply(list_datf, function(x) x[n_r, , drop = FALSE]))
Alternatively, you could convert your components to tibbles, which (unlike data frames) never drop "unneeded" dimensions:
NEW <- do.call(cbind, lapply(list_datf, function(x) tibble::as_tibble(x)[n_r, ]))
If you want to go full tidyverse:
library(dplyr)
list_datf %>% purrr::map(~ tibble::as_tibble(.)[n_r, ]) %>% bind_cols()
I have a dataframe called covars with three ethnicities. How do I apply function Get_STATs so I can get the output for each ethnicity?
Right, now I am running it like this:
tt <- covars[covars$ETHNICITY == "HISPANIC",]
Get_STATs(tt)
tt <- covars[covars$ETHNICITY == "WHITE",]
Get_STATs(tt)
tt <- covars[covars$ETHNICITY == "ASIAN",]
Get_STATs(tt)
I tried to run it like this
aggregate(covars, by = list(covars$ETHNICITY), FUN = Get_STATs)
which generates error rror: $ operator is invalid for atomic vectors
aggregate runs on each column separately, we may need by
do.call(rbind, by(covars, covars$ETHNICITY, FUN = Get_STATs))
Or split into a list and loop over the list and apply the function
do.call(rbind, lapply(split(covars, covars$ETHNICITY), Get_STATs))
If we need the ETHNICITY names as well
lst1 <- split(covars, covars$ETHNICITY)
do.call(rbind, Map(cbind, ETHNICITY = names(lst1), lapply(lst1, Get_STATs)))
Depending on the Get_STATs function, you can use dplyr:
tt <-
covars %>%
group_by(ETHNICITY) %>%
Get_STATs()
I have a list containing named elements. I am iterating over the list names, performing the computation for each corresponding element, "encapsulating" the results and the name in a vector and finally adding the vector to a table. The row or vector after each iteration contains a mix of characters and numbers.
The first row is getting added but from the second row onwards there is a problem.
In this example, there is supposed to be one column (first) containing alphanumeric names. All rows after the first one contain NAs.
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s))
}
df <- as.data.frame(df)
I know there are possibly more efficient ways but for the moment this is more intuitive for me as it is assuring that each computation is associated with a particular name. There can be several columns and rows and the names are extremely helpful to join tables, query, compare etc. They make it easier to trace back results to a particular element in my original list.
Additionally, I would be glad to know other ways in which the element names are always retained while transforming.
Thankyou!
You have to set stringsAsFactors = FALSE in rbind. With stringsAsFactors = TRUE the first iteration in the loop converts the string variables into factors (with the factor levels being the values).
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s), stringsAsFactors = FALSE)
}
An easier solution would be to utilize sapply().
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame(name = names(x), m = sapply(x, mean), s = sapply(x, sum))