Split many dataframes by a column, and save as different dataframes - r

I have many dataframes. I would like to split them based on the values in a column (a factor). Then I would like to store the result of the split in separate data frame that have a specific name.
For the sake of a mrp, consider some generated data,
for (i in 1:10) {
assign(paste("df_",i,sep = ""), data.frame(x = rep(1,12), y = c(rep("a",4),rep("b",4),rep("c",4))))
}
here we have 10 dfs, df_1, df_2... to df_10. (real data is similar to generated data, but in real data column z is different for each df).
Now, I want to split the dfs by 'y' (column 2).
For 1 df, I can do the following;
splitdf <- split(df_1,df_1$y)
namessplit <- c("a","b","c")
for (i in 1:length(splitdf)) {
assign(paste("df_1_",namessplit[[i]],sep = ""),splitdf[[i]])
}
While this works for 1 df, how can I do it for all the dfs?
Big thanks in advance!

It is not recommended to create multiple objects in the global env, but if we want to know how to create the objects from a nested list - Loop over the outer list sequence and then in the inner list sequence, paste the corresponding names to assign the extracted inner list element
lst1 <- lapply(mget(ls(pattern = "^df_\\d+$")), \(x) split(x, x$y))
for(i in seq_along(lst1)) {
for(j in seq_along(lst1[[i]])) {
assign(paste0(names(lst1)[i], "_", names(lst1[[i]][j])), lst1[[i]][[j]])
}
}
-checking for objects created in the global env
> ls(pattern = "^df_\\d+_[a-z]+$")
[1] "df_1_a" "df_1_b" "df_1_c" "df_10_a" "df_10_b" "df_10_c" "df_2_a" "df_2_b" "df_2_c" "df_3_a" "df_3_b" "df_3_c" "df_4_a"
[14] "df_4_b" "df_4_c" "df_5_a" "df_5_b" "df_5_c" "df_6_a" "df_6_b" "df_6_c" "df_7_a" "df_7_b" "df_7_c" "df_8_a" "df_8_b"
[27] "df_8_c" "df_9_a" "df_9_b" "df_9_c"

Related

R function used to rename columns of a data frames

I have a data frame, say acs10. I need to relabel the columns. To do so, I created another data frame, named as labelName with two columns: The first column contains the old column names, and the second column contains names I want to use, like the table below:
column_1
column_2
oldLabel1
newLabel1
oldLabel2
newLabel2
Then, I wrote a for loop to change the column names:
for (i in seq_len(nrow(labelName))){
names(acs10)[names(acs10) == labelName[i,1]] <- labelName[i,2]}
, and it works.
However, when I tried to put the for loop into a function, because I need to rename column names for other data frames as well, the function failed. The function I wrote looks like below:
renameDF <- function(dataF,varName){
for (i in seq_len(nrow(varName))){
names(dataF)[names(dataF) == varName[i,1]] <- varName[i,2]
print(varName[i,1])
print(varName[i,2])
print(names(dataF))
}
}
renameDF(acs10, labelName)
where dataF is the data frame whose names I need to change, and varName is another data frame where old variable names and new variable names are paired. I used print(names(dataF)) to debug, and the print out suggests that the function works. However, the calling the function does not actually change the column names. I suspect it has something to do with the scope, but I want to know how to make it works.
In your function you need to return the changed dataframe.
renameDF <- function(dataF,varName){
for (i in seq_len(nrow(varName))){
names(dataF)[names(dataF) == varName[i,1]] <- varName[i,2]
}
return(dataF)
}
You can also simplify this and avoid for loop by using match :
renameDF <- function(dataF,varName){
names(dataF) <- varName[[2]][match(names(dataF), varName[[1]])]
return(dataF)
}
This should do the whole thing in one line.
colnames(acs10)[colnames(acs10) %in% labelName$column_1] <- labelName$column_2[match(colnames(acs10)[colnames(acs10) %in% labelName$column_1], labelName$column_1)]
This will work if the column name isn't in the data dictionary, but it's a bit more convoluted:
library(tibble)
df <- tribble(~column_1,~column_2,
"oldLabel1", "newLabel1",
"oldLabel2", "newLabel2")
d <- tibble(oldLabel1 = NA, oldLabel2 = NA, oldLabel3 = NA)
fun <- function(dat, dict) {
names(dat) <- sapply(names(dat), function(x) ifelse(x %in% dict$column_1, dict[dict$column_1 == x,]$column_2, x))
dat
}
fun(d, df)
You can create a function containing just on line of code.
renameDF <- function(df, varName){
setNames(df,varName[[2]][pmatch(names(df),varName[[1]])])
}

Changing column names of many dataframes in a loop

I have three dataframes EC_Data, ED_Data, and ST_data
all of them have the same column names and more specifically, after 4th column
has Year named colums from 2006 to 2015
So I create a new list that has all three dataframes:
Alldata = list(EC_Data, ED_Data, ST_Data)
So I tried to rename all the columns in a for loop like below...
for(x in seq_along(Alldata))
{
for(j in seq_along(Alldata[[x]]))
{
if(j>4)
{
names(colnames(Alldata[[x]][j])) <- paste("X", substr(colnames(Alldata[[x]][j]), start = 1, stop = 5),sep="")
print(colnames(Alldata[[x]][j]))
}
}
}
But nothing happens...
I cannot understand why, because when I try to call the names of every list, for example with
view(colnames(Alldata[[2]]))
the names seems to be exactly what I want to see
Can someone help me to understand the reason that this loop doesn't work and what can I use instead of this?
Thank you
If we want to rename all the columns use lapply to loop over the list, paste with the substr of the existing column names and assign them with setNames
Alldata <- lapply(Alldata, function(x)
setNames(x, paste0("X", substr(colnames(x), 1, 5))))
Or using a for loop
for(i in seq_along(Alldata)) {
Alldata[[i]] <- setNames(Alldata[[i]],
paste0("X", substr(colnames(Alldata[[i]]), 1, 5))
}

How to iterate with the get() function in R

I have a few data frames that have the names df_JANUARY 2020, df_FEBRUARY 2020 etc. (I know spaces are an ill practice in variable assignment, but it has to do with a sql query). And would like to build a function to iterate through the months of these data frames. The purpose of this is have the function (not written below) clean each df the same way.
date <- c("JANUARY 2020", "FEBRUARY 2020")
x <- function(date) {
y <- get(paste0("df_", date))
}
for(i in seq_along(date)) {
z <- date[i]
assign(paste0("dfclean_", date[i]), x(z))
}
The problem being that when I use the get() function it's pushing the whole list through rather than one element at a time. Is there away to avoid this problem with this methodology or is there a better way to approach this problem? Any help is extremely appreciated.
We can convert the matrix to data.frame and then use $ as matrix columns are extracted with [
x <- function(daten) {
y <- as.data.frame(get(paste0("df_", daten)))
y[grep("Enterprise", y$AcctType), ]
}
for(i in seq_along(date)) {
z <- date[i]
assign(paste0("dfclean_", date[i]), x(z))
}
We can also use mget
lst1 <- mget(paste0("df_", date))
lst1 <- lapply(lst1, function(x) subset(as.data.frame(x),
grepl("Enterprise",AcctType)))
names(lst1) <- sub("_", "clean_", names(lst1))
list2env(lst1, .GlobalEnv)
I know you didn't ask for this, but how about just rename all of the dataframes with _ instead of space?
The first line assigns all of the objects in the global environment with df in the name to be elements of a list named mydfs.
The second line replaces space with _ in the names.
The third line assigns all of the list elements into the global environment.
mydfs <- mget(ls(pattern = "df"), globalenv())
names(mydfs) <- gsub(" ","_",names(mydfs))
list2env(mydfs, env = globalenv())
Or, option two, you could just use lapply on mydfs.

Searching Hierarchical List for a Name (a Named Matrix) and Placing Output in an Array or List (r)

I have a complex hierarchical set of lists, within which are stored multiple matrices. I would like to store all those matrices in either an array or a list.
I say array or list because I'm not sure, but presumably storing the 'path' to the matrix in an array will be faster than duplicating all the matrices into a new list.
Here is how to create the list hierarchy:
Kings = c('Alfred the Great', 'Edgar the Peaceful', 'Edmund Ironside', 'Harold Godwinson')
DataSets <- c('KingDF', 'KingDFMtx', 'KingMtx')
KingList <- lapply(Kings, function(K) {
ret <- rep(tibble(setNames(vector("list", length(DataSets)),
DataSets)),
length(Kings))
setNames(ret,
paste0(K, " vs ", Kings))
})
names(KingList) <- Kings
str(KingList)
So this will give you a list of Kings, with a list inside each of those lists comparing the kings, and inside those, a list of various data formats.
So for instance I have a list 'path' that looks like this:
KingsList[['Alfred the Great']][['Alfred the Great vs Edgar the Peaceful']][['KingMtx']]
and another that looks like this:
KingsList[['Edmund Ironside']][['Edmund Ironside vs Harrold Goodwinson']][['KingMtx']]
And I want an array or list which collects all the 'KingsMtx' matrices, with the intent to use this to create one large unified matrix which includes all the data.
However the search.list function returns a list of every single data point within a matrix named 'KingMtx', thus returning a jumble of hundred of integers in a rather unhelpful list.
Assuming your list looks like this:
KingList <- lapply(Kings, function(K) {
vs.list <- lapply(paste0(K, " vs ", Kings), function(x){ ds.list <- lapply(DataSets, function(y){matrix(1:6, nrow=2)}); setNames(ds.list, DataSets)})
setNames(vs.list, paste0(K, " vs ", Kings))
})
names(KingList) <- Kings
str(KingList)
You can get a list of all the matrices like this:
unlisted <- unlist(unlist(KingList, recursive = F), recursive = F)
To get only matrices KingMtx do:
KingMtx <- unlisted[grep('\\.KingMtx$', names(unlisted), value = T)]
names(KingMtx) <- sub('\\.KingMtx$', '', names(KingMtx))
And to get this back into 1 data.frame:
KingDF <- as.data.frame(do.call(rbind, lapply(names(KingMtx), function(name){
d <- as.data.frame(KingMtx[[name]])
n.split <- strsplit(name, '\\.')[[1]]
d$King <- n.split[1]
d$opponent <- strsplit(n.split[2], ' vs ')[[1]]
d
})))

r: lapply but with dynamic naming

Let's say I have 5 datasets in a list (each named df_1, df_2, and so on), each with a variable called cons. I'd like to execute a function over cons in each dataset in the list, and create a new variable whose name has the suffix of the corresponding dataset.
So in the end df_1 will have a variable called something like cons_1 and df_2 will have a variable called cons_2. The problem I run into is the variable looping and trying to create dynamic names.
Any suggestions?
This is actually pretty straightforward:
df_names <- paste("df", 1:5, sep = "_")
cons_names <- paste("cons", 1:5, sep = "_")
for (i in 1:5) {
# get the df from the current env by name
df_i <- get(df_names[i])
# do whatever you need to do and assign the result
df_i[[cons_names[i]]] <- some_operation(df_i)
}
But it would make more sense to keep your data frames in a list to avoid using get, which can be sketchy:
for (i in 1:5) {
df_i[[cons_names[i]]] <- some_operation(df_list[[i]])
}
Using the purrr package, this would be an alternative solution:
library(purrr)
lst <- list(mtcars_1 = mtcars,
mtcars_2 = mtcars,
mtcars_3 = mtcars,
mtcars_4 = mtcars,
mtcars_5 = mtcars)
map(seq_along(lst), function(x) {
lst[[x]][paste0("mpg_", x)] <- some_operation(lst[[x]]['mpg']); lst[[x]]
})
Subset each data frame from the list, create the new mpg variable with the index of the current data frame and perform whatever operation you want on the mpg variable. The result is a list with all data previous data frames with the new variable for each data frame.
Since this new list doesn't have the data frame names, you can always just add them with setNames(newlist, names(lst))

Resources