Using a loop to create multiple dataframes from a single dataset - r

Quick question for you. I have the following:
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c)
What i am looking to do is utilize a for loop to create multiple data frames from rows based on the column contents of column B (i.e. a df for the "a," the "d," and so on).
At the same time, I would also like to name the data frame based on the corresponding value from column B (df will be named "a" for the data frame created from the "a."
I tried making it work based off the answers provided here Using a loop to create multiple data frames in R but i had no luck.
If it helps, I have variables created with levels() and nlevels() to use in the loop to keep it scalable based on how my data changes. Any help would be much appreciated.
Thanks!

This should do:
require(dplyr)
df$b <- as.character(df$b)
col.filters <- unique(df$b)
lapply(seq_along(col.filters), function(x) {
filter(df, b == col.filters[x])
}
) -> list
names(list) <- col.filters
list2env(list, .GlobalEnv)
Naturally, you don't need dplyr to do this. You can just use base syntax:
df$b <- as.character(df$b)
col.filters <- unique(df$b)
lapply(seq_along(col.filters), function(x) {
df[df[, "b"] == col.filters[x], ]
}
) -> list
names(list) <- col.filters
list2env(list, .GlobalEnv)
But I find dplyrmuch more intuitive.
Cheers

Related

rownames on multiple dataframe with for loop in R

I have several dataframe. I want the first column to be the name of each row.
I can do it for 1 dataframe this way :
# Rename the row according the value in the 1st column
row.names(df1) <- df1[,1]
# Remove the 1st column
df1 <- df1[,-1]
But I want to do that on several dataframe. I tried several strategies, including with assign and some get, but with no success. Here the two main ways I've tried :
# Getting a list of all my dataframes
my_df <- list.files(path="data")
# 1st strategy, adapting what works for 1 dataframe
for (i in 1:length(files_names)) {
rownames(get(my_df[i])) <- get(my_df[[i]])[,1] # The problem seems to be in this line
my_df[i] <- my_df[i][,-1]
}
# The error is Could not find function 'get>-'
# 2nd strategy using assign()
for (i in 1:length(my_df)) {
assign(rownames(get(my_df[[i]])), get(my_df[[i]])[,1]) # The problem seems to be in this line
my_df[i] <- my_df[i][,-1]
}
# The error is : Error in assign(rownames(my_df[i]), get(my_df[[i]])[, 1]) : first argument incorrect
I really don't see what I missed. When I type get(my_df[i]) and get(my_df[[i]])[,1], it works alone in the console...
Thank you very much to those who can help me :)
You may write the code that you have in a function, read the data and pass every dataframe to the function.
change_rownames <- function(df1) {
row.names(df1) <- df1[,1]
df1 <- df1[,-1]
df1
}
my_df <- list.files(path="data")
list_data <- lapply(my_df, function(x) change_rownames(read.csv(x)))
We can use a loop function like lapply or purrr::map to loop through all the data.frames, then use dplyr::column_to_rownames, which simplifies the procedure a lot. No need for an explicit for loop.
library(purrr)
library(dplyr)
map(my_df, ~ .x %>% read.csv() %>% column_to_rownames(var = names(.)[1]))

How to unpack particular list elements into dataframes in R?

Have been researching this question on SO, and found only solutions for merging list elements into one large data frame. However, I am struggling with unpacking only those elements that meet certain condition.
df1 <- iris %>% filter(Sepal.Length > 2.5)
df2 <- mtcars %>% filter(qsec > 16)
not_neccessary <- head(diamonds, 10)
not_neccessary2 <- head(beaver1, 12)
data_lists <- list("#123 DATA" = df1, "CON" = not_neccessary2, "#432 DATA" = df2, "COM" = not_neccessary)
My goal is to convert only those list elements that contain "DATA" in their name. I was thinking about writing a loop function within a lapply:
a <- lapply(data_lists, function(x){if (x == "#+[1-9]+_+DATA"){new_df <- as.data.frame(x)}})
It does not work. Also was trying to make a for loop:
for (i in list){
if (i == "#+[1-9]+_+DATA"){
df <- i
}
}
It does not work neither.
Is there any effective function that will unpack my list into particular dataframes by certain condition? My R skills are very bad, especially in writing functions, although I am not really new to this language. Sorry about that.
Use grepl/grep to find lists that have 'DATA' in their name and subset the list.
result <- data_lists[grepl('DATA', names(data_lists))]
#With `grep`
#result <- data_lists[grep('DATA', names(data_lists))]
Using %like%
result <- data_lists[names(data_lists) %like% 'DATA']

Subset the remaining of a dataframe using another subset

I have a sample dataset. I've created a subset of the original data frame using some condition. Now I need to extract the remaining contents of the original sample data frame, except the subset created. How can I do this?
data("mtcars")
fulldf <- mtcars
subdf <- subset.data.frame(fulldf, subset = fulldf$disp < 100)
restdf <- subset.data.frame(fulldf, subset = <fulldf without subdf>)
There are a lot of questions on subsetting data frames in R, but I couldn't find one that satisfied my requirement.
Also the final solution need not necessarily be using subset.data.frame. Any method/package will do.
It is better to assign the logical condition in base R to an object identifier and then negate (!)
i1 <- fulldf$disp < 100
subdf <- subset.data.frame(fulldf, subset = i1)
restdf <- subset.data.frame(fulldf, subset = !i1)
Also another option is to create a list of two datasets with split
lst1 <- split(fulldf, i1)
If the 'subdf' is creating with multiple conditions (not clear though), one option is to add a sequence variable in the data and then subset with %in%
fulldf$ind <- seq_len(nrow(fulldf))
then after the 'subdf' step
restdf <- subset(fulldf, !ind %in% subdf$ind)
and remove the 'ind' columns
restdf$ind <- NULL
subdf$ind <- NULL

Subsetting efficiently on multiple columns and rows

I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.
I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:
# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]
# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]
If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.
newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9)
newdf <- df[which(df$vars %in% vals, ]
It is my understanding I want to use apply() but I am not sure how.
You do not need to use which with %in%, it returns boolean values. How about the below:
keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]
Use data.table
First, melt your data
library(data.table)
DT <- melt.data.table(df)
Then split into lists
DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have.
Now you can operate on the lists recursively using lapply
DTresult <- lapply(DTLists, function(x) {
...
}

R change column names over multiple data

I'm trying to change column names over multiple data sets. I have tried writing the following function to do this:
# simplified test data #
df1<-as.data.frame(c("M","F"))
colnames(df1)<-"M1"
# my function #
rename_cols<-function(df){
colnames(df)[names(df) == "M1"] <- "sex"
}
rename_cols(df1)
However when testing this function on df1, the column is always called "M1" instead of "sex". How can I correct this?
SOLUTION - THANKS TO DAVID ARENBERG
rename_cols<-function(df){
colnames(df)[names(df) == "M1"] <- "sex"
df
}
df1<-rename_cols(df1)
Here is another solution which gets around the problem of functions operating in a temporary space:
df <- as.data.frame(c("M","F"))
colnames(df) <- "M1"
rename_cols <- function(df) {
colnames(df)[names(df) == "M1"] <<- "sex"
}
> rename_cols(df) # this will operate directly on the 'df' object
> df
sex
1 M
2 F
Using the global assignment operator <<- makes the name changes to the input data frame df "stick". Granted, this solution is not ideal because it means the function could potentially do something unwanted. But I feel this is in the spirit of what you were trying to do originally.

Resources