I have a dataset dt, it stored list dataset names, I need to use them to create some new datasets with select some variables, then I use the dataset I just created, repeat the same process .....
The first row and second row were data available.
Then use data available to create a new data.
Then use data just create to create a new data
The final output was list of datasets
I appreciated any helps or suggestions.
dt <- data.frame(name = c("mtcars","iris", "mtcars_new","mtcars_new_1"),
data_source = c("mtcars","iris", "mtcars","mtcars_new"),
variable = c("","","mpg,cyl,am,hp","mpg,cyl"), stringsAsFactors = FALSE)
> dt
name data_source variable
1 mtcars mtcars
2 iris iris
3 mtcars_new mtcars mpg,cyl,am,hp
4 mtcars_new_1 mtcars_new mpg,cyl
dt_list <- list(mtcars, iris)
names(dt_list ) <- c("mtcars","iris")
# The final list of datasets
final_dt <- list(mtcars, iris, mtcars_new, mtcars_new_1)
So far if I wrote a loop like that, I got only mtcars_new dataset, but I don't know how to return to the list and continue looping to get mtcars_new_1 and so on. I have many datasets, and I don't know how many times I should looping through nested data.
mtcars_new <- data.frame()
for(i in 1:nrow(dt)){
if(dt$data_source[[i]] %in% names(dt_list) && !dt$name[[i]] %in% names(dt_list)){
check <- eval(parse(text = dt$data_source[[i]]))
var <- c(unlist(strsplit(dt$variable[[i]],",")))
mtcars_new <- check[, colnames(check) %in% var]
}
}
This will produce the desired output shown. Since the fourth loop uses the data created in the third loop, you need to have a way to append the results of each loop to a growing list of available data sets. Then within each loop find which one is the right starting data set from the available list.
dt <- data.frame(name = c("mtcars","iris", "mtcars_new","mtcars_new_1"),
data_source = c("mtcars","iris", "mtcars","mtcars_new"),
variable = c("","","mpg,cyl,am,hp","mpg,cyl"), stringsAsFactors = FALSE)
input_data_sets <- list(mtcars, iris)
names(input_data_sets) <- c("mtcars","iris")
final_data_sets <- list()
for(i in 1:nrow(dt)) {
available_data_sets <- c(input_data_sets, final_data_sets) #Grows a list of all available data sets
num_to_use <- which(dt$data_source[[i]] == names(available_data_sets)) #finds the right list member to use
temp <- available_data_sets[num_to_use][[1]]
var <- c(unlist(strsplit(dt$variable[[i]],",")))
temp <- list(subset(temp, select = var)) #keep only the desired variables
names(temp) <- dt$name[i] #assign the name provided
final_data_sets <- c(final_data_sets, temp) #add to list of final data sets which will be the output. Anything listed here will become part of the available list in the next loop
}
Related
I have created a function that filters a dataframe based on some unique values of 2 different columns. I'm trying to loop thru the unique values of cell type(3) as well as unique values from another column called Cell_line(2). I've created two lists to hold this information and am using a nested loop to count thru each. The New Dataframe seems to be the last iteration of each list (cell_line and types) and omits the other outputs. How can I obtain these as well. either a list of dataframes or a single dataframe with all the information bound together would work
### My function takes a few arguments and gives a new dataframe
myfunction <- function(data_frame,type,Cell) {
prism <- df2%>%
ungroup() %>%
filter(.,TYPE == type & Cell_Line == Cell) %>%
pivot_wider(., id_cols = c("Treatment_rep","value","lipid"),
names_from = Treatment_rep, values_from = value)
prism$Cell_line <- Cell
prism
}
### I'm attempting to feed these into my function iteratively
cell_lines <- unique(df2$Cell_Line) ## list of 3 types
types <- unique(df2$TYPE) ### list of 2 types
### nested loop
for (i in 1:length(types)) {
for(j in 1: length(cell_lines)) {
newdf <- myfunction(data_frame = df2, type = types[i], Cell = cell_lines[j])
}
}
You can do this
dflist <- list()
for (i in 1:length(types)){
for(j in 1: length(cell_lines)){
newdf <- myfunction(data_frame = df2, type = types[i], Cell = cell_lines[j])
dflist[[ length(dflist)+1 ]] <- newdf
}
}
And if afterwards you want to bind them all together
df_total <- do.call(rbind, datalist)
I've been struggling with column selection with lists in R. I've loaded a bunch of csv's (all with different column names and different number of columns) with the goal of extracting all the columns that have the same name (just phone_number, subregion, and phonetype) and putting them together into a single data frame.
I can get the columns I want out of one list element with this;
var<-data[[1]] %>% select("phone_number","Subregion", "PhoneType")
But I cannot select the columns from all the elements in the list this way, just one at a time.
I then tried a for loop that looks like this:
new.function <- function(a) {
for(i in 1:a) {
tst<-datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
But when I try:
new.function(5)
I'll only get the columns from the 5th element.
I know this might seem like a noob question for most, but I am struggling to learn lists and loops and R. I'm sure I'm missing something very easy to make this work. Thank you for your help.
Another way you could do this is to make a function that extracts your columns and apply it to all data.frames in your list with lapply:
library(dplyr)
extractColumns = function(x){
select(x,"phone_number","Subregion", "PhoneType")
#or x[,c("phone_number","Subregion","PhoneType")]
}
final_df = lapply(data,extractColumns) %>% bind_rows()
The way you have your loop set up currently is only saving the last iteration of the loop because tst is not set up to store more than a single value and is overwritten with each step of the loop.
You can establish tst as a list first with:
tst <- list()
Then in your code be explicit that each step is saved as a seperate element in the list by adding brackets and an index to tst. Here is a full example the way you were doing it.
#Example data.frame that could be in datas
df_1 <- data.frame("not_selected" = rep(0, 5),
"phone_number" = rep("1-800", 5),
"Subregion" = rep("earth", 5),
"PhoneType" = rep("flip", 5))
# Another bare data.frame that could be in datas
df_2 <- data.frame("also_not_selected" = rep(0, 5),
"phone_number" = rep("8675309", 5),
"Subregion" = rep("mars", 5),
"PhoneType" = rep("razr", 5))
# Datas is a list of data.frames, we want to pull only specific columns from all of them
datas <- list(df_1, df_2)
#create list to store new data.frames in once columns are selected
tst <- list()
#Function for looping through 'a' elements
new.function <- function(a) {
for(i in 1:a) {
tst[[i]] <- datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
#Proof of concept for 2 elements
new.function(2)
I am new to R and trying to do things the "R" way, which means no for loops. I would like to loop through a list of dataframes, loop through each row in the dataframe, and extract data based on criteria and store in a master dataframe.
Some issues I am having are with accessing the "global" dataframe. I am unsure the best approach (global variable, pass by reference).
I have created an abstract example to try to show what needs to be done:
rm(list=ls())## CLEAR WORKSPACE
assign("last.warning", NULL, envir = baseenv())## CLEAR WARNINGS
# Generate a descriptive name with name and size
generateDescriptiveName <- function(animal.row, animalList.vector){
name <- animal.row["animal"]
size <- animal.row["size"]
# if in list of interest prepare name for master dataframe
if (any(grepl(name, animalList.vector))){
return (paste0(name, "Sz-", size))
}
}
# Animals of interest
animalList.vector <- c("parrot", "cheetah", "elephant", "deer", "lizard")
jungleAnimals <- c("ants", "parrot", "cheetah")
jungleSizes <- c(0.1, 1, 50)
jungle.df <- data.frame(jungleAnimals, jungleSizes)
fieldAnimals <- c("elephant", "lion", "hyena")
fieldSizes <- c(1000, 100, 80)
field.df <- data.frame(fieldAnimals, fieldSizes)
forestAnimals <- c("squirrel", "deer", "lizard")
forestSizes <- c(1, 40, 0.2)
forest.df <- data.frame(forestAnimals, forestSizes)
ecosystems.list <- list(jungle.df, field.df, forest.df)
# Final master list
descriptiveAnimal.df <- data.frame(name = character(), descriptive.name = character())
# apply to all dataframes in list
lapply(ecosystems.list, function(ecosystem.df){
names(ecosystem.df) <- c("animal", "size")
# apply to each row in dataframe
output <- apply(ecosystem.df, 1, function(row){generateDescriptiveName(row, animalList.vector)})
if(!is.null(output)){
# Add generated names to unique master list (no duplicates)
}
})
The end result would be:
name descriptive.name
1 "parrot" "parrot Sz-0.1"
2 "cheetah" "cheetah Sz-50"
3 "elephant" "elephant Sz-1000"
4 "deer" "deer Sz-40"
5 "lizard" "lizard Sz-0.2"
I did not use your function generateDescriptiveName() because I think it is a bit too laborious. I also do not see a reason to use apply() within lapply(). Here is my attempt to generate the desired output. It is not perfect but I hope it helps.
df_list <- lapply(ecosystems.list, function(ecosystem.df){
names(ecosystem.df) <- c("animal", "size")
temp <- ecosystem.df[ecosystem.df$animal %in% animalList.vector, ]
if(nrow(temp) > 0){
data.frame(name = temp$animal, descriptive.name = paste0(temp$animal, " Sz-", temp$size))
}
})
do.call("rbind",df_list)
I'm struggling with the following issue: I have many data frames with different names (For instance, Beverage, Construction, Electronic etc., dim. 540x1000). I need to clean each of them, calculate and save as zoo object and R data file. Cleaning is the same for all of them - deleting the empty columns and the columns with some specific names.
For example:
Beverages <- Beverages[,colSums(is.na(Beverages))<nrow(Beverages)] #removing empty columns
Beverages_OK <- Beverages %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
Beverages_OK[, 1] <- NULL #dropping the first column
Beverages_OK <- cbind(data[1], Beverages_OK) # adding a date column
Beverages_zoo <- read.zoo(Beverages_OK, header = FALSE, format = "%Y-%m-%d")
save (Beverages_OK, file = "StatisticsInRFormat/Beverages.RData")
I tied to use 'lapply' function like this:
list <- ls() # the list of all the dataframes
lapply(list, function(X) {
temp <- X
temp <- temp [,colSums(is.na(temp))< nrow(temp)] #removing empty columns
temp <- temp %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
temp[, 1] <- NULL
temp <- cbind(data[1], temp)
X_zoo <- read.zoo(X, header = FALSE, format = "%Y-%m-%d") # I don't know how to have the zame name as X has.
save (X, file = "StatisticsInRFormat/X.RData")
})
but it doesn't work. Is any way to do such a job? Is any r-package that facilitates it?
Thanks a lot.
If you are sure the you have only the needed data frames in the environment this should get you started:
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
list <- ls()
lapply(list, function(x) {
tmp <- get(x)
})
I declare an empty data frame as this:
df <- data.frame()
then I go though processing some files and as process, I need to build my df data frame. I need to keep adding columns to it:
For example, I process some file and build a data frame called new_df, I now need to add this new_df to my df:
I've tried this:
latest_df <- cbind(latest_df, new_df)
I get this error:
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 0, 1
Just put data into the index after the last column
new_df = data.frame()
new_df[,ncol(new_df)+1] = NA
So if you knew you had 3 columns then:
new_df[,4] = c('a','b','c')
Example:
new_df = data.frame('a'=NA)
for(i in 1:10){
new_df[,ncol(new_df)+1] = NA
}
new_df
EDIT:
ProcessExample <- function(){
return(c(5)) #just returns 5 as fake data everytime
}
new_df = data.frame(matrix(nrow=1))
for(i in 1:10){
new_df[,ncol(new_df)+1] = ProcessExample()
}
latest_df <- new_df[,-1]
Or just add rows and transpose the data set
new_df = data.frame()
for(i in 1:10){
new_df[i,1] = ProcessExample()
}
latest_df <- t(new_df)
If you simply want an empty data frame of the proper size before you enter the loop, and assuming "df" and "new_df" have the same number of rows x, try
df <- data.frame(matrix(nrow=x))
for (i in 1:n){
temp[i] <- % some vector of length x
}