How to write to csv each split chunk? - r

I have list data for which I used split:
x <- split(A, f = A$Col_1)
It works beautifully. But now I need to write each chunk of the split to an individual .csv. There are 2100 chunks of 140 rows each. Let's call them "1:2100". I would like to create something that wrote "1" to "~/full_path_name/A1.csv" then go to "2" and write to "~/full_path_name/A2.csv", then "3" to "~/full_path_name/A3.csv", etc.
I included "~/full_path_name/" because down the road this path name will change for other data using the same code, and for my own understanding I need to see it in the code. I don't know how to write a small sample of what I am asking for for someone to correct because I don't know how to write it at all.
Can someone make a suggestion on how to do this? Thank you.
I have only been coding for month and am entirely self-taught. I do not have a background in other coding programs. I have no one to ask for help but here. I struggle with the terminology, so please understand if I am not asking in the proper way and I will try to correct it if need be.
EDIT, AFTER DOING SOME FURTHER RESEARCH --
This is what I have found elsewhere on SO from #RichPaloo, and my adaptations below that:
#example data.frame
df <- data.frame(x = 1:4, y = c("a", "a", "b", "b"))
#split into a list by the y column
l <- split(df, df$y)
#the names of the list are the unique values of the y column
nam <- names(l)
#iterate over the list and the vector of list names and write csvs
for(i in 1:length(l)) {
write_csv(l[[i]], paste0(nam[i], ".csv"))
}
This is my version:
bcc4.5_WINTER <- split(bcc4.5_FinalWinterRO, f = bcc4.5_FinalWinterRO$HUC8)
nam <- names(bcc4.5_WINTER)
for(i in 1:length(bcc4.5_WINTER)) {
write_csv(bcc4.5_WINTER[[i]], paste0(“~/Rprojects/BCC_CSM1_1_RCP_45/Winter/”, nam[i], “.csv”))
}
I appear to have a problem with the folder within my home folder "/BCC_CSM1_1_RCP_45/Winter/” It says "unexpected token" at both ends, but not at the "~Rprojects". Can I not send something to a folder within my home folder?
It also shows redlines under the quotes around ".csv" near the end. I don't know what to make of this because it's exactly what the person used successfully, apparently, in another post. Thank you.

So, the code example above (#Paul) worked except the df[l] was not being iterated, so I removed the _i from each l instance. The final problem I had (in comments above) was because the path name was not complete.
I used fwrite() rather than write.csv because it gave me better feedback as I struggled with mistakes. This gave me what I needed:
#split file into chunks by names within a row, in this case row "BBB"
df <- split(old_df, f = old_df$BBB)
#write those chunks to individual .csv files with the name being the name of each chunk
save_fun <- function(df, name_i) {
fwrite(df, file = paste0("~/Desktop/projects_folder/", name_i, ".csv"))
}
#save the file on your computer
mapply(FUN = save_fun, df, name_i = names(df), SIMPLIFY = FALSE)
Much thanks to Paul.

Investigating the potential typo problem
Please see the two lines below:
write.csv(l[[1]], file = paste0("./a_folder/", names(l)[1], ".csv"))
write.csv(l[[1]], file = paste0(“./a_folder/”, names(l)[1], “csv”))
Line 1 will save the file. Note that "./a_folder/" and ".csv" are seen as text.
Line 2 “./a_folder/” and “.csv” are not recognized as text. Line 2 produces an error: unexpected input in " write.csv(l[[1]], file = paste0(“"
RStudio colors your code to help you with this problem.
Thoughts about not using a for loop.
I think one better way to go (especialy when you have large dataset) is by using lapply or mapply. What these functions do is take each "chunk" of a list and apply a function to it.
As lapply loses the name of each chunk while processing it. It can be annoying when you want to use the name of the chunk to name the file on your computer. mapply() comes handy to deal with this situation.
Here is an example using the provided example.
# example data.frame
df <- data.frame(x = 1:4, y = c("a", "a", "b", "b"))
# split df
l <- split(df, df$y)
# save each "chunk" of l as a .csv file on a hard drive
# 1st, create a function that takes a "chunk" of your list and its name as inputs
save_fun <- function(l_i, name_i) {
print(l_i) # print the output in console
write.csv(l_i, file = paste0("./a_folder/", name_i, ".csv")) # save the file on your computer
}
# 2nd, use mapply (and not a list) to use the previous function on each pair chunk/name
mapply(FUN = save_fun, l_i = l, name_i = names(l), SIMPLIFY = FALSE) # see ?mapply for how to use mapply()

Related

Saving data frames using a for loop with file names corresponding to data frames

I have a few data frames (colors, sets, inventory) and I want to save each of them into a folder that I have set as my wd. I want to do this using a for loop, but I am not sure how to write the file argument such that R understands that it should use the elements of the vector as the file names.
I might write:
DFs <- c("colors", "sets", "inventory")
for (x in 1:length(DFs)){
save(x, file = "x.Rda")
}
The goal would be that the files would save as colors.Rda, sets.Rda, etc. However, the last element to run through the loop simply saves as x.Rda.
In short, perhaps my question is: how do you tell R that I am wanting to use elements being run through a loop within an argument when that argument requires a character string?
For bonus points, I am sure I will encounter the same problem if I want to load a series of files from that folder in the future. Rather than loading each one individually, I'd also like to write a for loop. To load these a few minutes ago, I used the incredibly clunky code:
sets_file <- "~/Documents/ME teaching/R notes/datasets/sets.csv"
sets <- read.csv(sets_file)
inventories_file <- "~/Documents/ME teaching/R notes/datasets/inventories.csv"
inventories <- read.csv(inventories_file)
colors_file <- "~/Documents/ME teaching/R notes/datasets/colors.csv"
colors <- read.csv(colors_file)
For compactness I use lapply instead of a for loop here, but the idea is the same:
lapply(DFs, \(x) save(list=x, file=paste0(x, ".Rda"))))
Note that you need to generate the varying file names by providing x as a variable and not as a character (as part of the file name).
To load those files, you can simply do:
lapply(paste0(DFs, ".Rda"), load, envir = globalenv())
To save you can do this:
DFs <- list(color, sets, inventory)
names(DFs) = c("color", "sets", "inventory")
for (x in 1:length(DFs)){
dx = paste(names(DFs)[[x]], "Rda", sep = ".")
dfx = DFs[[x]]
save(dfx, file = dx)
}
To specify the path just inform in the construction of the dx object as following to read.
To read:
DFs <- c("colors", "sets", "inventory")
# or
DFs = dir("~/Documents/ME teaching/R notes/datasets/")
for(x in 1:length(DFs)){
arq = paste("~/Documents/ME teaching/R notes/datasets/", DFs[x], ".csv", sep = "")
DFs[x] = read.csv(arq)
}
It will read as a list, so you can access using [[]] indexation.

RSME on dataframe of multiple files in R

My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.
As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))

R - combining lines from multiple CSV into a data frame

I have a folder with hundreds of CSV files each containing data for a particular postal code.
Each CSV files contains two columns and thousands of rows. Descriptors are in Column A, values are in Column B.
I need to extract two pieces of information from each file and create a new table or dataframe using the values in [Column A, Row 2] (which is the postal code) and [Column B, Row 1585] (which is the median income).
The end result should be a table/dataframe with two columns: one for postal code, the other for median income.
Any help or advice would be appreciated.
Disclaimer: this question is pretty vague. Next time, be sure to add a reproducible example that we can run on our machines. It will help you, the people answering your questions, and future users.
You might try something like:
files = list.files("~/Directory")
my_df = data.frame(matrix(ncol = 2, nrow = length(files)
for(i in 1:length(files)){
row1 = read.csv("~/Directory/files[i]",nrows = 1)
row2 = read.csv("~/Directory/files[i]", skip = 1585, nrows = 1)
my_df = rbind(my_df, rbind(row1, row2))
}
my_df = my_df[,c("A","B")]
# Note on interpreting indexing syntax:
Read this as "my_df is now (=) my_df such that ([) the columns (,)
are only A and B (c("A", "B")) "
You can use list.files function to get directories for all your files and then use read.csv and rbind in for loop to create one data.frame.
Something like this:
direct<-list.files("directory_to_your_files")
df<-NULL
for(i in length(direct)){
df<-rbind(df,read.csv(direct[i]))
}
So here is the code which does what I want it to do. If there are more elegant solutions, please feel free to point them out.
# set the working directory to where the data files are stored
setwd("/foo")
# count the files
files = list.files("/foo")
#create an empty dataframe and name the columns
dataMatrix=data.frame(matrix(c(rep(NA,times=2*length(files))),nrow=length(files)))
colnames(dataMatrix)=c("Postal Code", "Median Income")
# create a for loop to get the information in R2/C1 and R1585/C2 of each data file
# Data is R2/C1 is a string, but is interpreted as a number unless specifically declared a string
for(i in 1:length(files)) {
getData = read.csv(files[i],header=F)
dataMatrix[i,1]=toString(getData[2,1])
dataMatrix[i,2]=(getData[1585,2])
}
Thank you to all those who helped me figure this out, especially Nancy.

How do I save individual species data downloaded via rgbif?

I have a list of species and I want to download occurrence data from them using rgbif. I'm trying out the code with just two species with the assumption that when I get it to work for two getting it to work for the actual (and much longer) list won't be a problem. Here's the code I'm using:
#Start
library(rgbif)
splist <- c('Acer platanoides','Acer pseudoplatanus')
keys <- sapply(splist, function(x) name_suggest(x)$key[1], USE.NAMES=FALSE)
OS1=occ_search(taxonKey=keys, fields=c('name','key','decimalLatitude','decimalLongitude','country','basisOfRecord','coordinateAccuracy','elevation','elevationAccuracy','year','month','day'), minimal=FALSE,limit=10, return='data')
OS1
#End
This bit works almost perfectly. I get data for both species divided by species. One species is missing some columns, but I'm assuming for now that's an issue with the data, not the code. The next line I tried -
write.csv(OS1, "os1.csv")
works fine when saving a single species but not for more than one. Can someone please help? How do I save data for each species as separate files, bearing in mind I also want the method to work for data for more than 2 species?
Thanks!
The result is a list, which means you can use R's functions to climb each list element and save it. The following code extracts species names (you might have this laying around somewhere already) and uses mapply to pair species data and file name and use this to save a .txt file.
filenames <- paste(sapply(sapply(OS1, FUN = "[[", "name", simplify = FALSE), unique), ".txt", sep = "")
mapply(OS1, filenames, FUN = function(x, y) write.table(x, file = y, row.names = FALSE))
This is akin to a for loop solution, but some might argue a more concise one.
for (i in 1:length(filenames)) {
write.table(OS1[[i]], file = filenames[i], row.names = FALSE)
}

R loop opening files

I am trying to run a simple 5 lines command but over 9000 different files. I wrote the following for loop
setwd("/Users/morandin/Desktop/Test")
output_file<- ("output_file.txt")
files <- list.files("/Users/morandin/Desktop/Test")
for(i in files) {
chem.w.in <- scan(i, sep=",")
pruned.tree<-drop.tip(mytree,which(chem.w.in %in% NA))
plot(pruned.tree)
pruned.tree.ja.chem.w.in <- phylo4d(pruned.tree, c(na.omit(chem.w.in)))
plot(pruned.tree.ja.chem.w.in)
out <- abouheif.moran(pruned.tree.ja.chem.w.in)
print(out)
}
Hey I am editing my question: the above code does the for loop perfectly now (thanks for all your help). I am still having an issue with the output.
I can redirect the entire output using R through bash commands but I would need the name of the processed file. My output looks like this:
class: krandtest
Monte-Carlo tests
Call: as.krandtest(sim = matrix(res$result, ncol = nvar, byrow = TRUE),
obs = res$obs, alter = alter, names = test.names)
Number of tests: 1
Adjustment method for multiple comparisons: none
Permutation number: 999
Test Obs Std.Obs Alter Pvalue
1 dt 0.1458514 0.7976225 greater 0.2
other elements: adj.method call
Is there a way to print Pvalue results and name of the file (element i)??
Thanks
Since Paul Hiemstra's answer answered #1, here's an answer to #2, assuming that by "answers" you mean "the printed output of abouheif.moran(pruned.tree.ja.chem.w.in)".
Use cat() with the argument append = true. For example:
output_file = "my_output_file.txt"
for(i in files) {
# do stuff
# make plots
out <- abouheif.moran(pruned.tree.ja.chem.w.in)
out <- sprintf("-------\n %s:\n-------\n%s\n\n", i, out)
cat(out, file = output_file, append = TRUE)
}
This will produce a file called my_output_file.txt that looks like:
-------
file_1:
-------
output_goes_here
-------
file_2:
-------
output_goes_here
Obviously the formatting is entirely up to you; I just wanted to demonstrate what could be done here.
An alternative solution would be to sink() the entire script, but I'd rather be explicit about it. A middle road might be to sink() just a small piece of the code, but except in extreme cases it's a matter of preference or style.
I suspect what is going wrong here is that list.files() by default returns a list of only the names of the files, not the entire path to the file. Setting full.names to TRUE will fix this issue. Note that you will not have to add the txt add the filename as list.files() already returns the full path to an existing file.

Resources