How do I save individual species data downloaded via rgbif? - r

I have a list of species and I want to download occurrence data from them using rgbif. I'm trying out the code with just two species with the assumption that when I get it to work for two getting it to work for the actual (and much longer) list won't be a problem. Here's the code I'm using:
#Start
library(rgbif)
splist <- c('Acer platanoides','Acer pseudoplatanus')
keys <- sapply(splist, function(x) name_suggest(x)$key[1], USE.NAMES=FALSE)
OS1=occ_search(taxonKey=keys, fields=c('name','key','decimalLatitude','decimalLongitude','country','basisOfRecord','coordinateAccuracy','elevation','elevationAccuracy','year','month','day'), minimal=FALSE,limit=10, return='data')
OS1
#End
This bit works almost perfectly. I get data for both species divided by species. One species is missing some columns, but I'm assuming for now that's an issue with the data, not the code. The next line I tried -
write.csv(OS1, "os1.csv")
works fine when saving a single species but not for more than one. Can someone please help? How do I save data for each species as separate files, bearing in mind I also want the method to work for data for more than 2 species?
Thanks!

The result is a list, which means you can use R's functions to climb each list element and save it. The following code extracts species names (you might have this laying around somewhere already) and uses mapply to pair species data and file name and use this to save a .txt file.
filenames <- paste(sapply(sapply(OS1, FUN = "[[", "name", simplify = FALSE), unique), ".txt", sep = "")
mapply(OS1, filenames, FUN = function(x, y) write.table(x, file = y, row.names = FALSE))
This is akin to a for loop solution, but some might argue a more concise one.
for (i in 1:length(filenames)) {
write.table(OS1[[i]], file = filenames[i], row.names = FALSE)
}

Related

Saving data frames using a for loop with file names corresponding to data frames

I have a few data frames (colors, sets, inventory) and I want to save each of them into a folder that I have set as my wd. I want to do this using a for loop, but I am not sure how to write the file argument such that R understands that it should use the elements of the vector as the file names.
I might write:
DFs <- c("colors", "sets", "inventory")
for (x in 1:length(DFs)){
save(x, file = "x.Rda")
}
The goal would be that the files would save as colors.Rda, sets.Rda, etc. However, the last element to run through the loop simply saves as x.Rda.
In short, perhaps my question is: how do you tell R that I am wanting to use elements being run through a loop within an argument when that argument requires a character string?
For bonus points, I am sure I will encounter the same problem if I want to load a series of files from that folder in the future. Rather than loading each one individually, I'd also like to write a for loop. To load these a few minutes ago, I used the incredibly clunky code:
sets_file <- "~/Documents/ME teaching/R notes/datasets/sets.csv"
sets <- read.csv(sets_file)
inventories_file <- "~/Documents/ME teaching/R notes/datasets/inventories.csv"
inventories <- read.csv(inventories_file)
colors_file <- "~/Documents/ME teaching/R notes/datasets/colors.csv"
colors <- read.csv(colors_file)
For compactness I use lapply instead of a for loop here, but the idea is the same:
lapply(DFs, \(x) save(list=x, file=paste0(x, ".Rda"))))
Note that you need to generate the varying file names by providing x as a variable and not as a character (as part of the file name).
To load those files, you can simply do:
lapply(paste0(DFs, ".Rda"), load, envir = globalenv())
To save you can do this:
DFs <- list(color, sets, inventory)
names(DFs) = c("color", "sets", "inventory")
for (x in 1:length(DFs)){
dx = paste(names(DFs)[[x]], "Rda", sep = ".")
dfx = DFs[[x]]
save(dfx, file = dx)
}
To specify the path just inform in the construction of the dx object as following to read.
To read:
DFs <- c("colors", "sets", "inventory")
# or
DFs = dir("~/Documents/ME teaching/R notes/datasets/")
for(x in 1:length(DFs)){
arq = paste("~/Documents/ME teaching/R notes/datasets/", DFs[x], ".csv", sep = "")
DFs[x] = read.csv(arq)
}
It will read as a list, so you can access using [[]] indexation.

RSME on dataframe of multiple files in R

My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.
As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))

How to write to csv each split chunk?

I have list data for which I used split:
x <- split(A, f = A$Col_1)
It works beautifully. But now I need to write each chunk of the split to an individual .csv. There are 2100 chunks of 140 rows each. Let's call them "1:2100". I would like to create something that wrote "1" to "~/full_path_name/A1.csv" then go to "2" and write to "~/full_path_name/A2.csv", then "3" to "~/full_path_name/A3.csv", etc.
I included "~/full_path_name/" because down the road this path name will change for other data using the same code, and for my own understanding I need to see it in the code. I don't know how to write a small sample of what I am asking for for someone to correct because I don't know how to write it at all.
Can someone make a suggestion on how to do this? Thank you.
I have only been coding for month and am entirely self-taught. I do not have a background in other coding programs. I have no one to ask for help but here. I struggle with the terminology, so please understand if I am not asking in the proper way and I will try to correct it if need be.
EDIT, AFTER DOING SOME FURTHER RESEARCH --
This is what I have found elsewhere on SO from #RichPaloo, and my adaptations below that:
#example data.frame
df <- data.frame(x = 1:4, y = c("a", "a", "b", "b"))
#split into a list by the y column
l <- split(df, df$y)
#the names of the list are the unique values of the y column
nam <- names(l)
#iterate over the list and the vector of list names and write csvs
for(i in 1:length(l)) {
write_csv(l[[i]], paste0(nam[i], ".csv"))
}
This is my version:
bcc4.5_WINTER <- split(bcc4.5_FinalWinterRO, f = bcc4.5_FinalWinterRO$HUC8)
nam <- names(bcc4.5_WINTER)
for(i in 1:length(bcc4.5_WINTER)) {
write_csv(bcc4.5_WINTER[[i]], paste0(“~/Rprojects/BCC_CSM1_1_RCP_45/Winter/”, nam[i], “.csv”))
}
I appear to have a problem with the folder within my home folder "/BCC_CSM1_1_RCP_45/Winter/” It says "unexpected token" at both ends, but not at the "~Rprojects". Can I not send something to a folder within my home folder?
It also shows redlines under the quotes around ".csv" near the end. I don't know what to make of this because it's exactly what the person used successfully, apparently, in another post. Thank you.
So, the code example above (#Paul) worked except the df[l] was not being iterated, so I removed the _i from each l instance. The final problem I had (in comments above) was because the path name was not complete.
I used fwrite() rather than write.csv because it gave me better feedback as I struggled with mistakes. This gave me what I needed:
#split file into chunks by names within a row, in this case row "BBB"
df <- split(old_df, f = old_df$BBB)
#write those chunks to individual .csv files with the name being the name of each chunk
save_fun <- function(df, name_i) {
fwrite(df, file = paste0("~/Desktop/projects_folder/", name_i, ".csv"))
}
#save the file on your computer
mapply(FUN = save_fun, df, name_i = names(df), SIMPLIFY = FALSE)
Much thanks to Paul.
Investigating the potential typo problem
Please see the two lines below:
write.csv(l[[1]], file = paste0("./a_folder/", names(l)[1], ".csv"))
write.csv(l[[1]], file = paste0(“./a_folder/”, names(l)[1], “csv”))
Line 1 will save the file. Note that "./a_folder/" and ".csv" are seen as text.
Line 2 “./a_folder/” and “.csv” are not recognized as text. Line 2 produces an error: unexpected input in " write.csv(l[[1]], file = paste0(“"
RStudio colors your code to help you with this problem.
Thoughts about not using a for loop.
I think one better way to go (especialy when you have large dataset) is by using lapply or mapply. What these functions do is take each "chunk" of a list and apply a function to it.
As lapply loses the name of each chunk while processing it. It can be annoying when you want to use the name of the chunk to name the file on your computer. mapply() comes handy to deal with this situation.
Here is an example using the provided example.
# example data.frame
df <- data.frame(x = 1:4, y = c("a", "a", "b", "b"))
# split df
l <- split(df, df$y)
# save each "chunk" of l as a .csv file on a hard drive
# 1st, create a function that takes a "chunk" of your list and its name as inputs
save_fun <- function(l_i, name_i) {
print(l_i) # print the output in console
write.csv(l_i, file = paste0("./a_folder/", name_i, ".csv")) # save the file on your computer
}
# 2nd, use mapply (and not a list) to use the previous function on each pair chunk/name
mapply(FUN = save_fun, l_i = l, name_i = names(l), SIMPLIFY = FALSE) # see ?mapply for how to use mapply()

R: use single file while running a for loop on list of files

I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!

R Programming: Difficulty removing NAs from frame when using lapply

Full disclosure: I am taking a Data Science course on Coursera. For this particular question, we need to calculate the mean of some pollutant data that is being read in from multiple files.
The main function I need help with also references a couple other functions that I wrote in the script. For brevity, I'm just going to list them and their purpose:
boundIDs: I use this to bound the input so that inputs won't be accepted that are out of range. (range is 1:332, so if someone enters 1:400 this changes the range to 1:332)
pollutantToCode: converts the pollutant string entered to that pollutant's column number in the data file
fullFilePath - Creates the file name and appends it to the full file path. So if
someone states they need the file for ID 1 in directory
"curse/your/sudden/but/inevitable/betrayal/", the function will return
"curse/your/sudden/but/inevitable/betrayal/001.csv" to be added to
the file list vector.
After all that, the main function I'm working with is:
pollutantmean <- function(directory = "", pollutant, id = 1:332){
id <- boundIDs(id)
pollutant <- pollutantToCode(pollutant)
numberOfIds <- length(id)
fileList <- character(numberOfIds)
for (i in 1:numberOfIds){
if (id[i] > 332){
next
}
fileList[i] <- fullFilePath(directory, id[i])
}
data <- lapply(fileList, read.csv)
print(data[[1]][[pollutant]])
}
Right now, I'm intentionally printing only the first frame of data to see what my output looks like. To remove the NAs I've tried using:
data <- lapply(fileList, read.csv)
data <- data[!is.na(data)]
But the NAs remained, so then I tried computing the mean directly and using the na.rm parameter:
print(mean(data[[1]][[pollutant]], na.rm = TRUE))
But the mean was still "NA". Then I tried na.omit:
data <- lapply(fileList, na.omit(read.csv))
...and unfortunately the problem persisted.
Can someone please help? :-/
(PS: Right now I'm just focusing on the first frame of whatever is read in, i.e. data[[1]], since I figure if I can't get it for the first frame there's no point in iterating over the rest.)

Resources