I'm trying to obtain GPS coordinate information for each species in a given data frame of species names using a package-specific function (Red::records) which pulls coordinate information from a database containing information about species distributions.
My For-loop is constructed below, where iterations is the nrow(names) and the function records returns lat/long coordinates:
for(i in 1:iterations){
gbif[i,1] <- names[i,] ## grab names
try(temp1 <- records(names[i,]))
try(temp1$scientificName <- names[i,])
try(temp2 <- merge(gbif, temp1, by.x="V1", by.y="scientificName"))
datalist[[i]] <- temp2
}
After executing this loop, I am able to obtain data for species; however, it is not appropriately merged with the namelist. For example, calling records("Agyneta flibuscrocus") correctly returns 5 unique lat/long coordinates while calling records("Agyneta mongolica") produces an error with 0 records found (this is valid for each species when checked online).
After this loop, I bind all of the obtained records into a single data frame using:
dat = do.call(rbind, datalist) ## merge all occurrence data from GBIF into
one data frame
dat <- unique(dat)
When I go to verify this data frame, I get the following sample data:
Agyneta flibuscrocus -115.58400 49.72
Agyneta flibuscrocus -117.58400 51.299
...
Agyneta mongolica -115.58400 49.72
Agyneta mongolica -117.58400 51.299
These erroneous replications are also repeated throughout the rest of the 200 names. As a side note, I wrapped everything in try statements because the code will not execute if it runs into a record that produces 0 results from the database.
I feel like I am overlooking something very obvious here?
Reproducible Data & Code:
install.packages("red")
library(red)
names = data.frame("Acantheis variatus", "Agyneta flibuscrocus", "Agyneta
mongolica", "Alpaida alticeps", "Alpaide venilliae", "Amaurobius
transversus", "Apochinomma nitidum")
iterations = nrow(names)
datalist = list()
temp1 <- data.frame() ## temporary data frame for joining occurrence data
from GBIF
for(i in 1:iterations){
gbif <- names[i,] ## grab name
try(temp1 <- records(gbif))
try(temp1$V1 <- gbif)
datalist[[i]] <- temp1
}
dat = do.call(rbind, datalist)
I adapted some parts of your script and now it seems to work properly (with your example data the function only successfully retrieves data for one species, the one that got replicated in your code, but that's not a coding issue).
The main reason for the erroneous duplications was the variable temp1 being reused. try(temp1 <- records(gbif)) failed but try(temp1$V1 <- gbif) did not, since both temp1 and gbif were (erroneously) defined. Make sure that variables defined in an iteration of a loop don't get carried over to the next iteration.
iterations = nrow(myNames)
datalist = list()
for(i in 1:iterations){
gbif <- myNames[i,] ## grab name
try_result <- try(records(gbif))
if(class(try_result) != "try-error"){
temp1 <- try_result
temp1$V1 <- gbif
datalist[[i]] <- temp1
rm(temp1)
}else{
datalist[[i]] <- NA
}
rm(try_result)
}
dat <- do.call(rbind, datalist[!is.na(datalist)])
Related
Summary: Despite a complicated lead-up, the solution was very simple: In order to plot a row of a dataframe as a line instead of a lattice, I needed to transpose the data in order to invert from x obs of y variables to y obs of x variables.
I am using RStudio on a Windows 10 computer.
I am using scientific equipment to write measurements to a csv file. Then I ZIP several files and read to R using read.csv. However, the data frame behaves strangely. Commands "length" and "dim" disagree and the "plot" function throws errors. Because I can create simulated data that doesn't throw the errors, I think the problem is either in how the machine wrote the data or in my loading and processing of the data.
Two ZIP files are located in my stackoverflow repository (with "Monterey Jack" in the name):
https://github.com/baprisbrey/stackoverflow
Here is my code for reading and processing them:
# Unzip the folders
unZIP <- function(folder){
orig.directory <- getwd()
setwd(folder)
zipped.folders <- list.files(pattern = ".*zip")
for (i in zipped.folders){
unzip(i)}
setwd(orig.directory)
}
folder <- "C:/Users/user/Documents/StackOverflow"
unZIP(folder)
# Load the data into a list of lists
pullData <- function(folder){
orig.directory <- getwd()
setwd(folder)
#zipped.folders <- list.files(pattern = ".*zip")
#unzipped.folders <- list.files(folder)[!(list.files(folder) %in% zipped.folders)]
unzipped.folders <- list.dirs(folder)[-1] # Removing itself as the first directory.
oData <- vector(mode = "list", length = length(unzipped.folders))
names(oData) <- str_remove(unzipped.folders, paste(folder,"/",sep=""))
for (i in unzipped.folders) {
filenames <- list.files(i, pattern = "*.csv")
#setwd(paste(folder, i, sep="/"))
setwd(i)
files <- lapply(filenames, read.csv, skip = 5, header = TRUE, fileEncoding = "UTF-16LE") #Note unusual encoding
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- vector(mode="list", length = length(files))
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- files
}
setwd(orig.directory)
return(oData)
}
theData <- pullData(folder) #Load the data into a list of lists
# Process the data into frames
bigFrame <- function(bigList) {
#where bigList is theData is the result of pullData
#initialize the holding list of frames per set
preList <- vector(mode="list", length = length(bigList))
names(preList) <- names(bigList)
# process the data
for (i in 1:length(bigList)){
step1 <- lapply(bigList[[i]], t) # transpose each data
step2 <- do.call(rbind, step1) # roll it up into it's own matrix #original error that wasn't reproduced: It showed length(step2) = 24048 when i = 1 and dim(step2) = 48 501. Any comments on why?
firstRow <- step2[1,] #holding onto the first row to become the names
step3 <- as.data.frame(step2) # turn it into a frame
step4 <- step3[grepl("µA", rownames(step3)),] # Get rid of all those excess name rows
rownames(step4) <- 1:(nrow(step4)) # change the row names to rowID's
colnames(step4) <- firstRow # change the column names to the first row steps
step4$ID <- rep(names(bigList[i]),nrow(step4)) # Add an I.D. column
step4$Class[grepl("pos",tolower(step4$ID))] <- "Yes" # Add "Yes" class
step4$Class[grepl("neg",tolower(step4$ID))] <- "No" # Add "No" class
preList[[i]] <- step4
}
# bigFrame <- do.call(rbind, preList) #Failed due to different number of measurements (rows that become columns) across all the data sets
# return(bigFrame)
return(preList) # Works!
}
frameList <- bigFrame(theData)
monterey <- rbind(frameList[[1]],frameList[[2]])
# Odd behaviors
dim(monterey) #48 503
length(monterey) #503 #This is not reproducing my original error of length = 24048
rowOne <- monterey[1,1:(ncol(monterey)-2)]
plot(rowOne) #Error in plot.new() : figure margins too large
#describe the data
quantile(rowOne, seq(0, 1, length.out = 11) )
quantile(rowOne, seq(0, 1, length.out = 11) ) %>% plot #produces undesired lattice plot
# simulate the data
doppelganger <- sample(1:20461,501,replace = TRUE)
names(doppelganger) <- names(rowOne)
# describe the data
plot(doppelganger) #Successful scatterplot. (With my non-random data, I want a line where the numbers in colnames are along the x-axis)
quantile(doppelganger, seq(0, 1, length.out = 11) ) #the random distribution is mildly different
quantile(doppelganger, seq(0, 1, length.out = 11) ) %>% plot # a simple line of dots as desired
# investigating structure
str(rowOne) # results in a dataframe of 1 observation of 501 variables. This is a correct interpretation.
str(as.data.frame(doppelganger)) # results in 501 observations of 1 variable. This is not a correct interpretation but creates the plot that I want.
How do I convert the rowOne to plot like doppelganger?
It looks like one of my errors is not reproducing, where calls to "dim" and "length" apparently disagree.
However, I'm confused as to why the "plot" function is producing a lattice plot on my processed data and a line of dots on my simulated data.
What I would like is to plot each row of data as a line. (Next, and out of the scope of this question, is I would like to classify the data with adaboost. My concern is that if "plot" behaves strangely then the classifier won't work.)
Any tips or suggestions or explanations or advice would be greatly appreciated.
Edit: Investigating the structure with ("str") of the two examples explains the difference between plots. I guess my modified question is, how do I switch between the two structures to enable plotting a line (like doppelganger) instead of a lattice (like rowOne)?
I am answering my own question.
I am leaving behind the part about the discrepancy between "length" and "dim" since I can't provide a reproducible example. However, I'm happy to leave up for comment.
The answer is that in order to produce my plot, I simply have to transpose the row as follows:
rowOne %>% t() %>% as.data.frame() %>% plot
This inverts the structure from one observation of 501 variables to 501 obs of one variable as follows:
rowOne %>% t() %>% as.data.frame() %>% str()
#'data.frame': 501 obs. of 1 variable:
# $ 1: num 8712 8712 8712 8712 8712 ...
Because of the unusual encoding I used, and the strange "length" result, I failed to see a simple solution to my "plot" problem.
I am currently working on an imputation project where I need to evaluate my methods of imputation. I have my incomplete dataframe with NAs from which I calculate the missing rate for every column/variable. My second data frame contains the complete cases which I extracted from the first data frame. I now want to simulate the missingness structure of the real data in the frame containing the complete cases. the data frame with the generated NAs get stored in the object "result" as you can see in the code. If I now want to replicate this code and thus generate 100 different data frames like "result", how do I replicate and save them separately?
I'm a beginner and would be really thankful for your answers!
I tried to put my loop which generates the NAs in another loop which contains the replicate() command and counts from 1:100 and saves these 100 replicated data frames but it didn't work at all.
result = data.frame(res0=rep(NA, dim(comp_cas)[1]))
for (i in 1:length(Z32_miss_item$miss_per_item)) {
dat = comp_cas[,i]
missRate = Z32_miss_item$miss_per_item[i]
cat (i, " ", paste0(dat, collapse=",") ," ", missRate, "!\n")
df <- data.frame("res"= GenMiss(x=dat, missrate = missRate), stringsAsFactors = FALSE)
colnames(df) = gsub("res", paste0("Var", i), colnames(df))
result = cbind(result, df)
}
result = result[,-1]
I expect that every data frame of the 100 runs get saved in a separate .rda file in my project folder.
also, is imputation and the evaluation of fitness of the latter beginner stuff in r or at what level of proficiency am I if you take a look at the code that I posted?
It is difficult to guess what exactly you are doing without some dummy data. But it is fine to have loops within loops and to save data.frames. Firstly, I would avoid the replicate function here as it has a strange syntax and just stick with plain loops. Secondly, you must make sure that the loops have different indexes (i.e. for(i ... should be surrounded by, say, for(j ... since functions can loop outside their scope in R. Finally, use saveRDS rather than save, as you can then have each object (data.frame) saved in separate .rds files. The save function is designed for saving your whole workspace so that you can pick up where you left off.
fun <- function(i){
df <- data.frame(x=rnorm(5))
names(df) <- paste0("x",i)
df
}
for(j in 1:100){
res <- data.frame(id=1:5)
for(i in 1:10){
res <- cbind(res, fun(i))
}
saveRDS(res, sprintf("replication_%s.rds",j))
}
I know there are a lot of posts on how to save data out of loops to data frames, but i've been having some trouble making it work for me. Currently i am only able to get my data using print, but would like for it to instead be put into a data frame. I can't predict how many lines of data or responses per line (although I just need a single true/false) it will give.
Suggestions on how to get the P loop to output data to a dataframe?
max <- max(x$a)
for (n in 1:max) {
print(n)
#right now i'm just printing the iteration and data to console
result <- x[x$a==n,"b"]
test <- unique(as.numeric(unlist(result)))
#Below is the loop i'd like to save the data from
for (P in test)
print({
ar <- x[x$b==P & x$a!=n,"a"]
ar1 <- sapply(unique(as.numeric(unlist(ar))),
function(f)
x[x$a==f & x$b!=P,"b"])
af <- sapply(ar1, function(f) any(match(f,result)))
})
}
Thanks!
Initiate an empty data frame:
results <- data.frame(it=numeric(), P=numeric(), value=logical())
And then instead of printing, just add this inside your loop:
results[nrow(results)+1,] <- list( [your 3 values separated by ","] )
I won't pretend that this code is even remotely optimal, but here is the problem I have. I have a list of files with multiple columns read in with sapply(), such that if I call file.list[[1]] I get a summary of that data.frame, and summary(file.list) is a list of files.
I am fitting curves to the data using the mgcv package as follows:
gam_data <- function(curves)
{
out <- gam(curves[, 15] ~ s(curves[, 23]))
pd <- plot(out)
return(pd)
}
out <- lapply(file.list, gam_data)
split_curves <- function(splitting)
{
pd_2 <- c(splitting[[1]]$fit)
pd_3 <- c(splitting[[1]]$x)
pd_4 <- c(splitting[[1]]$se)
curveg <- cbind(pd_2, pd_3, pd_4)
colnames(curveg) <- c("fitted", "sphro", "se")
return(curveg)
}
out2 <- lapply(out, split_curves)
Where the first block is performing gam and the second is extracting the fit of the curve. However, after all of that the original information in file.list such as replicate, genotype, etc. is lost, and the data.frames are not the same length anymore. This is probably a trivial question, but how does one retain that information through processing? I'm applying this to hundreds of data frames so I cannot just manually recreate the columns.
I am experimenting with different regression models. My end goal is to have a nice easy to read dataframe with 3 columns:
model_results <- data.frame(name = character(),
rmse = numeric(),
r2 = numeric())
Then after running each model, add the corrosponding output to the dataframe and then, at the end, review and make some decisions on which model to use.
I tried this:
mod.spend_transactions.results <- list("mod.spend_transactions",
rsme(residuals(mod.spend_transactions)),
summary(mod.spend_transactions)$r.squared)
I tried using a list because I know vectors can only store one datatype (right?).
Output:
rbind(model_results, mod.spend_transactions.results)
X.mod.spend_transactions. X12.6029444519635 X0.912505643567096
1 mod.spend_transactions 12.60294 0.9125056
Close but not what I wnated since the df names have been changed and I did not expect that.
So I tried vectors, which works but seems "clunky" in that I'm sure I could do this with writing less code:
vect_modname <- vector()
vect_rsme <- vector()
vect_r2 <- vector()
Then after running a model
vect_modname <- c(vect_modname, "mod.spend_transactions")
vect_rsme <- c(vect_rsme, rsme(residuals(mod.spend_transactions)))
vect_r2 <- c(vect_r2, summary(mod.spend_transactions)$r.squared)
Then at the end of running all the models I'm testing out
data.frame(vect_modname, vect_rsme, vect_r2)
Again, the vector method does work. But is there a "better", more elegant way of doing this?