Reading ZIP file of machine-written data won't "plot" in RStudio - r

Summary: Despite a complicated lead-up, the solution was very simple: In order to plot a row of a dataframe as a line instead of a lattice, I needed to transpose the data in order to invert from x obs of y variables to y obs of x variables.
I am using RStudio on a Windows 10 computer.
I am using scientific equipment to write measurements to a csv file. Then I ZIP several files and read to R using read.csv. However, the data frame behaves strangely. Commands "length" and "dim" disagree and the "plot" function throws errors. Because I can create simulated data that doesn't throw the errors, I think the problem is either in how the machine wrote the data or in my loading and processing of the data.
Two ZIP files are located in my stackoverflow repository (with "Monterey Jack" in the name):
https://github.com/baprisbrey/stackoverflow
Here is my code for reading and processing them:
# Unzip the folders
unZIP <- function(folder){
orig.directory <- getwd()
setwd(folder)
zipped.folders <- list.files(pattern = ".*zip")
for (i in zipped.folders){
unzip(i)}
setwd(orig.directory)
}
folder <- "C:/Users/user/Documents/StackOverflow"
unZIP(folder)
# Load the data into a list of lists
pullData <- function(folder){
orig.directory <- getwd()
setwd(folder)
#zipped.folders <- list.files(pattern = ".*zip")
#unzipped.folders <- list.files(folder)[!(list.files(folder) %in% zipped.folders)]
unzipped.folders <- list.dirs(folder)[-1] # Removing itself as the first directory.
oData <- vector(mode = "list", length = length(unzipped.folders))
names(oData) <- str_remove(unzipped.folders, paste(folder,"/",sep=""))
for (i in unzipped.folders) {
filenames <- list.files(i, pattern = "*.csv")
#setwd(paste(folder, i, sep="/"))
setwd(i)
files <- lapply(filenames, read.csv, skip = 5, header = TRUE, fileEncoding = "UTF-16LE") #Note unusual encoding
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- vector(mode="list", length = length(files))
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- files
}
setwd(orig.directory)
return(oData)
}
theData <- pullData(folder) #Load the data into a list of lists
# Process the data into frames
bigFrame <- function(bigList) {
#where bigList is theData is the result of pullData
#initialize the holding list of frames per set
preList <- vector(mode="list", length = length(bigList))
names(preList) <- names(bigList)
# process the data
for (i in 1:length(bigList)){
step1 <- lapply(bigList[[i]], t) # transpose each data
step2 <- do.call(rbind, step1) # roll it up into it's own matrix #original error that wasn't reproduced: It showed length(step2) = 24048 when i = 1 and dim(step2) = 48 501. Any comments on why?
firstRow <- step2[1,] #holding onto the first row to become the names
step3 <- as.data.frame(step2) # turn it into a frame
step4 <- step3[grepl("µA", rownames(step3)),] # Get rid of all those excess name rows
rownames(step4) <- 1:(nrow(step4)) # change the row names to rowID's
colnames(step4) <- firstRow # change the column names to the first row steps
step4$ID <- rep(names(bigList[i]),nrow(step4)) # Add an I.D. column
step4$Class[grepl("pos",tolower(step4$ID))] <- "Yes" # Add "Yes" class
step4$Class[grepl("neg",tolower(step4$ID))] <- "No" # Add "No" class
preList[[i]] <- step4
}
# bigFrame <- do.call(rbind, preList) #Failed due to different number of measurements (rows that become columns) across all the data sets
# return(bigFrame)
return(preList) # Works!
}
frameList <- bigFrame(theData)
monterey <- rbind(frameList[[1]],frameList[[2]])
# Odd behaviors
dim(monterey) #48 503
length(monterey) #503 #This is not reproducing my original error of length = 24048
rowOne <- monterey[1,1:(ncol(monterey)-2)]
plot(rowOne) #Error in plot.new() : figure margins too large
#describe the data
quantile(rowOne, seq(0, 1, length.out = 11) )
quantile(rowOne, seq(0, 1, length.out = 11) ) %>% plot #produces undesired lattice plot
# simulate the data
doppelganger <- sample(1:20461,501,replace = TRUE)
names(doppelganger) <- names(rowOne)
# describe the data
plot(doppelganger) #Successful scatterplot. (With my non-random data, I want a line where the numbers in colnames are along the x-axis)
quantile(doppelganger, seq(0, 1, length.out = 11) ) #the random distribution is mildly different
quantile(doppelganger, seq(0, 1, length.out = 11) ) %>% plot # a simple line of dots as desired
# investigating structure
str(rowOne) # results in a dataframe of 1 observation of 501 variables. This is a correct interpretation.
str(as.data.frame(doppelganger)) # results in 501 observations of 1 variable. This is not a correct interpretation but creates the plot that I want.
How do I convert the rowOne to plot like doppelganger?
It looks like one of my errors is not reproducing, where calls to "dim" and "length" apparently disagree.
However, I'm confused as to why the "plot" function is producing a lattice plot on my processed data and a line of dots on my simulated data.
What I would like is to plot each row of data as a line. (Next, and out of the scope of this question, is I would like to classify the data with adaboost. My concern is that if "plot" behaves strangely then the classifier won't work.)
Any tips or suggestions or explanations or advice would be greatly appreciated.
Edit: Investigating the structure with ("str") of the two examples explains the difference between plots. I guess my modified question is, how do I switch between the two structures to enable plotting a line (like doppelganger) instead of a lattice (like rowOne)?

I am answering my own question.
I am leaving behind the part about the discrepancy between "length" and "dim" since I can't provide a reproducible example. However, I'm happy to leave up for comment.
The answer is that in order to produce my plot, I simply have to transpose the row as follows:
rowOne %>% t() %>% as.data.frame() %>% plot
This inverts the structure from one observation of 501 variables to 501 obs of one variable as follows:
rowOne %>% t() %>% as.data.frame() %>% str()
#'data.frame': 501 obs. of 1 variable:
# $ 1: num 8712 8712 8712 8712 8712 ...
Because of the unusual encoding I used, and the strange "length" result, I failed to see a simple solution to my "plot" problem.

Related

Looping 'raster' package functions over each multi-polygon feature in R

I am trying to loop several functions from the 'raster' package, namely, crop(), mask(), reclassify() and unstack()/as.list(). I have ten raster layers that share the same extent and data type; they correspond to land cover over 10 time-points. I want to create individual list variables for each output of crop() -> mask() -> reclassify() -> as.list(). I was able to pipe the process for 1 polygon feature, but I need to be able to loop it for each of the 10 polygon features stored in the multipolygon Shapefile, such that I can save each output list according to a specified naming convention.
Thank you and please advise. I share my code below.
EDIT: I am wondering if a for-loop is the right way to go about this, or would an lapply approach be better?
# Load libraries
library(raster) # for raster processing
library(rgdal) # for raster/vector processing
library(sf) # for Shapefile processing
# Stack 10 rasters together
raster.stack = stack(
raster("path/raster1.tif"),
raster("path/raster2.tif"),
raster("path/raster3.tif"),
raster("path/raster4.tif"),
raster("path/raster5.tif"),
raster("path/raster6.tif"),
raster("path/raster7.tif"),
raster("path/raster8.tif"),
raster("path/raster9.tif"),
raster("path/raster10.tif")
)
# Prepare reclassification codes from 9-class raster to 3-class raster
reclasscodes = c(
0,0, # no data
1,1,
2,1,
3,1,
4,1,
5,2,
6,2,
7,3,
8,3,
9,3
)
# Convert reclass codes list into n x 2 matrix
reclassmatrix = matrix(reclasscodes, ncol=2, byrow = T)
# Load multipolygon vector Shapefile
multipolygon = shapefile("path/multipolygon.shp") # Shapefile is made of n polygons
# Example subset Shapefile to polygon_1 using attribute "ID"
polygon_1 = subset(multipolygon,ID=="D-4")
# Create output for polygon_1
list_polygon_1 =
raster.stack %>%
crop(y = polygon_1) %>% # crop to bounds
mask(mask = polygon_1) %>% # mask to polygon cutline
reclassify(rcl = reclassmatrix) %>% # reclassify to 3-class
as.list() # functions the same as unstack() where raster brick is converted to list of raster layers
# I use %>% because I do not want to save any of the intermediate outputs.
# Resultant output is a variable list for polygon_1 named 'list_polygon_1' which is exactly what I want.
# Worked perfectly.
# How do I repeat this process for polygon_1 to polygon n?
# My attempt
for (i in 1:nrow(multipolygon)) {
raster.stack %>%
crop(y = multipolygon[i,]) %>%
mask(mask = multipolygon[i,]) %>%
reclassify(rcl = reclassmatrix) %>%
as.list() %>% # up till here it is the same steps as before for polygon_1
# now I want to save each list output as a separate variable according to i, e.g. list_polygon_2, list_polygon_3 etc.
assign(paste(multipolygon$ID, i, sep = '_')) # assign a naming convention for each output variable
}
# Does not work. Even without the last line of code "..assign(paste(...))" there is no output variable from the as.list() line.
Here is a minimal self-contained reproducible example.
Example data
library(raster)
s <- stack(system.file("external/rlogo.grd", package="raster"))
xy1 <- xy2 <- xy3 <- matrix(c(10,17, 6,10,71,60,62,71), ncol=2)
xy2[,1] <- xy2[,1] + 30
xy3[,2] <- xy3[,2] - 30
p <- spPolygons(xy1, xy2, xy3)
#plot(r, 1)
#lines(p)
What you are after
rm = matrix(c(0,100,0,100,150,2,150,255,3), ncol=3, byrow=TRUE)
out <- list()
for (i in 1:length(p)) {
x <- crop(s, p[i,])
x <- mask(x, p[i,])
out[[i]] <- reclassify(x, rm)
}
What you are saying about unstack does not make sense (and unlist does not work). I would advise against it, but you could do
out2 <- lapply(out, unstack)
I am not sure what you are really after. If you want the cell values you can make it much simpler (no need for a loop) and do
r <- reclassify(s, rm)
e <- extract(r, p)
To your question about lapply vs a loop. In terms of performance that rarely matters. lapply can be concise, but in cases like this, writing a loop is better as it is easier to read, and write, especially if you do not use %>%.

Creating a pipeline in R that serially processes multiple csv files

My pipeline reads in a csv to a dataframe, assigns rownames, removes a column, performs a pca, plots the pca and extracxts the meaningful variables from the pca which are also plotted.
Here is my current code, which only goes as far as the first plot:
library(ggplot2)
library(ggrepel)
tsv = read.csv('matrix.tsv', sep='\t')
bell= read.csv('bell.tsv', sep='\t')
tail= read.csv('tail.tsv', sep='\t')
dfList = list(tail, tsv, bell)
#process csv's
dfList = lapply(dfList, function(dum){
rownames(dum) = dum[,1]
dum[,1] = NULL
dum$X = NULL
dum = dum[, -grep('un', colnames(dum))]
})
#create pca's of dataframes
pcaList = lapply(dfList, function(pca){
prin_comp = prcomp(pca, scale. = T)
})
#plot top 2 principle components in the pca
plotList = lapply(pcaList, function(prin_comp){
t = qplot(x=prin_comp$rotation[,1], y=prin_comp$rotation[,2]) + geom_text_repel(aes(label=row.names(prin_comp$rotation)))
})
#this plots the 3 plots, one for each pca, but they are un-named
plotList
The problem is that the plots don't have meaningful names/titles. I don't know how to keep that information present, passed from function to function.
I know there must be a more elegant way of doing this. And I have spent a day reading similar and not so similar questions regarding processing multiple csv files. But either they weren't applicable or didn't work for my case.
And as the title of this question implies, I would prefer to do this on one csv at a time, not all 3 at a time, as the csv's in question are very large, over 5GB each, so keeping each dataframe and pca in memory at the same time is impossible.
You just need to keep a string you want to use as the title somewhere and add ggtitle(YOUR_TITLE) to your plot, but this is not so easy with your current code. Instead of performing each step of the analysis for each CSV before going to the next step, why don't you just perform all steps for one CSV at a time?
Your code could look like:
library(ggplot2)
library(ggrepel)
csvs <- c("matrix.tsv","bell.tsv","tail.tsv")
for (i in csvs) {
# read file
df <- read.csv(i, sep='\t')
# process file
rownames(df) <- df[,1]
df[,1] <- NULL
df$X = NULL
df = df[, -grep('un', colnames(df))]
# create pca
pca <- prcomp(df, scale = T)
# plot pca
pcaPlot <- qplot(x=pca$rotation[,1], y=pca$rotation[,2]) +
geom_text_repel(aes(label=row.names(pca$rotation))) +
ggtitle(i)
print(pcaPlot)
# extract and plot meaningful variables
# ...
}
Basically I just put everything you do in a lapply call inside of a for loop, this approach also does the processing for one CSV at the time.

R: For Loop Copying Error

I'm trying to obtain GPS coordinate information for each species in a given data frame of species names using a package-specific function (Red::records) which pulls coordinate information from a database containing information about species distributions.
My For-loop is constructed below, where iterations is the nrow(names) and the function records returns lat/long coordinates:
for(i in 1:iterations){
gbif[i,1] <- names[i,] ## grab names
try(temp1 <- records(names[i,]))
try(temp1$scientificName <- names[i,])
try(temp2 <- merge(gbif, temp1, by.x="V1", by.y="scientificName"))
datalist[[i]] <- temp2
}
After executing this loop, I am able to obtain data for species; however, it is not appropriately merged with the namelist. For example, calling records("Agyneta flibuscrocus") correctly returns 5 unique lat/long coordinates while calling records("Agyneta mongolica") produces an error with 0 records found (this is valid for each species when checked online).
After this loop, I bind all of the obtained records into a single data frame using:
dat = do.call(rbind, datalist) ## merge all occurrence data from GBIF into
one data frame
dat <- unique(dat)
When I go to verify this data frame, I get the following sample data:
Agyneta flibuscrocus -115.58400 49.72
Agyneta flibuscrocus -117.58400 51.299
...
Agyneta mongolica -115.58400 49.72
Agyneta mongolica -117.58400 51.299
These erroneous replications are also repeated throughout the rest of the 200 names. As a side note, I wrapped everything in try statements because the code will not execute if it runs into a record that produces 0 results from the database.
I feel like I am overlooking something very obvious here?
Reproducible Data & Code:
install.packages("red")
library(red)
names = data.frame("Acantheis variatus", "Agyneta flibuscrocus", "Agyneta
mongolica", "Alpaida alticeps", "Alpaide venilliae", "Amaurobius
transversus", "Apochinomma nitidum")
iterations = nrow(names)
datalist = list()
temp1 <- data.frame() ## temporary data frame for joining occurrence data
from GBIF
for(i in 1:iterations){
gbif <- names[i,] ## grab name
try(temp1 <- records(gbif))
try(temp1$V1 <- gbif)
datalist[[i]] <- temp1
}
dat = do.call(rbind, datalist)
I adapted some parts of your script and now it seems to work properly (with your example data the function only successfully retrieves data for one species, the one that got replicated in your code, but that's not a coding issue).
The main reason for the erroneous duplications was the variable temp1 being reused. try(temp1 <- records(gbif)) failed but try(temp1$V1 <- gbif) did not, since both temp1 and gbif were (erroneously) defined. Make sure that variables defined in an iteration of a loop don't get carried over to the next iteration.
iterations = nrow(myNames)
datalist = list()
for(i in 1:iterations){
gbif <- myNames[i,] ## grab name
try_result <- try(records(gbif))
if(class(try_result) != "try-error"){
temp1 <- try_result
temp1$V1 <- gbif
datalist[[i]] <- temp1
rm(temp1)
}else{
datalist[[i]] <- NA
}
rm(try_result)
}
dat <- do.call(rbind, datalist[!is.na(datalist)])

write a for loop to automatically create subsets of datasets in r

Please help me as I am new to R and also programming
I am trying to write a loop in such that it should read the data for every 1000 rows and create a data-set in r
Following is my trial
for(i in 0:nl){
df[i] = fread('RM.csv',skip = 1000*i, nrows =1000,
col.names = colnames(read.csv('RM.csv', nrow=1, header = T)))
}
where nl is a integer and is equal to length of data 'RM.csv'
What I am trying to do is create a function which will skip every 1000 rows and read next 1000 rows and terminates once it reaches nl which is length of original data.
Now it is not mandatory to use only this approach.
You can try reading in the entire file into a single data frame, and then subsetting off the rows you don't want:
df <- read.csv('RM.csv', header=TRUE)
y <- seq(from = 0, to = 100000, by = 1) # replace the 'to' value with a value
seq.keep <- y[floor(y / 1000) %% 2 == 0] # large enough for the whole file
df.keep <- df[seq.keep, ]
Here is a rather messy demo which shows that the above sequence logic be correct:
Demo
You can inspect that the sequence generated is:
0-999
2000-2999
4000-4999
etc.
As mentioned in the code comment, make sure you generate a sequence large enough to accommodate the actual size of the data frame.
If you need to continue with your current approach, then try reading in only every other 1000 lines, e.g.
sq <- seq(from=0, to=nl, by=2)
names <- colnames(read.csv('RM.csv', nrow=1, header=TRUE))
for(i in sq) {
df_i <- fread('RM.csv', skip=1000*i, nrows=1000, col.names=names)
# process this chunk and move on
}

R ncdf package - put.var.ncdf requiring incorrect number of dimensions

I am organizing weather data into netCDF files in R. Everything goes fine until I try to populate the netcdf variables with data, because it is asking me to specify only one dimension for two-dimensional variables.
library(ncdf)
These are the dimension tags for the variables. Each variable uses the Threshold dimension and one of the other two dimensions.
th <- dim.def.ncdf("Threshold", "level", c(5,6,7,8,9,10,50,75,100))
rt <- dim.def.ncdf("RainMinimum", "cm", c(5, 10, 25))
wt <- dim.def.ncdf("WindMinimum", "m/s", c(18, 30, 50))
The variables are created in a loop, and there are a lot of them, so for the sake of easy understanding, in my example I'll only populate the list of variables with one variable.
vars <- list()
v1 <- var.def.ncdf("ARMM_rain", "percent", list(th, rt), -1, prec="double")
vars[[length(vars)+1]] <- v1
ncdata <- create.ncdf("composite.nc", vars)
I use another loop to extract data from different data files into a 9x3 data frame named subframe while iterating through the variables of the netcdf file with varindex. For the sake of reproducing, I'll give a quick initialization for these values.
varindex <- 1
subframe <- data.frame(matrix(nrow=9, ncol=3, rep(.01, 27)))
The desired outcome from there is to populate each ncdf variable with the contents of subframe. The code to do so is:
for(x in 1:9) {
for(y in 1:3) {
value <- ifelse(is.na(subframe[x,y]), -1, subframe[x,y])
put.var.ncdf(ncdata, varindex, value, start=c(x,y), count=1)
}
}
The error message is:
Error in put.var.ncdf(ncdata, varindex, value, start = c(x, y), count = 1) :
'start' should specify 1 dims but actually specifies 2
tl;dr: I have defined two-dimensional variables using ncdf in R, I am trying to write data to them, but I am getting an error message because R believes they are single-dimensional variables instead.
Anyone know how to fix this error?

Resources