Select columns when reading in files with st_read - r

I am trying to read 39 json files into a common sf dataset in R.
Here is the method I've been trying:
path <- "~/directory"
file.names <- as.list(dir(path, pattern='.json', full.names=T))
geodata <- do.call(rbind, lapply(file.names, st_read))
The problem is in the last line: rbind cannot work because the files have different numbers of columns. However, they all have three columns in common, and which I care about: MOVEMENT_ID, DISPLAY_NAME and geometry. How could I select only these three columns when running st_read?
I've tried running geodata<-do.call(rbind, lapply(file.names, st_read,select=c('MOVEMENT_ID', 'DISPLAY_NAME', 'geometry'))) but, in this case, st_read does not seem to recognise the geometry column (error: 'no simple features geometry column pressent').
I've also tried to use fread in place of st_read but this doesn't work as fread is not adapted to spatial data.

Run lapply over a function that calls st_read and then does what you need to it, something like:
read_my_json = function(f){
s = st_read(f)
return(s[,c("MOVEMENT_ID","DISPLAY_NAME")]
}
(I'm pretty sure you don't have to select the geometry as well, you get that for free when selecting columns of an sf spatial object)
then do.call(rbind, lapply(file.names, read_my_json)) should work.
no extra packages need to be included and it has the big advantage in that you can test this function to see how it works on a single item before throwing a thousand at it.

Related

Read in a list of shapefiles and row bind them in R (preferably using tidy syntax and sf)

I have a directory with a bunch of shapefiles for 50 cities (and will accumulate more). They are divided into three groups: cities' political boundaries (CityA_CD.shp, CityB_CD.shp, etc.), neighborhoods (CityA_Neighborhoods.shp, CityB_Neighborhoods.shp, etc.), and Census blocks (CityA_blocks.shp, CityB_blocks.shp, etc.). They use common file-naming syntaxes, have the same set of attribute variables, and are all in the same CRS. (I transformed all of them as such using QGIS.) I need to write a list of each group of files (political boundaries, neighborhoods, blocks) to read as sf objects and then bind the rows to create one large sf object for each group. However I am running into consistent problems developing this workflow in R.
library(tidyverse)
library(sf)
library(mapedit)
# This first line succeeds in creating a character string of the files that match the regex pattern.
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
# This second line creates a list object from the files.
shapefile_list <- lapply(filenames, st_read)
# This third line (adopted from https://github.com/r-spatial/sf/issues/798) fails as follows.
districts <- mapedit:::combine_list_of_sf(shapefile_list)
Error: Column `District_I` cant be converted from character to numeric
# This fourth line fails in an apparently different way (also adopted from https://github.com/r-spatial/sf/issues/798).
districts <- do.call(what = sf:::rbind.sf, args = shapefile_list)
Error in CPL_get_z_range(obj, 2) : z error - expecting three columns;
The first error appears to be indicating that one of my shapefiles has an incorrect variable class for the common variable District_I but R provides no information to clue me into which file is causing the error.
The second error seems to be looking for a z coordinate but is only finding x and y in the geometry attribute.
I have four questions on this front:
How can I have R identify which list item it is attempting to read and bind is causing an error that halts the process?
How can I force R to ignore the incompatibility issue and coerce the variable class to character so that I can deal with the variable inconsistency (if that's what it is) in R?
How can I drop a variable entirely from the read sf objects that is causing an error (i.e. omit District_I for all read_sf calls in the process)?
More generally, what is going on and how can I solve the second error?
Thanks all as always for your help.
P.S.: I know this post isn't "reproducible" in the desired way, but I'm not sure how to make it so besides copying the contents of all my shapefiles. If I'm mistaken on this point, I'd gladly accept any wisdom on this front.
UPDATE:
I've run
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
shapefile_list <- lapply(filenames, st_read)
districts <- mapedit:::combine_list_of_sf(shapefile_list)
successfully on a subset of three of the shapefiles. So I've confirmed that there is some class conflict between the column District_I in one of the files causing the hold-up when running the code on the full batch. But again, I need the error to identify the file name causing the issue so I can fix it in the file OR need the code to coerce District_I to character in all files (which is the class I want that variable to be in anyway).
A note, particularly regarding Pablo's recommendation:
districts <- do.call(what = dplyr::rbind_all, shapefile_list)
results in an error
Error in (function (x, id = NULL) : unused argument
followed by a long string of digits and coordinates. So,
mapedit:::combine_list_of_sf(shapefile_list)
is definitely the mechanism to read from the list and merge the files, but I still need a way to diagnose the source of the column incompatibility error across shapefiles.
So after much fretting and some great guidance from Pablo (and his link to https://community.rstudio.com/t/simplest-way-to-modify-the-same-column-in-multiple-dataframes-in-a-list/13076), the following works:
library(tidyverse)
library(sf)
# Reads in all shapefiles from Directory that include the string "_CDs".
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
# Applies the function st_read from the sf package to each file saved as a character string to transform the file list to a list object.
shapefile_list <- lapply(filenames, st_read)
# Creates a function that transforms a problem variable to class character for all shapefile reads.
my_func <- function(data, my_col){
my_col <- enexpr(my_col)
output <- data %>%
mutate(!!my_col := as.character(!!my_col))
}
# Applies the new function to our list of shapefiles and specifies "District_I" as our problem variable.
districts <- map_dfr(shapefile_list, ~my_func(.x, District_I))

How to convert list of strings to list of objects or list of dataframes in R

I have written a program in R that takes all of the .csv files in a folder and imports them as data frames with the naming convention "main1," "main2," "main3" and so on for each data frame. The number of files in the folder may vary, so I was hoping the convention would make it easier to join the files later by being able to paste together the number of records. I successfully coded a way to find the folder and identify all of the files, as well as the total number of files.
agencyloc <- dirname(file.choose())
setwd(agencyloc)
listagencyfiles <- list.files(pattern = "*.csv")
numagencies <- 1:length(listagencyfiles)
I then created the individual dataframes without issue. I am not including this because it is long and does not relate to my problem. The problem is when I try to rbind these dataframes into one large dataframe, it says "Input to rbindlist must be a list of data.tables." Since there will be varying numbers of files, I can't just hard code this in, it has to be something similar to this. I tried the following, but it creates a list of strings and not a list of objects:
allfiles <- paste0("main", 1:length(numagencies))
However, this outputs a list of strings that can't be used to bind the fiels. Is there a way to change the data type from character strings to objects so that this will work when executed:
finaltable <- rbindlist(allfiles)
What I am looking for would almost be the opposite of as.character(objectname) if that makes any sense. I need to go from character to object instead of object to character.

Reading specific column of multiple files in R

I have used the following code to read multiple .csv files in R:
Assembly<-t(read.table("E:\\test\\exp1.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Assembly","f"))[1:4416,"Assembly",drop=FALSE])
Top1<-t(read.table("E:\\test\\exp2.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Top1","f"))[1:4416,"Top1",drop=FALSE])
Top3<-t(read.table("E:\\test\\exp3.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Top3","f"))[1:4416,"Top3",drop=FALSE])
Top11<-t(read.table("E:\\test\\exp4.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Top11","f"))[1:4416,"Top11",drop=FALSE])
Assembly1<-t(read.table("E:\\test\\exp5.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Assembly1","f"))[1:4416,"Assembly1",drop=FALSE])
Area<-t(read.table("E:\\test\\exp6.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Area","f"))[1:4416,"Area",drop=FALSE])
data<-rbind(Assembly,Top1,Top3,Top11,Assembly1,Area)
So the entire data is in the folder "test" in E drive. Is there a simpler way in R to read multiple .csv data with a couple of lines of code or some sort of function call to substitute what has been made above?
(Untested code; no working example available) Try: Use the list.files function to generate the correct names and then use colClasses as argument to read.csv to throw away the first 4 columns (and since that vector is recycled you will alss throw away the 6th column):
lapply(list.files("E:\\test\\", patt="^exp[1-6]"), read.csv,
colClasses=c(rep("NULL", 4), "numeric"), nrows= 4416)
If you want this to be returned as a dataframe, then wrap data.frame around it.

Using a for loop to write in multiple .grd files

I am working with very large data layers for a SDM class and because of this I ended up breaking some of my layers into a bunch of blocks to avoid memory restraint. These blocks were written out as .grd files, and now I need to get them read back into R and merged together. I am extremely new to R an programming in general so any help would be appreciated. What I have been trying so far looks like this:
merge.coarse=raster("coarseBlock1.grd")
for ("" in 2:nBlocks){
merge.coarse=merge(merge.coarse,raster(paste("coarseBlock", ".grd", sep="")))
}
where my files are in coarseBlock.grd and are sequentially numbered from 1 to nBlocks (259)
Any feed back would be greatly appreciated.
Using for loops is generally slow in R. Also, using functions like merge and rbind in a for loop eat up a lot of memory because of the way R passes values to these functions.
A more efficient way to do this task would be to call lapply (see this tutorial on apply functions for details) to load the files into R. This will result in a list which can then be collapsed using the rbind function:
rasters <- lapply(list.files(GRDFolder), FUN = raster)
merge.coarse <- do.call(rbind, rasters)
I'm not too familiar with .grd files, but this overall process should at least get you going in the right direction. Assuming all your .grd files (1 through 259) are stored in the same folder (which I will refer to as GRDFolder), then you can try this:
merge.coarse <- raster("coarseBlock1.grd")
for(filename in list.files(GRDFolder))
{
temp <- raster(filename)
merge.coarse <- rbind(merge.coarse, temp)
}

How can I add blank top rows to an xlsx file using the xlsx package?

I would like to add two blank rows to the top of an xlsx file that I create.
so far I have tried:
library("xlsx")
fn1 <- 'test1.xlsx'
fn2 <- 'test2.xlsx'
write.xlsx(matrix(rnorm(25),5),fn1)
wb <- loadWorkbook(fn1)
rows <- getRows(getSheets(wb)[[1]])
for(i in 1:length(rows))
rows[[i]]$setRowNum(as.integer(i+1))
saveWorkbook(wb,fn2)
But test2.xlsx is empty!
I'm a bit confused by what you're trying to do with the for loop, but:
You could create a dummy object with the same number of columns as wb, then use rbind() to join the dummy and wb to create fn2.
fn1 <- 'test1.xlsx'
wb <- loadWorkbook(fn1)
dummy <- wb[c(1,2),]
# set all values of dummy to whatever you want, e.g. "NA" or 0
fn2 <- rbind(dummy, wb)
saveWorkbook( fn2)
Hope that helps
So the xlsx package actually interfaces with a java library in order to pull in and modify the workbooks. The read.xlsx and write.xlsx functions are convenience wrappers to read and write data frames within R without having to manually write code to parse the individual cells and rows yourself using the java objects. The loadWorkbook and getRows functions give you access to the actual java objects, which you can then use to modify things such as cell styles.
However, if all you want to do is add a blank row to your spreadsheet before you output it, the easiest way is to add a blank row to your data frame before you export it (as Chris mentioned). You can accomplish this with the following variant on your code:
library("xlsx")
fn1 <- 'test1.xlsx'
fn2 <- 'test2.xlsx'
# I have added row.names=FALSE because the read.xlsx function
# does not have a parameter to include row names
write.xlsx(matrix(rnorm(25),5),fn1,row.names=FALSE)
# If you read your data back in using the read.xlsx function
# you will have a data frame rather than a series of java object references
wb <- read.xlsx(fn1,1)
# Now that we have a data frame we can add a blank row at the top
wb<-rbind.data.frame(rep("",5),wb)
# Then we write to the file using the same function as before
write.xlsx(wb,fn2,row.names=FALSE)
If you desire to use the advanced functionality in the java libraries, some of it is not currently implemented in the R package, and thus you will have to call it directly using .jcall. If you decide to pursue this line of action, I definitely recommend using .jmethods on the objects produced (i.e. .jmethods(rows[[1]]) which will list the available functions which you can use on the object (at the cellular level these are quite extensive).

Categories

Resources