Dealing with large shapefiles on R - r

Good afternoon:
I have been studying R for a while, however now I'm working with large shapefiles, the size of these files is bigger than 600 Mb. I have a computer with 200GB free and 12 GB in RAM, I want to ask if somebody knows how to deal with these files?. I really appreciate your kind help.

With the latest version of 64-bit R, and the latest version of rgdal just try reading it in:
library(rgdal)
shpdata <- readOGR("/path/to/shpfolder/", "shpfilename")
Where "shpfilename" is the filename without the extension.
If that fails update your question with details of what you did, what you saw, details of the file sizes - each of the "shpfilename.*" files, details of your R version, operating system and rgdal version.

Ok so the question is more about a strategy for dealing with large files, not "how does one read a shapefile in R."
This post shows how one might use the divide-apply-recombine approach as a solution by subsetting shapefiles.
Working from the current answer, assume you have a SpatialPolygonsDataFrame called shpdata. shpdata will have a data attribute (accessed via #data) with some sort of identifier for each polygon (for Tiger shapefiles it's usually something like 'GEOID'). You can then loop over these identifiers in groups and subset/process/export the shpdata for each small batch of polygons. I suggest either writing intermediate files as a .csv or inserting them into a database like sqlite.
Some sample code
library(rgdal)
shpdata <- readOGR("/path/to/shpfolder/", "shpfilename")
# assuming the geo id var is 'geo_id'
lapply(unique(shpdata#data$geo_id), function(id_var){
shp_sub = subset(shpdata, geo_id == id_var)
### do something to the shapefile subset here ###
### output results here ###
### clean up memory !!! ###
rm(shp_sub)
gc()
})

Related

Create data tables using SPSS in R

Using expss package I am creating cross tabs by reading SPSS files in R. This actually works perfectly but the process takes lots of time to load. I have a folder which contains various SPSS files(usually 3 files only) and through R script I am fetching the last modified file among the three.
setwd('/file/path/for/this/file/SPSS')
library(expss)
expss_output_viewer()
#get all .sav files
all_sav <- list.files(pattern ='\\.sav$')
#use file.info to get the index of the file most recently modified
pass<-all_sav[with(file.info(all_sav), which.max(mtime))]
mydata = read_spss(pass,reencode = TRUE) # read SPSS file mydata
w <- data.frame(mydata)
args <- commandArgs(TRUE)
Everything is perfect and works absolutely fine but it generally takes too much time to load large files(112MB,48MB for e.g) which isn't good.
Is there a way I can make it more time-efficient and takes less time to create the table. The dropdowns are created using PHP.
I have searched for this and found another library called 'haven' but I am not sure whether that can give me significance as well. Can anyone help me with this? I would really appreciate that. Thanks in advance.
As written in the expss vignette (https://cran.r-project.org/web/packages/expss/vignettes/labels-support.html) you can use in the following way:
# we need to load packages strictly in this order to avoid conflicts
library(haven)
library(expss)
spss_data = haven::read_spss("spss_file.sav")
# add missing 'labelled' class
spss_data = add_labelled_class(spss_data)

How to batch read 2.8 GB gzipped (40 GB TSVs) files into R?

I have a directory with 31 gzipped TSVs (2.8 GB compressed / 40 GB uncompressed). I would like to conditionally import all matching rows based on the value of 1 column, and combine into one data frame.
I've read through several answers here, but none seem to work—I suspect that they are not meant to handle that much data.
In short, how can I:
Read 3 GB of gzipped files
Import only rows whose column matches a certain value
Combine matching rows into one data frame.
The data is tidy, with only 4 columns of interest: date, ip, type (str), category (str).
The first thing I tried using read_tsv_chunked():
library(purrr)
library(IPtoCountry)
library(lubridate)
library(scales)
library(plotly)
library(tidyquant)
library(tidyverse)
library(R.utils)
library(data.table)
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
#Define a function to filter data as it comes in.
call_back <- function(x, pos){
unique(dplyr::filter(x, .data[["type"]] == "purchase"))
}
raw_data <- files %>%
map(~ read_tsv_chunked(., DataFrameCallback$new(call_back),
chunk_size = 5000)) %>%
reduce(rbind) %>%
as_tibble() # %>%
This first approach worked with 9 GB of uncompressed data, but not with 40 GB.
The second approach using fread() (same loaded packages):
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
bind_rows(map(str_c("gunzip - c", files), fread))
That looked like it started working, but then locked up. I couldn't figure out how to pass the select = c(colnames) argument to fread() inside the map()/str_c() calls, let alone the filter criteria for the one column.
This is more of a strategy answer.
R loads all data into memory for processing, so you'll run into issues with the amount of data you're looking at.
What I suggest you do, which is what I do, is to use Apache Spark for the data processing, and use the R package sparklyr to interface to it. You can then load your data into Spark, process it there, then retrieve the summarised set of data back into R for further visualisation and analysis.
You can install Spark locally in your R Studio instance and do a lot there. If you need further computing capacity have a look at a hosted option such as AWS.
Have a read of this https://spark.rstudio.com/
One technical point, there is a sparklyr function spark_read_text which will read delimited text files directly into the Spark instance. It's very useful.
From there you can use dplyr to manipulate your data. Good luck!
First, if the base read.table is used, you don't need to gunzip anything, as it uses Zlib to read these directly. read.table also works much faster if the colClasses parameter is specified.
Y'might need to write some custom R code to produce a melted data frame directly from each of the 31 TSVs, and then accumulate them by rbinding.
Still it will help to have a machine with lots of fast virtual memory. I often work with datasets on this order, and I sometimes find an Ubuntu system wanting on memory, even if it has 32 cores. I have an alternative system where I have convinced the OS that an SSD is more of its memory, giving me an effective 64 GB RAM. I find this very useful for some of these problems. It's Windows, so I need to set memory.limit(size=...) appropriately.
Note that once a TSV is read using read.table, it's pretty compressed, approaching what gzip delivers. You may not need a big system if you do it this way.
If it turns out to take a long time (I doubt it), be sure to checkpoint and save.image at spots in between.

How can I download GADM data in R?

library(raster)
france<-getData('GADM', country='FRA', level=1)
However, the command is leading me to this error.
trying URL 'http://biogeo.ucdavis.edu/data/gadm2.8/rds/FRA_adm1.rds'
Error in utils::download.file(url = aurl, destfile = fn, method = "auto", :
cannot open URL 'http://biogeo.ucdavis.edu/data/gadm2.8/rds/FRA_adm1.rds'
First, download the country data you want from the GADM database, and save it to your local directory. Be sure that you have chosen the R (SpatialPolygonsDataFrame) format. There are five levels available for France (from level 0 to level 5). You can choose what you need.
Second, read the .rds file downloaded from GADM with readRDS() function and transform it into a data.frame with ggplot2::fortify().
library(ggplot2)
library(sp)
# assumed that you downloaded into a such path: '~/Downloads/FRA_adm1.rds':
path <- file.path(Sys.getenv("HOME"), "Downloads", "FRA_adm1.rds")
# FR map (Level 1) from GADM version 2.8
frRDS <- readRDS(path)
# Region names 1 in data frame
frRDS_df <- ggplot2::fortify(frRDS, region = "NAME_1")
head(frRDS_df)
I am going to improve upon the previous answer to the OP's question.
To answer the OP's question directly and correctly, there is nothing wrong with the OP's code. The issue was likely a temporary internet connection issue because the OP's code works and retrieves the gadm.org data without issue. Note, the getData() function retrieves the gadm.org website's geodata that is stored and retrieved from the http://biogeo.ucdavis.edu/ website.
The raster package provides the getData() function which is very useful for automatically retrieving the geodata from the internet. This function can also be used to retrieve geodata that is kept locally on a PC.
In years past, the way to use geodata was to first download a file from the gadm.org website, and then to move that file from the download folder and save the file in a folder on the pc. These files then needed to be unpackaged/unzipped before the geodata was available to be used by R.
Using the getData() makes life simpler because this method directly retrieves the desired geodata and then makes the geodata available to use with R.
The gadm.org website clearly states:
"Downloading by country is the recommended approach"
Even though downloading the large world geodata file directly from the website can be done, it is unnecessary and resource intensive. Unless there is some specific reason for doing so, there is absolutely no need to download and keep the large worldwide geodata database on the PC.
And one last thing about the getData() function. This function is currently generating a warning when it is used in R nowadays. The warning reads:
Warning message in getData("GADM", country = "USA", level = 1):
"getData will be removed in a future version of raster.
Please use the geodata package instead"

Can I cache data loading in R?

I'm working on a R script which has to load data (obviously). The data loading takes a lot of effort (500MB) and I wonder if I can avoid having to go through the loading step every time I rerun the script, which I do a lot during the development.
I appreciate that I could do the whole thing in the interactive R session, but developing multi-line functions is just so much less convenient on the R prompt.
Example:
#!/usr/bin/Rscript
d <- read.csv("large.csv", header=T) # 500 MB ~ 15 seconds
head(d)
How, if possible, can I modify the script, such that on subsequent executions, d is already available? Is there something like a cache=T statement as in R markdown code chunks?
Sort of. There are a few answers:
Use a faster csv read: fread() in the data.table() package is beloved by many. Your time may come down to a second or two.
Similarly, read once as csv and then write in compact binary form via saveRDS() so that next time you can do readRDS() which will be faster as you do not have to load and parse the data again.
Don't read the data but memory-map it via package mmap. That is more involved but likely very fast. Databases uses such a technique internally.
Load on demand, and eg the package SOAR package is useful here.
Direct caching, however, is not possible.
Edit: Actually, direct caching "sort of" works if you save your data set with your R session at the end. Many of us advise against that as clearly reproducible script which make the loading explicit are preferably in our view -- but R can help via the load() / save() mechanism (which lots several objects at once where saveRSS() / readRDS() work on a single object.
Package ‘R.cache’ R.cache
start_year <- 2000
end_year <- 2013
brics_countries <- c("BR","RU", "IN", "CN", "ZA")
indics <- c("NY.GDP.PCAP.CD", "TX.VAL.TECH.CD", "SP.POP.TOTL", "IP.JRN.ARTC.SC",
"GB.XPD.RSDV.GD.ZS", "BX.GSR.CCIS.ZS", "BX.GSR.ROYL.CD", "BM.GSR.ROYL.CD")
key <- list(brics_countries, indics, start_year, end_year)
brics_data <- loadCache(key)
if (is.null(brics_data)) {
brics_data <- WDI(country=brics_countries, indicator=indics,
start=start_year, end=end_year, extra=FALSE, cache=NULL)
saveCache(brics_data, key=key, comment="brics_data")
}
I use exists to check if the object is present and load conditionally, i.e.:
if (!exists(d))
{
d <- read.csv("large.csv", header=T)
# Any further processing on loading
}
# The rest of the script
If you want to load/process the file again, just use rm(d) before sourcing. Just be careful that you do not use object names that are already used elsewhere, otherwise it will pick that up and not load.
I wrote up some of the common ways of caching in R in "Caching in R" and published it to R-Bloggers. For your purpose, I would recommend just using saveRDS() or qs() from the 'qs' (quick serialization) package. My package, 'mustashe', uses qs() for reading and writing files, so you could just use mustashe::stash(), too.

Read SPSS file into R

I am trying to learn R and want to bring in an SPSS file, which I can open in SPSS.
I have tried using read.spss from foreign and spss.get from Hmisc. Both error messages are the same.
Here is my code:
## install.packages("Hmisc")
library(foreign)
## change the working directory
getwd()
setwd('C:/Documents and Settings/BTIBERT/Desktop/')
## load in the file
## ?read.spss
asq <- read.spss('ASQ2010.sav', to.data.frame=T)
And the resulting error:
Error in read.spss("ASQ2010.sav", to.data.frame = T) : error
reading system-file header In addition: Warning message: In
read.spss("ASQ2010.sav", to.data.frame = T) : ASQ2010.sav: position
0: character `\000' (
Also, I tried saving out the SPSS file as a SPSS 7 .sav file (was previously using SPSS 18).
Warning messages: 1: In read.spss("ASQ2010_test.sav", to.data.frame =
T) : ASQ2010_test.sav: Unrecognized record type 7, subtype 14
encountered in system file 2: In read.spss("ASQ2010_test.sav",
to.data.frame = T) : ASQ2010_test.sav: Unrecognized record type 7,
subtype 18 encountered in system file
I had a similar issue and solved it following a hint in read.spss help.
Using package memisc instead, you can import a portable SPSS file like this:
data <- as.data.set(spss.portable.file("filename.por"))
Similarly, for .sav files:
data <- as.data.set(spss.system.file('filename.sav'))
although in this case I seem to miss some string values, while the portable import works seamlessly. The help page for spss.portable.file claims:
The importer mechanism is more flexible and extensible than read.spss and read.dta of package "foreign", as most of the parsing of the file headers is done in R. They are also adapted to load efficiently large data sets. Most importantly, importer objects support the labels, missing.values, and descriptions, provided by this package.
The read.spss seems to be outdated a little bit, so I used package called memisc.
To get this to work do this:
install.packages("memisc")
data <- as.data.set(spss.system.file('yourfile.sav'))
You may also try this:
setwd("C:/Users/rest of your path")
library(haven)
data <- read_sav("data.sav")
and if you want to read all files from one folder:
temp <- list.files(pattern = "*.sav")
read.all <- sapply(temp, read_sav)
I know this post is old, but I also had problems loading a Qualtrics SPSS file into R. R's read.spss code came from PSPP a long time ago, and hasn't been updated in a while. (And Hmisc's code uses read.spss(), too, so no luck there.)
The good news is that PSPP 0.6.1 should read the files fine, as long as you specify a "String Width" of "Short - 255 (SPSS 12.0 and earlier)" on the "Download Data" page in Qualtrics. Read it into PSPP, save a new copy, and you should be in business. Awkward, but free.
,
You can read SPSS file from R using above solutions or the one you are currently using. Just make sure that the command is fed with the file, that it can read properly. I had same error and the problem was, SPSS could not access that file. You should make sure the file path is correct, file is accessible and it is in correct format.
library(foreign)
asq <- read.spss('ASQ2010.sav', to.data.frame=TRUE)
As far as warning message is concerned, It does not affect the data. The record type 7 is used to store features in newer SPSS software to make older SPSS software able to read new data. But does not affect data. I have used this numerous times and data is not lost.
You can also read about this at http://r.789695.n4.nabble.com/read-spss-warning-message-Unrecognized-record-type-7-subtype-18-encountered-in-system-file-td3000775.html#a3007945
It looks like the R read.spss implementation is incomplete or broken. R2.10.1 does better than R2.8.1, however. It appears that R gets upset about custom attributes in a sav file even with 2.10.1 (The latest I have). R also may not understand the character encoding field in the file, and in particular it probably does not work with SPSS Unicode files.
You might try opening the file in SPSS, deleting any custom attributes, and resaving the file.
You can see whether there are custom attributes with the SPSS command
display attributes.
If so, delete them (see VARIABLE ATTRIBUTE and DATAFILE ATTRIBUTE commands), and try again.
HTH,
Jon Peck
If you have access to SPSS, save file as .csv, hence import it with read.csv or read.table. I can't recall any problem with .sav file importing. So far it was working like a charm both with read.spss and spss.get. I reckon that spss.get will not give different results, since it depends on foreign::read.spss
Can you provide some info on SPSS/R/Hmisc/foreign version?
Another solution not mentioned here is to read SPSS data in R via ODBC. You need:
IBM SPSS Statistics Data File Driver. Standalone driver is enough.
Import SPSS data using RODBC package in R.
See the example here. However I have to admit that, there could be problems with very big data files.
For me it works well using memisc!
install.packages("memisc")
load('memisc')
Daten.Februar <-as.data.set(spss.system.file("NPS_Februar_15_Daten.sav"))
names(Daten.Februar)
I agree with #SDahm that the haven package would be the way to go. I myself have struggled a bit with string values when starting to use it, so I thought I'd share my approach on that here, too.
The "semantics" vignette has some useful information on this topic.
library(tidyverse)
library(haven)
# Some interesting information in here
vignette('semantics')
# Get data from spss file
df <- read_sav(path_to_file)
# get value labels
df <- map_df(.x = df, .f = function(x) {
if (class(x) == 'labelled') as_factor(x)
else x})
# get column names
colnames(df) <- map(.x = spss_file, .f = function(x) {attr(x, 'label')})
There is no such problem with packages you are using. The only requirement for read a spss file is to put the file into a PORTABLE format file. I mean, spss file have *.sav extension. You need to transform your spss file in a portable document that uses *.por extension.
There is more info in http://www.statmethods.net/input/importingdata.html
In my case this warning was combined with a appearance of a new variable before first column of my data with values -100, 2, 2, 2, ..., a shift in the correspondence between labels and values and the deletion of the last variable. A solution that worked was (using SPSS) to create a new dump variable in the last column of the file, fill it with random values and execute the following code:
(filename is the path to the sav file and in my case the original SPSS file had 62 columns, thus 63 with the additional dumb variable)
library(memisc)
data <- as.data.set(spss.system.file(filename))
copyofdata = data
for(i in 2:63){
names(data)[i] <- names(copyofdata)[i-1]
}
data[[1]] <- NULL
newcopyofdata = data
for(i in 2:62){
labels(data[[i]]) <- labels(newcopyofdata[[i-1]])
}
labels(data[[1]]) <- NULL
Hope the above code will help someone else.
Turn your UNICODE in SPSS off
Open SPSS without any data open and run the code below in your syntax editor
SET UNICODE OFF.
Open the data set and resave it to remove the Unicode
read.spss('yourdata.sav', to.data.frame=T) works correctly then
I just came came across an SPSS file that I couldn't get open using haven, foreign, or memisc, but readspss::read.por did the trick for me:
download.file("http://www.tcd.ie/Political_Science/elections/IMSgeneral92.zip",
"IMSgeneral92.zip")
unzip("IMSgeneral92.zip", exdir = "IMSgeneral92")
# rio, haven, foreign, memisc pkgs don't work on this file! But readspss does:
if(!require(readspss)) remotes::install_git("https://github.com/JanMarvin/readspss.git")
ims92 <- readspss::read.por("IMSgeneral92/IMS_Nov7 92.por", convert.factors = FALSE)
Nice! Thanks, #JanMarvin!
1)
I've found the program, stat-transfer, useful for importing spss and stata files into R.
It resolves the issue you mention by converting spss to R dataset. Also very useful for subsetting super large datasets into smaller portions consumable by R. Not free, but a very useful tool for working with datasets from different programs -- especially if you don't have access to them.
2)
Memisc package also has an spss function worth trying.

Resources