Any help would be greatly appreciated, as this has been a problem I've been trying to solve for quite some time.
I have 13877 days worth of climate data from PRISM, daily precipitation data from 01/01/1981 to 12/31/2018. The data I need is "climate data: precipitation," with "temporal period: daily data." From there, data has to be downloaded manually by year as *.bil. The data is floating point, with each .bil file being a raster of dimension 621 x 1405.
I would like to convert and consolidate all these raster files into a single *.csv file using R (or Python). This post seemed close, but I could not get the suggested code to work properly.
With such a large amount of data, I am looking for an efficient solution if possible. My most recent attempt was as follows:
#After setting working directory to the folder with all .bil files
library(prism)
library(raster)
library(sp)
type = "ppt"
ls_prism_data(name=TRUE) #lists PRISM files
j = 1
k = 18
final_df = NULL
filepre = paste("PRISM_", type , "_stable_4kmD2_", sep = "")
for (i in 1:771) {
RS1 <- prism_stack(ls_prism_data()[j:k, 1])
l = 18
if (i = 771) {
l = 17 # 770*18 + 17 = 13877
}
j = j + l
k = k + l
pr_df <- data.frame(rasterToPoints(RS1))
long_pr_df <- pr_df %>% gather(Date, type, -x, -y)
long_pr_df$Dates <- gsub(filepre, "", long_pr_df$Dates)
long_pr_df$Dates <- gsub("_bil", "", long_pr_df$Dates)
final_df <- rbind(final_df, long_pr_df)
}
My thinking was to convert the data in "batches." Unfortunately, this results in a fatal error that causes R to abort.
Ultimately I would like to get the information to a single file organized by year, by point with column variable denoting precipitation so I can continue this for other climate variables.
I don't need a correction to the above code per se, I'm just looking for a means to an end. Thank you in advance for the help!
Related
I have a problem with my R code. I would like to run about 100 regressions and perform this process with a loop. I have tried to program the loop myself using help from YouTube and the like, but I am getting nowhere. Therefore, I would like to ask you if you can help me.
Specifically, it's about the following:
I have a dataset of the 100 companies in the Nasdaq100 and I would like to regress the sales per share with the stock price performance on a quarterly basis. Another problem is that the data set contains these 100 companies and a subset with the respective ticker symbol has to be created for each additional company so that R can access it correctly for each regression.
Here is an excerpt from the code:
Nasdaq_100 = read_xlsx("Nasdaq_100_Sales_Data.xlsx")
#Correlation between quarterly close price and Sales of AMD
AMD <- subset (Nasdaq_100, Nasdaq_100$TickerSymbol=="AMD")
AMD_regression = lm(AMD$Sales ~ AMD$Stockprice_quarterly, data = Nasdaq_100)
summary(AMD_regression)
Can you help me to program this loop for regression analysis?
Thanks in advance for any help!
To convert this to a for loop, first get a list of the .xlsx files in your working directory:
require(data.table)
myfiles <- list.files(pattern="*.xlsx")
Then loop through each file, and saving with minor modifications to your existing code:
for (file in myfiles) {
Nasdaq_100 <- data.table::fread(file)
AMD <- subset (Nasdaq_100, Nasdaq_100$TickerSymbol=="AMD")
AMD_regression = lm(AMD$Sales ~ AMD$Stockprice_quarterly, data = Nasdaq_100)
summary(AMD_regression)
data.table::fwrite(AMD_regression, file=paste0("output_", file), quote = F, sep = "\t", row.names = F)
}
Copy-paste in r and Let me know if it works.
file <- choose.files()
lmset <- data.frame(x='x',y='y')
for(i in seq_len(100))
{
data <- read_excel(file,sheet=i)
lmset <- rbind(lmset,lm(AMD$Sales~AMD$Stockprice_quarterly,data=data)$coefficients)
}
I have several rasters, 343 to be more exact, from Cropscape. I need to get the locations (centroids) and area measurements of pixels that represent potatoes and tomatoes based on the associated values in the rasters. The pixel values are 43 and 54, respectively. Cropscape provides rasters separated by year and state, except for 2016, which has the lower 48 states combined. The rasters are saved as GeoTiffs on a Google Drive and I am using Google File Stream to connect to the rasters locally.
I want to create a SpatialPointsDataFrame from the centroids of each pixel or group of adjacent pixels for tomatoes and potatoes in all the rasters. Right now, my code will
Subset the rasters to potatoes and tomatoes
Change the raster subsets to polygons, one for potatoes and one for tomatoes
Create centroids from each polygon
Create a SpatialPointsDataFrame based on the centroids
Extract the area measurement for each area of interest with SpatialPointsDataFrame
Write the raster subsets and each polygon to a file.
Code:
library(raster)
library(rgdal)
library(rgeos)
dat_dir2 = getwd()
mepg <- make_EPSG()
ae_pr <- mepg[mepg$code == "5070", "prj4"]
# Toy raster list for use with code
# I use `list.files()` with the directories that hold
# the rasters and use list that is generated from
# that to read in the files to raster. My list is called
# "tiflist". Not used in the code, but mentioned later.
rmk1 <- function(x, ...) {
r1 = raster(ncol = 1000, nrow = 1000)
r1[] = sample(1:60, 1000000, replace = T)
proj4string(r1) = CRS(ae_pr)
return(r1)
}
rlis <- lapply(1:5, rmk1)
#Pixel values needed
ptto <- c(43, 54)
# My function to go through rasters for locations and area measurements.
# This code is somewhat edited to work with the demo raster list.
# It produces the same output as what I wanted, but with the demo list.
pottom <- function(x, ...) {
# Next line is not necessary with the raster list created above.
# temras = raster(x)
now = format(Sys.time(), "%b%d%H%M%S")
nwnm = paste0(names(x), now)
rasmatx = match(x = x, table = ptto)
writeRaster(rasmatx, file.path( dat_dir2, paste0(nwnm,"ras")), format = "GTiff")
tempol = rasterToPolygons(rasmatx, fun = function(x) { x > 0 & x < 4}, dissolve = T)
tempol2 = disaggregate(tempol)
# for potatoes
tempol2p = tempol2[tempol2$layer == '1',]
if (nrow(tempol2p) > 0) {
temcenp = gCentroid(tempol2p, byid = T)
temcenpdf = SpatialPointsDataFrame(temcenp, data.frame(ID = 1:length(temcenp) , temcenp))
temcenpdf$pot_p = extract(rasmatx, temcenpdf)
temcenpdf$areap_m = gArea(tempol2p, byid = T)
# writeOGR(temcenpdf, dsn=file.path(dat_dir2), paste0(nwnm, "p"), driver = "ESRI Shapefile")
}
# for tomatoes
tempol2t = tempol2[tempol2$layer == '2',]
if (nrow(tempol2t) > 0) {
temcent = gCentroid(tempol2t, byid = T)
temcentdf = SpatialPointsDataFrame(temcent, data.frame(ID = 1:length(temcent) , temcent))
temcentdf$tom_t = extract(rasmatx, temcentdf)
temcentdf$areat_m = gArea(tempol2t, byid = T)
writeOGR(temcentdf, dsn=file.path(dat_dir2), paste0(nwnm,"t"), driver = "ESRI Shapefile")
}
}
lapply(rlis, pottom)
I know I should provide some toy data and I created some, but I don't know if they exactly recreate my problem, which follows.
Besides my wonky code, which seems to work, I have a bigger problem. A lot of memory is used when this code runs. The tiflist can only get through the first 4 files of the list and by then RAM, which is 16 GB on my laptop, is completely consumed. I'm pretty sure it's the connections to the Google Drive, since the cache for the drive stream is at least 8 GB. I guess each raster is staying open after being connected to in the Google Drive? I don't know how to confirm that.
I think I need to get the function to clear out all of the objects that are created, e.g. temras, rasmatx, tempol, etc., after processing each raster, but I'm not sure how to do that. I did try adding rm(temras ...) to the end of the function, but when I did that, there was no output at all from the function after 10 minutes and by then, I've usually got the first 3 rasters processed.
27/Oct EDIT after comments from RobertHijmans. It seems that the states with large geographic extents are causing problems with rasterToPolygons(). I edited the code from the way it works for me locally to work with the demo data I included, since RobertHijmans pointed out it wasn't functional. So I hope this is now reproducible.
I feel silly answering my own question, but here it is: the rasterToPolygons function is notoriously slow. I was unaware of this issue. I waited 30 minutes before killing the process with no result in one of my attempts. It works on the conditions I require for rasters for Alabama and Arkansas for example, but not California.
A submitted solution, which I am in the process of testing, comes from this GitHub repo. The test is ongoing at 12 minutes, so I don't know if it works for an object as large as California. I don't want to copy and paste someone else's code in an answer to my own question.
One of the comments suggested using profvis, but I couldn't really figure out the output. And it hung with the process too.
I'm trying to extract a subset of depth data from GEBCO's global ocean bathymetry dataset, which is a 10.9gb .nc file, netcdf4 (direct link).
I open a connection to the file, which doesn't load it into memory:
library(ncdf4)
GEBCO <- nc_open(filename = "GEBCO_2019.nc", verbose = T)
Find the lat & lon indices corresponding to my subset area:
LonIdx <- which(GEBCO$dim$lon$vals < -80 & GEBCO$dim$lon$vals > -81.7) #n=408 long
LatIdx <- which(GEBCO$dim$lat$vals < 26 & GEBCO$dim$lat$vals > 25) #n=240; 240*408=97920
Then get Z data for those extents:
z <- ncvar_get(GEBCO, GEBCO$var$elevation)[LonIdx, LatIdx]
Resulting in:
Error: cannot allocate vector of size 27.8gb
However it does this regardless of the size of the subset, even down to a 14*14 matrix. I presume therefore that ncvar_get() is pulling the whole database in order to extract the indices... even though I was under the impression that the entire point of netcdf files was that you could extract using matrix indexing without loading the whole thing to memory?
FWIW I'm on a 32gb linux machine, so it should work anyway? [edit, and the file is 10.9gb in the first place, so one would think a subset would be smaller]
Any ideas/intel/insights gratefully received. Thanks in advance.
Edit: other times it crashes RStudio rather than giving the error. R Session Aborted, fatal error, session terminated. RAM usage was:
Ok, solved. Turns out the answer I found online before using [LonIdx, LatIdx] indexes the object after the whole thing is read to memory. Notwithstanding I still don't know why this was a problem given its filesize is a third my memory, and failing expanded size is within my memory, this is still the wrong way to go.
Assuming ones rows and columns are contiguous (they should be in netcdf) the solution is:
z <- ncvar_get(nc = GEBCO,
varid = GEBCO$var$elevation,
start = c(LonIdx[1],
LatIdx[1]),
count = c(length(LonIdx),
length(LatIdx)),
verbose = T)
To convert to long format:
lon <- GEBCO$dim$lon$vals[LonIdx]
lat <- GEBCO$dim$lat$vals[LatIdx]
rownames(z) <- as.character(lon)
colnames(z) <- as.character(lat)
library(tidyr)
library(magrittr)
ztbl <- as_tibble(z, rownames = "lon")
ztbl %<>% pivot_longer(-lon, names_to = "lat", values_to = "depth")
I'm having a memory problem with R giving the Can not allocate vector of size XX Gb error message. I have a bunch of daily files (12784 days) in netcdf format giving sea surface temperature in a 1305x378 (longitude-latitude) grid. That gives 493290 points each day, decreasing to about 245000 when removing NAs (over land points).
My final objective is to build a time series for any of the 245000 points from the daily files and find the temporal trend for each point. And my idea was to build a big data frame with a point per row and a day per column (2450000x12784) so I could apply the trend calculation to any point. But then, building such data frame, the memory problem appeared, as expected.
First I tried a script I had previously used to read data and extract a three column (lon-lat-sst) dataframe by reading nc file and then melting the data. This lead to an excessive computing time when tried for a small set of days and to the memory problem. Then I tried to subset the daily files into longitudinal slices; this avoided the memory problem but the csv output files were too big and the process was very time consuming.
Another strategy I've tried without success to the moment it's been to sequentially read all the nc files and then extract all the daily values for each point and find the trend. Then I would only need to save a single 245000 points dataframe. But I think this would be time consuming and not the proper R way.
I have been reading about big.memory and ff packages to try to declare big.matrix or a 3D array (1305 x 378 x 12784) but had not success by now.
What would be the appropriate strategy to face the problem?
Extract single point time series to calculate individual trends and populate a smaller dataframe
Subset daily files in slices to avoid the memory problem but end with a lot of dataframes/files
Try to solve the memory problem with bigmemory or ff packages
Thanks in advance for your help
EDIT 1
Add code to fill the matrix
library(stringr)
library(ncdf4)
library(reshape2)
library(dplyr)
# paths
ruta_datos<-"/home/meteo/PROJECTES/VERSUS/CMEMS/DATA/SST/"
ruta_treball<-"/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL/"
setwd(ruta_treball)
sst_data_full <- function(inputfile) {
sstFile <- nc_open(inputfile)
sst_read <- list()
sst_read$lon <- ncvar_get(sstFile, "lon")
sst_read$lats <- ncvar_get(sstFile, "lat")
sst_read$sst <- ncvar_get(sstFile, "analysed_sst")
nc_close(sstFile)
sst_read
}
melt_sst <- function(L) {
dimnames(L$sst) <- list(lon = L$lon, lat = L$lats)
sst_read <- melt(L$sst, value.name = "sst")
}
# One month list file: This ends with a df of 245855 rows x 33 columns
files <- list.files(path = ruta_datos, pattern = "SST-CMEMS-198201")
sst.out=data.frame()
for (i in 1:length(files) ) {
sst<-sst_data_full(paste0(ruta_datos,files[i],sep=""))
msst <- melt_sst(sst)
msst<-subset(msst, !is.na(msst$sst))
if ( i == 1 ) {
sst.out<-msst
} else {
sst.out<-cbind(sst.out,msst$sst)
}
}
EDIT 2
Code used in a previous (smaller) data frame to calculate temporal trend. Original data was a matrix of temporal series, being each column a series.
library(forecast)
data<-read.csv(....)
for (i in 2:length(data)){
var<-paste("V",i,sep="")
ff<-data$fecha
valor<-data[,i]
datos2<-as.data.frame(cbind(data$fecha,valor))
datos.ts<-ts(datos2$valor, frequency = 365)
datos.stl <- stl(datos.ts,s.window = 365)
datos.tslm<-tslm(datos.ts ~ trend)
summary(datos.tslm)
output[i-1]<-datos.tslm$coefficients[2]
}
fecha is date variable name
EDIT 2
Working code from F. Privé answer
library(bigmemory)
tmp <- sst_data_full(paste0(ruta_datos,files[1],sep=""))
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files),backingfile = "/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL" )
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(paste0(ruta_datos,files[i],sep=""))$sst
}
With this code a big matrix was created
dim(mat)
[1] 493290 12783
mat[1,1]
[1] 293.05
mat[1,1:10]
[1] 293.05 293.06 292.98 292.96 292.96 293.00 292.97 292.99 292.89 292.97
ncol(mat)
[1] 12783
nrow(mat)
[1] 493290
So, to your read data in a Filebacked Big Matrix (FBM), you can do
files <- list.files(path = "SST-CMEMS", pattern = "SST-CMEMS-198201*",
full.names = TRUE)
tmp <- sst_data_full(files[1])
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files))
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(files[i])$sst
}
I need to read hundred of .bil files:(reproductive example)
d19810101 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(1,10),value=c(11:20))
d19810102 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(2,10),value=c(12:21))
d19820101 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(1,10),value=c(13:22))
d19820102 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(2,10),value=c(14:23))
The code I wrote for testing small amount files works ok but when I tried to run the entire files, it went super slow, please let me know if there is any way that I can improve. What I need to do is simply get the average of 33 years of daily data, here is the code for testing small amount of files:
years <- c(1981:1982)
days <- substr(as.numeric(format(seq(as.Date("1981/1/1"), as.Date("1981/1/2"), "day"), '%Y%m%d')),5,8)
X_Y <- NULL
for (j in days) {
for (i in years) {
XYi <- read.table(paste(i,substr(j,1,2),substr(j,3,4),".csv",sep=''),header=T,sep=",",stringsAsFactors=F)
X_Y <- rbind(X_Y, XYi)
cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
}
library(plyr)
X_Y1 <- ddply(X_Y, .(ID, month, day), summarize, mean(value, na.rm=T))
cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
}
EDIT:
Thank you for all your help! I tried putting the files in a list to read, but since its .bil files which needs to get the raster characteristics, thus I got error, that's why I need to read them one by one, sorry for didn't make it clear earlier
Read.files <- function(file.names, sep=",") {
library(raster)
ldply(file.names, function(fn) data.frame(Filename=fn, layer <- raster(fn, sep=",")))
}
data1 <- Read.files(paste("filenames here",days,".bil",sep=''), sep=",")
"Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class 'structure("RasterLayer", package = "raster")' into a data.frame.
EDIT 2:
The data structure of my data is actually same with the example data, only that my data is grid data and needs to be extracted(using raster function instead of read.csv), and then to be put into data frame, therefore I need to do the following steps:
for (i in days)
{
layer <- raster(paste("filename here",i,".bil",sep=''))
projection <- projection(layer)
cellsize <- res(layer)[1]
...
s <- resample(layer,r, method='ngb')
XY <- data.frame(rasterToPoints(s))
names(XY) <- c('Long','Lat','Data')
}
It's hard to tell exactly how your are managing file IO, but I think an easier way to achieve this would be to read the files in, put them into one data.frame (e.g. using rbind()), and then get the summary statistics you need via tapply():
data <- do.call(rbind, mget(ls(pattern = "d[0-9]*"))) # combine data
with(data, tapply(value, list(month, day), mean)) # get mean for each month and day combination
This assumes you have already read in all of the files, to objects named as in your example.