I have over 111,000 longitude and latitude points with depths associated with each coordinate. The data is in the format Longitude, Latitude, and Depth. When I load the data into R and convert the data into class bathy using as.bathy R seems to hang. When I check the format using is.bathy R returns FALSE. Can 'marmap' handle such large datasets?
There could be several causes to this behavior; to help you diagnose the problem, could you send me:
- your session info (use sessionInfo() ) and tell me what kind of machine you have (e.g. RAM)?
- your code and the error message that you received?
- your data, or a subset of it, so I can try to re-create the problem here, on my machine?
cheers, eric
Related
I am an absolute beginner in PostgreSQL and PostGIS (databases in general) but have a fairly good working experience in R. I have two multi-polygon data sets of vulnerable areas of India from two different sources - one is around 12gb and it's in .gdb format (let's call it mygdb) and the other is a shapefile around 2gb (let's call it myshp). I want to compare the two sets of vulnerability maps and generate some state-wise measures of fit using intersection (I), difference (D), and union (U) between the maps.
I would like to make use of PostGIS functionalities (via R) as neither R (crashes!) nor qgis (too slow) is efficient for this. To start with, I have uploaded both data sets in my PostGIS database. I used ogr2ogr in R to upload mygdb. But I am kind of stuck at this point. My idea is to split both polygon files by states and then apply other functions to get I, U and D. From my search, I think I can use sf functions like st_split, st_intersect, st_difference, and st_union. However, even after splitting, I would imagine that the file sizes will be still too large for r to process, so my questions are
Is my approach the best way forward?
How can I use sf::st_ functions (e.g. st_split, st_intersection) without importing the data from database into R
There are some useful answers to previous relevant questions, like this one for example. But I find it hard to put the steps together from different links and any help with a dummy example would be great. Many thanks in advance.
Maybe you could try loading it as a stars proxy. It doesn't load the file to the memory, it applies it directly to the hard drive.
https://r-spatial.github.io/stars/articles/stars2.html
Not answer for question sensu stricte, however in response to request in comment, an example of postgresql/postgis query for ST_Intersection. Based on OSM data in postgresql database imported with osm2pgsql:
WITH
highway AS (
select osm_id, way from planet_osm_line where osm_id = 332054927),
dln AS (
select osm_id, way from planet_osm_polygon where "boundary" = 'administrative'
and "admin_level" = '4' and "ref" = 'DS')
SELECT ST_Intersection(dln.way, highway.way) FROM highway, dln
I've looked through many pages of how to do this and they essentially all have the same R code suggestions, which I've followed. Here's the R code I'm using for the specific weather station I'm looking for:
library(rnoaa)
options(noaakey="MyKeyHere")
ncdc(datasetid='GHCND', stationid='GHCND:USW00014739', datatypeid='dly-tmax-normal', startdate='2017-05-15', enddate='2018-01-04')
The error message I get when I run this is:
Warning message:
Sorry, no data found
I've gone directly to the NOAA site (https://www.ncdc.noaa.gov/cdo-web/search) and manaually pulled the dataset out there (using the "daily summaries" dataset, which is the same as GHCND in the API). There is in fact data there for my entire date range.
What am I missing?
The documentation says:
Note that NOAA NCDC API calls can take a long time depending on the call. The NOAA API doesn't perform well with very long timespans, and will time out and make you angry - beware.
Have you tried a smaller timespan?
I got Error in curl::curl_fetch_memory(url, handle = handle) : Empty reply from server for some operations in Rstudio (Watson studio) when I tried to do data manipulation on Spark data frames.
Background:
The data is stored on IBM Cloud Object Storage (COS). It will be several 10GB files but currently I'm testing only on the first subset (10GB).
The workflow I suppose is, in Rstudio (Watson Studio), connect to spark (free plan) using sparklyr, read the file as Spark data frame through sparklyr::spark_read_csv(), then apply feature transformation on it (e.g., split one column into two, compute the difference between two columns, remove unwanted columns, filter out unwanted rows etc.). After the preprocessing, save out the cleaned data back to COS through sparklyr::spark_write_csv().
To work with Spark I added 2 spark services into the project (seems like any spark service under the account can be used by Rstudio.. Rstudio is not limited by project?); I may need to use R notebooks for data exploration (to show the plots in a nice way) so I created the project for that purpose. In previous testings I found that for R notebooks / Rstudio, the two env cannot use the same Spark service at the same time; so I created 2 spark services, the first for R notebooks (let's call it spark-1) and the second for Rstudio (call it spark-2).
As I personally prefer sparklyr (pre-installed in Rstudio only) over SparkR (pre-installed in R notebooks only), for almost the whole week I was developing & testing code in Rstudio using spark-2.
I'm not very familiar with Spark and currently it behaves in a way that I don't really understand. It would be very helpful if anyone can give suggestions on any issue:
1) failure to load data (occasionally)
It worked quite stable until yesterday, since when I started to encounter issues loading data using exactly the same code. The error does not tell anything but R fails to fetch data (Error in curl::curl_fetch_memory(url, handle = handle) : Empty reply from server). What I observed for several times is, after I got this error, if I again run the code to import data (just one line of code), the data would be loaded successfully.
Q1 screenshot
2) failure to apply (possibly) large amount of transformations (always, regardless of data size)
To check whether the data is transformed correctly, I printed out the first several rows of interested variables after each step (most of them are not ordinal, i.e., the order of steps doesn't matter) of transformation. I read a little bit of how sparklyr translates operations. Basically sparklyr doesn't really apply the transformation to the data until you call to preview or print some of the data after transformation. After a set of transformations, if I run some more, when I printed out the first several rows I got error (same useless error as in Q1). I'm sure the code is right as once I run these additional steps of code right after I load the data, I'm able to print and preview the first several rows.
3) failure to collect data (always for the first subset)
By collecting data I want to pull the data frame down to the local machine, here to Rstudio in Watson Studio. After applying the same set of transformations, I'm able to collect the cleaned version of a sample data (originally 1000 rows x 158 cols, about 1000 rows x 90 cols after preprocessing), but failed on the first 10 GB subset file (originally 25,000,000 rows x 158 cols, at most 50,000 rows x 90 cols after preprocessing). The space it takes up should not exceed 200MB in my opinion, which means it should be able to be read into either Spark RAM (1210MB) or Rstudio RAM. But it just failed (again with that useless error).
4) failure to save out data (always, regardless of data size)
The same error happened every time when I tried to write the data back to COS. I suppose this has something to do with the transformations, probably something happens when Spark received too many transformation request?
5) failure to initialize Spark (some kind of pattern found)
Starting from this afternoon, I cannot initialize spark-2, which has been used for about a week. I got the same useless error message. However I'm able to connect to spark-1.
I checked the spark instance information on IBM Cloud:
spark-2
spark-1
It's weird that spark-2 has 67 active tasks since my previous operations got error messages. Also, I'm not sure why "input" in both spark instances are so large.
Does anyone know what happened and why did it happen?
Thank you!
I'm having trouble extracting point data from a large shape file (916.2 Mb, 4618197 elements - from here: https://earthdata.nasa.gov/data/near-real-time-data/firms/active-fire-data) in R. I'm using readShapeSpatial in maptools to read in the shape file which takes a while but eventually works:
worldmap <- readShapeSpatial("shp_file_name")
I then have a data.frame of coordinates that I want extract data for. However R is really struggling with this and either loses connection or freezes, even with just one set of coordinates!
pt <-data.frame(lat=-64,long=-13.5)
pt<-SpatialPoints(pt)
e<-over(pt,worldmap)
Could anyone advise me on a more efficient way of doing this?
Or is it the case that I need to run this script on something more powerful (currently using a mac mini with 2.3 GHz processor)?
Many thanks!
By 'point data' do you mean the longitude and latitude coordinates? If that's the case, you can obtain the data underlying the shapefile with:
worldmap#data
You can view this in the same way you would any other data frame, for example:
View(worldmap#data)
You can also access columns in this data frame in the same way you normally would, except you don't need the #data, e.g.:
worldmap$LATITUDE
Finally, it is recommended to use readOGR from the rgdal package rather than maptools::readShapeSpatial as the former reads in the CRS/projection information.
I'm new at R, and I have to write commands to read a file containing real values and then compute and plot a histogram of distributions, using 100 subintervals.
I've been havin' some problems in using hist() function...
This is what I do for readin' data:
values = read.table("filepath.txt");
filepath.txt contains real values (2509.92, 615.41, 417.031, ... , 0.0516073, 0.023377, 0.00681471).
Then I've tried to follow these instructions ( http://msenux.redwoods.edu/math/R/hist.php ), but it did not work, because using method as.numeric(), the system thinks it's managin' integer data and all the values are set to 1.0
How could I do?
Thanks a lot!
If your "filepath.txt" is exactly as you show, it is a comma-separated file, and you need to specify such appropriately in your read.table call. That may be all you need to do.
The info on your referenced page has nothing to do with reading or converting data, so I'm not sure why you are asking about histogram generation when you know your source data is bad.
However, I'm not sure because your question is a little imprecise: there's no such thing as "the system." If you can provide the exact R code you are using to read the data file, and clarify whether "all values are set to 1.0" means the values in your variable values or all the data in the output of hist we can guide you further.