I have about 500,000 points in R of occurrence data of a migratory bird species throughout the US.
I am attempting to overlay a grid on these points, and then count the number of occurrences in each grid. Once the counts have been tallied, I then want to reference them to a grid cell ID.
In R, I've used the over() function to just get the points within the range map, which is a shapefile.
#Read in occurrence data
data=read.csv("data.csv", header=TRUE)
coordinates(data)=c("LONGITUDE","LATITUDE")
#Get shapefile of the species' range map
range=readOGR(".",layer="data")
proj4string(data)=proj4string(range)
#Get points within the range map
inside.range=!is.na(over(data,as(range,"SpatialPolygons")))
The above worked exactly as I hoped, but does not address my current problem: how to deal with points that are the type SpatialPointsDataFrame, and a grid that is a raster. Would you recommend polygonizing the raster grid, and using the same method I indicated above? Or would another process be more efficient?
First of all, your R code doesn't work as written. I would suggest copy-pasting it into a clean session, and if it errors out for you as well, correcting syntax errors or including add-on libraries until it runs.
That said, I assume that you are supposed to end up with a data.frame of two-dimensional numeric coordinates. So, for the purposes of binning and counting them, any such data will do, so I took the liberty of simulating such a dataset. Please correct me if this doesn't capture a relevant aspect of your data.
## Skip this line if you are the OP, and substitute the real data instead.
data<-data.frame(LATITUDE=runif(100,1,100),LONGITUDE=runif(100,1,100));
## Add the latitudes and longitudes between which each observation is located
## You can substitute any number of breaks you want. Or, a vector of fixed cutpoints
## LATgrid and LONgrid are going to be factors. With ugly level names.
data$LATgrid<-cut(data$LATITUDE,breaks=10,include.lowest=T);
data$LONgrid<-cut(data$LONGITUDE,breaks=10,include.lowest=T);
## Create a single factor that gives the lat,long of each observation.
data$IDgrid<-with(data,interaction(LATgrid,LONgrid));
## Now, create another factor based on the above one, with shorter IDs and no empty levels
data$IDNgrid<-factor(data$IDgrid);
levels(data$IDNgrid)<-seq_along(levels(data$IDNgrid));
## If you want total grid-cell count repeated for each observation falling into that grid cell, do this:
data$count<- ave(data$LATITUDE,data$IDNgrid,FUN=length);
## You could have also used data$LONGITUDE, doesn't matter in this case
## If you want just a table of counts at each grid-cell, do this:
aggregate(data$LATITUDE,data[,c('LATgrid','LONgrid','IDNgrid')],FUN=length);
## I included the LATgrid and LONgrid vectors so there would be some
## sort of descriptive reference accompanying the anonymous numbers in IDNgrid,
## but only IDNgrid is actually necessary
## If you want a really minimalist table, you could do this:
table(data$IDNgrid);
Related
I'm building a transition matrix of land use change (state) over the years.
I'm therefore comparing shapefile years after years and build a dataframe with:
Landuse year1 - Landuse year2 - ....- ID- centroid
with the following function :
full_join(landuse1, landuse2, by="centroid")
where centroid is the actual centroid of the polygons. A centroid, is basically a vector of two numeric value.
However, the centroid, year after year, can slitghly shift (because the polygon actually change a little bit) leading in incomplete data gathering through the full_join function because centroid must exactly match.
I'd like to include a "more or less" argument, so that that any centroid close enough to the one from the year before can be joined to the datagrame for that particular polygon.
But I'm not sure how ?
Thank you in advance.
So the general term for what you are trying to do is called fuzzy matching. Im not sure how exactly it would work for the coordinates of a centroid. My Idea would be to calculate the distance between the Coordinates, and then set a margin of error, say 0.5%, and if they deviate from each other by less than that you could declare it a match. Basically loop through your list of locations and give the matches some unique ID, which you can then use for the join
I have a data frame (pLog) containing the number of reads per nucleotide for a chip-seq experiment done for a E. coli genome (4.6MB). I want to be able to plot on the X axis the chromosomal position and on the Y axis the number of reads. To make it easier, I binned the data in windows of 100bp. That makes the data frame of 46,259 rows and 2 columns. One column is named "position" and has a number representing a chromosomal position (1,101,201,....) and the other column is named "values" and contains the number of reads found on that bin e.g.(210,511,315,....). I have been using ggplot for all my analysis and I would like to use it for this plot, if possible.
I am trying for the graph to look something like this:
but I haven't been able to plot it.
This is how my data looks like
I tried
ggplot(pLog,aes(position))+
geom_histogram(binwidth=50)
ggsave(file.jpg)
And this is how it looks like :(
Many thanks!
You cannot use geom_histogram(), try geom_line:
pLog=data.frame(position=seq(1,100000,by=100),
value=rnbinom(10000,mu=100,size=20))
ggplot(pLog,aes(x=position,y=value))+geom_line(alpha=0.7,col="steelblue")
Most likely you need to play around to get the visualization you need
I have my polygons stored in a SpatialPolygonsDataFrame and my coordinates in a data frame.
The output I want is to just have an additional column on my data frame that tags the OBJECTID (id of the polygon from the shapefile) that the coordinates fall into.
My problem is kind of the same with this
But its output is a little bit different. Also, it's kinda slow. I tried to tag just 4 coordinates and it took more than 5 minutes. I'm gonna be tagging 16k coordinates so would it be possible to do it faster?
The current methods I know about wouldn't do that exactly (i.e., produce one polygon id per coordinate) because they're generalized in case one point is contained in multiple (overlapping polygons).
See sp::over(), which used to be called overlay().
Example:
over(sr, geometry(meuse), returnList = TRUE)
over(sr, meuse, returnList = TRUE)
Possible duplicates (it's hard to tell without seeing your example data):
Extracting points with polygon in R
Intersecting Points and Polygons in R
I have some classified raster layers as categorical land cover maps. All the layers having exactly the same categories (lets say: "water", "Trees", "Urban","bare soil") but they are from different time points (e.g. 2005 and 2015)
I load them into memory using the raster function like this:
comp <- raster("C:/workingDirectory4R/rasterproject/2005marsh3.rst")
ref <- raster("C:/workingDirectory4R/rasterproject/2013marsh3.rst")
"comp" is the comparison map at time t+1 and "ref" is the reference map from time t. Then I used the crosstab function to generate the confusion table. This table can be used to explore the changes in categories through the time interval.
contingency.Matrix <- crosstab(comp, ref)
The result is in the matrix format with the "comp" categories in the column and "ref" in the rows. And column and row names labeled with numbers numbers 1 to 4.
Now I have 2 questions and I really appreciate any help on how to solve them.
1- I want to assign the category names to the columns and rows of
the matrix to facilitate it's interpretation.
2- Now let's say I have three raster layers for 2005, 2010 and 2015.
This means I would have two confusion tables one for 2005-2010 and
another one for 2010-2015. What's the best procedure to automate
this process with the minimal interaction from user.
I thought to ask the user to load the raster layers, then the code save them in a list. Then I ask for a vector of years from the user but the problem is how can I make sure that the order of raster layers and the years are the same? And is there a more elegant way to do this.
Thanks
I found a partial answer to my first question. If the categorical map is created in TerrSet(IDRISI) software with the ".rst" extention then I can extract the category names like this:
comp <- raster("C:/rasterproject/2005subset.rst")
attributes <- data.frame(comp#data#attributes)
categories <- as.character(attributes[,8])
and I get a vector with the name of categories. However if the raster layers are created with a different extension then the code won't work. For instance if the raster is created in ENVI then the third line of the code should get changed to:
categories <- as.character(attributes[,2])
I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.