R - Efficiently create dataframe from large raster excluding NA values - r

apologies for cross-posting something similar in the GIS stack.
I am looking for a more efficient way to create a frequency table based on a large raster in R.
Currently, I have a few dozen rasters, ~ 150 million cells in each, and I need to create frequencies tables for each. These rasters are derived from masking a base raster with a few hundred small sampling locations*. Therefore the rasters I am creating the tables from contain ~99% NA values.
My current working approach is this:
sampling_site_raster <- raster("FILE")
base_raster <- raster("FILE")
sample_raster <- mask(base_raster, sampling_site_raster)
DF <- as.data.frame(freq(sample_raster, useNA='no', progress='text'))
### run time for the freq() process ###
user system elapsed
162.60 4.85 168.40
this uses the freq() function from the raster package of R. The usaNA=no flag will dump the NA values.
My questions are:
1) is there a more efficient way to create a frequency table from a large raster that is 99% NA values?
or
2) is the a more efficient way to derive the values from the base raster than by using mask()? (using the Mask GP function in ArcGIS is very fast, but still has the NA values and is an extra step
*additional info: The sample areas represented by sampling_site_raster are irregular shapes of various sizes spread randomly across the study area. In the sampling_site_raster the sampling sites are encoded as 1 and non-sampling areas as NA.
Thank you!

If you mask the raster by raster, you will always get another huge raster. I don't think this is a way to make things faster.
What I would do is to try to mask by polygon layer using extract:
res <- extract(raster, polygons)
Then you will have all the cell values for each polygon and can run freq on them.

Related

aggregate statistics on each cell of grid and plot heatmap in R

Below is the cycle hire data of london. Each point represent one cycle hire point.
I have created a grid using st_make_grid(). Now I wish to -
plot heatmap of number of cycle hire point in each cell of grid
plot heatmap of total nbikes in each cell of grid
(nbikes - The number of bikes currently parked)
library(spData)
library(sf)
# cycle hire data of london
# Each observaion represent a cycle hire point in London.
hire_sf <- spData::cycle_hire
head(hire_sf)
# create grid
grid_area <- st_make_grid(hire_sf)
# 1. plot heatmap of number of cycle hire point in each cell
# 2. plot heatmap of total nbikes in each cell
# (nbikes - The number of bikes currently parked)
This is indeed a duplicate; but we may as well offer a possible solution. So consider the following code:
It is built on sf::st_join() which spatially joins two sf objects (in this case grid and points) while preserving the data attributes.
Note that the join is by default left (in SQL speak) so all rows (grid cells) are maintained in the first object. There will NAs for cells with no hires, and duplicate rows for multiple points (so be sure to assign each cell a unique ID in advance, to make aggregation easier).
The type of the first object in join drives the resulting geometry type, so be sure to start with grid if you want to end up with polygon type result / starting with points you would get point result.
Once the points are assigned to grid cells it is an exercise in aggregation - I suggest via {dplyr} techniques, but base R would do as well.
For the final heatmap you will likely want ggplot for polished results, but base plot will do for a proof of concept.
library(spData)
library(sf)
library(dplyr)
# cycle hire data of london
# Each observaion represent a cycle hire point in London.
hire_sf <- spData::cycle_hire
head(hire_sf)
# create grid
grid_area <- st_make_grid(hire_sf) %>%
st_as_sf() %>%
mutate(grid_id = 1:n())
# join data to grid; make sure to join points to grid
# the first object drives the output geometry
result <- grid_area %>%
st_join(hire_sf) %>%
group_by(grid_id) %>%
summarise(point_count = n(),
total_bikes = sum(nbikes))
# draw heatmap
plot(result["point_count"])

How to plot data from Excel using the R corrplot function?

I am trying to learn R, and use the corrplot library to draw Y:City and X: Population graph. I wrote the below code:
When you look at the picture above, there are 2 columns City and population. When I run the code I get this error message:
Error in cor(Illere_Gore_Nufus) : 'x' must be numeric.
My excel data:
In general, correlation plot (Scattered plot) can be plotted only when you have two continuous variable. Correlation is a value that tells you how two continuous variables are linearly related. The Correlation value will always fall between -1 and 1, where correlation value of -1 depicts weak linear relationship and correlation value of 1 depicts strong linear relationship between the two variables. Correlation value of 0 says that there is no linear relationship between the two variables, however, there could be curvi-linear relationship between the two variables
For example
Area of the land Vs Price of the land
Here is the Data
The correlation value for this data is 0.896, which means that there is a strong linear correlation between Area of the land and Price of the land (Obviously!).
Scatter plot in R would look like this
Scatter plot
The R code would be
area<-c(650,785,880,990,1100,1250,1350,1800,2200,2800)
price<-c(250,275,280,290,350,340,400,335,420,460)
cor(area,price)
plot(area,price)
In Excel, for the same example, you can select the two columns, go to Insert > Scatter plot (under charts section)
Scatter plot
In your case, the information can be plotted in bar graph with city in y axis and population in x axis or vice versa!
Hope I have answered you query!
Some assumptions
You are asking how to do this in Excel, but your question is tagged R and Power BI (also RStudio, but that has been edited away), so I'm going to show you how to do this with R and Power BI. I'm also going to show you why you got that error message, and also why you would get an error message either way because your dataset is just not sufficient to make a correlation plot.
My answer
I'm assuming you would like to make a correlation plot of the population between the cities in your table. In that table you'd need more information than only one year for each city. I would check your data sources and see if you could come up with population numbers for, let's say, the last 10 years. In lack of the exact numbers for the cities in your table, I'm going to use some semi-made up numbers for the population in the 10 most populous countries (following your datastrutcture):
Country 2017 2016 2015 2014 2013
China 1415045928 1412626453 1414944844 1411445597 1409517397
India 1354051854 1340371473 1339431384 1343418009 1339180127
United States 326766748 324472802 325279622 324521777 324459463
Indonesia 266794980 266244787 266591965 265394107 263991379
Brazil 210867954 210335253 209297939 209860881 209288278
Pakistan 200813818 199761249 200253292 197655630 197015955
Nigeria 195875237 192568158 195757661 191728478 190886311
Bangladesh 166368149 165630262 165936711 166124290 164669751
Russia 143964709 143658415 143146914 143341653 142989754
Mexcio 137590740 137486490 136768870 137177870 136590740
Writing and debugging R code in Power BI is a real pain, so I would recommend installing R studio, write your little R snippets there, and then paste it into Power B.
The reason for your error message is that the function cor() onlyt takes numerical data as arguments. In your code sample the city names are given as arguments. And there are more potential traps in your code sample. You have to make sure that your dataset is numeric. And you have to make sure that your dataset has a shape that the cor() will accept.
Below is an R script that will do just that. Copy the data above, and store it in a file called data.xlsx on your C drive.
The Code
library(corrplot)
library(readxl)
# Read data
setwd("C:/")
data <- read_excel("data.xlsx")
# Set Country names as row index
rownames(data) <- data$Country
# Remove Country from dataframe
data$Country <- NULL
# Transpose data into a readable format for cor()
data <- data.frame(t(data))
# Plot data
corrplot(cor(data))
The plot
Power BI
In Power BI, you need to import the data before you use it in an R visual:
Copy this:
Country,2017,2016,2015,2014,2013
China,1415045928,1412626453,1414944844,1411445597,1409517397
India,1354051854,1340371473,1339431384,1343418009,1339180127
United States,326766748,324472802,325279622,324521777,324459463
Indonesia,266794980,266244787,266591965,265394107,263991379
Brazil,210867954,210335253,209297939,209860881,209288278
Pakistan,200813818,199761249,200253292,197655630,197015955
Nigeria,195875237,192568158,195757661,191728478,190886311
Bangladesh,166368149,165630262,165936711,166124290,164669751
Russia,143964709,143658415,143146914,143341653,142989754
Mexcio,137590740,137486490,136768870,137177870,136590740
Save it as countries.csv in a folder of your choosing, and pick it up in Power BI using
Get Data | Text/CSV, click Edit in the dialog box, and in the Power Query Editor, click Use First Row as headers so that you have this table in your Power Query Editor:
Click Close & Apply and make sure that you've got the data available under VISUALIZATIONS | FIELDS:
Click R under VISUALIZATIONS:
Select all columns under FIELDS | countries so that you get this setup:
Take parts of your R snippet that we prepared above
library(corrplot)
# Set Country names as row index
data <- dataset
rownames(data) <- data$Country
# Remove Country from dataframe
data$Country <- NULL
# Transpose data into a readable format for cor()
data <- data.frame(t(data))
# Plot data
corrplot(cor(data))
And paste it into the Power BI R script Editor:
Click Run R Script:
And you're gonna get this:
That's it!
If you change the procedure to importing data from an Excel file instead of a textfile (using Get Data | Excel , you've successfully combined the powers of Excel, Power BI and R to produce a scatterplot!
I hope this is what you were looking for!

Dividing Individual Spatial Polygons Equally in R

I have a shapefile of polygons that are the townships in the state of Iowa.I'd like to divide each element (ie each township) into 9 equal parts (i.e. a 3 x 3 grid for each township). I've figured out how to do this, but am having trouble forming a new dataframe out of the new polygons. My code is below. The data can be downloaded here: https://ufile.io/wi6tt
library(sf)
library(tidyverse)
setwd("~/Desktop")
iowa<-st_read( dsn="Townships/iowa", layer="PLSS_Township_Boundaries", stringsAsFactors = F) # import data
## Make division
r<-NULL
for (row in 1:nrow(iowa)) {
r[[row]]<-st_make_grid(iowa[row,],n=c(3,3))
}
# Combine together
region<-NULL
for (row in 1:nrow(iowa)) {
region<-rbind(region,r[[row]])
}
region<-st_sfc(region,crs=4326) #convert to sfc
reg_id<-data.frame(reg_id=1:length(region)) #make ID for dataframe
# Make SF
region_df<-st_sf(reg_id,region)
The last line gives the following error:
Error in `[[<-.data.frame`(`*tmp*`, all_sfc_names[i], value = list(list( : replacement has 1644 rows, data has 14796
1664 is the number of rows in the initial Iowa dataframe.
Clearly the number of rows does not match the number of elements.
This might be a general r thing, rather than a spatial one, but I figured I'd post the whole thing in case someone had an idea on how to do the entirety of this a little cleaner

spatial join on two simple features {sf} with over 1 mil. entries as fast as possible

I hope this is not too trivial but I really can't find an answer and I'm too new to the topic to come up with alternatives myself. So here is the Problem:
I have two shapefiles x and y that represent different processing levels of a Sentinel2 satellite image.
x contains about 1.300.000 polygons/Segments completely covering the image extend without any further vital information.
y contains about 500 polygons representing the cloud-free area of the image (also covering most of the image except for a few "cloud-holes") as well as information about the used image in 4 columns (Sensor, Time...)
I'm trying to add the image information to x in places x is covered by y. pretty simple? I just can't find a way to make it happen without taking days.
I read x in as a simple feature {sf}, as reading it with shapefile / readOGR takes ages.
I tried different things with y
when I try merge(x,y) I can only take one sf as merge doesn't support two sf's.
merging x (as sf) and y (as shp) gives me the error "cannot allocate vector of size 13.0 Gb"
so I tried sf::st_join(x,y), which supports both Variables to be sf but still didn't finish for 28 hours now
sf::st_intersect(x,y) took about 9 minutes for a 10.000 segment subset, so that might not be a lot faster for the whole piece.
could subsetting x to a few smaller pieces solve the whole thing or is there another simple solution? could I do something with my workspace to make the merge work or is there simply no shortcut to joining that amount of polygons?
Thanks a lot in advance and I hope my description isn't too fuzzy!
my tiny work station:
win 7 64 bit
8 GB RAM
intel i7-4790 # 3,6 GHz
I often face this kind of problems and as #manotheshark2 afirms, I prefer to work in a loop subseting my vector layer. Here is my advice:
Load your data
library(raster)
library(rgdal)
x <- readOGR('C:/', 'sentinelCovers')
y <- readOGR('C:/', 'cloudHoles')
Assign an y ID for identify which x polygons intersects y polygons and create the column in x table
x$xyID <- NA # Answer col
y$yID <- 1:nrow(y#data) # ID col
Run a loop subseting x
for (posX in 1:nrow(x#data)){
pol.x <- x[posX, ]
intX <- raster::intersect(pol.x, y)
# x$xyID[posX] <- intX#data$yID ## Run this if there's unique y polygons
# x$xyID[posX] <- paste0(intX#data$yID, collapse = ',') ## Run this if there's multiple y polygons
}
You can check if is better to run the loop on x o y layer
x$xyID <- NA # Answer col
x$xID <- 1:nrow(x#data) # ID Col
for (posY in 1:nrow(y#data)){
pol.y <- y[posY, ]
intY <- tryCatch(raster::intersect(pol.y, x), finally = 'NULL')
if (is.null(intY)) next
x$xyID[x#data$xID %in% intY#data$xID] <- pol.y$yID
}

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources