Count unique values in Raster Data in R - r

I have these Raster Datasets, which look like this
1 2 3 4 5
1 NA NA NA 10 NA
2 7 3 7 10 10
3 NA 3 7 3 3
4 9 9 NA 3 7
5 3 NA 7 NA NA
via
MyRaster1 <- raster("MyRaster_EUNIS1.tif")
head(MyRaster1)
I created that table.
Using unique(MyRaster1) I get 3 7 9 10.
What I need are the counts of these unique values in the raster dataset.
I have tried quite a few ways around, one way works, but is a lot of trouble and I can't get a loop to work for all the raster datasets I have.
Classes1 <- as.factor(unique(values(MyRaster1)))[!is.na(unique(values(MyRaster1)))]
val1 <- unique(MyRaster1)
Tab1 <- matrix(nrow = length(values(MyRaster1)), ncol = length(val))
colnames(Tab1) <- levels(unique(Classes1))
Tab1 <- Tab1[!is.na(Tab1[,1]),]
colSums(Tab1)
It seems to work properly, until I try to delete the NA values. When I use colSums before that, I get NA as result for each column, after I delete the NA values, I get 0.
This is my first time using R, so I'm a real novice. I've researched quite a lot, but since I hardly understand the language at all, this is the furthest I have gotten.
Thank you for your help.
Edit:
table(MyRaster1)
gives me this: Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
The best result would be:
3 7 9 10
6 5 2 3
But I'd also be ok with a different format which I could use in Excel.

Use raster::freq()
Here's an example for the first two rows of your data:
r <- raster(matrix(c(NA,NA,NA,10,NA,7,3,7,10,10), nrow = 2, ncol =5))
> freq(r)
value count
[1,] 3 1
[2,] 7 2
[3,] 10 3
[4,] NA 4
Note that the freq function rounds unless explicitly told not to:
https://www.rdocumentation.org/packages/raster/versions/3.0-7/topics/freq

Related

Convert entire data frame into one long column (vector)

I want to turn the entire content of a numeric (incl. NA's) data frame into one column. What would be the smartest way of achieving the following?
>df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
>df
C1 C2 C3
1 1 4 NA
2 NA 5 8
3 3 NA 9
>x <- mysterious_operation(df)
>x
[1] 1 NA 3 4 5 NA NA 8 9
I want to calculate the mean of this vector, so ideally I'd want to remove the NA's within the mysterious_operation - the data frame I'm working on is very large so it will probably be a good idea.
Here's a couple ways with purrr:
# using invoke, a wrapper around do.call
purrr::invoke(c, df, use.names = FALSE)
# similar to unlist, reduce list of lists to a single vector
purrr::flatten_dbl(df)
Both return:
[1] 1 NA 3 4 5 NA NA 8 9
The mysterious operation you are looking for is called unlist:
> df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
> unlist(df, use.names = F)
[1] 1 NA 3 4 5 NA NA 8 9
We can use unlist and create a single column data.frame
df1 <- data.frame(col =unlist(df))
Just for fun. Of course unlist is the most appropriate function.
alternative
stack(df)[,1]
alternative
do.call(c,df)
do.call(c,c(df,use.names=F)) #unnamed version
Maybe they are more mysterious.

Difference between ntile and cut and then quantile() function in R

I found two threads on this topic for calculating deciles in R. However, both the methods i.e. dplyr::ntile and quantile() yield different output. In fact, dplyr::ntile() fails to output proper deciles.
Method 1: Using ntile()
From R: splitting dataset into quartiles/deciles. What is the right method? thread, we could use ntile().
Here's my code:
vector<-c(0.0242034679584454, 0.0240411606258083, 0.00519255930109344,
0.00948031338483081, 0.000549450549450549, 0.085972850678733,
0.00231687756193192, NA, 0.1131625967838, 0.00539244534707915,
0.0604885614579294, 0.0352030947775629, 0.00935626135385923,
0.401201201201201, 0.0208212839791787, NA, 0.0462887301644538,
0.0224952741020794, NA, NA, 0.000984952654008562)
ntile(vector,10)
The output is:
ntile(vector,10)
5 5 2 3 1 7 1 NA 8 2 7 6 3 8 4 NA 6 4 NA NA 1
If we analyze this, we see that there is no 10th quantile!
Method 2: using quantile()
Now, let's use the method from How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame thread.
Here's my code:
as.numeric(cut(vector, breaks=quantile(vector, probs=seq(0,1, length = 11), na.rm=TRUE),include.lowest=TRUE))
The output is:
7 6 2 4 1 9 2 NA 10 3 9 7 4 10 5 NA 8 5 NA NA 1
As we can see, the outputs are completely different. What am I missing here? I'd appreciate any thoughts.
Is this a bug in ntile() function?
In dplyr::ntile NA is always last (highest rank), and that is why you don't see the 10th decile in this case. If you want the deciles not to consider NAs, you can define a function like the one here which I use next:
ntile_na <- function(x,n)
{
notna <- !is.na(x)
out <- rep(NA_real_,length(x))
out[notna] <- ntile(x[notna],n)
return(out)
}
ntile_na(vector, 10)
# [1] 6 6 2 4 1 9 2 NA 9 3 8 7 3 10 5 NA 8 5 NA NA 1
Also, quantile has 9 ways of computing quantiles, you are using the default, which is the number 7 (you can check ?stats::quantile for the different types, and here for the discussion about them).
If you try
as.numeric(cut(vector,
breaks = quantile(vector,
probs = seq(0, 1, length = 11),
na.rm = TRUE,
type = 2),
include.lowest = TRUE))
# [1] 6 6 2 4 1 9 2 NA 9 3 8 7 3 10 5 NA 8 5 NA NA 1
you have the same result as the one using ntile.
In summary: it is not a bug, it is just the different ways they are implemented.

[R]: applying a function to columns based on conditional row position

I am attempting to find the number of observations by column in a data frame that meet a certain condition after the max for that column has been encountered.
Here is a highly simplified example:
fake.dat<-data.frame(samp1=c(5,6,7,5,4,5,10,5,6,7), samp2=c(2,3,4,6,7,9,2,3,7,8), samp3=c(2,3,4,11,7,9,2,3,7,8),samp4=c(5,6,7,5,4,12,10,5,6,7))
samp1 samp2 samp3 samp4
1 5 2 2 5
2 6 3 3 6
3 7 4 4 7
4 5 6 11 5
5 4 7 7 4
6 5 9 9 12
7 10 2 2 10
8 5 3 3 5
9 6 7 7 6
10 7 8 8 7
So, let's say I'm trying to find the number of observations per column that are greater than 5 after excluding all the observations in a column up to and including the row where the maximum for the column occurs.
Expected outcome:
samp1 samp2 samp3 samp4
2 2 4 3
I am able to get the answer I want by using nested for loops to exclude the observations I don't want.
newfake.dat<-data.frame()
for(j in 1:length(fake.dat)){
for(i in 1:nrow(fake.dat)){
ifelse(i>max.row[j],newfake.dat[i,j]<-fake.dat[i,j],"NA")
print(newfake.dat)
}}
This creates a new data frame on which I can run an easy apply function.
colcount<-apply(newfake.dat,2,function(x) (sum(x>5,na.rm=TRUE)))
V1 V2 V3 V4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA 7 NA
6 NA NA 9 NA
7 NA 2 2 10
8 5 3 3 5
9 6 7 7 6
10 7 8 8 7
V1 V2 V3 V4
2 2 4 3
Which is all well and good for this tiny example dataset, but is prohibitively slow on anything approaching the size of my real datasets. Which are large (2000 x 2000 or larger) and numerous. I tried it with a truncated version of one of my files (fewer columns, but same number of rows) and it ran for at least 5 hours (I left it going when I left work for the day). Also, I don't really need the new dataframe for anything other than to be able to run the apply function.
Is there any way to do this more efficiently? I tried limiting the rows that the apply function works on by using seq and the row number of the max.
maxrow<-apply(fake.dat,2,function(x) which.max(x))
print(maxrow)
seq.att<-apply(fake.dat,2,function(x) {
sum(x[which(seq(1,nrow(fake.dat))==(maxrow)):nrow(fake.dat)]>5,na.rm=TRUE)})
Which kicks up four instances of this warning message:
1: In seq(1, nrow(fake.dat)) == (maxrow) :
longer object length is not a multiple of shorter object length
If I ignore the warning message and get the output anyway it doesn't give me the answer I expected:
samp1 samp2 samp3 samp4
2 3 3 3
I also tried using a while function which kept cycling so I stopped it (I misplaced the code I tried for this).
So far the most promising result has come from the nested for loops, but I know it's terribly inefficient and I'm hoping that there's a better way. I'm still new to R, and I'm sure I'm tripping up on some syntax somewhere. Thanks in advance for any help you can provide!
Here is a way in dplyr to replicate the same process that you showed with base R
library(dplyr)
fake.dat %>%
summarise_each(funs(sum(.[(which.max(.)+1):n()]>5,
na.rm=TRUE)))
# samp1 samp2 samp3 samp4
#1 2 2 4 3
If you need it as two steps:
datNA <- fake.dat %>%
mutate_each(funs(replace(., seq_len(which.max(.)), NA)))
datNA %>%
summarise_each(funs(sum(.>5, na.rm=TRUE)))
Here's one approach using data.table:
library(data.table)
##
data <- data.frame(
samp1=c(5,6,7,5,4,5,10,5,6,7),
samp2=c(2,3,4,6,7,9,2,3,7,8),
samp3=c(2,3,4,11,7,9,2,3,7,8),
samp4=c(5,6,7,5,4,12,10,5,6,7))
##
Dt <- data.table(data)
##
R> Dt[,lapply(.SD,function(x){
y <- x[(which.max(x)+1):.N]
length(y[y>5])
})
samp1 samp2 samp3 samp4
1: 2 2 4 3
A single-liner in base R:
vapply(fake.dat,function(x) sum(x[(which.max(x)+1):length(x)]>5),1L)
#samp1 samp2 samp3 samp4
# 2 2 4 3

Storing an output in the same data.frame when row size of output different

Sometimes I want to perform a function (eg difference calculation) on a dataset and store the results directly in the data frame
df <- data.frame(a$C, diff(a$C))
But I cannot do that because the number of rows is different.
Is there some syntax that will allow me to to that, perhaps having NA when the function (diff()) gives no results?
There isn't a general solution to this without making vast assumptions about the whole panoply of function one may wish to use.
For the example you show, we can easily work out that the first value from diff() would be an NA if it returned it:
set.seed(5)
d <- rpois(10, 5)
> d
[1] 3 6 8 4 2 6 5 7 9 2
> diff(d)
[1] 3 2 -4 -2 4 -1 2 2 -7
So if you are using diff() then you can always just do:
> dd <- data.frame(d, Diff = c(NA, diff(d)))
> dd
d Diff
1 3 NA
2 6 3
3 8 2
4 4 -4
5 2 -2
6 6 4
7 5 -1
8 7 2
9 9 2
10 2 -7
But now consider what you would do with any other function that you might wish to use that doesn't always return NA in the correct place.
For this example, we can use the zoo package which has an na.pad argument:
require(zoo)
d2 <- as.zoo(d)
ddd <- data.frame(d, Diff = diff(d2, na.pad = TRUE))
> ddd
d Diff
1 3 NA
2 6 3
3 8 2
4 4 -4
5 2 -2
6 6 4
7 5 -1
8 7 2
9 9 2
10 2 -7
If you are using a modelling function with a formula interface (e.g. lm()) and that function has an na.action argument, then you can set na.action = na.exclude in the function call and extractor functions such as fitted(), resid() etc will add back in to their output NA in the correct places so that the output is of the same length as the data passed to the modelling function.
If you have other more specific cases you want to explore, please edit your Answer. In specific cases there will usually be a simple Answer to your Q. In the general case the Answer is no, it is not possible to do what you ask.
The standard method is to create as you say a vector that is extended at one end or the other with an NA
dfrm$diffvec <- c(NA, diff(firstvec) )

Re-inserting NAs into a vector

I have a vector of values which include NAs. The values need to be processed by an external program that can't handle NAs, so they are stripped out, its written to a file, processed, then read back in, resulting in a vector of the length of the number of non-NAs. Example, suppose the input is 7 3 4 NA 5 4 6 NA 1 NA, then the output would just be 7 values. What I need to do is re-insert the NAs in position.
So, given two vectors X and Y:
> X
[1] 64 1 9 100 16 NA 25 NA 4 49 36 NA 81
> Y
[1] 8 1 3 10 4 5 2 7 6 9
produce:
8 1 3 10 4 NA 5 NA 2 7 6 NA 9
(you may notice that X is Y^2, thats just for an example).
I could knock out a function to do this but I wonder if there's any nice tricksy ways of doing it... split, list, length... hmmm...
na.omit keeps an attribute of the locations of the NA in the original series, so you could use that to know where to put the missing values:
Y <- sqrt(na.omit(X))
Z <- rep(NA,length(Y)+length(attr(Y,"na.action")))
Z[-attr(Y,"na.action")] <- Y
#> Z
# [1] 8 1 3 10 4 NA 5 NA 2 7 6 NA 9
Answering my own question is probably very bad form, but I think this is probably about the neatest:
rena <- function(X,Z){
Y=rep(NA,length(X))
Y[!is.na(X)]=Z
Y
}
Can also try replace:
replace(X, !is.na(X), Y)
Another variant on the same theme
rena <- function(X,Z){
X[which(!is.na(X))]=Z
X
}
R automatically fills the rest with NA.
Edit: Corrected by Marek.

Resources