Difference between ntile and cut and then quantile() function in R - r

I found two threads on this topic for calculating deciles in R. However, both the methods i.e. dplyr::ntile and quantile() yield different output. In fact, dplyr::ntile() fails to output proper deciles.
Method 1: Using ntile()
From R: splitting dataset into quartiles/deciles. What is the right method? thread, we could use ntile().
Here's my code:
vector<-c(0.0242034679584454, 0.0240411606258083, 0.00519255930109344,
0.00948031338483081, 0.000549450549450549, 0.085972850678733,
0.00231687756193192, NA, 0.1131625967838, 0.00539244534707915,
0.0604885614579294, 0.0352030947775629, 0.00935626135385923,
0.401201201201201, 0.0208212839791787, NA, 0.0462887301644538,
0.0224952741020794, NA, NA, 0.000984952654008562)
ntile(vector,10)
The output is:
ntile(vector,10)
5 5 2 3 1 7 1 NA 8 2 7 6 3 8 4 NA 6 4 NA NA 1
If we analyze this, we see that there is no 10th quantile!
Method 2: using quantile()
Now, let's use the method from How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame thread.
Here's my code:
as.numeric(cut(vector, breaks=quantile(vector, probs=seq(0,1, length = 11), na.rm=TRUE),include.lowest=TRUE))
The output is:
7 6 2 4 1 9 2 NA 10 3 9 7 4 10 5 NA 8 5 NA NA 1
As we can see, the outputs are completely different. What am I missing here? I'd appreciate any thoughts.
Is this a bug in ntile() function?

In dplyr::ntile NA is always last (highest rank), and that is why you don't see the 10th decile in this case. If you want the deciles not to consider NAs, you can define a function like the one here which I use next:
ntile_na <- function(x,n)
{
notna <- !is.na(x)
out <- rep(NA_real_,length(x))
out[notna] <- ntile(x[notna],n)
return(out)
}
ntile_na(vector, 10)
# [1] 6 6 2 4 1 9 2 NA 9 3 8 7 3 10 5 NA 8 5 NA NA 1
Also, quantile has 9 ways of computing quantiles, you are using the default, which is the number 7 (you can check ?stats::quantile for the different types, and here for the discussion about them).
If you try
as.numeric(cut(vector,
breaks = quantile(vector,
probs = seq(0, 1, length = 11),
na.rm = TRUE,
type = 2),
include.lowest = TRUE))
# [1] 6 6 2 4 1 9 2 NA 9 3 8 7 3 10 5 NA 8 5 NA NA 1
you have the same result as the one using ntile.
In summary: it is not a bug, it is just the different ways they are implemented.

Related

Why TTR::SMA returns NA for first element of series when n=1?

This is what I am looking at:
library(TTR)
test <- c(1:10)
test <- SMA(test, n=1)
test
[1] NA 2 3 4 5 6 7 8 9 10
The reason I am asking is actually that I have a script that let you define n:
library(TTR)
test <- c(1:10)
Index_Transformation <- 1 #1 means no transformation to the series
test <- SMA(test, n = Index_Transformation)
test
[1] NA 2 3 4 5 6 7 8 9 10
Is there any way I can have the SMA function return the first element of the series when "n =1" instead of NA?
Thanks a lot for your help
You can use rollmean instead from zoo package
library(zoo)
rollmean(test, 1)
#[1] 1 2 3 4 8 6 7 8 9 10
Just out of curiosity I was studying SMA function , it calls runMean function internally. So if you do
runMean(test, 1)
# [1] NA 2 3 4 5 6 7 8 9 10
it still gives the same output.
Further, runMean calls runSum in this way
runSum(x, n)/n
So if you now do
runSum(test, 1)
#[1] NA 2 3 4 5 6 7 8 9 10
there is still NA. Now runSum is a very big function from where the original NA is generated.
So if in case you still have to persist in using SMA function can you add an additional if check saying
if (Index_Transformation > 1) # OR (Index_Transformation != 1)
test <- SMA(test, n = Index_Transformation)
So test only changes if Index_Transformation is greater than 1 and stays as it is if it is 1.

Count unique values in Raster Data in R

I have these Raster Datasets, which look like this
1 2 3 4 5
1 NA NA NA 10 NA
2 7 3 7 10 10
3 NA 3 7 3 3
4 9 9 NA 3 7
5 3 NA 7 NA NA
via
MyRaster1 <- raster("MyRaster_EUNIS1.tif")
head(MyRaster1)
I created that table.
Using unique(MyRaster1) I get 3 7 9 10.
What I need are the counts of these unique values in the raster dataset.
I have tried quite a few ways around, one way works, but is a lot of trouble and I can't get a loop to work for all the raster datasets I have.
Classes1 <- as.factor(unique(values(MyRaster1)))[!is.na(unique(values(MyRaster1)))]
val1 <- unique(MyRaster1)
Tab1 <- matrix(nrow = length(values(MyRaster1)), ncol = length(val))
colnames(Tab1) <- levels(unique(Classes1))
Tab1 <- Tab1[!is.na(Tab1[,1]),]
colSums(Tab1)
It seems to work properly, until I try to delete the NA values. When I use colSums before that, I get NA as result for each column, after I delete the NA values, I get 0.
This is my first time using R, so I'm a real novice. I've researched quite a lot, but since I hardly understand the language at all, this is the furthest I have gotten.
Thank you for your help.
Edit:
table(MyRaster1)
gives me this: Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
The best result would be:
3 7 9 10
6 5 2 3
But I'd also be ok with a different format which I could use in Excel.
Use raster::freq()
Here's an example for the first two rows of your data:
r <- raster(matrix(c(NA,NA,NA,10,NA,7,3,7,10,10), nrow = 2, ncol =5))
> freq(r)
value count
[1,] 3 1
[2,] 7 2
[3,] 10 3
[4,] NA 4
Note that the freq function rounds unless explicitly told not to:
https://www.rdocumentation.org/packages/raster/versions/3.0-7/topics/freq

R - Create a column with entries only for the first row of each subset

For instance if I have this data:
ID Value
1 2
1 2
1 3
1 4
1 10
2 9
2 9
2 12
2 13
And my goal is to find the smallest value for each ID subset, and I want the number to be in the first row of the ID group while leaving the other rows blank, such that:
ID Value Start
1 2 2
1 2
1 3
1 4
1 10
2 9 9
2 9
2 12
2 13
My first instinct is to create an index for the IDs using
A <- transform(A, INDEX=ave(ID, ID, FUN=seq_along)) ## A being the name of my data
Since I am a noob, I get stuck at this point. For each ID=n, I want to find the min(A$Value) for that ID subset and place that into the cell matching condition of ID=n and INDEX=1.
Any help is much appreciated! I am sorry that I keep asking questions :(
Here's a solution:
within(A, INDEX <- "is.na<-"(ave(Value, ID, FUN = min), c(FALSE, !diff(ID))))
ID Value INDEX
1 1 2 2
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 1 10 NA
6 2 9 9
7 2 9 NA
8 2 12 NA
9 2 13 NA
Update:
How it works? The command ave(Value, ID, FUN = min) applies the function min to each subset of Value along the values of ID. For the example, it returns a vector of five times 2 and four times 9. Since all values except the first in each subset should be NA, the function "is.na<-" replaces all values at the logical index defined by c(FALSE, !diff(ID)). This index is TRUE if a value is identical with the preceding one.
You're almost there. We just need to make a custom function instead of seq_along and to split value by ID (not ID by ID).
first_min <- function(x){
nas <- rep(NA, length(x))
nas[which.min(x)] <- min(x, na.rm=TRUE)
nas
}
This function makes a vector of NAs and replaces the first element with the minimum value of Value.
transform(dat, INDEX=ave(Value, ID, FUN=first_min))
## ID Value INDEX
## 1 1 2 2
## 2 1 2 NA
## 3 1 3 NA
## 4 1 4 NA
## 5 1 10 NA
## 6 2 9 9
## 7 2 9 NA
## 8 2 12 NA
## 9 2 13 NA
You can achieve this with a tapply one-liner
df$Start<-as.vector(unlist(tapply(df$Value,df$ID,FUN = function(x){ return (c(min(x),rep("",length(x)-1)))})))
I keep going back to this question and the above answers helped me greatly.
There is a basic solution for beginners too:
A$Start<-NA
A[!duplicated(A$ID),]$Start<-A[!duplicated(A$ID),]$Value
Thanks.

Storing an output in the same data.frame when row size of output different

Sometimes I want to perform a function (eg difference calculation) on a dataset and store the results directly in the data frame
df <- data.frame(a$C, diff(a$C))
But I cannot do that because the number of rows is different.
Is there some syntax that will allow me to to that, perhaps having NA when the function (diff()) gives no results?
There isn't a general solution to this without making vast assumptions about the whole panoply of function one may wish to use.
For the example you show, we can easily work out that the first value from diff() would be an NA if it returned it:
set.seed(5)
d <- rpois(10, 5)
> d
[1] 3 6 8 4 2 6 5 7 9 2
> diff(d)
[1] 3 2 -4 -2 4 -1 2 2 -7
So if you are using diff() then you can always just do:
> dd <- data.frame(d, Diff = c(NA, diff(d)))
> dd
d Diff
1 3 NA
2 6 3
3 8 2
4 4 -4
5 2 -2
6 6 4
7 5 -1
8 7 2
9 9 2
10 2 -7
But now consider what you would do with any other function that you might wish to use that doesn't always return NA in the correct place.
For this example, we can use the zoo package which has an na.pad argument:
require(zoo)
d2 <- as.zoo(d)
ddd <- data.frame(d, Diff = diff(d2, na.pad = TRUE))
> ddd
d Diff
1 3 NA
2 6 3
3 8 2
4 4 -4
5 2 -2
6 6 4
7 5 -1
8 7 2
9 9 2
10 2 -7
If you are using a modelling function with a formula interface (e.g. lm()) and that function has an na.action argument, then you can set na.action = na.exclude in the function call and extractor functions such as fitted(), resid() etc will add back in to their output NA in the correct places so that the output is of the same length as the data passed to the modelling function.
If you have other more specific cases you want to explore, please edit your Answer. In specific cases there will usually be a simple Answer to your Q. In the general case the Answer is no, it is not possible to do what you ask.
The standard method is to create as you say a vector that is extended at one end or the other with an NA
dfrm$diffvec <- c(NA, diff(firstvec) )

Re-inserting NAs into a vector

I have a vector of values which include NAs. The values need to be processed by an external program that can't handle NAs, so they are stripped out, its written to a file, processed, then read back in, resulting in a vector of the length of the number of non-NAs. Example, suppose the input is 7 3 4 NA 5 4 6 NA 1 NA, then the output would just be 7 values. What I need to do is re-insert the NAs in position.
So, given two vectors X and Y:
> X
[1] 64 1 9 100 16 NA 25 NA 4 49 36 NA 81
> Y
[1] 8 1 3 10 4 5 2 7 6 9
produce:
8 1 3 10 4 NA 5 NA 2 7 6 NA 9
(you may notice that X is Y^2, thats just for an example).
I could knock out a function to do this but I wonder if there's any nice tricksy ways of doing it... split, list, length... hmmm...
na.omit keeps an attribute of the locations of the NA in the original series, so you could use that to know where to put the missing values:
Y <- sqrt(na.omit(X))
Z <- rep(NA,length(Y)+length(attr(Y,"na.action")))
Z[-attr(Y,"na.action")] <- Y
#> Z
# [1] 8 1 3 10 4 NA 5 NA 2 7 6 NA 9
Answering my own question is probably very bad form, but I think this is probably about the neatest:
rena <- function(X,Z){
Y=rep(NA,length(X))
Y[!is.na(X)]=Z
Y
}
Can also try replace:
replace(X, !is.na(X), Y)
Another variant on the same theme
rena <- function(X,Z){
X[which(!is.na(X))]=Z
X
}
R automatically fills the rest with NA.
Edit: Corrected by Marek.

Resources