R cran - non overlapping deciles - r

I need to paste the quartile number to a table, but it has to avoid overlapping values. An example would be:
table<-data.frame(num1=c(0,1,1,2,2,3,3,3,3,6), num2=seq(20,30,len=10))
if I use ntile from dplyr:
library(dplyr)
table$cuartil<-ntile(table$num1, 4)
I get an overlapping result:
> table
num1 num2 cuartil
0 20.00000 1
1 21.11111 1
1 22.22222 1
2 23.33333 2
2 24.44444 2
3 25.55556 3
3 26.66667 3
3 27.77778 3
3 28.88889 4
6 30.00000 4
Instead of that I would need:
num1 num2 cuartil
0 20 1
1 21.11111 1
1 22.22222 1
2 23.33333 2
2 24.44444 2
3 25.55556 3
3 26.66667 3
3 27.77778 3
3 28.88889 3
6 30 4
So the value 3 is only in one group and not in two (3 and 4).
Is there any other function to calculate percentiles but considering not to overlap values like ntile does?

You could just write your own function using quantile():
quartile <- function(x) {
qrt <- quantile(x)
y <- x
y[x<=qrt[5]] <- 4
y[x<=qrt[4]] <- 3
y[x<=qrt[3]] <- 2
y[x<=qrt[2]] <- 1
y[x<=qrt[1]] <- 0
return(y)
}
table$cuartil <- quartile(table$num1)

Using Sam's Function, a generic one might be:
quantG <- function(x, m)
{library(plyr)
qrt <- quantile(x,probs=seq(0, 1, 1/m) )
y <- x
for (i in (1:5) ) {y[x<=qrt[m+2-i]] <- m+1-i}
return(y)}
And:
table$cuartil <- quantG (table$num1,4)
gets the same result.

Related

How to vectorize the RHS of dplyr::case_when?

Suppose I have a dataframe that looks like this:
> data <- data.frame(x = c(1,1,2,2,3,4,5,6), y = c(1,2,3,4,5,6,7,8))
> data
x y
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 4 6
7 5 7
8 6 8
I want to use mutate and case_when to create a new id variable that will identify rows using the variable x, and give rows missing x a unique id. In other words, I should have the same id for rows one and two, rows three and four, while rows 5-8 should have their own unique ids. Suppose I want to generate these id values with a function:
id_function <- function(x, n){
set.seed(x)
res <- character(n)
for(i in seq(n)){
res[i] <- paste0(sample(c(letters, LETTERS, 0:9), 32), collapse="")
}
res
}
id_function(1, 1)
[1] "4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf"
I am trying to use this function on the RHS of a case_when expression like this:
data %>%
mutate(my_id = id_function(1234, nrow(.)),
my_id = dplyr::case_when(!is.na(x) ~ id_function(x, 1),
TRUE ~ my_id))
But the RHS does not seem to be vectorized and I get the same value for all non-missing values of x:
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
4 2 4 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
5 NA 5 0vnws5giVNIzp86BHKuOZ9ch4dtL3Fqy
6 NA 6 IbKU6DjvW9ypitl7qc25Lr4sOwEfghdk
7 NA 7 8oqQMPx6IrkGhXv4KlUtYfcJ5Z1RCaDy
8 NA 8 BRsjumlCEGS6v4ANrw1bxLynOKkF90ao
I'm sure there's a way to vectorize the RHS, what am I doing wrong? Is there an easier approach to solving this problem?
I guess rowwise() would do the trick:
data %>%
rowwise() %>%
mutate(my_id = id_function(x, 1))
x y my_id
1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN
purrr map functions can be used for non-vectorized functions. The following will give you a similar result. map2 will take the two arguments expected by your id_function.
library(tidyverse)
data %>%
mutate(my_id = map2(x, 1, id_function))
Output
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
4 2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
5 3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
6 4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
7 5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
8 6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN

How to shift data in only one column up and down in R?

I have a data frame that looks as follows:
ID
Count
1
3
2
5
3
2
4
0
5
1
And I am trying to shift ONLY the values in the "Count" column down one so that it looks as follows:
ID
Count
1
NA
2
3
3
5
4
2
5
0
I will also need to eventually shift the same data up one:
ID
Count
1
5
2
2
3
0
4
1
5
NA
I've tried the following code:
shift <- function(x, n){
c(x[-(seq(n))], rep(NA, n))
}
df$Count <- shift(df$Count, 1)
But it ended up duplicating the titles and shifting the data down, like as follows:
ID
Count
ID
Count
1
3
2
5
3
2
4
0
Is there an easy way for me to accomplish this? Thank you!!
# set as data.table
setDT(df)
# shift
df[, count := shift(count, 1)]
df$Count=c(NA, df$Count[1:(nrow(df)-1)])
1) dplyr Using DF shown reproducibly in the Note at the end, use lag and lead from dplyr
library(dplyr)
DF %>% mutate(CountLag = lag(Count), CountLead = lead(Count))
## ID Count CountLag CountLead
## 1 1 3 NA 5
## 2 2 5 3 2
## 3 3 2 5 0
## 4 4 0 2 1
## 5 5 1 0 NA
2) zoo This creates a zoo object using zoo's vectorized lag. Optionally use fortify.zoo(z) or as.ts(z) to convert it back to a data frame or ts object.
Note that dplyr clobbers lag with its own lag so we used stats::lag to ensure it does not interfere. The stats:: can optionally be omitted if dplyr is not loaded.
library(zoo)
z <- stats::lag(read.zoo(DF), seq(-1, 1)); z
Index lag-1 lag0 lag1
1 1 NA 3 5
2 2 3 5 2
3 3 5 2 0
4 4 2 0 1
5 5 0 1 NA
3) collapse flag from the collapse package is also vectorized over its second argument.
library(collapse)
with(DF, data.frame(ID, Count = flag(Count, seq(-1, 1))))
## ID Count.F1 Count... Count.L1
## 1 1 5 3 NA
## 2 2 2 5 3
## 3 3 0 2 5
## 4 4 1 0 2
## 5 5 NA 1 0
Note
DF <- data.frame(ID = 1:5, Count = c(3, 5, 2, 0, 1))

Dataframe create new column based on other columns

I have a dataframe:
df <- data.frame('a'=c(1,2,3,4,5), 'b'=c(1,20,3,4,50))
df
a b
1 1 1
2 2 20
3 3 3
4 4 4
5 5 50
and I want to create a new column based on existing columns. Something like this:
if (df[['a']] == df[['b']]) {
df[['c']] <- df[['a']] + df[['b']]
} else {
df[['c']] <- df[['b']] - df[['a']]
}
The problem is that the if condition is checked only for the first row... If I create a function from the above if statement then I use apply() (or mapply()...), it is the same.
In Python/pandas I can use this:
df['c'] = df[['a', 'b']].apply(lambda x: x['a'] + x['b'] if (x['a'] == x['b']) \
else x['b'] - x['a'], axis=1)
I want something similar in R. So the result should look like this:
a b c
1 1 1 2
2 2 20 18
3 3 3 6
4 4 4 8
5 5 50 45
One option is ifelse which is vectorized version of if/else. If we are doing this for each row, the if/else as showed in the OP's pandas post can be done in either a for loop or lapply/sapply, but that would be inefficient in R.
df <- transform(df, c= ifelse(a==b, a+b, b-a))
df
# a b c
#1 1 1 2
#2 2 20 18
#3 3 3 6
#4 4 4 8
#5 5 50 45
This can be otherwise written as
df$c <- with(df, ifelse(a==b, a+b, b-a))
to create the 'c' column in the original dataset
As the OP wants a similar option in R using if/else
df$c <- apply(df, 1, FUN = function(x) if(x[1]==x[2]) x[1]+x[2] else x[2]-x[1])
Here is a slightly more confusing algebraic method:
df$c <- with(df, b + ((-1)^((a==b)+1) * a))
df
a b c
1 1 1 2
2 2 20 18
3 3 3 6
4 4 4 8
5 5 50 45
The idea is that the "minus" operator is turned on or off based on the test a==b.
If you want an apply method, then another way with mapply would be create a function and apply it,
fun1 <- function(x, y) if (x == y) {x + y} else {y-x}
df$c <- mapply(fun1, df$a, df$b)
df
# a b c
#1 1 1 2
#2 2 20 18
#3 3 3 6
#4 4 4 8
#5 5 50 45
Using dplyr package:
library(dplyr)
df <- df %>%
mutate(c = if_else(a == b, a + b, b - a))
df
# a b c
# 1 1 1 2
# 2 2 20 18
# 3 3 3 6
# 4 4 4 8
# 5 5 50 45
A solution with apply
myFunction <- function(x){
a <- x[1]
b <- x[2]
#further values ignored (if there are more than 2 columns)
value <- if(a==b) a + b else b - a
#or more complicated stuff
return(value)
}
df$c <- apply(df, 1, myFunction)

How can I subset a dataframe according to group membership?

I am wanting to write a function so that a (potentially large) dataframe can be subsetted according to group membership, where a 'group' is a unique combination of a set of column values.
For example, I would like to subset the following data frame according to unique combination of the first two columns (Loc1 and Loc2).
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF=data.frame(Loc1,Loc2,Dat1,Dat2,Dat3)
Loc1 Loc2 Dat1 Dat2 Dat3
1 A a 1 1 2
2 A a 1 2 2
3 A b 1 1 4
4 A b 1 2 4
5 B a 1 1 6
6 B a 1 2 5
7 B b 1 2 3
I want to return (i) the number of groups (i.e. 4), (ii) the number in each group (i.e. c(2,2,2,1), and (iii) to relabel the rows so that I can further analyse the data frame according to group membership (e.g. for ANOVA and MANOVA) (i.e.
Group<-as.factor(c(1,1,2,2,3,3,4))
Data <- cbind(Group,DF[,-1:-2])
Group Dat1 Dat2 Dat3
1 1 1 1 2
2 1 1 2 2
3 2 1 1 4
4 2 1 2 4
5 3 1 1 6
6 3 1 2 5
7 4 1 2 3
).
So far all I have managed is to get the number of groups, and I'm suspicious that there's a better way to do even this:
nrow(unique(DF[,1:2]))
I was hoping to avoid for-loops as I am concerned about the function being slow.
I have tried converting to a data matrix so that I could concatenate the row values but I couldn't get that to work either.
Many thanks
You could try:
Create Group column by using unique level combination of Loc1 and Loc2.
indx <- paste(DF[,1], DF[,2])
DF$Group <- as.numeric(factor(indx, unique(indx))) #query No (iii)
DF1 <- DF[-(1:2)][,c(4,1:3)]
# Group Dat1 Dat2 Dat3
#1 1 1 1 2
#2 1 1 2 2
#3 2 1 1 4
#4 2 1 2 4
#5 3 1 1 6
#6 3 1 2 5
#7 4 1 2 3
table(DF$Group) #(No. ii)
#1 2 3 4
#2 2 2 1
length(unique(DF$Group)) #(i)
#[1] 4
Then, if you need to subset the datasets by group, you could split the dataset using the Group to create a list of 4 list elements
split(DF1, DF1$Group)
Update
If you have multiple columns, you could still try:
ColstoGroup <- 1:2
indx <- apply(DF[,ColstoGroup], 1, paste, collapse="")
as.numeric(factor(indx, unique(indx)))
#[1] 1 1 2 2 3 3 4
You could create a function;
fun1 <- function(dat, GroupCols){
FactGroup <- dat[, GroupCols]
if(length(GroupCols)==1){
dat$Group <- as.numeric(factor(FactGroup, levels=unique(FactGroup)))
}
else {
indx <- apply(FactGroup, 1, paste, collapse="")
dat$Group <- as.numeric(factor(indx, unique(indx)))
}
dat
}
fun1(DF, "Loc1")
fun1(DF, c("Loc1", "Loc2"))
This gets all three of your queries.
Begin with a table of the first two columns and then work with that data.
> (tab <- table(DF$Loc1, DF$Loc2))
#
# a b
# A 2 2
# B 2 1
#
> (ct <- c(tab)) ## (ii)
# [1] 2 2 2 1
> length(unlist(dimnames(tab))) ## (i)
# [1] 4
> cbind(Group = rep(seq_along(ct), ct), DF[-c(1,2)]) ## (iii)
# Group Dat1 Dat2 Dat3
# 1 1 1 1 2
# 2 1 1 2 2
# 3 2 1 1 4
# 4 2 1 2 4
# 5 3 1 1 6
# 6 3 1 2 5
# 7 4 1 2 3
Borrowing a bit from this answer and using some dplyr idioms:
library(dplyr)
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF <- data.frame(Loc1, Loc2, Dat1, Dat2, Dat3)
emitID <- local({
idCounter <- -1L
function(){
idCounter <<- idCounter + 1L
}
})
DF %>% group_by(Loc1, Loc2) %>% mutate(Group=emitID())
## Loc1 Loc2 Dat1 Dat2 Dat3 Group
## 1 A a 1 1 2 0
## 2 A a 1 2 2 0
## 3 A b 1 1 4 1
## 4 A b 1 2 4 1
## 5 B a 1 1 6 2
## 6 B a 1 2 5 2
## 7 B b 1 2 3 3

multiplication in matrix R

Look, what I want to do: [In Excel is clear and easy, but in R I have a problem...:(]
Column A 1 2 3 4 5
Column B 0 9 2 1 7
That's my real "algorithm":
Column C
(first value) = mean(Column A) = 3
(second value) = ((mean(Column A)*4) + 0)/5 = 2,4
(third value) = ((second value*4) + 9)/5 = 3,72
etc.
So we have:
# A B C
# 1 1 0 3
# 2 2 9 2,4
# 3 3 2 3,72
# 4 4 1 3,37
# 5 5 7 2,90
This is my actually code with your suggestion:
a <- c(1:5)
b <- c(0,9,0,1,7,0)
matrix <- data.frame(A=a,B=b)
matrix <- c(mean(matrix$A), (cumsum(matrix$B) + (mean(matrix$A)*4))/5)
This is solution: 2.4 4.2 4.2 4.4 5.8 (WRONG !!)
Of course R write me error that: "replacement has 6 rows, data has 5" but this isn't relevant...I only want to know, how should I do it??
You could use ?cumsum:
a <- 1:5
b <- c(0, 9, 2, 1, 7)
mean(a) + cumsum(b)
# [1] 3 12 14 15 22
UPDATE:
It seems you want to run a (weighted) moving average. Maybe you should have a look at the TTR package.
Please find an easy approach below:
wma <- function(b, startValue, a=4/5) {
m <- double(length(b)+1)
m[1] <- startValue
for (i in seq(along=b)) {
m[i+1] <- a * m[i] + (1-a) * b[i]
}
return(m)
}
wma(b, mean(a))
# [1] 3.00000 2.40000 3.72000 3.37600 2.90080 3.72064
This solves your issue:
mydf<-data.frame(A=1:5, B=c(0,9,2,1,7))
mydf$C<-cumsum(mydf$B)+mean(mydf$A)
mydf
# A B C
# 1 1 0 3
# 2 2 9 12
# 3 3 2 14
# 4 4 1 15
# 5 5 7 22
Hope it helps.

Resources