Numbering of groups in dplyr? - r

I have question about numbering the groups in a data.frame.
I found only one similar approach here dplyr-how-to-number-label-data-table-by-group-number-from-group-by
but it didnt worked to me. I dont know why.
S <- rep(letters[1:12],each=6)
R = sort(replicate(9, sample(5000:6000,4)))
df <- data.frame(R,S)
get_next_integer = function(){
i = 0
function(S){ i <<- i+1 }
}
get_integer = get_next_integer()
result <- df %>% group_by(S) %>% mutate(label = get_integer())
result
Source: local data frame [72 x 3]
Groups: S [12]
R S label
(int) (fctr) (dbl)
1 5058 a 1
2 5121 a 1
3 5129 a 1
4 5143 a 1
5 5202 a 1
6 5213 a 1
7 5239 b 1
8 5245 b 1
9 5269 b 1
10 5324 b 1
.. ... ... ...
I look for elegant solution in dplyr. Numbering each letters from 1 to 12 etc.

Using as.numeric will do the trick.
S <- rep(letters[1:12],each=6)
R = sort(replicate(9, sample(5000:6000,4)))
df <- data.frame(R,S)
result <- df %>% mutate(label = as.numeric(S)) %>% group_by(S)
result
Source: local data frame [72 x 3]
Groups: S
R S label
1 5018 a 1
2 5042 a 1
3 5055 a 1
4 5066 a 1
5 5081 a 1
6 5133 a 1
7 5149 b 2
8 5191 b 2
9 5197 b 2
10 5248 b 2
.. ... . ...

No need to use dplyr at all.
S <- rep(letters[1:12],each=6)
R = sort(replicate(9, sample(5000:6000,4)))
df <- data.frame(R,S)
df$label <- as.numeric(factor(df$S))

Related

Infill missing variables of a df from a list

I have missing categorical variables in a list. I would like to add all the combinations of these classifications to the data frame using complete. I can do this for a single variable using mutate.
Simplified example:
library(tidyverse)
df <- tibble(a1 = 1:6,
b1 = rep(c(1,2),3),
c1 = rep(c(1:3), 2))
missing_cols <- list(d1 = c(7:8),
e1 = c(12:14))
# Use the first classification of d1 for mutate and complete with all classifications
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]])
Desired output
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
mutate(!!names(missing_cols)[2] := missing_cols[[2]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]], e1 = missing_cols[[2]])
This will get the correct output for d1. How can I do this for all variables in my list?
We can use crossing with cross_df :
library(tidyr)
crossing(df, cross_df(missing_cols))
# a1 b1 c1 d1 e1
# <int> <dbl> <int> <int> <int>
# 1 1 1 1 7 12
# 2 1 1 1 7 13
# 3 1 1 1 7 14
# 4 1 1 1 8 12
# 5 1 1 1 8 13
# 6 1 1 1 8 14
# 7 2 2 2 7 12
# 8 2 2 2 7 13
# 9 2 2 2 7 14
#10 2 2 2 8 12
# … with 26 more rows
cross_df creates all possible combination of missing_cols while crossing takes that output and creates all possible combination with df.
Using expand.grid
library(tidyr)
crossing(df, expand.grid(missing_cols))

Dense Rank by Multiple Columns in R

How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8

R: Producing frequency table by selecting certain rows

I have a minimal example of a data set D that looks something like:
score person freq
10 1 3
10 2 5
10 3 4
8 1 3
7 2 2
6 4 1
Now, I want to be able to plot frequency of score=10 against person.
However, if I do:
#My bad, turns out the next line only works for matrices anyway:
#D = D[which(D[,1] == 10)]
D = subset(D, score == 10)
then I get:
score person freq
10 1 3
10 2 5
10 3 4
However, this is what I would like to get:
score person freq
10 1 3
10 2 5
10 3 4
10 4 0
Is there any quick and painless way for me to do this in R?
Here's a base R approach:
subset(as.data.frame(xtabs(freq ~ score + person, df)), score == 10)
# score person Freq
#4 10 1 3
#8 10 2 5
#12 10 3 4
#16 10 4 0
You can use complete() from the tidyr package to create the missing rows and then you can simply subset:
library(tidyr)
D2 <- complete(D, score, person, fill = list(freq = 0))
D2[D2$score == 10, ]
## Source: local data frame [4 x 3]
##
## score person freq
## (int) (int) (dbl)
## 1 10 1 3
## 2 10 2 5
## 3 10 3 4
## 4 10 4 0
complete() takes as the first argument the data frame that it should work with. Then follow the names of the columns that should be completed. The argument fill is a list that gives for each of the remaining columns (which is only freq here) the value they should be filled with.
As suggested by docendo-discimus, this can be further simplified by using also the dplyr package as follows:
library(tidyr)
library(dplyr)
complete(D, score, person, fill = list(freq = 0)) %>% filter(score == 10)
Here is a dplyr approach:
D %>% mutate(freq = ifelse(score == 10, freq, 0),
score = 10) %>%
group_by(score, person) %>%
summarise(freq = max(freq))
Source: local data frame [4 x 3]
Groups: score [?]
score person freq
(dbl) (int) (dbl)
1 10 1 3
2 10 2 5
3 10 3 4
4 10 4 0

Divide one column of data frame by condition from another column

I have a data frame with 2 columns like this:
cond val
1 5
2 18
2 18
2 18
3 30
3 30
I want to change values in val in this way:
cond val
1 5 # 5 = 5/1 (only "1" in cond column)
2 6 # 6 = 18/3 (there are three "2" in cond column)
2 6
2 6
3 15 # 15 = 30/2
3 15
How to achieve this?
A base R solution:
# method 1:
mydf$val <- ave(mydf$val, mydf$cond, FUN = function(x) x = x/length(x))
# method 2:
mydf <- transform(mydf, val = ave(val, cond, FUN = function(x) x = x/length(x)))
which gives:
cond val
1 1 5
2 2 6
3 2 6
4 2 6
5 3 15
6 3 15
Here's the dplyr way:
library(dplyr)
df %>%
group_by(cond) %>%
mutate(val = val / n())
Which gives:
#Source: local data frame [6 x 2]
#Groups: cond [3]
#
# cond val
# (int) (dbl)
#1 1 5
#2 2 6
#3 2 6
#4 2 6
#5 3 15
#6 3 15
The idea is to divide val by the number of observations in the current group (cond) using n()
This seems like an appropriate situation for data.table:
library(data.table)
(dt <- data.table(df)[,val := val / .N, by = cond][])
# cond val
# 1: 1 5
# 2: 2 6
# 3: 2 6
# 4: 2 6
# 5: 3 15
# 6: 3 15
df <- read.table(
text = "cond val
1 5
2 18
2 18
2 18
3 30
3 30",
header = TRUE,
colClasses = "numeric"
)
In base R
df$result = df$val / ave(df$cond, df$cond, FUN = length)
The ave() divides up the cond column by its unique values and takes the length of each subvector, i.e., the denominator you ask for.
Here is a base R answer that will work if cond is an ID variable:
# get length of repeats
temp <- rle(df$cond)
temp <- data.frame(cond=temp$values, lengths=temp$lengths)
# merge onto data.frame
df <- merge(df, temp, by="cond")
df$valNew <- df$val / df$lengths

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Resources