R Count Unique By Group in DPLYR - r

HAVE = data.frame("TRIMESTER" = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4),
"STUDENT" = c(1,2,3,3,4,2,5,6,7,1,2,2,2,2,2,1,2,3,4,5))
HAVE$WANT1 = c(4,4,4,4,4,5,5,5,5,5,1,1,1,1,5,5,5,5,5,5)
HAVE$WANT2 = c(0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1)
I have HAVE and wish to APPEND a column to count the UNIQUE value of STUDENT for every TRIMESTER shown WANT1 and I wish to create WANT2 which is the SUM of times for every TRIMESTER that STUDENT==5 appears so STUDENT==5 appear ZERO times in TRIMESTER == 1, so the value for all TRIMESTER == 1 is ZERO but student 5 appear ONCE in TRIMESTER==4 so the value is 1

After grouping by 'TRIMESTER', get the count of distinct elements of 'STUDENT' with n_distinct and the count of STUDENT 5 with sum on a logical expression
library(dplyr)
HAVE %>%
group_by(TRIMESTER) %>%
mutate(WANT1new = n_distinct(STUDENT),
WANT2NEW = sum(STUDENT == 5)) %>%
ungroup
-output
# A tibble: 20 × 6
TRIMESTER STUDENT WANT1 WANT2 WANT1new WANT2NEW
<dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 1 4 0 4 0
2 1 2 4 0 4 0
3 1 3 4 0 4 0
4 1 3 4 0 4 0
5 1 4 4 0 4 0
6 2 2 5 1 5 1
7 2 5 5 1 5 1
8 2 6 5 1 5 1
9 2 7 5 1 5 1
10 2 1 5 1 5 1
11 3 2 1 0 1 0
12 3 2 1 0 1 0
13 3 2 1 0 1 0
14 3 2 1 0 1 0
15 4 2 5 1 5 1
16 4 1 5 1 5 1
17 4 2 5 1 5 1
18 4 3 5 1 5 1
19 4 4 5 1 5 1
20 4 5 5 1 5 1

The code below should produce the desired result.
library(dplyr)
HAVE %>%
group_by(TRIMESTER) %>%
mutate(WANT1 = length(unique(STUDENT)),
WANT2 = as.numeric(any(5 == STUDENT)))

Related

Create column with ID starting at 1 and increment when value in another column changes in R

I have a data frame like so:
ID <- c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B')
val1 <- c(0,1,2,3,4,5,6,7,8,9,10,11,0,1,2,3)
val2 <- c(0,1,2,3,4,5,0,1,0,1,2,0,1,0,1,2)
df <- data.frame(ID, val1, val2)
Output:
ID val1 val2
1 A 0 0
2 A 1 1
3 A 2 2
4 A 3 3
5 A 4 4
6 A 5 5
7 A 6 0
8 A 7 1
9 A 8 0
10 A 9 1
11 A 10 2
12 B 11 0
13 B 0 1
14 B 1 0
15 B 2 1
16 B 3 2
I am trying to create a third column (val 3) which is like an index. When val1 = 0 and val 2 = 0 it should be 1 (this is also grouped by ID). It should stay as one and then increment by 1 until val2 = 0 again, like so showing desired output:
ID val1 val2 val3
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2
How can this be achieved? I tried:
df <- df %>%
group_by(ID, val2) %>%
mutate(val3 = row_number())
And:
df$val3 <- cumsum(c(1,diff(df$val2)==0))
But neither provide the desired outcome.
Inside cumsum use the logical comparison val2==0
df %>%
group_by(ID) %>%
mutate(val3 = cumsum(val2==0))
# A tibble: 16 × 4
# Groups: ID [2]
ID val1 val2 val3
<chr> <dbl> <dbl> <int>
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2

Rows sequence by group using two columns

Suppose I have the following df
data <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3),
Value = c(1,1,0,1,0,1,1,1,0,0,1,0,0,0),
Result = c(1,1,2,3,4,5,5,1,2,2,3,1,1,1))
How can I obtain column Result from the first two columns?
I have tried different approaches using rle, seq, cumsum and cur_group_id but can't get the Result column easily
library(data.table)
library(dplyr)
data %>%
group_by(ID) %>%
mutate(Result2 = rleid(Value))
This gives us:
ID Value Result Result2
<dbl> <dbl> <dbl> <int>
1 1 1 1 1
2 1 1 1 1
3 1 0 2 2
4 1 1 3 3
5 1 0 4 4
6 1 1 5 5
7 1 1 5 5
8 2 1 1 1
9 2 0 2 2
10 2 0 2 2
11 2 1 3 3
12 3 0 1 1
13 3 0 1 1
14 3 0 1 1
Does this work:
library(dplyr)
data %>% group_by(ID) %>% mutate(r = rep(seq_along(rle(ID*Value)$values), rle(ID*Value)$lengths))
# A tibble: 14 x 4
# Groups: ID [3]
ID Value Result r
<dbl> <dbl> <dbl> <int>
1 1 1 1 1
2 1 1 1 1
3 1 0 2 2
4 1 1 3 3
5 1 0 4 4
6 1 1 5 5
7 1 1 5 5
8 2 1 1 1
9 2 0 2 2
10 2 0 2 2
11 2 1 3 3
12 3 0 1 1
13 3 0 1 1
14 3 0 1 1
We could use rle with ave in base R
data$Result2 <- with(data, ave(Value, ID, FUN =
function(x) inverse.rle(within.list(rle(x), values <- seq_along(values)))))
data$Result2
#[1] 1 1 2 3 4 5 5 1 2 2 3 1 1 1

is there a way in R to fill missing groups absent of observations?

Say I have something like:
df<-data.frame(group=c(1, 1,1, 2,2,2,3,3,3,4,4, 1, 1,1),
group2=c(1,2,3,1,2,3,1,2,3,1,3, 1,2,3))
group group2
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 3
12 1 1
13 1 2
14 1 3
My goal is to count the number of unique instances for group= something and group2= something. Like so:
df1<-df%>%group_by(group, group2)%>% mutate(want=n())%>%distinct(group, group2, .keep_all=TRUE)
group group2 want
<dbl> <dbl> <int>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 3 1
however, notice that group=4, group2=2 was not in my dataset to begin with. Is there some sort of autofill function where I can fill these non-observations with a zero to get below easily?:
group group2 want
<dbl> <dbl> <int>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 2 0
12 4 3 1
After getting the count, we can expand with complete to fill the missing combinations with 0
library(dplyr)
library(tidyr)
df %>%
count(group, group2) %>%
complete(group, group2, fill = list(n = 0))
# A tibble: 12 x 3
# group group2 n
# <dbl> <dbl> <dbl>
# 1 1 1 2
# 2 1 2 2
# 3 1 3 2
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 1
# 8 3 2 1
# 9 3 3 1
#10 4 1 1
#11 4 2 0
#12 4 3 1
Or if we do the group_by, instead of mutate and then do the distinct, directly use the summarise
df %>%
group_by(group, group2) %>%
summarise(n = n()) %>%
ungroup %>%
complete(group, group2, fill = list(n = 0))
Here is a data.table approach solution to this problem:
library(data.table)
setDT(df)[CJ(group, group2, unique = TRUE),
c(.SD, .(want = .N)), .EACHI,
on = c("group", "group2")]
# group group2 want
# 1 1 2
# 1 2 2
# 1 3 2
# 2 1 1
# 2 2 1
# 2 3 1
# 3 1 1
# 3 2 1
# 3 3 1
# 4 1 1
# 4 2 0
# 4 3 1

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

Adding sequence of numbers to the data

Hi I have a data frame like this
df <-data.frame(x=rep(rep(seq(0,3),each=2),2 ),gr=gl(2,8))
x gr
1 0 1
2 0 1
3 1 1
4 1 1
5 2 1
6 2 1
7 3 1
8 3 1
9 0 2
10 0 2
11 1 2
12 1 2
13 2 2
14 2 2
15 3 2
16 3 2
I want to add a new column numbering sequence of numbers when the x value ==0
I tried
library(dplyr)
df%>%
group_by(gr)%>%
mutate(numbering=seq(2,8,2))
Error in mutate_impl(.data, dots) :
Column `numbering` must be length 8 (the group size) or one, not 4
?
Just for side note mutate(numbering=rep(seq(2,8,2),each=2)) would work for this minimal example but for the general case its better to look x value change from 0!
the expected output
x gr numbering
1 0 1 2
2 0 1 2
3 1 1 4
4 1 1 4
5 2 1 6
6 2 1 6
7 3 1 8
8 3 1 8
9 0 2 2
10 0 2 2
11 1 2 4
12 1 2 4
13 2 2 6
14 2 2 6
15 3 2 8
16 3 2 8
Do you mean something like this?
library(tidyverse);
df %>%
group_by(gr) %>%
mutate(numbering = cumsum(c(1, diff(x) != 0)))
## A tibble: 16 x 3
## Groups: gr [2]
# x gr numbering
# <int> <fct> <dbl>
# 1 0 1 1.
# 2 0 1 1.
# 3 1 1 2.
# 4 1 1 2.
# 5 2 1 3.
# 6 2 1 3.
# 7 3 1 4.
# 8 3 1 4.
# 9 0 2 1.
#10 0 2 1.
#11 1 2 2.
#12 1 2 2.
#13 2 2 3.
#14 2 2 3.
#15 3 2 4.
#16 3 2 4.
Or if you must have a numbering sequence 2,4,6,... instead of 1,2,3,... you can do
df %>%
group_by(gr) %>%
mutate(numering = 2 * cumsum(c(1, diff(x) != 0)));
## A tibble: 16 x 3
## Groups: gr [2]
# x gr numering
# <int> <fct> <dbl>
# 1 0 1 2.
# 2 0 1 2.
# 3 1 1 4.
# 4 1 1 4.
# 5 2 1 6.
# 6 2 1 6.
# 7 3 1 8.
# 8 3 1 8.
# 9 0 2 2.
#10 0 2 2.
#11 1 2 4.
#12 1 2 4.
#13 2 2 6.
#14 2 2 6.
#15 3 2 8.
#16 3 2 8.
Here is an option using match to get the index and then pass on the seq values to fill
df %>%
group_by(gr) %>%
mutate(numbering = seq(2, length.out = n()/2, by = 2)[match(x, unique(x))])
# A tibble: 16 x 3
# Groups: gr [2]
# x gr numbering
# <int> <fct> <dbl>
# 1 0 1 2
# 2 0 1 2
# 3 1 1 4
# 4 1 1 4
# 5 2 1 6
# 6 2 1 6
# 7 3 1 8
# 8 3 1 8
# 9 0 2 2
#10 0 2 2
#11 1 2 4
#12 1 2 4
#13 2 2 6
#14 2 2 6
#15 3 2 8
#16 3 2 8

Resources