I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?
library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))
Related
Hello I need to count the occurencies of every number in each column.
Example data-frame:
A B C
2 1 2
2 1 1
1 1 3
3 3 3
3 2 2
2 1 2
I want my output to look like this
how_much A B C
1 1 4 1
2 3 1 3
3 2 1 2
In tidyverse you could do:
library(tidyverse)
gather(df1) %>%
group_by(key,value) %>%
count() %>%
pivot_wider(value, names_from = key, values_from = n, values_fill = 0)
value A B C
<int> <int> <int> <int>
1 1 1 4 1
2 2 3 1 3
3 3 2 1 2
We can use table
table(unlist(df1), names(df1)[c(col(df1))])
-output
A B C
1 1 4 1
2 3 1 3
3 2 1 2
Or loop over the columns with sapply, and apply table
sapply(df1, table)
A B C
1 1 4 1
2 3 1 3
3 2 1 2
data
df1 <- structure(list(A = c(2L, 2L, 1L, 3L, 3L, 2L), B = c(1L, 1L, 1L,
3L, 2L, 1L), C = c(2L, 1L, 3L, 3L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-6L))
In order for the solution to be more flexible and can be used for any occurrence of numbers we can use the following solution using purrr package functions.
library(dplyr)
library(purrr)
df1 %>%
map(~ unique(.x) %>% sort()) %>% reduce(~ union(..1, ..2)) %>%
bind_cols(map_dfr(., ~ map_dfc(df1, function(a) sum(a == .x)))) %>%
rename(what = ...1)
# A tibble: 3 x 4
what A B C
<int> <int> <int> <int>
1 1 1 4 1
2 2 3 1 3
3 3 2 1 2
A slightly verbose answer, but it will work on all data types.
set.seed(1234)
df1 <- data.frame(A = sample(letters[1:3], 8, T),
B = sample(letters[1:3], 8, T),
C = sample(letters[1:3], 8, T))
df1
#> A B C
#> 1 b c b
#> 2 b b a
#> 3 a b c
#> 4 c b c
#> 5 a c c
#> 6 a b a
#> 7 b b b
#> 8 b b a
library(tidyverse)
unique(unlist(apply(df1, 1, unique))) %>% as.data.frame() %>% setNames('how_much') %>%
bind_cols(map_df(unique(unlist(apply(df1, 1, unique))), ~map_int(df1, \(x) sum(x %in% .x) ) ))
#> how_much A B C
#> 1 b 4 6 2
#> 2 c 1 2 3
#> 3 a 3 0 3
Created on 2021-06-23 by the reprex package (v2.0.0)
I'm trying to return a logical vector based on whether a person meets one set of conditions and ALSO meets another set of conditions later on. I'm using a data frame that looks like so:
Person.Id Year Term
250 1 3
250 1 1
250 2 3
300 1 3
511 2 1
300 1 5
700 2 3
What I want to return is a logical vector that indicates true/false if person ID 250 has year 1 and term 3, AND later has year 2 term 3. So a person that only has year 1 term 3 or year 1 term 5 will return false. Solutions in dplyr preferred! I feel like this is simple and I'm just missing something. I initially tried this code but all it returned was a blank df:
df2 <- df1 %>%
group_by(Person.Id) %>%
filter((year==1 & term==3) & (year==2 & term==3))
Are you looking for something like this ?
require(dplyr)
df %>%
group_by(Person.Id) %>%
mutate(count=sum((year==1 & term==3) | (year==2 & term==3))) %>%
mutate(count2=if_else(count==2,T,F))
# A tibble: 7 x 5
# Groups: Person.Id [4]
Person.Id year term count count2
<int> <int> <int> <int> <lgl>
1 250 1 3 2 TRUE
2 250 1 1 2 TRUE
3 250 2 3 2 TRUE
4 300 1 3 1 FALSE
5 511 2 1 0 FALSE
6 300 1 5 1 FALSE
7 700 2 3 1 FALSE
Maybe this can help:
#Data
Data <- structure(list(Person.Id = c(250L, 250L, 250L, 300L, 511L, 300L,
700L), Year = c(1L, 1L, 2L, 1L, 2L, 1L, 2L), Term = c(3L, 1L,
3L, 3L, 1L, 5L, 3L)), row.names = c(NA, -7L), class = "data.frame")
#Flags
cond1 <- Data$Year==1 & Data$Term==3
cond2 <- Data$Year==2 & Data$Term==3
#Replace
Data$Flag1 <- 0
Data$Flag1[cond1]<-1
Data$Flag2 <- 0
Data$Flag2[cond2]<-1
#Filter
Data %>% group_by(Person.Id) %>% filter(Flag1==1 | Flag2==1)
# A tibble: 4 x 5
# Groups: Person.Id [3]
Person.Id Year Term Flag1 Flag2
<int> <int> <int> <dbl> <dbl>
1 250 1 3 1 0
2 250 2 3 0 1
3 300 1 3 1 0
4 700 2 3 0 1
I have the following table:
id_question id_event num_events
2015012713 49508 1
2015012711 49708 1
2015011523 41808 3
2015011523 44008 3
2015011523 44108 3
2015011522 41508 3
2015011522 43608 3
2015011522 43708 3
2015011521 39708 1
2015011519 44208 1
The third column gives the count of events by question. I want to create a variable that would index the events by question only where there are multiple events per question. It would look something like that:
id_question id_event num_events index_event
2015012713 49508 1
2015012711 49708 1
2015011523 41808 3 1
2015011523 44008 3 2
2015011523 44108 3 3
2015011522 41508 3 1
2015011522 43608 3 2
2015011522 43708 3 3
2015011521 39708 1
2015011519 44208 1
How can I do that?
We can use tidyverse to create an 'index_event' after grouping by 'id_question'. If the number of rows are greater than 1 (n() >1), then get the sequence of rows (row_number()) and the default option in case_when is NA
library(dplyr)
df1 %>%
group_by(id_question) %>%
mutate(index_event = case_when(n() >1 ~ row_number()))
# A tibble: 10 x 4
# Groups: id_question [6]
# id_question id_event num_events index_event
# <int> <int> <int> <int>
# 1 2015012713 49508 1 NA
# 2 2015012711 49708 1 NA
# 3 2015011523 41808 3 1
# 4 2015011523 44008 3 2
# 5 2015011523 44108 3 3
# 6 2015011522 41508 3 1
# 7 2015011522 43608 3 2
# 8 2015011522 43708 3 3
# 9 2015011521 39708 1 NA
#10 2015011519 44208 1 NA
Or with data.table, we use rowid on 'id_question' and change the elements that are 1 in 'num_events' to NA with NA^ (making use of NA^0, NA^1)
library(data.table)
setDT(df1)[, index_event := rowid(id_question) * NA^(num_events == 1)]
Or using base R, another option with the sequence of frequency from 'id_question' and change elements to NA as in the previous case
df1$index_event <- with(df1, sequence(table(id_question)) * NA^(num_events == 1))
df1$index_event
#[1] NA NA 1 2 3 1 2 3 NA NA
data
df1 <- structure(list(id_question = c(2015012713L, 2015012711L, 2015011523L,
2015011523L, 2015011523L, 2015011522L, 2015011522L, 2015011522L,
2015011521L, 2015011519L), id_event = c(49508L, 49708L, 41808L,
44008L, 44108L, 41508L, 43608L, 43708L, 39708L, 44208L), num_events = c(1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
If num_events = 1 you can return NA or create a row-index for each id_question.
This can be done in base R :
df$index_event <- with(df, ave(num_events == 1, id_question,
FUN = function(x) replace(seq_along(x), x, NA)))
df
# id_question id_event num_events index_event
#1 2015012713 49508 1 NA
#2 2015012711 49708 1 NA
#3 2015011523 41808 3 1
#4 2015011523 44008 3 2
#5 2015011523 44108 3 3
#6 2015011522 41508 3 1
#7 2015011522 43608 3 2
#8 2015011522 43708 3 3
#9 2015011521 39708 1 NA
#10 2015011519 44208 1 NA
dplyr :
library(dplyr)
df %>%
group_by(id_question) %>%
mutate(index_event = if_else(num_events == 1, NA_integer_, row_number()))
Or data.table :
library(data.table)
setDT(df)
df[,index_event := ifelse(num_events == 1, NA_integer_, seq_len(.N)), id_question]
data
df <- structure(list(id_question = c(2015012713L, 2015012711L, 2015011523L,
2015011523L, 2015011523L, 2015011522L, 2015011522L, 2015011522L,
2015011521L, 2015011519L), id_event = c(49508L, 49708L, 41808L,
44008L, 44108L, 41508L, 43608L, 43708L, 39708L, 44208L), num_events = c(1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L)),class = "data.frame",row.names = c(NA, -10L))
I have a list of events by ID and would like to group them in two week periods. The two weeks should start whenever the first event occurs for each ID. The grouped event data should look something like the following,
ID Date Group
<dbl> <date> <dbl>
1 2018-01-01 1
1 2018-01-02 1
1 2018-01-02 1
1 2018-02-01 2
1 2018-03-01 3
2 2018-01-01 4
2 2018-04-01 5
dat = structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), Date = structure(c(17532,
17533, 17533, 17563, 17591, 17532, 17622), class = "Date"), Group = c(1L,
1L, 1L, 2L, 3L, 4L, 5L)), .Names = c("ID", "Date", "Group"), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
I was originally thinking of lagging by ID and filtering for events that happen within a two week period, but there may be many events that correspond to a single two week period.
You can use cut and seq to round to the nearest two week cutoff, then group_indices to make an increasing index:
dat %>%
group_by(ID) %>%
mutate(g = cut(Date, seq(first(Date), max(Date) + 14, by="2 weeks")) %>% as.character) %>%
ungroup %>%
mutate(g = group_indices(., ID, g))
# A tibble: 7 x 4
ID Date Group g
<int> <date> <int> <int>
1 1 2018-01-01 1 1
2 1 2018-01-02 1 1
3 1 2018-01-02 1 1
4 1 2018-02-01 2 2
5 1 2018-03-01 3 3
6 2 2018-01-01 4 4
7 2 2018-04-01 5 5
Get the difference of adjacent 'Date's with difftime specifying the unit as "week", check if the difference is greater than 2, and get the cumulative sum
dat %>%
mutate(GroupNew = cumsum(abs(difftime(Date, lag(Date,
default = first(Date)), unit = "week")) > 2) + 1)
# A tibble: 7 x 4
# ID Date Group GroupNew
# <int> <date> <int> <dbl>
#1 1 2018-01-01 1 1
#2 1 2018-01-02 1 1
#3 1 2018-01-02 1 1
#4 1 2018-02-01 2 2
#5 1 2018-03-01 3 3
#6 2 2018-01-01 4 4
#7 2 2018-04-01 5 5
This question already has answers here:
Frequency count of two column in R
(8 answers)
Closed 6 years ago.
I have a data frame like this:
ID Cont
1 a
1 a
1 b
2 a
2 c
2 d
I need to report the frequence of "Cont" by ID. The output should be
ID Cont Freq
1 a 2
1 b 1
2 a 1
2 c 1
2 d 1
Using dplyr, you can group_by both ID and Cont and summarise using n() to get Freq:
library(dplyr)
res <- df %>% group_by(ID,Cont) %>% summarise(Freq=n())
##Source: local data frame [5 x 3]
##Groups: ID [?]
##
## ID Cont Freq
## <int> <fctr> <int>
##1 1 a 2
##2 1 b 1
##3 2 a 1
##4 2 c 1
##5 2 d 1
Data:
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), Cont = structure(c(1L,
1L, 2L, 1L, 3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor")), .Names = c("ID",
"Cont"), class = "data.frame", row.names = c(NA, -6L))
## ID Cont
##1 1 a
##2 1 a
##3 1 b
##4 2 a
##5 2 c
##6 2 d
library(data.table)
setDT(x)[, .(Freq = .N), by = .(ID, Cont)]
# ID Cont Freq
# 1: 1 a 2
# 2: 1 b 1
# 3: 2 a 1
# 4: 2 c 1
# 5: 2 d 1
With base R:
df1 <- subset(as.data.frame(table(df)), Freq != 0)
if you want to order by ID, add this line:
df1[order(df1$ID)]
ID Cont Freq
1 1 a 2
3 1 b 1
2 2 a 1
6 2 c 1
8 2 d 1