I would like to track min and max occurrences in two columns. This should be done in rolling fashion from beginning of the data, so we can track how many times overall IDs are present at each date. Also it doesn't matter in which column ID is present.
Result should be as follows. Row 1, nor B or C has occurred, so min_appearance is 0 but max_appearance is 1 as A and D was present. Row 5 A and D have been present 3 times at this point but B and C only 2. I'm not concerned which ID is present, but only on counts what is min and max. Also real data is more complicated, so pairs are not static, but A could face C and so on.
# A tibble: 8 x 5
date id1 id2 min_appearances max_appearances
<date> <chr> <chr> <dbl> <dbl>
1 2020-01-01 A D 0 1
2 2020-01-02 B C 1 1
3 2020-01-03 C B 1 2
4 2020-01-04 D A 2 2
5 2020-01-05 A D 2 3
6 2020-01-06 B C 3 3
7 2020-01-07 C B 3 4
8 2020-01-08 D A 4 4
DATA:
library(dplyr)
date <- seq(as.Date("2020/1/1"), by = "day", length.out = 8)
id1 <- rep(c("A", "B", "C", "D"), 2)
id2 <- rep(c("D", "C", "B", "A"), 2)
dt <- tibble(date = date,
id1 = id1,
id2 = id2)
Here's a way to do it using functions from the tidyverse. First, pivot_longer to handle more easily the data. Then compute the cumulative count of value for every unique ids. Compute the min and max for each row over the "count" columns. Finally, take the last min and max values for each pairs, and pivot back to wide.
library(tidyverse)
dt %>%
pivot_longer(cols = -date, values_to = "id") %>%
mutate(map_dfc(unique(id), ~ tibble("count_{.x}" := cumsum(id == .x)))) %>%
mutate(min_appearances = do.call(pmin, select(., starts_with("count"))),
max_appearances = do.call(pmax, select(., starts_with("count")))) %>%
group_by(date) %>%
mutate(across(min_appearances:max_appearances, last),
n = row_number()) %>%
pivot_wider(c(date, min_appearances, max_appearances), names_from = n, values_from = id, names_prefix = "id") %>%
relocate(order(colnames(.)))
date id1 id2 max_appearances min_appearances
<date> <chr> <chr> <int> <int>
1 2020-01-01 A D 1 0
2 2020-01-02 B C 1 1
3 2020-01-03 C B 2 1
4 2020-01-04 D A 2 2
5 2020-01-05 A D 3 2
6 2020-01-06 B C 3 3
7 2020-01-07 C B 4 3
8 2020-01-08 D A 4 4
Related
I have a data frame containing a varying number of data points in the same column:
library(tidyverse)
df <- tribble(~id, ~data,
"A", "a;b;c",
"B", "e;f")
I want to obtain one row per data point, separating the content of column data and distributing it on rows. This code gives the expected result, but is clumsy:
df %>%
separate(data,
into = paste0("dat_",1:5),
sep = ";",
fill = "right") %>%
pivot_longer(starts_with("dat_"),
names_to = "data_number",
names_pattern = "dat_(\\d+)") %>%
filter(!is.na(value))
#> # A tibble: 5 x 3
#> id data_number value
#> <chr> <chr> <chr>
#> 1 A 1 a
#> 2 A 2 b
#> 3 A 3 c
#> 4 B 1 e
#> 5 B 2 f
Tidyverse solutions preferred.
Here is one way
library(dplyr)
library(tidyr)
library(data.table)
df %>%
separate_rows(data) %>%
mutate(data_number = rowid(id), .before = 2)
-output
# A tibble: 5 x 3
id data_number data
<chr> <int> <chr>
1 A 1 a
2 A 2 b
3 A 3 c
4 B 1 e
5 B 2 f
library(dplyr)
library(tidyr)
df %>%
separate_rows(data)
output:
# A tibble: 5 x 2
id data
<chr> <chr>
1 A a
2 A b
3 A c
4 B e
5 B f
Using str_split and unnest -
library(tidyverse)
df %>%
mutate(data = str_split(data, ';'),
data_number = map(data, seq_along)) %>%
unnest(c(data, data_number))
# id data data_number
# <chr> <chr> <int>
#1 A a 1
#2 A b 2
#3 A c 3
#4 B e 1
#5 B f 2
I have a data frame:
df = tibble(a=c(7,6,10,12,12), b=c(3,5,8,8,7), c=c(4,4,12,15,20), week=c(1,2,3,4,5))
# A tibble: 5 x 4
a b c week
<dbl> <dbl> <dbl> <dbl>
1 7 3 4 1
2 6 5 4 2
3 10 8 12 3
4 12 8 15 4
5 12 7 20 5
and i want for every column a, b and c the week in which the observation is equal to or exceeds 10.
I.e. for column a it would be week 3, for column b it would be week NA, for column c it would be week 3 as well.
A desired ouotcome could look like this:
tibble(abc=c("a", NA, "b"), value=c(10, NA, 12), week=c(3, NA, 3))
# A tibble: 3 x 3
abc value week
<chr> <dbl> <dbl>
1 a 10 3
2 b NA NA
3 c 12 3
One way would be to get the data in long format and for each column name select the first value that is greater than 10. We fill the missing combinations with complete.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -week, names_to = 'abc') %>%
group_by(abc) %>%
slice(which(value >= 10)[1]) %>%
ungroup %>%
complete(abc = names(df)[-4])
# A tibble: 3 x 3
# abc week value
# <chr> <dbl> <dbl>
#1 a 3 10
#2 b NA NA
#3 c 3 12
Another way is to first calculate what we want and then transform the dataset into long format.
df %>%
summarise(across(a:c, list(week = ~week[which(. >= 10)[1]],
value = ~.[. >= 10][1]))) %>%
pivot_longer(cols = everything(),
names_to = c('abc', '.value'),
names_sep = "_")
How to use R to create a rank column? Below is an example
This is what I have:
Date group
12/5/2020 A
12/5/2020 A
11/7/2020 A
11/7/2020 A
11/9/2020 B
11/9/2020 B
10/8/2020 B
This is what I want:
Date group rank
12/5/2020 A 2
12/5/2020 A 2
11/7/2020 A 1
11/7/2020 A 1
11/9/2020 B 2
11/9/2020 B 2
10/8/2020 B 1
tidyverse
(I'm using dplyr here since I think it is easy to see the steps being done.)
A first approach might be to capitalize on R's factor function, which assigns an integer to each distinct value, so that operations on this factor is faster (when compared with strings). That is, it takes a (possibly looooong) vector of strings and converts it into a just-as-long vector of integers (much smaller and faster) and a very short vector of strings, where the integers are indices into the small vector of strings. This small vector is called the factor's "levels".
library(dplyr)
group_by(dat, group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <chr> <chr> <int>
# 1 12/5/2020 A 2
# 2 12/5/2020 A 2
# 3 11/7/2020 A 1
# 4 11/7/2020 A 1
# 5 11/9/2020 B 2
# 6 11/9/2020 B 2
# 7 10/8/2020 B 1
This "sorta" works, but there are two problems:
This is reliant on the lexicographic sorting of the Date column, for which this data sample is acceptable, but this will fail. A better way is to convert to something more appropriately sortable, such as a Date object.
Failing sorts:
sort(c("12/9/2020", "11/9/2020", "2/9/2020"))
# [1] "11/9/2020" "12/9/2020" "2/9/2020"
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
and
There really are better functions for ranking, such as dplyr::dense_rank (which #akrun put in an answer first ... I was building to it, honestly):
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = dense_rank(Date)) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
We can use dense_rank after converting the 'Date' to Date class
library(dplyr)
library(lubridate)
df1 %>%
group_by(group) %>%
mutate(rank = dense_rank(mdy(Date)))
# A tibble: 7 x 3
# Groups: group [2]
# Date group rank
# <chr> <chr> <int>
#1 12/5/2020 A 2
#2 12/5/2020 A 2
#3 11/7/2020 A 1
#4 11/7/2020 A 1
#5 11/9/2020 B 2
#6 11/9/2020 B 2
#7 10/8/2020 B 1
data
df1 <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA,
-7L))
Convert the Date column to the actual date object, arrange the data by Date and use match with unique to get rank column.
library(dplyr)
df %>%
mutate(Date = lubridate::mdy(Date)) %>%
arrange(group, Date) %>%
group_by(group) %>%
mutate(rank = match(Date, unique(Date)))
# Date group rank
# <date> <chr> <int>
#1 2020-11-07 A 1
#2 2020-11-07 A 1
#3 2020-12-05 A 2
#4 2020-12-05 A 2
#5 2020-10-08 B 1
#6 2020-11-09 B 2
#7 2020-11-09 B 2
data
df <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA, -7L))
I have the following data frame:
library(dplyr)
library(tibble)
df <- tibble(
source = c("a", "b", "c", "d", "e"),
score = c(10, 5, NA, 3, NA ) )
df
It looks like this:
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10 . # current max value
2 b 5
3 c NA
4 d 3
5 e NA
What I want to do is to replace NA in score column with values ranging for existing max + n onwards. Where n range from 1 to total number of rows of the df
Resulting in this (hand-coded) :
source score
a 10
b 5
c 11 # obtained from 10 + 1
d 3
e 12 # obtained from 10 + 2
How can I achieve that?
Another option :
transform(df, score = pmin(max(score, na.rm = TRUE) +
cumsum(is.na(score)), score, na.rm = TRUE))
# source score
#1 a 10
#2 b 5
#3 c 11
#4 d 3
#5 e 12
If you want to do this in dplyr
library(dplyr)
df %>% mutate(score = pmin(max(score, na.rm = TRUE) +
cumsum(is.na(score)), score, na.rm = TRUE))
A base R solution
df$score[is.na(df$score)] <- seq(which(is.na(df$score))) + max(df$score,na.rm = TRUE)
such that
> df
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
Here is a dplyr approach,
df %>%
mutate(score = replace(score,
is.na(score),
(max(score, na.rm = TRUE) + (cumsum(is.na(score))))[is.na(score)])
)
which gives,
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
With dplyr:
library(dplyr)
df %>%
mutate_at("score", ~ ifelse(is.na(.), max(., na.rm = TRUE) + cumsum(is.na(.)), .))
Result:
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
A dplyr solution.
df %>%
mutate(na_count = cumsum(is.na(score)),
score = ifelse(is.na(score), max(score, na.rm = TRUE) + na_count, score)) %>%
select(-na_count)
## A tibble: 5 x 2
# source score
# <chr> <dbl>
#1 a 10
#2 b 5
#3 c 11
#4 d 3
#5 e 12
Another one, quite similar to ThomasIsCoding's solution:
> df$score[is.na(df$score)]<-max(df$score, na.rm=T)+(1:sum(is.na(df$score)))
> df
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
Not quite elegant as compared to the base R solutions, but still possible:
library(data.table)
setDT(df)
max.score = df[, max(score, na.rm = TRUE)]
df[is.na(score), score :=(1:.N) + max.score]
Or in one line but a bit slower:
df[is.na(score), score := (1:.N) + df[, max(score, na.rm = TRUE)]]
df
source score
1: a 10
2: b 5
3: c 11
4: d 3
5: e 12
I have data:
rowID incidentID participant.type
1 1 A
2 1 B
3 2 A
4 3 A
5 3 B
6 3 C
7 4 B
8 4 C
And I would like to end up with:
rowID incident participant.type participant.type.1 participant.type.2
1 1 A B
2 2 A
3 3 A B C
4 4 B C
I tried the spread but can't achieve one line per incident; I don't think I have a way of creating a key-value pair so I wonder if there is some other method for doing this.
Before using spread(), you need to create a proper key argument.
df %>% select(-rowID) %>%
group_by(incidentID) %>%
mutate(id = 1:n()) %>%
spread(id, participant.type)
# incidentID `1` `2` `3`
# <int> <fct> <fct> <fct>
# 1 1 A B NA
# 2 2 A NA NA
# 3 3 A B C
# 4 4 B C NA
Since your grouping is based on the row order within the icidentID column. The following simple solution will also work.
It is just filtering the dataframe and then merging in the end.
It is probably not the best solution in terms of effective use of computing power, but it is easy to understand.
library(tidyverse)
df <-
tribble(
~rowID, ~incidentID, ~participant.type,
1, 1, "A",
2, 1, "B",
3, 2, "A",
4, 3, "A",
5, 3, "B",
6, 3, "C",
7, 4, "B",
8, 4, "C")
df_1 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==1)
df_2 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==2) %>%
rename(participant.type.1 = participant.type)
df_3 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==3) %>%
rename(participant.type.2 = participant.type)
full_join(df_1, full_join(df_2, df_3))
Result:
Joining, by = "incidentID"
Joining, by = "incidentID"
# A tibble: 4 x 4
# Groups: incidentID [?]
incidentID participant.type participant.type.1 participant.type.2
<dbl> <chr> <chr> <chr>
1 1 A B NA
2 2 A NA NA
3 3 A B C
4 4 B C NA
Here's my solution:
df %>%
select(-rowID) %>%
group_by(incidentID) %>%
nest() %>%
mutate(data = map_chr(data, ~str_c(.x$participant.type, collapse = '_'))) %>%
separate(data, paste0('participant.type.', 0:2)) %>%
mutate_at(2:4, ~replace_na(.x, ''))
We can use reshape2::dcast for this
reshape2::dcast(df, insidentID ~ participant.type)
# insidentID A B C
# 1 1 <NA> B <NA>
# 2 8 <NA> B <NA>
# 3 12 <NA> <NA> C
# 4 16 A <NA> <NA>
# 5 24 <NA> B <NA>
# 6 27 <NA> B C
# 7 29 <NA> <NA> C
with the data
set.seed(123)
df <- data.frame(insidentID = sample(0:30, 8L, replace = TRUE),
participant.type = sample(LETTERS[1:3], 8L, replace = TRUE),
stringsAsFactors = FALSE)
df
# insidentID participant.type
# 1 8 B
# 2 24 B
# 3 12 C
# 4 27 B
# 5 29 C
# 6 1 B
# 7 16 A
# 8 27 C
The 'related question' link provided by #markus shows a variety of other solutions, including what appears to be the most concise in a tidyverse format:
df %>%
group_by(incidentID) %>%
mutate(rn = paste0("newcolumn",row_number())) %>%
spread(rn, participant.type)
gives:
incidentID newcolumn1 newcolumn2 newcolumn3
<int> <fct> <fct> <fct>
1 1 A B NA
2 2 A NA NA
3 3 A B C
4 4 B C NA
A