I have data like this:
df<-data.frame(one=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 7),
test=c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0))
I want to sum the number of consecutive 'tests' by variable 'one', but importantly they have to be consecutive. So I'd want:
dfwant<-data.frame(one=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 7),
test=c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0),
want=c(2, 2, 1, 1, 1, 2, 2, 3, 3, 3, 1, 1, 1, 0, 0))
I got pretty close with rle but was never able to make the new want column.
An attempt in base R using ave, grouping by the one column and a cumulative sum of values that are not equal to 1 in the test column:
ave(df$test, list(df$one, cumsum(df$test != 1)), FUN=function(x) if(any(x==1)) sum(x) else x )
# [1] 2 2 1 1 1 2 2 3 3 3 1 1 1 0 0
A shortening of this logic, with a hat-tip to #RonakShah is:
ave(df$test == 1, df$one, cumsum(df$test != 1), FUN = sum)
One option is rleid from data.table, grouped by the run-length-id of 'one', 'test', get the sum of 'test' as 'want', grouped by 'one', mutate 'want' as the max of 'want'
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(one, test))%>%
mutate(want = sum(test)) %>%
group_by(one) %>%
mutate(want = max(want)) %>%
dplyr::select(-grp)
# A tibble: 15 x 3
# Groups: one [7]
# one test want
# <dbl> <dbl> <dbl>
# 1 1 1 2
# 2 1 1 2
# 3 2 1 1
# 4 2 0 1
# 5 2 1 1
# 6 3 1 2
# 7 3 1 2
# 8 4 1 3
# 9 4 1 3
#10 4 1 3
#11 5 0 1
#12 5 1 1
#13 6 1 1
#14 7 0 0
#15 7 0 0
Or using data.table
setDT(df)[, want := max(tabulate(rleid(test))* test), .(one)]
You can use rle to obtain the lengths of different runs with 1 and then take the maximum of those lengths
library(dplyr)
df %>%
group_by(one) %>%
mutate(want = with(rle(test == 1), max(0, lengths[values], na.rm = TRUE)))
Related
I have the following dataframe:
df1 <- data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
var1 = c(0, 2, 3, 4, 2, 5, 6, 10, 11, 0, 1, 2, 1, 5, 7, 10))
I want to select only the rows containing values up to 5, once 5 is reached I want it to go to the next ID and select only values up to 5 for that group so that the final result would look like this:
ID var1
1 0
1 2
1 3
1 4
1 2
1 5
2 0
2 1
2 2
2 1
2 5
I would like to try something with dplyr as it is what I am most familiar with.
You could use which.max() to find the first occurrence of var1 >= 5, and then extract those rows whose row numbers are before it.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(row_number() <= which.max(var1 >= 5)) %>%
ungroup()
or
df1 %>%
group_by(ID) %>%
slice(1:which.max(var1 >= 5)) %>%
ungroup()
# # A tibble: 11 × 2
# ID var1
# <dbl> <dbl>
# 1 1 0
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 2
# 6 1 5
# 7 2 0
# 8 2 1
# 9 2 2
# 10 2 1
# 11 2 5
I have a dataset of questionnaires filled by patients.
I want to identify them using diagnostic criteria; the criteria I'm struggling with requires at least 3 answers of >= 3 (questions are Likert questions from 1 up to 5).
A MWE of the dataset I'm working on is presented below
data <- structure(list(q1 = c(1, 2, 3, 1, 1, 1, 1, 3, 1, 1), q2 = c(1,
1, 3, 1, 1, 1, 1, 3, 1, 1), q3 = c(1, 1, 1, 1, 3, 3, 1, 1,
1, 1), q4 = c(1, 2, 2, 1, 1, 3, 1, 3, 1, 1), q5 = c(1, 1,
3, 1, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I've figured out how to identify observations that match at least 1 value >=3 using (I do not use all_vars as my dataset is larger than the MWE:
data.match <- data %>%
filter_at(vars(q1, q2, q3, q4, q5), any_vars(. %in% c(3:5)))
data$diagnostic <- ifelse(data$id %in% data.match$id,1,0)
I then back-identified patients using the second line.
The thing is I've not been able to replicate such a strategy to identify patients meeting a determined number of pre-specified values across columns.
In this specific example, I'd like to identify patients 3 and 8. I've tried using rowSums but it seems to me that the number of possible combinations is too high.
Using dplyr, you could use rowwise with c_across :
library(dplyr)
result <- data %>%
rowwise() %>%
mutate(diagnostic = as.integer(sum(c_across(starts_with('q')) >= 3) >= 3))
result
# q1 q2 q3 q4 q5 diagnostic
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1 1 0
# 2 2 1 1 2 1 0
# 3 3 3 1 2 3 1
# 4 1 1 1 1 1 0
# 5 1 1 3 1 1 0
# 6 1 1 3 3 1 0
# 7 1 1 1 1 1 0
# 8 3 3 1 3 1 1
# 9 1 1 1 1 1 0
#10 1 1 1 1 1 0
Perhaps, we can use rowSums
data$diagnostic <- +(rowSums(data >=3) == 3)
data$diagnostic
#[1] 0 0 1 0 0 0 0 1 0 0
I have the following data.frame:
df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))
And I want to add a new column grp which, for each date, ranks the IDs. Ties should have the same value, but there should be no omitted values. That is, if there are two values which are equally minimum, they should both get rank 1, and the next lowest values should get rank 2.
The expected result would therefore look like this. Note that, as mentioned, the groups are for each date, so the operation must be grouped by date.
data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1),
grp = c(2, 2, 1, 2, 1, 2, 3, 1, 2, 2, 1, 1))
I'm sure there's a trivial way to do this but I haven't found it: none of the options for tie.method behave in this way (data.table::frank also doesn't help, since it only adds a dense rank).
I thought of doing a normal rank and then using data.table::rleid, but that doesn't work if there are duplicate values separated by other values during the same day.
I also thought of grouping by date and id and then using a group-ID, but the lowest values each day must start at rank 1, so that won't work either.
The only functional solution I've found is to create another table with the unique ids per day and then join that table to this one:
suppressPackageStartupMessages(library(dplyr))
df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))
uniques <- df %>%
group_by(
date
) %>%
distinct(
id
) %>%
mutate(
grp = rank(id)
)
df <- df %>% left_join(
unique
) %>% print()
#> Joining, by = c("date", "id")
#> date id grp
#> 1 1 4 2
#> 2 1 4 2
#> 3 1 2 1
#> 4 1 4 2
#> 5 2 1 1
#> 6 2 2 2
#> 7 2 3 3
#> 8 2 1 1
#> 9 3 2 2
#> 10 3 2 2
#> 11 3 1 1
#> 12 3 1 1
Created on 2020-05-08 by the reprex package (v0.3.0)
However, this seems quite inelegant and convoluted for what seems like a simple operation, so I'd rather see if other solutions are available.
Curious to see data.table solutions if available, but unfortunately the solution must be in dplyr.
We can use dense_rank
library(dplyr)
df %>%
group_by(date) %>%
mutate(grp = dense_rank(id))
# A tibble: 12 x 3
# Groups: date [3]
# date id grp
# <dbl> <dbl> <int>
# 1 1 4 2
# 2 1 4 2
# 3 1 2 1
# 4 1 4 2
# 5 2 1 1
# 6 2 2 2
# 7 2 3 3
# 8 2 1 1
# 9 3 2 2
#10 3 2 2
#11 3 1 1
#12 3 1 1
Or with frank
library(data.table)
setDT(df)[, grp := frank(id, ties.method = 'dense'), date]
I'm trying to generate random number by group with multiple times.
For example,
> set.seed(1002)
> df<-data.frame(ID=LETTERS[seq(1:5)],num=sample(c(2,3,4), size=5, replace=TRUE))
> df
ID num
1 A 3
2 B 4
3 C 3
4 D 2
5 E 3
In ID, I want to generate sequential random number without replacement with (for example) 4 times.
If ID is A, it will randomly select numbers among 1:3 4 times. So, this will be
sample(c(1,2,3,1,2,3,1,2,3),replace=FALSE)
or
ep(sample(c(1:4), replace=FALSE),times=4)
If the results is 3 2 1 2 1 3 2 3 3 1 1 2, then the data will be
ID num
1 A 3
2 A 2
3 A 2
4 A 1
5 A 1
6 A 3
7 A 2
8 A 1
9 A 3
I tried several things, like
df%>%group_by(ID)%>%mutate(random=sample(rep(1:num,times=4),replace=FALSE))
It failed. The warning appeared with In 1:num
I also tried this.
ddply(df,.(ID),function(x) sample(rep(1:num,times=4),replace=FALSE))
The error appeared again, with NA/NaN.
I would really appreciate if you let me know how to solve this problem.
We can create a list-column and then unnest it to have separate rows.
n <- 4
library(dplyr)
df %>%
group_by(ID) %>%
mutate(num = list(sample(rep(seq_len(num), n)))) %>%
tidyr::unnest(num)
# ID num
# <fct> <int>
# 1 A 2
# 2 A 2
# 3 A 2
# 4 A 3
# 5 A 3
# 6 A 1
# 7 A 3
# 8 A 1
# 9 A 1
#10 A 3
# … with 50 more rows
I'm not quite clear on your expected output.
The following samples num elements from 1:num with replacement, and stores samples in a list column sample.
library(tidyverse)
set.seed(2018)
df %>% mutate(sample = map(num, ~sample(1:.x, replace = T)))
# ID num sample
#1 A 2 1, 1
#2 B 4 3, 4, 1, 2
#3 C 2 1, 1
#4 D 4 3, 3, 4, 4
#5 E 2 2, 2
Or if you want to repeat sampling num elements (with replacement) 4 times, you can do
set.seed(2018)
df %>%
mutate(sample = map(num, ~as.numeric(replicate(4, sample(1:.x, replace = T)))))
#ID num sample
#1 A 2 1, 1, 1, 2, 1, 2, 1, 1
#2 B 4 3, 3, 4, 4, 4, 4, 4, 2, 3, 4, 3, 3, 2, 1, 1, 2
#3 C 2 1, 1, 1, 1, 1, 1, 1, 2
#4 D 4 2, 3, 2, 1, 3, 4, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1
#5 E 2 2, 1, 2, 2, 1, 1, 1, 2
If column a is equal to 1, I would like to start a cumulative sum. I would like to stop when 2 of the previous 6 rows is equal to 0.
dplyr::tibble(a = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1),
sum = c(1, 2, 3, 4, 5, 6, 7, 8, 8, 9, 0, 1, 2, 3))
sum is my desired output
Ideally using tidyverse
One approach could be to find out row where two consecutive 0's are found within interval of 6 rows, then use cumsum to create groups and final take cumsum value in each group.
library(dplyr)
library(purrr)
df %>%
mutate(sum1 = map_dbl(seq_along(a), ~sum(a[. : max(.-6, 1)] == 0) >= 2)) %>%
group_by(group = cumsum(sum1 != lag(sum1, default = first(sum1)))) %>%
mutate(ans = cumsum(a)) %>%
ungroup %>%
select(-sum1, -group)
# A tibble: 14 x 2
# a ans
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 1 6
# 7 1 7
# 8 1 8
# 9 0 8
#10 1 9
#11 0 0
#12 1 1
#13 1 2
#14 1 3