Select rows up to certain value in R - r

I have the following dataframe:
df1 <- data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
var1 = c(0, 2, 3, 4, 2, 5, 6, 10, 11, 0, 1, 2, 1, 5, 7, 10))
I want to select only the rows containing values up to 5, once 5 is reached I want it to go to the next ID and select only values up to 5 for that group so that the final result would look like this:
ID var1
1 0
1 2
1 3
1 4
1 2
1 5
2 0
2 1
2 2
2 1
2 5
I would like to try something with dplyr as it is what I am most familiar with.

You could use which.max() to find the first occurrence of var1 >= 5, and then extract those rows whose row numbers are before it.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(row_number() <= which.max(var1 >= 5)) %>%
ungroup()
or
df1 %>%
group_by(ID) %>%
slice(1:which.max(var1 >= 5)) %>%
ungroup()
# # A tibble: 11 × 2
# ID var1
# <dbl> <dbl>
# 1 1 0
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 2
# 6 1 5
# 7 2 0
# 8 2 1
# 9 2 2
# 10 2 1
# 11 2 5

Related

Apply same recoding rules to multiple data frames

I have 5 data frames. I want to recode all variables ending with "_comfort", "_agree", and "effective" using the same rules for each data frame. As is, the values in each column are 1:5 and I want is to recode 5's to 1, 4's to 2, 2's to 4, and 5's to 1 (3 will stay the same).
I do not want the end result to one merged dataset, but instead to apply the same recoding rules across all 5 independent data frames. For simplicity sake, let's just assume I have 2 data frames:
df1 <- data.frame(a_comfort = c(1, 2, 3, 4, 5),
b_comfort = c(1, 2, 3, 4, 5),
c_effective = c(1, 2, 3, 4, 5))
df2 <- data.frame(a_comfort = c(1, 2, 3, 4, 5),
b_comfort = c(1, 2, 3, 4, 5),
c_effective = c(1, 2, 3, 4, 5))
What I want is:
df1 <- data.frame(a_comfort = c(5, 4, 3, 2, 1),
b_comfort = c(5, 4, 3, 2, 1),
c_effective = c(5, 4, 3, 2, 1))
df2 <- data.frame(a_comfort = c(5, 4, 3, 2, 1),
b_comfort = c(5, 4, 3, 2, 1),
c_effective = c(5, 4, 3, 2, 1))
Conventionally, I would use dplyr's mutate_atand ends_withto achieve my goal, but have not been successful with this method across multiple data frames. I am thinking a combination of the purr and dplyr packages will work, but haven't nailed down the exact technique.
Thanks in advance for any help!
You can use get() and assign() in a loop:
library(dplyr)
for (df_name in c("df1", "df2")) {
df <- mutate(
get(df_name),
across(
ends_with(c("_comfort", "_agree", "_effective")),
\(x) 6 - x
)
)
assign(df_name, df)
}
Result:
#> df1
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1
#> df2
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1
Note, however, it’s often better practice to keep multiple related dataframes in a list than loose in the global environment (see). In this case, you can use purrr::map() (or base::lapply()):
library(dplyr)
library(purrr)
dfs <- list(df1, df2)
dfs <- map(
dfs,
\(df) mutate(
df,
across(
ends_with(c("_comfort", "_agree", "_effective")),
\(x) 6 - x
)
)
)
Result:
#> dfs
[[1]]
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1
[[2]]
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1
You can use ls(pattern = 'df\\d+') to find all objects whose names match a certain pattern. Then store them into a list and pass to purrr::map or lapply for recoding.
library(dplyr)
df.lst <- purrr::map(
mget(ls(pattern = 'df\\d+')),
~ .x %>% mutate(6 - across(ends_with(c("_comfort", "_agree", "effective"))))
)
# $df1
# a_comfort b_comfort c_effective
# 1 5 5 5
# 2 4 4 4
# 3 3 3 3
# 4 2 2 2
# 5 1 1 1
#
# $df2
# a_comfort b_comfort c_effective
# 1 5 5 5
# 2 4 4 4
# 3 3 3 3
# 4 2 2 2
# 5 1 1 1
You can further overwrite those dataframes in your workspace from the list through list2env().
list2env(df.lst, .GlobalEnv)
Please try the below code where i convert the columns to factor and then recode them
data
a_comfort b_comfort c_effective
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
code
library(tidyverse)
df1 %>% mutate(across(c(ends_with('comfort'),ends_with('effective')), ~ factor(.x, levels=c('1','2','3','4','5'), labels=c('5','4','3','2','1'))))
output
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1

Is there a way to identify rows that match a condition several times across several columns in R?

I have a dataset of questionnaires filled by patients.
I want to identify them using diagnostic criteria; the criteria I'm struggling with requires at least 3 answers of >= 3 (questions are Likert questions from 1 up to 5).
A MWE of the dataset I'm working on is presented below
data <- structure(list(q1 = c(1, 2, 3, 1, 1, 1, 1, 3, 1, 1), q2 = c(1,
1, 3, 1, 1, 1, 1, 3, 1, 1), q3 = c(1, 1, 1, 1, 3, 3, 1, 1,
1, 1), q4 = c(1, 2, 2, 1, 1, 3, 1, 3, 1, 1), q5 = c(1, 1,
3, 1, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I've figured out how to identify observations that match at least 1 value >=3 using (I do not use all_vars as my dataset is larger than the MWE:
data.match <- data %>%
filter_at(vars(q1, q2, q3, q4, q5), any_vars(. %in% c(3:5)))
data$diagnostic <- ifelse(data$id %in% data.match$id,1,0)
I then back-identified patients using the second line.
The thing is I've not been able to replicate such a strategy to identify patients meeting a determined number of pre-specified values across columns.
In this specific example, I'd like to identify patients 3 and 8. I've tried using rowSums but it seems to me that the number of possible combinations is too high.
Using dplyr, you could use rowwise with c_across :
library(dplyr)
result <- data %>%
rowwise() %>%
mutate(diagnostic = as.integer(sum(c_across(starts_with('q')) >= 3) >= 3))
result
# q1 q2 q3 q4 q5 diagnostic
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1 1 0
# 2 2 1 1 2 1 0
# 3 3 3 1 2 3 1
# 4 1 1 1 1 1 0
# 5 1 1 3 1 1 0
# 6 1 1 3 3 1 0
# 7 1 1 1 1 1 0
# 8 3 3 1 3 1 1
# 9 1 1 1 1 1 0
#10 1 1 1 1 1 0
Perhaps, we can use rowSums
data$diagnostic <- +(rowSums(data >=3) == 3)
data$diagnostic
#[1] 0 0 1 0 0 0 0 1 0 0

Grouped non-dense rank without omitted values

I have the following data.frame:
df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))
And I want to add a new column grp which, for each date, ranks the IDs. Ties should have the same value, but there should be no omitted values. That is, if there are two values which are equally minimum, they should both get rank 1, and the next lowest values should get rank 2.
The expected result would therefore look like this. Note that, as mentioned, the groups are for each date, so the operation must be grouped by date.
data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1),
grp = c(2, 2, 1, 2, 1, 2, 3, 1, 2, 2, 1, 1))
I'm sure there's a trivial way to do this but I haven't found it: none of the options for tie.method behave in this way (data.table::frank also doesn't help, since it only adds a dense rank).
I thought of doing a normal rank and then using data.table::rleid, but that doesn't work if there are duplicate values separated by other values during the same day.
I also thought of grouping by date and id and then using a group-ID, but the lowest values each day must start at rank 1, so that won't work either.
The only functional solution I've found is to create another table with the unique ids per day and then join that table to this one:
suppressPackageStartupMessages(library(dplyr))
df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))
uniques <- df %>%
group_by(
date
) %>%
distinct(
id
) %>%
mutate(
grp = rank(id)
)
df <- df %>% left_join(
unique
) %>% print()
#> Joining, by = c("date", "id")
#> date id grp
#> 1 1 4 2
#> 2 1 4 2
#> 3 1 2 1
#> 4 1 4 2
#> 5 2 1 1
#> 6 2 2 2
#> 7 2 3 3
#> 8 2 1 1
#> 9 3 2 2
#> 10 3 2 2
#> 11 3 1 1
#> 12 3 1 1
Created on 2020-05-08 by the reprex package (v0.3.0)
However, this seems quite inelegant and convoluted for what seems like a simple operation, so I'd rather see if other solutions are available.
Curious to see data.table solutions if available, but unfortunately the solution must be in dplyr.
We can use dense_rank
library(dplyr)
df %>%
group_by(date) %>%
mutate(grp = dense_rank(id))
# A tibble: 12 x 3
# Groups: date [3]
# date id grp
# <dbl> <dbl> <int>
# 1 1 4 2
# 2 1 4 2
# 3 1 2 1
# 4 1 4 2
# 5 2 1 1
# 6 2 2 2
# 7 2 3 3
# 8 2 1 1
# 9 3 2 2
#10 3 2 2
#11 3 1 1
#12 3 1 1
Or with frank
library(data.table)
setDT(df)[, grp := frank(id, ties.method = 'dense'), date]

Generate random sequential number by group with multiple times

I'm trying to generate random number by group with multiple times.
For example,
> set.seed(1002)
> df<-data.frame(ID=LETTERS[seq(1:5)],num=sample(c(2,3,4), size=5, replace=TRUE))
> df
ID num
1 A 3
2 B 4
3 C 3
4 D 2
5 E 3
In ID, I want to generate sequential random number without replacement with (for example) 4 times.
If ID is A, it will randomly select numbers among 1:3 4 times. So, this will be
sample(c(1,2,3,1,2,3,1,2,3),replace=FALSE)
or
ep(sample(c(1:4), replace=FALSE),times=4)
If the results is 3 2 1 2 1 3 2 3 3 1 1 2, then the data will be
ID num
1 A 3
2 A 2
3 A 2
4 A 1
5 A 1
6 A 3
7 A 2
8 A 1
9 A 3
I tried several things, like
df%>%group_by(ID)%>%mutate(random=sample(rep(1:num,times=4),replace=FALSE))
It failed. The warning appeared with In 1:num
I also tried this.
ddply(df,.(ID),function(x) sample(rep(1:num,times=4),replace=FALSE))
The error appeared again, with NA/NaN.
I would really appreciate if you let me know how to solve this problem.
We can create a list-column and then unnest it to have separate rows.
n <- 4
library(dplyr)
df %>%
group_by(ID) %>%
mutate(num = list(sample(rep(seq_len(num), n)))) %>%
tidyr::unnest(num)
# ID num
# <fct> <int>
# 1 A 2
# 2 A 2
# 3 A 2
# 4 A 3
# 5 A 3
# 6 A 1
# 7 A 3
# 8 A 1
# 9 A 1
#10 A 3
# … with 50 more rows
I'm not quite clear on your expected output.
The following samples num elements from 1:num with replacement, and stores samples in a list column sample.
library(tidyverse)
set.seed(2018)
df %>% mutate(sample = map(num, ~sample(1:.x, replace = T)))
# ID num sample
#1 A 2 1, 1
#2 B 4 3, 4, 1, 2
#3 C 2 1, 1
#4 D 4 3, 3, 4, 4
#5 E 2 2, 2
Or if you want to repeat sampling num elements (with replacement) 4 times, you can do
set.seed(2018)
df %>%
mutate(sample = map(num, ~as.numeric(replicate(4, sample(1:.x, replace = T)))))
#ID num sample
#1 A 2 1, 1, 1, 2, 1, 2, 1, 1
#2 B 4 3, 3, 4, 4, 4, 4, 4, 2, 3, 4, 3, 3, 2, 1, 1, 2
#3 C 2 1, 1, 1, 1, 1, 1, 1, 2
#4 D 4 2, 3, 2, 1, 3, 4, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1
#5 E 2 2, 1, 2, 2, 1, 1, 1, 2

create a cumulative count, until 2 of previous 6 rows meet a condition

If column a is equal to 1, I would like to start a cumulative sum. I would like to stop when 2 of the previous 6 rows is equal to 0.
dplyr::tibble(a = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1),
sum = c(1, 2, 3, 4, 5, 6, 7, 8, 8, 9, 0, 1, 2, 3))
sum is my desired output
Ideally using tidyverse
One approach could be to find out row where two consecutive 0's are found within interval of 6 rows, then use cumsum to create groups and final take cumsum value in each group.
library(dplyr)
library(purrr)
df %>%
mutate(sum1 = map_dbl(seq_along(a), ~sum(a[. : max(.-6, 1)] == 0) >= 2)) %>%
group_by(group = cumsum(sum1 != lag(sum1, default = first(sum1)))) %>%
mutate(ans = cumsum(a)) %>%
ungroup %>%
select(-sum1, -group)
# A tibble: 14 x 2
# a ans
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 1 6
# 7 1 7
# 8 1 8
# 9 0 8
#10 1 9
#11 0 0
#12 1 1
#13 1 2
#14 1 3

Resources