How do you apply if else statements across user id #? - r

I am trying to create a dummy variable that flags the user id of people who attended a specific event. Each user id has multiple rows and I would like this dummy variable to apply to every row of the flagged user id. For example, using the data set below, I would like to flag the user IDs of everyone who attended "event b" (using a "1" for attended event b and "0" for did not attend event b). The tricky part is that I want the 1 to appear in every row that matches the user IDs of the people who attended "event b".
I want to use this dummy variable to eventually subset the data so that I can assess the event attending patterns of the users who attended a particular event.
df<-data.frame(id=(100,100,100,101,101,102,102,103,103,103,103),
event=("a","b","c","b","d","a","c","a","c","d","e"))

Consider ifelse and ave, iterating across unique values or levels of event
for(ev in unique(df$event)) { # for(ev in levels(df$event)) {
df[[paste0("event_", ev, "_flag")]] <- with(df, ave(ifelse(event == ev, 1, 0), id, FUN=max))
}
df
# id event event_a_flag event_b_flag event_c_flag event_d_flag event_e_flag
# 1 100 a 1 1 1 0 0
# 2 100 b 1 1 1 0 0
# 3 100 c 1 1 1 0 0
# 4 101 b 0 1 0 1 0
# 5 101 d 0 1 0 1 0
# 6 102 a 1 0 1 0 0
# 7 102 c 1 0 1 0 0
# 8 103 a 1 0 1 1 1
# 9 103 c 1 0 1 1 1
# 10 103 d 1 0 1 1 1
# 11 103 e 1 0 1 1 1

As I understood you want to one-hot encode.
You can use the following code with the dummyVars function of the caret package. Afterwards you aggregate the duplicate rows with the corresponding dplyr function.
library(caret)
library(dplyr)
df<-data.frame(id=c(100,100,100,101,101,102,102,103,103,103,103),
event=c("a","b","c","b","d","a","c","a","c","d","e"))
dmy <- dummyVars(" ~ .", data = df)
trsf <- data.frame(predict(dmy, newdata = df))
aggregate(.~id, trsf, FUN=sum)
id event.a event.b event.c event.d event.e
1 100 1 1 1 0 0
2 101 0 1 0 1 0
3 102 1 0 1 0 0
4 103 1 0 1 1 1

Perhaps I'm using a way to simple approach. Using dplyr and tidyr:
df %>%
mutate(value=1) %>%
pivot_wider(names_from="event", values_fill=0)
returns
# A tibble: 4 x 6
id a b c d e
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 1 1 1 0 0
2 101 0 1 0 1 0
3 102 1 0 1 0 0
4 103 1 0 1 1 1

Related

Is there a R function for preparing datasets for survival analysis like stset in Stata?

Datasets look like this
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
As you see, when id = 1, it's just the data input to coxph in survival package. However, when id = 2, at the beginning and end, failure occurs, but in the middle, failure disappears.
Is there a general function to extract data from id = 2 and get the result like id = 1?
I think when id = 2, the result should look like below.
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
A bit hacky, but should get the job done.
Data:
# Load data
library(tidyverse)
df <- read_table("
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
")
Data wrangling:
# Check for sub-groups within IDs and remove all but the last one
df <- df %>%
# Group by ID
group_by(
id
) %>%
mutate(
# Check if a new sub-group is starting (after a failure)
new_group = case_when(
# First row is always group 0
row_number() == 1 ~ 0,
# If previous row was a failure, then a new sub-group starts here
lag(failure) == 1 ~ 1,
# Otherwise not
TRUE ~ 0
),
# Assign sub-group number by calculating cumulative sums
group = cumsum(new_group)
) %>%
# Keep only last sub-group for each ID
filter(
group == max(group)
) %>%
ungroup() %>%
# Remove working columns
select(
-new_group, -group
)
Result:
> df
# A tibble: 6 × 5
id start end failure x1
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0
2 1 1 3 0 0
3 1 3 6 1 0
4 2 3 4 0 1
5 2 4 6 0 1
6 2 6 7 1 1

is there an r code that can select based on 3 columns and returns a value based on the options

I want to create a new variable name based on roof, wall and floor. if any of the options has 1 then the new variable is assigned 1 and zero otherwise.
roof<-c(1,1,1,0,1,0)
wall<-c(0,1,1,0,0,0)
floor<-c(1,1,1,0,1,0)
data<-data.frame(roof,wall,floor)
data
data$code<-c(1,1,1,0,1,0)
You can use pmap:
library(tidyverse)
roof<-c(1,1,1,0,1,0)
wall<-c(0,1,1,0,0,0)
floor<-c(1,1,1,0,1,0)
data<-data.frame(roof,wall,floor)
#
data %>%
mutate(code_want = pmap_int(data %>%
select(roof:floor) %>%
mutate_all(as.logical), any))
# roof wall floor code code_want
#1 1 0 1 1 1
#2 1 1 1 1 1
#3 1 1 1 1 1
#4 0 0 0 0 0
#5 1 0 1 1 1
#6 0 0 0 0 0

How to apply a function within a group that depends on which row within that group holds a value?

I have a dataset that looks like the following, where each ID has 3 levels, and where one of those levels has a value (and all other levels within that ID are 0):
ID level value
1 1 0
1 2 0
1 3 1
2 1 0
2 2 1
2 3 0
I need to return a similar dataframe, with an additional column which specifies which row within the ID has the value 1. In this case:
ID level value which
1 1 0 3
1 2 0 0
1 3 1 0
2 1 0 2
2 2 1 0
2 3 0 0
I feel like I should be able to create this somehow by group_by(ID) and then a mutate based on a case_when that refers to the rows relative to the group (i.e. if it is the 1st, 2nd, or 3rd row), but I can't crack how that should work.
Any suggestions are much appreciated!
You can use which or better which.max which is guaranteed to return only 1 value.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(which = which.max(value) * +(row_number() == 1))
# ID level value which
# <int> <int> <int> <int>
#1 1 1 0 3
#2 1 2 0 0
#3 1 3 1 0
#4 2 1 0 2
#5 2 2 1 0
#6 2 3 0 0
+(row_number() == 1) is to ensure that the value of which is assigned to only 1st row in the group and rest all the rows are 0.
We can use base R
df1$Which <- with(df1, tapply(as.logical(value), ID,
FUN = which)[ID] * !duplicated(ID))
-output
df1
# ID level value Which
#1 1 1 0 3
#2 1 2 0 0
#3 1 3 1 0
#4 2 1 0 2
#5 2 2 1 0
#6 2 3 0 0
Or another option with ave
df1$Which <- with(df1, ave(as.logical(value), ID, FUN = which) * !duplicated(ID))

ERROR when using count in R which was working before

i used count in order to count the same rows and get the frequency and it was working very well like 2 hours ago and now it's giving me an ERROR that i do not understand. I wanted that every time i have the same row, add the concentration of these rows. Here is my toy data and my function.
df=data.frame(ID=seq(1:6),A=rep(0,6),B=c(rep(0,5),1),C=c(rep(1,5),0),D=rep(1,6),E=c(rep(0,3),rep(1,2),0),concentration=c(0.002,0.004,0.001,0.0075,0.00398,0.006))
df
ID A B C D E concentration
1 1 0 0 1 1 0 0.00200
2 2 0 0 1 1 0 0.00400
3 3 0 0 1 1 0 0.00100
4 4 0 0 1 1 1 0.00750
5 5 0 0 1 1 1 0.00398
6 6 0 1 0 1 0 0.00600
freq.concentration=function(df,Vars){
df=data.frame(df)
Vars=as.character(Vars)
compte=count(df,Vars)
frequence.C= (compte$freq)/nrow(df)
output=cbind(compte,frequence.C)
return(output)
}
freq.concentration(df,colnames(df[2:6]))
# and here is the error that i get when i run the function which was working perfectly a while ago!
# Error: Must group by variables found in `.data`.
# * Column `Vars` is not found.
# Run `rlang::last_error()` to see where the error occurred.
PS: I do not know if this is related or not but i got this problem when i opened a script Rmd and did copy paste all my function to this script and all of a sudden my function stopped working .
I really appreciate your help in advance. Thank you.
Here is the output that i had when it was working properly :
output
ID A B C D E concentration.C.1 concentration.C.2
1 1 0 0 1 1 0 3 0.007
2 4 0 0 1 1 1 2 0.01148
3 6 0 1 0 1 0 1 0.00600
The first 3 rows are similar so we sum the concentration of the 3 and get 0.007, and then rows 4 and 5 are the same so we add their concentration and get 0.01148 and the last row is unique so the concentration remains the same.
We can convert to symbol and evaluate (!!!) in count to get the frequency count based on those columns and then get the 'frequence.C' as the proportion of 'n' with the sum of that count
library(dplyr)
freq.concentration <- function(df, Vars){
df %>%
count(!!! rlang::syms(Vars)) %>%
mutate(frequence.C = n/sum(n))
}
-testing
freq.concentration(df,colnames(df)[2:6])
# A B C D E n frequence.C
#1 0 0 1 1 0 3 0.5000000
#2 0 0 1 1 1 2 0.3333333
#3 0 1 0 1 0 1 0.1666667
If we need the sum of 'concentration', we could use a group_by operation instead of count
freq.concentration <- function(df, Vars){
df %>%
group_by(across(all_of(Vars))) %>%
summarise(n = n(), frequency.C = sum(concentration), .groups = 'drop')
}
-testing
freq.concentration(df,colnames(df)[2:6])
# A tibble: 3 x 7
# A B C D E n frequency.C
# <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#1 0 0 1 1 0 3 0.007
#2 0 0 1 1 1 2 0.0115
#3 0 1 0 1 0 1 0.006

creating an 'ever event' variable from an 'incident event' variable

In R, in a repeated measures dataset, how can I create a variable that is the same for each measurement on an individual based upon an incident variable? For instance if I have:
id incident_MI
1 0
1 0
1 1
2 0
2 0
2 0
3 0
3 0
3 0
3 1
And I want to use the incident_MI to create an ever_MI variable like this:
id incident_MI Ever_MI
1 0 1
1 0 1
1 1 1
2 0 0
2 0 0
2 0 0
3 0 1
3 0 1
3 0 1
3 1 1
Any ideas on how I might code that in R?
We can check for any 1's in the 'incident_MI' after grouping by 'id' and convert it to 'numeric' with as.integer to create the 'Ever_MI'
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(Ever_MI = as.integer(any(incident_MI==1)))
# A tibble: 10 x 3
# Groups: id [3]
# id incident_MI Ever_MI
# <int> <int> <int>
# 1 1 0 1
# 2 1 0 1
# 3 1 1 1
# 4 2 0 0
# 5 2 0 0
# 6 2 0 0
# 7 3 0 1
# 8 3 0 1
# 9 3 0 1
#10 3 1 1
Or as #lmo commented, the data.table option would be
library(data.table)
setDT(df1)[, Ever_MI := any(incident_MI), by=.(id)][]
Or using base R
df1$Ever_MI <- with(df1, ave(incident_MI, id, FUN = any))

Resources