I have a data.frame with repeated measurements of boars.
Now I need to substract a data.frame that only includes measurements if the column "rejected" says "blood" plus the measurement before and the next 3 after the occurence of "blood".
I know that i would have to group it by the boar ID and sort after the time of measurements but i have no idea how to get to the measurement before and after the occurence.
dat2 <- dat1 %>%
group_by(ID) %>%
filter(any(reject == "blood")) %>%
ungroup(ID)
I created a data.frame with all the repeated measurements of the boars who once had the reject reason "blood". But I dont know how to filter it down to only the 5 measurements (blood, one ahead, three after)
I also can use an ongoing measurement ID instead of the date, incase that makes it easier to work with.
I tried using the lag function in dyplr but i think i am using the order_by incorrectly.
dat2%>%
group_by(ID) %>%
lag(reject=="blood", n=1, order_by = measurement_ID)
Related
I have a dataset of three columns and roughly 300000 rows:
#Person ID# ##Likelihood of Risk## ###Year the survey was taken###
Each Person has taken part multiple times and I only want the most recent likelihood of Risk.
I wanted to figure that out by grouping the Person ID and then finding the max year.
That did not work out but I rather ended up having still multiple identical person ID's.
To continue working I need one specific value of Likelihood of Risk for each ID.
Riskytest <- Risk_Adult %>% group_by(pid,A_risk) %>% summarize(max=max(syear))
Riskytest <- Risk_Adult %>%
group_by(pid) %>%
slice_max(syear) %>%
ungroup()
I have a dataset with repeated measures: measurements nested within participants (ID) nested in groups. A variable G (with range 0-100) was measured on the group-level. I want to create a new column that shows:
The first day on which the maximum value of G was reached in a group coded as zero.
How many days each measurement (in this same group) occurred before or after the day on which the maximum was reached. For example: a measurement taken 2 days before the maximum is then coded -2, and a measurement 5 days after the maximum is coded as 5.
Here is an example of what I'm aiming for: Example
I highlighted the days on which the maximum value of G was reached in the different groups. The column 'New' is what I'm trying to get.
I've been trying with dplyr and I managed to get for each group the maximum with group_by, arrange(desc), slice. I then recoded those maxima into zero and joined this dataframe with my original dataframe. However, I cannot manage to do the 'sequence' of days leading up to/ days from the maximum.
EDIT: sorry I didn't include a reprex. I used this code so far:
To find the maximum value: First order by date
data <- data[with(data, order(G, Date)),]
Find maximum and join with original data:
data2 <- data %>%
dplyr::group_by(Group) %>%
arrange(desc(c(G)), .by_group=TRUE) %>%
slice(1) %>%
ungroup()
data2$New <- data2$G
data2 <- data2 %>%
dplyr::select(c("ID", "New", "Date"))
data3 <- full_join(data, data2, by=c("ID", "Date"))
data3$New[!is.na(data3$New)] <- 0
This gives me the maxima coded as zero and all the other measurements in column New as NA but not yet the number of days leading up to this, and the number of days since. I have no idea how to get to this.
It would help if you would be able to provide the data using dput() in your question, as opposed to using an image.
It looked like you wanted to group_by(Group) in your example to compute number of days before and after the maximum date in a Group. However, you have ID of 3 and Group of A that suggests otherwise, and maybe could be clarified.
Here is one approach using tidyverse I hope will be helpful. After grouping and arranging by Date, you can look at the difference in dates comparing to the Date where G is maximum (the first maximum detected in date order).
Also note, as.numeric is included to provide a number, as the result for New is a difftime (e.g., "7 days").
library(tidyverse)
data %>%
group_by(Group) %>%
arrange(Date) %>%
mutate(New = as.numeric(Date - Date[which.max(G)]))
This is an example of what I'm working with. hh_p_id is an individual and tdtrpnum is identifying each trip they made per day, I want to find the average number of trips taken per day per person. To start, I want to separate (maybe filter?) the highest value per individual. How can do I do that?
If my_data is your data.frame then:
my_data <- my_data %>%
group_by(hh_p_id) %>%
summarise(avg_per_day = n() / length(unique(date))
will give you average number of trips per day for column hh_p_id
Here is a solution using dplyr if you want to calculate min, max and average number of trips per person ...
library(magrittr)
# some sample data
data <- dplyr::tibble(ID=sample(1:10,size=1000,replace=T),
DATE=sample(1:20,size=1000,replace=T) %>%
as.Date(origin="2020-01-01")) %>%
dplyr::group_by(ID,DATE) %>%
dplyr::summarise(CNT=dplyr::n())
# solution to your problem
data %>%
dplyr::group_by(ID) %>%
dplyr::summarise(AVG=sum(CNT)/dplyr::n(),
MAX=max(CNT),
MIN=min(CNT))
I have a large df with around 40.000.000 rows , covering in total a time period of 2 years and more than 400k unique users.
The time variable is formatted as POSIXct and I have a unique user_id per user. I observe each user over several points in time.
Each row is therefore a unqiue combination of user_id, time and a set of variables.
Based on a set of dummy variables (df$v1, df$v2), a category variable(df$category_var) and the time variable (df$time_var) I now want to calculate 3 new variables on a user_id level on a rolling time window over the previous 30 days.
So in each row, the new variable should be calculated over the values of the previous 30 days of the input variables.
I do not observe all users over the same time period, some enter later some leave earlier, also the distances between times are not equal, therefore I can not calculate the variables just by number of rows.
So far I only managed to calculate my new variables per user_id over the whole observation period, but I couldn’t achieve to calculate the variables for the previous 30 days rolling window per user.
After checking and trying all the related posts here, I assume a data.table solution is the most suitable, but since I have so far mainly worked with dplyr the attempt of calculating these variables on the rolling time window on a groupey_by user_id level has taken more than a week without any results. I would be so grateful for your support!
My df basically looks like :
user_id <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3)
time_var <- c(“,2,3,4,5, 1.5, 2, 3, 4.5, 1,2.5,3,4,5)
category_var <- c(“A”, “A”, “B”, “B”, “A”, “A”, “C”, “C”, “A”, …)
v1 <- c(0,1,0,0,1,0,1,1,1,0,1,…)
v2 <- c(1,1,0,1,0,1,1,0,...)
My first needed new variable (new_x1) is basically a cumulative sum based on a condition in dummy variable v1. What I achieved so far:
df <- df %>% group_by(user_id) %>% mutate(new_x1=cumsum(v1==1))
What I need: That variables only counting over the previoues 30 days per user
Needed new variable (new_x2): Basically cumulative count of v1 if v2 has a (so far) unique value. So for each new value in v2 given v1==1, count.
What I achieved so far:
df <- df %>%
group_by(user_id, category_var) %>%
mutate(new_x2 = cumsum(!duplicated(v2 )& v1==1))
I also need this based on the previous 30 days and not the whole observation period per user.
My third variable of interest (new__x3):
The time between two observations given a certain condition (v1==1)
#Interevent Time
df2 <- df%>% group_by(user_id) %>% filter(v1==1) %>% mutate(time_between_events=time-lag(time))
I would also need this on the previoues 30 days.
Thank you so much!
Edit after John Springs Post:
My potential solution would then be
setDT(df)[, `:=`(new_x1= cumsum(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]),
new_x2= cumsum(!duplicated(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]))),
by = eval(c("user_id", "time"))]
I really not familiar with data.table and not sure, if I can nest my conditions on cumsum in on data.table like that.
Any suggestions?
There is 8 week internet experiment. Data is gathered on each participant, who can start the experiment at any date. The idea is to calculate the exercises done by each participant in the first week, in the second week and so on. So the result should be a participant times 8 matrix/data frame.
each participant can start at any date, but the experiment is closed after 8 weeks
each partisipant can do as many exercises as he/she wants.
here an example
df <- data.frame(
fac=c("a","a","a","a","a","b","b","b","b","b","c","c","c","c","c","d","d","d","d","d","d"),
date=c("2017-01-01","2017-01-05","2017-01-13","2017-01-25","2017-02-10","2017-01-06","2017-01-16","2017-01-28","2017-02-02","2017-02-07","2017-01-11","2017-01-19","2017-01-24","2017-01-31","2017-02-09","2017-01-12","2017-01-24","2017-01-29","2017-02-04","2017-02-19","2017-03-08"),
sessions=c(1,2,3,6,5,1,3,2,3,3,1,5,3,2,4,1,3,5,2,6,6)
)
My idea is to:
add an "0" column (df$count<-0)
split the data frame by factors [split(df, df$fac)] 3
take the date value-subtract the date value that is the first entry, add 1, divide by 7 and then round up. [roundup((date2 -date$1$+1)/7)]. This gives me exactly the number of week in which the participant did the exercises.
with tidyr: reorganize the whole data frame so that values in every week are summed together (participant times 8 data frame)
But I have no idea how to correctly implement the step 3 and to combine with step 4
Thanks a lot!
Something like:
library(dplyr)
df <- df %>%
group_by(fac) %>%
mutate(time = ceiling(1+difftime(as.Date(date), as.Date(date[1]), units = 'weeks')))
summarize(df, total_sessions = sum(sessions))