R: replace identical rows with average - r

I have data which looks like this:
patient day response
Bob "08/08/2011" 5
However, sometimes, we have several responses for the same day (from the same patient). For all such rows, I want to replace them all with just one row, where the patient and the day is of course what it happens to be for all those rows, and the response is the average of them.
So if we also had
patient day response
Bob "08/08/2011" 6
then we'd remove both these rows and replace them with
patient day response
Bob "08/08/2011" 5.5
How do I write up a code in R to do this for a data frame that spans tens of thousands of rows?
EDIT: I might need the code to generalize to several covariables. So, for example, apart from day, we might have "location", so then we'd only want to average all the rows which correspond to the same patient on the same day on the same location.

Required output can be obtained by:
aggregate(a$response, by=list(Category=a$patient,a$date), FUN=mean)

You can do this with the dplyr package pretty easily:
library(dplyr)
df %>% group_by(patient, day) %>%
summarize(response_avg = mean(response))
This groups by whatever variables you choose in the group_by so you can add more. I named the new variable "response_avg" but you can change that to what you want also.

just to add a data.table solution if any reader is a data.table user.
library(data.table)
setDT(df)
df[, response := mean(response, na.rm = T), by = .(patient, day)]
df <- unique(df) # to remove duplicates

Related

How can I group a dataframe's observation 3 by 3?

I am struggling with a dataframe of exchange-rate observations taken 3 times a day for approximately 30 days. This means that currently the dataframe is formed by 90 observations. For the purpose of my research I need to reduce the observations to 1 per day (30 observations), possibly by making the mean every 3 observations. In sum, I need a code that takes the observations 3 by 3 and outputs one observation every 3. I have tried some different codes but my attempts have all completely failed. I was wondering if someone had to do something similar and managed.
Thanks!
Use group_by and summarise like this:
library(tidyverse)
df=tibble(
day = rep(1:30, each=3),
rate = rnorm(90)
)
df %>%
group_by(day) %>%
summarise(mrate = mean(rate))
P.S.
Attach data. It will be easier to help out on specific data.

grouping by ID, shift all rows up one and leave NA for last row

I have a dataset in long format organized by four times per subject. When grouping by subject number, I'm trying to shift all rows up by 1 and then leave the last observation for each subject with NA.
I tried this, but it shifted it down by 1, instead of up by 1.
data_long_new <- data_long_new[, variable_lag:=c(NA, variable[-.N]), by=subject_id]
Any assistance would be appreciated.
looking at the fact u're using data.table
data_long[, next.variable:=shift(variable, type="lead"), by= subject_id]
also you're code is almost correct:
data_long_new[, variable_lead:=c(variable[-1],NA), by=subject_id]
This did it.
data_long_new <- data_long %>% group_by(subject_id) %>% mutate(next.variable = lead(variable, order_by=subject_id))

How to mutate variables on a rollwing time window by groups with unequal time distances?

I have a large df with around 40.000.000 rows , covering in total a time period of 2 years and more than 400k unique users.
The time variable is formatted as POSIXct and I have a unique user_id per user. I observe each user over several points in time.
Each row is therefore a unqiue combination of user_id, time and a set of variables.
Based on a set of dummy variables (df$v1, df$v2), a category variable(df$category_var) and the time variable (df$time_var) I now want to calculate 3 new variables on a user_id level on a rolling time window over the previous 30 days.
So in each row, the new variable should be calculated over the values of the previous 30 days of the input variables.
I do not observe all users over the same time period, some enter later some leave earlier, also the distances between times are not equal, therefore I can not calculate the variables just by number of rows.
So far I only managed to calculate my new variables per user_id over the whole observation period, but I couldn’t achieve to calculate the variables for the previous 30 days rolling window per user.
After checking and trying all the related posts here, I assume a data.table solution is the most suitable, but since I have so far mainly worked with dplyr the attempt of calculating these variables on the rolling time window on a groupey_by user_id level has taken more than a week without any results. I would be so grateful for your support!
My df basically looks like :
user_id <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3)
time_var <- c(“,2,3,4,5, 1.5, 2, 3, 4.5, 1,2.5,3,4,5)
category_var <- c(“A”, “A”, “B”, “B”, “A”, “A”, “C”, “C”, “A”, …)
v1 <- c(0,1,0,0,1,0,1,1,1,0,1,…)
v2 <- c(1,1,0,1,0,1,1,0,...)
My first needed new variable (new_x1) is basically a cumulative sum based on a condition in dummy variable v1. What I achieved so far:
df <- df %>% group_by(user_id) %>% mutate(new_x1=cumsum(v1==1))
What I need: That variables only counting over the previoues 30 days per user
Needed new variable (new_x2): Basically cumulative count of v1 if v2 has a (so far) unique value. So for each new value in v2 given v1==1, count.
What I achieved so far:
df <- df %>%
group_by(user_id, category_var) %>%
mutate(new_x2 = cumsum(!duplicated(v2 )& v1==1))
I also need this based on the previous 30 days and not the whole observation period per user.
My third variable of interest (new__x3):
The time between two observations given a certain condition (v1==1)
#Interevent Time
df2 <- df%>% group_by(user_id) %>% filter(v1==1) %>% mutate(time_between_events=time-lag(time))
I would also need this on the previoues 30 days.
Thank you so much!
Edit after John Springs Post:
My potential solution would then be
setDT(df)[, `:=`(new_x1= cumsum(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]),
new_x2= cumsum(!duplicated(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]))),
by = eval(c("user_id", "time"))]
I really not familiar with data.table and not sure, if I can nest my conditions on cumsum in on data.table like that.
Any suggestions?

R dataframe split by factor then apply and tidyr

There is 8 week internet experiment. Data is gathered on each participant, who can start the experiment at any date. The idea is to calculate the exercises done by each participant in the first week, in the second week and so on. So the result should be a participant times 8 matrix/data frame.
each participant can start at any date, but the experiment is closed after 8 weeks
each partisipant can do as many exercises as he/she wants.
here an example
df <- data.frame(
fac=c("a","a","a","a","a","b","b","b","b","b","c","c","c","c","c","d","d","d","d","d","d"),
date=c("2017-01-01","2017-01-05","2017-01-13","2017-01-25","2017-02-10","2017-01-06","2017-01-16","2017-01-28","2017-02-02","2017-02-07","2017-01-11","2017-01-19","2017-01-24","2017-01-31","2017-02-09","2017-01-12","2017-01-24","2017-01-29","2017-02-04","2017-02-19","2017-03-08"),
sessions=c(1,2,3,6,5,1,3,2,3,3,1,5,3,2,4,1,3,5,2,6,6)
)
My idea is to:
add an "0" column (df$count<-0)
split the data frame by factors [split(df, df$fac)] 3
take the date value-subtract the date value that is the first entry, add 1, divide by 7 and then round up. [roundup((date2 -date$1$+1)/7)]. This gives me exactly the number of week in which the participant did the exercises.
with tidyr: reorganize the whole data frame so that values in every week are summed together (participant times 8 data frame)
But I have no idea how to correctly implement the step 3 and to combine with step 4
Thanks a lot!
Something like:
library(dplyr)
df <- df %>%
group_by(fac) %>%
mutate(time = ceiling(1+difftime(as.Date(date), as.Date(date[1]), units = 'weeks')))
summarize(df, total_sessions = sum(sessions))

R: dataframe Selecting the maximum row per ID based on first timestamp

I have a data frame which contains records that have time stamps.
The toy example below contains an ID with 2 SMS's attached to it based on two different time stamps. In reality there would be thousands of IDs each with almost 80-100 SMS Types and dates
toydf <- data.frame(ID = c(1045937900, 1045937900),
SMS.Type = c("DF1", "WCB14"),
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"))
I want to be able to create a new dataframe that only contains the the record of the SMS type for the first SMS.Date or even the last
I have had a look at using duplicated, I have also thought about sorting the date column in descending order per ID and adding a new column which puts a 1 next to the first instance of the ID and a zero if the current ID is equal to the previous ID. I suspect this will get heavy if the number of records increases dramatically
Does anyone know a more elegant way of doing this - maybe using data.table
Thanks for your time
Try
library(dplyr)
toydf %>%
group_by(ID) %>%
arrange(desc(as.POSIXct(SMS.Date, format='%d/%m/%Y %H:%M'))) %>%
slice(1L)
Or using data.table
library(data.table)
toydf$SMS.Date <- as.POSIXct(toydf$SMS.Date, format='%d/%m/%Y %H:%M')
setkey(setDT(toydf), ID, SMS.Date)[, .SD[.N], ID]

Resources