Merge rows with the same ID but with overlapping variables - r

I have data in r that has over 6000 observations and 96 variables.
The data relates to groups of individuals and their activities etc. If a group returned the Group ID number was recorded again and a new observation was made. I need to merge the rows by ID so that the # of individuals take the highest number recorded, but the activities etc are a combination of both observations.
The data contains, #of individuals, activities, impacts, time of arrival etc. The issue is that some of the observations were split across 2 lines, so there may be activities which were recorded for the same group in another line. The Group ID for both observations is the same, but one may have the #of individuals recorded and some activity records or impacts, but the second may be incomplete and only have Group ID and then Impacts (which are additional to those in the 1st record). The #of individuals in the group never changes, so I need some way to combine them so that activities are additive, but #visitors takes the value that is highest, time of arrival needs to be the earliest recorded and time of departure needs to be the later of the 2 observations.
Does anyone know how to merge observations based on Group ID but vary the merging protocol based on the variable.
enter image description here

I'm not sure if this actually is what you want, but to combine rows of a data frame based on multiple conditions you can use the dplyr package and its summarise()function. I generated some data to use in R directly, you would have to modify the code according to your needs.
# generate data
ID<-rep(1:20,2)
visitors<-sample(1:50, 40, replace=TRUE)
impact<-sample(rep(c("a", "b", "c", "d", "e"), 8))
arrival<-sample(rep(8:15, 5))
departure <- sample(rep(16:23, 5))
df<-data.frame(ID, visitors, impact, arrival, departure)
df$impact<-as.character(df$impact)
# summarise rows with identical ID
df_summary <- df %>%
group_by(ID) %>%
summarise(visitors = max(visitors), arrival = min(arrival),
departure = max(departure), impact = paste0(impact, collapse =", "))
Hope this helps!

Related

R - grouping by ID, finding the max value in column two (delete all other) and keep value of third column as well

I have a dataset of three columns and roughly 300000 rows:
#Person ID# ##Likelihood of Risk## ###Year the survey was taken###
Each Person has taken part multiple times and I only want the most recent likelihood of Risk.
I wanted to figure that out by grouping the Person ID and then finding the max year.
That did not work out but I rather ended up having still multiple identical person ID's.
To continue working I need one specific value of Likelihood of Risk for each ID.
Riskytest <- Risk_Adult %>% group_by(pid,A_risk) %>% summarize(max=max(syear))
Riskytest <- Risk_Adult %>%
group_by(pid) %>%
slice_max(syear) %>%
ungroup()

How do I assign grouped values (per-subject) from one df to another df that's grouped by trial (e.g. repeated rows for each subject)

I am using R.
I have two dfs, A and B.
A is grouped by trial, so contains numerous observations for each subject (e.g. reaction times per trial).
B is grouped by subject, so contains just one observation per subject (e.g. self-reported individual difference measures).
I want to transfer the B values so they repeat per participant across trials in A. There are numerous variables I wish to transfer from B to A, so I'm looking for an elegant solution.
What you want is to use dplyr::left_join to do this elegantly.
library(dplyr)
C <- A %>%
left_join(B, by = "subject_id")

How to mutate variables on a rollwing time window by groups with unequal time distances?

I have a large df with around 40.000.000 rows , covering in total a time period of 2 years and more than 400k unique users.
The time variable is formatted as POSIXct and I have a unique user_id per user. I observe each user over several points in time.
Each row is therefore a unqiue combination of user_id, time and a set of variables.
Based on a set of dummy variables (df$v1, df$v2), a category variable(df$category_var) and the time variable (df$time_var) I now want to calculate 3 new variables on a user_id level on a rolling time window over the previous 30 days.
So in each row, the new variable should be calculated over the values of the previous 30 days of the input variables.
I do not observe all users over the same time period, some enter later some leave earlier, also the distances between times are not equal, therefore I can not calculate the variables just by number of rows.
So far I only managed to calculate my new variables per user_id over the whole observation period, but I couldn’t achieve to calculate the variables for the previous 30 days rolling window per user.
After checking and trying all the related posts here, I assume a data.table solution is the most suitable, but since I have so far mainly worked with dplyr the attempt of calculating these variables on the rolling time window on a groupey_by user_id level has taken more than a week without any results. I would be so grateful for your support!
My df basically looks like :
user_id <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3)
time_var <- c(“,2,3,4,5, 1.5, 2, 3, 4.5, 1,2.5,3,4,5)
category_var <- c(“A”, “A”, “B”, “B”, “A”, “A”, “C”, “C”, “A”, …)
v1 <- c(0,1,0,0,1,0,1,1,1,0,1,…)
v2 <- c(1,1,0,1,0,1,1,0,...)
My first needed new variable (new_x1) is basically a cumulative sum based on a condition in dummy variable v1. What I achieved so far:
df <- df %>% group_by(user_id) %>% mutate(new_x1=cumsum(v1==1))
What I need: That variables only counting over the previoues 30 days per user
Needed new variable (new_x2): Basically cumulative count of v1 if v2 has a (so far) unique value. So for each new value in v2 given v1==1, count.
What I achieved so far:
df <- df %>%
group_by(user_id, category_var) %>%
mutate(new_x2 = cumsum(!duplicated(v2 )& v1==1))
I also need this based on the previous 30 days and not the whole observation period per user.
My third variable of interest (new__x3):
The time between two observations given a certain condition (v1==1)
#Interevent Time
df2 <- df%>% group_by(user_id) %>% filter(v1==1) %>% mutate(time_between_events=time-lag(time))
I would also need this on the previoues 30 days.
Thank you so much!
Edit after John Springs Post:
My potential solution would then be
setDT(df)[, `:=`(new_x1= cumsum(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]),
new_x2= cumsum(!duplicated(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]))),
by = eval(c("user_id", "time"))]
I really not familiar with data.table and not sure, if I can nest my conditions on cumsum in on data.table like that.
Any suggestions?

Obtain average for unique identifiers in R

I have a dataset containing rows of unique identifiers. Each unique identifier occupies several rows b/c each person (identifier) has different ratings. For example unique identifier 1 may have a rating for Goal A, Goal B, Goal C, all represented in a separate row.
What would be the best way to find the average for each unique identifier (i.e. for manager 1 (unique identifier 1), what is their average score across Goal A, Goal B and Goal C?
In excel, I'd do this by using the data sort > and check unique identifiers, copy and paste those values at the bottom of the dataset, and find the average using a series of conditional statements. I'm sure there must be a way to do this in R. Would appreciate any help/insight.
I started with this code, but am not sure if this is what I need. I'm filtering by departments (FSO), then asking it to give me a list of unique IDs, and then computing the average for each manager.
df %>% filter(newdept=='FSO') %>%
distinct(ID) %>%
summarize(compmean = mean(CompRating2, na.rm=TRUE))
A base R solution would be to use aggregate:
dat <- data.frame(id=sample(LETTERS, 50, replace=TRUE), score=sample(1:5, 50, replace=TRUE), stringsAsFactors=FALSE)
aggregate(score ~ id, data=dat, mean)

Add column from another data.frame based on multiple criteria

I have 2 data frames:
cars = data.frame(car_id=c(1,2,2,3,4,5,5),
max_speed=c(150,180,185, 200, 210, 230,235),
since=c('2000-01-01', '2000-01-01', '2007-10-01', '2000-01-01', '2000-01-01', '2000-01-01', '2009-11-18'))
voyages = data.frame(voy_id=c(1234,1235,1236,1237,1238),
car_id=c(1,2,3,4,5),
date=c('2000-01-01', '2002-02-02', '2003-03-03', '2004-04-04', '2010-05-05'))
If you look closely you can see that the cars occasionally has multiple entries for a car_id because the manufacturer decided to increase the max speed of that make. Each entry has a date marked by since that indicates the date from which the actual max speed is applied.
My goal: I want to add the max_speed variable to the voyages data frame based on the values found in cars. I can't just join the 2 data frames by car_id because I also have to check the date in voyages and compare it to since in cars to determine the proper max_speed
Question: What is the elegant way to do this without loops?
One approach:
Merge the two datasets, including duplicated observations in "cars".
Drop any observations where the date for "since" is later than the date for "date". Order the dataset so most recent dates are first, then drop duplicated observations for "voy_id"--this ensures that where there are two dates in "since", you'll only keep the most recent one that occurs before the voyage date.
z <- merge(cars, voyages, by="car_id")
z <- z[as.Date(z$since)<=as.Date(z$date),]
z <- z[order(as.Date(z$since), decreasing=TRUE),]
z <- z[!duplicated(z$voy_id),]
Also curious to see if someone comes up with a more elegant, parsimonious approach.

Resources