Related
Any one able to assist me with a problem using R to create a sum of previous events in a specific time period please? I apologise if I do not follow protocol this is my first question here.
I have a df of IDs and event dates. In the genuine df the events are date times but to keep things simpler I have used dates here.
I am trying to create a variable which is a tally of the number of previous events within the last 2 years for that ID number (or 750 days as Im not too concerned about leap years but that would be nice to factor in).
There is an additional issue in that some IDs will have more than one event on the same date (and time in the genuine df). I do not want to remove these as in the real df there is an additional variable which is not necessarily the same. However, in the sum I want to count events happening on the same date as one event i.e. if an ID had only 2 events but they were on the same day the result would be 0, or there may be 3 rows of dates within the previous 2 years for an ID - but as two are the same date the result is 2. I have created a outcome vector to give an example of what I have after ID 7 has examples of this.
If there were 3 previous events all on the same day the result sum would be 1 and any subsequent events in 2 years
ID <- c(10,1,11,2,2,13,4,5,6,6,13,7,7,7,8,8,9,9,9,10,1,11,2,11,12,9,13,14,7,15,7)
event.date<-c('2018-09-09','2016-06-02','2018-08-20', '2018-11-03', '2018-07-10', '2017-03-08', '2018-06-16', '2017-05-20', '2016-04-02', '2016-07-27', '2018-07-15', '2018-06-15', '2018-06-15', '2018-01-16', '2017-10-07', '2016-08-17','2018-08-01','2017-01-22','2016-08-05', '2018-08-13', '2016-11-28', '2018-11-24','2016-06-01', '2018-03-26', '2017-02-04', '2017-12-01', '2016-05-16', '2017-11-25', '2018-04-01', '2017-09-21', '2018-04-01')
df<-data.frame(ID,event.date)
df<-df%>%
arrange(ID,event.date)
The resulting column should look something like this.
event.count <- c(0,1,0,0,1,0,0,0,1,0,1,1,2,2,0,1,0,1,2,3,0,1,0,1,2,0,0,1,1,0,0)
df$event.count<-event.count
I have tried various if else and use of lag() but cannot get what I am after
thank you.
Here is a solution with data.table.
To subtract 2 years from the event.date, you can use lubridate and subtract years(2).
After grouping by both ID and event.date, you can subset all dates that fall in between 2 years ago and the date (incbounds of between will exclude the upper and lower bounds).
Using uniqueN will prevent duplicate dates from being counted multiple times.
library(data.table)
library(lubridate)
df$event.date <- as.Date(df$event.date)
setDT(df)[, new.event.count := uniqueN(df$event.date[df$ID == ID][
between(df$event.date[df$ID == ID],
event.date - years(2),
event.date,
incbounds = FALSE)]),
by = c("ID", "event.date")][]
Output
ID event.date event.count new.event.count
1: 1 2016-06-02 0 0
2: 1 2016-11-28 1 1
3: 2 2016-06-01 0 0
4: 2 2018-07-10 0 0
5: 2 2018-11-03 1 1
6: 4 2018-06-16 0 0
7: 5 2017-05-20 0 0
8: 6 2016-04-02 0 0
9: 6 2016-07-27 1 1
10: 7 2018-01-16 0 0
11: 7 2018-04-01 1 1
12: 7 2018-04-01 1 1
13: 7 2018-06-15 2 2
14: 7 2018-06-15 2 2
15: 8 2016-08-17 0 0
16: 8 2017-10-07 1 1
17: 9 2016-08-05 0 0
18: 9 2017-01-22 1 1
19: 9 2017-12-01 2 2
20: 9 2018-08-01 3 3
21: 10 2018-08-13 0 0
22: 10 2018-09-09 1 1
23: 11 2018-03-26 0 0
24: 11 2018-08-20 1 1
25: 11 2018-11-24 2 2
26: 12 2017-02-04 0 0
27: 13 2016-05-16 0 0
28: 13 2017-03-08 1 1
29: 13 2018-07-15 1 1
30: 14 2017-11-25 0 0
31: 15 2017-09-21 0 0
ID event.date event.count new.event.count
I have two datatables dt_main and dt_unit.
set.seed(1)
dt_main<-data.table(ID=sample(1:20,size=10),Group=sample(1:3,size=10,replace=TRUE),Unit=0)
dt_unit<-data.table(Group=sample(1:3,size=10,replace=TRUE),Unit_id=sample(1000:3000,size=10,replace=TRUE))
dt_main look like this:
> dt_main
ID Group Unit
1: 4 1 0
2: 7 1 0
3: 1 1 0
4: 2 2 0
5: 13 2 0
6: 19 2 0
7: 11 2 0
8: 17 3 0
9: 14 1 0
10: 3 3 0
dt_unit look like this:
> dt_unit
Group Unit_id
1: 1 2624
2: 1 2963
3: 1 1974
4: 1 1800
5: 2 1851
6: 1 1930
7: 1 1325
8: 2 1329
9: 2 1553
10: 2 2445
I would like to fill in the Unit column in dt_main by sampling one Unit_id from dt_unit to dt_main with the same Group.
For example for the first row in dt_main (so Group=1), the code should lookup at dt_unit and find the rows where Group is 1 (see below), and select a Unit_id and insert it in the Unit.
> dt_unit[Group==1]
Group Unit_id
1: 1 2624
2: 1 2963
3: 1 1974
4: 1 1800
5: 1 1930
6: 1 1325
I tried something like this which assigned the same number to each row:
dt_main[,Unit:=sample(dt_unit[Group==Group]$Unit_id,size=1)]
I also attempted sapply but no good.
Here is a base R solution where we match the Groups and sample 1 value every time,
dt_main$Unit <- sapply(dt_main$Group, function(i) {
v1 <- dt_unit$Unit_id[dt_unit$Group %in% i];
if (length(v1) > 0) {sample(v1, 1) } else {NA}
})
# ID Group Unit
# 1: 4 1 1930
# 2: 7 1 1325
# 3: 1 1 1325
# 4: 2 2 1329
# 5: 13 2 2445
# 6: 19 2 2445
# 7: 11 2 1851
# 8: 17 3 NA
# 9: 14 1 1930
#10: 3 3 NA
You can join dt_main and dt_unit by Group and select a random row for each ID.
Using dplyr, you can do this by :
library(dplyr)
left_join(dt_main, dt_unit, by = 'Group') %>% group_by(ID) %>% sample_n(1)
# ID Group Unit_id
# <int> <int> <int>
# 1 1 1 1800
# 2 2 2 2445
# 3 3 3 NA
# 4 4 1 2963
# 5 7 1 1800
# 6 11 2 1851
# 7 13 2 1553
# 8 14 1 1325
# 9 17 3 NA
#10 19 2 2445
I removed Unit column from data.table creation.
Another answer with mapply, which I used for the case with multiple conditions. In this case, while looking up I check if Group columns match AND a new column (Size) in dt_main is larger than that of dt_unit. As the OP I had to add another condition to the original post and therefore adding this solution to help future users.
my_fun<-function(var1,var2)
{
d<-dt_unit[(Group%in%var1)&(Size>=var2)]
if(nrow(d)>=2){
sample(x=d$Unit_id,size=1,replace=T)
}else
{d$Unit_id}
}
vars1<-dt_main$Group
vars2<-dt_main$Size
dt_main$Unit<-mapply(my_fun,vars1,vars2)
I have the following data
Exp = my data frame
dt<-data.table(Game=c(rep(1,9),rep(2,3)),
Round=rep(1:3,4),
Participant=rep(1:4,each=3),
Left_Choice=c(1,0,0,1,1,0,0,0,1,1,1,1),
Total_Points=c(5,15,12,16,83,7,4,8,23,6,9,14))
> dt
Game Round Participant Left_Choice Total_Points
1: 1 1 1 1 5
2: 1 2 1 0 15
3: 1 3 1 0 12
4: 1 1 2 1 16
5: 1 2 2 1 83
6: 1 3 2 0 7
7: 1 1 3 0 4
8: 1 2 3 0 8
9: 1 3 3 1 23
10: 2 1 4 1 6
11: 2 2 4 1 9
12: 2 3 4 1 14
Now, I need to do the following:
First of all for each of the participants in each of the Games I need to calculated the mean "Left Choice rate".
After that I want to break the results to 5 groups (Left choice <20%,
left choice between 20% and 40% e.t.c),
For each group (in each of the games), I want to calculate the mean of the Total_Points **in the last round - round 3 in this simple example **** [ONLY the value of the round 3] - so for example for participant 1, in game 1, the total points are in round 3 are 12. And for Participant 4, in game 2 it is 14.
So in the first stage I think I should calculate the following:
Game Participant Percent_left Total_Points (in last round)
1 1 33% 12
1 2 66% 7
1 3 33% 23
2 4 100% 14
And the final result should look like this:
Game Left_Choice Total_Poins (average)
1 >35% 17.5= (12+23)/2
1 <35%<70% 7
1 >70% NA
2 >35% NA
2 <35%<70% NA
2 >70% 14
Please help! :)
Working in data.table
1: simple group mean with by
dt[,pct_left:=mean(Left_Choice),by=.(Game,Participant)]
2: use cut; not totally clear, but I think you want include.lowest=T.
dt[,pct_grp:=cut(pct_left,breaks=seq(0,1,by=.2),include.lowest=T)]
3: slightly more complicated group mean with by
dt[Round==max(Round),end_mean:=mean(Total_Points),by=.(pct_grp,Game)]
(if you just want the reduced table, use .(end_mean=mean(Total_Points))instead).
You didn't make it clear whether there is a global maximum number of rounds (i.e. whether all games end in the same number of rounds); this was assumed above. You'll have to be more clear about this in order to provide an exact alternative, but I suggest starting with just defining it round by round:
dt[,end_mean:=mean(Total_Points),by=.(pct_grp,Game,Round)]
I am trying to number in sequence locations gathered within a certain time period (those with time since previous location >60 seconds). I've eliminated columns irrelevant to this question, so example data looks like:
TimeSincePrev
1
1
1
1
511
1
2
286
1
My desired output looks like this: (sorry for the underscores, but I couldn't otherwise figure out how to get it to include my spaces to make the columns obvious...)
TimeSincePrev ___ NoInSeries
1 ________________ 1
1 ________________ 2
1 ________________ 3
1 ________________ 4
511 ______________ 1
1 ________________ 2
2 ________________ 3
286 ______________ 1
1 ________________ 2
...and so on for another 3500 lines
I have tried a couple of ways to approach this unsuccessfully:
First, I tried to do an ifelse, where I would make the NoInSequence 1 if the TimeSincePrev was more than a minute, or else the previous row's value +1..(In this case, I first insert a line number column to help me reference the previous row, but I suspect there is an easier way to do this?)
df$NoInSeries <- ifelse((dfTimeSincePrev > 60), 1, ((df[((df$LineNo)-1),"NoInSeries"])+1)).
I don't get any errors, but it only gives me the 1s where I want to restart sequences but does not fill in any of the other values:
TimeSincePrev ___ NoInSeries
1 ________________ NA
1 ________________ NA
1 ________________ NA
1 ________________ NA
511 ______________ 1
1 ________________ NA
2 ________________ NA
286 ______________ 1
1 ________________ NA
I assume this has something to do with trying to reference back to itself?
My other approach was to try to get it to do sequences of numbers (max 15), restarting every time there is a change in the TimeSincePrev value:
df$NoInSeries <- ave(df$TimeSincePrev, df$TimeSincePrev, FUN=function(y) 1:15)
I still get no errors but exactly the same output as before, with NAs in place and no other numbers filled in.
Thanks for any help!
Using ave after creating a group detecting serie's change using (diff + cumsum)
dt$NoInSeries <-
ave(dt$TimeSincePrev,
cumsum(dt$TimeSincePrev >60),
FUN=seq)
The result is:
dt
# TimeSincePrev NoInSeries
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 511 1
# 6 1 2
# 7 2 3
# 8 286 1
# 9 1 2
add steps explanation:
## detect time change > 60 seconds
## group value by the time change
(gg <- cumsum(dt$TimeSincePrev >60))
[1] 0 0 0 0 1 1 1 2 2
## get the sequence by group
ave(dt$TimeSincePrev, gg, FUN=seq)
[1] 1 2 3 4 1 2 3 1 2
Using data.table
library(data.table)
setDT(dt)[,NoInSeries:=seq_len(.N), by=cumsum(TimeSincePrev >60)]
dt
# TimeSincePrev NoInSeries
#1: 1 1
#2: 1 2
#3: 1 3
#4: 1 4
#5: 511 1
#6: 1 2
#7: 2 3
#8: 286 1
#9: 1 2
Or
indx <- c(which(dt$TimeSincePrev >60)-1, nrow(dt))
sequence(c(indx[1], diff(indx)))
#[1] 1 2 3 4 1 2 3 1 2
data
dt <- data.frame(TimeSincePrev=c(1,1,1,1,511, 1,2, 286,1))
I have a data frame that I'm working with that contains experimental data. For the purposes of this post we can limit the discussion to 3 columns: ExperimentID, ROI, isContrast, isTreated, and, Value. ROI is a text-based factor that indicates where a region-of-interest is drawn, e.g. 'ROI_1', 'ROI_2',...etc. isTreated and isContrast are binary fields indicating whether or not some treatment was applied. I want to make a scatter plot comparing the values of, e.g., 'ROI_1' vs. 'ROI_2 ', which means I need the data paired in such a way that when I plot it the first X value is from Experiment_1 and ROI_1, the first Y value is from Experiment_1 and ROI_2, the next X value is from Experiment_2 and ROI_1, the next Y value is from Experiment_2 and ROI_2, etc. I only want to make this comparison for common values of isContrast and isTreated (i.e. 1 plot for each combination of these variables, so 4 plots altogether.
Subsetting doesn't solve my problem because data from different experiments/ROIs was sometimes entered out of numerical order.
The following code produces a mock data set to demonstrate the problem
expID = c('Bob','Bob','Bob','Bob','Lisa','Lisa','Lisa','Lisa','Alice','Alice','Alice','Alice','Joe','Joe','Joe','Joe','Bob','Bob','Alice','Alice','Lisa','Lisa')
treated = c(0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0)
contrast = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1)
val = c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,6,7,8,9,10,11)
roi = c(rep('A',16),'B','B','B','B','B','B')
myFrame = data.frame(ExperimentID=expID,isTreated = treated, isContrast= contrast,Value = val, ROI=roi)
ExperimentID isTreated isContrast Value ROI
1 Bob 0 0 1 A
2 Bob 0 1 2 A
3 Bob 1 0 3 A
4 Bob 1 1 4 A
5 Lisa 0 0 1 A
6 Lisa 0 1 2 A
7 Lisa 1 0 3 A
8 Lisa 1 1 4 A
9 Alice 0 0 1 A
10 Alice 0 1 2 A
11 Alice 1 0 3 A
12 Alice 1 1 4 A
13 Joe 0 0 1 A
14 Joe 0 1 2 A
15 Joe 1 0 3 A
16 Joe 1 1 4 A
17 Bob 0 0 6 B
18 Bob 0 1 7 B
19 Alice 0 0 8 B
20 Alice 0 1 9 B
21 Lisa 0 0 10 B
22 Lisa 0 1 11 B
Now let's say I want to scatter plot values for A vs. B. That is to say, I want to plot x vs. y where {(x,y)} = {(Bob's Value from ROI A, Bob's Value from ROI B), (Alice's Value from ROI A, Alices Value from ROI B)},...} etc. and these all must have the same values for isTreated and isContrast for the comparison to make sense. Now, if I just go an subset I'll get something like:
> x= myFrame$Value[(myFrame$ROI == 'A') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> x
[1] 1 1 1 1
> y= myFrame$Value[(myFrame$ROI == 'B') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> y
[1] 6 8 10
Now as you can see the values in y correspond to the first rows of Bob, Lisa, Alice and Joe, respectively but the values of y Bob, Alice and Lisa respectively, and there is no value for Joe.
So say I ignored the value for Joe because that data is missing for B and just decided to plot the first 3 values of x vs. the first 3 values of y. The data are still out of order because x = (Bob, Lisa, Alice) but y = (Bob, Alice, Lisa) in terms of where the values are coming from. So I would like to now how to make vectors such that the order is correct and the plot makes sense.
Similar to #Matthew, with ggplot:
The idea is to reshape your data so the the values from ROI=A and RIO=B are in different columns. This can be done (with your sample data) as follows:
library(reshape2)
zz <- dcast(myFrame,
value.var="Value",
formula=ExperimentID+isTreated+isContrast~ROI)
zz
ExperimentID isTreated isContrast A B
1 Alice 0 0 1 8
2 Alice 0 1 2 9
3 Alice 1 0 3 NA
4 Alice 1 1 4 NA
5 Bob 0 0 1 6
6 Bob 0 1 2 7
7 Bob 1 0 3 NA
8 Bob 1 1 4 NA
9 Joe 0 0 1 NA
10 Joe 0 1 2 NA
11 Joe 1 0 3 NA
12 Joe 1 1 4 NA
13 Lisa 0 0 1 10
14 Lisa 0 1 2 11
15 Lisa 1 0 3 NA
16 Lisa 1 1 4 NA
Notiice that your sample data is rather sparse (lots of NA's).
To plot:
library(ggplot2)
ggplot(zz,aes(x=A,y=B,color=factor(isTreated))) +
geom_point(size=4)+facet_wrap(~isContrast)
Produces this:
The reason there are no blue points is that, in your sample data, there are no occurrences of isTreated=1 and ROI=B.
Something like this, perhaps:
myFrameReshaped <- reshape(myFrame, timevar='ROI', direction='wide', idvar=c('ExperimentID','isTreated','isContrast'))
plot(Value.B ~ Value.A, data=myFrameReshaped)
To condition by the isTreated and isContrast variables, lattice comes in handy:
library(lattice)
xyplot(Value.B~Value.A | isTreated + isContrast, data=myFrameReshaped)
Values that are not present for one of the conditions give NA, and are not plotted.
head(myFrameReshaped)
## ExperimentID isTreated isContrast Value.A Value.B
## 1 Bob 0 0 1 6
## 2 Bob 0 1 2 7
## 3 Bob 1 0 3 NA
## 4 Bob 1 1 4 NA
## 5 Lisa 0 0 1 10
## 6 Lisa 0 1 2 11