Conditional count in r data.table with two grouping variables - r

I have a data.table in which I have records belonging to multiple groupings. I want to count the number of records that fall into the same group for two variables, where the grouping variables may include some NAs.
Example data below:
library(data.table)
mydt <- data.table(id = c(1,2,3,4,5,6),
travel = c("no travel", "morocco", "algeria",
"morocco", "morocco", NA),
cluster = c(1,1,1,2,2,2))
> mydt
id travel cluster
1: 1 no travel 1
2: 2 morocco 1
3: 3 algeria 1
4: 4 morocco 2
5: 5 morocco 2
6: 6 <NA> 2
In the above example I want to calculate how many people travelled to each destination by cluster.
Initially I was doing this using the .N notation, as below:
mydt[, ndest1 := as.double(.N), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1
1: 1 no travel 1 1
2: 2 morocco 1 1
3: 3 algeria 1 1
4: 4 morocco 2 2
5: 5 morocco 2 2
6: 6 <NA> 2 1
However, NAs are counted as a value - this doesn't work well for my purposes since I later want to identify which destination within each cluster the most people travelled to (morocco in cluster 2 above) using max(...) and if there are a lot of NAs in a given cluster, 'NA' will incorrectly be flagged as the most popular destination.
I then tried using sum() instead, as this is intuitive and also allows me to exclude NAs:
mydt[, ndest2 := sum(!is.na(travel)), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1 ndest2
1: 1 no travel 1 1 1
2: 2 morocco 1 1 1
3: 3 algeria 1 1 1
4: 4 morocco 2 2 1
5: 5 morocco 2 2 1
6: 6 <NA> 2 1 0
This gives incorrect results - after a bit of further testing, it seems to be because I have used the same variable for the logic statement within sum(...) as one of the grouping variables in the by statement.
When I use a different variable I get the desired result except that I am not able to exclude NAs this way:
mydt[, ndest3 := sum(!is.na(id)), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1 ndest2 ndest3
1: 1 no travel 1 1 1 1
2: 2 morocco 1 1 1 1
3: 3 algeria 1 1 1 1
4: 4 morocco 2 2 1 2
5: 5 morocco 2 2 1 2
6: 6 <NA> 2 1 0 1
This leads me to two questions:
In a data.table conditional count, how do I exclude NAs?
Why can't the same variable be used in the sum logic statemtent and as a grouping variable after by?
Any insights would be much appreciated.

You can exclude NAs in i
mydt[!is.na(travel), ndest1 := .N, by = .(travel, cluster)][]
# id travel cluster ndest1
#1: 1 no travel 1 1
#2: 2 morocco 1 1
#3: 3 algeria 1 1
#4: 4 morocco 2 2
#5: 5 morocco 2 2
#6: 6 <NA> 2 NA

Related

Generating recursive ID by muli-variate group using data.table in R

I've found several options on how to generate IDs by groups using the data.table package in R, but none of them fit my problem exactly. Hopefully someone can help.
In my problem, I have 160 markets that fall within 21 regions in a country. These markets are numbered 1:160 and there may be multiple observations documented within each market. I would like to restructure my market ID variable so that it represents unique markets within each region, and starts counting over again with each new region.
Here's some code to represent my problem:
require(data.table)
dt <- data.table(region = c(1,1,1,1,2,2,2,2,3,3,3,3),
market = c(1,1,2,2,3,3,4,4,5,6,7,7))
> dt
region market
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 3
6: 2 3
7: 2 4
8: 2 4
9: 3 5
10: 3 6
11: 3 7
12: 3 7
Currently, my data is set up to represent the result of
dt[, market_new := .GRP, by = .(region, market)]
But what I'd like get is
region market market_new
1: 1 1 1
2: 1 1 1
3: 1 2 2
4: 1 2 2
5: 2 3 1
6: 2 3 1
7: 2 4 2
8: 2 4 2
9: 3 5 1
10: 3 6 2
11: 3 7 3
12: 3 7 3
This seems to return what you want
dt[, market_new:=as.numeric(factor(market)), by=region]
here we divide the data up by regions and then give a unique ID to each market in each region via the factor() function and extract the underlying numeric index.
From 1.9.5+, you can use frank() (or frankv()) with ties.method = "dense" as follows:
dt[, market_new := frankv(market, ties="dense"), by=region]

Mean of Groups of means in R

I have the following data
Exp = my data frame
dt<-data.table(Game=c(rep(1,9),rep(2,3)),
Round=rep(1:3,4),
Participant=rep(1:4,each=3),
Left_Choice=c(1,0,0,1,1,0,0,0,1,1,1,1),
Total_Points=c(5,15,12,16,83,7,4,8,23,6,9,14))
> dt
Game Round Participant Left_Choice Total_Points
1: 1 1 1 1 5
2: 1 2 1 0 15
3: 1 3 1 0 12
4: 1 1 2 1 16
5: 1 2 2 1 83
6: 1 3 2 0 7
7: 1 1 3 0 4
8: 1 2 3 0 8
9: 1 3 3 1 23
10: 2 1 4 1 6
11: 2 2 4 1 9
12: 2 3 4 1 14
Now, I need to do the following:
First of all for each of the participants in each of the Games I need to calculated the mean "Left Choice rate".
After that I want to break the results to 5 groups (Left choice <20%,
left choice between 20% and 40% e.t.c),
For each group (in each of the games), I want to calculate the mean of the Total_Points **in the last round - round 3 in this simple example **** [ONLY the value of the round 3] - so for example for participant 1, in game 1, the total points are in round 3 are 12. And for Participant 4, in game 2 it is 14.
So in the first stage I think I should calculate the following:
Game Participant Percent_left Total_Points (in last round)
1 1 33% 12
1 2 66% 7
1 3 33% 23
2 4 100% 14
And the final result should look like this:
Game Left_Choice Total_Poins (average)
1 >35% 17.5= (12+23)/2
1 <35%<70% 7
1 >70% NA
2 >35% NA
2 <35%<70% NA
2 >70% 14
Please help! :)
Working in data.table
1: simple group mean with by
dt[,pct_left:=mean(Left_Choice),by=.(Game,Participant)]
2: use cut; not totally clear, but I think you want include.lowest=T.
dt[,pct_grp:=cut(pct_left,breaks=seq(0,1,by=.2),include.lowest=T)]
3: slightly more complicated group mean with by
dt[Round==max(Round),end_mean:=mean(Total_Points),by=.(pct_grp,Game)]
(if you just want the reduced table, use .(end_mean=mean(Total_Points))instead).
You didn't make it clear whether there is a global maximum number of rounds (i.e. whether all games end in the same number of rounds); this was assumed above. You'll have to be more clear about this in order to provide an exact alternative, but I suggest starting with just defining it round by round:
dt[,end_mean:=mean(Total_Points),by=.(pct_grp,Game,Round)]

R replace value with the value shown by an index

I have a table called "merged" like:
Nationality CustomerID_count ClusterId
1 argentina 1 1
2 ARGENTINA 26 1
3 ARGENTINO 1 1
4 argentona 1 1
5 boliviana 14 2
6 paragauy 1 3
7 paraguay 1 3
8 PARAGUAY 1 3
I need to create a new Nationality column, searching the max value of Customer_ID_count within each cluster.
I did this other table with the following code:
merged1<-data.table(merged)
merged2<-merged1[, which.max(CustomerID), by = ClusterId]
And I got:
ClusterId V1
1: 1 2
2: 2 1
3: 3 1
After that I did a merge:
tot<-merge(x=merged, y=merged2, by= "ClusterId", all.x=TRUE)
And I got the following table:
ClusterId Nationality CustomerID V1
1 1 argentina 1 2
2 1 ARGENTINA 26 2
3 1 ARGENTINO 1 2
4 1 argentona 1 2
5 2 boliviana 14 1
6 3 paragauy 1 1
7 3 paraguay 1 1
8 3 PARAGUAY 1 1
But I didn't know how to finish. I tried this:
tot[,5]=tot[V1,5]
Because I want to have for each row the Nationality that is in the line shown in column V1. This didn't work.
How can I do the last part? and also is there a better way to solve this?
Thanks!
Note that you may have more that one CustomerID_count that matches the maximum value (e.g. all versions of "paraguay" have CustomerID_count == 1, which is the max for that cluster).
It's very easy using the plyr package:
library(plyr)
ddply(merged, .(ClusterId), mutate, Nationality2 = Nationality[CustomerID_count == max(CustomerID_count)])
This could be a good use-case for `dplyr:
library(dplyr)
merged <- merged %>%
group_by(ClusterId) %>%
mutate(newNat=Nationality[CustomerID_count == max(CustomerID_count)]) %>%
ungroup
print(merged)
## Source: local data frame [8 x 4]
##
## Nationality CustomerID_count ClusterId newNat
## 1 argentina 1 1 ARGENTINA
## 2 ARGENTINA 26 1 ARGENTINA
## 3 ARGENTINO 1 1 ARGENTINA
## 4 argentona 1 1 ARGENTINA
## 5 boliviana 14 2 boliviana
## 6 paragauy 1 3 paragauy
## 7 paraguay 1 3 paraguay
## 8 PARAGUAY 1 3 PARAGUAY

Select row prior to first occurrence of an event by group

I have a series of observations that describe if and when an animal is spotted in a specific area. The following sample table identifies when a certain animal is seen (status == 1) or not (status == 0) by day.
id date status
1 1 2014-06-20 1
2 1 2014-06-21 1
3 1 2014-06-22 1
4 1 2014-06-23 1
5 1 2014-06-24 0
6 2 2014-06-20 1
7 2 2014-06-21 1
8 2 2014-06-22 0
9 2 2014-06-23 1
10 2 2014-06-24 1
11 3 2014-06-20 1
12 3 2014-06-21 1
13 3 2014-06-22 0
14 3 2014-06-23 1
15 3 2014-06-24 0
16 4 2014-06-20 1
17 4 2014-06-21 0
18 4 2014-06-22 0
19 4 2014-06-23 0
20 4 2014-06-24 1
Using the data.table package, I can identify the first day an animal is no longer seen in the area:
library(data.table)
dt <- as.data.table(df)
dt[status == 0, .SD[1], by = id]
id date status
1: 1 2014-06-24 0
2: 2 2014-06-22 0
3: 3 2014-06-22 0
4: 4 2014-06-21 0
While the above table is useful, I would like to know how to manipulate the function to find the dates prior to first occurrence of an animal's absence. In other words, I want to know the last day that each animal is in the area before temporarily leaving.
My actual data set bins these presence/absence observations into different time lengths depending on the situation (e.g. presence/absence by 3-hour intervals, 6-hour, etc). Therefore, it would be easier to access the previous row rather than subtract the time interval from each value since it always changes. My desired output would be the following:
id date status
1: 1 2014-06-23 1
2: 2 2014-06-21 1
3: 3 2014-06-21 1
4: 4 2014-06-20 1
Please feel free to use base code or other packages (i.e. dplyr) to answer this question, I am always up for something new. Thank you for your time!
Try the following:
dt[dt[status == 0, .I[1] - 1, by = id]$V1]
# id date status
#1: 1 2014-06-23 1
#2: 2 2014-06-21 1
#3: 3 2014-06-21 1
#4: 4 2014-06-20 1
Incidentally, this method (using .I instead of .SD) will also be much faster. See this post for more on that.
Here is a method via dplyr :
df %>%
group_by(id) %>%
mutate(status_change = status - lead(status)) %>%
filter(status_change == 1)
id date status status_change
1 1 2014-06-23 1 1
2 2 2014-06-21 1 1
3 3 2014-06-21 1 1
4 3 2014-06-23 1 1
5 4 2014-06-20 1 1
This takes advantage of status being a numeric variable. lead() accesses the next value; the change is 1 when and animal disappears.

R data.table: Count events since last occurance (multiple, inclusive/exclusive)

[udpated: tried to clarify and simplify, corrected sample code and data.]
I've a set of measurements that are taken over a period of days. The range of numbers that can be captured in any measurement is 1-25 (in real life, given the test set, the range could be as high as 100 or as low as 20).
I'd like a way to tally a count for how many events have passed since a specific number occurred regardless of the measurement column. I'd like it to reset the count after the number match as shown below.
V1,V2,Vn are the values captured.
Match1, Match2, Matchn are the counts since last encountered columns.
Note: Matchn counts are incremented regardless of which Vx column n is encountered.
Any help is much appreciated.
this is somewhat related to my earlier post here
Sample input
library(data.table)
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,3,29,36),
V2 = c(2, 5, 12, 4,8,2,38,50,4,1)
)
code for creating Sample output
t$match1 <- c(1,2,3,4,1,2,3,4,5,1)
t$match2 <- c(1,1,2,3,4,5,1,2,3,4)
t$match3 <- c(1,2,3,1,2,3,4,5,1,2)
> t
Date V1 V2 match1 match2 match3
1: 2013-05-01 4 2 1 1 1
2: 2013-05-02 2 5 2 1 2
3: 2013-05-03 3 12 3 2 3
4: 2013-05-04 1 4 4 3 1
5: 2013-05-05 7 8 1 4 2
6: 2013-05-06 22 2 2 5 3
7: 2013-05-07 35 38 3 1 4
8: 2013-05-08 3 50 4 2 5
9: 2013-05-09 29 4 5 3 1
10: 2013-05-10 36 1 1 4 2
I think OP has a bunch of typos in it, and as far as I understand you want this:
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,52,29,36),
V2 = c(2, 5, 2, 4,8,47,38,50,4,1)
)
t[, inclusive.match.1 := 1:.N, by = cumsum(V1 == 1 | V2 == 1)]
t[, exclusive.match.1 := 1:.N, by = rev(cumsum(rev(V1 == 1 | V2 == 1)))]
t
# Date V1 V2 inclusive.match.1 exclusive.match.1
# 1: 2013-05-01 4 2 1 1
# 2: 2013-05-02 2 5 2 2
# 3: 2013-05-03 3 2 3 3
# 4: 2013-05-04 1 4 1 4
# 5: 2013-05-05 7 8 2 1
# 6: 2013-05-06 22 47 3 2
# 7: 2013-05-07 35 38 4 3
# 8: 2013-05-08 52 50 5 4
# 9: 2013-05-09 29 4 6 5
#10: 2013-05-10 36 1 1 6

Resources