Generating recursive ID by muli-variate group using data.table in R - r

I've found several options on how to generate IDs by groups using the data.table package in R, but none of them fit my problem exactly. Hopefully someone can help.
In my problem, I have 160 markets that fall within 21 regions in a country. These markets are numbered 1:160 and there may be multiple observations documented within each market. I would like to restructure my market ID variable so that it represents unique markets within each region, and starts counting over again with each new region.
Here's some code to represent my problem:
require(data.table)
dt <- data.table(region = c(1,1,1,1,2,2,2,2,3,3,3,3),
market = c(1,1,2,2,3,3,4,4,5,6,7,7))
> dt
region market
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 3
6: 2 3
7: 2 4
8: 2 4
9: 3 5
10: 3 6
11: 3 7
12: 3 7
Currently, my data is set up to represent the result of
dt[, market_new := .GRP, by = .(region, market)]
But what I'd like get is
region market market_new
1: 1 1 1
2: 1 1 1
3: 1 2 2
4: 1 2 2
5: 2 3 1
6: 2 3 1
7: 2 4 2
8: 2 4 2
9: 3 5 1
10: 3 6 2
11: 3 7 3
12: 3 7 3

This seems to return what you want
dt[, market_new:=as.numeric(factor(market)), by=region]
here we divide the data up by regions and then give a unique ID to each market in each region via the factor() function and extract the underlying numeric index.

From 1.9.5+, you can use frank() (or frankv()) with ties.method = "dense" as follows:
dt[, market_new := frankv(market, ties="dense"), by=region]

Related

Conditional count in r data.table with two grouping variables

I have a data.table in which I have records belonging to multiple groupings. I want to count the number of records that fall into the same group for two variables, where the grouping variables may include some NAs.
Example data below:
library(data.table)
mydt <- data.table(id = c(1,2,3,4,5,6),
travel = c("no travel", "morocco", "algeria",
"morocco", "morocco", NA),
cluster = c(1,1,1,2,2,2))
> mydt
id travel cluster
1: 1 no travel 1
2: 2 morocco 1
3: 3 algeria 1
4: 4 morocco 2
5: 5 morocco 2
6: 6 <NA> 2
In the above example I want to calculate how many people travelled to each destination by cluster.
Initially I was doing this using the .N notation, as below:
mydt[, ndest1 := as.double(.N), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1
1: 1 no travel 1 1
2: 2 morocco 1 1
3: 3 algeria 1 1
4: 4 morocco 2 2
5: 5 morocco 2 2
6: 6 <NA> 2 1
However, NAs are counted as a value - this doesn't work well for my purposes since I later want to identify which destination within each cluster the most people travelled to (morocco in cluster 2 above) using max(...) and if there are a lot of NAs in a given cluster, 'NA' will incorrectly be flagged as the most popular destination.
I then tried using sum() instead, as this is intuitive and also allows me to exclude NAs:
mydt[, ndest2 := sum(!is.na(travel)), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1 ndest2
1: 1 no travel 1 1 1
2: 2 morocco 1 1 1
3: 3 algeria 1 1 1
4: 4 morocco 2 2 1
5: 5 morocco 2 2 1
6: 6 <NA> 2 1 0
This gives incorrect results - after a bit of further testing, it seems to be because I have used the same variable for the logic statement within sum(...) as one of the grouping variables in the by statement.
When I use a different variable I get the desired result except that I am not able to exclude NAs this way:
mydt[, ndest3 := sum(!is.na(id)), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1 ndest2 ndest3
1: 1 no travel 1 1 1 1
2: 2 morocco 1 1 1 1
3: 3 algeria 1 1 1 1
4: 4 morocco 2 2 1 2
5: 5 morocco 2 2 1 2
6: 6 <NA> 2 1 0 1
This leads me to two questions:
In a data.table conditional count, how do I exclude NAs?
Why can't the same variable be used in the sum logic statemtent and as a grouping variable after by?
Any insights would be much appreciated.
You can exclude NAs in i
mydt[!is.na(travel), ndest1 := .N, by = .(travel, cluster)][]
# id travel cluster ndest1
#1: 1 no travel 1 1
#2: 2 morocco 1 1
#3: 3 algeria 1 1
#4: 4 morocco 2 2
#5: 5 morocco 2 2
#6: 6 <NA> 2 NA

Data.table selecting columns by name, e.g. using grepl

Say I have the following data.table:
dt <- data.table("x1"=c(1:10), "x2"=c(1:10),"y1"=c(10:1),"y2"=c(10:1), desc = c("a","a","a","b","b","b","b","b","c","c"))
I want to sum columns starting with an 'x', and sum columns starting with an 'y', by desc. At the moment I do this by:
dt[,.(Sumx=sum(x1,x2), Sumy=sum(y1,y2)), by=desc]
which works, but I would like to refer to all columns with "x" or "y" by their column names, eg using grepl().
Please could you advise me how to do so? I think I need to use with=FALSE, but cannot get it to work in combination with by=desc?
One-liner:
melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))[,
lapply(.SD, sum), by=desc, .SDcols=x:y]
Long version (by #Frank):
First, you probably don't want to store your data like that. Instead...
m = melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))
desc variable x y
1: a 1 1 10
2: a 1 2 9
3: a 1 3 8
4: b 1 4 7
5: b 1 5 6
6: b 1 6 5
7: b 1 7 4
8: b 1 8 3
9: c 1 9 2
10: c 1 10 1
11: a 2 1 10
12: a 2 2 9
13: a 2 3 8
14: b 2 4 7
15: b 2 5 6
16: b 2 6 5
17: b 2 7 4
18: b 2 8 3
19: c 2 9 2
20: c 2 10 1
Then you can do...
setnames(m[, lapply(.SD, sum), by=desc, .SDcols=x:y], 2:3, paste0("Sum", c("x", "y")))[]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
For more on improving the data structure you're working with, read about tidying data.
Use mget with grep is an option, where grep("^x", ...) returns the column names starting with x and use mget to get the column data, unlist the result and then you can calculate the sum:
dt[,.(Sumx=sum(unlist(mget(grep("^x", names(dt), value = T)))),
Sumy=sum(unlist(mget(grep("^y", names(dt), value = T))))), by=desc]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6

Create Time Based User Sessions in R

I have a dataset which consists of three columns: user, action and time which is a log for user actions. the data looks like this:
user action time
1: 618663 34 1407160424
2: 617608 33 1407160425
3: 89514 34 1407160425
4: 71160 33 1407160425
5: 443464 32 1407160426
---
996: 146038 8 1407161349
997: 528997 9 1407161350
998: 804302 8 1407161351
999: 308922 8 1407161351
1000: 803763 8 1407161352
I want to separate sessions for each user based on action times. Actions done in certain period (for example one hour) are going to be assumed one session.
The simple solution is to use a for loop and compare action times for each user but that's not efficient and my data is very large.
Is there any method that can I use to overcome this problem?
I can group users but separate on users actions into different sessions is somehow difficult for me :-)
Try
library(data.table)
dt <- rbind(
data.table(user=1, action=1:10, time=c(1,5,10,11,15,20,22:25)),
data.table(user=2, action=1:5, time=c(1,3,10,11,12))
)
# dt[, session:=cumsum(c(T, !(diff(time)<=2))), by=user][]
# user action time session
# 1: 1 1 1 1
# 2: 1 2 5 2
# 3: 1 3 10 3
# 4: 1 4 11 3
# 5: 1 5 15 4
# 6: 1 6 20 5
# 7: 1 7 22 5
# 8: 1 8 23 5
# 9: 1 9 24 5
# 10: 1 10 25 5
# 11: 2 1 1 1
# 12: 2 2 3 1
# 13: 2 3 10 2
# 14: 2 4 11 2
# 15: 2 5 12 2
I used a difference of <=2 to collect sessions.

Splitting a data frame to create new columns

I have a data frame with columns for "Count","Transect Number","Data", and "Year". My goal is to split up the data frame by Transect, then again by Year, and create a new data frame with a column for "Transect", and then the appropriate data per Year in the following columns.
To build a dummy data frame:
Count1<-1:27
Count2<-1:30
Count3<-1:25
T1<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
T2<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,2,3,3,3,3)
T3<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3)
Data1<-c(1,2,3,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2,3,4,5,4,3,2,3,3,2)
Data2<-c(1,2,3,2,1,4,3,2,1,2,4,3,2,3,4,3,2,3,4,5,6,4,3,2,1,4,5,4,3,2)
Data3<-c(1,2,3,4,5,4,3,3,3,4,5,4,3,3,2,3,4,5,4,3,4,3,2,3,4)
Year1<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year2<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year3<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016)
DF1<-data.frame(Count1,T1,Data1,Year1)
colnames(DF1)<-c("Count","Transect","Data","Year")
DF2<-data.frame(Count2,T2,Data2,Year2)
colnames(DF2)<-c("Count","Transect","Data","Year")
DF3<-data.frame(Count3,T3,Data3,Year3)
colnames(DF3)<-c("Count","Transect","Data","Year")
All<-rbind(DF1,DF2,DF3)
Once I have the data frame, my thought was to split up the data by transect since this will be a permanent aspect of my ongoing data set.
#Step 1-Break down by T
Trans1<-All[All$Transect==1,]
Trans2<-All[All$Transect==2,]
Trans3<-All[All$Transect==3,]
Trans4<-All[All$Transect==4,]
Trans5<-All[All$Transect==5,]
But I'm a little less clear on the next step. I need to pull out data from the "Data" column organized by year. Something along the lines of further breaking down the data like so:
Trans1_Year1<-Trans1[Trans1$Year==2014,]
Trans2_Year1<-Trans2[Trans2$Year==2014,]
Trans3_Year1<-Trans3[Trans3$Year==2014,]
Trans4_Year1<-Trans4[Trans4$Year==2014,]
Trans5_Year1<-Trans5[Trans5$Year==2014,]
or even using split
ByYear1<-split(Trans1,Trans1$Year)
But I would prefer to avoid writing out the code as above as I hope to add new data every year as this data set progresses. And I'd like the code to be able to accommodate new "Year" data as it is added, as opposed to writing out new lines of code every year.
Once I have the data set up like so, I'd like to create a second data frame with columns for each year. One problem is that the each year contains differing numbers of rows, which has been an issue for me. But my final result would have columns:
"Transect", "Data 2014", "Data 2015", "Data 2016"
Since each year has can have different numbers of rows within a transect, I'd like to leave NA's at the end of each Transect section when the number of rows per individual transect differ between years.
It sounds like you are basically trying to convert your data into a semi-wide format, with columns for years, rather than keeping it in the "long" format.
If this is the case, you're better off adding a secondary index column that shows the repeated combination of "Transect" and "Year".
This can easily be done with getanID from my "splitstackshape" package. "splitstackshape" also loads "data.table", from which you could then use dcast.data.table to get a wide format.
library(splitstackshape)
dcast.data.table(getanID(All, c("Transect", "Year")),
Transect + .id ~ Year, value.var = "Data")
# Transect .id 2014 2015 2016
# 1: 1 1 1 2 3
# 2: 1 2 2 1 4
# 3: 1 3 3 2 5
# 4: 1 4 1 2 4
# 5: 1 5 2 4 5
# 6: 1 6 3 3 6
# 7: 1 7 1 4 4
# 8: 1 8 2 5 4
# 9: 1 9 3 4 3
# 10: 1 10 NA NA 4
# 11: 2 1 2 3 4
# 12: 2 2 1 4 3
# 13: 2 3 2 3 2
# 14: 2 4 2 2 3
# 15: 2 5 1 3 2
# 16: 2 6 4 4 1
# 17: 2 7 4 3 4
# 18: 2 8 5 3 3
# 19: 2 9 4 2 2
# 20: 2 10 NA NA 3
# 21: 3 1 3 2 3
# 22: 3 2 4 1 3
# 23: 3 3 3 2 2
# 24: 3 4 3 3 5
# 25: 3 5 2 2 4
# 26: 3 6 1 3 3
# 27: 3 7 3 3 2
# 28: 3 8 3 4 4
# 29: 3 9 3 5 NA
# Transect .id 2014 2015 2016
Then, if you really want to split on the "Transect" column you can go ahead and use split, but since you now have a "data.table" it would be better to stick with that and take advantage of its many convenient features, including those related to subsetting and aggregation.
I think you are forcing your data into a format it does not have naturally. There are a lot of processing advantages to leaving it in "long" format. Have a look at this article if you have not seen it yet, it is a classic.
http://www.jstatsoft.org/v21/i12

update data.table subset with function

I have a data.table
dt2 <- data.table(urn=1:10,freq=0, freqband="")
dt2$freqband = NA
dt2$freq <- 1:7 #does give a warning message
## urn freq freqband
## 1: 1 1 NA
## 2: 2 2 NA
## 3: 3 3 NA
## 4: 4 4 NA
## 5: 5 5 NA
## 6: 6 6 NA
## 7: 7 7 NA
## 8: 8 1 NA
## 9: 9 2 NA
##10: 10 3 NA
i also have a function that I am wanting to use to group my freq column
fn_GetFrequency <- function(numgifts) {
if (numgifts <5) return("<5")
if (numgifts >=5) return("5+")
return("ERROR")
}
I am wanting to set the freqband column based on this function. In some cases it will be all records, in some cases it will be a subset. My current approach is (for a subset):
dt2[dt2$urn < 9, freqband := fn_GetFrequency(freq)]
using this approach I get the warning:
Warning message:
In if (numgifts < 5) return("<5") :
the condition has length > 1 and only the first element will be used
then it sets all the records to have a value of "<5" rather than the correct value. I'm figuring that I need to use some sort of lapply/sapply/etc function, however I still haven't been able to quite grasp how they work in order to use them to solve my problem.
Any help would be greatly appreciated.
EDIT: How might you do this if you use a function that requires 2 parameters?
UPDATED: to include the output of dt2 after my attempted update
urn freq freqband
1: 1 1 <5
2: 2 2 <5
3: 3 3 <5
4: 4 4 <5
5: 5 5 <5
6: 6 6 <5
7: 7 7 <5
8: 8 1 <5
9: 9 2 NA
10: 10 3 NA
UPDATE: I tried this code to and it worked to deliver the desired output, and it allows me to have a function I can call in other places of code too.
dt2[dt2$urn < 9, freqband := sapply(freq, fn_GetFrequency)]
> fn_GetFrequency <- function(numgifts) {
+ ifelse (numgifts <5, "<5", "5+")
+ }
> dt2[dt2$urn < 9, freqband := fn_GetFrequency(freq)]
> dt2
urn freq freqband
1: 1 1 <5
2: 2 2 <5
3: 3 3 <5
4: 4 4 <5
5: 5 5 5+
6: 6 6 5+
7: 7 7 5+
8: 8 1 <5
9: 9 2 NA
10: 10 3 NA
For multiple bands (which I'm sure has been asked before) you should use the findInterval function. And I'm doing it the data.table way reather than the dataframe way:
dt2[ urn==8, freq := -1 ] # and something to test the <0 condition
dt2[ urn <= 8, freqband := c("ERROR", "<5", "5+")[
findInterval(freq,c(-Inf, 0, 5 ,Inf))] ]
dt2
urn freq freqband
1: 1 1 <5
2: 2 2 <5
3: 3 3 <5
4: 4 4 <5
5: 5 5 5+
6: 6 6 5+
7: 7 7 5+
8: 8 -1 ERROR
9: 9 2 NA
10: 10 3 NA

Resources