What is the "data table" way of doing this join/merge? - r

I have a "dictionary" table like this:
dict <- data.table(
Nickname = c("Abby", "Ben", "Chris", "Dan", "Ed"),
Name = c("Abigail", "Benjamin", "Christopher", "Daniel", "Edward")
)
dict
# Nickname Name
# 1: Abby Abigail
# 2: Ben Benjamin
# 3: Chris Christopher
# 4: Dan Daniel
# 5: Ed Edward
And a "data" table like this:
dat <- data.table(
Friend1 = c("Abby", "Ben", "Ben", "Chris"),
Friend2 = c("Ben", "Ed", NA, "Ed"),
Friend3 = c("Ed", NA, NA, "Dan"),
Friend4 = c("Dan", NA, NA, NA)
)
dat
# Friend1 Friend2 Friend3 Friend4
# 1: Abby Ben Ed Dan
# 2: Ben Ed NA NA
# 3: Ben NA NA NA
# 4: Chris Ed Dan NA
I would like to produce a data.table that looks like this
result <- data.table(
Friend1.Nickname = c("Abby", "Ben", "Ben", "Chris"),
Friend1.Name = c("Abigail", "Benjamin", "Benjamin", "Christopher"),
Friend2.Nickname = c("Ben", "Ed", NA, "Ed"),
Friend2.Name = c("Benjamin", "Edward", NA, "Edward"),
Friend3.Nickname = c("Ed", NA, NA, "Dan"),
Friend3.Name = c("Edward", NA, NA, "Daniel"),
Friend4.Nickname = c("Dan", NA, NA, NA),
Friend4.Name = c("Daniel", NA, NA, NA)
)
result
# sorry, word wrapping makes this too annoying to copy
And this is the solution I had in mind:
friend_vars <- paste0("Friend", 1:4)
friend_nicks <- paste0(friend_vars, ".Nickname")
friend_names <- paste0(friend_vars, ".Name")
setnames(dat, friend_vars, friend_nicks)
for (i in 1:4) {
dat[, friend_names[i] := dict$Name[match(dat[[friend_nicks[i]]], dict$Nickname)], with = FALSE]
}
Is there a more "data-table-esque" way to do this? I'm sure it's nice and efficient, but it's ugly to read, and part from data.table's in-place assignment I don't feel like I'm taking good advantage of what the package has to offer.
I'm also not a very strong SQL user, and I'm not too comfortable with join terminology. I have a feeling that Data.table - left outer join on multiple tables could be useful here but I'm not sure how to apply it to my situation.

Using data.table 1.9.5:
for (nm in names(dat)) {
on = setattr("Nickname", 'names', nm)
dat[dict, paste0(nm, ".Name") := i.Name, on=on]
}
We can join using on= instead of setting keys. Now you can use setcolorder() to reorder the names.
I avoid reshaping data unless absolutely necessary. This is where update while join comes in really handy. And now with the on= argument, I couldn't resist posting an answer :-).

I didn't come up w/ a solution that matches exactly your result, but you might be able to work w/ something like this:
dat[, id := .I]
dat.m <- melt(dat, id.vars='id', variable.name='Friend', value.name='Nickname')
setkey(dict, Nickname)
dat.m[, Name := dict[Nickname, Name]]
> dat.m
id Friend Nickname Name
1: 1 Friend1 Abby Abigail
2: 2 Friend1 Ben Benjamin
3: 3 Friend1 Ben Benjamin
4: 4 Friend1 Chris Christopher
5: 1 Friend2 Ben Benjamin
6: 2 Friend2 Ed Edward
7: 3 Friend2 NA NA
8: 4 Friend2 Ed Edward
9: 1 Friend3 Ed Edward
10: 2 Friend3 NA NA
11: 3 Friend3 NA NA
12: 4 Friend3 Dan Daniel
13: 1 Friend4 Dan Daniel
14: 2 Friend4 NA NA
15: 3 Friend4 NA NA
16: 4 Friend4 NA NA
The variable id was just a placeholder so I could melt the DT.

setkey(dict,Nickname)
dat[,paste(names(dat),"Name",sep="."):=lapply(.SD,function(x)dict[J(x)]$Name)]
setcolorder(dat,c(1,5,2,6,3,7,4,8))
dat
# Friend1 Friend1.Name Friend2 Friend2.Name Friend3 Friend3.Name Friend4 Friend4.Name
# 1: Abby Abigail Ben Benjamin Ed Edward Dan Daniel
# 2: Ben Benjamin Ed Edward NA NA NA NA
# 3: Ben Benjamin NA NA NA NA NA NA
# 4: Chris Christopher Ed Edward Dan Daniel NA NA

in base, super ugly:
cbind(dat, lapply(dat, function(x){dict$Name[match(x, dict$Nickname)]}))
Friend1 Friend2 Friend3 Friend4 V2 NA NA NA
1: Abby Ben Ed Dan Abigail Benjamin Edward Daniel
2: Ben Ed NA NA Benjamin Edward NA NA
3: Ben NA NA NA Benjamin NA NA NA
4: Chris Ed Dan NA Christopher Edward Daniel NA

Related

Filter data.frame by date range in R

I have a DF like this:
Date <- c("10/17/17","11/11/17","11/23/17","11/25/17","12/3/17","12/10/17","12/16/17")
Ben <- c("1294",NA,"8959","2345",NA,"0303",NA)
James <- c(NA,"4523","3246",NA,"2394","8877","1427")
Alex <- c("3754","1122","5582",NA,"0094",NA,NA)
df1 <- data.frame(Date,Ben,James,Alex)
#df1
Date Ben James Alex
10/17/17 1294 NA 3754
11/11/17 NA 4523 1122
11/23/17 8959 3246 5582
11/25/17 2345 NA NA
12/3/17 NA 2394 0094
12/10/17 0303 8877 NA
12/16/17 NA 1427 NA
As you can see, the DF is sorted by date. I'm trying to put values that are within 2 weeks of the latest date for each column into a new DF, like this:
#df2
Ben James Alex
0303 1427 0094
NA 8877 5582
NA 2394 NA
Ben only has one listed value because there's only one non NA value within 2 weeks of 12/10/17, the latest date that has a non NA value in Ben's column. James's latest non NA date is 12/16/17. He has three values that fall within two weeks of that date: 1427, 8877 and 2394. Alex's latest date is 12/3/17. He has two values within two weeks of his latest date: 0094 and 5582. The number of rows that the new data.frame has should be equal to the column that is longest. Columns with fewer entries within their respective two week ranges should use NA to fill in data, like Ben's column.
I'm currently using the following code, which simply filters the last 3 non NA in each column:
df2 <- lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 3))
using base r to be subset:
lapply(df1[-1],function(x)x[which((m<-tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=14)])->result
max(lengths(result))->len
do.call(cbind.data.frame,lapply(result,`length<-`,len))
Ben James Alex
1 <NA> 2394 5582
2 0303 8877 <NA>
3 <NA> 1427 0094
I just realized those are coded as characters according to the data you gave
To have it exactly as given in the expected results, we would have:
do.call(cbind.data.frame,lapply(result,function(x) `length<-`(rev(x),len)))
Ben James Alex
1 0303 1427 0094
2 <NA> 8877 <NA>
3 <NA> 2394 5582
Whether I well understood what you are looking for, the following code will help you:
I have loaded your dataset (with dput function)
dataset <- structure(list(Date = structure(c(17456, 17481, 17493, 17495,
17499, 17510, 17516), class = "Date"), Ben = c(1294L, NA, 8959L,
2345L, NA, 303L, NA), James = c(NA, 4523L, 3246L, NA, NA, 8877L,
1427L), Alex = c(3754L, 1122L, 5582L, NA, 94L, NA, NA)), .Names = c("Date",
"Ben", "James", "Alex"), row.names = c(NA, -7L), class = "data.frame")
Then load the following packages:
library(lubridate)
library(tidyverse)
Fix last_date and change format to Date variable:
last_date <- mdy("12/16/17")
dataset$Date <- mdy(dataset$Date)
Now, let's select only rows you want:
dataset_filtered <- dataset %>%
filter(Date<=last_date & Date>=(last_date-days(14)))
You'll have:
Date Ben James Alex
1 2017-12-10 303 8877 NA
2 2017-12-16 NA 1427 NA
Please, next time use dput function, not always is Xmas ;-)

Convert AsIs to numeric separated by coma in data frame

I have such data frame:
structure(list(P1 = c("Mark", "Katrin", "Kate", "Hank", "Tom",
"Marcus"), P2 = c("Tim", "Greg", "Seba", "Teqa", "Justine", "Monica"
), clique = structure(list(`930` = integer(0), `2090` = integer(0),
`3120` = c(2L, 3L, 231L), `3663` = integer(0), `3704` = integer(0),
`4156` = c(19L, 27L)), .Names = c("930", "2090", "3120",
"3663", "3704", "4156"), class = "AsIs")), .Names = c("P1", "P2",
"clique"), row.names = c(930L, 2090L, 3120L, 3663L, 3704L, 4156L
), class = "data.frame")
And I have a problem with the last column called clique. I would like to convert this column to numeric values separated by come in one column or the best option would be to transform integer(0) to NAs and put the numbers in separate columns. Just keep one number in each column.
I will accept both solutions.
example data:
P1 P2 clique
Mark Tim integer(0)
Katrin Greg integer(0)
Kate Seba c(2, 3, 231)
Hank Teqa integer(0)
Tom Justine integer(0)
Marcus Monica c(19, 27)
> class(data$clique)
[1] "AsIs"
Desired output:
P1 P2 clique
Mark Tim NA
Katrin Greg NA
Kate Seba 2,3,231
Hank Teqa NA
Tom Justine NA
Marcus Monica 19,27
or
P1 P2 clique New_column1 New_column2
Mark Tim
Katrin Greg
Kate Seba 2 3 231
Hank Teqa
Tom Justine
Marcus Monica 19 27
You can try listCol_w from my "splitstackshape" package:
library(splitstackshape)
listCol_w(mydf, "clique")[, lapply(.SD, as.numeric), by = .(P1, P2)]
## P1 P2 clique_fl_1 clique_fl_2 clique_fl_3
## 1: Mark Tim NA NA NA
## 2: Katrin Greg NA NA NA
## 3: Kate Seba 2 3 231
## 4: Hank Teqa NA NA NA
## 5: Tom Justine NA NA NA
## 6: Marcus Monica 19 27 NA
I recommend this because you mentioned you wanted the numeric values. You won't be able to store a value like "2,3,231" as a numeric value.
If you still want to try the approach of collapsing the values and then splitting them, you can try:
mydf$clique <- vapply(mydf$clique, function(x) paste(x, collapse = ","), character(1L))
The str would show that you now have a single character string instead of a list of character vectors. You can then use cSplit on that to get the wide form.
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ P1 : chr "Mark" "Katrin" "Kate" "Hank" ...
$ P2 : chr "Tim" "Greg" "Seba" "Teqa" ...
$ clique: chr "" "" "2,3,231" "" ...

Extracting event types from last 21 day window

My dataframe looks like this. The two rightmost columns are my desired columns.
**Name ActivityType ActivityDate Email(last 21 says) Webinar(last21)**
John Email 1/1/2014 NA NA
John Webinar 1/5/2014 NA NA
John Sale 1/20/2014 Yes Yes
John Webinar 3/25/2014 NA NA
John Sale 4/1/2014 No Yes
John Sale 7/1/2014 No No
Tom Email 1/1/2015 NA NA
Tom Webinar 1/5/2015 NA NA
Tom Sale 1/20/2015 Yes Yes
Tom Webinar 3/25/2015 NA NA
Tom Sale 4/1/2015 No Yes
Tom Sale 7/1/2015 No No
I am just trying to create a yes/no variable that denotes whether there was an email or a webinar in the last 21 days for each "Sale" transaction. I was thinking(mock code) along the lines of using dplyr this way:
custlife %>%
group_by(Name) %>%
mutate(Email(last21days)=lag(ifelse(ActivityType = "Email" & ActivityDate of email within (activity date of sale - 21),Yes,No)).
I am not sure of the way to implement this. Kindly help. Your help is sincerely appreciated!
Here's a possible data.table solution. Here I'm creating 2 temporary data sets- one for Sale and one for the rest of activity types and then joining between them by a rolling window of 21 while using by = .EACHI in order to check conditions in each join. Then, I'm joining the result to the original data set.
Convert the date column to Date class and key the data by Name and Date (for the final/rolling join)
library(data.table)
setkey(setDT(df)[, ActivityDate := as.IDate(ActivityDate, "%m/%d/%Y")], Name, ActivityDate)
Create 2 temporary data sets per each activity
Saletemp <- df[ActivityType == "Sale", .(Name, ActivityDate)]
Elsetemp <- df[ActivityType != "Sale", .(Name, ActivityDate, ActivityType)]
Join by a rolling window of 21 to the sales temporary data set while checking conditions
Saletemp[Elsetemp, `:=`(Email21 = as.logical(which(i.ActivityType == "Email")),
Webinar21 = as.logical(which(i.ActivityType == "Webinar"))),
roll = -21, by = .EACHI]
Join everything back
df[Saletemp, `:=`(Email21 = i.Email21, Webinar21 = i.Webinar21)]
df
# Name ActivityType ActivityDate Email21 Webinar21
# 1: John Email 2014-01-01 NA NA
# 2: John Webinar 2014-01-05 NA NA
# 3: John Sale 2014-01-20 TRUE TRUE
# 4: John Webinar 2014-03-25 NA NA
# 5: John Sale 2014-04-01 NA TRUE
# 6: John Sale 2014-07-01 NA NA
# 7: Tom Email 2015-01-01 NA NA
# 8: Tom Webinar 2015-01-05 NA NA
# 9: Tom Sale 2015-01-20 TRUE TRUE
# 10: Tom Webinar 2015-03-25 NA NA
# 11: Tom Sale 2015-04-01 NA TRUE
# 12: Tom Sale 2015-07-01 NA NA
Here is another option with base R:
df is first split according to Name and then, among each subset, for each Sale, it looks if there is an Email (Webinar) within 21 days from the Sale. Finally, the list is unsplit according to Name.
You just have to replace FALSE by no and TRUE by yes afterwards.
df_split <- split(df, df$Name)
df_split <- lapply(df_split, function(tab){
i_s <- which(tab[,2]=="Sale")
tab$Email21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Email", 3] >= d_s-21)})
tab$Webinar21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Webinar", 3] >= d_s-21)})
tab
})
df_res <- unsplit(df_split, df$Name)
df_res
# Name ActivityType ActivityDate Email21 Webinar21
#1 John Email 2014-01-01 NA NA
#2 John Webinar 2014-01-05 NA NA
#3 John Sale 2014-01-20 TRUE TRUE
#4 John Webinar 2014-03-25 NA NA
#5 John Sale 2014-04-01 FALSE TRUE
#6 John Sale 2014-07-01 FALSE FALSE
#7 Tom Email 2015-01-01 NA NA
#8 Tom Webinar 2015-01-05 NA NA
#9 Tom Sale 2015-01-20 TRUE TRUE
#10 Tom Webinar 2015-03-25 NA NA
#11 Tom Sale 2015-04-01 FALSE TRUE
#12 Tom Sale 2015-07-01 FALSE FALSE
data
df <- structure(list(Name = c("John", "John", "John", "John", "John",
"John", "Tom", "Tom", "Tom", "Tom", "Tom", "Tom"), ActivityType = c("Email",
"Webinar", "Sale", "Webinar", "Sale", "Sale", "Email", "Webinar",
"Sale", "Webinar", "Sale", "Sale"), ActivityDate = structure(c(16071,
16075, 16090, 16154, 16161, 16252, 16436, 16440, 16455, 16519,
16526, 16617), class = "Date")), .Names = c("Name", "ActivityType",
"ActivityDate"), row.names = c(NA, -12L), index = structure(integer(0), ActivityType = c(1L,
7L, 3L, 5L, 6L, 9L, 11L, 12L, 2L, 4L, 8L, 10L)), class = "data.frame")

Find most common occurrence among a set of variables by key in data.table in R

I am trying to find the most common occurrence among a set of variables by a key in a data.table in R. Here is a small example of what I'm trying to do:
library(data.table)
mydata <- data.table(mergedName=c("JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE","JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE"),
job=c("teacher","teacher","teacher","teacher","teacher","teacher","police","police","police","police","police","police"),
from=c("NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG"),
misspelled_NYT=c("John Doe", NA, NA, "Mary White", NA, NA,"John_Doe", NA, NA, "Mary*White", NA, NA),
misspelled_USAT=c(NA, "JohnDOE", NA, NA, "Mary White", NA, NA, "John Doe", NA, NA, "Mary White", NA),
misspelled_BG=c(NA, NA, "John Doe", NA, NA, "Mary-White", NA, NA, "John Doe", NA, NA, "Mary White"))
setkeyv(mydata, cols=c("mergedName","job"))
Here's the data.table object:
> mydata
mergedName job from misspelled_NYT misspelled_USAT misspelled_BG
1: JOHNDOE teacher NYT John Doe NA NA
2: JOHNDOE teacher USAT NA JohnDOE NA
3: JOHNDOE teacher BG NA NA John Doe
4: MARYWHITE teacher NYT Mary White NA NA
5: MARYWHITE teacher USAT NA Mary White NA
6: MARYWHITE teacher BG NA NA Mary-White
7: JOHNDOE police NYT John_Doe NA NA
8: JOHNDOE police USAT NA John Doe NA
9: JOHNDOE police BG NA NA John Doe
10: MARYWHITE police NYT Mary*White NA NA
11: MARYWHITE police USAT NA Mary White NA
12: MARYWHITE police BG NA NA Mary White
Here's the output I'm expecting (the most common name spelling across each of the three sources for each keyed combination of mergedName & job):
mergedName job actualSpelling
1: JOHNDOE teacher John Doe
2: JOHNDOE teacher John Doe
3: JOHNDOE teacher John Doe
4: JOHNDOE police John Doe
5: JOHNDOE police John Doe
6: JOHNDOE police John Doe
7: MARYWHITE teacher Mary White
8: MARYWHITE teacher Mary White
9: MARYWHITE teacher Mary White
10: MARYWHITE police Mary White
11: MARYWHITE police Mary White
12: MARYWHITE police Mary White
I have been able to do this with data frames in wide form. Here is a small example of the code for doing this in wide form---NOTE: for some reason this seemingly works only on larger data frames, it does not work in the example below, even though the code is the same. The table() output applied across rows to this DF is different from what I'd expect.:
mydataWide <- data.frame(mergedName=c("JOHNDOE","MARYWHITE","JOHNDOE","MARYWHITE"),
job=c("teacher","police","teacher","police"),
misspelled_NYT=c("John Doe", "Mary White", "John_Doe", "Mary*White"),
misspelled_USAT=c("JohnDOE", "Mary White", "John Doe", "Mary White"),
misspelled_BG=c("John Doe", "Mary-White", "John Doe", "Mary White"),
stringsAsFactors=FALSE)
nametable <- apply(mydataWide[,paste("misspelled", c("NYT","USAT","BG"), sep="_")], 1, function(x) sort(table(x), decreasing=TRUE))
mydataWide$actualSpelling <- names(sapply(nametable,`[`, 1) )
You could first melt the mydata to long form, remove the NA rows using na.omit, find the max count of actualSpelling (grouped by mergedName and job) using which.max and table. Use the numeric index to get the terms with max frequency.
library(data.table)
melt(mydata, id.vars=c('mergedName', 'job'), measure.vars=4:6,
na.rm=TRUE, value.name='actualSpelling')[,
actualSpelling:= names(which.max(table(actualSpelling))),
by=list(mergedName, job)][order(mergedName), -3]
# mergedName job actualSpelling
#1: JOHNDOE police John Doe
#2: JOHNDOE teacher John Doe
#3: JOHNDOE police John Doe
#4: JOHNDOE teacher John Doe
#5: JOHNDOE police John Doe
#6: JOHNDOE teacher John Doe
#7: MARYWHITE police Mary White
#8: MARYWHITE teacher Mary White
#9: MARYWHITE police Mary White
#10: MARYWHITE teacher Mary White
#11: MARYWHITE police Mary White
#12: MARYWHITE teacher Mary White

How do I find last date in which a value increased in another column?

I have a data frame in R that looks something like this:
person date level
Alex 2007-06-01 3
Alex 2008-12-01 4
Alex 2009-12-01 3
Beth 2008-03-01 6
Beth 2010-10-01 6
Beth 2010-12-01 6
Mary 2009-11-04 9
Mary 2012-04-25 9
Mary 2013-09-10 10
I have sorted it first by "person" and second by "date".
I am trying to find out when the last increase in "level" occurred for each person. Ideally, the output would look something like:
person date
Alex 2008-12-01
Beth NA
Mary 2013-09-10
Using dplyr
library(dplyr)
dat %>% group_by(person) %>%
mutate(inc = c(F, diff(level) > 0)) %>%
summarize(date = last(date[inc], default = NA))
Yielding:
Source: local data frame [3 x 2]
person date
1 Alex 2008-12-01
2 Beth <NA>
3 Mary 2013-09-10
Try data.table version:
library(data.table)
setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
If na also needs to be included:
dd=setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
dd2 =data.frame(unique(ddt[!(person %in% dd$person),,]$person),NA)
names(dd2) = c('person','date')
rbind(dd, dd2)
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
3: Beth NA
A base-R version, using data frame df:
sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
})
produces the named vector
Alex Beth Mary
"2008-12-01" NA "2013-09-10"
Easy to wrap this to produce any output format you need:
last.level.up <- function(df) {
data.frame(Date=sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
}))
}
last.level.up(df)
Date
Alex 2008-12-01
Beth <NA>
Mary 2013-09-10

Resources