Count number of rows meeting criteria in another table - R PRogramming - r

I have two tables, one with property listings and another one with contacts made for a property (i.e. is someone is interested in the property they will "contact" the owner).
Sample "listings" table below:
listings <- data.frame(id = c("6174", "2175", "9176", "4176", "9177"), city = c("A", "B", "B", "B" ,"A"), listing_date = c("01/03/2015", "14/03/2015", "30/03/2015", "07/04/2015", "18/04/2015"))
listings$listing_date <- as.Date(listings$listing_date, "%d/%m/%Y")
listings
# id city listing_date
#1 6174 A 01/03/2015
#2 2175 B 14/03/2015
#3 9176 B 30/03/2015
#4 4176 B 07/04/2015
#5 9177 A 18/04/2015
Sample "contacts" table below:
contacts <- data.frame (id = c ("6174", "6174", "6174", "6174", "2175", "2175", "2175", "9176", "9176", "4176", "4176", "9177"), contact_date = c("13/03/2015","14/04/2015", "27/03/2015", "13/04/2015", "15/03/2015", "16/03/2015", "17/03/2015", "30/03/2015", "01/06/2015", "08/05/2015", "09/05/2015", "23/04/2015" ))
contacts$contact_date <- as.Date(contacts$contact_date, "%d/%m/%Y")
contacts
# id contact_date
#1 6174 2015-03-13
#2 6174 2015-04-14
#3 6174 2015-03-27
#4 6174 2015-04-13
#5 2175 2015-03-15
#6 2175 2015-03-16
#7 2175 2015-03-17
#8 9176 2015-03-30
#9 9176 2015-06-01
#10 4176 2015-05-08
#11 4176 2015-05-09
#12 9177 2015-04-23
Problem
1. I need to count the number of contacts made for a property within 'x' days of listing. The output should be a new column added to "listings" with # contacts:
Sample ('x' = 30 days)
listings
# id city listing_date ngs
#1 6174 A 2015-03-01 2
#2 2175 B 2015-03-14 3
#3 9176 B 2015-03-30 1
#4 4176 B 2015-04-07 0
#5 9177 A 2015-04-18 1
I have done this with the for loop; it is horrible slow for live data:
n <- nrow(listings)
mat <- vector ("integer", n)
for (i in 1:n) {
mat[i] <- nrow (contacts[contacts$id==listings[i,"id"] & as.numeric (contacts$contact_date - listings[i,"listing_date"]) <=30,])
}
listings$ngs <- mat
I need to prepare a histogram of # contacts vs days with 'x' as variable - through manipulate function. I can't figure out a way to do all this inside the manipulate function.

Here's a possible solution using data.table rolling joins
library(data.table)
# key `listings` by proper columns in order perform the binary join
setkey(setDT(listings), id, listing_date)
# Perform a binary rolling join while extracting matched icides and counting them
indx <- data.table(listings[contacts, roll = 30, which = TRUE])[, .N, by = V1]
# Joining back to `listings` by proper rows while assigning the counts by reference
listings[indx$V1, ngs := indx$N]
# id city listing_date ngs
# 1: 2175 B 2015-03-14 3
# 2: 4176 B 2015-04-07 NA
# 3: 6174 A 2015-03-01 2
# 4: 9176 B 2015-03-30 1
# 5: 9177 A 2015-04-18 1

I'm not sure if your actual id values are factor, but I'll start by making those numeric. Using them as factors will cause you problems:
listings$id <- as.numeric(as.character(listings$id))
contacts$id <- as.numeric(as.character(contacts$id))
Then, the strategy is to calculate the "days since listing" value for each contact and add this to your contacts data.frame. Then, aggregate this new data.frame (in your example, sum of contacts within 30 days), and then merge the resulting count back into your original data.
contacts$ngs <- contacts$contact_date - listings$listing_date[match(contacts$id, listings$id)]
a <- aggregate(ngs ~ id, data = contacts, FUN = function(x) sum(x <= 30))
merge(listings, a)
# id city listing_date ngs
# 1 2175 B 2015-03-14 3
# 2 4176 B 2015-04-07 0
# 3 6174 A 2015-03-01 2
# 4 9176 B 2015-03-30 1
# 5 9177 A 2015-04-18 1

Or:
indx <- match(contacts$id, listings$id)
days_since <- contacts$contact_date - listings$listing_date[indx]
n <- with(contacts[days_since <= 30, ], tapply(id, id, length))
n[is.na(n)] <- 0
listings$n <- n[match(listings$id, names(n))]
It's similar to Thomas' answer but utilizes tapply and match instead of aggregate and merge.

You could use the dplyr package. First merge the data:
all.data <- merge(contacts,listings,by = "id")
Set a target number of days:
number.of.days <- 30
Then gather the data by ID (group_by), exclude the results that are not within the time frame (filter) and count the number of occurrences/rows (summarise).
result <- all.data %>% group_by(id) %>% filter(contact_date > listing_date + number.of.days) %>% summarise(count.of.contacts = length(id))
I think there are a number of ways this could be potentially solved but I have found dplyr to be very helpful in a lot circumstances.
EDIT:
Sorry should have thought about that a little more. Does this work,
result <- all.data %>% group_by(id,city,listing_date) %>% summarise(ngs = length(id[which(contact_date < listing_date + number.of.days)]))
I don't think zero results can be passed sensibly through the filter stage (understandably, the goal is usually the opposite). I'm not too sure what sort of impact the 'which' component will have on processing time, likely to be slower than using the 'filter' function but might not matter.

Using dplyr for your first problem:
left_join(contacts, listings, by = c("id" = "id")) %>%
filter(abs(listing_date - contact_date) < 30) %>%
group_by(id) %>% summarise(cnt = n()) %>%
right_join(listings)
And the output is:
id cnt city listing_date
1 6174 2 A 2015-03-01
2 2175 3 B 2015-03-14
3 9176 1 B 2015-03-30
4 4176 NA B 2015-04-07
5 9177 1 A 2015-04-18
I'm not sure I understand your second question to answer it.

Related

converting an abbreviation into a full word

I am trying to avoid writing a long nested ifelse statement in excel.
I am working on two datasets, one where I have abbreviations and county names.
Abbre
COUNTY_NAME
1 AD Adams
2 AS Asotin
3 BE Benton
4 CH Chelan
5 CM Clallam
6 CR Clark
And another data set that contains the county abbreviation and votes.
CountyCode Votes
1 WM 97
2 AS 14
3 WM 163
4 WM 144
5 SJ 21
For the second table, how do I convert the countycode (abbreviation) into the full spelled-out text and add that as a new column?
I have been trying to solve this unsuccessfully using grep, match, and %in%. Clearly I am missing something and any insight would be greatly appreciated.
We can use a join
library(dplyr)
library(tidyr)
df2 <- df2 %>%
left_join(Abbre %>%
separate(COUNTY_NAME, into = c("CountyCode", "FullName")),
by = "CountyCode")
Or use base R
tmp <- read.table(text = Abbre$COUNTY_NAME, header = FALSE,
col.names = c("CountyCode", "FullName"))
df2 <- merge(df2, tmp, by = 'CountyCode', all.x = TRUE)
Another base R option using match
df2$COUNTY_NAME <- with(
df1,
COUNTY_NAME[match(df2$CountyCode, Abbre)]
)
gives
> df2
CountyCode Votes COUNTY_NAME
1 WM 97 <NA>
2 AS 14 Asotin
3 WM 163 <NA>
4 WM 144 <NA>
5 SJ 21 <NA>
A data.table option
> setDT(df1)[setDT(df2), on = .(Abbre = CountyCode)]
Abbre COUNTY_NAME Votes
1: WM <NA> 97
2: AS Asotin 14
3: WM <NA> 163
4: WM <NA> 144
5: SJ <NA> 21

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

How to calculate the number of group using R?

It could be a very easy question, I have a data.table with key and more than 1000 rows, two of which could be set as key. I want to calculate the number of the groups for this dataset.
For example, the simple data is(ID and Act is key)
ID ValueDate Act Volume
1 2015-01-01 EUR 21
1 2015-02-01 EUR 22
1 2015-01-01 MAD 12
1 2015-02-01 MAD 11
2 2015-01-01 EUR 5
2 2015-02-01 EUR 7
3 2015-01-01 EUR 4
3 2015-02-01 EUR 2
3 2015-03-01 EUR 6
Here is a code to generate test data:
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01- 01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
in this case, we can see that there are a total of 4 subsets.
I tried to set the key for this table as first,
setkey(dd, ID, Act)
Then I thought the function of count could be working to count the groups.
Is it right to use the function of count, or there could be a simple method?
Thanks a lot !
nrow(dd[, .(cnt= sum(.N)), by= c("ID", "Act")])
# or using base R
{t <- table(interaction(dd$ID, dd$Act)); length(t[t>0])}
# or for the counts:
dd[, .(cnt= sum(.N)), by= c("ID", "Act")]
ID Act cnt
1: 1 EUR 2
2: 1 MAD 2
3: 2 EUR 2
4: 3 EUR 3
The fastest way should be uniqueN.
library(data.table)
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01-01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
uniqueN(dd, by = c("ID", "Act"))
#[1] 4

Merging overlapping dataframes in R

Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80
Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).
Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

Resources