How to calculate the number of group using R? - r

It could be a very easy question, I have a data.table with key and more than 1000 rows, two of which could be set as key. I want to calculate the number of the groups for this dataset.
For example, the simple data is(ID and Act is key)
ID ValueDate Act Volume
1 2015-01-01 EUR 21
1 2015-02-01 EUR 22
1 2015-01-01 MAD 12
1 2015-02-01 MAD 11
2 2015-01-01 EUR 5
2 2015-02-01 EUR 7
3 2015-01-01 EUR 4
3 2015-02-01 EUR 2
3 2015-03-01 EUR 6
Here is a code to generate test data:
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01- 01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
in this case, we can see that there are a total of 4 subsets.
I tried to set the key for this table as first,
setkey(dd, ID, Act)
Then I thought the function of count could be working to count the groups.
Is it right to use the function of count, or there could be a simple method?
Thanks a lot !

nrow(dd[, .(cnt= sum(.N)), by= c("ID", "Act")])
# or using base R
{t <- table(interaction(dd$ID, dd$Act)); length(t[t>0])}
# or for the counts:
dd[, .(cnt= sum(.N)), by= c("ID", "Act")]
ID Act cnt
1: 1 EUR 2
2: 1 MAD 2
3: 2 EUR 2
4: 3 EUR 3

The fastest way should be uniqueN.
library(data.table)
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01-01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
uniqueN(dd, by = c("ID", "Act"))
#[1] 4

Related

How to evaluate zodiac sign based on date of birth in R?

So I have a date of birth vector in a data.frame. I want to evaluate, based on this date, which zodiac sign is the respondent.
I've seen this solution:
Checking if Date is Between two Dates in R
But, this approach would mean that I have to create 12 vectors times 2 for each zodiac sign (starting date and finishing date), to check if my date of birth falls between the two. Is there a more efficient way to do this?
So this is my data.frame:
data.frame(respondent = c(1,2,3,4,5), date_of_birth = seq(as.Date("2011-12-30"), as.Date("2012-04-30"), by="months") )
respondent date_of_birth
1 1 2011-12-30
2 2 2012-01-30
3 3 2012-03-01
4 4 2012-03-30
5 5 2012-04-30
and I want to get this:
respondent date_of_birth zodiac
1 1 2011-12-30 Capricorn
2 2 2012-01-30 Aquarius
3 3 2012-03-01 Pisces
4 4 2012-03-30 Aries
5 5 2012-04-30 Taurus
I think the *apply functions are just made for this work. You could try to use lapply on your fisrt data frame (more precisely: with its date_of_birth column) and with a data frame indexing the zodiac signs according to the date to produce a vector zodiac whose length equals the height of your data frame.
That would work and with a fully populated zodiac database it should be pretty easy. What I mean with this is that you need a database, where for each year, you've got the different dates, because otherwise it's difficult to compare dates across New Year. Also please make sure that the conditions are correct, don't know anything about zodiac signs.
library(fuzzyjoin)
birth.days <- data.frame(respondent = c(1,2,3,4,5), date_of_birth = seq(as.Date("2011-12-30"), as.Date("2012-04-30"), by="months") )
zodiacs <- data.frame(Zodiac = c("Capricorn")
, Start.Date = as.Date("2011-12-22")
, End.Date = as.Date("2012-01-20"))
fuzzy_left_join(birth.days, zodiacs,
by = c("date_of_birth" = "Start.Date", "date_of_birth" = "End.Date"),
match_fun = list(`>=`, `<`))
respondent date_of_birth Zodiac Start.Date End.Date
1 1 2011-12-30 Capricorn 2011-12-22 2012-01-20
2 2 2012-01-30 <NA> <NA> <NA>
3 3 2012-03-01 <NA> <NA> <NA>
4 4 2012-03-30 <NA> <NA> <NA>
5 5 2012-04-30 <NA> <NA> <NA>
Just as an example on how to populate a database with the dates:
Capricorn <- data.frame( Start.Date = seq.Date(from= as.Date("1900-12-22"), to = as.Date("2100-01-01"), by = "year")
, End.Date = seq.Date(from= as.Date("1901-01-20"), to = as.Date("2100-01-20"), by = "year")
, Zodiac = rep("Capricorn", 200 )
)

Finding newest data older than a specific date in R

I have a two data.frames (call them dataset.new and dataset.old) that both contain information about some individuals. These individuals all have a identification number (a variable we can call ”individual”) that occurs in both of the data.frames and each frame has information on when the data was collected, stored in a column that we can call ”some.date”.
The second of these two data.frames (dataset.old) contains historical data for the individuals, i.e. values of some other variables measured at other times and thus each individual appears many times in dataset.old.
What I wish to do is the following. For each individual in dataset.new, find the rows from dataset.old that are the newest but still older than the observations in dataset.new. For the individuals that have no such date present in dataset.old, I want it to return NA.
This is perhaps easiest illustrated through some example data, presented below.
dataset.new
individual some.date
1 1 2016-05-01
2 2 2016-01-28
3 7 2016-03-03
dataset.old
individual some.date
1 1 2016-01-12
2 1 2015-12-30
3 1 2016-04-27
4 1 2016-05-02
5 2 2015-11-15
6 2 2012-01-27
7 2 2016-02-06
8 3 2016-04-30
9 3 2016-01-27
10 4 2016-03-01
11 4 2011-01-16
In this example, I am looking for a way get the following output:
individual row.nr
1 1 3
2 2 5
3 7 NA
since those rows correspond to the newest data in dataset.old that still is older than the data in dataset.new.
I have a code that solves the problem, but it is too slow for the data that I have in mind (which has well over 20 000 rows in dataset.new and many, many more in dataset.old). My solution is basically a loop over all individuals, subsetting the data at each stage.
find.previous <- function(dataset.old, individual, some.new.date){
subsetted.dataset <- dataset.old[dataset.old[, "individual"] == individual, ] # We only look at the individual in question.
subsetted.dataset <- subsetted.dataset[subsetted.dataset[, "some.date"] < some.new.date, ]# Here we get all the rows that have data that are measured BEFORE timepoint.
row.index <- which.min(some.new.date - subsetted.dataset[, "some.date"]) # This can be done, since we have already made sure that fromdatum < timepoint.
ifelse(length(row.index)!= 0, as.integer(rownames(subsetted.dataset[row.index,])), NA) # Then we output the row that had that information.
}
output <- matrix(ncol=2, nrow=0)
for(i in 1:nrow(dataset.new)){
output <- rbind(output, cbind(dataset.new[, "individual"][i], find.previous(dataset.old, dataset.new[, "individual"][i], dataset.new[, "some.date"][i])))
}
colnames(output) <- c("individual", "row.nr")
output
Any help on how to solve this problem would be greatly appreciated. I have tried using my Google skills as well as reading other posts on here stackoverflow, but without success.
The example data can be replicated by copying the following lines of code:
dataset.new <- data.frame(individual=c(1, 2, 7), some.date=as.Date(c("2016-05-01", "2016-01-28", "2016-03-03")))
dataset.old <- data.frame(individual=c(1,1,1,1,2,2,2,3,3,4,4), some.date=as.Date(c("2016-01-12", "2015-12-30", "2016-04-27", "2016-05-02", "2015-11-15", "2012-01-27", "2016-02-06", "2016-04-30", "2016-01-27", "2016-03-01", "2011-01-16")))
You can solve this efficiently with a merge.
First make the rownumber variable you want in dataset.old. Then merge dataset.new with dataset.old on individual (left join, or merge(lhs, rhs, all.x = TRUE)). This can get you:
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
4 1 2016-05-01 2016-05-02 4
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
7 2 2016-01-28 2016-02-06 7
8 7 2016-03-03 NA NA
Subset to new.date > old.date or is.na(old.date):
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
8 7 2016-03-03 NA NA
Subset to old.date == max(old.date) or is.na(old.date) grouped by individual.
dataset.old
individual new.date old.date old.rownumber
3 1 2016-05-01 2016-04-27 3
6 2 2016-01-28 2012-01-27 5
8 7 2016-03-03 NA NA
Edit:
I'm partial to data.table. The code would look something like:
dataset.old[, old.rownumber := 1:.N]
setnames(dataset.old, "some.date", "old.date")
setnames(dataset.new, "some.date", "new.date")
dataset.merge <- merge(dataset.old, dataset.new, by = "individual", all.x = TRUE)
dataset.merge <- dataset.merge[, new.date > old.date]
dataset.merge[old.date == max(old.date) | is.na(old.date), by = individual]
We can skip the NA search by finding the minimum square root. The negative values will be coerced to missing for us:
dataset.old$rn <- 1:nrow(dataset.old)
minp <- function(x) if(!length(m <- which.min(as.numeric(x)^.5))) NA else m
mrg <- merge(dataset.new, dataset.old, by="individual", all.x=TRUE)
mrg %>% group_by(individual) %>%
summarise(row.nr=rn[minp(some.date.x - some.date.y)])
# A tibble: 3 x 2
# individual row.nr
# <int> <int>
# 1 1 3
# 2 2 5
# 3 7 NA

Creating a vector containing total quantities sold per delivery term

Have a look at the simplified table below. I want for each product a vector containing the quantities sold within each delivery time. A delivery time is defined as 4 days. So if we look at product A, we see that it starts at 03/12/15 and within the first delivery term (until 07/12/15) it has sold a quantity of 4. The second delivery term starts at 08/12/15 and ends at 12/12/15. So for this period there is 1 quantity sold. The following delivery term starts at 13/12/15 and ends at 17/12/15. During these period there are no quantities sold and thus for this period the vector must have a value of 0. In the last period, finally, 2 products are sold. So basically the problem here is that information regarding the periods were no products are sold is missing.
Any ideas on how the vector I want can be created using R? I've been thinking of for or while loops, but these do not seem to give the requested results. Note that the code must be applicable on a real dataset containing over 1000 product categories, so it has to be 'automatized' in one way.
I would be very gratefull if somebody could point me in the right direction.
Product Quantity Date
A 1 03/12/15
A 2 04/12/15
A 1 05/12/15
A 1 08/12/15
A 1 17/12/16
A 1 18/12/16
B 1 19/12/15
B 2 10/05/15
B 2 11/05/15
C 1 01/06/15
C 1 02/06/15
C 1 12/06/15
Assume that dt is the dataset you provided. You'll get a better understanding of the process if you run it step by step (and maybe with an even simpler dataset).
library(lubridate)
library(dplyr)
# create date time columns
dt$Date = dmy(dt$Date)
dt %>%
group_by(Product) %>%
do(data.frame(days = seq(min(.$Date), max(.$Date), by="1 day"))) %>% # create all combinations between product and days
mutate(dist = as.numeric(difftime(days,min(days), units="days"))) %>% # create distance of each day with min date
ungroup() %>%
left_join(dt, by=c("Product"="Product","days"="Date")) %>% # join info to get quantities for each day
mutate(Quantity = ifelse(is.na(Quantity), 0, Quantity), # replace NAs with 0s
id = floor(dist/5 + 1)) %>% # create the 4 period id
group_by(Product, id) %>%
summarise(Sum = sum(Quantity),
min_date = min(days),
max_date = max(days)) %>%
ungroup
# Product id Sum min_date max_date
# 1 A 1 4 2015-12-03 2015-12-07
# 2 A 2 1 2015-12-08 2015-12-12
# 3 A 3 0 2015-12-13 2015-12-17
# 4 A 4 0 2015-12-18 2015-12-22
# 5 A 5 0 2015-12-23 2015-12-27
# 6 A 6 0 2015-12-28 2016-01-01
# 7 A 7 0 2016-01-02 2016-01-06
# 8 A 8 0 2016-01-07 2016-01-11
# 9 A 9 0 2016-01-12 2016-01-16
# 10 A 10 0 2016-01-17 2016-01-21
# .. ... .. ... ... ...
First row of the output tells you that for product A in the first 4 days period (id = 1) you had 4 quantities in total and the period is from 3/12 to 7/12.
I would suggest {dplyr}'s summarise(),mutate() and group_by() functions. group_by() groups your data by desired variables (in your case - product and delivery term),mutate() allows operations on grouped columns, and summarise() applies a summarising function over these groups (in your case sum(Quantity)).
So this is how it will look:
convert date into proper format:
library(dplyr)
df=tbl_df(df)
df$Date=as.Date(df$Date,format="%d/%m/%y")
calculating delivery terms
df=group_by(df,Product) %>% arrange(Date)
df=mutate(df,term=1+unclass((Date-min(Date)))%/%4)
group by product and terms and calculate sum of quantity:
df=group_by(df,Product,term)
summarise(df,sum=sum(Quantity))
Here's a base R way:
df$groups <- ave(as.numeric(df$Date), df$Product, FUN=function(x) {
intrvl <- findInterval(x, seq(min(x), max(x),4))
as.numeric(factor(intrvl))
})
df
# Product Quantity Date groups
# 1 A 1 2015-12-03 1
# 2 A 2 2015-12-04 1
# 3 A 1 2015-12-05 1
# 4 A 1 2015-12-08 2
# 5 A 1 2016-12-17 3
# 6 A 1 2016-12-18 3
# 7 B 1 2015-12-19 2
# 8 B 2 2015-05-10 1
# 9 B 2 2015-05-11 1
# 10 C 1 2015-06-01 1
# 11 C 1 2015-06-02 1
# 12 C 1 2015-06-12 2
The dates should be converted to one of the date classes. I chose as.Date. When it converts to numeric, the output will be the number of days from a specified date. From there, we are able to group by 4 day increments.
Data
df$Date <- as.Date(df$Date, format="%d/%m/%y")

Count number of rows meeting criteria in another table - R PRogramming

I have two tables, one with property listings and another one with contacts made for a property (i.e. is someone is interested in the property they will "contact" the owner).
Sample "listings" table below:
listings <- data.frame(id = c("6174", "2175", "9176", "4176", "9177"), city = c("A", "B", "B", "B" ,"A"), listing_date = c("01/03/2015", "14/03/2015", "30/03/2015", "07/04/2015", "18/04/2015"))
listings$listing_date <- as.Date(listings$listing_date, "%d/%m/%Y")
listings
# id city listing_date
#1 6174 A 01/03/2015
#2 2175 B 14/03/2015
#3 9176 B 30/03/2015
#4 4176 B 07/04/2015
#5 9177 A 18/04/2015
Sample "contacts" table below:
contacts <- data.frame (id = c ("6174", "6174", "6174", "6174", "2175", "2175", "2175", "9176", "9176", "4176", "4176", "9177"), contact_date = c("13/03/2015","14/04/2015", "27/03/2015", "13/04/2015", "15/03/2015", "16/03/2015", "17/03/2015", "30/03/2015", "01/06/2015", "08/05/2015", "09/05/2015", "23/04/2015" ))
contacts$contact_date <- as.Date(contacts$contact_date, "%d/%m/%Y")
contacts
# id contact_date
#1 6174 2015-03-13
#2 6174 2015-04-14
#3 6174 2015-03-27
#4 6174 2015-04-13
#5 2175 2015-03-15
#6 2175 2015-03-16
#7 2175 2015-03-17
#8 9176 2015-03-30
#9 9176 2015-06-01
#10 4176 2015-05-08
#11 4176 2015-05-09
#12 9177 2015-04-23
Problem
1. I need to count the number of contacts made for a property within 'x' days of listing. The output should be a new column added to "listings" with # contacts:
Sample ('x' = 30 days)
listings
# id city listing_date ngs
#1 6174 A 2015-03-01 2
#2 2175 B 2015-03-14 3
#3 9176 B 2015-03-30 1
#4 4176 B 2015-04-07 0
#5 9177 A 2015-04-18 1
I have done this with the for loop; it is horrible slow for live data:
n <- nrow(listings)
mat <- vector ("integer", n)
for (i in 1:n) {
mat[i] <- nrow (contacts[contacts$id==listings[i,"id"] & as.numeric (contacts$contact_date - listings[i,"listing_date"]) <=30,])
}
listings$ngs <- mat
I need to prepare a histogram of # contacts vs days with 'x' as variable - through manipulate function. I can't figure out a way to do all this inside the manipulate function.
Here's a possible solution using data.table rolling joins
library(data.table)
# key `listings` by proper columns in order perform the binary join
setkey(setDT(listings), id, listing_date)
# Perform a binary rolling join while extracting matched icides and counting them
indx <- data.table(listings[contacts, roll = 30, which = TRUE])[, .N, by = V1]
# Joining back to `listings` by proper rows while assigning the counts by reference
listings[indx$V1, ngs := indx$N]
# id city listing_date ngs
# 1: 2175 B 2015-03-14 3
# 2: 4176 B 2015-04-07 NA
# 3: 6174 A 2015-03-01 2
# 4: 9176 B 2015-03-30 1
# 5: 9177 A 2015-04-18 1
I'm not sure if your actual id values are factor, but I'll start by making those numeric. Using them as factors will cause you problems:
listings$id <- as.numeric(as.character(listings$id))
contacts$id <- as.numeric(as.character(contacts$id))
Then, the strategy is to calculate the "days since listing" value for each contact and add this to your contacts data.frame. Then, aggregate this new data.frame (in your example, sum of contacts within 30 days), and then merge the resulting count back into your original data.
contacts$ngs <- contacts$contact_date - listings$listing_date[match(contacts$id, listings$id)]
a <- aggregate(ngs ~ id, data = contacts, FUN = function(x) sum(x <= 30))
merge(listings, a)
# id city listing_date ngs
# 1 2175 B 2015-03-14 3
# 2 4176 B 2015-04-07 0
# 3 6174 A 2015-03-01 2
# 4 9176 B 2015-03-30 1
# 5 9177 A 2015-04-18 1
Or:
indx <- match(contacts$id, listings$id)
days_since <- contacts$contact_date - listings$listing_date[indx]
n <- with(contacts[days_since <= 30, ], tapply(id, id, length))
n[is.na(n)] <- 0
listings$n <- n[match(listings$id, names(n))]
It's similar to Thomas' answer but utilizes tapply and match instead of aggregate and merge.
You could use the dplyr package. First merge the data:
all.data <- merge(contacts,listings,by = "id")
Set a target number of days:
number.of.days <- 30
Then gather the data by ID (group_by), exclude the results that are not within the time frame (filter) and count the number of occurrences/rows (summarise).
result <- all.data %>% group_by(id) %>% filter(contact_date > listing_date + number.of.days) %>% summarise(count.of.contacts = length(id))
I think there are a number of ways this could be potentially solved but I have found dplyr to be very helpful in a lot circumstances.
EDIT:
Sorry should have thought about that a little more. Does this work,
result <- all.data %>% group_by(id,city,listing_date) %>% summarise(ngs = length(id[which(contact_date < listing_date + number.of.days)]))
I don't think zero results can be passed sensibly through the filter stage (understandably, the goal is usually the opposite). I'm not too sure what sort of impact the 'which' component will have on processing time, likely to be slower than using the 'filter' function but might not matter.
Using dplyr for your first problem:
left_join(contacts, listings, by = c("id" = "id")) %>%
filter(abs(listing_date - contact_date) < 30) %>%
group_by(id) %>% summarise(cnt = n()) %>%
right_join(listings)
And the output is:
id cnt city listing_date
1 6174 2 A 2015-03-01
2 2175 3 B 2015-03-14
3 9176 1 B 2015-03-30
4 4176 NA B 2015-04-07
5 9177 1 A 2015-04-18
I'm not sure I understand your second question to answer it.

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

Resources