I'm working with a data table in R containing information quarterly information on the products being sold in grocery stores across the United States. In particular, there is a column for the date, a column for the store, and a column for the product. For example, here is a (very small) subset of the data:
Date StoreID ProductID
2000-03-31 10001 20001
2000-03-31 10001 20002
2000-03-31 10002 20001
2000-06-30 10001 20001
For each product in each store, I want to find out for how many consecutive quarters the product has been sold in that store up until that date. For example, if we restrict to just looking at staplers that sold in a specific store, we would have:
Date StoreID ProductID
2000-03-31 10001 20001
2000-06-30 10001 20001
2000-09-30 10001 20001
2000-12-31 10001 20001
2001-06-30 10001 20001
2001-09-30 10001 20001
2001-12-31 10001 20001
Assuming that's all the data for that combination of StoreID and ProductID, I want to assign a new variable as:
Date StoreID ProductID V
2000-03-31 10001 20001 1
2000-06-30 10001 20001 2
2000-09-30 10001 20001 3
2000-12-31 10001 20001 4
2001-06-30 10001 20001 1
2001-09-30 10001 20001 2
2001-12-31 10001 20001 3
2002-03-31 10001 20001 4
2002-06-30 10001 20001 5
2002-09-30 10001 20001 6
2002-12-31 10001 20001 7
2004-03-30 10001 20001 1
2004-06-31 10001 20001 2
Note that we roll over after Q4 2000 because the product was not sold during Q1 of 2001. Additionally, we roll over after Q4 2002 because the product was not sold during Q1 2003. The next time the product was sold was Q1 2004, which was assigned a 1.
The issue I am having, is that my actual dataset is quite large (on the order of 10 Million rows) so this needs to be done efficiently. The only techniques I've been able to come up with are dreadfully inefficient. Any advice would be greatly appreciated.
You can use custom function that calculate difference between quarters.
# Load data.table
library(data.table)
# Set data as a data.table object
setDT(data)
# Set key as it might be big data
setkey(data, StoreID, ProductID)
consecutiveQuarters <- function(date, timeGap = 14) {
# Calculate difference in dates
# And check if this difference is less than 14 weeks
shifts <- cumsum(c(FALSE, abs(difftime(date[-length(date)], date[-1], units = "weeks")) > timeGap))
# Generate vector from 1 to number of consecutive quarters
ave(shifts, shifts, FUN = seq_along)
}
# Calculate consecutive months my storeID and productID
data[, V := consecutiveQuarters(Date), .(StoreID, ProductID)]
Create a variable which is 1 if the product is sold in a quarter and 0 if not. Order the variable so it starts at the present and goes backward in time.
Compare the cumulative sum of such a variable to a sequence of identical length. When sales drop to zero the cumulative sum will no longer equal the sequence. Sum the number of times the cumulative sum equals the sequence and this will tell the number of consecutive quarters sales were positive.
data <- data.frame(
quarter = c(1, 2, 3, 4, 1, 2, 3, 4),
store = as.factor(c(1, 1, 1, 1, 1, 1, 1, 1)),
product = as.factor(c(1, 1, 1, 1, 2, 2, 2, 2)),
numsold = c(5, 6, 0, 1, 7, 3, 2, 14)
)
sortedData <- data[order(-data$quarter),]
storeValues <- c("1")
productValues <- c("1","2")
dataConsec <- data.frame(store = NULL, product = NULL, ConsecutiveSales = NULL)
for (storeValue in storeValues ){
for(productValue in productValues){
prodSoldinQuarter <-
as.numeric(sortedData[sortedData$store == storeValue &
sortedData$product == productValue,]$numsold > 0)
dataConsec <- rbind(dataConsec,
data.frame(
store = storeValue,
product = productValue,
ConsecutiveSales =
sum(as.numeric(cumsum(prodSoldinQuarter) ==
seq(1,length(prodSoldinQuarter))
))
))
}
}
As I understand from your question your really need your V column as quarter of year, not sum of product in each quarter. You can use something like that.
# to_quarters returns year's quarter of given date in character string
# base on reg exp
to_quarters <- function(date_string) {
month <- as.numeric(substr(date_string, 6, 7))
as.integer((month - 1) / 3) + 1
}
# with tidyverse library
library(tidyverse)
# your data as tibble format of data frame
data_set_tibble <- as.tibble(YOUR_DATA)
# here you create your table
data_set_tibble %>% mutate(V = to_quarters(Date) %>% as.integer())
# alterative with data.table library
library(data.table)
# your data as data.table format of data frame
data_set <- as.data.table(YOUR_DATA)
# here you create your table
data_set[,.(Date, StoreID, ProductID, V = to_quarters(Date))]
For tidyverse and data.table performance is a same, in my case for 5 000 000 rows works in 12 seconds
Related
I need to find the average prices for all the different weeks. I need to make a ggplot to show how the price is during the year.
When you find the mean how does the empty cells affect the mean?
I have tried several thing including using the melt() function so I only have 3 variables. The variable are factors which I want to find the mean of.
Company variable value
ns Price week 24 1749
ns Price week 24
ns Price week 24 1599
ns Price week 24
ns Price week 24
ns Price week 24 359
ns Price week 24 460
I got more than 300K obs, and would love to have a small data.frame where I only have the Company, Price of different weeks as a mean. Now I have all observations for each week and I need to use the mean for using GGplot.
When I use following code
dat %in% mutate(means=mean(value), na.rm=TRUE)
I got a warning message saying the argument is not numeric or logical: returning NA.
I am looking forward to getting your help!
Clean code from PavoDive's comment
dt[!is.na(value), mean(value), by = .(price, week)]
and even better
dt[ , mean(value, na.rm = TRUE), by = .(price, week)]
Original:
This works using data.table. The first part filters out rows that don't have a number in value. Next is to say we want the average from the value column. Final the by defines how to group the rows.
Code:
dt[value >0 | value<1, .(MeanValues = mean(`value`)), by = c("Price", "Week")][]
Input:
dt <- data.table(`Price` = c("A","B","B","A","A","B","B","A"),
`Week`= c(1,2,1,1,2,2,1,2),
`value` = c(3,7,2,NA,1,46,1,NA))
Price Week value
1: A 1 3
2: B 2 7
3: B 1 2
4: A 1 NA
5: A 2 1
6: B 2 46
7: B 1 1
8: A 2 NA
Output:
1: A 1 3.0
2: B 2 26.5
3: B 1 1.5
4: A 2 1.0
I want to determine the difference of each row and have that total difference rbinded at the end. Below is a sample dataset:
DATE <- as.Date(c('2016-11-28','2016-11-29'))
TYPE <- c('A', 'B')
Revenue <- c(2000, 1000)
Sales <- c(1000, 4000)
Price <- c(5.123, 10.234)
Material <- c(10000, 7342)
df<-data.frame(DATE, TYPE, Revenue, Sales, Price, Material)
df
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
How Do I calculate the Difference of Each of the Columns to produce this total:
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
3 DIFFERENCE -1000 3000 5.111 -2658
I can easily do it by columns but having trouble doing it by row.
Any help would be great thanks!
As 'DATE' is Date class, we may need to change it to character before proceeding with rbinding with string "DIFFERENCE". Other than that, subset the numeric columns of 'df', loop it with lapply, get the difference, concatenate with the 'DATE' and 'TYPE', and rbind with original dataset.
df$DATE <- as.character(df$DATE)
rbind(df, c(DATE = "DIFFERENCE", TYPE= NA, lapply(df[-(1:2)], diff)))
# DATE TYPE Revenue Sales Price Material
#1 2016-11-28 A 2000 1000 5.123 10000
#2 2016-11-29 B 1000 4000 10.234 7342
#3 DIFFERENCE <NA> -1000 3000 5.111 -2658
Data:
DB <- data.frame(orderID = c(1,2,3,1,1,3,2,4,5,5),
orderDate = c("1.1.12","1.1.12","1.1.12","1.1.12","1.1.12", "1.1.12","1.1.12","2.1.12","2.1.12","2.1.12"),
itemID = c(2,3,2,5,12,4,2,3,1,5),
customerID = c(1, 2, 3, 1, 1, 3, 2, 2, 1, 1),
itemPrice = c(9.99, 14.99, 9.99, 19.99, 29.99, 4.99, 9.99, 14.99, 49.99, 19.99))
Expected outcome:
NumberofOrdersOfSpecificUser = c(2, 2, 1, 2, 2, 1, 2, 2, 2, 2)
AverageValuePerOrder = c(64.975, 19.985, 14.98, 64.975, 64.975, 14.98, 19.985, 19.985, 64.975, 64.975)
For Understanding:
The orderID is continuou. Products orderd from the same customer(ID) at the same day get the same orderID. When the same customer orders products at another day he/she it´s a new orderID.
Hi,
I want 2 thinks:
1. count the number of orders per user
2. Calculate the average value per order per user
How can we do this?
I tried it already with this:
DB$NumberofOrdersOfSpecificUser <- with(DB,ave(as.numeric(mydata$orderDate), customerID, FUN=function(x) length(unique(x))))
DB$NumberofOrdersOfSpecificUser <- as.integer(DB$NumberofOrdersOfSpecificUser)
DB$orderDate <- as.factor(DB$orderDate)
Thanks a lot for your support!
There are, of course, many ways to do this. In your prefered outcome, there is redundant data. When you work with R, there is really no need to force the summary data back on to the individual records, instead you create a new object and continue the work with that object.
my.summaries <- data.frame(customerID = unique(DB$customerID),
NumberofOrdersOfSpecificUser = sapply(unique(DB$customerID), function(customer) { length(unique(DB$orderDate[which(DB$customerID == customer)])) } ),
AverageValuePerOrder = tapply(tapply(DB$itemPrice, DB$orderID, sum), DB$customerID[match(unique(DB$orderID), DB$orderID)], mean)
)
my.summaries
customerID NumberofOrdersOfSpecificUser AverageValuePerOrder
1 1 2 64.975
2 2 2 19.985
3 3 1 14.980
Should you really need to force the summary data back on to the individual records, use merge()
merge(DB, my.summaries)
customerID orderID orderDate itemID itemPrice NumberofOrdersOfSpecificUser AverageValuePerOrder
1 1 1 1.1.12 2 9.99 2 64.975
2 1 5 2.1.12 5 19.99 2 64.975
3 1 1 1.1.12 5 19.99 2 64.975
4 1 1 1.1.12 12 29.99 2 64.975
5 1 5 2.1.12 1 49.99 2 64.975
6 2 2 1.1.12 2 9.99 2 19.985
7 2 2 1.1.12 3 14.99 2 19.985
8 2 4 2.1.12 3 14.99 2 19.985
9 3 3 1.1.12 2 9.99 1 14.980
10 3 3 1.1.12 4 4.99 1 14.980
EDIT: since the original poster added a requirement that the solution should be fast, here is a fast solution, using data.table
library(data.table)
dt <- data.table(DB)
orders.per.customer <- dt[, sum(itemPrice), by="orderID,customerID"]
my.summaries <- merge(orders.per.customer[, length(orderID), by=customerID],
orders.per.customer[, mean(V1), by=customerID],
by = "customerID")
colnames(my.summaries) <- c("customerID",
"NumberofOrdersOfSpecificUser", "AverageValuePerOrder")
dt <- merge(dt, my.summaries, by = "customerID")
I am trying to group my data by Year and CountyID then use splinefun (cubic spline interpolation) on the subset data. I am open to ideas, however the splinefun is a must and cannot be changed.
Here is the code I am trying to use:
age <- seq(from = 0, by = 5, length.out = 18)
TOT_POP <- df %.%
group_by(unique(df$Year), unique(df$CountyID) %.%
splinefun(age, c(0, cumsum(df$TOT_POP)), method = "hyman")
Here is a sample of my data Year = 2010 : 2013, Agegrp = 1 : 17 and CountyIDs are equal to all counties in the US.
CountyID Year Agegrp TOT_POP
1001 2010 1 3586
1001 2010 2 3952
1001 2010 3 4282
1001 2010 4 4136
1001 2010 5 3154
What I am doing is taking the Agegrp 1 : 17 and splitting the grouping into individual years 0 - 84. Right now each group is a representation of 5 years. The splinefun allows me to do this while providing a level of mathematical rigour to the process i.e., splinefun allows me provide a population total per each year of age, in each individual county in the US.
Lastly, the splinefun code by itself does work but within the group_by function it does not, it produces:
Error: wrong result size(4), expected 68 or 1.
The splinefun code the way I am using it works like this
TOT_POP <- splinefun(age, c(0, cumsum(df$TOT_POP)),
method = "hyman")
TOT_POP = pmax(0, diff(TOT_POP(c(0:85))))
Which was tested on one CountyID during one Year. I need to iterate this process over "x" amount of years and roughly 3200 counties.
# Reproducible data set
set.seed(22)
df = data.frame( CountyID = rep(1001:1005,each = 100),
Year = rep(2001:2010, each = 10),
Agegrp = sample(1:17, 500, replace=TRUE),
TOT_POP = rnorm(500, 10000, 2000))
# Convert Agegrp to age
df$Agegrp = df$Agegrp*5
colnames(df)[3] = "age"
# Make a spline function for every CountyID-Year combination
split.dfs = split(df, interaction(df$CountyID, df$Year))
spline.funs = lapply(split.dfs, function(x) splinefun(x[,"age"], x[,"TOT_POP"]))
# Use the spline functions to interpolate populations for all years between 0 and 85
new.split.dfs = list()
for( i in 1:length(split.dfs)) {
new.split.dfs[[i]] = data.frame( CountyID=split.dfs[[i]]$CountyID[1],
Year=split.dfs[[i]]$Year[1],
age=0:85,
TOT_POP=spline.funs[[i]](0:85))
}
# Does this do what you want? If so, then it will be
# easier for others to work from here
# > head(new.split.dfs[[1]])
# CountyID Year age TOT_POP
# 1 1001 2001 0 909033.4
# 2 1001 2001 1 833999.8
# 3 1001 2001 2 763181.8
# 4 1001 2001 3 696460.2
# 5 1001 2001 4 633716.0
# 6 1001 2001 5 574829.9
# > tail(new.split.dfs[[2]])
# CountyID Year age TOT_POP
# 81 1002 2001 80 10201.693
# 82 1002 2001 81 9529.030
# 83 1002 2001 82 8768.306
# 84 1002 2001 83 7916.070
# 85 1002 2001 84 6968.874
# 86 1002 2001 85 5923.268
First, I believe I was using the wrong wording in what I was trying to achieve, my apologies; group_by actually wasn't going to solve the issue. However, I was able to solve the problem using two functions and ddply. Here is the code that solved the issue:
interpolate <- function(x, ageVector){
result <- splinefun(ageVector,
c(0, cumsum(x)), method = "hyman")
diff(result(c(0:85)))
}
mainFunc <- function(df){
age <- seq(from = 0, by = 5, length.out = 18)
colNames <- setdiff(colnames(df)
c("Year","CountyID","AgeGrp"))
colWiseSpline <- colwise(interpolate, .cols = true,
age)(df[ , colNames])
cbind(data.frame(
Year = df$Year[1],
County = df$CountyID[1],
Agegrp = 0:84
),
colWiseSpline
)
}
CompleteMainRaw <- ddply(.data = df,
.variables = .(CountyID, Year),
.fun = mainFunc)
The code now takes each county by year and runs the splinefun on that subset of population data. At the same time it creates a data.frame with the results i.e., splits the data from 17 age groups to 85 age groups while factoring it our appropriately; which is what splinefun does.
Thanks!
I have read many of the threads and do not think my question has been asked before. I have a data.frame in R related to advertisements shown to customers as such:.. I have many customers, 8 different products.. so this is just a sample
mydf <- data.frame(Cust = c(1, 1), age = c(24, 24),
state = c("NJ", "NJ"), Product = c(1, 1), cost = c(400, 410),
Time = c(35, 25), Purchased = c("N", "Y"))
mydf
# Cust age state Product cost Time Purchased
# 1 1 24 NJ 1 400 35 N
# 2 1 24 NJ 1 410 23 Y
And I want to transform it to look as such ...
Cust | age | state | Product | cost.1 | time.1 | purch.1 | cost.2 | time.2 | purch.2
1 | 24 | NJ | 1 | 400 | 35 | N | 410 | 23 | Y
How can I do this? There are a few static variables for each customer such as age, state and a few others... and then there are the details associated with each offer that was presented to a given customer, the product # in the offer, the cost, the time, and if they purchased it... I want to get all of this onto 1 line for each customer to perform analysis.
It is worth noting that the number of products maxes out at 7, but for some customers it ranges from 1 to 7.
I have no sample code to really show. I have tried using the aggregate function, but I do not want to aggregate, or do any SUMs. I just want to do some joins. Research suggests the cbind, and tapply functions may be useful.
Thank you for your help. I am very new to R.
You are essentially asking to do a "long" to "wide" reshape of your data.
It looks to me like you're using "Cust", "age", "state", and "Product" as your ID variables. You don't have a an actual "time" variable though ("time" as in the sequential count of records by the IDs mentioned above). However, such a variable is easy to create:
mydf$timevar <- with(mydf,
ave(rep(1, nrow(mydf)),
Cust, age, state, Product, FUN = seq_along))
mydf
# Cust age state Product cost Time Purchased timevar
# 1 1 24 NJ 1 400 35 N 1
# 2 1 24 NJ 1 410 23 Y 2
From there, this is pretty straightforward with the reshape function in base R.
reshape(mydf, direction = "wide",
idvar=c("Cust", "age", "state", "Product"),
timevar = "timevar")
# Cust age state Product cost.1 Time.1 Purchased.1 cost.2 Time.2 Purchased.2
# 1 1 24 NJ 1 400 35 N 410 23 Y