find number of customers added each month - r

customer_id transaction_id month year
1 3 7 2014
1 4 7 2014
2 5 7 2014
2 6 8 2014
1 7 8 2014
3 8 9 2015
1 9 9 2015
4 10 9 2015
5 11 9 2015
2 12 9 2015
I am well familiar with R basics. Any help will be appreciated.
the expected output should look like following:
month year number_unique_customers_added
7 2014 2
8 2014 0
9 2015 3
In the month 7 and year 2014, only customers_id 1 and 2 are present, so number of customers added is two. In the month 8 and year 2014, no new customer ids are added. So there should be zero customers added in this period. Finally in year 2015 and month 9, customer_ids 3,4 and 5 are the new ones added. So new number of customers added in this period is 3.

Using data.table:
require(data.table)
dt[, .SD[1,], by = customer_id][, uniqueN(customer_id), by = .(year, month)]
Explanation: We first remove all subsequent transactions of each customer (we're interested in the first one, when she is a "new customer"), and then count unique customers by each combination of year and month.

Using dplyr we can first create a column which indicates if a customer is duplicate or not and then we group_by month and year to count the new customers in each group.
library(dplyr)
df %>%
mutate(unique_customers = !duplicated(customer_id)) %>%
group_by(month, year) %>%
summarise(unique_customers = sum(unique_customers))
# month year unique_customers
# <int> <int> <int>
#1 7 2014 2
#2 8 2014 0
#3 9 2015 3

Related

Updating table with custom numbers

Below is my dataset, which contains four columns id, year, quarter, and price.
df <- data.frame(id = c(1,2,1,2),
year = c(2010,2010,2011,2011),
quarter = c("2010-q1","2010-q2","2011-q1","2011-q2"),
price = c(10,50,10,50))
Now I want to expand this dataset for 2012 and 2013. First, I want to copy rows for 2010 and 2011 and paste them below, and after that, replace these values for years and quarters with 2012 and 2013 and also quarters with 2012-q1,2012-q2,2013-q1 and 2013-q2.
So can anybody help me with how to solve this and prepare the table as the table below?
df %>%
mutate(year = year + 2, quarter = paste0(year, "-q", id)) %>%
bind_rows(df, .)
id year quarter price
1 1 2010 2010-q1 10
2 2 2010 2010-q2 50
3 1 2011 2011-q1 10
4 2 2011 2011-q2 50
5 1 2012 2012-q1 10
6 2 2012 2012-q2 50
7 1 2013 2013-q1 10
8 2 2013 2013-q2 50

Is there a way I can get the maximum value for each group after a double group_by in R?

I am trying to extract the team with the maximum number of wins each year in women's college basketball, and I am currently stuck with having the number of wins for each year for each team, and I want only the team with the maximum number of wins in each year.
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
summarise(totalwinsyr = sum(Outcome))
Output currently looks like this, but I am expecting to see each year only once with the team with the maximum number of wins in the subsequent columns
Year Team totalwinsyr
<fct> <chr> <dbl>
1 2014 AbileneChristian 10
2 2014 AirForce 0
3 2014 Akron 18
4 2014 Alabama 10
5 2014 AlabamaAM 3
6 2014 AlabamaHuntsville 0
7 2014 AlabamaMobile 0
8 2014 AlabamaSt 15
9 2014 AlaskaAnchorage 1
10 2014 AlbanyNY 16
How to select the rows with maximum values in each group with dplyr?
I have already looked here but I could not find any resources to help with a group_by() with multiple values
Create a new column with the number of wins and then filter:
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
mutate(totalwinsyr = sum(Outcome)) %>%
filter(totalwinsyr == max(totalwinsyr))

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

Conditional cumulative subtraction

This is what my data.table looks like:
library(data.table)
dt <- fread('
Year Total Shares Balance
2017 10 1 10
2016 12 2 9
2015 10 2 7
2014 10 3 6
2013 10 NA 3
')
**Balance** is my desired column. I am trying to find the cumulative subtractions by taking the first value of Total which is 10(it should also be the first value of Balance field) and then cumulatively subtracting values in Shares. So the second value is 10-1 =9 and the third value is 9-2 = 7 and such. There is one condition, if the Year is 2014, then subtract the Shares value after dividing it by 2. so the fourth value is 7-(2/2)=6 and the fifth value is 6-3=3. I want to end the calc as of the last row.
My attempt is:
dt[, Balance:= ifelse( Year == 2014, cumsum(Total[1]-Shares/2), cumsum(Total[1] - Shares))]
Here is one method.
dt[, Balance2 := Total[1] - cumsum(shift(Shares * (1 - (0.5 *(Year == 2015))), fill=0))]
shift is used to create a lag variable, and the first element is filled with 0, using fill=0. The other elements are calculated as Shares * (1 - (0.5 *(Year == 2015))) which return Shares except when Years == 2015, in which case Shares * 0.5 is returned.
which returns
dt
Year Total Shares Balance Balance2
1: 2017 10 1 10 10
2: 2016 12 2 9 9
3: 2015 10 2 7 7
4: 2014 10 3 6 6
5: 2013 10 NA 3 3
FWIW, I wanted to provide a functional alternative that would allow for more flexible calculations in the cumulative differences, indexing, etc. I also have read in the data with read.table.
dt <- read.table(header=TRUE, text='
Year Total Shares Balance
2017 10 1 10
2016 12 2 9
2015 10 2 7
2014 10 3 6
2013 10 NA 3
')
makeNewBalance <- function(dt) {
output <- NULL
for (i in 1:nrow(dt)) {
if (i==1) {
output[i] <- dt$Total[i]
} else {
output[i] <- output[i-1] - as.integer(ifelse(dt$Year[i]==2014,
dt$Shares[i-1]/2,
dt$Shares[i-1]))
}
}
return(output)
}
dt$NewBalance <- makeNewBalance(dt)
which also returns
> dt
Year Total Shares Balance NewBalance
1 2017 10 1 10 10
2 2016 12 2 9 9
3 2015 10 2 7 7
4 2014 10 3 6 6
5 2013 10 NA 3 3

Aggregation on 2 columns while keeping two unique R

So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5

Resources