I have the following data and looking to create the "Final Col" shown below using dplyr in R. I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
It can be done with a group_by operation i.e. grouped by 'Year', 'MainCat', divide the 'Qty' by the sum of 'Qty' to create the 'Final' column
library(dplyr)
df1 <- df1 %>%
group_by(Year, MainCat) %>%
mutate(Final = Qty/sum(Qty))
You can use prop.table :
library(dplyr)
df %>% group_by(Year, MainCat) %>% mutate(Final = prop.table(Qty))
# Year Week MainCat Qty Final
# <int> <int> <chr> <int> <dbl>
#1 2017 1 Edible 69 0.852
#2 2017 2 Edible 12 0.148
#3 2017 1 Flowers 88 0.652
#4 2017 2 Flowers 47 0.348
#5 2018 1 Edible 90 0.72
#6 2018 2 Edible 35 0.28
#7 2018 1 Flowers 78 0.479
#8 2018 2 Flowers 85 0.521
You can also do this in base R :
df$Final <- with(df, ave(Qty, Year, MainCat, FUN = prop.table))
I'm just starting to learn R and transition a project from Jupyter Notebook to an R Markdown document. I have a data set that looks like this:
DATE | ROUTE | STOP_NAME | BOARDING
-----------------------------------------------
2020-03-09 | 1 | STOP A | 2
2020-03-09 | 1 | STOP B | 3
2020-03-09 | 2 | STOP C | 1
There are 20,xxx records over several days and 16 routes. I am trying to group by DATE and ROUTE and sum the BOARDING column. I was able to do this in Python using
df.groupby(['DATE','ROUTE'],as_index = False)['BOARDING'].sum().pivot('DATE','ROUTE').fillna(0)
I've been able to create a table in R close to what I want using:
groupcol1 <- c("DATE","ROUTE")
datacol1 <- ("BOARDING")
route_totals_table <- ddply(df,groupcol1,function(x) colSums(x[datacol1]))
But this gives me a table with a row for each date and route. I am wanting a table like this.
DATE | ROUTE 1 | Route 2 | Route 3
-----------------------------------------------
2020-03-09 | 25 | 45 | 10
2020-03-10 | 36 | 69 | 22
2020-03-11 | 95 | 100 | 29
I would suggest using the tidyverse package to do this work, and the spread or pivot_wider functions from the tidyr package. Suppose your data is in a data.frame called "dat":
library(tidyverse)
# using spread
dat %>%
mutate(ROUTE = paste0("Route ", ROUTE)) %>%
group_by(DATE, ROUTE)%>%
summarise(BOARDING = sum(BOARDING)) %>%
spread(ROUTE, BOARDING)
# using pivot_wider
dat %>%
mutate(ROUTE = paste0("Route ", ROUTE)) %>%
group_by(DATE, ROUTE)%>%
summarise(BOARDING = sum(BOARDING)) %>%
pivot_wider(names_from = ROUTE, values_from = BOARDING)
Both return:
DATE `Route 1` `Route 2`
<chr> <int> <int>
1 "2020-03-09" 5 1
This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)
I'm having a tough time wrapping my head around this or finding a guideline online.
I have membership data. I want to be to see how many members last in a particular month before dropping their membership. I can see which month they have joined and I can see how long they've been active by looking at their transaction no (it increases by 1 each month). So if I track transaction no's for each month, I can get a waterfall of how many people joined that month and what the drop off was.
The kicker is that sometimes there are multiple transactions within a month by the same member, but I would only like to count that member once, so I would need to count that member only once.
Name | Joined Month | Transaction no
Adam | Jan | 1
Adam | Jan | 2
Adam | Jan | 2
Ben | Jan | 1
Ben | Jan | 2
Ben | Jan | 3
Ben | Jan | 4
Cathy| Jan | 1
Donna| Feb | 1
Donna| Feb | 2
Donna| Feb | 3
Evan | Mar | 1
Evan | Mar | 1
Frank | Mar | 1
Frank | Mar | 2
Aggregating for distinct members with months as columns, the result would look something like this:
Transaction# | Jan | Feb | March
1 | 3 | 1 | 2
2 | 2 | 1 | 1
3 | 1 | 1 | 0
4 | 1 | 0 | 0
Any tips or pointers in the correct direction would be very helpful. Should I be using reshape2 or a similar package? Hopefully I did not butcher the explanation or the formatting, please feel free to ask any questions.
Thank you!
Below is a reproducible example that uses the tidyverse functions dplyr::n_distinct and tidyr::spread.
I have first represented your data as a tibble (or you could use a data frame equally well).
Next we group by Transactionno and JoinedMonth before counting distinct Names. To get it in table format you request we use tidyr::spread. If you want the resulting columns in month order, ensuring your data frame has them as ordered factors would be important.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
x <- tribble(
~Name , ~JoinedMonth, ~Transactionno,
"Adam" , "Jan" , 1,
"Adam" , "Jan" , 2,
"Adam" , "Jan" , 2,
"Ben" , "Jan" , 1,
"Ben" , "Jan" , 2,
"Ben" , "Jan" , 3,
"Ben" , "Jan" , 4,
"Cathy", "Jan" , 1,
"Donna", "Feb" , 1,
"Donna", "Feb" , 2,
"Donna", "Feb" , 3,
"Evan" , "Mar" , 1,
"Evan" , "Mar" , 1,
"Frank" , "Mar" , 1,
"Frank" , "Mar" , 2
)
x %>%
group_by(Transactionno, JoinedMonth) %>%
summarise(ct = n_distinct(Name)) %>%
tidyr::spread(JoinedMonth, ct, fill = 0)
#> # A tibble: 4 x 4
#> # Groups: Transactionno [4]
#> Transactionno Feb Jan Mar
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1. 1. 3. 2.
#> 2 2. 1. 2. 1.
#> 3 3. 1. 1. 0.
#> 4 4. 0. 1. 0.
1) xtabs This one-liner uses base R and the input DF shown reproducibly in the Note below. Note that we assume that Joined.Month is a factor with levels Jan, Feb, Mar to ensure that the output is sorted in that order (rather than alphabetically).
xtabs(~ Transaction.no + Joined.Month, unique(DF))
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 1 3 2
2 1 2 1
3 1 1 0
4 0 1 0
2) table Another base R approach.
with(unique(DF), table(Transaction.no, Joined.Month))
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 3 1 2
2 2 1 1
3 1 1 0
4 1 0 0
2a) This would also work and is shorter but not quite as clear:
table(unique(DF)[3:2])
3) tapply This also uses only base R:
u <- unique(DF)
tapply(u[[1]], u[3:2], length, default = 0)
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 3 1 2
2 2 1 1
3 1 1 0
4 1 0 0
Note
DF in reproducible form is assumed to be:
Lines <- "Name | Joined Month | Transaction no
Adam | Jan | 1
Adam | Jan | 2
Adam | Jan | 2
Ben | Jan | 1
Ben | Jan | 2
Ben | Jan | 3
Ben | Jan | 4
Cathy| Jan | 1
Donna| Feb | 1
Donna| Feb | 2
Donna| Feb | 3
Evan | Mar | 1
Evan | Mar | 1
Frank | Mar | 1
Frank | Mar | 2"
DF <- read.table(text = Lines, header = TRUE, sep = "|",
strip.white = TRUE, as.is = TRUE)
DF$Joined.Month <- factor(DF$Joined.Month, lev = month.abb[1:3])