I am very new to R and sqldf and can't seem to solve a basic problem. I have a file with transactions where each row represents a product purchased.
The file looks like this:
customer_id,order_number,order_date, amount, product_name
1, 202, 21/04/2015, 58, "xlfd"
1, 275, 16//08/2015, 74, "ghb"
1, 275, 16//08/2015, 36, "fjk"
2, 987, 12/03/2015, 27, "xlgm"
3, 376, 16/05/2015, 98, "fgt"
3, 368, 30/07/2015, 46, "ade"
I need to find the maximum amount spent in a single transaction (same order_number) by each customer_id. For example in case of customer_id "1" it would be (74+36)=110.
In case sqldf is not a strict requirement.
Considering your input as dft , you can try:
require(dplyr)
require(magrittr)
dft %>%
group_by(customer_id, order_number) %>%
summarise(amt = sum(amount)) %>%
group_by(customer_id) %>%
summarise(max_amt = max(amt))
which gives:
Source: local data frame [3 x 2]
Groups: customer_id [3]
customer_id max_amt
<int> <int>
1 1 110
2 2 27
3 3 98
Assuming the dataframe is named orders, following will do the job:
sqldf("select customer_id, order_number, sum(amount)
from orders
group by customer_id, order_number")
Update: using nested query the following will give the desired output:
sqldf("select customer_id, max(total)
from (select customer_id, order_number, sum(amount) as total
from orders
group by customer_id, order_number)
group by customer_id")
Output:
customer_id max(total)
1 1 110
2 2 27
3 3 98
We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'customer_id', 'order_number', we get the sum of 'amount', do a second group by with 'customer_id' and get the max of 'Sumamount'
library(data.table)
setDT(df1)[, .(Sumamount = sum(amount)) , .(customer_id, order_number)
][,.(MaxAmount = max(Sumamount)) , customer_id]
# customer_id MaxAmount
#1: 1 110
#2: 2 27
#3: 3 98
Or making it more compact, after grouping by 'customer_id', we split the 'amount' by 'order_number', loop through the list, get the sum, find the max to get the 'MaxAmount'
setDT(df1)[, .(MaxAmount = max(unlist(lapply(split(amount,
order_number), sum)))), customer_id]
# customer_id MaxAmount
#1: 1 110
#2: 2 27
#3: 3 98
Or using base R
aggregate(amount~customer_id, aggregate(amount~customer_id+order_number,
df1, sum), FUN = max)
Related
I have the following dataframe:
user_id <- c(97, 97, 97, 97, 96, 95, 95, 94, 94)
event_id <- c(42, 15, 43, 12, 44, 32, 38, 10, 11)
plan_id <- c(NA, 38, NA, NA, 30, NA, NA, 30, 25)
treatment_id <- c(NA, 20, NA, NA, NA, 28, 41, 17, 32)
system <- c(1, 1, 1, 1, NA, 2, 2, NA, NA)
df <- data.frame(user_id, event_id, plan_id, treatment_id system)
I would like to count the distinct number of user_id for each column, excluding the NA values. The output I am hoping for is:
user_id event_id plan_id treatment_id system
1 4 4 3 4 2
I tried to leverage mutate_all, but that was unsuccessful because my data frame is too large. In other functions, I've used the following two lines of code to get the nonnull count and the count distinct for each column:
colSums(!is.empty(df[,]))
apply(df[,], 2, function(x) length(unique(x)))
Optimally, I would like to combine the two with an ifelse to minimize the mutations, as this will ultimately be thrown into a function to be applied with a number of other summary statistics to a list of data frames.
I have tried a brute-force method, where make the values 1 if not null and 0 otherwise and then copy the id to that column if 1. I can then just use the count distinct line from above to get my output. However, I get the wrong values when copying it into the other columns and the number of adjustments is sub optimal. See code:
binary <- cbind(df$user_id, !is.empty(df[,2:length(df)]))
copied <- binary %>% replace(. > 0, binary[.,1])
I'd greatly appreciate your help.
1: Base
sapply(df, function(x){
length(unique(df$user_id[!is.na(x)]))
})
# user_id event_id plan_id treatment_id system
# 4 4 3 3 2
2: Base
aggregate(user_id ~ ind, unique(na.omit(cbind(stack(df), df[1]))[-1]), length)
# ind user_id
#1 user_id 4
#2 event_id 4
#3 plan_id 3
#4 treatment_id 3
#5 system 2
3: tidyverse
df %>%
mutate(key = user_id) %>%
pivot_longer(!key) %>%
filter(!is.na(value)) %>%
group_by(name) %>%
summarise(value = n_distinct(key)) %>%
pivot_wider()
## A tibble: 1 x 5
# event_id plan_id system treatment_id user_id
# <int> <int> <int> <int> <int>
#1 4 3 2 3 4
Thanks #dcarlson I had misunderstood the question:
apply(df, 2, function(x){length(unique(df[!is.na(x), 1]))})
A data.table option with uniqueN
> setDT(df)[, lapply(.SD, function(x) uniqueN(user_id[!is.na(x)]))]
user_id event_id plan_id treatment_id system
1: 4 4 3 3 2
Using dplyr you can use summarise with across :
library(dplyr)
df %>% summarise(across(.fns = ~n_distinct(user_id[!is.na(.x)])))
# user_id event_id plan_id treatment_id system
#1 4 4 3 3 2
I have a simple dataframe in R
df1 <- data.frame(
questionID = c(1,1,3,4,5,5),
userID = c(101, 101, 102, 101, 102,101),
Value=c(10,20,30,40,50,10))
The basic idea is to have a column that indicates the sum of value for a user on questions they asked before (lower number questions).
I tried using this function (after trying the pipe of sum which just gave errors about non-numeric that everybody seems to face)
f2 <- function(x){
Value_out <- filter(df1,questionID<x['questionID'] & userID == x['userID'] ) %>%
select(Value) %>%
summarize_if(is.numeric, sum, na.rm=TRUE)
}
out=mutate(df1,Expert=apply(df1, 1,f2))
While this works if you print it out, the Expert column is saved as a list of dataframes. All I want is an int or numeric of the sum of Value. Is there anyway to do this? By the way, yes, I've looked all over for ways to do this, with most answers just summarizing the column in a manner that won't work for me.
I think I would avoid writing my own function altogether and use data.table on this one. You can do what you want in just a couple lines, and I wouldn't be surprised if there was a way to golf this down to fewer lines
Given your same data, we create a data.table object:
library(data.table)
dt <- data.table(
questionID = c(1,1,3,4,5,5),
userID = c(101, 101, 102, 101, 102,101),
Value=c(10,20,30,40,50,10))
Next, we shift our values by 1 (lag) within each userID:
dt[, lastVal := shift(Value, n = 1, fill = 0), by = .(userID)]
And finally, we cumsum those by userID, and replace those with multiple Values with the same userID x questionID with the min Expert, which should be 0 because we used fill = 0 in shift above before we cumsum:
dt[,
Expert := cumsum(lastVal),
by = .(userID)][,
Expert := min(Expert),
by = .(userID, questionID)]
So, putting that all together, we have:
library(data.table)
dt <- data.table(
questionID = c(1,1,3,4,5,5),
userID = c(101, 101, 102, 101, 102,101),
Value=c(10,20,30,40,50,10))
dt[, lastVal := shift(Value, n = 1, fill = 0), by = .(userID)]
dt[,
Expert := cumsum(lastVal),
by = .(userID)][,
Expert := min(Expert),
by = .(userID, questionID)]
dt
questionID userID Value lastVal Expert
1: 1 101 10 0 0
2: 1 101 20 10 0
3: 3 102 30 0 0
4: 4 101 40 20 30
5: 5 102 50 30 30
6: 5 101 10 40 70
Using dplyr and purrr::map_dbl one approach would be to group_by userID and sum Value for each questionID which is less than current value.
library(dplyr)
df1 %>%
group_by(userID) %>%
mutate(Expert = purrr::map_dbl(questionID, ~sum(Value[questionID < .x])))
# questionID userID Value Expert
# <dbl> <dbl> <dbl> <dbl>
#1 1 101 10 0
#2 1 101 20 0
#3 3 102 30 0
#4 4 101 40 30
#5 5 102 50 30
#6 5 101 10 70
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I have the following data set and I would like to identify the product with the highest amount per customer_ID and convert it into a new column. I also want to keep only one record per ID.
Data to generate the data set:
x <- data.frame(customer_id=c(1,1,1,2,2,2),
product=c("a","b","c","a","b","c"),
amount=c(50,125,100,75,110,150))
Actual data set looks like this:
customer_id product amount
1 a 50
1 b 125
1 c 100
2 a 75
2 b 110
2 c 150
Desired output wanted should look like this:
customer_ID product_b product_c
1 125 0
2 0 150
We can do this with tidyverse. After grouping by 'customer_id', slice the row that has the maximum 'amount', paste with prefix ('product_') to 'product' column (if needed) and spread to wide format
library(dplyr)
library(tidyr)
x %>%
group_by(customer_id) %>%
slice(which.max(amount)) %>%
mutate(product = paste0("product_", product)) %>%
spread(product, amount, fill = 0)
# customer_id product_b product_c
#* <dbl> <dbl> <dbl>
#1 1 125 0
#2 2 0 150
Another option is to arrange the dataset by 'customer_id' and 'amount' in descending order, get the distinct rows based on 'customer_id' and `spread to 'wide'
arrange(x, customer_id, desc(amount)) %>%
distinct(customer_id, .keep_all = TRUE) %>%
spread(customer_id, amount, fill = 0)
Using reshape2 package,
library(reshape2)
x1 <- x[!!with(x, ave(amount, customer_id, FUN = function(i) i == max(i))),]
dcast(x1, customer_id ~ product, value.var = 'amount', fill = 0)
# customer_id b c
#1 1 125 0
#2 2 0 150
Here is the data similar to that I am using :-
df <- data.frame(Name=c("Joy","Jane","Jane","Joy"),Grade=c(40,20,63,110))
Name Grade
1 Joy 40
2 Jane 20
3 Jane 63
4 Joy 110
Agg <- ddply(df, .(Name), summarize,Grade= max(Grade))
Name Grade
1 Jane 63
2 Joy 110
As the grade cannot be greater than 100, I need 40 as the value of for Joy and not 110. Basically I want to exclude all the values greater than 100 while summarizing. I can create a new data frame by excluding the values and then applying the ddply function, but would like to know if I can do it on my original data frame. Thanks in advance.
Using ddply, we can use the logical condition to subset the values of 'Grade'
library(plyr)
ddply(df, .(Name), summarise, Grade = max(Grade[Grade <=100]))
# Name Grade
#1 Jane 63
#2 Joy 40
Or with dplyr, we filter the "Grade" that are less than or equal to 100, then grouped by "Name", get the max of "Grade"
library(dplyr)
df %>%
filter(Grade <= 100) %>%
group_by(Name) %>%
summarise(Grade = max(Grade))
# Name Grade
# <fctr> <dbl>
#1 Jane 63
#2 Joy 40
Or instead of the filter, we can create the logical condition in summarise
df %>%
group_by(Name) %>%
summarise(Grade = max(Grade[Grade <=100]))
Or with data.table, convert the 'data.frame' to 'data.table' (setDT(df)), create the logical condition (Grade <= 100) in 'i', grouped by "Name", get the max of "Grade".
library(data.table)
setDT(df)[Grade <= 100, .(Grade = max(Grade)), by = Name]
# Name Grade
#1: Joy 40
#2: Jane 63
Or using sqldf
library(sqldf)
sqldf("select Name,
max(Grade) as Grade
from df
where Grade <= 100
group by Name")
# Name Grade
#1 Jane 63
#2 Joy 40
In base R, another variant of aggregate would be
aggregate(Grade ~ Name, df, subset = Grade <= 100, max)
# Name Grade
#1 Jane 63
#2 Joy 40
You can also use base R aggregate for the same
aggregate(Grade ~ Name, df[df$Grade <= 100, ], max)
# Name Grade
#1 Jane 63
#2 Joy 40
I have a dataframe with some numbers(score) and repeating ID. I want to get the maximum value for each of the ID.
I used this function
top = aggregate(df$score, list(df$ID),max)
This returned me a top dataframe with maximum values corresponding to each ID.
But it so happens that for one of the ID, we have two EQUAL max value. But this function is ignoring the second value.
Is there any way to retain BOTH the max values.?
For Example:
df
ID score
1 12
1 15
1 1
1 15
2 23
2 12
2 13
The above function gives me this:
top
ID Score
1 15
2 23
I need this:
top
ID Score
1 15
1 15
2 23
I recommend data.table as Chris mentioned (good for speed, but steeper learning curve).
Or if you don't want data.table you could use plyr:
library(plyr)
ddply(df, .(ID), subset, score==max(score))
# same as ddply(df, .(ID), function (x) subset(x, score==max(score)))
You can convert to a data.table:
DT <- as.data.table(df)
DT[, .SD[score == max(score)], by=ID]
Here is a dplyr solution.
library(dplyr)
df %>%
group_by(ID) %>%
filter(score == max(score))
Otherwise, to build on what you have done, we can use a sneaky property of merge on your "top" dataframe, see the following example:
df1 <- data.frame(ID = c(1,1,5,2), score = c(5,5,2,6))
top_df <- data.frame(ID = c(1,2), score = c(5,6))
merge(df1, top_df)
which gives:
ID score
1 1 5
2 1 5
3 2 6
Staying with a data.frame:
df[unlist(by(df, df$ID, FUN=function(D) rownames(D)[D$score == max(D$score)] )),]
# ID score
#2 1 15
#4 1 15
#5 2 23
This works because by splits df into a list of data.frames on the basis of df$ID, but retains the original rownames of df ( see by(df, df$ID, I) ). Therefore, returning the rownames of each D subset corresponding to a max score value in each group can still be used to subset the original df.
A simple base R solution:
df <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2),
score = c(12, 15, 1, 15, 23, 12, 13))
Several options:
df[df$score %in% tapply(df$score, df$ID, max), ]
df[df$score %in% aggregate(score ~ ID, data = df, max)$score, ]
df[df$score %in% aggregate(df$score, list(df$ID), max)$x, ]
Output:
ID score
2 1 15
4 1 15
5 2 23
Using sqldf:
library(sqldf)
sqldf('SELECT df.ID, score FROM df
JOIN (SELECT ID, MAX(score) AS score FROM df GROUP BY ID)
USING (score)')
Output:
ID score
2 1 15
4 1 15
5 2 23