`dpylr` count function for unique items in field - r

I have searched for this on here a few times, so apologies if this is a duplicate.
I am working with dplyr for the first time, and I am having trouble coming up with what I'd like. If I was doing SQL, the query would look like:
select count(customer_id), sum(sales), (sum(sales) / count(customer_id), *
from data_table
group by salesperson_id
In words, I want to:
group the data by salesperson
add up the total sales
count the number of unique customers
find the average sales per customer for each sales person.
I don't want to strip away "irrelevant" fields at this point, because they will become relevant in later steps.
I am getting stuck, specifically because the only counting function dplyr provides doesn't take any arguments. What aggregate function should I use to count distinct items in a field?

Responding to the question: What aggregate function should I use to count distinct items in a field?
n_distinct()
See docs here.
A broader example, though a reprex in the original question would help:
data_table %>%
group_by(salesperson_id) %>%
mutate(
customers = n_distinct(customer_id),
sales = sum(sales),
sales_per_customer = sales / customers
)

Related

Updating all columns of a table dependent on a single column

I have a table that consists of only columns of type Date. The data is about shopping behavior of customers on a website. The columns correspond to the first time an event is triggered by a customer (NULL if no occurrence of the event). One of the columns is the purchase motion.
I want to update the table so that all the for a particular row, all the cells that do did not happen 7 days prior to the purchase are replaced with NULL. I'm looking for some guidance in coding this single line. I've tried utilizing mutate_all() to no avail.
Assuming your data is called df, date columns start with date_ and the purchase date is purchase_date, perhaps something like this?
mutate(
df,
across(starts_with(“date_”),
~ifelse(purchase_date - . < -7,
NA,
.))
)

select lowest values which sum up to 10% of total

Im new to this place and I'm not super experienced with R but I need it at work and I really hope you can support me
So i have a huge data set but i will explain the issue using small sample
I have already grouped my data set to achieve a layout which i want
So basically i have multiple EXCPosOutlet and EXCPPMonth names and i need to remove lowest values per EXCPosOutlet per EXCMonth which sum up to 10% of total for that individual group.
So lets say that total of AvaragePrice for a sampleName for Month 612 is 1000$. i need to remove all rows with lowest values of AveragePrice which sum up to 100$
If removing is messy, even creating extra column (mutate) using ifelse for example which would just tell me if it falls under my criteria, that would be totally enough
I have tried all ntile, quntile fucntions but im not geeting what i need.
Thank you so much in advance
LEt me know if I should provide more details
One possibility is to use the dplyr package and, for legibility, the pipe operator %>%. There's other ways towards the same result, but you might want to give it a try:
library(dplyr)
## generate example data:
data.frame(
EXCPosOutlet = gl(3,12),
AveragePrice = runif(36) * 100
) %>%
## sort dataframe by outlet and (increasing) price:
arrange(EXCPosOutlet, AveragePrice) %>%
## group by outlet:
group_by(EXCPosOutlet) %>%
## calculate cumulative price:
mutate(cumAveragePrice = cumsum(AveragePrice)) %>%
## keep rows which, per outlet, total less than the treshold of $100:
filter(cumAveragePrice <= 100)

Count and summarise ID and the Date of Purchase while creating a third column that reflects the amount of Purchased per one day and Customer

Good afternoon dear Community,
I am quite new in the R language so forgive me if I am not to precise or specific with my description of the problem yet.
I have a data frame which contains two columns. First one being the ID and second one being the Date of purchase. However, some ID's appear more often during one Date and I would like to summarise the ID and Date, while the third column (amount of Purchases) reflects the quantity of purchases.
ID and Purchase Date
Many thanks in Advance.
There is an R package called dplyr that makes this kind of aggregation very easy. In your case you could summarise the data using a few lines of code.
library(dplyr)
results <- df %>%
group_by(ID, Date) %>%
summarise(numPurchases = n(),
totalPurchases = sum(Quantity))
df would be your input data. Your results will have the ID and Date columns, as well as a new column that counts the number of sales per ID per Date (numPurchases) and a new column that shows the total quantity of purchases per ID per date (totalPurchases). Hope that helps.

Summarizing Data across age groups in R

I have data for customer purchases across different products , I calculated the amount_spent by multiplying Item Numbers by the respective Price
I used cut function to segregate people into different age bins, Now how can I find the aggregate amount spent by different age groups i.e the contribution of each age group in terms of dollars spent
Please let me know if you need anymore info
I am really sorry that I can't paste the data here due to remote desktop constraints . I am actually concerned with the result I got after summarize function
library(dplyr)
customer_transaction %>% group_by(age_gr) %>% select(amount_spent) %>% summarise_each(funs(sum))
Though I am not sure if you want the contribution to the whole pie or just the sum in each age group.
If your data is of class data.table you could go with
customer_transaction[,sum(amount_spent),by=age_gr]

Create variable based on counts of groups and sub groups in data table

I have many student records. I need to create two new variable. One should display the count of Unitcode (ie enrolments) for each student_ID for each Year.
One should display the count of Fail (ie Grade=='Fail') for each student_ID for each Year. See the example of records for three students below:
student_ID=c(rep("1001",8),rep("1002",3),rep("1005",11))
Year=c(rep(2011,4),rep(2012,4),2011,2012,2013,rep(2011,4),rep(2012,3),rep(2013,4))
Grade=c(rep("Fail",2),rep("Pass",3),rep("Fail",3),rep("Pass",7),rep("Fail",2),rep("Pass",5))
Unitcode<-c(1201:1222)
record<-data.table(student_ID, Year, Grade, Unitcode)
If someone could assist with counting new variables that would be greatly appreciated.
A similar option using dplyr would be
library(dplyr)
record %>%
group_by(student_ID, Year) %>%
summarise(unitcodes=n(), fails=sum(Grade=='Fail'))

Resources