I am trying to compute the total sales and the end result should be product id and product type, but I keep getting only one column (product type with sales)
How can I get the product id in my end result?
products %>%
filter(str_detect(product_type, regex("pizza", ignore_case = TRUE))) %>%
inner_join(get_transactions(), by = "product_id") %>%
group_by(product_type) %>%
summarize(total_sales = sum(sales_value)) %>%
arrange(desc(total_sales))
Group by
Try to group by the second criterion, maybe that will solve the issue.
It is difficult to understand what you get as a result of your code. Could you add a picture?
Maybe the problem in filter setting to only one output?
Related
I have a line of code that calculates the maximum value for a number of products
data2019 %>%
group_by(PRODUCT) %>%
summarise(max_amt = max(AMOUNT))
I want to then count the number of rows where AMOUNT == max_amt for that particular product, but if I try to wrap it in a count or sum function it gives me the max value for the whole set, and the total number of rows for each product, which isn't very helpful, especially as the values vary considerably. How can I get it to produce the answer for each specific product?
You can do a count on condition by writing your summarize like sum(CONDITION). Like so:
data2019 %>%
group_by(PRODUCT) %>%
summarize(max_count = sum(AMOUNT == max(AMOUNT)))
I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.
Here I have the code with a for loop:
for (i in 1:length(mc_1$code))
{cmc1 = mc_1$code[i]
cmc2 = mc_1[mc_1$code == cmc1,]
cmc3 = cmc2[order(cmc2[ ,2], cmc2[ ,3]),]
mc_1[mc_1$code == cmc1,]$region = last(cmc3$region)
}
For each value in the variable "code", mc_1 have different number of rows. And mc_1 also has columns of year and month (column 2 and 3), and another column, say, region. "region" is different even for same "code" at different month and year.
For each "code", I want to select only the most recent region by month and year (that's why I use "order") and assign that region to all the regions in all the rows for that certain code.
I did have this for loop, which works. But for efficiency and code length issue, how can I rewrite it better using something like data table or dplyr?
you can try this using the dplyr package
and the fact that n() returns the number of rows in each group
mc_1 %>%
group_by(code) %>%
arrange(year, month ) %>%
mutate(region = region[n()])
hope it helps!!
There are three columns: website, Date ("%Y %m"), click_tracking (T/F). I would like to add a variable describing the number of websites whose click tracking = T in each month / the number of all website in that month.
I thought the steps would be something like:
aggregate(sum(df$click_tracking = TRUE), by=list(Category=df$Date), FUN = sum)
as.data.frame(table(Date))
Then somehow loop through Date and divide the two variables above which would have been already grouped by Date. How can I achieve this? Many thanks!
If we are creating a column, then do a group by 'Date' and get the sum of 'click_tracking' (assuming it is a logical column - TRUE/FALSE) iin mutate
library(dplyr)
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(click_tracking))
If the column is factor, convert to logical with as.logical
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(as.logical(click_tracking)))
If it is to create a summarised output
df %>%
group_by(Date) %>%
summarise(countTRUE = sum(click_tracking))
In the OP's code, = (assignment) is used instead of == in sum(df$click_tracking = TRUE) and there is no need to do a comparison on a logical column
aggregate(cbind(click_tracking = as.logical(click_tracking)) ~ Date, FUN = sum)
This will create the proportion of websites with click tracking (out of all websites) per month.
aggregate(data=df, click_tracking ~ Date, mean)
Being more or less a beginner in R, I have a quick question. Indeed, I would like to attach a series of elements (country number) to different categories (n°id). The idea is as follows: as soon as a country number belongs 3 times in a row to a certain id number, it is attached to this id number. Here is a simplified example below:
Starting database Desired outcome
I think I can do this using the R program, although I couldn't find similar questions on the different forums.
Thank you very much for your help,
Gauthier
Assuming a n-n relationship between country number and id (e.g. each country can have 0-n IDs and each ID can be tied to 0-n countries), here is one solution:
library(dplyr)
dataframe %>%
mutate(Count = 1) %>%
group_by("Country number","n°id") %>%
summarise(Count = sum(Count, na.rm = TRUE) %>%
ungroup() %>%
filter(Count >= 3) %>%
select(-Count)