Sum is not computed over groups (always gives the absolute total) [duplicate] - r

This question already has an answer here:
Summarising giving same value for each group
(1 answer)
Closed last year.
I'm creating some summary tables and I'm having a hard time with simple sums...
While the count of records is correct, the variables with sums always compute the same value for all groups.
This is the code:
SummarybyCallContext <- PSTNRecords %>%
group_by (PSTNRecords$destinationContext) %>%
summarize(
Calls = n(),
Minutes = sum(PSTNRecords$durationMinutes),
Charges = sum(PSTNRecords$charge),
Fees = sum(PSTNRecords$connectionCharge)
)
SummarybyCallContext
And this is the result:
Minutes and Charges should be different for each group (Fees is always zero, but I need to display it anyway in the table).
Setting na.rm to TRUE or FALSE doesn't seem to change the result.
What am I doing wrong?
Thanks in advance!
~Alienvolm

(Almost) Never use PSTNRecords$ within dplyr verb functions in a pipeline starting from PSTNRecords. Why? With the $-indexing, every reference is to the original data, before any grouping or filtering or adding/changing columns or rearranging is done. Without the $-referencing, it is using the columns as they appear at that point in the pipeline.
SummarybyCallContext <- PSTNRecords %>%
group_by (destinationContext) %>%
summarize(
Calls = n(),
Minutes = sum(durationMinutes),
Charges = sum(charge),
Fees = sum(connectionCharge)
)
There are exceptions to this, but they are rare and, for the vast majority of new dplyr users, generally done better via other mechanisms.
Demonstration:
dat <- data.frame(x=1:5)
dat %>%
filter(dat$x > 2) %>% # this still works okay, since `dat` and "data now" are same
summarize(x2 = dat$x[1]) # however, `dat` has 5 rows but data in pipe only has 3 rows
# x2
# 1 1
dat %>%
filter(x > 2) %>%
summarize(x2 = x[1])
# x2
# 1 3

Related

Reordering columns in a table according to a previously confirmed vector,

I am trying to create a replicable piece of data extraction and I'm having an issue with organising the columns.
I have a large database of aggregates, and there are various checks performed against them.
There are around 500 aggregates, each with their own code, and a series of checks which boils down to A + B - C = 0.
I have a number of vectors, which are aggregategroup1 <- c(aggregate1, aggregate2, aggregate3), repeated nearly 100 times. The way the vectors are set out, means that the above sum would be applied like "aggregate 1 + aggregate 2 - aggregate 3".
The problem is that the aggregate codes are just numbers/letters and so when I am extracting the data, they aren't laying out in the table in the order I want- - namely the order they are in in the vector. It might come out like "Date, Aggregate 2, Aggregate 3, Aggregate 1", so then I can't apply the check consistently.
The code I am using so far the following. SQLconn is a connection from R to a SQL database where the data is housed.
dplyr::tbl("AGGREGATE_RESULT_LATEST") %>%
dplyr::select(one_of("AGGREGATE_NAME", "REPORTING_DATE", "ADJUSTED_POSITION")) %>%
dplyr::filter(
AGGREGATE_NAME%in%aggregate1,
REPORTING_DATE %in%databaseConnectR::as_oracle_date(range2)
) %>%
collect() %>%
pivot_wider(names_from = "AGGREGATE_NAME", values_from = "ADJUSTED_POSITION" ) %>%
mutate("Check" = (aggregate1check[2] + aggregate1check[4] - aggregate1check[3] ))
The table extracted up until the "collect" line
Aggregate Name
Date
Position
Code 1
2021
123
Code 2
2021
123
Code 3
2021
123
And then After the pivot/mutate lines:
Date
Code1
Code3
Code 2
2021
123
123
123
I just want to be able to reorder the columns, according to the order that they are present in the original vector, which would be code1, code2, code3, so that the mutate command can work consistently and I don't need to check the columns have come out correctly.
They are not coming out in the "wrong" order consistently either so I can't just adjust for that.
I tried adding the following subset as a pipeline but that didn't work, I also tried the same thing but turned the vector "aggregate1" into a list but that didn't work either.
%>%
pivot_wider(names_from = "AGGREGATE_NAME", values_from = "ADJUSTED_POSITION" ) %>%
subset("REPORTING_DATE", aggregate1) %>%
mutate("Check" = (GT_32D_check[2] + GT_32D_check[4] - GT_32D_check[3] ))
I have replaced the subset line with a relocate function.
I created a new vector which is the same as "aggregate1", but with "REPORTING_DATE" included at the start. I was then able to reorder the columns using the following line
pivot_wider(names_from = "AGGREGATE_NAME", values_from = "ADJUSTED_POSITION" ) %>%
relocate(aggregate1list) %>%
mutate("Check" = (aggregate1[2] + aggregate1[3]) - aggregate1[4] )

I need to use a loop in R but don't know where to start

I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.
Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio.
Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):
df<- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(s))
#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
for(j in 1:15){
if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,50]=df[j,3]
}
}
}
This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.
Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.
#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))
names(blp)<-c("cat","yr","s")
head(blp)
#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))
#works thanks to akrun: applying the code I provided for what leads to the 15 groups
df <- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(as.numeric(as.character(s))))
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,4]=df[j,3];
}
}
if I understood your problem correctly this should ideally help!
Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.
# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
# Calculation
blp <-
data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
group_by(cat, yr) %>% # Grouping by category and year
mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year
ungroup()
Expected output
Expected output

call variable that has been grouped by

Some sample data:
df <- data.frame(lang = rep(c("A", "B", "C"), 3),
answer = rep(c("1", "2", "3"), each=3))
I am getting an error when I try to call a variable that I recently grouped by:
df2 <- df %>%
Total = count(lang) %>% # count is short hand for tally + group_by()
filter(answer=='2') %>%
mutate(prop = NROW(answer)/NROW(Total))
Error in group_vars(x) : object 'lang' not found
I would like a new column on my dataframe that says the proportion of the answer '2' to total observations in each level of lang. So how many times does '2' occur in 'A' in proportion to the total number of observations in 'A'?
Here's a solution that does what you want:
df %>%
group_by(lang) %>%
summarize(
prop = length(lang[answer==2])/n()
)
Here, we group by the variable or variables that you want set as the unique groups you want to get the proportion of and then use summarize to calculate the length of the vector of one of the variables where answer is equal to 2 and divide that by the number of rows in the grouping. If, for whatever reason, you want the prop column AND the answer column, just change summarize to mutate.
The reason you were getting the error about not finding lang is because count needs to be used as a function like mutate, i.e.
df %>%
count(lang, name = "Total")
You could achieve the same thing adapting your code, but you should use add_count (so your answer column is preserved) or mutate(Total = n()). However, group_by was designed to address problems such as this and is definitely worth spending some time to learn about.
df %>%
add_count(lang, name = "Total") %>%
filter(answer == 2) %>%
add_count(lang, name = "Twos") %>%
distinct(lang, .keep_all = TRUE) %>%
mutate(prop = Twos/Total) %>%
select(lang, prop)
Alternate solution with data.table
I prefer to use data.table than data frames everywhere personally. Here is the implementation with that method, although admittedly it looks a bit more cryptic than the solution in dplyr (The syntax to accomplish something like this may be more involved, but getting used to it ends up giving you a whole bag of tricks, and with simple queries the syntax actually looks better)
You end up trying to use "lang" like its a variable, when its a name of a column.
To get the values requested, 0.3333 for each,
library(data.table)
df <- data.table(df)
df[, nrow(.SD[answer == 2])/nrow(.SD), by="lang"]
lang V1
1: A 0.3333333
2: B 0.3333333
3: C 0.3333333
(the special variable .SD allows you to manipulate every subset of the data, split by by)

Group and summarize with iterative filter using dplyr

Upfront apology if this has been asked, I have been searching all day and have not found an answer I can apply to my problem.
I am trying to solve this issue using dplyr (and co.) because my previous method (for loops) was too inefficient. I have a dataset of event times, at sites, that are in groups. I want to summarize the number (and proportion) of events that occur in a moving window along a sequence.
# Example data
set.seed(1)
sites = rep(letters[1:10],10)
groups = c('red','blue','green','yellow')
times = round(runif(length(sites),1,100))
timePeriod = seq(1,100)
# Example dataframe
df = data.frame(site = sites,
group = rep(groups,length(sites)/length(groups)),
time = times)
This is my attempt to summarize the number of sites from each group that contain a time (event) within a given moving window of time.
The goal is to move through each element of the vector timePeriod and summarize how many events in each group occurred at timePeriod[i] +/- half-window. Ultimately storing them in, e.g., a dataframe with a column for each group, and a row for each time step, is ideal.
df %>%
filter(time > timePeriod[i]-25 & time < timePeriod[i]+25) %>%
group_by(group) %>%
summarise(count = n())
How can I do this without looping through my sequence of time and storing the summary table for each group individually? Thanks!
Combining lapply and dplyr, you can do the following, which is close to what you had worked so far.
lapply(timePeriod, function(i){
df %>%
filter(time > (i - 25) & time < ( i + 25 ) ) %>%
group_by(group) %>%
summarise(count = n()) %>%
mutate(step = i)
}) %>%
bind_rows()

Resources