This question already has answers here:
R group by aggregate
(3 answers)
Closed 2 years ago.
I have a data frame that stores the amount someone spends per transaction for this month. I'm trying to create a loop that checks for repeated user IDs, then sums and stores the amount they spent in total in the first record that they appear. It should set the amount they spent in any other occurrences to 0.
I keep getting "Error: No loop for break/next, jumping to top level" when I stop it from running:
# Number of trips
numTrips <- NROW(tripData)
# For each trip in data
for (i in 1:numTrips){
# For each trip after i
for (j in ((i+1): numTrips)){
# If the user ID's match, sum prices
if (tripData[i,]$user_id == tripData[j,]$user_id){
tripData[i,]$original_price <- tripData[i,]$original_price + tripData[j,]$original_price
tripData[j,]$original_price <- 0
}
}
}
Can someone please help?
I'll go with #MrFlick's comment and give you a sample:
set.seed(42)
dat <- tibble(
id = rep(1:3, each=3),
when = sort(Sys.Date() - sample(10, size=9)),
amt = sample(1e4, size=9))
dat
# # A tibble: 9 x 3
# id when amt
# <int> <date> <int>
# 1 1 2020-06-19 356
# 2 1 2020-06-20 7700
# 3 1 2020-06-21 3954
# 4 2 2020-06-22 9091
# 5 2 2020-06-23 5403
# 6 2 2020-06-24 932
# 7 3 2020-06-25 9189
# 8 3 2020-06-27 5637
# 9 3 2020-06-28 4002
It sounds like you want to sum the amounts for each id, but preserve the individual rows with the rest of the amounts zeroed out.
dat %>%
group_by(id) %>%
mutate(amt2 = c(sum(amt), rep(0, n() - 1)))
# # A tibble: 9 x 4
# # Groups: id [3]
# id when amt amt2
# <int> <date> <int> <dbl>
# 1 1 2020-06-19 356 12010
# 2 1 2020-06-20 7700 0
# 3 1 2020-06-21 3954 0
# 4 2 2020-06-22 9091 15426
# 5 2 2020-06-23 5403 0
# 6 2 2020-06-24 932 0
# 7 3 2020-06-25 9189 18828
# 8 3 2020-06-27 5637 0
# 9 3 2020-06-28 4002 0
If instead you just want the summaries, you can use this:
dat %>%
group_by(id) %>%
summarize(amt = sum(amt))
# # A tibble: 3 x 2
# id amt
# <int> <int>
# 1 1 12010
# 2 2 15426
# 3 3 18828
or if you want to preserve the date range, then
dat %>%
group_by(id) %>%
summarize(whenfrom = min(when), whento = max(when), amt = sum(amt))
# # A tibble: 3 x 4
# id whenfrom whento amt
# <int> <date> <date> <int>
# 1 1 2020-06-19 2020-06-21 12010
# 2 2 2020-06-22 2020-06-24 15426
# 3 3 2020-06-25 2020-06-28 18828
Related
I was testing an example from RStudio about "render SQL code" from dbplyr:
library(nycflights13)
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
dbplyr::sql_render(ranked)
But when run, it returns the following error message:
Error in UseMethod("sql_render") :
no applicable method for 'sql_render' applied to an object of class "c('grouped_df', 'tbl_df', 'tbl', 'data.frame')"
Can someone explain why?
When you are working on a "normal" data.frame, then it returns a frame, in which case sql_render is inappropriate (and will be very confused). If we work with just your initial code, then we can see that SQL has nothing to do with it:
library(dplyr)
library(nycflights)
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
ranked
# # A tibble: 336,776 x 5
# # Groups: year, month, day [365]
# year month day dep_delay rank
# <int> <int> <int> <dbl> <dbl>
# 1 2013 1 1 2 313
# 2 2013 1 1 4 276
# 3 2013 1 1 2 313
# 4 2013 1 1 -1 440
# 5 2013 1 1 -6 742
# 6 2013 1 1 -4 633
# 7 2013 1 1 -5 691
# 8 2013 1 1 -3 570
# 9 2013 1 1 -3 570
# 10 2013 1 1 -2 502.
# # ... with 336,766 more rows
But dbplyr won't be able to do something with that:
library(dbplyr)
sql_render(ranked)
# Error in UseMethod("sql_render") :
# no applicable method for 'sql_render' applied to an object of class "c('grouped_df', 'tbl_df', 'tbl', 'data.frame')"
If, however, we have that same flights data in a database, then we can do what you are expecting, with some minor changes.
# pgcon <- DBI::dbConnect(odbc::odbc(), ...) # to my local postgres instance
copy_to(pgcon, flights, name = "flights_table") # go get some coffee
flights_db <- tbl(pgcon, "flights_table")
ranked_db <- flights_db %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
# Adding missing grouping variables: `year`, `month`, `day`
We can see some initial data, showing the top 10 rows of what the query will eventually return:
ranked_db
# # Source: lazy query [?? x 5]
# # Database: postgres [postgres#localhost:/]
# # Groups: year, month, day
# year month day dep_delay rank
# <int> <int> <int> <dbl> <int64>
# 1 2013 1 1 NA 1
# 2 2013 1 1 NA 1
# 3 2013 1 1 NA 1
# 4 2013 1 1 NA 1
# 5 2013 1 1 853 5
# 6 2013 1 1 379 6
# 7 2013 1 1 290 7
# 8 2013 1 1 285 8
# 9 2013 1 1 260 9
# 10 2013 1 1 255 10
# # ... with more rows
and we can see what the real SQL query looks like:
sql_render(ranked_db)
# <SQL> SELECT "year", "month", "day", "dep_delay", RANK() OVER (PARTITION BY "year", "month", "day" ORDER BY "dep_delay" DESC) AS "rank"
# FROM "flights_table"
Realizing that, due to the way dbplyr operates, we don't know how many rows will be returned until we collect it:
nrow(ranked_db)
# [1] NA
res <- collect(ranked_db)
nrow(res)
# [1] 336776
res
# # A tibble: 336,776 x 5 # <--- no longer 'Source: lazy query [?? x 5]'
# # Groups: year, month, day [365]
# year month day dep_delay rank
# <int> <int> <int> <dbl> <int64>
# 1 2013 1 1 NA 1
# 2 2013 1 1 NA 1
# 3 2013 1 1 NA 1
# 4 2013 1 1 NA 1
# 5 2013 1 1 853 5
# 6 2013 1 1 379 6
# 7 2013 1 1 290 7
# 8 2013 1 1 285 8
# 9 2013 1 1 260 9
# 10 2013 1 1 255 10
# # ... with 336,766 more rows
Check the documentation of the package. So you can render a code with the SQL syntax.
Maybe the chunk of code below helps you:
library(dplyr)
library(SqlRender)
library(nycflights13)
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay))) %>%
ungroup()
sql <- "SELECT * FROM #x WHERE month = #a;"
render(sql, x = ranked, a = 2)
I am looking for a concise way to filter a data.frame for all rows smaller than a value x with all following values also smaller than x. I found a way but it is somehwat verbose. I tried to do it with dplyr::cumall and cumany, but was not able to figure it out.
Here is a small reprex including my actual approach. Ideally I would only have one filter line or mutate + filter, but with the current approach it takes two rounds of mutate/filter.
library(dplyr)
# Original data
tbl <- tibble(value = c(100,100,100,10,10,5,10,10,5,5,5,1,1,1,1))
# desired output:
# keep only rows, where value is smaller than 5 and ...
# no value after that is larger than 5
tbl %>%
mutate(id = row_number()) %>%
filter(value <= 5) %>%
mutate(id2 = lead(id, default = max(id) + 1) - id) %>%
filter(id2 == 1)
#> # A tibble: 7 x 3
#> value id id2
#> <dbl> <int> <dbl>
#> 1 5 9 1
#> 2 5 10 1
#> 3 5 11 1
#> 4 1 12 1
#> 5 1 13 1
#> 6 1 14 1
#> 7 1 15 1
Created on 2020-04-20 by the reprex package (v0.3.0)
You could combine cummin with a reversed reverse cummax:
tbl %>% filter(rev(cummax(rev(value))) <= 5 & cummin(value) <= 5)
# A tibble: 7 x 1
value
<dbl>
1 5
2 5
3 5
4 1
5 1
6 1
7 1
A base R option is to use subset + rle
tblout <- subset(tbl,
with(rle(value<=5 & c(0,diff(value))<=0),
rep(lengths>1 & values,lengths)))
such that
> tblout
# A tibble: 7 x 1
value
<dbl>
1 5
2 5
3 5
4 1
5 1
6 1
7 1
I have a dataframe that looks like this:
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
ID Type X2019 X2020 X2021
1 1 A 1 2 3
2 2 A 2 3 4
3 3 B 3 4 5
4 4 B 4 5 6
5 5 C 5 6 7
6 6 C 6 7 8
Now, I'm looking for some code that does the following:
1. Create a new data.frame for every row in df
2. Names the new dataframe with a combination of "ID" and "Type" (A_1, A_2, ... , C_6)
The resulting new dataframes should look like this (example for A_1, A_2 and C_6):
Year Values
1 2019 1
2 2020 2
3 2021 3
Year Values
1 2019 2
2 2020 3
3 2021 4
Year Values
1 2019 6
2 2020 7
3 2021 8
I have some things that somehow complicate the code:
1. The code should work in the next few years without any changes, meaning next year the data.frame df will no longer contain the years 2019-2021, but rather 2020-2022.
2. As the data.frame df is only a minimal reproducible example, I need some kind of loop. In the "real" data, I have a lot more rows and therefore a lot more dataframes to be created.
Unfortunately, I can't give you any code, as I have absolutely no idea how I could manage that.
While researching, I found the following code that may help adress the first problem with the changing years:
year <- as.numeric(format(Sys.Date(), "%Y"))
Further, I read about list, and that it may help to work with a list in a for loop and then transform the list back into a dataframe. Sorry for my limited approach, I hope anyone can give me a hint or even the solution to my problem. If you need any further information, please let me know. Thanks in advance!
A kind of similar question to mine:
Populating a data frame in R in a loop
Try this:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
df %>%
gather(Year, Values, 3:5) %>%
mutate(Year = str_sub(Year, 2)) %>%
select(ID, Year, Values) %>%
group_split(ID) # split(.$ID)
# [[1]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 1 2019 1
# 2 1 2020 2
# 3 1 2021 3
#
# [[2]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 2 2019 2
# 2 2 2020 3
# 3 2 2021 4
#
# [[3]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 3 2019 3
# 2 3 2020 4
# 3 3 2021 5
#
# [[4]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 4 2019 4
# 2 4 2020 5
# 3 4 2021 6
#
# [[5]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 5 2019 5
# 2 5 2020 6
# 3 5 2021 7
#
# [[6]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 6 2019 6
# 2 6 2020 7
# 3 6 2021 8
Data
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
library(magrittr)
library(tidyr)
library(dplyr)
library(stringr)
names(df) <- str_replace_all(names(df), "X", "") #remove X's from year names
df %>%
gather(Year, Values, 3:5) %>%
select(ID, Year, Values) %>%
group_split(ID)
I am working with gait-cycle data. I have 8 events marked for each id and gait trial. The values "LFCH" and "RFCH" occurs twice in each trial, as these represent the beginning and the end of the gait cycles from left and right leg.
Sample Data Frame:
df <- data.frame(ID = rep(1:5, each = 16),
Gait_nr = rep(1:2, each = 8, times=5),
Frame = rep(c(1,5,7,9,10,15,22,25), times = 10),
Marks = rep(c("LFCH", "LHL", "RFCH", "LTO", "RHL", "LFCH", "RTO", "RFCH"), times =10)
head(df,8)
ID Gait_nr Frame Marks
1 1 1 1 LFCH
2 1 1 5 LHL
3 1 1 7 RFCH
4 1 1 9 LTO
5 1 1 10 RHL
6 1 1 15 LFCH
7 1 1 22 RTO
8 1 1 25 RFCH
I wold like to create something like
Total_gait_left = Frame[The last time Marks == "LFCH"] - Frame[The first time Marks == "LFCH"]
My current code solves the problem, but depends on the position of the Frame values rather than actual values in Marks. Any individual not following the normal gait pattern will have wrong values produced by the code.
library(tidyverse)
l <- df %>% group_by(ID, Gait_nr) %>% filter(grepl("L.+", Marks)) %>%
summarize(Total_gait = Frame[4] - Frame[1],
Side = "left")
r <- df %>% group_by(ID, Gait_nr) %>% filter(grepl("R.+", Marks)) %>%
summarize(Total_gait = Frame[4] - Frame[1],
Side = "right")
val <- union(l,r, by=c("ID", "Gait_nr", "Side")) %>% arrange(ID, Gait_nr, Side)
Can you help me make my code more stable by helping me change e.g. Frame[4] to something like Frame[Marks=="LFCH" the last time ]?
If both LFCH and RFCH happen exactly twice, you can filter and then use diff in summarize:
df %>%
group_by(ID, Gait_nr) %>%
summarise(
left = diff(Frame[Marks == 'LFCH']),
right = diff(Frame[Marks == 'RFCH'])
)
# A tibble: 10 x 4
# Groups: ID [?]
# ID Gait_nr left right
# <int> <int> <dbl> <dbl>
# 1 1 1 14 18
# 2 1 2 14 18
# 3 2 1 14 18
# 4 2 2 14 18
# 5 3 1 14 18
# 6 3 2 14 18
# 7 4 1 14 18
# 8 4 2 14 18
# 9 5 1 14 18
#10 5 2 14 18
We can use first and last from the dplyr package.
library(dplyr)
df2 <- df %>%
filter(Marks %in% "LFCH") %>%
group_by(ID, Gait_nr) %>%
summarise(Total_gait = last(Frame) - first(Frame)) %>%
ungroup()
df2
# # A tibble: 10 x 3
# ID Gait_nr Total_gait
# <int> <int> <dbl>
# 1 1 1 14
# 2 1 2 14
# 3 2 1 14
# 4 2 2 14
# 5 3 1 14
# 6 3 2 14
# 7 4 1 14
# 8 4 2 14
# 9 5 1 14
# 10 5 2 14
I'd like to calculate relative changes of measured variables in a data.frame by group with dplyr.
The changes are with respect to a first baseline value at time==0.
I can easily do this in the following example:
# with this easy example it works
df.easy <- data.frame( id =c(1,1,1,2,2,2)
,time=c(0,1,2,0,1,2)
,meas=c(5,6,9,4,5,6))
df.easy %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative =
meas/meas[time==0])
# Source: local data frame [6 x 4]
# Groups: id [2]
#
# id time meas meas.relative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 5 1.00
# 2 1 1 6 1.20
# 3 1 2 9 1.80
# 4 2 0 4 1.00
# 5 2 1 5 1.25
# 6 2 2 6 1.50
However, when there are id's with no measuremnt at time==0, this doesn't work.
A similar question is this, but I'd like to get an NA as a result instead of simply taking the first occurence as baseline.
# how to output NA in case there are id's with no measurement at time==0?
df <- data.frame( id =c(1,1,1,2,2,2,3,3)
,time=c(0,1,2,0,1,2,1,2)
,meas=c(5,6,9,4,5,6,5,6))
# same approach now gives an error:
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = meas/meas[time==0])
# Error in mutate_impl(.data, dots) :
# incompatible size (0), expecting 2 (the group size) or 1
Let's try to return NA in case no measurement at time==0 was taken, using ifelse
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas/meas[time==0], NA) )
# Source: local data frame [8 x 4]
# Groups: id [3]
#
# id time meas meas.relative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 5 1
# 2 1 1 6 1
# 3 1 2 9 1
# 4 2 0 4 1
# 5 2 1 5 1
# 6 2 2 6 1
# 7 3 1 5 NA
# 8 3 2 6 NA>
Wait, why is above the relative measurement 1?
identical(
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas, NA) ),
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas[time==0], NA) )
)
# TRUE
It seems that the ifelse prevents meas to pick the current line, but selects always the subset where time==0.
How can I calculate relative changes when there are IDs with no baseline measurement?
Your issue was in the ifelse(). According to the ifelse documentation it returns "A vector of the same length...as test". Since any(time==0) is of length 1 for each group (TRUE or FALSE) only the first observation of the meas / meas[time==0] was being selected. This was then repeated to fill each group.
To fix this all I did was rep the any() to be the length of the group. I believe this should work:
df %>% dplyr::group_by(id) %>%
dplyr::mutate(meas.relative = ifelse(rep(any(time==0),times = n()), meas/meas[time==0], NA) )
# id time meas meas.relative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 5 1.00
# 2 1 1 6 1.20
# 3 1 2 9 1.80
# 4 2 0 4 1.00
# 5 2 1 5 1.25
# 6 2 2 6 1.50
# 7 3 1 5 NA
# 8 3 2 6 NA
To see how this was working incorrectly in your case try:
ifelse(TRUE,c(1,2,3),NA)
#[1] 1
Edit: A data.table solution with the same concept:
as.data.table(df)[, meas.rel := ifelse(rep(any(time==0), .N), meas/meas[time==0], NA_real_)
,by=id]