This question already has answers here:
Mutate across multiple columns to create new variable sets
(3 answers)
Closed 10 days ago.
I've got a dplyr query which runs on a large data frame and it's painfully slow. Reprex:
Start with a dataframe df which has duplicate rows (because it was formed by a left_join() call). If I see a duplicate index value then the name, year and city values will be duplicated too.
df <- data.frame(index = c(1, 1, 1, 2, 2, 3),
name = c("Andy", "Andy", "Andy", "Bob", "Bob", "Charles"),
year = c(1970, 1970, 1970, 1971, 1971, 1972),
city = c("Amsterdam", "Amsterdam", "Amsterdam", "Barcelona", "Barcelona", "Copenhagen"),
amount = c(123, 234, 345, 456, 567, 678))
I want to know the sum of the amount field for each value of index. However I want to retain name, year and city.
output_i_want <- data.frame(index = c(1, 2, 3),
name = c("Andy", "Bob", "Charles"),
year = c(1970, 1971, 1972),
city = c("Amsterdam", "Barcelona", "Copenhagen"),
total_amount = c(702, 1023, 678))
It's easy enough to do it like this:
df |>
group_by(index) |>
summarise(name = first(name),
year = first(year),
city = first(city),
total_amount = sum(amount)) |>
ungroup()
...but in my real world case (where first() appears about 20 times and sum() appears 8 times) it's horribly slow.
If I instead do:
df |>
group_by(index) |>
summarise(total_amount = sum(amount)) |>
ungroup()
then it runs fast, but I then lose name, year and city - and I'm not sure how best to get them back. Do I need a different type of join afterwards, or some other technique?
Thank you.
1) Since the columns involved with first in the question are constant within index try grouping by all 4 to eliminate the need for first. The code below assumes all non-grouping columns are to be summed but you could specify something like where(is.numeric) if all numeric non-grouping columns are to be summed or amount1:amount3 if the columns are side by side or starts_with("amount") if they all start with amount.
library(dplyr) # version 1.1.0 or later
df %>%
summarize(across(everything(), sum, .names = "total_{.col}"), .by = index:city)
giving:
index name year city total_amount
1 1 Andy 1970 Amsterdam 702
2 2 Bob 1971 Barcelona 1023
3 3 Charles 1972 Copenhagen 678
2) A base solution using the same idea is the following. Change the numbers as needed. Omit the last 2 statements if having a total_ prefix is not important.
ag <- aggregate(df[5], df[-5], sum)
names(ag)[5] <- paste("total", names(ag)[5], sep = "_")
ag
giving:
index name year city total_amount
1 1 Andy 1970 Amsterdam 702
2 2 Bob 1971 Barcelona 1023
3 3 Charles 1972 Copenhagen 678
Related
Consider a data frame that has 3 columns: A - a name; B - the yearly food intake (one name can eat different foods); C - the year in which the person stops eating that food
Such as:
A B C
Peter 400 2035
Peter 500 2050
Peter 350 2024
John 700 2050
I need to create a time series that sums all the food intake for each person, from today (2022) to 2050. In the case of John is easy: 700 * (2050-2022). But for Peter, I need to add some restrictions: sum the 3 lines until 2024, then one of them goes to zero, but the time series keeps summing the other two lines, until eventually there is only one line to sum.
So year 2022 would be (400+500+350), the same for years 2023 to 2024. Then would be (400+500), until 2035, etc.
This allows me to have a time-series, per person, which contains the yearly intake of food, taking into consideration that the yearly food intake will decrease throughout the years.
Are you after the total intake over the period? Then this will calculate it:
library(tidyverse)
data <- tribble(~"A", ~"B", ~"C",
"Peter", 400, 2035,
"Peter", 500, 2050,
"Peter", 350, 2024,
"John", 700, 2050)
data %>%
mutate(line_total = B*(C - 2022)) %>% # 2022 being the start year
group_by(A) %>%
summarise(person_total = sum(line_total))
If you actually want a time-series, with a column for each row and the total for the row at the end, then try this:
years <- 2022:max(data$C)
mat <- matrix(nrow = nrow(data), ncol = length(years))
colnames(mat) <- c(years)
timeseries <- cbind(data, mat) %>%
as_tibble() %>%
pivot_longer(-c(A, B, C)) %>%
mutate(value = ifelse(name <= C, B, 0)) %>%
pivot_wider() %>%
select(-c(B, C)) %>%
mutate(rowsum = rowSums(across(where(is.numeric))))
I am trying to merge two datasets. The survey dataset consists of biodiversity surveys from different regions conducted every 1-5 years in a certain month (the month is constant within, but not between, regions). The temperature dataset consists of daily temperature readings in each survey region.
For multiple surveys that have different start months and temporal extents, I want to pair each survey*year combination with the twelve months of temperature data preceding it. In other words, I want to pair a May 1983 survey with the 12 months (or 365 days -- I don't care which) of daily temperature records preceding it, ending April 30, 1983. Meanwhile, another survey elsewhere conducted in August 1983 needs to be paired with the 365 days of temperature data ending July 31, 1983.
There are (at least) two ways to do this -- one would be joining the survey data to the (longer) temperature data and then somehow subsetting or identifying which dates fall in the 12 months preceding the survey-date. Another is to start with the survey data and try to pair the temperature data to each row with a matrix-column -- I tried doing this with time-series tools from tsibble and tsModel but couldn't get it to "lag" the right values when grouped by region.
I was able to create an identifier to join the datasets such that each date in the temperature data is matched with the subsequent survey in time. However, not all of those are within 365 days (e.g., in the dataset created below, the date 1983-06-03 is matched with the ref_year aleutian_islands-5-1986 because the survey only happens every 3-5 years).
Here are some examples of the behavior I want for a single region (from the example dataset below), although I'm open to solutions that achieve the same thing but don't look exactly like this:
For this row, the value in the new column that I want to generate (ref_match) should be NA; the date is more than 365 days before ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 1982-06-09 1982 6 6-1982 aleutian_islands-5-1983 0
For this row, ref_match should be aleutian_islands-5-2014 since the date is within 12 months of ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 2013-07-22 2013 7 7-2013 aleutian_islands-5-2014 0.998
The following script will generate a dataset temp_dat with columns like those in the snippets above from which I hope to generate the ref_match column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <- rbind(ai_dat, ebs_dat) %>%
mutate(month_year = paste0(startmonth,"-",year)) %>%
select(region, month_year) %>%
distinct() %>%
mutate(region_month_year = paste0(region,"-",month_year))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <- expand.grid(month=seq(1, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(region_month_year = paste0(region,"-",month,"-",year)) %>% # create unique identifier
mutate(ref_year = ifelse(region_month_year %in% surv_dat$region_month_year, region_month_year, NA),
month_year = paste0(month,"-",year)) %>%
select(region, month_year, ref_year) %>%
distinct() %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
month_year = paste0(month,"-",year)) %>%
left_join(surv_dat_exploded, by=c('region','month_year')) %>%
filter(!is.na(ref_year))# get rid of dates that are after any ref_year
Sounds like you want a non-equi join. This is easily done with data.table and is very fast. Here's an example that lightly modifies your MWE:
library(data.table)
# make survey dfs
ai_dat = data.table(year = c(1983, 1986, 1991, 1994, 1997),
region = "aleutian_islands", "startmonth" = 5)
ebs_dat = data.table(year = seq(1983, 1999, 1),
region = "eastern_bering_sea", "startmonth" = 5)
# bind together and create date (and cutoffdate) vars
surv_dat = rbind(ai_dat, ebs_dat)
surv_dat[, startdate := as.IDate(paste(year, startmonth, '01', sep = '-'))
][, cutoffdate := startdate - 365L]
# make temperature df
temp_dat = CJ(date=seq(as.IDate("1982-01-01"), as.IDate("1999-12-31"), "days"),
region=c('aleutian_islands','eastern_bering_sea'))
# add temperature var
temp_dat$temp = rnorm(nrow(temp_dat))
# create duplicate date variable (will make post-join processing easier)
temp_dat[, matchdate := date]
# Optional: Set keys for better join performance
setkey(surv_dat, region, startdate)
setkey(temp_dat, region, matchdate)
# Where the magic happens: Non-equi join
surv_dat = temp_dat[surv_dat, on = .(region == region,
matchdate <= startdate,
matchdate >= cutoffdate)]
# Optional: get rid of unneeded columns
surv_dat[, c('matchdate', 'matchdate.1') := NULL][]
#> date region temp year startmonth
#> 1: 1982-05-01 aleutian_islands 0.3680810 1983 5
#> 2: 1982-05-02 aleutian_islands 0.8349334 1983 5
#> 3: 1982-05-03 aleutian_islands -1.3622227 1983 5
#> 4: 1982-05-04 aleutian_islands 1.4327587 1983 5
#> 5: 1982-05-05 aleutian_islands 0.5068226 1983 5
#> ---
#> 8048: 1999-04-27 eastern_bering_sea -1.2924594 1999 5
#> 8049: 1999-04-28 eastern_bering_sea 0.7519078 1999 5
#> 8050: 1999-04-29 eastern_bering_sea -1.0185174 1999 5
#> 8051: 1999-04-30 eastern_bering_sea -1.4322252 1999 5
#> 8052: 1999-05-01 eastern_bering_sea -1.0412836 1999 5
Created on 2021-05-20 by the reprex package (v2.0.0)
Try this solution.
I basically used your reference column to generate a ref_date and estimate the difference in days between the observation and reference. Then, I used a simple ifelse to test if the dates fall within the 365 days range and then copy them to the temp_valid column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <-
rbind(ai_dat, ebs_dat) %>%
mutate(year_month = paste0(year,"-",startmonth),
region_year_month = paste0(region,"-",year,"-",startmonth))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <-
expand.grid(month=seq(01, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(year_month = paste0(year,"-",month)) %>%
mutate(region_year_month = paste0(region,"-",year,"-",month)) %>%
mutate(ref_year = ifelse(region_year_month %in% surv_dat$region_year_month, region_year_month,NA)) %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
year_month = paste0(year,"-",month))
final_df <-
left_join(temp_dat, surv_dat_exploded, by=c('region','year_month')) %>%
#split ref_column in ref_year and ref_region
separate(ref_year, c("ref_region","ref_year"), "-", extra="merge") %>%
#convert ref_year into date
mutate_at("ref_year", as.Date, format= "%Y-%M") %>%
#round it down to be in the first day of the month (not needed if the day matters)
mutate_at("ref_year", floor_date, "month" ) %>%
#difference between observed and the reference
mutate(diff_days = date - ref_year) %>%
# ifelse statement for capturing values of interest
mutate(temp_valid = ifelse(between(diff_days, -365, 0),temperature,NA))
I am trying to make my code display the mean attendance of a specified country in a new column. When I run the code below (Should be an image) I get the table also listed in the image. Can anyone explain how to display only the column of the specified country name and the mean attendance in the new column and explain what I am doing wrong? Thank you
My_Code
EDIT: sorry I'm obviously new at this,
my code is
WorldCupMatches %>%
select(Home.Team.Name, Attendance) %>%
group_by(Home.Team.Name == "USA") %>%
mutate(AVG_Attendance = mean(Attendance, na.rm = T))
so to explain more, worldcupmatches is a dataframe and it has columns named "home.team.names" and "Attendance." I am trying to add a column by mutating and I want the mutated column to show the mean attendance for a country. The country i am looking for in this particular situation is USA. I also want the output to only display the columns "home.team.name" (with the USA as the home team), attendance and the mutated column that would be the mean attendance.
Thank you all for the help i got a lot of great answers!
First group by Home.Team.Name
Then you get the mean of each country in the table with summarise
If you just want USA then add filter(Home.Team.Name == "USA") at the end
WorldCupMatches %>%
select(Home.Team.Name, Attendance) %>%
group_by(Home.Team.Name) %>%
summarise(AVG_Attendance = mean(Attendance, na.rm = T)) %>%
filter(Home.Team.Name == "USA")
If you want to have averages by group just group_byand summarise:
library(dplyr)
df %>%
group_by(Hometeam) %>%
summarise(mean = mean(Attendance))
# A tibble: 3 x 2
Hometeam mean
* <chr> <dbl>
1 France 555
2 UK 500.
3 USA 373
If you're just interested in a specific group you can filter that group:
df %>%
filter(Hometeam=="USA") %>%
summarise(mean = mean(Attendance))
mean
1 373
Data:
df <- data.frame(
Hometeam = c("USA", "UK", "USA", "France", "UK", "USA"),
Attendance = c(120, 333, 222, 555, 666, 777)
)
My Data frame looks like this
Now, I want to add a new column which assigns one (!) specific value to each country. That means, there is only one value for Australia, one for Canada etc. for every year.
It should look like this:
Year Country R Ineq Adv NEW_COL
2018 Australia R1 Ineq1 1 x_Australia
2019 Australia R2 Ineq2 1 x_Australia
1972 Canada R1 Ineq1 1 x_Canada
...
Is there a smart way to do this?
Appreciate any help!
you use merge.
x = data.frame(country = c("AUS","CAN","AUS","USA"),
val1 = c(1:4))
y = data.frame(country = c("AUS","CAN","USA"),
val2 = c("a","b","c"))
merge(x,y)
country val1 val2
1 AUS 1 a
2 AUS 3 a
3 CAN 2 b
4 USA 4 c
You just manually create the (probably significantly smaller!) reference table that then gets duplicated in the original table in the merge. As you can see, my 3 row table (with a,b,c) is correctly duplicated up to the original (4 row) table such that every AUS gets "a".
You may use mutate and case_when from the package dplyr:
library(dplyr)
data <- data.frame(country = rep(c("AUS", "CAN"), each = 2))
data <- mutate(data,
newcol = case_when(
country == "CAN" ~ 1,
country == "AUS" ~ 2))
print(data)
You can use mutate and group_indices:
library(dplyr)
Sample data:
sample.df <- data.frame(Year = sample(1971:2019, 10, replace = T),
Country = sample(c("AUS", "Can", "UK", "US"), 10, replace = T))
Create new variable called ID, and assign unique ID to each Country group:
sample.df <- sample.df %>%
mutate(ID = group_indices(., Country))
If you want it to appear as x_Country, you can use paste (as commented):
sample.df <- sample.df %>%
mutate(ID = paste(group_indices(., Country), Country, sep = "_"))
I am looking for some help here. I have two data frames, df1 and df2. I want to add an extra column to df1, based on the sum of specific rows in df2.
Df1 contains station names. Df2 contain locations, years and observations in degrees. I want a sum of degrees for each station. These degrees should be sums of specific locations for each year.
Think of it, as “each station should get its sum of degrees, based on the given locations, for each year”. I hope to only code the station name and locations, the years in desired_output should include all years given in df2.
Failed example and desired output. I prefer to work in the tidyverse environment.
All the best
df1 <- data.frame(station = c("station_A", "station_B"))
df2 <- data.frame(location= c("south", "north", "north", "east", "west"), year = c(2000, 2000, 2001, 2001, 2001), degrees = c(5,3,9,5,2))
degrees_for_each_station <-
df1%>%
mutate (degrees = case_when(
station == "station_A" ~ if_else(df2$location %in% c("north","south"),
sum(df2$degrees),
NA),
station == "station_B" ~ if_else(df2$location %in% c("north","east", "west"),
sum(df2$degrees),
NA)))
desired_output <- data.frame(station = c("station_A", "station_A","station_B", "station_B"),
year = c(2000, 2001, 2000, 2001),
degrees = c(8,9,3,16))```
One way would be:
library(tidyverse)
df1 %>%
left_join(
df2 %>%
mutate(
location = case_when(
location == 'south' ~ 'station_A',
location %in% c('east', 'west') ~ 'station_B',
location == 'north' ~ 'station_A, station_B'
)
) %>%
separate_rows(location, sep = ', ') %>%
group_by(location, year) %>%
summarise(degrees = sum(degrees)),
by = c('station' = 'location')
)
Output:
station year degrees
1 station_A 2000 8
2 station_A 2001 9
3 station_B 2000 3
4 station_B 2001 16