How to conditionally mutate a new column when data is in long format, and condition is depending on grouping combination - r

I have data in long format, and I'm trying to test each row against the mean of a certain grouping combination, in order to generate a new column with the conclusion from that test.
Example
In this toy example, I have data about 20 cars. Each car could be of one of three possible makers. We have mpg data for each car, measured 8 times: in the city or highway, in the morning or evening, during the winter or spring.
library(tidyr)
set.seed(2021)
df_id_and_makers <-
data.frame(id = 1:20,
maker = sample(c("toyota", "audi", "bmw"), size = 20, replace = TRUE))
df <- tidyr::expand_grid(df_id_and_makers,
road_type = c("city", "highway"),
time_of_day = c("morning", "evening"),
season = c("winter", "spring"))
df$mpg_val <- sample(15:40, size = nrow(df), replace = TRUE)
df
#> # A tibble: 160 x 6
#> id maker road_type time_of_day season mpg_val
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 bmw city morning winter 28
#> 2 1 bmw city morning spring 22
#> 3 1 bmw city evening winter 40
#> 4 1 bmw city evening spring 18
#> 5 1 bmw highway morning winter 19
#> 6 1 bmw highway morning spring 36
#> 7 1 bmw highway evening winter 30
#> 8 1 bmw highway evening spring 16
#> 9 2 audi city morning winter 33
#> 10 2 audi city morning spring 18
#> # ... with 150 more rows
Created on 2021-07-07 by the reprex package (v2.0.0)
I want to analyze this data to test my hypothesis that mpg in city is larger than mpg in highway. To this end, I want to create a new column that tests whether the value in mpg_val when road_type is city is larger than the mean of mpg_val across rows where road_type is highway. Furthermore, I want to compare just among cars of the same makers.
So, for example, id = 1 is bmw, and therefore the new column I want to compute should test each value of mpg_val in rows where road_type == city (i.e., rows 1-4, but not 5-6), and see whether mpg_val is larger than mean(mpg_val) in rows where road_type == highway and maker == bmw.
Expected output
Here's the manual and dumb way of doing this. I'll show only how I do this for maker = bmw for the sake of demonstration.
library(dplyr)
# step 1 -- calculate the mean of `mpg_val` for `road_type = highway` and only across bmw
mean_bmw_highway_mpg <-
df %>%
filter(maker == "bmw",
road_type == "highway") %>%
pull(mpg_val) %>%
mean()
mean_bmw_highway_mpg
## [1] 26.22222
# step 2 -- compare each row where `maker = "bmw"` and `road_type = "city"` for its `mpg_val` against `mean_bmw_highway_mpg`
result_bmw_only <-
df %>%
mutate(is_mpg_city_larger_than_mpg_highway = case_when(maker != "bmw" ~ "not_relevant",
road_type != "city" ~ "not_relevant",
mpg_val > mean_bmw_highway_mpg ~ "yes",
TRUE ~ "no"))
result_bmw_only
## # A tibble: 160 x 7
## id maker road_type time_of_day season mpg_val is_mpg_city_larger_than_mpg_highway
## <int> <chr> <chr> <chr> <chr> <int> <chr>
## 1 1 bmw city morning winter 28 yes ## because 28 > 26.222
## 2 1 bmw city morning spring 22 no ## because 22 < 26.222
## 3 1 bmw city evening winter 40 yes
## 4 1 bmw city evening spring 18 no
## 5 1 bmw highway morning winter 19 not_relevant
## 6 1 bmw highway morning spring 36 not_relevant
## 7 1 bmw highway evening winter 30 not_relevant
## 8 1 bmw highway evening spring 16 not_relevant
## 9 2 audi city morning winter 33 not_relevant
## 10 2 audi city morning spring 18 not_relevant
## # ... with 150 more rows
How could I achieve the same result as result_bmw_only (but applied to the entire df) in a more elegant way? Hopefully using dplyr approach, because this is what I'm used to, but otherwise any method will do.
Thanks!
EDIT 1
One solution I could think of involves purrr, but I can't get this done yet.
library(purrr)
solution_purrr <-
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
mpg_val > mean(mpg_val) ~ "yes",
TRUE ~ "no"))))
It seems that solution_purrr gets the desired output, but not exactly. This is because the second logic in case_when (i.e., mpg_val > mean(mpg_val) ~ "yes") is not what I want. I want to compare mpg_val to mean(mpg_val) when that mean is computed based only on rows where road_type == "highway". But here mean(mpg_val) computes across all rows.
EDIT 2
Based on #Till's answer below, I'd like to clarify that I'm looking for a solution that avoids a separate calculation of the mean we want to test against. What I did above with mean_bmw_highway_mpg is the undesired way of working towards the output. I showed mean_bmw_highway_mpg only for demonstrating the kind of mean I need to calculate.

What you tried is already close. Take a look at the documentation of dplyr::group_by()
it is designed for these kinds of operations.
Below is how you can expand your BMW-only solution to the full dataset using group_by().
library(tidyverse)
mean_highway_mpg_df <-
df %>%
filter(road_type == "highway") %>%
group_by(maker) %>%
summarise(mean_highway_mpg = mean(mpg_val))
result_df <-
df %>%
filter(road_type == "city") %>%
group_by(maker) %>%
left_join(mean_highway_mpg_df) %>%
mutate(mpg_city_higher_highway = mpg_val > mean_highway_mpg)
#> Joining, by = "maker"
result_df %>%
select(-(time_of_day:season))
#> # A tibble: 80 x 6
#> # Groups: maker [3]
#> id maker road_type mpg_val mean_highway_mpg mpg_city_higher_highway
#> <int> <chr> <chr> <int> <dbl> <lgl>
#> 1 1 bmw city 28 26.2 TRUE
#> 2 1 bmw city 22 26.2 FALSE
#> 3 1 bmw city 40 26.2 TRUE
#> 4 1 bmw city 18 26.2 FALSE
#> 5 2 audi city 33 28.1 TRUE
#> 6 2 audi city 18 28.1 FALSE
#> 7 2 audi city 35 28.1 TRUE
#> 8 2 audi city 36 28.1 TRUE
#> 9 3 audi city 25 28.1 FALSE
#> 10 3 audi city 32 28.1 TRUE
#> # … with 70 more rows

I think I got this. The following solution is based on both my EDIT 1 above, as well as #MrFlick's comment here.
First, we define a helper function:
is_x_larger_than_mean_y <- function(x, y) {
x > mean(y)
}
Then, we run:
library(dplyr)
library(purrr)
library(tidyr)
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes",
TRUE ~ "no")))) %>%
select(-data) %>%
unnest(cols = tbl_with_desired_new_col)
This way, the line within case_when() that says is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes" ensures that we compute the mean of mpg_val only based on rows in which road_type == "highway".

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

How to create rate on R

I want to change my data so that it gives me the rate of pedestrians to that states population. I am using a linear model and my summary values look like this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.087061 0.029876 2.914 0.00438 **
intersection 0.009192 0.003086 2.978 0.00362 **
Here, my beta value intersection is .009192 and that is not meaningful because compared to a state that has a smaller population, this value might be nothing in comparison.
Below is a condensed version of my data without all the columns I use, but here is the link of the csv incase someone wants to download it from there.
> head(c)
# A tibble: 6 x 15
STATE STATENAME PEDS PERSONS PERMVIT PERNOTMVIT COUNTY COUNTYNAME CITY DAY MONTH YEAR LATITUDE LONGITUD
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Alabama 0 3 3 0 81 LEE (81) 2340 7 2 2019 32.7 -85.3
2 1 Alabama 0 2 2 0 55 ETOWAH (55) 1280 23 1 2019 34.0 -86.1
3 1 Alabama 0 4 4 0 29 CLEBURNE (29) 0 22 1 2019 33.7 -85.4
4 1 Alabama 1 1 1 1 55 ETOWAH (55) 2562 22 1 2019 34.0 -86.1
5 1 Alabama 0 1 1 0 3 BALDWIN (3) 0 18 1 2019 30.7 -87.8
6 1 Alabama 0 2 2 0 85 LOWNDES (85) 0 7 1 2019 32.2 -86.4
# … with 1 more variable: FATALS <dbl>
Here is the code I have that runs through the process I am doing. I don't see how I can change it so that each value is a rate (values like peds or type_int)
#Libraries
rm(list=ls()) # this is to clear anything in memory
library(leaflet)
library(tidyverse)
library(ggmap)
library(leaflet.extras)
library(htmltools)
library(ggplot2)
library(maps)
library(mapproj)
library(mapdata)
library(zoo)
library(tsibble)
setwd("~/Desktop/Statistics790/DataSets/FARS2019NationalCSV")
df <- read.csv("accident.csv")
state <- unique(df$STATE)
for(i in state){
df1<- df %>%
filter(STATE==i) %>%
dplyr::select(c(STATE,PEDS,DAY,MONTH,YEAR,TYP_INT)) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d"))) %>% # create a date
group_by(date) %>% # Group by State id and date
# summarise_at(.vars = vars(PEDS), sum)
summarise(pedday=sum(PEDS),intersection=mean(TYP_INT))
#ts1<-ts(df,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts/figures")
plots<-df1 %>%
ggplot()+
geom_line(aes(x=date,y=pedday))+ylim(0,13)+
theme_bw()
ggsave(paste0("state_",i,".png"),width=8,height=6, )
ts1<-ts(df1,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts")
ts1 %>% write.csv(paste0("state_",i,".csv"),row.names = F)
#Plots
}
#date1<- as.character(df$date)
#df1<- df%>% filter(STATE=="1")
#ts2<-xts(df,order.by = as.Date(df$date,"%Y-%m-%d"))
setwd("~/Desktop/Statistics790/States_ts")
cat("\f")
#df <- read.csv(paste0("state_1.csv"))
#print("------Linear Model------")
#summary(lm(pedday~weather,data=df))
for(i in state){
print(paste0("-------------------------Analysis for State: ",i," -------------------------------"))
df <- read.csv(paste0("state_",i,".csv"))
print("------Linear Model------")
print(summary(lm(pedday~intersection,data=df)))
}
Collating my answers from the comments: you need to get state population data from an outside source such as the US Census https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html#par_textimage_1574439295, read it in, join it to your dataset, and then calculate rate as pedestrians per population, scaled for ease of reading on the graph. You can make your code faster by taking some of your calculations out of the loop. The code below assumes the census data is called 'census.csv' and has columns 'Geographic Area' for state and 'X2019' for the most recent population data available.
pop <- read.csv('census.csv')
df <- read.csv('accidents.csv') %>%
left_join(pop, by = c('STATENAME' = 'Geographic Area') %>%
mutate(rate = (PEDS / X2019) * <scale>) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d")))
The left_join will match state names and give each row a population value depending on its state, regardless of how many rows there are.

aggregation of the region's values ​in the dataset

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
 is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17

Extend data frame column with inflation in R

I'm trying to extend some code to be able to:
1) read in a vector of prices
2) left join that vector of prices to a data frame of years (or years and months)
3) append/fill the prices for missing years with interpolated data based on the last year of available prices plus a specified inflation rate. Consider an example like this one:
prices <- data.frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA))
What I need is something that will fill the missing rows of each column with the last price plus inflation (suppose 2%). I can do this in a pretty brute force way as:
i_rate<-0.02
for(i in c(1:nrow(prices))){
if(is.na(prices$wti[i]))
prices$wti[i]<-prices$wti[i-1]*(1+i_rate)
if(is.na(prices$brent[i]))
prices$brent[i]<-prices$brent[i-1]*(1+i_rate)
}
It seems to me there should be a way to do this using some combination of apply() and/or fill() but I can't seem to make it work.
Any help would be much appreciated.
As noted by #camille, the problem with dplyr::lag is that it doesn't work here with consecutive NAs because it uses the "original" ith element of a vector instead of the "revised" ith element. We'd have to first create a version of lag that will do this by creating a new function:
impute_inflation <- function(x, rate) {
output <- x
y <- rep(NA, length = length(x)) #Creating an empty vector to fill in with the loop. This makes R faster to run for vectors with a large number of elements.
for (i in seq_len(length(output))) {
if (i == 1) {
y[i] <- output[i] #To avoid an error attempting to use the 0th element.
} else {
y[i] <- output[i - 1]
}
if (is.na(output[i])) {
output[i] <- y[i] * (1 + rate)
} else {
output[i]
}
}
output
}
Then it's a pinch to apply this across a bunch of variables with dplyr::mutate_at():
library(dplyr)
mutate_at(prices, vars(wti, brent), impute_inflation, 0.02)
year wti brent
1 2018 75.000 80.00
2 2019 80.000 85.00
3 2020 90.000 94.00
4 2021 91.800 93.00
5 2022 93.636 94.86
You can use dplyr::lag to get the previous value in a given column. Your lagged values look like this:
library(dplyr)
inflation_factor <- 1.02
prices <- data_frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA)) %>%
mutate_at(vars(wti, brent), as.numeric)
prices %>%
mutate(prev_wti = lag(wti))
#> # A tibble: 5 x 4
#> year wti brent prev_wti
#> <int> <dbl> <dbl> <dbl>
#> 1 2018 75 80 NA
#> 2 2019 80 85 75
#> 3 2020 90 94 80
#> 4 2021 NA 93 90
#> 5 2022 NA NA NA
When a value is NA, multiply the lagged value by the inflation factor. As you can see, that doesn't handle consecutive NAs, however.
prices %>%
mutate(wti = ifelse(is.na(wti), lag(wti) * inflation_factor, wti),
brent = ifelse(is.na(brent), lag(brent) * inflation_factor, brent))
#> # A tibble: 5 x 3
#> year wti brent
#> <int> <dbl> <dbl>
#> 1 2018 75 80
#> 2 2019 80 85
#> 3 2020 90 94
#> 4 2021 91.8 93
#> 5 2022 NA 94.9
Or to scale this and avoid doing the same multiplication over and over, gather the data into a long format, get lags within each group (wti, brent, or any others you may have), and adjust values as needed. Then you can spread back to the original shape:
prices %>%
tidyr::gather(key = key, value = value, wti, brent) %>%
group_by(key) %>%
mutate(value = ifelse(is.na(value), lag(value) * inflation_factor, value)) %>%
tidyr::spread(key = key, value = value)
#> # A tibble: 5 x 3
#> year brent wti
#> <int> <dbl> <dbl>
#> 1 2018 80 75
#> 2 2019 85 80
#> 3 2020 94 90
#> 4 2021 93 91.8
#> 5 2022 94.9 NA
Created on 2018-07-12 by the reprex package (v0.2.0).

Resources