Running a function in dplyr gives wrong output - r

My sample data consists of daily rainfall and temperature from day 1 to 365 for year 1981 and 1982
set.seed(0)
df <- data.frame(year = rep(1981:1982, each = 365),
doy = rep(1:365, times = 2),
rainfall = sample(0:30, 730, replace = T),
tmax = sample(25:35, 730, replace = T))
Each year I have two days of the year called ref.doy and for each ref.doy, I have corresponding doy.first, doy.second.
my.df <- data.frame(year = c(1981, 1981, 1982, 1982),
ref.doy = c(250, 260, 230, 240),
doy.first = c(280, 300, 290, 310),
doy.second = c(310, 330, 340, 350))
What I want to do is for each year, take the first ref.doy and the corresponding
doy.first, doy.second and calculate total rainfall and mean temperature from
ref.doy:doy.first and doy.first:doy.second`. I wrote a function to do this:
my.func <- function(x) {
dat <- x %>%
dplyr::summarise(tot.rain.val1 = sum(rainfall[doy >= ref.doy & doy <= doy.first]),
tot.rain.val2 = sum(rainfall[doy >= doy.first & doy <= doy.second]),
mean.tmax.val1 = mean(tmax[doy >= ref.doy & doy <= doy.first]),
mean.tmax.val2 = sum(tmax[doy >= doy.first & doy <= doy.second]))
return(dat)
}
The approach I took is to first join the two data and then run my function
df <- df %>% left_join(my.df)
results <- df %>% dplyr::group_by(year, ref.doy) %>%
dplyr::summarise(results = paste(my.func(.), collapse = ","))
However, the results look a bit funny and the format is not correct. I need the results
in the following format
year ref.doy tot.rain.val1 tot.rain.val2 mean.tmax.val1 mean.tmax.val2
1981 250
1981 260
1982 230
1982 240

Your function returns a dataframe in the format you want, so you don't need to use paste, but save those outputs in a list and then unnest.
library(tidyverse)
df <- df %>% left_join(my.df)
df %>%
group_by(year, ref.doy) %>%
summarise(results = list(my.func(.))) %>%
unnest() %>%
ungroup() %>%
select(-year, -ref.doy)
# # A tibble: 16 x 6
# year1 ref.doy1 tot.rain.val1 tot.rain.val2 mean.tmax.val1 mean.tmax.val2
# <dbl> <dbl> <int> <int> <dbl> <int>
# 1 1981 250 396 365 29.6 939
# 2 1981 260 429 489 29.8 926
# 3 1982 230 994 805 29.3 1515
# 4 1982 240 1140 653 29.7 1224
# 5 1981 250 396 365 29.6 939
# 6 1981 260 429 489 29.8 926
# 7 1982 230 994 805 29.3 1515
# 8 1982 240 1140 653 29.7 1224
# 9 1981 250 396 365 29.6 939
#10 1981 260 429 489 29.8 926
#11 1982 230 994 805 29.3 1515
#12 1982 240 1140 653 29.7 1224
#13 1981 250 396 365 29.6 939
#14 1981 260 429 489 29.8 926
#15 1982 230 994 805 29.3 1515
#16 1982 240 1140 653 29.7 1224

What about something like this, if you want it in a function:
library(dplyr)
fun <- function(x,y) {
df1 <- x %>% left_join(y) %>% group_by(year,ref.doy) %>%
summarise(tot.rain.val1 = sum(rainfall[doy >= ref.doy & doy <= doy.first]),
tot.rain.val2 = sum(rainfall[doy >= doy.first & doy <= doy.second]),
mean.tmax.val1 = mean(tmax[doy >= ref.doy & doy <= doy.first]),
mean.tmax.val2 = sum(tmax[doy >= doy.first & doy <= doy.second]))
print(df1)
}
fun(df,my.df)
Joining, by = "year"
# A tibble: 4 x 6
# Groups: year [?]
year ref.doy tot.rain.val1 tot.rain.val2 mean.tmax.val1 mean.tmax.val2
<dbl> <dbl> <int> <int> <dbl> <int>
1 1981 250 396 365 29.6 939
2 1981 260 429 489 29.8 926
3 1982 230 994 805 29.3 1515
4 1982 240 1140 653 29.7 1224

Related

Create a new variable in data frame that contains the sum of the values of all other groups

I have data similar to this
example_data <- data.frame(
company = c(rep("A",6),
rep("B",6),
rep("C",6)),
year = c(rep(c(rep(c(2019), 3), rep(2020, 3)), 3)),
country = c(rep(c("Australia","Tanzania","Nepal"),3)),
sales = c(sample(1000:2000, 18)),
employees = c(sample(100:200, 18)),
profit = c(sample(500:1000, 18))
)
which when printed out looks like this:
> example_data
company year country sales employees profit
1 A 2019 Australia 1815 138 986
2 A 2019 Tanzania 1183 126 907
3 A 2019 Nepal 1159 155 939
4 A 2020 Australia 1873 183 866
5 A 2020 Tanzania 1858 198 579
6 A 2020 Nepal 1841 184 601
7 B 2019 Australia 1989 160 595
8 B 2019 Tanzania 1162 151 520
9 B 2019 Nepal 1470 187 670
10 B 2020 Australia 1013 128 945
11 B 2020 Tanzania 1718 123 886
12 B 2020 Nepal 1135 149 778
13 C 2019 Australia 1846 188 755
14 C 2019 Tanzania 1445 194 916
15 C 2019 Nepal 1029 145 903
16 C 2020 Australia 1737 161 578
17 C 2020 Tanzania 1489 141 859
18 C 2020 Nepal 1350 167 536
The unit of observation for the three variables of interest sales, employees, profit is a unique combination of company, year, and country.
What I need is a column in the data frame for every one of these three variables named other_sales, other_employees, and other_profit. (In my actual data, I have not only three but closer to 40 such variables of interest.) These should be the sum of the other companies in that year, in that country for that variable. So for instance, example_data$other_sales[1] should be the sum of the two values 1989 and 1846, which are "he sales for company B in that year in that country, and the sales for company C in that year in that country respectively.
I am familiar with dplyr::group_by() and dplyr::mutate(), but I struggle to come up with a way to solve this problem. What I would like to do is something like this:
library(dplyr)
example_data %>%
group_by(company, year, country) %>%
mutate(other_sales = sum(
example_data %>% filter(company!="this") %>% .$sales)
)
# "this" should be the value of 'company' in the current group
Obviously, this code doesn't work. Even if it did, it would not accomplish the goal of creating these other_* variables automatically for every specified column in the data frame. I've thought about creating a complicated for loop, but I figured before I go down that most likely wrong route, it's better to ask here. Finally, while it would be possible to construct a solution based purely on column indices (i.e., for example_data[1,7] calculate the sum of [7,4] and [13,4]), this would not work in my real data because the number of observations per company can differ.
EDIT: small correction in the code
--- SOLUTION ---
Based on the comment under this question, I was able to figure out a solution that solves both issues in the question:
example_data %>%
group_by(year, country) %>%
mutate(across(sales:profit, .names = "other_{.col}", function(x) sum(x)-x))
I think this will solve your problem
example_data %>%
group_by(country,year) %>%
mutate(other_sales = sum(sales)- sales)
To generalise it for all variables, i.e. sales, profit and employees:
(arrange is not necessary, but helps when checking.)
library(tidyverse)
set.seed(123)
example_data <- data.frame(
company = c(rep("A",6),
rep("B",6),
rep("C",6)),
year = c(rep(c(rep(c(2019), 3), rep(2020, 3)), 3)),
country = c(rep(c("Australia","Tanzania","Nepal"),3)),
sales = c(sample(1000:2000, 18)),
employees = c(sample(100:200, 18)),
profit = c(sample(500:1000, 18))
)
example_data |>
arrange(country, year, company) |> # Optional
group_by(country, year) |>
mutate(across(sales:profit, ~sum(.) - ., .names = "other_{.col}"))
#> # A tibble: 18 × 9
#> # Groups: country, year [6]
#> company year country sales employees profit other_sales other_em…¹ other…²
#> <chr> <dbl> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 2019 Australia 1414 190 989 3190 302 1515
#> 2 B 2019 Australia 1817 125 522 2787 367 1982
#> 3 C 2019 Australia 1373 177 993 3231 315 1511
#> 4 A 2020 Australia 1525 108 892 2830 372 1524
#> 5 B 2020 Australia 1228 197 808 3127 283 1608
#> 6 C 2020 Australia 1602 175 716 2753 305 1700
#> 7 A 2019 Nepal 1178 191 762 2899 283 1608
#> 8 B 2019 Nepal 1298 141 943 2779 333 1427
#> 9 C 2019 Nepal 1601 142 665 2476 332 1705
#> 10 A 2020 Nepal 1937 171 829 2721 266 1967
#> 11 B 2020 Nepal 1013 135 991 3645 302 1805
#> 12 C 2020 Nepal 1708 131 976 2950 306 1820
#> 13 A 2019 Tanzania 1462 156 608 2781 286 1633
#> 14 B 2019 Tanzania 1117 106 910 3126 336 1331
#> 15 C 2019 Tanzania 1664 180 723 2579 262 1518
#> 16 A 2020 Tanzania 1194 192 924 3010 296 1423
#> 17 B 2020 Tanzania 1243 182 634 2961 306 1713
#> 18 C 2020 Tanzania 1767 114 789 2437 374 1558
#> # … with abbreviated variable names ¹​other_employees, ²​other_profit
Created on 2022-12-08 with reprex v2.0.2

Error finding object when removing outliers in pipe in R

Okay. I have looked everywhere and read documentation, watched videos, talked to people for help, etc... and cant seem to get this figured out. I need to remove the outliers in one variable of a data set using object assignment and the quartile method, but I have to do it in the pipe. When I run the code, the object cannot be found. Here is the code:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(nycflights13))
suppressPackageStartupMessages(library(lm.beta))
Q1 <- flights %>%
dep_delay_upper <- quantile(dep_delay$y, 0.997, na.rm = TRUE) %>%
dep_delay_lower <- quantile(dep_delay$y, 0.003, na.rm = TRUE) %>%
dep_delay_out <- which(dep_delay$y > dep_delay_upper | dep_delay$y < dep_delay_lower) %>%
dep_delay_noout <- dep_delay[-dep_delay_out,]
Here is a screenshot with my error in the terminal:
enter image description here
With magrittr's pipe, you can reuse the piped object with a . as so.
The first way gets only the values of dep_delay:
flights$dep_delay %>%
.[which(. < quantile(., 0.997, na.rm = TRUE) & . > quantile(., 0.003, na.rm = TRUE))]
And the second way filters the entire flights dataframe:
flights %>%
.[which(.$dep_delay < quantile(.$dep_delay, 0.997, na.rm = TRUE) &
.$dep_delay > quantile(.$dep_delay, 0.003, na.rm = TRUE)),]
# # A tibble: 326,164 × 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_…¹ arr_d…² carrier flight tailnum origin dest air_t…³ dista…⁴ hour minute time_hour
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
# 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
# 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6 0 2013-01-01 06:00:00
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6 0 2013-01-01 06:00:00
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6 0 2013-01-01 06:00:00
# 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00
# # … with 326,154 more rows, and abbreviated variable names ¹​sched_arr_time, ²​arr_delay, ³​air_time, ⁴​distance
# # ℹ Use `print(n = ...)` to see more rows
Or alternatively with dplyr:
flights %>%
filter(dep_delay < quantile(dep_delay, 0.997, na.rm = TRUE) &
dep_delay > quantile(dep_delay, 0.003, na.rm = TRUE))

Extract min and max information by sequential similar parts of data frame in R

I have a data frame that corresponds to the path taken by a river, describing elevation and distance. I need to evaluate each different ground path traveled by the river and extract this information.
Example:
df = data.frame(Soil = c("Forest", "Forest",
"Grass", "Grass","Grass",
"Scrub", "Scrub","Scrub","Scrub",
"Grass", "Grass","Grass","Grass",
"Forest","Forest","Forest","Forest","Forest","Forest"),
Distance = c(1, 5,
10, 15, 56,
59, 67, 89, 99,
102, 105, 130, 139,
143, 145, 167, 189, 190, 230),
Elevation = c(1500, 1499,
1470, 1467, 1456,
1450, 1445, 1440, 1435,
1430, 1420, 1412, 1400,
1390, 1387, 1384, 1380, 1376, 1370))
Soil Distance Elevation
1 Forest 1 1500
2 Forest 5 1499
3 Grass 10 1470
4 Grass 15 1467
5 Grass 56 1456
6 Scrub 59 1450
7 Scrub 67 1445
8 Scrub 89 1440
9 Scrub 99 1435
10 Grass 102 1430
11 Grass 105 1420
12 Grass 130 1412
13 Grass 139 1400
14 Forest 143 1390
15 Forest 145 1387
16 Forest 167 1384
17 Forest 189 1380
18 Forest 190 1376
19 Forest 230 1370
But i need to something like this:
Soil Distance.Min Distance.Max Elevation.Min Elevation.Max
1 Forest 1 5 1499 1500
2 Grass 10 56 1456 1470
3 Scrub 59 99 1435 1450
4 Grass 102 139 1400 1430
5 Forest 143 230 1370 1390
I tried to use group_by() and which.min(Soil), but that takes into account the whole df, not each path.
We need a run-length encoding to track consecutive Soil.
Using this function (fashioned to mimic data.table::rleid):
myrleid <- function (x) {
r <- rle(x)
rep(seq_along(r$lengths), times = r$lengths)
}
We can do
df %>%
group_by(grp = myrleid(Soil)) %>%
summarize(Soil = Soil[1], across(c(Distance, Elevation), list(min = min, max = max))) %>%
select(-grp)
# # A tibble: 5 x 5
# Soil Distance_min Distance_max Elevation_min Elevation_max
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Forest 1 5 1499 1500
# 2 Grass 10 56 1456 1470
# 3 Scrub 59 99 1435 1450
# 4 Grass 102 139 1400 1430
# 5 Forest 143 230 1370 1390
You can try this:
df = df %>% mutate(id=data.table::rleid(Soil))
inner_join(
distinct(df %>% select(Soil,id)),
df %>%
group_by(id) %>%
summarize(across(Distance:Elevation, .fns = list("min" = min,"max"=max))),
by="id"
) %>% select(!id)
Output:
Soil Distance_min Distance_max Elevation_min Elevation_max
1 Forest 1 5 1499 1500
2 Grass 10 56 1456 1470
3 Scrub 59 99 1435 1450
4 Grass 102 139 1400 1430
5 Forest 143 230 1370 1390
Or, even more concise, thanks to r2evans.
df %>%
group_by(id = data.table::rleid(Soil)) %>%
summarize(Soil=first(Soil),across(Distance:Elevation, .fns = list("min" = min,"max"=max))) %>%
select(!id)

How to limit the number of factors in facet_wrap() to plot for only top N factors?

I am working with Covid data at districts level & trying to plot Cases timeseries for Districts using facet_wrap().
Some States have lots of districts and all of that wont fit so I want to limit the facet_wrap to top N districts.
I have tried to reorder the facet_wrap() by fct_reorder(Districts) but that only reorders & produces plot for all the districts.
Is there a way I can get top N levels of those fct_reorder(Districts) and plot only those top N or if there is any option to control the number of facets in facet_wrap ?
df:
library(tidyverse)
library(lubridate)
file_url <- "https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/df_districts.csv"
df_districts <- read.csv(url(file_url))
df_districts <- df_districts %>%
mutate(Date = ymd(Date))
I have tried:
df_districts %>%
filter(State == "Rajasthan",
Date != max(Date),
!is.na(Daily_confirmed)) %>%
# group_by(District, Date) %>%
# slice_max(order_by = Daily_confirmed, n = 20) %>%
mutate(District = fct_reorder(District, Daily_confirmed,
.fun = max, .desc = TRUE)) %>%
ggplot(aes(x = Date, y = Daily_confirmed)) +
geom_line(size = 1) +
facet_wrap(~District)
One way is to summarize the table by whatever precedence you need (e.g., sum(Daily_confirmed)), then take the top "n" by that sorted variable.
df_districts %>%
group_by(District) %>%
summarize(daily = sum(Daily_confirmed)) %>%
slice_max(daily, n = 10)
# # A tibble: 10 x 2
# District daily
# <chr> <int>
# 1 Jaipur 99843
# 2 Jodhpur 72443
# 3 Kota 39442
# 4 Alwar 34650
# 5 Udaipur 31297
# 6 Bikaner 26144
# 7 Ajmer 25866
# 8 Bhilwara 19922
# 9 Pali 16589
# 10 Sikar 15031
Using this data, we can left_join the original data back in on District (removing daily first, if desired), and your subset will have just those districts.
out <- df_districts %>%
group_by(District) %>%
summarize(daily = sum(Daily_confirmed)) %>%
slice_max(daily, n = 10) %>%
select(-daily) %>%
left_join(df_districts, by = "District")
out
# # A tibble: 3,660 x 11
# District Date State Confirmed Recovered Deceased Other Tested Daily_confirmed Daily_Recovered Daily_Deceased
# <chr> <date> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 Jaipur 2021-04-27 Rajasthan 100651 68325 659 0 968783 3289 989 21
# 2 Jaipur 2021-04-26 Rajasthan 97362 67336 638 0 968783 2878 961 11
# 3 Jaipur 2021-04-25 Rajasthan 94484 66375 627 0 968783 3145 648 13
# 4 Jaipur 2021-04-24 Rajasthan 91339 65727 614 0 968783 3260 657 9
# 5 Jaipur 2021-04-23 Rajasthan 88079 65070 605 0 968783 3036 767 13
# 6 Jaipur 2021-04-22 Rajasthan 85043 64303 592 0 968783 2317 658 11
# 7 Jaipur 2021-04-21 Rajasthan 82726 63645 581 0 968783 3101 551 5
# 8 Jaipur 2021-04-20 Rajasthan 79625 63094 576 0 968783 1875 413 9
# 9 Jaipur 2021-04-19 Rajasthan 77750 62681 567 0 968783 2011 528 11
# 10 Jaipur 2021-04-18 Rajasthan 75739 62153 556 0 968783 1963 258 4
# # ... with 3,650 more rows
For comparison,
table(df_districts$District)
# Ajmer Alwar Banswara Baran Barmer Bharatpur Bhilwara Bikaner
# 366 366 366 362 366 366 366 366
# BSF Camp Bundi Chittorgarh Churu Dausa Dholpur Dungarpur Evacuees
# 356 335 366 366 366 366 366 366
# Ganganagar Hanumangarh Italians Jaipur Jaisalmer Jalore Jhalawar Jhunjhunu
# 342 366 366 366 366 356 366 366
# Jodhpur Karauli Kota Nagaur Other State Pali Pratapgarh Rajsamand
# 366 366 366 366 366 366 366 366
# Sawai Madhopur Sikar Sirohi Tonk Udaipur
# 366 366 355 366 366
table(out$District)
# Ajmer Alwar Bhilwara Bikaner Jaipur Jodhpur Kota Pali Sikar Udaipur
# 366 366 366 366 366 366 366 366 366 366

Adding data to data table dependent on prior data

My first post so hopefully I am doing it right.
I have a table as below:
Year Day Amount
1990 1 200
1990 363 2058
1993 1 10
1993 71 564
1993 360 931
I would like to add rows of data to this table such that there is a row entry for all numbers between the maximum 'Day' of each 'Year' in the table and 364, and the corresponding value in 'Amount' would be the maximum 'Amount' for each Year. The resulting data should be:
Year Day Amount
1990 1 200
1990 363 2058
1993 1 10
1993 71 564
1993 360 931
1990 364 2058
1993 361 931
1993 362 931
1993 363 931
1993 364 931
Any ideas?
Taking advantage of how data.table[i, j, by] lets us evaluate expressions in j for each group of by:
library(data.table)
DT <- data.table(
Year = c(1990, 1990, 1993, 1993, 1993),
Day = c(1, 363, 1, 71, 360),
Amount = c(200, 2058, 10, 564, 931)
)
DT[
order(Day),
{
extended_days <- seq(max(Day) + 1, 364)
extended_amounts <- rep(max(Amount), length(extended_days))
list(
Day = c(Day, extended_days),
Amount = c(Amount, extended_amounts)
)
},
keyby = Year
]
# Year Day Amount
# 1: 1990 1 200
# 2: 1990 363 2058
# 3: 1990 364 2058
# 4: 1993 1 10
# 5: 1993 71 564
# 6: 1993 360 931
# 7: 1993 361 931
# 8: 1993 362 931
# 9: 1993 363 931
# 10: 1993 364 931

Resources