Sum of column based on a condition a R - r

I would like to print out total amount for each date so that my new dataframe will have date and and total amount columns.
My data frame looks like this
permitnum
amount
6/1/2022
na
ascas
30.00
olic
40.41
6/2/2022
na
avrey
17.32
fev
32.18
grey
12.20
any advice on how to go about this will be appreciated

Here is another tidyverse option, where I convert to date (and then reformat), then we can fill in the date, so that we can use that to group. Then, get the sum for each date.
library(tidyverse)
df %>%
mutate(permitnum = format(as.Date(permitnum, "%m/%d/%Y"), "%m/%d/%Y")) %>%
fill(permitnum, .direction = "down") %>%
group_by(permitnum) %>%
summarise(total_amount = sum(as.numeric(amount), na.rm = TRUE))
Output
permitnum total_amount
<chr> <dbl>
1 06/01/2022 70.4
2 06/02/2022 61.7
Data
df <- structure(list(permitnum = c("6/1/2022", "ascas", "olic", "6/2/2022",
"avrey", "fev", "grey"), amount = c("na", "30.00", "40.41", "na",
"17.32", "32.18", "12.20")), class = "data.frame", row.names = c(NA,
-7L))

Here is an option. Split the data by the date marked by a row with a number, then summarize the total in amount and combine the date and all rows.
library(tidyverse)
dat <- read_table("permitnum amount
6/1/2022 na
ascas 30.00
olic 40.41
6/2/2022 na
avrey 17.32
fev 32.18
grey 12.20")
dat |>
group_split(id = cumsum(grepl("\\d", permitnum))) |>
map_dfr(\(x){
date <- x$permitnum[[1]]
x |>
slice(-1) |>
summarise(date = date,
total_amount = sum(as.numeric(amount)))
})
#> # A tibble: 2 x 2
#> date total_amount
#> <chr> <dbl>
#> 1 6/1/2022 70.4
#> 2 6/2/2022 61.7

Related

R: Comparing Subgroups From Different Datasets

I am working with the R programming language.
I have the following dataset that contains the heights and weights of people from Canada - using the value of height (cm), I split weight (kg) into bins based on ntiles, and calculated the average value of var2 within each ntile bin:
library(dplyr)
library(gtools)
set.seed(123)
canada = data.frame(height = rnorm(10000,150,10), weight = rnorm(10000,90, 10))
Part_1 = canada %>%
mutate(quants = quantcut(weight, 100),
rank = as.numeric(quants)) %>%
group_by(quants) %>%
mutate(min = min(weight), max = max(weight), count = n(), avg_height = mean(height))
Part_1 = Part_1 %>% distinct(rank, .keep_all = TRUE)
> Part_1
# A tibble: 100 x 8
# Groups: quants [100]
height weight quants rank min max count avg_height
<dbl> <dbl> <fct> <dbl> <dbl> <dbl> <int> <dbl>
1 144. 114. (110.2,113.9] 99 110. 114. 100 150.
2 148. 88.3 (88.12,88.38] 44 88.1 88.4 100 149.
3 166. 99.3 (99.1,99.52] 83 99.1 99.5 100 152.
4 151. 84.3 (84.14,84.44] 29 84.1 84.4 100 150.
For example, I see that there are 100 people between the weight range of 100.2 - 113.9 kg and the average height of these people is 150 cm
Now, suppose I have a similar dataset for people from the USA:
set.seed(124)
usa = data.frame(height = rnorm(10000,150,10), weight = rnorm(10000,90, 10))
My Question: Based on the weight ranges I calculated using the Canada dataset - I want to find out how many people from the USA fall within these Canadian ranges and what is the average weight of the Americans within these Canadian ranges
For example:
In the Canada dataset, I saw that there are 100 people between the weight range of 100.2 - 113.9 kg and the average height of these people is 150 cm
How many Americans are between the weight range of 100.2 - 113.9 kg and what is the average height of these Americans?
I know that I can do this manually for each rank:
americans_in_canadian_rank99 = usa %>%
filter(weight > 110.2 & weight < 113.9) %>%
group_by() %>%
summarize(count = n(), avg_height = mean(height))
americans_in_canadian_rank44 = usa %>%
filter(weight > 88.1 & weight < 88.4) %>%
group_by() %>%
summarize(count = n(), avg_height = mean(height))
In the end, I would be looking for something a desired output like this:
# number of rows should be = number of unique ranks
canadian_rank min_weight max_weight canadian_count canadian_avg_height american_count american_avg_height
1 99 110.2 113.9 100 150 116 150
2 44 88.1 88.4 100 149 154 150
Can someone please help me figure out a better way to do this?
Thanks!
Note: updated based on the desired output format combining the two sets:
This can be done in a straight-forward manner using the non-equijoin functionality of data.table.
library(data.table)
library(gtools)
set.seed(123)
canada = data.table(height = rnorm(10000,150,10), weight = rnorm(10000,90, 10))
set.seed(124)
usa = data.table(height = rnorm(10000,150,10), weight = rnorm(10000,90, 10))
## You can also use data.table to generate your Part_1 summary table
Part_1 <- canada[, .(min = min(weight),
max = max(weight),
count = .N,
avg_height = mean(height)), keyby = .(quants = quantcut(weight,100))]
Part_1[, rank := as.numeric(quants)]
## Join using a non-equi join to combine data sets
usa[Part_1, on = .(weight >= min,
weight < max)
## On the join result, compute same summary states by quants & rank
][, .(usa_count = .N,
usa_avg_height = mean(height)), keyby = .(rank,
quants,
## whenever we do a non-equijoin, the foreign key values, in this case min/max
## overwrite the local keys. Since we used weight twice, canadian min/max
## will show up in the join result table as weight and weight.1
min_weight = weight,
max_weight = weight.1,
## To keep both sets of results distinct, we can rename columns in our "by" statement
canadian_count = count,
canadian_avg_height = avg_height)]
Gives results as follows:
rank quants min_weight max_weight canadian_count canadian_avg_height usa_count usa_avg_height
1: 1 [55.11,66.71] 55.11266 66.69011 100 149.2101 114 149.8116
2: 2 (66.71,69.48] 66.70575 69.46055 100 149.0639 119 148.6486
3: 3 (69.48,71.15] 69.48011 71.13895 100 150.5331 94 148.4336
4: 4 (71.15,72.44] 71.14747 72.43042 100 150.4779 104 149.8926
Also, another option would be to assign result columns for the usa table directly back to your Part_1 summary table in place.
## This is a two-part nested join
Part_1[
## Start by creating a result that matches Part_1 ranks to all usa data
Part_1[usa,on = .(min <= weight,
max > weight)
## Compute aggregated results on the join table result
][,.(usa_count = .N,
usa_avg_height = mean(height)), by = .(rank)],
## Finaly, assign results back to the Part_1 summary table joined by rank
c("usa_count",
"usa_avg_height") := .(usa_count,
usa_avg_height), on = .(rank)]
Gives the following
quants min max count avg_height rank usa_count usa_avg_height
1: [55.11,66.71] 55.11266 66.69011 100 149.2101 1 114 149.8116
2: (66.71,69.48] 66.70575 69.46055 100 149.0639 2 119 148.6486
3: (69.48,71.15] 69.48011 71.13895 100 150.5331 3 94 148.4336
4: (71.15,72.44] 71.14747 72.43042 100 150.4779 4 104 149.8926
With data.table you can do this:
library(data.table)
library(stringr)
dt1 <- as.data.table(usa)
dt1 <- dt1[, c("min", "max") := weight]
dt2 <- as.data.table(Part_1 %>% select("quants", "rank"))
dt2 <- cbind(dt2[,.(rank)],
setDT(tstrsplit(str_sub(dt2$quants, 2, -2), ",", fixed = TRUE, names = c("min", "max"))))
dt2 <- dt2[, lapply(.SD, as.numeric)]
setkey(dt2, min, max)
dt1 <- dt1[, rank := dt2$rank[foverlaps(dt1, dt2, by.x = c("min", "max"), by.y = c("min", "max"), which = TRUE)$yid]] %>%
select(-c("min", "max"))
EDIT
Totally missed the last part. But if you wish to do that, it should be relatively straightforward from the last point (you could use dplyr for that if you wish):
dt3 <- rbind(canada %>%
mutate(quants = quantcut(weight, 100),
rank = as.numeric(quants),
country = "Canada") %>%
as.data.table(),
copy(dt1)[, country := "USA"], fill = TRUE)
dt3 <- dt3[,.(count = .N, avg_height = mean(height)), by = c("rank", "country")] %>%
dcast(rank ~ country, value.var = c("count", "avg_height")) %>%
merge(dt2 %>% rename("min_weight" = "min", "max_weight" = "max"), by = c("rank"), all.x = TRUE)
EDIT 2
Alternatively, you could try to do something similar using cut function without learning anything from data.table
rank_breaks <- Part_1 %>%
mutate(breaks = sub(",.*", "", str_sub(quants, 2)) %>% as.numeric()) %>%
arrange(rank) %>%
pull(breaks)
# Here I change minimum and maximum of groups 1 and 100 to -Inf and Inf respectively.
# If you do not wish to do so, you can disregard it and run `rank_breaks <- c(rank_breaks, max(canada$weight))` instead
rank_breaks[1] <- -Inf
rank_breaks <- c(rank_breaks, Inf)
usa <- usa %>%
mutate(rank = cut(weight, breaks = rank_breaks, labels = c(1:100)))
You can use fuzzyjoin for this.
library(fuzzyjoin)
# take percentile ranges and join US data
us_by_canadian_quantiles <- Part_1 |>
ungroup() |>
distinct(rank, min, max, height_avg_can = avg_height) |>
fuzzy_full_join(usa, by = c(min = "weight", max = "weight"), match_fun = c(`<`, `>=`))
# get count and height average per bin
us_by_canadian_quantiles |>
group_by(rank) |>
summarize(n_us = n(),
height_avg_us = mean(height),
height_avg_can = first(height_avg_can)
)
#> # A tibble: 101 × 4
#> rank n_us height_avg_us height_avg_can
#> <dbl> <int> <dbl> <dbl>
#> 1 1 114 150. 149.
#> 2 2 119 149. 149.
#> 3 3 94 148. 151.
#> 4 4 104 150. 150.
#> 5 5 115 152. 150.
#> 6 6 88 150. 149.
#> 7 7 86 150. 150.
#> 8 8 86 150. 151.
#> 9 9 102 151. 151.
#> 10 10 81 152. 150.
#> # … with 91 more rows
Note that there are a number of cases in the US frame which fall outside of the Canadian percentile ranges. They are grouped together here with rank being NA, but you could also add ranks 0 and 101 if you wanted to distinguish them.
I should note that fuzzyjoin tends to be much slower than data.table. But since you have already gotten a data.table solution, this might be more to your liking.

Pivot_longer and Pivot wider syntax

I want to ask for ideas on creating a syntax to pivot_longer given on this.
I've already tried researching in the internet but I can't seem to find any examples that is similar to my data given where it has a Metric column which is also seperated in 3 different columns of months.
My desire final output is to have seven columns consisting of (regions,months, and the five Metrics)
How to formulate the pivot_longer and pivot_wider syntax to clean my data in order for me to visualize it?
The tricky part isn't pivot_longer. You first have to clean your Excel spreadsheet, i.e. get rid of empty rows and merge the two header rows containing the names of the variables and the dates.
One approach to achieve your desired result may look like so:
library(readxl)
library(tidyr)
library(janitor)
library(dplyr)
x <- read_excel("data/Employment.xlsx", skip = 3, col_names = FALSE) %>%
# Get rid of empty rows and cols
janitor::remove_empty()
# Make column names
col_names <- data.frame(t(x[1:2,])) %>%
fill(1) %>%
unite(name, 1:2, na.rm = TRUE) %>%
pull(name)
x <- x[-c(1:2),]
names(x) <- col_names
# Convert to long and values to numerics
x %>%
pivot_longer(-Region, names_to = c(".value", "months"), names_sep = "_") %>%
separate(months, into = c("month", "year")) %>%
mutate(across(!c(Region, month, year), as.numeric))
#> # A tibble: 6 × 8
#> Region month year `Total Population … `Labor Force Part… `Employment Rat…
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Philippin… April 2020f 73722. 55.7 82.4
#> 2 Philippin… Janu… 2021p 74733. 60.5 91.3
#> 3 Philippin… April 2021p 74971. 63.2 91.3
#> 4 National … April 2020f 9944. 54.2 87.7
#> 5 National … Janu… 2021p 10051. 57.2 91.2
#> 6 National … April 2021p 10084. 60.1 85.6
#> # … with 2 more variables: Unemployment Rate <dbl>, Underemployment Rate <dbl>

Reshaping multiple long columns into wide column format in R

My sample dataset has multiple columns that I want to convert into wide format. I have tried using the dcast function, but I get error. Below is my sample dataset:
df2 = data.frame(emp_id = c(rep(1,2), rep(2,4),rep(3,3)),
Name = c(rep("John",2), rep("Kellie",4), rep("Steve",3)),
Year = c("2018","2019","2018","2018","2019","2019","2018","2019","2019"),
Type = c(rep("Salaried",2), rep("Hourly", 2), rep("Salaried",2),"Hourly",rep("Salaried",2)),
Dept = c("Sales","IT","Sales","Sales", rep("IT",3),rep("Sales",2)),
Salary = c(100,1000,95,95,1500,1500,90,1200,1200))
I'm expecting my output to look like:
One option is the function pivot_wider() from the tidyr package:
df.wide <- tidyr::pivot_wider(df2,
names_from = c("Type", "Dept", "Year"),
values_from = "Salary",
values_fn = {mean})
This should get you the desired result.
What do you think about this output? It is not the expected output, but somehow I find it easier to interpret the data??
df2 %>%
group_by(Name, Year, Type, Dept) %>%
summarise(mean = mean(Salary))
Output:
Name Year Type Dept mean
<chr> <chr> <chr> <chr> <dbl>
1 John 2018 Salaried Sales 100
2 John 2019 Salaried IT 1000
3 Kellie 2018 Hourly Sales 95
4 Kellie 2019 Salaried IT 1500
5 Steve 2018 Hourly IT 90
6 Steve 2019 Salaried Sales 1200

Is there any way to join two data frames by date ranges?

I have two data frames, the first dataset is the record for forecasted demand in the following 27 days for each item of the company, shown as below:
library(tidyverse)
library(lubridate)
daily_forecast <- data.frame(
item=c("A","B","A","B"),
date_fcsted=c("2020-8-1","2020-8-1","2020-8-15","2020-8-15"),
fcsted_qty=c(100,200,200,100)
) %>%
mutate(date_fcsted=ymd(date_fcsted)) %>%
mutate(extended_date=date_fcsted+days(27))
and the other dateset is the actual daily demand for each item:
actual_orders <- data.frame(
order_date=rep(seq(ymd("2020-8-3"),ymd("2020-9-15"),by = "1 week"),2),
item=rep(c("A","B"),7),
order_qty=round(rnorm(n=14,mean=50,sd=10),0)
)
What i am trying to accomplish is to get the actual total demand for each item within the date_fcsted and extended_date in the first dataset and then have them joined to calculate the forecast accuracy.
Solutions with tidyverse would be highly appreciated.
You can try the following :
library(dplyr)
daily_forecast %>%
left_join(actual_orders, by = 'item') %>%
filter(order_date >= date_fcsted & order_date <= extended_date) %>%
group_by(item, date_fcsted, extended_date, fcsted_qty) %>%
summarise(value = sum(order_qty))
# item date_fcsted extended_date fcsted_qty value
# <chr> <date> <date> <dbl> <dbl>
#1 A 2020-08-01 2020-08-28 100 179
#2 A 2020-08-15 2020-09-11 200 148
#3 B 2020-08-01 2020-08-28 200 190
#4 B 2020-08-15 2020-09-11 100 197
You could also try fuzzy_join as suggested by #Gregor Thomas. I added a row number column to make sure you have unique rows independent of item and date ranges (but this may not be needed).
library(fuzzyjoin)
library(dplyr)
daily_forecast %>%
mutate(rn = row_number()) %>%
fuzzy_left_join(actual_orders,
by = c("item" = "item",
"date_fcsted" = "order_date",
"extended_date" = "order_date"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(rn, item.x, date_fcsted, extended_date, fcsted_qty) %>%
summarise(actual_total_demand = sum(order_qty))
Output
rn item.x date_fcsted extended_date fcsted_qty actual_total_demand
<int> <chr> <date> <date> <dbl> <dbl>
1 1 A 2020-08-01 2020-08-28 100 221
2 2 B 2020-08-01 2020-08-28 200 219
3 3 A 2020-08-15 2020-09-11 200 212
4 4 B 2020-08-15 2020-09-11 100 216

filtering data based on rank and conditions

I have some data which looks similar to the following:
# A tibble: 2,717 x 6
# Groups: date [60]
symbol date monthly.returns score totals score_rank
<chr> <date> <dbl> <dbl> <dbl> <int>
1 GIS 2010-01-29 0.0128 0.436 119. 2
2 GIS 2010-02-26 0.00982 0.205 120. 1
3 GIS 2010-03-31 -0.0169 0.549 51.1 3
4 GIS 2010-04-30 0.0123 0.860 28.0 4
5 GIS 2010-05-28 0.000984 0.888 91.6 4
6 GIS 2010-06-30 -0.00267 0.828 15.5 4
7 GIS 2010-07-30 -0.0297 0.482 81.7 2
8 GIS 2010-08-31 0.0573 0.408 57.2 3
9 GIS 2010-09-30 0.0105 0.887 93.3 4
10 GIS 2010-10-29 0.0357 0.111 96.6 1
# ... with 2,707 more rows
I have a score_rank, what I want to do is whenever the totals column is > 100 filter the data in the following way:
1) When the score_rank = 1, take the top 5% of observations based on the score column
2) When the score_rank = 2 or 3, take a random sample of 5% of the observations
3) When the score_rank = 4, take the bottom 5% of observations based on the score column.
Data:
tickers <- c("GIS", "KR", "MKC", "SJM", "EL", "HRL", "HSY", "K",
"KMB", "MDLZ", "MNST", "PEP", "PG", "PM", "SYY", "TAP", "TSN", "WBA", "WMT",
"MMM", "ABMD", "ACN", "AMD", "AES", "AON", "ANTM", "APA", "CSCO", "CMS", "KO", "GRMN", "GPS",
"JEC", "SJM", "JPM", "JNPR", "KSU", "KEYS", "KIM", "NBL", "NEM", "NWL", "NFLX", "NEE", "NOC", "TMO", "TXN", "TWTR")
library(tidyquant)
data <- tq_get(tickers,
get = "stock.prices", # Collect the stock price data from 2010 - 2015
from = "2010-01-01",
to = "2015-01-01") %>%
group_by(symbol) %>%
tq_transmute(select = adjusted, # Convert the data from daily prices to monthly prices
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic")
data$score <- runif(nrow(data), min = 0, max = 1)
data$totals <- runif(nrow(data), min = 10, max = 150)
data <- data %>%
group_by(date) %>%
mutate(
score_rank = ntile(score, 4)
)
Edit: Added code.
Here is one option to filter. Create a list of functions (fs) for each corresponding 'score_rank', use map2 to loop over the list functions and the corresponding 'score_rank' list of vectors, filter the 'data' where the 'totals' is greater than 100, and the 'score_rank' %in% the input from map2 vector, apply the function on 'score' column to filter the sample of rows and bind the subset data with the data filtered where 'totals' is less than or equal to 100
library(purrr)
library(dplyr)
fs <- list(as_mapper(~ . >= quantile(., prob = 0.95)),
as_mapper(~ row_number() %in% sample(row_number(), round(0.05 * n()) )),
as_mapper(~ . <= quantile(., prob = 0.05))
)
map2_df(list(1, c(2, 3), 4), fs, ~
data %>%
filter(totals > 100, score_rank %in% .x) %>%
filter(.y(score))
)%>% bind_rows(data %>%
filter(totals <= 100))

Resources