How best to parse fields in R? - r

Below is the sample data. This is how it comes from the current population survey. There are 115 columns in the original. Below is just a subset. At the moment, I simply append a new row each month and leave it as is. However, there has been a new request that it be made longer and parsed a bit.
For some context, the first character is the race, a = all, b=black, w=white, and h= hispanic. The second character is the gender, x = all, m = male, and f= female. The third variable, which does not appear in all columns is the age. These values are 2024 for ages 20-24, 3039 or 30-39, and so on. Each one will end in the terms, laborforce unemp or unemprate.
stfips <- c(32,32,32,32,32,32,32,32)
areatype <- c(01,01,01,01,01,01,01,01)
periodyear <- c(2021,2021,2021,2021,2021,2021,2021,2021)
period <- (01,02,03,04,05,06,07,08)
xalaborforce <- c(1210.9,1215.3,1200.6,1201.6,1202.8,1209.3,1199.2,1198.9)
xaunemp <- c(55.7,55.2,65.2,321.2,77.8,88.5,92.4,102.6)
xaunemprate <- c(2.3,2.5,2.7,2.9,3.2,6.5,6.0,12.5)
walaborforce <- c(1000.0,999.2,1000.5,1001.5,998.7,994.5,999.2,1002.8)
waunemp <- c(50.2,49.5,51.6,251.2,59.9,80.9,89.8,77.8)
waunemprate <- c(3.4,3.6,3.8,4.0,4.2,4.5,4.1,2.6)
balaborforce <- c (5.5,5.7,5.2,6.8,9.2,2.5,3.5,4.5)
ba2024laborforce <- c(1.2,1.4,1.2,1.3,1.6,1.7,1.4,1.5)
ba2024unemp <- c(.2,.3,.2,.3,.4,.5,.02,.19))
ba2024lunemprate <- c(2.1,2.2,3.2,3.2,3.3,3.4,1.2,2.5)
test2 <- data.frame (stfips,areatype,periodyear, period, xalaborforce,xaunemp,xaunemprate,walaborforce, waunemp,waunemprate,balaborforce,ba2024laborforce,ba2024unemp,ba2024unemprate)
Desired result
stfips areatype periodyear period race gender age laborforce unemp unemprate
32 01 2021 01 x a all 1210.9 55.7 2.3
32 01 2021 02 x a all 1215.3 55.2 2.5
.....(the other six rows for race = x and gender = a
32 01 2021 01 w a all 1000.0 50.2 3.4
32 01 2021 02 w a all 999.2 49.5 3.6
....(the other six rows for race = w and gender = a
32 01 2021 01 b a 2024 1.2 .2 2.1

Edit -- added handling for columns with age prefix. Mostly there, but would be nice to have a concise way to add the - to make 2024 into 20-24....
test2 %>%
pivot_longer(xalaborforce:ba2024laborforce) %>%
separate(name, c("race", "gender", "stat"), sep = c(1,2)) %>%
mutate(age = coalesce(parse_number(stat) %>% as.character, "all"),
stat = str_remove_all(stat, "[0-9]")) %>%
pivot_wider(names_from = stat, values_from = value)
# A tibble: 32 × 10
stfips areatype periodyear period race gender age laborforce unemp unemprate
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 32 1 2021 1 x a all 1211. 55.7 2.3
2 32 1 2021 1 w a all 1000 50.2 3.4
3 32 1 2021 1 b a all 5.5 NA NA
4 32 1 2021 1 b a 2024 1.2 NA NA
5 32 1 2021 2 x a all 1215. 55.2 2.5
6 32 1 2021 2 w a all 999. 49.5 3.6
7 32 1 2021 2 b a all 5.7 NA NA
8 32 1 2021 2 b a 2024 1.4 NA NA
9 32 1 2021 3 x a all 1201. 65.2 2.7
10 32 1 2021 3 w a all 1000. 51.6 3.8
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

Related

how best to calculate this share of a total

Below is the sample data. The goal is to first create a column that contains the total employment for that quarter. Second is to create a new column that shows the relative share for the area. Finally, the last item (and one which is vexing me) is to calculate whether the total with suppress = 0 represents over 50% of the total. I can do this in excel easily but trying to this in R and so have it be something that I can replicate year after year.
desired result is below
area <- c("001","005","007","009","011","013","015","017","019","021","023","027","033","001","005","007","009","011","013","015","017","019","021","023","027","033")
year <- c("2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021")
qtr <- c("01","01","01","01","01","01","01","01","01","01","01","01","01","02","02","02","02","02","02","02","02","02","02","02","02","02")
employment <- c(2,4,6,8,11,10,12,14,16,18,20,22,30,3,5,8,9,12,9,24,44,33,298,21,26,45)
suppress <- c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0)
testitem <- data.frame(year,qtr, area, employment, suppress)
For the first quarter of 2021, the total is 173. If you only take suppress = 1 into account, that is only 24 of 173 hence the TRUE in the 50 percent column. If these two values summed up to 173/2 or greater than you would have it say FALSE. For the second quarter, the suppress = 1 accounts for 310 of the total of 537 and so is over 50% of the total.
For the total column, I am showing the computation or ingredients. Ideally, it would show a value such as .0115 in place of 2/173.
year qtr area employment suppress total 50percent
2021 01 001 2 0 =2/173 TRUE
2021 01 005 4 0 =4/173 TRUE
.....
2021 02 001 3 0 =3/537 FALSE
2021 02 005 5 0 =5/537 FALSE
For example:
library(dplyr)
testitem %>%
group_by(year, qtr) %>%
mutate(
total = employment / sum(employment),
over_half = sum(employment[suppress == 0]) > (0.5 * sum(employment))
)
Gives:
# A tibble: 26 × 7
# Groups: year, qtr [2]
year qtr area employment suppress total over_half
<chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
1 2021 01 001 2 0 0.0116 TRUE
2 2021 01 005 4 0 0.0231 TRUE
3 2021 01 007 6 0 0.0347 TRUE
4 2021 01 009 8 1 0.0462 TRUE
5 2021 01 011 11 0 0.0636 TRUE
6 2021 01 013 10 0 0.0578 TRUE
7 2021 01 015 12 0 0.0694 TRUE
8 2021 01 017 14 0 0.0809 TRUE
9 2021 01 019 16 1 0.0925 TRUE
10 2021 01 021 18 0 0.104 TRUE
# … with 16 more rows
# ℹ Use `print(n = ...)` to see more rows
I think you'll want to use group_by() and mutate() here.
library(dplyr)
testitem |>
## grouping by year and quarter
## sums will be calculated over areas
group_by(year, qtr) |>
## this could be more terse, but gets the job done.
mutate(total_sum = sum(employment),
## This uses the total_sum column that was just created
total_prop = employment/total_sum,
## leveraging the 0,1 coding of suppress
suppress_sum = sum(suppress * employment),
suppress_prop = suppress_sum/total,
fifty = (1-suppress_prop) > 0.5)

Using dplyr mutate function to create new variable conditionally based on current row

I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )
It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5
We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)

arrange one below the other every 2 columns from data frame in R

Hi I have a df as below which show date and their respected
date 1_val date 2_val . . . . date n_val
2014 23 2014 33 . . . . 2014 34
2015 22 2016 12 . . . . 2016 99
i have tried with hard coding to arrange the columns one below the other
for 1&2 columns
a=1
b=2
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out<-names_2
for 3&4 columns
a=3
b=4
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out1<-names_2
till n-numbers
df1= rbind(samp_out,samp_out1,......samp_out_n)
output
date variable value
2014 1_val 23
2015 1_val 22
2014 2_val 33
2016 2_val 12
.
.
2014 n_val 34
2016 n_val 99
Thanks in advance
The function melt in the package data.table does that:
melt(df, id = "Date", measure = patterns("_val"))
You can specify the name of the variable to pivot on (Date in this case) and a pattern in the variables you want to keep the values of. You can also supply a vector with all the variablenames instead.
> DT <- data.table(Date = c(2014,2013), `1_val` = c(33, 32), Date = c(2014, 2013), `2_val` = c(65, 34))
> DT
Date 1_val Date 2_val
1: 2014 33 2014 65
2: 2013 32 2013 34
> melt(DT, id = "Date", measure = patterns("_val"))
Date variable value
1: 2014 1_val 33
2: 2013 1_val 32
3: 2014 2_val 65
4: 2013 2_val 34
You can use stack from base R,
setNames(data.frame(stack(df[c(TRUE, FALSE)])[1],
stack(df[c(FALSE, TRUE)])),
c('date', 'value', 'variable'))
# date value variable
#1 2014 33 1_val
#2 2013 32 1_val
#3 2014 65 2_val
#4 2013 34 2_val
Define the untidy rectangle
library(magrittr)
csv <- "date,1_val,date,2_val,date,3_val
2014,23,2014,33,2014,34
2015,22,2016,12,2016,99"
Read into a data frame, then transform into a long/eav rectangle.
ds_eav <- csv %>%
readr::read_csv() %>%
tibble::rownames_to_column(var="height") %>%
tidyr::gather(key=key, value=value, -height)
output:
# A tibble: 12 x 4
key index value height
<chr> <int> <int> <int>
1 date 1 2014 1
2 date 1 2015 2
3 value 1 23 1
4 value 1 22 2
5 date 2 2014 1
6 date 2 2016 2
7 value 2 33 1
8 value 2 12 2
9 date 3 2014 1
10 date 3 2016 2
11 value 3 34 1
12 value 3 99 2
Identify which rows are dates/values. Then shift up dates' index by 1.
ds_eav <- ds_eav %>%
dplyr::mutate(
index_val = sub("^(\\d+)_val$" , "\\1", key),
index_date = sub("^date_(\\d+)$", "\\1", key),
index_date = dplyr::if_else(key=="date", "0", index_date),
key = dplyr::if_else(grepl("^date(_\\d+)*", key), "date", "value"),
index = dplyr::if_else(key=="date", index_date, index_val),
index = as.integer(index),
index = index + dplyr::if_else(key=="date", 1L, 0L)
) %>%
dplyr::select(key, index, value, height)
Follow the advice of #jarko-dubbeldam and use spread/gather on the last step too
ds_eav %>%
tidyr::spread(key=key, value=value)
output:
# A tibble: 6 x 4
index height date value
* <int> <int> <int> <int>
1 1 1 2014 23
2 1 2 2015 22
3 2 1 2014 33
4 2 2 2016 12
5 3 1 2014 34
6 3 2 2016 99
You can use paste0(index, "_val") to get you exact output. But I'd prefer to keep them as integers, so you can do math on them in necessary (eg, max()).
edit 1: incorporate the advice & corrections of #jarko-dubbeldam and #hnskd.
edit 2: use rownames_to_column() in case the input isn't a balanced rectangle (eg, one column doesn't all all the rows).

similarity score between vectors and creating column vectors based on a function

I have a sample on which I want to create an aggregate measure based on similarity scores of the person's movie interests. For example consider the following data.
person <- c( 'John', 'John', 'Vikram', 'Kris', 'Kris', 'Lara', 'Mohi', 'Mohi', 'Mohi')
year<- c(2010, 2011,2010,2010, 2011, 2010, 2010, 2011, 2012)
sciencefiction <- c( 4, 5, 0, 44,32, 5, 32, 43,33)
romantic <- c( 19, 28, 56, 7, 4, 33, 2,1,2)
comedy<- c(22,34, 22,34,44, 54, 54,32,44)
timespent<- c(30,40, 100,33, 22, 80, 96, 22,34)
df<- data.frame(person, year, sciencefiction, romantic, comedy, timespent)
I want to variable called similarity score which is basically given by the sum of a persons[i] distance from person[j] multiplied by the time spent by j and is summed over all the combinations for one year. For example for person John for year 2010 it would be
score[john, 2010]= 0.8 * 100+ 0.6 * 33+ .98 * 80 + .73* 96 = 248.28
The 0.8 is the distance (cosine distance calculated by a.b/|a| |b|) between the john and vikram determined by the cosine angle (as shown above) between two vectors formed by sciencefiction+ romantic+comedy (see here (v[i] = 4i+19j+22k and v[j]= 0i+7j+34k)) and 100 is the time spent by Vikram in watching the movies in 2010. In a similar way the comparisons are made and aggregated for John. Is there a way I do this operation in R to create a row called score with the above procedure? Thanks
I'll step through this solution. Skip down to the bottom for the overall result.
Up front: because 2012 only has one person (Mohi), there is no output. You can easily capture this either by not filtering out self-comparisons (which should score 0) or re-merging in missing person/year rows.
Update 2: your df$person needs to be character, so either create your data with
df <- data.frame(..., stringsAsFactors = FALSE)
or modify it in-place with
df$person <- as.character(df$person)
Dependencies
I'm using dplyr here primarily because I think it clearly communicates what is going on. There is nothing in the code that could not be replaced with base functions (or even data.table).
library(dplyr)
One could use tidyr::crossing instead of expand.grid and purrr::pmap instead of mapply. They have strengths but are mostly drop-in replacements, so I leave it up to the reader.
A simple geometric angle-calculation function, for simplicity/reference
angle <- function(a, b, zero = NaN) {
num <- (a %*% b)
denom <- sqrt(sum(a^2)) * sqrt(sum(b^2))
if (denom == 0) zero else (num / denom)
}
Update: if either of the vectors is all-0, then R calculates 0/0 as NaN. Depending on your use, it may make sense to change this to 0 or NA.
Identify unique combinations (not permutations)
df %>%
distinct(year, person) %>%
group_by(year) %>%
do( expand.grid(person = .$person, person2 = .$person, stringsAsFactors = FALSE) ) %>%
ungroup() %>%
filter(person != person2) %>%
mutate(
p1 = pmin(person, person2),
p2 = pmax(person, person2)
) %>%
distinct() %>%
select(-person, -person2)
# # A tibble: 13 × 3
# year p1 p2
# <dbl> <chr> <chr>
# 1 2010 John Vikram
# 2 2010 John Kris
# 3 2010 John Lara
# 4 2010 John Mohi
# 5 2010 Kris Vikram
# 6 2010 Lara Vikram
# 7 2010 Mohi Vikram
# 8 2010 Kris Lara
# 9 2010 Kris Mohi
# 10 2010 Lara Mohi
# 11 2011 John Kris
# 12 2011 John Mohi
# 13 2011 Kris Mohi
If you did up through (but stopping at) the expand.grid, you'd end up with redundant pairs, e.g. "John, Vikram" and "Vikram, John". Because I'm inferring you are interested in pairwise combinations vice permutations, the rest of that code block removes redundant rows.
Bring in each person's data
(continuing in the pipe with the previous data)
... %>%
left_join(setNames(df, paste0(colnames(df), "1")), by = c("p1" = "person1", "year" = "year1")) %>%
left_join(setNames(df, paste0(colnames(df), "2")), by = c("p2" = "person2", "year" = "year2"))
# # A tibble: 13 × 11
# year p1 p2 sciencefiction1 romantic1 comedy1 timespent1 sciencefiction2 romantic2 comedy2 timespent2
# <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2010 John Vikram 4 19 22 30 0 56 22 100
# 2 2010 John Kris 4 19 22 30 44 7 34 33
# 3 2010 John Lara 4 19 22 30 5 33 54 80
# 4 2010 John Mohi 4 19 22 30 32 2 54 96
# 5 2010 Kris Vikram 44 7 34 33 0 56 22 100
# 6 2010 Lara Vikram 5 33 54 80 0 56 22 100
# 7 2010 Mohi Vikram 32 2 54 96 0 56 22 100
# 8 2010 Kris Lara 44 7 34 33 5 33 54 80
# 9 2010 Kris Mohi 44 7 34 33 32 2 54 96
# 10 2010 Lara Mohi 5 33 54 80 32 2 54 96
# 11 2011 John Kris 5 28 34 40 32 4 44 22
# 12 2011 John Mohi 5 28 34 40 43 1 32 22
# 13 2011 Kris Mohi 32 4 44 22 43 1 32 22
Calculate the angle per pair
... %>%
mutate(
angle = mapply(function(a,b,c, d,e,f) angle(c(a,b,c), c(d,e,f), zero=NA),
sciencefiction1, romantic1, comedy1,
sciencefiction2, romantic2, comedy2, SIMPLIFY = TRUE)
) %>%
select(year, p1, p2, starts_with("timespent"), angle)
# A tibble: 13 × 6
# year p1 p2 timespent1 timespent2 angle
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
# 1 2010 John Vikram 30 100 0.8768294
# 2 2010 John Kris 30 33 0.6427461
# 3 2010 John Lara 30 80 0.9851037
# 4 2010 John Mohi 30 96 0.7347653
# 5 2010 Kris Vikram 33 100 0.3380778
# 6 2010 Lara Vikram 80 100 0.7948679
# 7 2010 Mohi Vikram 96 100 0.3440492
# 8 2010 Kris Lara 33 80 0.6428056
# 9 2010 Kris Mohi 33 96 0.9256539
# 10 2010 Lara Mohi 80 96 0.7881070
# 11 2011 John Kris 40 22 0.7311130
# 12 2011 John Mohi 40 22 0.5600843
# 13 2011 Kris Mohi 22 22 0.9533073
Finally, the score
... %>%
group_by(year, person = p1) %>%
summarize(
score = angle %*% timespent2
) %>%
ungroup()
# # A tibble: 6 × 3
# year person score
# <dbl> <chr> <dbl>
# 1 2010 John 258.23933
# 2 2010 Kris 174.09501
# 3 2010 Lara 155.14507
# 4 2010 Mohi 34.40492
# 5 2011 John 28.40634
# 6 2011 Kris 20.97276
I'm guessing the difference between my 258.24 and your 248.28 is due to the second vector (Vikram's values).
TL;DR
All at once:
df %>%
distinct(year, person) %>%
group_by(year) %>%
do( expand.grid(person = .$person, person2 = .$person, stringsAsFactors = FALSE) ) %>%
ungroup() %>%
filter(person != person2) %>%
mutate(
p1 = pmin(person, person2),
p2 = pmax(person, person2)
) %>%
select(-person, -person2) %>%
distinct() %>%
# p-wise lookups
left_join(setNames(df, paste0(colnames(df), "1")), by = c("p1" = "person1", "year" = "year1")) %>%
left_join(setNames(df, paste0(colnames(df), "2")), by = c("p2" = "person2", "year" = "year2")) %>%
# calc angles
mutate(
angle = mapply(function(a,b,c, d,e,f) angle(c(a,b,c), c(d,e,f)),
sciencefiction1, romantic1, comedy1,
sciencefiction2, romantic2, comedy2, SIMPLIFY = TRUE)
) %>%
# calc scores
group_by(year, person = p1) %>%
summarize(
score = angle %*% timespent2
) %>%
ungroup()

Percentile for multiple groups of values in R

I'm using R to do my data analysis.
I'm looking for the code to achieve the below mentioned output.
I need a single piece of code to do this as I have over 500 groups & 24 months in my actual data. The below sample has only 2 groups & 2 months.
This is a sample of my data.
Date Group Value
1-Jan-16 A 10
2-Jan-16 A 12
3-Jan-16 A 17
4-Jan-16 A 20
5-Jan-16 A 12
5-Jan-16 B 56
1-Jan-16 B 78
15-Jan-16 B 97
20-Jan-16 B 77
21-Jan-16 B 86
2-Feb-16 A 91
2-Feb-16 A 44
3-Feb-16 A 93
4-Feb-16 A 87
5-Feb-16 A 52
5-Feb-16 B 68
1-Feb-16 B 45
15-Feb-16 B 100
20-Feb-16 B 81
21-Feb-16 B 74
And this is the output I'm looking for.
Month Year Group Minimum Value 5th Percentile 10th Percentile 50th Percentile 90th Percentile Max Value
Jan 2016 A
Jan 2016 B
Feb 2016 A
Feb 2016 B
considering dft as your input, you can try:
library(dplyr)
dft %>%
mutate(Date = as.Date(Date, format = "%d-%b-%y")) %>%
mutate(mon = month(Date),
yr = year(Date)) %>%
group_by(mon,yr,Group) %>%
mutate(minimum = min(Value),
maximum = max(Value),
q95 = quantile(Value, 0.95)) %>%
select(minimum, maximum, q95) %>%
unique()
which gives:
mon yr Group minimum maximum q95
<int> <int> <chr> <int> <int> <dbl>
1 1 2016 A 10 20 19.4
2 1 2016 B 56 97 94.8
3 2 2016 A 44 93 92.6
4 2 2016 B 45 100 96.2
and add more variables as per your need.

Resources