I'm working on a dataset with a with grouping-system with six digits. The first two digits denote grouping on the top-level, the next two denote different sub-groups, and the last two digits denote specific type within the sub-group. I want to group the data to the top level in the hierarchy (two first digits only), and count unique names in each group.
An example for the GroupID 010203:
01 denotes BMW
02 denotes 3-series
03 denotes 320i (the exact model)
All I care about in this example is how many of each brand there is.
Toy dataset and wanted output:
df <- data.table(Quarter = c('Q4', 'Q4', 'Q4', 'Q4', 'Q3'),
GroupID = c(010203, 150503, 010101, 150609, 010000),
Name = c('AAAA', 'AAAA', 'BBBB', 'BBBB', 'CCCC'))
Output:
Quarter Group Counts
Q3 01 1
Q4 01 2
Q4 15 2
Using data.table we could do:
library(data.table)
dt[, Group := substr(GroupID, 1, 2)][
, Counts := .N, by = list(Group, Quarter)][
, head(.SD, 1), by = .(Quarter, Group, Counts)][
, .(Quarter, Group, Counts)]
Returns:
Quarter Group Counts
1: Q4 01 2
2: Q4 15 2
3: Q3 01 1
With dplyr and stringr we could do something like:
library(dplyr)
library(stringr)
df %>%
mutate(Group = str_sub(GroupID, 1, 2)) %>%
group_by(Group, Quarter) %>%
summarise(Counts = n()) %>%
ungroup()
Returns:
# A tibble: 3 x 3
Group Quarter Counts
<chr> <fct> <int>
1 01 Q3 1
2 01 Q4 2
3 15 Q4 2
Since you are already using data.table, you can do:
df[, Group := substr(GroupID,1,2)]
df <- df[,Counts := .N, .(Group,Quarter)][,.(Group, Quarter, Counts)]
df <- unique(df)
print(df)
Group Quarter Counts
1: 10 Q4 2
2: 15 Q4 2
3: 10 Q3 1
Here's my simple solution with plyr and base R, it is lightening fast.
library(plyr)
df$breakid <- as.character((substr(df$GroupID, start =0 , stop = 2)))
d <- plyr::count(df, c("Quarter", "breakid"))
Result
Quarter breakid freq
Q3 01 1
Q4 01 2
Q4 15 2
Alternatively, using tapply (and data.table indexing):
df$Brand <- substr(df$GroupID, 1, 2)
tapply(df$Brand, df[, .(Quarter, Brand)], length)
(If you don't care about the output being a matrix).
Related
I have multiple tables all with the same variable names that I want to join by an ID, but each table represents another year. If I use an inner.join, it will correctly only keep those IDs in each table, but it will then create new variables for observations (i.e. X becomes X.x and X.y in the same row). I could use rbind, but that would keep all the data when I only want those that appear in each table.
library(dplyr)
df1 <- data.frame(x1 = 1:3,
x2 = c(12,14,11),
year = 2020)
df2 <- data.frame(x1 = 2:4,
x2 = c(15,17,13),
year = 2021)
dfall <- inner_join(df1,df2,by="x1")
This results in:
x1 x2.x year.x x2.y year.y
2 14 2020 15 2021
3 11 2020 17 2021
But I want this:
x1 x2 year
2 14 2020
2 15 2021
3 11 2020
3 17 2021
Is there a join where I can do this?
dplyr::bind_rows and then filter would work:
bind_rows(df1, df2) %>%
filter(x1 %in% intersect(df1$x1, df2$x1))
You can pipe the output to arrange(x1) to sort the output if needed.
Output
x1 x2 year
1 2 14 2020
2 3 11 2020
3 2 15 2021
4 3 17 2021
library(tidyr) # pivot_longer
inner_join(df1,df2,by="x1") %>%
pivot_longer(-x1, names_pattern="(.*)\\.(.*)",
names_to=c(".value", "val")) %>%
select(-val)
# # A tibble: 4 x 3
# x1 x2 year
# <int> <dbl> <dbl>
# 1 2 14 2020
# 2 2 15 2021
# 3 3 11 2020
# 4 3 17 2021
Try this. It's an inner join of your two approaches so far.
dfall <- inner_join(rbind(df1, df2) , inner_join(df1, df2 , by="x1") %>% select(x1))
Here's another option. It creates a column n which is equal to the number of times that each x1 appears, and then filters only those which appear as many times as there distinct values for year. You could change n==length(unique(year)) to n>=2 if you wanted any records that appear in more than one year/table, as opposed to those which appear in every year/table. This one is nice because it is easy to scale up to a large number of input tables.
dfall <- rbind(df1, df2) %>%
add_count(x1) %>%
filter(n==length(unique(year))) %>%
select(-n)
I have a data frame something like bellow:
amount <- sample(10000:2000, 20)
year<- sample(2015:2017, 20, replace = TRUE)
company<- sample(LETTERS[1:3],20, replace = TRUE)
df<-data.frame(company, year, amount)
Then I want to group by company and year so I have:
df %>%
group_by(company, year) %>%
summarise(
total= sum(amount)
)
company year total
<fct> <int> <int>
1 A 2015 1094
2 A 2016 3308
3 A 2017 4785
4 B 2015 1190
5 B 2016 6583
6 B 2017 1964
7 C 2015 4974
8 C 2016 1986
9 C 2017 3465
Now, I want to divide the last row in each group to the first row. In other words, I want to divide the total value for the last year for each company to the same value of the first year.
Thanks.
You could use last and first to access those elements of total respectively :
library(dplyr)
df %>%
group_by(company, year) %>%
summarise(total= sum(amount)) %>%
summarise(final = last(total)/first(total))
# company final
# <fct> <dbl>
#1 A 2.26
#2 B 1.92
#3 C 0.565
In base R, we can use aggregate
aggregate(amount~company, aggregate(amount~company+year, df, sum),
function(x) x[length(x)]/x[1])
# company amount
#1 A 2.262524
#2 B 1.919138
#3 C 0.565281
With data.table, we can do
library(data.table)
setDT(df)[ , .(total = sum(amount)), .(company, year)][,
.(final = last(total)/first(total)), .(company)]
I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
I have mulple observations for each id and time identifier - e.g. I have 3 different alpha 1970 values. I would like to retain only one observation per id/year most notably the last one that appears in for each id/year.
the final dataset should look something like this:
final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"),
"Year" = c(1970,1971,1980,1981,1982),
"Val" = c(-2,5,5,3,5))
Does anyone know how I can approach the problem?
Thanks a lot in advance for your help
If you are open to a data.table solution, this can be done quite concisely:
library(data.table)
setDT(df)[, .SD[.N], by = c("id", "Year")]
#> id Year Val
#> 1: Alpha 1970 -2
#> 2: Alpha 1971 5
#> 3: Beta 1980 5
#> 4: Beta 1981 3
#> 5: Beta 1982 5
by = c("id", "Year") groups the data.table by id and Year, and .SD[.N] then returns the last row within each such group.
How about this?
library(tidyverse)
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
final <-
df %>%
group_by(id, Year) %>%
slice(n()) %>%
ungroup()
final
#> # A tibble: 5 x 3
#> id Year Val
#> <fct> <dbl> <dbl>
#> 1 Alpha 1970 -2
#> 2 Alpha 1971 5
#> 3 Beta 1980 5
#> 4 Beta 1981 3
#> 5 Beta 1982 5
Created on 2019-09-29 by the reprex package (v0.3.0)
Translates to "within each id-Year group, take only the row where the row number is equal to the size of the group, i.e. it's the last row under the current ordering."
You could also use either filter(), e.g. filter(row_number() == n()), or distinct() (and then you wouldn't even have to group), e.g. distinct(id, Year, .keep_all = TRUE) - but distinct functions take the first distinct row, so you'd need to reverse the row ordering here first.
An option with base R
aggregate(Val ~ ., df, tail, 1)
# id Year Val
#1 Alpha 1970 -2
#2 Alpha 1971 5
#3 Beta 1980 5
#4 Beta 1981 3
#5 Beta 1982 5
If we need to select the first row
aggregate(Val ~ ., df, head, 1)
I have a set of 85 possible combinations from two variables, one with five values (years) and one with 17 values (locations). I make a dataframe that has the years in the first column and the locations in the second column. For each combination of year and location I want to calculate the weighted mean value and then add it to the third column, according to the year and location values.
My code is as follows:
for (i in unique(data1$year)) {
for (j in unique(data1$location)) {
data2 <- crossing(data1$year, data1$location)
dataname <- subset(data1, year %in% i & location %in% j)
result <- weighted.mean(dataname$length, dataname$raising_factor, na.rm = T)
}
}
The result I gets puts the last calculated mean in the third column for each row.
How can I get it to add according to matching year and location combination?
thanks.
A base R option would be by
by(df[c('x', 'y')], df[c('group', 'year')],
function(x) weighted.mean(x[,1], x[,2]))
Based on #LAP's example
As #A.Suleiman suggested, we can use dplyr::group_by.
Example data:
df <- data.frame(group = rep(letters[1:5], each = 4),
year = rep(2001:2002, 10),
x = 1:20,
y = rep(c(0.3, 1, 1/0.3, 0.4), each = 5))
library(dplyr)
df %>%
group_by(group, year) %>%
summarise(test = weighted.mean(x, y))
# A tibble: 10 x 3
# Groups: group [?]
group year test
<fctr> <int> <dbl>
1 a 2001 2.000000
2 a 2002 3.000000
3 b 2001 6.538462
4 b 2002 7.000000
5 c 2001 10.538462
6 c 2002 11.538462
7 d 2001 14.000000
8 d 2002 14.214286
9 e 2001 18.000000
10 e 2002 19.000000
Background
I've a quarterly data set where certain quarters and corresponding values are missing. The characteristics of the data set are:
Each group should have the same number of quarters but in practe quarters are missing
For missing quarter values are unknown
This is to be resolved by sourcing imputing next available value; for instance, as available via na.locf function
Example data
# Packages
Vectorize(require)(package = c("tidyverse", "zoo", "magrittr"),
character.only = TRUE)
# Seed
set.seed(123)
# Dummy data
dta <- data.frame(group = rep(LETTERS[1:5], 10)) %>%
group_by(group) %>%
mutate(qrtr = seq(
from = as.Date("01/01/2012", "%d/%m/%Y"),
to = as.Date("31/5/2014", "%d/%m/%Y"),
by = "quarter"
)) %>%
ungroup() %>%
mutate(qrtr = as.yearqtr(qrtr)) %>%
arrange(group, qrtr) %>%
mutate(value = sample(1:10, 50, replace = TRUE))
# Remove random rows
dta[sample(1:dim(dta)[1], 10), c(2, 3)] <- NA
dta %<>% na.omit()
Preview
# A tibble: 40 x 3
group qrtr value
<chr> <S3: yearqtr> <int>
1 A 2012 Q1 3
2 A 2012 Q2 8
3 A 2012 Q4 9
4 A 2013 Q1 10
5 A 2013 Q3 6
6 A 2013 Q4 9
7 A 2014 Q1 6
8 B 2012 Q1 10
9 B 2012 Q2 5
10 B 2012 Q3 7
# ... with 30 more rows
Problem
Create add rows within each group where quarter are missing. Total number of quarter is derived from sequence min(qrtr) to max(qrtr), in the context of existing code:
seq(from = as.Date("01/01/2012", "%d/%m/%Y"),
to = as.Date("31/5/2014", "%d/%m/%Y"),
by = "quarter")
First non-missing value should be carried forward for the missing value.
Desired results:
>> dta
# A tibble: 50 x 3
group qrtr value
<chr> <S3: yearqtr> <int>
1 A 2012 Q1 3
2 A 2012 Q2 8
3 A 2012 Q3 8
4 A 2012 Q4 9
5 A 2013 Q1 10
6 A 2013 Q2 10
7 A 2013 Q3 6
8 A 2013 Q4 9
9 A 2014 Q1 6
10 A 2015 Q1 6
# ... with 40 more rows
Proposed approaches
One approach would rely on making use of the expand, in order to convert implicitly missing values to explicitly missing values. This so far creates missing quarters but there is no clear way of creating missing observations for value column for where given quarter is missing.
dta %>%
# Append mixing quarters
expand(group, qrtr) %>%
left_join(data.frame(qrtr = as.yearqtr(
seq(
from = as.Date("01/01/2012", "%d/%m/%Y"),
to = as.Date("31/5/2014", "%d/%m/%Y"),
by = "quarter"
)
)), by = "qrtr") %>%
# TODO
# mutate(value = na.locf(value)) %>%
arrange(group, qrtr) -> dta_fixed
You seems to be interested in padr
library(padr)
library(zoo)
#convert to POSIXct as pad() expect it to be like this
dta$qrtr <- as.POSIXct(dta$qrtr,format="%Y %q")
dta %>%
pad(group="group") %>%
arrange(group, qrtr) %>%
mutate(qrtr = as.yearqtr(qrtr)) %>%
na.locf()
output is:
# A tibble: 49 x 3
group qrtr value
<chr> <chr> <chr>
1 A 2012 Q1 3
2 A 2012 Q2 8
3 A 2012 Q3 8
4 A 2012 Q4 9
5 A 2013 Q1 10
6 A 2013 Q2 10
7 A 2013 Q3 6
8 A 2013 Q4 9
9 A 2014 Q1 6
10 B 2012 Q1 10
# ... with 39 more rows
Use read.zoo to create a multivariate time series z with one column per group; merge that with a zero width series of quarters, run na.locf and then convert that back to long form.
We can omit:
the line with the merge if there is no quarter missing from every group -- in the example data in the question this is the case. i.e. for the data in the question we could omit the merge (although if we left it in it would not cause a problem)
the last line (the one with fortify.zoo) if we can work with the 10 x 5 multivariate time series z directly, which may actually be more convenient, e.g. library(ggplot); autoplot(z, facet = NULL) + scale_x_yearqtr() or the same without the facet argument will plot it using ggplot2 graphics using 1 or 5 panels.
This does not use any packages that the question is not already using anyways and works directly with the index in the original "yearqtr" class without conversion.
library(zoo)
z <- read.zoo(dat, index = "qrtr", split = "group")
z <- merge(z, zoo(, seq(start(z), end(z), 1/4))
z <- na.locf(z)
fortify.zoo(z, melt = TRUE)
This could alternately be expressed as the following pipeline:
library(dplyr) # or library(magrittr)
library(zoo)
dta %>%
read.zoo(index = "qrtr", split = "group") %>%
merge(zoo(, start(z), end(z), 1/4)) %>%
na.locf %>%
fortify.zoo(melt = TRUE)
Updates Have added pipeline and made a number of wording improvements and clarifications.