I have a data frame something like bellow:
amount <- sample(10000:2000, 20)
year<- sample(2015:2017, 20, replace = TRUE)
company<- sample(LETTERS[1:3],20, replace = TRUE)
df<-data.frame(company, year, amount)
Then I want to group by company and year so I have:
df %>%
group_by(company, year) %>%
summarise(
total= sum(amount)
)
company year total
<fct> <int> <int>
1 A 2015 1094
2 A 2016 3308
3 A 2017 4785
4 B 2015 1190
5 B 2016 6583
6 B 2017 1964
7 C 2015 4974
8 C 2016 1986
9 C 2017 3465
Now, I want to divide the last row in each group to the first row. In other words, I want to divide the total value for the last year for each company to the same value of the first year.
Thanks.
You could use last and first to access those elements of total respectively :
library(dplyr)
df %>%
group_by(company, year) %>%
summarise(total= sum(amount)) %>%
summarise(final = last(total)/first(total))
# company final
# <fct> <dbl>
#1 A 2.26
#2 B 1.92
#3 C 0.565
In base R, we can use aggregate
aggregate(amount~company, aggregate(amount~company+year, df, sum),
function(x) x[length(x)]/x[1])
# company amount
#1 A 2.262524
#2 B 1.919138
#3 C 0.565281
With data.table, we can do
library(data.table)
setDT(df)[ , .(total = sum(amount)), .(company, year)][,
.(final = last(total)/first(total)), .(company)]
Related
I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!
It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30
I want to generate a new column to show the Period data using IDs.
My data are similar to this data
df1<-read.table(text="ID Location Day Period Group
241 A am M1 A
231 D am N1 A
241 N pm M2 A
234 K pm N2 B
231 G pm N2 B
300 K am M2 A",header=TRUE)
and the the expected data are:
df1<-read.table(text="ID Location Day Period Group Match
241 A am M1 A M2
231 D am N1 A N2
234 K pm N2 B NA
300 K am M2 A NA",header=TRUE)
If there are duplicated IDs, only one Id is kept and the value of the period is addressed in the Match column. I want to have blank instead of NA
Try this
library(dplyr)
df1 %>%
filter(!duplicated(ID)) %>%
left_join(
df1 %>%
filter(duplicated(ID)) %>%
select(ID, Period), by = "ID") %>%
rename(Period = Period.x, Match = Period.y)
or using group_split
library(dplyr)
library(purrr)
df1 %>%
mutate(is_duplicated = duplicated(ID)) %>%
group_split(is_duplicated, keep = FALSE) %>%
reduce(left_join, by = "ID", suffix = c("", "_match")) %>%
select(names(df1), Match = Period_match)
After grouping by 'ID', we can get the first element of 'Group', 'Day', 'Location', and return the second element of 'Match'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Group = first(Group), Day = first(Day),
Location = first(Location),
Match = Period[2], Period = first(Period))
# A tibble: 4 x 6
# ID Group Day Location Match Period
# <int> <fct> <fct> <fct> <fct> <fct>
#1 231 A am D N2 N1
#2 234 B pm K <NA> N2
#3 241 A am A M2 M1
#4 300 A am K <NA> M2
Or another option is to mutate the columns with the first value after grouping by 'ID' and then do the summarise
df1 %>%
group_by(ID) %>%
mutate_at(vars(Group, Day, Location), first) %>%
group_by(Group, Day, Location , .add= TRUE) %>%
summarise(Match = Period[2], Period = first(Period))
# A tibble: 4 x 6
# Groups: ID, Group, Day [4]
# ID Group Day Location Match Period
# <int> <fct> <fct> <fct> <fct> <fct>
#1 231 A am D N2 N1
#2 234 B pm K <NA> N2
#3 241 A am A M2 M1
#4 300 A am K <NA> M2
In the devel version of dplyr, this can be made more compact with across
df1 %>%
group_by(ID) %>%
summarise(across(c(Group, Day, Location), first),
Match = Period[2], Period = first(Period))
# A tibble: 4 x 6
# ID Group Day Location Match Period
# <int> <fct> <fct> <fct> <fct> <fct>
#1 231 A am D N2 N1
#2 234 B pm K <NA> N2
#3 241 A am A M2 M1
#4 300 A am K <NA> M2
Here, we assume that there would be no more than 2 rows per unique 'ID'
I want to summarize relocations (between cities), based on a unique ID number. A sample dataframe, with two unique ID's:
year ID city adress
1 2013 1 B adress_1
2 2014 1 B adress_1
3 2015 1 A adress_2
4 2016 1 A adress_2
5 2013 2 B adress_3
6 2014 2 B adress_3
7 2015 2 C adress_4
8 2016 2 C adress_4
I have provided a sample code below. The summaries are correct, except for one thing. If, for example, a relocation is found between city B and city A, I want an output of relocation found from city B to city A (and number of times 1 = seen once in the dataframe). However, because of the properties of the summary function (and the tendency to store output in alphabetic order), I get the following output
tmp <- df %>% group_by(ID, city, adress) %>% summarize(numberofyears = n())
tmp <- tmp %>%
group_by(ID) %>%
#filter(n() >1) %>%
mutate(from = city[1], from_adres = adress[1], from_years = numberofyears[1], to = city[2],
to_adres = adress[2], to_years = numberofyears[2]) %>%
distinct(ID, .keep_all = TRUE) %>% select(-c(2:3))
# A tibble: 2 x 8
# Groups: ID [2]
ID numberofyears from from_adres from_years to to_adres to_years
<dbl> <int> <fct> <fct> <int> <fct> <fct> <int>
1 1 2 A adress_2 2 B adress_1 2
2 2 2 B adress_3 2 C adress_4 2
Which is wrong, because we know that adress_1 preceed adress_2. When summarizing a relocation from City B to City C, I get the right results.
It is a very small detail, but an important one as I tried to demonstrate. Any suggestions would be very much appreciated!
Similar to #jyjek but this will allow for the possibility of more than one move per ID.
library(tidyverse)
df <- data.frame(year = rep(2013:2016, 2),
ID = rep(1:2, each = 4),
city = c("B", "B", "A", "A", "B", "B", "C", "C"),
address = rep(1:4, each = 2),
stringsAsFactors = FALSE)
df %>%
group_by(ID, city, address) %>%
#note the first and last year at the address
summarise(startyear = min(year),
endyear = max(year)) %>%
#sort by ID and year
arrange(ID, startyear) %>%
group_by(ID) %>%
#grab the next address for each ID
mutate(to = lead(city),
to_address = lead(address),
to_years = lead(endyear) - lead(startyear) + 1,
from_years = endyear - startyear + 1) %>%
#exclude the last row of each ID, since there's no next address being moved to
filter(!is.na(to)) %>%
select(ID, from = city, from_address = address, from_years, to, to_address, to_years)
Like this?
library(tidyverse)
df<-read.table(text=" year ID city adress
1 2013 1 B adress_1
2 2014 1 B adress_1
3 2015 1 A adress_2
4 2016 1 A adress_2
5 2013 2 B adress_3
6 2014 2 B adress_3
7 2015 2 C adress_4
8 2016 2 C adress_4",header=T)
df%>%
group_by(ID, city, adress)%>%
summarize(numberofyears = n())%>%
mutate(id=parse_number(adress))%>%
group_by(ID,id)%>%
arrange(id)%>%
ungroup()%>%
select(-id)%>%
group_by(ID)%>%
mutate(from=first(city), from_adres = first(adress),
from_years = first(numberofyears),to=last(city),
to_adres = last(adress),to_years=last(numberofyears))%>%
distinct(ID, .keep_all = TRUE)%>%
select(-c(2:3))
# A tibble: 2 x 8
# Groups: ID [2]
ID numberofyears from from_adres from_years to to_adres to_years
<int> <int> <fct> <fct> <int> <fct> <fct> <int>
1 1 2 B adress_1 2 A adress_2 2
2 2 2 B adress_3 2 C adress_4 2
I'd like to create a new data frame where the columns are subsets of the same variable that are split by a different variable. For example, I'd like to make a new subset of variable ('b') where the columns are split by a subset of a different variable ('year')
set.seed(88)
df <- data.frame(year = rep(1996:1998,3), a = runif(9), b = runif(9), e = runif(9))
df
year a b e
1 1996 0.41050128 0.97679183 0.7477684
2 1997 0.10273570 0.54925568 0.7627982
3 1998 0.74104481 0.74416429 0.2114261
4 1996 0.48007870 0.55296210 0.7377032
5 1997 0.99051343 0.18097104 0.8404930
6 1998 0.99954223 0.02063662 0.9153588
7 1996 0.03247379 0.33055434 0.9182541
8 1997 0.76020784 0.10246882 0.7055694
9 1998 0.67713100 0.59292207 0.4093590
Desired output for variable 'b' for years 1996 and 1998, is:
V1 V2
1 0.9767918 0.74416429
2 0.5529621 0.02063662
3 0.3305543 0.59292207
I could probably find a way to do this with a loop, but am wondering if there is a dplyr methed (or any simple method to accomplish this).
We subset dataset based on 1996, 1998 in 'year', select the 'year', 'b' columns and unstack to get the expected output
unstack(subset(df, year %in% c(1996, 1998), select = c('year', 'b')), b ~ year)
# X1996 X1998
#1 0.9767918 0.74416429
#2 0.5529621 0.02063662
##3 0.3305543 0.59292207
Or using tidyverse, we select the columns of interest, filter the rows based on the 'year' column, create a sequence column by 'year', spread to 'wide' format and select out the unwanted columns
library(tidyverse)
df %>%
select(year, b) %>%
filter(year %in% c(1996, 1998)) %>%
group_by(year = factor(year, levels = unique(year), labels = c('V1', 'V2'))) %>%
mutate(n = row_number()) %>%
spread(year, b) %>%
select(-n)
# A tibble: 3 x 2
# V1 V2
# <dbl> <dbl>
#1 0.977 0.744
#2 0.553 0.0206
#3 0.331 0.593
As there are only two 'year's, we can also use summarise
df %>%
summarise(V1 = list(b[year == 1996]), V2 = list(b[year == 1998])) %>%
unnest
Another option with dplyr, mixing in some base R, resulting in a tiny bit shorter solution than #akrun's code:
bind_cols(split(df$b, df$year)) %>% select(-'1997')
# A tibble: 3 x 2
`1996` `1998`
<dbl> <dbl>
1 0.977 0.744
2 0.553 0.0206
3 0.331 0.593
I'm working on a dataset with a with grouping-system with six digits. The first two digits denote grouping on the top-level, the next two denote different sub-groups, and the last two digits denote specific type within the sub-group. I want to group the data to the top level in the hierarchy (two first digits only), and count unique names in each group.
An example for the GroupID 010203:
01 denotes BMW
02 denotes 3-series
03 denotes 320i (the exact model)
All I care about in this example is how many of each brand there is.
Toy dataset and wanted output:
df <- data.table(Quarter = c('Q4', 'Q4', 'Q4', 'Q4', 'Q3'),
GroupID = c(010203, 150503, 010101, 150609, 010000),
Name = c('AAAA', 'AAAA', 'BBBB', 'BBBB', 'CCCC'))
Output:
Quarter Group Counts
Q3 01 1
Q4 01 2
Q4 15 2
Using data.table we could do:
library(data.table)
dt[, Group := substr(GroupID, 1, 2)][
, Counts := .N, by = list(Group, Quarter)][
, head(.SD, 1), by = .(Quarter, Group, Counts)][
, .(Quarter, Group, Counts)]
Returns:
Quarter Group Counts
1: Q4 01 2
2: Q4 15 2
3: Q3 01 1
With dplyr and stringr we could do something like:
library(dplyr)
library(stringr)
df %>%
mutate(Group = str_sub(GroupID, 1, 2)) %>%
group_by(Group, Quarter) %>%
summarise(Counts = n()) %>%
ungroup()
Returns:
# A tibble: 3 x 3
Group Quarter Counts
<chr> <fct> <int>
1 01 Q3 1
2 01 Q4 2
3 15 Q4 2
Since you are already using data.table, you can do:
df[, Group := substr(GroupID,1,2)]
df <- df[,Counts := .N, .(Group,Quarter)][,.(Group, Quarter, Counts)]
df <- unique(df)
print(df)
Group Quarter Counts
1: 10 Q4 2
2: 15 Q4 2
3: 10 Q3 1
Here's my simple solution with plyr and base R, it is lightening fast.
library(plyr)
df$breakid <- as.character((substr(df$GroupID, start =0 , stop = 2)))
d <- plyr::count(df, c("Quarter", "breakid"))
Result
Quarter breakid freq
Q3 01 1
Q4 01 2
Q4 15 2
Alternatively, using tapply (and data.table indexing):
df$Brand <- substr(df$GroupID, 1, 2)
tapply(df$Brand, df[, .(Quarter, Brand)], length)
(If you don't care about the output being a matrix).