dplyr broadcasting single value per group in mutate - r

I am trying to do something very similar to Scale relative to a value in each group (via dplyr) (however this solution seems to crash R for me). I would like to replicate a single value for each group and add a new column with this value repeated. As an example I have
library(dplyr)
data = expand.grid(
category = LETTERS[1:2],
year = 2000:2003)
data$value = runif(nrow(data))
data
category year value
1 A 2000 0.6278798
2 B 2000 0.6112281
3 A 2001 0.2170495
4 B 2001 0.6454874
5 A 2002 0.9234604
6 B 2002 0.9311204
7 A 2003 0.5387899
8 B 2003 0.5573527
And I would like a dataframe like
data
category year value value2
1 A 2000 0.6278798 0.6278798
2 B 2000 0.6112281 0.6112281
3 A 2001 0.2170495 0.6278798
4 B 2001 0.6454874 0.6112281
5 A 2002 0.9234604 0.6278798
6 B 2002 0.9311204 0.6112281
7 A 2003 0.5387899 0.6278798
8 B 2003 0.5573527 0.6112281
i.e. the value for each category is the value from year 2000. I was trying to think of a general solution extensible to a given filtering criteria, i.e. something like
data %>% group_by(category) %>% mutate(value = filter(data, year==2002))
however this does not work because of incorrect length in the assignment.

Do this:
data %>% group_by(category) %>%
mutate(value2 = value[year == 2000])
You could also do it this way:
data %>% group_by(category) %>%
arrange(year) %>%
mutate(value2 = value[1])
or
data %>% group_by(category) %>%
arrange(year) %>%
mutate(value2 = first(value))
or
data %>% group_by(category) %>%
mutate(value2 = nth(value, n = 1, order_by = "year"))
or probably several other ways.
Your attempt with mutate(value = filter(data, year==2002)) doesn't make sense for a few reasons.
When you explicitly pass in data again, it's not part of the chain that got grouped earlier, so it doesn't know about the grouping.
All dplyr verbs take a data frame as first argument and return a data frame, including filter. When you do value = filter(...) you're trying to assign a full data frame to the single column value.

Related

Delete duplicates with multiple grouping conditions

I want to delete duplicates with multiple grouping conditions but always get way less results than expected.
The dataframe compares two companies per year. Like this:
year
c1
c2
2000
a
b
2000
a
c
2000
a
d
2001
a
b
2001
b
d
2001
a
c
For every c1 I want to look at c2 and delete rows which are in the previous year.
I found a similar problem but with just one c. Here are some of my tries so far:
df<- df%>%
group_by(c1,c2) %>%
mutate(dup = n() > 1) %>%
group_split() %>%
map_dfr(~ if(unique(.x$dup) & (.x$year[2] - .x$year[1]) == 1) {
.x %>% slice_head(n = 1)
} else {
.x
}) %>%
select(-dup) %>%
arrange(year)
df<- sqldf("select a.*
from df a
left join df b on b.c1=a.c1 and b.c2 = a.c2 and b.year = a.year - 1
where b.year is null")
The desired output for the example would be:
year
c1
c2
2000
a
b
2000
a
c
2000
a
d
2001
b
d
Assuming you want to check duplicate in the previous year only. So showing it to you on a modified sample
library(tidyverse)
df <- read.table(header = T, text = 'year c1 c2
2000 a b
2000 a c
2000 a d
2001 a b
2001 b d
2001 a c
2002 a d')
df %>%
filter(map2_lgl(df$year, paste(df$c1, df$c2), ~ !paste(.x -1, .y) %in% paste(df$year, df$c1, df$c2)))
#> year c1 c2
#> 1 2000 a b
#> 2 2000 a c
#> 3 2000 a d
#> 4 2001 b d
#> 5 2002 a d
Created on 2021-07-08 by the reprex package (v2.0.0)
Some of the other solutions won't work because I think they ignore the fact that you will probably have many years and want to eliminate duplicates from only the prior.
Here is something fairly simple. You could do this in some map function or whatnot, but sometimes a simple loop does just fine. For each year of data, use anti_join() to return only those values from the current year which are not in the prior year. Then just restack the data.
df_split <- df %>%
group_split(year)
for (this_year in 2:length(df_split)) {
df_split[[this_year]] <- df_split[[this_year]] %>%
anti_join(df_split[[this_year - 1]], by = c("c1", "c2"))
}
bind_rows(df_split)
# # A tibble: 4 x 3
# year c1 c2
# <int> <chr> <chr>
# 1 2000 a b
# 2 2000 a c
# 3 2000 a d
# 4 2001 b d
Edit
Another approach is to add a dummy column for the prior year and just use an anti_join() with that. This is probably what I would do.
df %>%
mutate(prior_year = year - 1) %>%
anti_join(df, by = c(prior_year = "year", "c1", "c2")) %>%
select(-prior_year)
You can also use the following solution.
library(dplyr)
library(purrr)
df %>%
filter(pmap_int(list(df$c1, df$c2, df$year), ~ df %>%
filter(year %in% c(..3, ..3 - 1)) %>%
rowwise() %>%
mutate(output = all(c(..1, ..2) %in% c_across(c1:c2))) %>%
pull(output) %>% sum) < 2)
# AnilGoyal's modified data set
year c1 c2
1 2000 a b
2 2000 a c
3 2000 a d
4 2001 b d
5 2002 a d
this will only keep the data u want.
The datais your data frame.
data[!duplicated(data[,2:3]),]
I think this is pretty simple with base duplicated using the fromLast option to get the last rather than the first entry. (It does assum the ordering by year.
dat[!duplicated(dat[2:3], fromLast=TRUE), ] # negate logical vector in i-position
year c1 c2
3 2000 a d
4 2001 a b
5 2001 b d
6 2001 a c
I do get a different result than you said was expected so maybe I misunderstood the specifications?
Assuming, that you indeed wanted to keep your last year, as stated in the question, but contrary to your example table, you could simply use slice:
library(dplyr)
df = data.frame(year=c("2000","2000","2000","2001","2001","2001"),
c1 = c("a","a","a","a","b","a"),c2=c("b","c","d","b","d","c"))
df %>% group_by(c1,c2) %>%
slice_tail() %>%arrange(year,c1,c2)
Use slice_head(), if you wanted the first year.
Here is the documentation: slice

Replace all values for a group using the first value from another column [duplicate]

I am trying to do something very similar to Scale relative to a value in each group (via dplyr) (however this solution seems to crash R for me). I would like to replicate a single value for each group and add a new column with this value repeated. As an example I have
library(dplyr)
data = expand.grid(
category = LETTERS[1:2],
year = 2000:2003)
data$value = runif(nrow(data))
data
category year value
1 A 2000 0.6278798
2 B 2000 0.6112281
3 A 2001 0.2170495
4 B 2001 0.6454874
5 A 2002 0.9234604
6 B 2002 0.9311204
7 A 2003 0.5387899
8 B 2003 0.5573527
And I would like a dataframe like
data
category year value value2
1 A 2000 0.6278798 0.6278798
2 B 2000 0.6112281 0.6112281
3 A 2001 0.2170495 0.6278798
4 B 2001 0.6454874 0.6112281
5 A 2002 0.9234604 0.6278798
6 B 2002 0.9311204 0.6112281
7 A 2003 0.5387899 0.6278798
8 B 2003 0.5573527 0.6112281
i.e. the value for each category is the value from year 2000. I was trying to think of a general solution extensible to a given filtering criteria, i.e. something like
data %>% group_by(category) %>% mutate(value = filter(data, year==2002))
however this does not work because of incorrect length in the assignment.
Do this:
data %>% group_by(category) %>%
mutate(value2 = value[year == 2000])
You could also do it this way:
data %>% group_by(category) %>%
arrange(year) %>%
mutate(value2 = value[1])
or
data %>% group_by(category) %>%
arrange(year) %>%
mutate(value2 = first(value))
or
data %>% group_by(category) %>%
mutate(value2 = nth(value, n = 1, order_by = "year"))
or probably several other ways.
Your attempt with mutate(value = filter(data, year==2002)) doesn't make sense for a few reasons.
When you explicitly pass in data again, it's not part of the chain that got grouped earlier, so it doesn't know about the grouping.
All dplyr verbs take a data frame as first argument and return a data frame, including filter. When you do value = filter(...) you're trying to assign a full data frame to the single column value.

Create column with a certain week value by group

I would like to create a column, by group, with a certain week's value from another column.
In this example New_column is created with the Number from the 2nd week for each group.
Group Week Number New_column
A 1 19 8
A 2 8 8
A 3 21 8
A 4 5 8
B 1 4 12
B 2 12 12
B 3 18 12
B 4 15 12
C 1 9 4
C 2 4 4
C 3 10 4
C 4 2 4
I've used this method, which works, but I feel is a really messy way to do it:
library(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(New_column = ifelse(Week == 2, Number, NA))
df <- df %>%
group_by(Group) %>%
mutate(New_column = sum(New_column, na.rm = T))
There are several solution possible, depending on what you need specifically. With your specific sample data, however, all of them give the same result
1) It identifies the week number from column Week, even if the dataframe is not sorted
df %>%
group_by(Group) %>%
mutate(New_column = Number[Week == 2])
However, if the weeks do not start from 1, this solution will still try to find the case only where Week == 2
2) If df is already sorted by Week inside each group, you could use
df %>%
group_by(Group) %>%
mutate(New_column = Number[2])
This solution does not take the week Number in which Week == 2, but rather the second week within each group, regardless of its actual Week value.
3) If df is not sorted by week, you could do it with
df %>%
group_by(Group) %>%
arrange(Week, .by_group = TRUE) %>%
mutate(New_column = Number[2])
and uses the same rationale as solution 2)

Expanding a dataframe from years to months

I have the data frame with a column for years. See below:
D <- as.data.frame(cbind(c(1998,1998,1999,1999,2000,2001,2001), c(1,2,2,5,1,3,4), c(1,5,9,2,NA,7,8)))
colnames(D) <- c('year','var1','var2')
D$start <- D$year*100+1
D$end <- D$year*100+12
print(D)
year var1 var2 start end
1 1998 1 1 199801 199812
2 1998 2 5 199801 199812
3 1999 2 9 199901 199912
4 1999 5 2 199901 199912
5 2000 1 NA 200001 200012
6 2001 3 7 200101 200112
7 2001 4 8 200101 200112
I want to copy each row 12 times, one for each month between the start and end columns. I made the start and end columns January and December in this example, but in theory they could be different. Obviously I am really dealing with an incredibly large dataset, so I was wondering how I could do it in one or two lines(preferably using dplyr since that is the coding language I am most used to).
If you want all months for each row, I would do this as a join:
months = expand.grid(year = unique(d$year), month = 1:12)
left_join(D, months, by = "year")
If you want most months for most years, you could filter out the ones you don't want in a next step.
If you really want to use the start and end columns you've created, I would do it like this:
D %>% mutate(month = Map(seq, start, end)) %>%
tidyr::unnest(cols = month)
We can do expand from tidyr
expand(D, year = unique(year), month = 1:12) %>%
left_join(D, by = 'year')
This also works:
D %>%
rowid_to_column() %>%
gather(key = key, value = date, start, end) %>%
select(-key) %>%
group_by(rowid) %>%
complete(date = full_seq(date, 1)) %>%
fill(everything(), -rowid, .direction = "downup") %>%
ungroup() %>%
arrange(rowid)
If you want to keep the start and end columns add the following before ungroup():
mutate(start = min(date), end = max(date))

Scale relative to a value in each group (via dplyr)

I have a set of time series, and I want to scale each of them relative to their value in a specific interval. That way, each series will be at 1.0 at that time and change proportionally.
I can't figure out how to do that with dplyr.
Here's a working example using a for loop:
library(dplyr)
data = expand.grid(
category = LETTERS[1:3],
year = 2000:2005)
data$value = runif(nrow(data))
# the first time point in the series
baseYear = 2002
# for each category, divide all the values by the category's value in the base year
for(category in as.character(levels(factor(data$category)))) {
data[data$category == category,]$value = data[data$category == category,]$value / data[data$category == category & data$year == baseYear,]$value[[1]]
}
Edit: Modified the question such that the base time point is not indexable. Sometimes the "time" column is actually a factor, which isn't necessarily ordinal.
This solution is very similar to #thelatemail, but I think it's sufficiently different enough to merit its own answer because it chooses the index based on a condition:
data %>%
group_by(category) %>%
mutate(value = value/value[year == baseYear])
# category year value
#... ... ... ...
#7 A 2002 1.00000000
#8 B 2002 1.00000000
#9 C 2002 1.00000000
#10 A 2003 0.86462789
#11 B 2003 1.07217943
#12 C 2003 0.82209897
(Data output has been truncated. To replicate these results, set.seed(123) when creating data.)
Use first in dplyr, ensuring you use order_by
data %>%
group_by(category) %>%
mutate(value = value / first(value, order_by = year))
Something like this:
data %>%
group_by(category) %>%
mutate(value=value/value[1]) %>%
arrange(category,year)
Result:
# category year value
#1 A 2000 1.0000000
#2 A 2001 0.2882984
#3 A 2002 1.5224308
#4 A 2003 0.8369343
#5 A 2004 2.0868684
#6 A 2005 0.2196814
#7 B 2000 1.0000000
#8 B 2001 0.5952027

Resources