Given a dataframe, how can I collapse several rows together and rename them as "other"? The function forcats::fct_lump() seems relevant in essence, but it operates on data with repetition, whereas my case is different.
Example
Some reproducible toy data about states in the USA.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(janitor, warn.conflicts = FALSE)
set.seed(2021)
create_weights <- function(x) ceiling(exp(-x/3)*1000)
df_uncounted <-
state.x77 %>%
as_tibble(rownames = "state") %>%
mutate(freq = sample(create_weights(1:n()), size = n())) %>%
uncount(freq)
df_counted <-
df_uncounted %>%
summarise(janitor::tabyl(state)) %>%
arrange(desc(percent))
df_counted
#> # A tibble: 50 x 3
#> state n percent
#> <chr> <dbl> <dbl>
#> 1 Oklahoma 717 0.280
#> 2 New York 514 0.201
#> 3 Indiana 368 0.144
#> 4 Maine 264 0.103
#> 5 Florida 189 0.0737
#> 6 Colorado 136 0.0531
#> 7 Alabama 97 0.0378
#> 8 Oregon 70 0.0273
#> 9 Minnesota 50 0.0195
#> 10 Tennessee 36 0.0140
#> # ... with 40 more rows
Created on 2021-08-11 by the reprex package (v2.0.0)
For the sake of this question, the dataset df_counted is given.
When I examine such data, I typically want to keep only rows that represent the larger chunks, and collapse the rest into "other". In this example, I may desire to collapse the data according to two scenarios.
Scenario A
Rows 1:4 are the ones I want to keep as-is, whereas rows 5:50 I'd collapse into "other".
Desired output for scenario A:
# A tibble: 5 x 3
state n percent
<chr> <dbl> <dbl>
1 Oklahoma 717 0.280
2 New York 514 0.201
3 Indiana 368 0.144
4 Maine 264 0.103
5 other 700 0.273
Scenario B
Any row that has a value lower than 0.01 in the percent column should be grouped as "other".
Desired output for scenario B
state n percent
<chr> <dbl> <dbl>
1 Oklahoma 717 0.280
2 New York 514 0.201
3 Indiana 368 0.144
4 Maine 264 0.103
5 Florida 189 0.0737
6 Colorado 136 0.0531
7 Alabama 97 0.0378
8 Oregon 70 0.0273
9 Minnesota 50 0.0195
10 Tennessee 36 0.0140
11 Montana 26 0.0101
12 other 96 0.0375
I suppose this is a pretty common procedure, but I didn't find a direct function that does so. My attempts to achieve the desired outputs included some very cumbersome code. way too complex for such a simple purpose.
Anyone knows of a straightforward way to achieve those desired outputs? Thanks!
It's just a case of creating appropriate groups and then summarising:
library(dplyr)
df_counted %>%
group_by(state = ifelse(row_number() < 5, state, "other")) %>%
summarise(across(everything(), sum)) %>%
arrange(state == "other", -n)
# A tibble: 5 x 3
state n percent
<chr> <dbl> <dbl>
1 Oklahoma 717 0.280
2 New York 514 0.201
3 Indiana 368 0.144
4 Maine 264 0.103
5 other 700 0.273
df_counted %>%
group_by(state = ifelse(percent >= .01, state, "other")) %>%
summarise(across(everything(), sum)) %>%
arrange(state == "other", -n)
# A tibble: 12 x 3
state n percent
<chr> <dbl> <dbl>
1 Oklahoma 717 0.280
2 New York 514 0.201
3 Indiana 368 0.144
4 Maine 264 0.103
5 Florida 189 0.0737
6 Colorado 136 0.0531
7 Alabama 97 0.0378
8 Oregon 70 0.0273
9 Minnesota 50 0.0195
10 Tennessee 36 0.0140
11 Montana 26 0.0101
12 other 96 0.0375
In case you have curiosity of knowing how to do it with forcats:
# scenario A
df_counted %>%
group_by(state = fct_lump_n(state, 4, w = percent)) %>%
summarise(across(.fn = sum)) %>%
arrange(state == "Other", -n)
# scenario B
df_counted %>%
group_by(state = fct_lump_min(state, .01, w = percent)) %>%
summarise(across(.fn = sum)) %>%
arrange(state == "Other",-n)
Base R options using aggregate
aggregate(
. ~ state,
transform(
df_counted,
state = replace(state, seq_along(state) >= 5, "other")
),
sum
)
which gives
state n percent
1 Indiana 368 0.1435817
2 Maine 264 0.1030043
3 New York 514 0.2005462
4 Oklahoma 717 0.2797503
5 other 700 0.2731174
aggregate(
. ~ state,
transform(
df_counted,
state = replace(state, percent < 0.01, "other")
),
sum
)
gives
state n percent
1 Alabama 97 0.03784627
2 Colorado 136 0.05306282
3 Florida 189 0.07374171
4 Indiana 368 0.14358174
5 Maine 264 0.10300429
6 Minnesota 50 0.01950839
7 Montana 26 0.01014436
8 New York 514 0.20054623
9 Oklahoma 717 0.27975029
10 Oregon 70 0.02731174
11 other 96 0.03745611
12 Tennessee 36 0.01404604
Related
Below is the sample data and code. I have two issues. First, I need the indtotal column to be the sum by the twodigit code and have it stay constant as shown below. The reasons is so that I can do a simple calculation of one column divided by the other to arrive at the smbshare number. When I try the following,
second <- first %>%
group_by(twodigit,smb) %>%
summarize(indtotal = sum(employment))
it breaks it down by twodigit and smb.
Second issue is having it produce an 0 if the value does not exist. Best example is twodigit code of 51 and smb = 4. When there are not 4 distinct smb values for a given two digit, I am looking for it to produce a 0.
Note: smb is short for small business
naicstest <- c (512131,512141,521921,522654,512131,536978,541214,531214,621112,541213,551212,574121,569887,541211,523141,551122,512312,521114,522112)
employment <- c(11,130,315,17,190,21,22,231,15,121,19,21,350,110,515,165,12,110,111)
smb <- c(1,2,3,1,3,1,1,3,1,2,1,1,4,2,4,3,1,2,2)
first <- data.frame(naicstest,employment,smb)
first<-first %>% mutate(twodigit = substr(naicstest,1,2))
second <- first %>% group_by(twodigit) %>% summarize(indtotal = sum(employment))
Desired result is below
twodigit indtotal smb smbtotal smbshare
51 343 1 23 (11+12) 23/343
51 343 2 130 130/343
51 343 3 190 190/343
51 343 4 0 0/343
52 1068 1 17 23/1068
52 1068 2 221 (110+111) 221/1068
52 1068 3 315 315/1068
52 1068 4 515 515/1068
This gives you all the columns you need, but in a slightly different order. You could use select or relocate to get them in the order you want I suppose:
first %>%
group_by(twodigit, smb) %>%
summarize(smbtotal = sum(employment)) %>%
ungroup() %>%
complete(twodigit, smb, fill = list('smbtotal' = 0)) %>%
group_by(twodigit) %>%
mutate(
indtotal = sum(smbtotal),
smbshare = smbtotal / indtotal
)
`summarise()` has grouped output by 'twodigit'. You can override using the `.groups` argument.
# A tibble: 32 × 5
# Groups: twodigit [8]
twodigit smb smbtotal indtotal smbshare
<chr> <dbl> <dbl> <dbl> <dbl>
1 51 1 23 343 0.0671
2 51 2 130 343 0.379
3 51 3 190 343 0.554
4 51 4 0 343 0
5 52 1 17 1068 0.0159
6 52 2 221 1068 0.207
7 52 3 315 1068 0.295
8 52 4 515 1068 0.482
9 53 1 21 252 0.0833
10 53 2 0 252 0
# … with 22 more rows
I have a data frame with five columns:
year<- c(2000,2000,2000,2001,2001,2001,2002,2002,2002)
k<- c(12.5,11.5,10.5,-8.5,-9.5,-10.5,13.9,14.9,15.9)
pop<- c(143,147,154,445,429,430,178,181,211)
pop_obs<- c(150,150,150,440,440,440,185,185,185)
df<- data_frame(year,k,pop,pop_obs)
df<-
year k pop pop_obs
<dbl> <dbl> <dbl> <dbl>
1 2000 12.5 143 150
2 2000 11.5 147 150
3 2000 10.5 154 150
4 2001 -8.5 445 440
5 2001 -9.5 429 440
6 2001 -10.5 430 440
7 2002 13.9 178 185
8 2002 14.9 181 185
9 2002 15.9 211 185
what I want is, based on each year and each k which value of pop has minimum difference of pop_obs. finally, I want to keep result as a data frame based on each year and each k.
my expected output would be like this:
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2003 14.9
You could try with dplyr
df<- data.frame(year,k,pop,pop_obs)
library(dplyr)
df %>%
mutate(diff = abs(pop_obs - pop)) %>%
group_by(year) %>%
filter(diff == min(diff)) %>%
select(year, k)
#> # A tibble: 3 x 2
#> # Groups: year [3]
#> year k
#> <dbl> <dbl>
#> 1 2000 11.5
#> 2 2001 -8.5
#> 3 2002 14.9
Created on 2021-12-11 by the reprex package (v2.0.1)
Try tidyverse way
library(tidyverse)
data_you_want = df %>%
group_by(year, k)%>%
mutate(dif=pop-pop_obs)%>%
ungroup() %>%
arrange(desc(dif)) %>%
select(year, k)
Using base R
subset(df, as.logical(ave(abs(pop_obs - pop), year,
FUN = function(x) x == min(x))), select = c('year', 'k'))
# A tibble: 3 × 2
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2002 14.9
Typically I use dplyr::distinct() to remove duplicated rows from the data. This function selects one copy of the duplicated rows and keeps it.
However, sometimes I wish to remove all copies if suspect the row is not valid.
Example
Let's say that I survey people and ask them about height, weight, and country they're from.
library(dplyr)
library(tibble)
set.seed(2021)
df_1 <- data.frame(id = 1:10,
height = sample(c(150:210), size = 10),
weight = sample(c(80: 200), size = 10))
df_2 <- df_1
df_final <- rbind(df_1, df_2)
df_final <- dplyr::arrange(df_final, id)
df_final <-
df_final %>%
add_column("country" = c("uk", "uk",
"france", "usa",
"germany", "germany",
"denmark", "norway",
"india", "india",
"chine", "china",
"mozambique", "argentina",
"morroco", "morroco",
"sweden", "japan",
"italy", "italy"))
df_final
#> id height weight country
#> 1 1 156 189 uk
#> 2 1 156 189 uk
#> 3 2 187 148 france
#> 4 2 187 148 usa
#> 5 3 195 190 germany
#> 6 3 195 190 germany
#> 7 4 207 182 denmark
#> 8 4 207 182 norway
#> 9 5 188 184 india
#> 10 5 188 184 india
#> 11 6 161 102 chine
#> 12 6 161 102 china
#> 13 7 201 155 mozambique
#> 14 7 201 155 argentina
#> 15 8 155 130 morroco
#> 16 8 155 130 morroco
#> 17 9 209 139 sweden
#> 18 9 209 139 japan
#> 19 10 202 97 italy
#> 20 10 202 97 italy
Created on 2021-07-19 by the reprex package (v2.0.0)
In df_final, each id means one person. In this example data we have duplicates for all 10 people. Everyone took the survey twice. However, if we look closely we see that some people reported they're from a different country. For example, id == 2 reported both usa in one case and france in another. In my data cleaning I wish to remove those people.
My primary goal is to remove duplicates. My secondary goal is to filter out those people who answered a different country.
If I simply go with dplyr::distinct(), I remain with all 10 ids.
df_final %>%
distinct(id, .keep_all = TRUE)
#> id height weight country
#> 1 1 156 189 uk
#> 2 2 187 148 france
#> 3 3 195 190 germany
#> 4 4 207 182 denmark
#> 5 5 188 184 india
#> 6 6 161 102 chine
#> 7 7 201 155 mozambique
#> 8 8 155 130 morroco
#> 9 9 209 139 sweden
#> 10 10 202 97 italy
What should I do in order to run distinct() but only on those who have the same value for country in all duplicated copies (per id)?
Thanks
Here is one option...
df_final %>%
group_by(id) %>%
filter(length(unique(country)) == 1) %>%
distinct()
# A tibble: 5 x 4
# Groups: id [5]
id height weight country
<int> <int> <int> <chr>
1 1 177 83 uk
2 3 191 151 germany
3 5 186 175 india
4 8 164 178 morroco
5 10 201 141 italy
We may also do
library(dplyr)
df_final %>%
distinct(id, country, .keep_all = TRUE) %>%
filter(id %in% names(which(table(id) == 1)))
I can pivot the data in wider format if the values to be pivoted consist in more than one column.
us_rent_income %>%
pivot_wider(
names_from = variable,
names_glue = "{variable}_{.value}",
values_from = c(estimate, moe)
)
# A tibble: 52 x 6
GEOID NAME income_estimate rent_estimate income_moe rent_moe
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 24476 747 136 3
2 02 Alaska 32940 1200 508 13
3 04 Arizona 27517 972 148 4
4 05 Arkansas 23789 709 165 5
5 06 California 29454 1358 109 3
6 08 Colorado 32401 1125 109 5
7 09 Connecticut 35326 1123 195 5
8 10 Delaware 31560 1076 247 10
9 11 District of Columbia 43198 1424 681 17
10 12 Florida 25952 1077 70 3
# ... with 42 more rows
In this code output, I want the order of columns to be income_estimate, income_moe, rent_estimate and rent_moe. Setting names_sort = T isn't helping. Changing the order in names_glue doesn't help either. I know I can reorder columns by select and through other functions, but I just want to know that is there any argument in pivot_wider to do so?
EDIT the issue seems already in development; it has been discussed here and here at least.
With the advent of tidyr 1.2.0, it is now super easy with the use of argument names_vary
library(tidyr)
us_rent_income %>%
pivot_wider(
names_from = variable,
names_glue = "{variable}_{.value}",
values_from = c(estimate, moe),
names_vary = 'slowest'
)
#> # A tibble: 52 x 6
#> GEOID NAME income_estimate income_moe rent_estimate rent_moe
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 136 747 3
#> 2 02 Alaska 32940 508 1200 13
#> 3 04 Arizona 27517 148 972 4
#> 4 05 Arkansas 23789 165 709 5
#> 5 06 California 29454 109 1358 3
#> 6 08 Colorado 32401 109 1125 5
#> 7 09 Connecticut 35326 195 1123 5
#> 8 10 Delaware 31560 247 1076 10
#> 9 11 District of Columbia 43198 681 1424 17
#> 10 12 Florida 25952 70 1077 3
#> # ... with 42 more rows
The explanation of names_vary given at package help page is -
names_vary
When names_from identifies a column (or columns) with multiple unique values, and multiple values_from columns are provided, in what order should the resulting column names be combined?
"fastest" varies names_from values fastest, resulting in a column naming scheme of the form: value1_name1, value1_name2, value2_name1, value2_name2. This is the default.
"slowest" varies names_from values slowest, resulting in a column naming scheme of the form: value1_name1, value2_name1, value1_name2, value2_name2.
For fine-grained control, you can use pivot_wider_spec(), which lets you define the specification for the resulting data frame:
library(tidyverse)
spec <- tibble(
.name = c("income_estimate", "income_moe", "rent_estimate", "rent_moe"),
.value = c("estimate", "moe", "estimate", "moe"),
variable = c("income", "income", "rent", "rent")
)
us_rent_income %>% pivot_wider_spec(spec)
Output:
# A tibble: 52 x 6
GEOID NAME income_estimate income_moe rent_estimate rent_moe
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 24476 136 747 3
2 02 Alaska 32940 508 1200 13
3 04 Arizona 27517 148 972 4
4 05 Arkansas 23789 165 709 5
5 06 California 29454 109 1358 3
6 08 Colorado 32401 109 1125 5
7 09 Connecticut 35326 195 1123 5
8 10 Delaware 31560 247 1076 10
9 11 District of Columbia 43198 681 1424 17
10 12 Florida 25952 70 1077 3
# … with 42 more rows
And with a few pre-processing steps, you can avoid having to manually enter all the values in spec:
field <- us_rent_income %>% distinct(variable) %>% pull()
sub_field <- colnames(us_rent_income)[4:5]
pivot_names <- map(field, ~paste(., sub_field, sep = "_")) %>% unlist()
pivot_vals <- rep(sub_field, 2)
pivot_vars <- map(field, rep, 2) %>% unlist()
spec <- tibble(.name = pivot_names, .value = pivot_vals, variable = pivot_vars)
us_rent_income %>% pivot_wider_spec(spec)
After the pivoting, we could do a select by ordering the substring of column names
library(dplyr)
library(tidyr)
library(stringr0
us_rent_income %>%
pivot_wider(
names_from = variable,
names_glue = "{variable}_{.value}",
values_from = c(estimate, moe)
) %>%
select(GEOID, NAME, order(str_remove(names(.)[-(1:2)], "_.*")) + 2)
-output
# A tibble: 52 x 6
# GEOID NAME income_estimate income_moe rent_estimate rent_moe
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 01 Alabama 24476 136 747 3
# 2 02 Alaska 32940 508 1200 13
# 3 04 Arizona 27517 148 972 4
# 4 05 Arkansas 23789 165 709 5
# 5 06 California 29454 109 1358 3
# 6 08 Colorado 32401 109 1125 5
# 7 09 Connecticut 35326 195 1123 5
# 8 10 Delaware 31560 247 1076 10
# 9 11 District of Columbia 43198 681 1424 17
#10 12 Florida 25952 70 1077 3
# … with 42 more rows
The ordering is based on the names_from column and so the names_sort have no impact on the column names from values_from i.e. in the OP's solution, it wouldn't change if we change the order in names_glue. In the data, the 'variable' column unique value appearance is in income, followed by rent. So, it does that order, when the default names_sort = FALSE. If it is changed to TRUE, it does alphabetic order, which is again i followed by r.
It can be checked if we first reshape to 'long', unite the columns and then do the pivot_wider
us_rent_income %>%
pivot_longer(cols = c(estimate, moe)) %>%
unite(variable, variable, name) %>%
pivot_wider(names_from = variable, values_from = value)
-output
# A tibble: 52 x 6
# GEOID NAME income_estimate income_moe rent_estimate rent_moe
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 01 Alabama 24476 136 747 3
# 2 02 Alaska 32940 508 1200 13
# 3 04 Arizona 27517 148 972 4
# 4 05 Arkansas 23789 165 709 5
# 5 06 California 29454 109 1358 3
# 6 08 Colorado 32401 109 1125 5
# 7 09 Connecticut 35326 195 1123 5
# 8 10 Delaware 31560 247 1076 10
# 9 11 District of Columbia 43198 681 1424 17
#10 12 Florida 25952 70 1077 3
# … with 42 more rows
Now, we check by changing into custom order with factor and specify names_sort = TRUE, it will go in the order we wanted
us_rent_income %>%
pivot_longer(cols = c(estimate, moe)) %>%
unite(variable, variable, name) %>%
mutate(variable = factor(variable,
levels = c('income_estimate', 'rent_moe', 'rent_estimate', 'income_moe'))) %>%
pivot_wider(names_from = variable, values_from = value, names_sort = TRUE)
# A tibble: 52 x 6
# GEOID NAME income_estimate rent_moe rent_estimate income_moe
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 01 Alabama 24476 3 747 136
# 2 02 Alaska 32940 13 1200 508
# 3 04 Arizona 27517 4 972 148
# 4 05 Arkansas 23789 5 709 165
# 5 06 California 29454 3 1358 109
# 6 08 Colorado 32401 5 1125 109
# 7 09 Connecticut 35326 5 1123 195
# 8 10 Delaware 31560 10 1076 247
# 9 11 District of Columbia 43198 17 1424 681
#10 12 Florida 25952 3 1077 70
# … with 42 more rows
I often need to rescale time series relative to their value at a certain baseline time (usually as a percent of the baseline). Here's an example.
> library(dplyr)
> library(magrittr)
> library(tibble)
> library(tidyr)
# [messages from package imports snipped]
> set.seed(42)
> mexico <- tibble(Year=2000:2004, Country='Mexico', A=10:14+rnorm(5), B=20:24+rnorm(5))
> usa <- tibble(Year=2000:2004, Country='USA', A=30:34+rnorm(5), B=40:44+rnorm(5))
> table <- rbind(mexico, usa)
> table
# A tibble: 10 x 4
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 11.4 19.9
2 2001 Mexico 10.4 22.5
3 2002 Mexico 12.4 21.9
4 2003 Mexico 13.6 25.0
5 2004 Mexico 14.4 23.9
6 2000 USA 31.3 40.6
7 2001 USA 33.3 40.7
8 2002 USA 30.6 39.3
9 2003 USA 32.7 40.6
10 2004 USA 33.9 45.3
I want to scale A and B to express each value as a percent of the country-specific 2001 value (i.e., the A and B entries in rows 2 and 7 should be 100). My way of doing this is somewhat roundabout and awkward: extract the baseline values into a separate table, merge them back into a separate column in the main table, and then compute scaled values, with annoying intermediate gathering and spreading to avoid specifying the column names of each time series (real data sets can have far more than two value columns). Is there a better way to do this, ideally with a single short pipeline?
> long_table <- table %>% gather(variable, value, -Year, -Country)
> long_table
# A tibble: 20 x 4
Year Country variable value
<int> <chr> <chr> <dbl>
1 2000 Mexico A 11.4
2 2001 Mexico A 10.4
#[remaining tibble printout snipped]
> baseline_table <- long_table %>%
filter(Year == 2001) %>%
select(-Year) %>%
rename(baseline=value)
> baseline_table
# A tibble: 4 x 3
Country variable baseline
<chr> <chr> <dbl>
1 Mexico A 10.4
2 USA A 33.3
3 Mexico B 22.5
4 USA B 40.7
> normalized_table <- long_table %>%
inner_join(baseline_table) %>%
mutate(value=100*value/baseline) %>%
select(-baseline) %>%
spread(variable, value) %>%
arrange(Country, Year)
Joining, by = c("Country", "variable")
> normalized_table
# A tibble: 10 x 4
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 109. 88.4
2 2001 Mexico 100. 100
3 2002 Mexico 118. 97.3
4 2003 Mexico 131. 111.
5 2004 Mexico 138. 106.
6 2000 USA 94.0 99.8
7 2001 USA 100 100
8 2002 USA 92.0 96.6
9 2003 USA 98.3 99.6
10 2004 USA 102. 111.
My second attempt was to use transform, but this failed because transform doesn't seem to recognize dplyr groups, and it would be suboptimal even if it worked because it requires me to know that 2001 is the second year in the time series.
> table %>%
arrange(Country, Year) %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
transform(norm=value*100/value[2])
Year Country variable value norm
1 2000 Mexico A 11.37096 108.9663
2 2001 Mexico A 10.43530 100.0000
3 2002 Mexico A 12.36313 118.4741
4 2003 Mexico A 13.63286 130.6418
5 2004 Mexico A 14.40427 138.0340
6 2000 USA A 31.30487 299.9901
7 2001 USA A 33.28665 318.9811
8 2002 USA A 30.61114 293.3422
9 2003 USA A 32.72121 313.5627
10 2004 USA A 33.86668 324.5395
11 2000 Mexico B 19.89388 190.6402
12 2001 Mexico B 22.51152 215.7247
13 2002 Mexico B 21.90534 209.9157
14 2003 Mexico B 25.01842 239.7480
15 2004 Mexico B 23.93729 229.3876
16 2000 USA B 40.63595 389.4085
17 2001 USA B 40.71575 390.1732
18 2002 USA B 39.34354 377.0235
19 2003 USA B 40.55953 388.6762
20 2004 USA B 45.32011 434.2961
It would be nice for this to be more scalable, but here's a simple solution. You can refer to A[Year == 2001] inside mutate, much as you might do table$A[table$Year == 2001] in base R. This lets you scale against your baseline of 2001 or whatever other year you might need.
Edit: I was missing a group_by to ensure that values are only being scaled against other values in their own group. The "sanity check" (that I clearly didn't do) is that values for Mexico in 2001 should have a scaled value of 1, and same for USA and any other countries.
library(tidyverse)
set.seed(42)
mexico <- tibble(Year=2000:2004, Country='Mexico', A=10:14+rnorm(5), B=20:24+rnorm(5))
usa <- tibble(Year=2000:2004, Country='USA', A=30:34+rnorm(5), B=40:44+rnorm(5))
table <- rbind(mexico, usa)
table %>%
group_by(Country) %>%
mutate(A_base2001 = A / A[Year == 2001], B_base2001 = B / B[Year == 2001])
#> # A tibble: 10 x 6
#> # Groups: Country [2]
#> Year Country A B A_base2001 B_base2001
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 Mexico 11.4 19.9 1.09 0.884
#> 2 2001 Mexico 10.4 22.5 1 1
#> 3 2002 Mexico 12.4 21.9 1.18 0.973
#> 4 2003 Mexico 13.6 25.0 1.31 1.11
#> 5 2004 Mexico 14.4 23.9 1.38 1.06
#> 6 2000 USA 31.3 40.6 0.940 0.998
#> 7 2001 USA 33.3 40.7 1 1
#> 8 2002 USA 30.6 39.3 0.920 0.966
#> 9 2003 USA 32.7 40.6 0.983 0.996
#> 10 2004 USA 33.9 45.3 1.02 1.11
Created on 2018-05-23 by the reprex package (v0.2.0).
Inspired by Camille's answer, I found one simple approach that that scales well:
table %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(value=100*value/value[Year == 2001]) %>%
spread(variable, value)
# A tibble: 10 x 4
# Groups: Country [2]
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 109. 88.4
2 2000 USA 94.0 99.8
3 2001 Mexico 100. 100
4 2001 USA 100 100
5 2002 Mexico 118. 97.3
6 2002 USA 92.0 96.6
7 2003 Mexico 131. 111.
8 2003 USA 98.3 99.6
9 2004 Mexico 138. 106.
10 2004 USA 102. 111.
Preserving the the original values alongside the scaled ones takes more work. Here are two approaches. One of them uses an extra gather call to produce two variable-name columns (one indicating the series name, the other marking original or scaled), then unifying them into one column and reformatting.
table %>%
gather(variable, original, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(scaled=100*original/original[Year == 2001]) %>%
gather(scaled, value, -Year, -Country, -variable) %>%
unite(variable_scaled, variable, scaled, sep='_') %>%
mutate(variable_scaled=gsub("_original", "", variable_scaled)) %>%
spread(variable_scaled, value)
# A tibble: 10 x 6
# Groups: Country [2]
Year Country A A_scaled B B_scaled
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2000 Mexico 11.4 109. 19.9 88.4
2 2000 USA 31.3 94.0 40.6 99.8
3 2001 Mexico 10.4 100. 22.5 100
4 2001 USA 33.3 100 40.7 100
5 2002 Mexico 12.4 118. 21.9 97.3
6 2002 USA 30.6 92.0 39.3 96.6
7 2003 Mexico 13.6 131. 25.0 111.
8 2003 USA 32.7 98.3 40.6 99.6
9 2004 Mexico 14.4 138. 23.9 106.
10 2004 USA 33.9 102. 45.3 111.
A second equivalent approach creates a new table with the columns scaled "in place" and then merges it back into with the original one.
table %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(value=100*value/value[Year == 2001]) %>%
ungroup() %>%
mutate(variable=paste(variable, 'scaled', sep='_')) %>%
spread(variable, value) %>%
inner_join(table)
Joining, by = c("Year", "Country")
# A tibble: 10 x 6
Year Country A_scaled B_scaled A B
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2000 Mexico 109. 88.4 11.4 19.9
2 2000 USA 94.0 99.8 31.3 40.6
3 2001 Mexico 100. 100 10.4 22.5
4 2001 USA 100 100 33.3 40.7
5 2002 Mexico 118. 97.3 12.4 21.9
6 2002 USA 92.0 96.6 30.6 39.3
7 2003 Mexico 131. 111. 13.6 25.0
8 2003 USA 98.3 99.6 32.7 40.6
9 2004 Mexico 138. 106. 14.4 23.9
10 2004 USA 102. 111. 33.9 45.3
It's possible to replace the final inner_join with arrange(County, Year) %>% select(-Country, -Year) %>% bind_cols(table), which may perform better for some data sets, though it orders the columns suboptimally.