Reshaping Long data frame to wide using R [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
I have the following data set containing the quantity of phytosanitary products purchased per zip code in france between 2015 and 2019 with its classification (other,toxic,mineral,organic).
the dataframe looks like this, so with the zip_code, the year and the classification you can see the quantity that was purchased
zip_code
year
classification
total_quantity
01000
2015
other
44.305436
01000
2015
toxic
212.783330
01000
2015
mineral
value
01000
2015
organic
value
01000
2016
other
value
01000
2016
toxic
value
01000
2016
mineral
value
it follows the same pattern .....
zip_code
year
classification
total_quantity
01000
2019
organic
value
01090
2015
other
value
but I would like something where you have only one entry per zip code like this (of course going to 2019 and not stoping at 2016 like i did in my exemple)
zip_code
other_total-quantity-2015
Toxic_total-quantity-2015
Mineral_total-quantity-2015
organic_total-quantity-2015
other_total-quantity-2016
Toxic_total-quantity-2016
01000
value
value
value
value
value
value
01090
value
value
value
value
value
I tried to do this using the reshape function but the closest i got from what i want is a table where the zip_code is repeated 4 times (for every classification).
Thank you

The following uses pivot_wider from package tidyr to do the reshape. I'm aware that's a personal preference, but maybe it's helpful though.
library(tidyr)
library(dplyr)
## or install and load these and related packages
## in bulk through the `tidyverse` package
df %>%
pivot_wider(
id_cols = zip_code,
names_from = c(year, classification),
values_from = total_quantity,
names_prefix = 'total-quantity' ## redundant, actually
)

I created a sample dataset that looks like this:
# A tibble: 40 × 4
zip_code year classification total_quantity
<dbl> <dbl> <chr> <dbl>
1 1000 2015 other 61.1
2 1000 2015 toxic 32.8
3 1000 2015 mineral 11.4
4 1000 2015 organic 38.9
5 1000 2016 other 18.8
6 1000 2016 toxic 65.0
7 1000 2016 mineral 0.382
8 1000 2016 organic 18.8
9 1000 2017 other 96.0
10 1000 2017 toxic 60.4
# … with 30 more rows
If you run the following code you will get your requested table:
library(tidyr)
library(dplyr)
df %>%
pivot_wider(
id_cols = zip_code,
names_from = c(year, classification),
values_from = total_quantity,
names_glue = "{classification}_total-quantity-{year}"
)
Output:
# A tibble: 2 × 21
zip_code `other_total-quantity…` `toxic_total-q…` `mineral_total…` `organic_total…` `other_total-q…` `toxic_total-q…` `mineral_total…` `organic_total…` `other_total-q…` `toxic_total-q…` `mineral_total…` `organic_total…` `other_total-q…` `toxic_total-q…` `mineral_total…` `organic_total…` `other_total-q…` `toxic_total-q…` `mineral_total…` `organic_total…`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1000 61.1 32.8 11.4 38.9 18.8 65.0 0.382 18.8 96.0 60.4 80.4 81.2 47.4 87.4 52.2 9.65 19.7 11.3 45.7 12.8
2 1090 75.2 40.1 47.9 10.3 86.2 97.9 11.2 93.3 55.0 88.5 63.5 46.6 5.30 13.1 20.4 83.9 58.6 61.3 6.56 46.7
As you can see the year and classification are added to the columnnames using names_glue in the pivot_wider function.

Related

How to Subset dataframe in R by entity of multiple entities over time

I have an NCAA recruiting dataset that has 200+ teams from 1980-2022 of many different ranking metrics. I am trying to subset my data into individual tibbles/dataframes of each team over time. I thought of using a loop but can't wrap my head around how to code it.
2nd. I am trying to look for extreme changes from one season to the next to determine if a certain event cause an increase/decrease in the rank of recruits each team had. Someone correct me if my logic is wrong. Could I use z scores to determine this? And if so how would I do it?
to subset, I have tried this:
postCovid <- teamRecruitingScore %>% filter(yr>2018) %>% filter(topAvg != 0) %>% group_by(yr, topAvg)
and it returns this:
yr team tfsRk tfsSc…¹ rawRk rawAvg topRk topAvg total…² five four three
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019 Akron 112 130. 120 79.2 125 79.2 23 0 0 9
2 2019 Alabama 1 317. 1 94.4 1 95.4 27 3 23 1
3 2019 Appalach… 99 139. 88 81.5 92 81.5 18 0 0 16
4 2019 Arizona 54 183. 58 84.8 58 84.8 20 0 1 19
What I am trying to get is each team, then list all the years.
Any ideas?

finding minimum for a column based on another column and keep result as a data frame

I have a data frame with five columns:
year<- c(2000,2000,2000,2001,2001,2001,2002,2002,2002)
k<- c(12.5,11.5,10.5,-8.5,-9.5,-10.5,13.9,14.9,15.9)
pop<- c(143,147,154,445,429,430,178,181,211)
pop_obs<- c(150,150,150,440,440,440,185,185,185)
df<- data_frame(year,k,pop,pop_obs)
df<-
year k pop pop_obs
<dbl> <dbl> <dbl> <dbl>
1 2000 12.5 143 150
2 2000 11.5 147 150
3 2000 10.5 154 150
4 2001 -8.5 445 440
5 2001 -9.5 429 440
6 2001 -10.5 430 440
7 2002 13.9 178 185
8 2002 14.9 181 185
9 2002 15.9 211 185
what I want is, based on each year and each k which value of pop has minimum difference of pop_obs. finally, I want to keep result as a data frame based on each year and each k.
my expected output would be like this:
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2003 14.9
You could try with dplyr
df<- data.frame(year,k,pop,pop_obs)
library(dplyr)
df %>%
mutate(diff = abs(pop_obs - pop)) %>%
group_by(year) %>%
filter(diff == min(diff)) %>%
select(year, k)
#> # A tibble: 3 x 2
#> # Groups: year [3]
#> year k
#> <dbl> <dbl>
#> 1 2000 11.5
#> 2 2001 -8.5
#> 3 2002 14.9
Created on 2021-12-11 by the reprex package (v2.0.1)
Try tidyverse way
library(tidyverse)
data_you_want = df %>%
group_by(year, k)%>%
mutate(dif=pop-pop_obs)%>%
ungroup() %>%
arrange(desc(dif)) %>%
select(year, k)
Using base R
subset(df, as.logical(ave(abs(pop_obs - pop), year,
FUN = function(x) x == min(x))), select = c('year', 'k'))
# A tibble: 3 × 2
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2002 14.9

Pivot Longer with Modification of Columns

I have data that is in the following format:
(data <- tribble(
~Date, ~ENRSxOPEN, ~ENRSxCLOSE, ~INFTxOPEN, ~INFTxCLOSE,
"1989-09-11",82.97,82.10,72.88,72.56,
"1989-09-12",83.84,83.96,73.52,72.51,
"1989-09-13",83.16,83.88,72.91,72.12))
# A tibble: 3 x 5
Date ENRSxOPEN ENRSxCLOSE INFTxOPEN INFTxCLOSE
<chr> <dbl> <dbl> <dbl> <dbl>
1 1989-09-11 83.0 82.1 72.9 72.6
2 1989-09-12 83.8 84.0 73.5 72.5
3 1989-09-13 83.2 83.9 72.9 72.1
For analysis, I want to pivot this tibble longer to the following format:
tribble(
~Ticker, ~Date, ~OPEN, ~CLOSE,
"ENRS","1989-09-11",82.97,82.10,
"ENRS","1989-09-12",83.84,83.96,
"ENRS","1989-09-13",83.16,83.88,
"INFT","1989-09-11",72.88,72.56,
"INFT","1989-09-12",73.52,72.51,
"INFT","1989-09-13",72.91,72.12)
# A tibble: 3 x 5
Date ENRSxOPEN ENRSxCLOSE INFTxOPEN INFTxCLOSE
<chr> <dbl> <dbl> <dbl> <dbl>
1 1989-09-11 83.0 82.1 72.9 72.6
2 1989-09-12 83.8 84.0 73.5 72.5
3 1989-09-13 83.2 83.9 72.9 72.1
I.e., I want to separate the Open/Close prices from the ticker, and put the latter as an entirely new column in the beginning.
I've tried to use the function pivot_longer:
pivot_longer(data, cols = ENRSxOPEN:INFTxCLOSE)
While this goes into the direction of what I wanna achieve, it does not separate the prices and keep them in one row for each Ticker.
Is there a way to add additional arguments to pivot_longer()to achieve that?
pivot_longer(data, -Date, names_to = c('Ticker', '.value'), names_sep = 'x')
# A tibble: 6 x 4
Date Ticker OPEN CLOSE
<dbl> <chr> <dbl> <dbl>
1 1969 ENRS 83.0 82.1
2 1969 INFT 72.9 72.6
3 1968 ENRS 83.8 84.0
4 1968 INFT 73.5 72.5
5 1967 ENRS 83.2 83.9
6 1967 INFT 72.9 72.1

pivot_longer, transform first column rownames, to column names

Using this data:
https://www.health.govt.nz/system/files/documents/publications/suicide-2015-tables.xlsx
The first column has rows:
"Number of suicides
Male
Female
Total
Age-standardised rate (deaths per 100,000)
Male
Female
Total"
However, I need these to be the column headers instead. Is this done so via pivot_longer() ?
Thanks
This is a pretty ugly data file, but hopefully this should get to what you want:
library(dplyr)
library(tidyr)
# After downloading the file in your project folder
dat <- readxl::read_excel("suicide-2015-tables.xlsx", skip = 2)
dat %>%
select(variables = ...1, `2006`:`2015`) %>% # Remove unneeded/blank columns
mutate(headers = if_else(is.na(`2006`), variables, NA_character_)) %>% # Create a headers variable
fill(headers, .direction = "down") %>% # Fill the headers down
pivot_longer(`2006`:`2015`, names_to = "year", values_to = "counts") %>% # Reshape data from wide to long
drop_na() %>%
unite("headers_vars", headers, variables, sep = " - ") %>% # Create a new variable that combines the headers and the subgroup breakdown
pivot_wider(names_from = headers_vars, values_from = counts) # Reshape back from long to wide
# A tibble: 10 x 15
year `Number of suic~ `Number of suic~ `Number of suic~ `Age-standardis~ `Age-standardis~ `Age-standardis~ `Age-specific r~
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2006 388 138 526 18.6 6.25 12.2 19.7
2 2007 371 116 487 17.4 5.01 11.0 15.2
3 2008 381 139 520 17.6 6.24 11.8 19.5
4 2009 393 117 510 17.9 5.03 11.3 18.1
5 2010 386 149 535 17.3 6.63 11.8 17.7
6 2011 377 116 493 17.0 5.06 10.9 20.1
7 2012 404 146 550 18.1 6.39 12.1 23.0
8 2013 367 147 514 16.0 6.41 11.0 17.8
9 2014 380 130 510 16.5 5.4 10.8 14.1
10 2015 384 143 527 16.4 6.1 11.1 16.9
# ... with 7 more variables: `Age-specific rates by life-stage age group (deaths per 100,000) - 25–44` <dbl>, `Age-specific rates by
# life-stage age group (deaths per 100,000) - 45–64` <dbl>, `Age-specific rates by life-stage age group (deaths per 100,000) - 65+` <dbl>,
# `Age-standardised suicide rates for Māori (deaths per 100,000) - Male` <dbl>, `Age-standardised suicide rates for Māori (deaths per
# 100,000) - Female` <dbl>, `Age-standardised suicide rates for non-Māori (deaths per 100,000) - Male` <dbl>, `Age-standardised suicide
# rates for non-Māori (deaths per 100,000) - Female` <dbl>

Rescale data frame columns as percentages of baseline entry with dplyr

I often need to rescale time series relative to their value at a certain baseline time (usually as a percent of the baseline). Here's an example.
> library(dplyr)
> library(magrittr)
> library(tibble)
> library(tidyr)
# [messages from package imports snipped]
> set.seed(42)
> mexico <- tibble(Year=2000:2004, Country='Mexico', A=10:14+rnorm(5), B=20:24+rnorm(5))
> usa <- tibble(Year=2000:2004, Country='USA', A=30:34+rnorm(5), B=40:44+rnorm(5))
> table <- rbind(mexico, usa)
> table
# A tibble: 10 x 4
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 11.4 19.9
2 2001 Mexico 10.4 22.5
3 2002 Mexico 12.4 21.9
4 2003 Mexico 13.6 25.0
5 2004 Mexico 14.4 23.9
6 2000 USA 31.3 40.6
7 2001 USA 33.3 40.7
8 2002 USA 30.6 39.3
9 2003 USA 32.7 40.6
10 2004 USA 33.9 45.3
I want to scale A and B to express each value as a percent of the country-specific 2001 value (i.e., the A and B entries in rows 2 and 7 should be 100). My way of doing this is somewhat roundabout and awkward: extract the baseline values into a separate table, merge them back into a separate column in the main table, and then compute scaled values, with annoying intermediate gathering and spreading to avoid specifying the column names of each time series (real data sets can have far more than two value columns). Is there a better way to do this, ideally with a single short pipeline?
> long_table <- table %>% gather(variable, value, -Year, -Country)
> long_table
# A tibble: 20 x 4
Year Country variable value
<int> <chr> <chr> <dbl>
1 2000 Mexico A 11.4
2 2001 Mexico A 10.4
#[remaining tibble printout snipped]
> baseline_table <- long_table %>%
filter(Year == 2001) %>%
select(-Year) %>%
rename(baseline=value)
> baseline_table
# A tibble: 4 x 3
Country variable baseline
<chr> <chr> <dbl>
1 Mexico A 10.4
2 USA A 33.3
3 Mexico B 22.5
4 USA B 40.7
> normalized_table <- long_table %>%
inner_join(baseline_table) %>%
mutate(value=100*value/baseline) %>%
select(-baseline) %>%
spread(variable, value) %>%
arrange(Country, Year)
Joining, by = c("Country", "variable")
> normalized_table
# A tibble: 10 x 4
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 109. 88.4
2 2001 Mexico 100. 100
3 2002 Mexico 118. 97.3
4 2003 Mexico 131. 111.
5 2004 Mexico 138. 106.
6 2000 USA 94.0 99.8
7 2001 USA 100 100
8 2002 USA 92.0 96.6
9 2003 USA 98.3 99.6
10 2004 USA 102. 111.
My second attempt was to use transform, but this failed because transform doesn't seem to recognize dplyr groups, and it would be suboptimal even if it worked because it requires me to know that 2001 is the second year in the time series.
> table %>%
arrange(Country, Year) %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
transform(norm=value*100/value[2])
Year Country variable value norm
1 2000 Mexico A 11.37096 108.9663
2 2001 Mexico A 10.43530 100.0000
3 2002 Mexico A 12.36313 118.4741
4 2003 Mexico A 13.63286 130.6418
5 2004 Mexico A 14.40427 138.0340
6 2000 USA A 31.30487 299.9901
7 2001 USA A 33.28665 318.9811
8 2002 USA A 30.61114 293.3422
9 2003 USA A 32.72121 313.5627
10 2004 USA A 33.86668 324.5395
11 2000 Mexico B 19.89388 190.6402
12 2001 Mexico B 22.51152 215.7247
13 2002 Mexico B 21.90534 209.9157
14 2003 Mexico B 25.01842 239.7480
15 2004 Mexico B 23.93729 229.3876
16 2000 USA B 40.63595 389.4085
17 2001 USA B 40.71575 390.1732
18 2002 USA B 39.34354 377.0235
19 2003 USA B 40.55953 388.6762
20 2004 USA B 45.32011 434.2961
It would be nice for this to be more scalable, but here's a simple solution. You can refer to A[Year == 2001] inside mutate, much as you might do table$A[table$Year == 2001] in base R. This lets you scale against your baseline of 2001 or whatever other year you might need.
Edit: I was missing a group_by to ensure that values are only being scaled against other values in their own group. The "sanity check" (that I clearly didn't do) is that values for Mexico in 2001 should have a scaled value of 1, and same for USA and any other countries.
library(tidyverse)
set.seed(42)
mexico <- tibble(Year=2000:2004, Country='Mexico', A=10:14+rnorm(5), B=20:24+rnorm(5))
usa <- tibble(Year=2000:2004, Country='USA', A=30:34+rnorm(5), B=40:44+rnorm(5))
table <- rbind(mexico, usa)
table %>%
group_by(Country) %>%
mutate(A_base2001 = A / A[Year == 2001], B_base2001 = B / B[Year == 2001])
#> # A tibble: 10 x 6
#> # Groups: Country [2]
#> Year Country A B A_base2001 B_base2001
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 Mexico 11.4 19.9 1.09 0.884
#> 2 2001 Mexico 10.4 22.5 1 1
#> 3 2002 Mexico 12.4 21.9 1.18 0.973
#> 4 2003 Mexico 13.6 25.0 1.31 1.11
#> 5 2004 Mexico 14.4 23.9 1.38 1.06
#> 6 2000 USA 31.3 40.6 0.940 0.998
#> 7 2001 USA 33.3 40.7 1 1
#> 8 2002 USA 30.6 39.3 0.920 0.966
#> 9 2003 USA 32.7 40.6 0.983 0.996
#> 10 2004 USA 33.9 45.3 1.02 1.11
Created on 2018-05-23 by the reprex package (v0.2.0).
Inspired by Camille's answer, I found one simple approach that that scales well:
table %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(value=100*value/value[Year == 2001]) %>%
spread(variable, value)
# A tibble: 10 x 4
# Groups:   Country [2]
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 109. 88.4
2 2000 USA 94.0 99.8
3 2001 Mexico 100. 100
4 2001 USA 100 100
5 2002 Mexico 118. 97.3
6 2002 USA 92.0 96.6
7 2003 Mexico 131. 111.
8 2003 USA 98.3 99.6
9 2004 Mexico 138. 106.
10 2004 USA 102. 111.
Preserving the the original values alongside the scaled ones takes more work. Here are two approaches. One of them uses an extra gather call to produce two variable-name columns (one indicating the series name, the other marking original or scaled), then unifying them into one column and reformatting.
table %>%
gather(variable, original, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(scaled=100*original/original[Year == 2001]) %>%
gather(scaled, value, -Year, -Country, -variable) %>%
unite(variable_scaled, variable, scaled, sep='_') %>%
mutate(variable_scaled=gsub("_original", "", variable_scaled)) %>%
spread(variable_scaled, value)
# A tibble: 10 x 6
# Groups:   Country [2]
Year Country A A_scaled B B_scaled
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2000 Mexico 11.4 109. 19.9 88.4
2 2000 USA 31.3 94.0 40.6 99.8
3 2001 Mexico 10.4 100. 22.5 100
4 2001 USA 33.3 100 40.7 100
5 2002 Mexico 12.4 118. 21.9 97.3
6 2002 USA 30.6 92.0 39.3 96.6
7 2003 Mexico 13.6 131. 25.0 111.
8 2003 USA 32.7 98.3 40.6 99.6
9 2004 Mexico 14.4 138. 23.9 106.
10 2004 USA 33.9 102. 45.3 111.
A second equivalent approach creates a new table with the columns scaled "in place" and then merges it back into with the original one.
table %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(value=100*value/value[Year == 2001]) %>%
ungroup() %>%
mutate(variable=paste(variable, 'scaled', sep='_')) %>%
spread(variable, value) %>%
inner_join(table)
Joining, by = c("Year", "Country")
# A tibble: 10 x 6
Year Country A_scaled B_scaled A B
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2000 Mexico 109. 88.4 11.4 19.9
2 2000 USA 94.0 99.8 31.3 40.6
3 2001 Mexico 100. 100 10.4 22.5
4 2001 USA 100 100 33.3 40.7
5 2002 Mexico 118. 97.3 12.4 21.9
6 2002 USA 92.0 96.6 30.6 39.3
7 2003 Mexico 131. 111. 13.6 25.0
8 2003 USA 98.3 99.6 32.7 40.6
9 2004 Mexico 138. 106. 14.4 23.9
10 2004 USA 102. 111. 33.9 45.3
It's possible to replace the final inner_join with arrange(County, Year) %>% select(-Country, -Year) %>% bind_cols(table), which may perform better for some data sets, though it orders the columns suboptimally.

Resources