Reshape R dataframe from long format to a wide format - r

I have a table I want to reshape/pivot. The Agency No will have duplicates as this is looking at years worth of data but they are grouped by Agency No, Fiscal Year, and Type currently. The table is provided below as well as a desired output.
Agency No
Fiscal Year
Type
Total Gross Weight
W1000FP
2018
Dry
1000
W1004CSFP
2018
Dry
2000
W1000FP
2018
Produce
500
W1004CSFP
2018
Produce
1000
W1004DR
2018
Produce
1000
W1004DR
2018
Dry
1000
W1005DR
2019
Dry
2000
W1000FP
2019
Dry
1000
W1005DR
2019
Produce
1000
W1000FP
2019
Produce
1000
Desired Output:
Agency No
Fiscal Year
Produce Weight
Dry Weight
W1000FP
2018
500
1000
W1004CSFP
2018
1000
2000
W1004DR
2018
1000
1000
W1005DR
2019
1000
2000
W1000FP
2019
1000
1000
Here is the script that I ran but did not provide the desired output:
reshape(df, idvar = "Agency No", timevar = "Type", direction = "wide"

We could use pivot_wider
library(tidyr)
pivot_wider(df1, names_from = Type, values_from = `Total Gross Weight`)
-output
# A tibble: 5 × 4
`Agency No` `Fiscal Year` Dry Produce
<chr> <int> <int> <int>
1 W1000FP 2018 1000 500
2 W1004CSFP 2018 2000 1000
3 W1004DR 2018 1000 1000
4 W1005DR 2019 2000 1000
5 W1000FP 2019 1000 1000
With reshape, specify the 'Fiscal Year' also a idvar
reshape(df1, idvar = c("Agency No", "Fiscal Year"),
timevar = "Type", direction = "wide")
Agency No Fiscal Year Total Gross Weight.Dry Total Gross Weight.Produce
1 W1000FP 2018 1000 500
2 W1004CSFP 2018 2000 1000
5 W1004DR 2018 1000 1000
7 W1005DR 2019 2000 1000
8 W1000FP 2019 1000 1000
data
df1 <- structure(list(`Agency No` = c("W1000FP", "W1004CSFP", "W1000FP",
"W1004CSFP", "W1004DR", "W1004DR", "W1005DR", "W1000FP", "W1005DR",
"W1000FP"), `Fiscal Year` = c(2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2019L, 2019L, 2019L, 2019L), Type = c("Dry", "Dry", "Produce",
"Produce", "Produce", "Dry", "Dry", "Dry", "Produce", "Produce"
), `Total Gross Weight` = c(1000L, 2000L, 500L, 1000L, 1000L,
1000L, 2000L, 1000L, 1000L, 1000L)), class = "data.frame", row.names = c(NA,
-10L))

Related

R - ddply(): Using min value of one column to find the corresponding value in different column [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 1 year ago.
I want to get a summary of min(cost) per country over the years with the specific airport. The dataset looks like this (around 1000 rows with multiple airports per country)
airport country cost year
ORD US 500 2010
SFO US 800 2010
LHR UK 250 2010
CDG FR 300 2010
FRA GR 200 2010
ORD US 650 2011
SFO US 500 2011
LHR UK 850 2011
CDG FR 350 2011
FRA GR 150 2011
ORD US 250 2012
SFO US 650 2012
LHR UK 350 2012
CDG FR 450 2012
FRA GR 100 2012
The code below gets me summary of min(cost) per country
ddply(df,c('country'), summarize, LowestCost = min(cost))
When I try to display min(cost) of the country along with the specific airport, I just get one airport listed
ddply(df,c('country'), summarize, LowestCost = min(cost), AirportName = df[which.min(df[,3]),1])
The output should look like below
country LowestCost AirportName
US 250 ORD
UK 250 LHR
FR 300 CDG
GR 100 FRA
But instead it looks like this
country LowestCost AirportName
US 250 ORD
UK 250 ORD
FR 300 ORD
GR 100 ORD
Any help is appreciated
We may use slice_min from dplyr
library(dplyr)
df %>%
select(-year) %>%
group_by(country) %>%
slice_min(cost, n = 1) %>%
ungroup %>%
rename(LowestCost = cost)
-output
# A tibble: 4 x 3
airport country LowestCost
<chr> <chr> <int>
1 CDG FR 300
2 FRA GR 100
3 LHR UK 250
4 ORD US 250
In the plyr, code, the which.min is applied on the whole column, instead of the grouped column. We just need to specify the column name
plyr::ddply(df, c("country"), plyr::summarise,
LowestCost = min(cost), AirportName = airport[which.min(cost)])
country LowestCost AirportName
1 FR 300 CDG
2 GR 100 FRA
3 UK 250 LHR
4 US 250 ORD
data
df <- structure(list(airport = c("ORD", "SFO", "LHR", "CDG", "FRA",
"ORD", "SFO", "LHR", "CDG", "FRA", "ORD", "SFO", "LHR", "CDG",
"FRA"), country = c("US", "US", "UK", "FR", "GR", "US", "US",
"UK", "FR", "GR", "US", "US", "UK", "FR", "GR"), cost = c(500L,
800L, 250L, 300L, 200L, 650L, 500L, 850L, 350L, 150L, 250L, 650L,
350L, 450L, 100L), year = c(2010L, 2010L, 2010L, 2010L, 2010L,
2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2012L, 2012L, 2012L,
2012L)), class = "data.frame", row.names = c(NA, -15L))

How might I arrange data by common year row into separate columns in R?

I am relatively new to R but have worked with dplyr for data transformation.
I have a data frame with rows for year and a number.
row year int
1 2020 100
2 2020 150
3 2020 300
4 2020 750
5 2020 555
6 2019 179
7 2019 233
8 2019 399
9 2019 400
10 2019 543
How might I group these rows by common year, in row order, but organized into columns? Such as:
year col1 col2 col3 col4 col5
2020 100 150 300 750 555
2021 179 233 399 400 543
This should be simple, but I can't seem to figure out how to do it with dplyr or base R. Thank you,
We can create a sequence column by 'year' and then pivot to 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
dplyr::select(-row) %>%
group_by(year) %>%
mutate(new = str_c('col', row_number())) %>%
ungroup %>%
pivot_wider(names_from = new, values_from = int)
# A tibble: 2 x 6
# year col1 col2 col3 col4 col5
# <int> <int> <int> <int> <int> <int>
#1 2020 100 150 300 750 555
#2 2019 179 233 399 400 543
Or with data.table, rowid does the sequence creator and this can be passed into the formula interface of dcast
library(data.table)
dcast(setDT(df1), year ~ paste0('col', rowid(year)), value.var = 'int')
data
df1 <- structure(list(row = 1:10, year = c(2020L, 2020L, 2020L, 2020L,
2020L, 2019L, 2019L, 2019L, 2019L, 2019L), int = c(100L, 150L,
300L, 750L, 555L, 179L, 233L, 399L, 400L, 543L)),
class = "data.frame", row.names = c(NA,
-10L))

Calculating percent change over time with a longitudal dataset

I'm trying to calculate the year to year change in some data I have. It is in panel/longitudinal form
the data is in a dataframe that looks like this
Year ZipCode Value
2011 11411 5
2012 11411 10
2013 11411 20
2011 11345 6
2012 11345 7
2013 11345 10
I would like to get a dataframe that comes out in the form like this
Year Differnce Zipcode % Change
2011-2012 11411 100%
2012-2013 11411 100%
2011-2012 11345 16%
2012-2013 11345 42%
One way would using dplyr is to calculate Change by subtracting current Value from previous Value and paste the Year together for each ZipCode.
library(dplyr)
df %>%
group_by(ZipCode) %>%
mutate(Change = (Value - lag(Value))/lag(Value) * 100,
Year_Diff = paste(lag(Year), Year, sep = "-")) %>%
slice(-1) %>%
select(Year_Diff, ZipCode, Change)
# Year_Diff ZipCode Change
# <chr> <int> <dbl>
#1 2011-2012 11345 16.7
#2 2012-2013 11345 42.9
#3 2011-2012 11411 100
#4 2012-2013 11411 100
Using data.table, we group by 'ZipCode', take the diff of 'Value', divide by the'Value' length adjusted while pasteing the adjacent 'Year' together
library(data.table)
setDT(df1)[, .(Change = 100 *diff(Value)/Value[-.N],
Year_Diff = paste(Year[-.N], Year[-1], sep="-")), .(ZipCode)]
# ZipCode Change Year_Diff
#1: 11411 100.00000 2011-2012
#2: 11411 100.00000 2012-2013
#3: 11345 16.66667 2011-2012
#4: 11345 42.85714 2012-2013
data
df1 <- structure(list(Year = c(2011L, 2012L, 2013L, 2011L, 2012L, 2013L
), ZipCode = c(11411L, 11411L, 11411L, 11345L, 11345L, 11345L
), Value = c(5L, 10L, 20L, 6L, 7L, 10L)), class = "data.frame",
row.names = c(NA,
-6L))

Deflate numbers by price index (CPI), when there are varying observation numbers for each year

I have some Revenue data in a format like that shown below. So the years are not sequential and they can also repeat (because of a different firm).
Firm Year Revenue
1 A 2018 100
2 B 2017 90
3 B 2018 80
4 C 2016 20
And I want to adjust the Revenue for inflation, by dividing through by the appropriate CPI for each year. The CPI data looks like this:
Year CPI
1 2016 98
2 2017 100
3 2018 101
I have a solution that works, but is this the best way to do it? Is it clumsy to mutate an entire calculating column in there?
revenue <- data.frame(stringsAsFactors=FALSE,
Firm = c("A", "B", "B", "C"),
Year = c(2018L, 2017L, 2018L, 2016L),
Revenue = c(100L, 90L, 80L, 20L)
)
cpi <- data.frame(
Year = c(2016L, 2017L, 2018L),
CPI = c(98L, 100L, 101L)
)
library(dplyr)
df <- left_join(revenue, cpi, by = 'Year')
mutate(df, real_revenue = (Revenue*100)/CPI)
The output is correct, shown below. But is this the best way to do it?
Firm Year Revenue CPI real_revenue
1 A 2018 100 101 99.00990
2 B 2017 90 100 90.00000
3 B 2018 80 101 79.20792
4 C 2016 20 98 20.40816

get the mean of a variable subset of data in R [duplicate]

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
Imagine I have the following data:
Year Month State ppo
2011 Jan CA 220
2011 Feb CA 250
2012 Jan CA 230
2011 Jan WA 200
2011 Feb WA 210
I need to calculate the mean for each state for the year, so the output would look something like this:
Year Month State ppo annualAvg
2011 Jan CA 220 230
2011 Feb CA 240 230
2012 Jan CA 260 260
2011 Jan WA 200 205
2011 Feb WA 210 205
where the annual average is the mean of any entries for that state in the same year. If the year and state were constant I would know how to do this, but somehow the fact that they are variable is throwing me off.
Looking around, it seems like maybe ddply is what I want to be using for this (https://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r), but when I tried to use it I was doing something wrong and kept getting errors (I have tried so many variations of it that I won't bother to post them all here). Any idea how I am actually supposed to be doing this?
Thanks for the help!
Try this:
library(data.table)
setDT(df)
df[ , annualAvg := mean(ppo) , by =.(Year, State) ]
Base R: df$ppoAvg <- ave(df$ppo, df$State, df$Year, FUN = mean)
Using dplyr with group_by %>% mutate to add a column:
library(dplyr)
df %>% group_by(Year, State) %>% mutate(annualAvg = mean(ppo))
#Source: local data frame [5 x 5]
#Groups: Year, State [3]
# Year Month State ppo annualAvg
# (int) (fctr) (fctr) (int) (dbl)
#1 2011 Jan CA 220 235
#2 2011 Feb CA 250 235
#3 2012 Jan CA 230 230
#4 2011 Jan WA 200 205
#5 2011 Feb WA 210 205
Using data.table:
library(data.table)
setDT(df)[, annualAvg := mean(ppo), .(Year, State)]
df
# Year Month State ppo annualAvg
#1: 2011 Jan CA 220 235
#2: 2011 Feb CA 250 235
#3: 2012 Jan CA 230 230
#4: 2011 Jan WA 200 205
#5: 2011 Feb WA 210 205
Data:
structure(list(Year = c(2011L, 2011L, 2012L, 2011L, 2011L), Month = structure(c(2L,
1L, 2L, 2L, 1L), .Label = c("Feb", "Jan"), class = "factor"),
State = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("CA",
"WA"), class = "factor"), ppo = c(220L, 250L, 230L, 200L,
210L), annualAvg = c(235, 235, 230, 205, 205)), .Names = c("Year",
"Month", "State", "ppo", "annualAvg"), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x105000778>)

Resources