Related
I want to modify the longitudinal data.
How can I create the column using the number in the column name(e.g. gdpPercap_1952, gdpPercap_1957, etc.)?
I try to divide by the number(year) and the letter(gdp) in those column(e.g. gdpPercap_1952, gdpPercap_1957, etc.).
Then I try to make the new column "year".
Would you tell me how I can solve that?
Or is there any other suitable way?
continent country gdpPercap_1952
1 Africa Algeria 2449.0082
2 Africa Angola 3520.6103
3 Africa Benin 1062.7522
4 Africa Botswana 851.2411
5 Africa Burkina Faso 543.2552
6 Africa Burundi 339.2965
gdpPercap_1957 gdpPercap_1962
1 3013.9760 2550.8169
2 3827.9405 4269.2767
3 959.6011 949.4991
4 918.2325 983.6540
5 617.1835 722.5120
6 379.5646 355.2032
gdpPercap_1967 gdpPercap_1972
1 3246.9918 4182.6638
2 5522.7764 5473.2880
3 1035.8314 1085.7969
This can be done with pivot_longer
library(tidyr)
pivot_longer(df1, cols = starts_with('gdp'),
names_to = c(".value", "year"), names_sep = "_")
Or instead of names_sep, we could use names_pattern:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-c(continent, country),
names_to = c(".value", "year"),
names_pattern = "(.*)_(\\d+)"
) %>%
data.frame()
continent country year gdpPercap
1 Africa Algeria 1952 2449.0082
2 Africa Algeria 1957 3013.9760
3 Africa Algeria 1962 2550.8169
4 Africa Angola 1952 3520.6103
5 Africa Angola 1957 3827.9405
6 Africa Angola 1962 4269.2767
7 Africa Benin 1952 1062.7522
8 Africa Benin 1957 959.6011
9 Africa Benin 1962 949.4991
10 Africa Botswana 1952 851.2411
11 Africa Botswana 1957 918.2325
12 Africa Botswana 1962 983.6540
13 Africa Burkina Faso 1952 543.2552
14 Africa Burkina Faso 1957 617.1835
15 Africa Burkina Faso 1962 722.5120
16 Africa Burundi 1952 339.2965
17 Africa Burundi 1957 379.5646
18 Africa Burundi 1962 355.2032
data:
structure(list(continent = c("Africa", "Africa", "Africa", "Africa",
"Africa", "Africa"), country = c("Algeria", "Angola", "Benin",
"Botswana", "Burkina Faso", "Burundi"), gdpPercap_1952 = c(2449.0082,
3520.6103, 1062.7522, 851.2411, 543.2552, 339.2965), gdpPercap_1957 = c(3013.976,
3827.9405, 959.6011, 918.2325, 617.1835, 379.5646), gdpPercap_1962 = c(2550.8169,
4269.2767, 949.4991, 983.654, 722.512, 355.2032)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 11 months ago.
I've browsed extensively online but could so far not find an appropriate answer for my question in this specific case.
I'm looking to partly re-structure a panel data set from long to wide format, but only for specific values that are specified by their respective names/characters in rows in R.
Consider this original format:
SERIES ECONOMY YEAR Value
246 CPI Panama 1960 0.05
247 CPI Peru 1960 0.05
248 CPI XXXXXX 1960 0.05
249 CPI Panama 1961 0.06
250 CPI Peru 1961 0.06
251 CPI XXXXXX 1961 0.06
252 % Gross savings Panama 1960 5
253 % Gross savings Peru 1960 6
254 % Gross savings XXXXXX 1960 7
255 % Gross savings Panama 1961 20
256 % Gross savings Peru 1961 21
257 % Gross savings XXXXXX 1961 22
(And so on for different countries, different indicators in the "SERIES" column, during 1960-2020 for each country and indicator.)
I'm looking to keep "ECONOMY" as its own column specifying the country as originally seen, keep the year as a column as well, but move each separate indicator under SERIES (e.g. CPI / % Gross savings) into their own columns like this:
ECONOMY YEAR CPI %_GROSS_SAVINGS
1 Panama 1960 0.05 5
2 Peru 1960 0.05 6
3 XXXXXX 1960 0.05 7
4 Panama 1961 0.06 20
5 Peru 1961 0.06 21
6 XXXXXX 1961 0.06 22
Any ideas? Grateful for answers.
Not sure if I follow - this seems to me like a typical pivot_wider use:
library(tidyr)
dat |> pivot_wider(names_from = "SERIES",
values_from = "Value")
#> # A tibble: 6 x 4
#> ECONOMY YEAR CPI `% Gross savings`
#> <chr> <dbl> <dbl> <dbl>
#> 1 Panama 1960 0.05 5
#> 2 Peru 1960 0.05 6
#> 3 XXXXXX 1960 0.05 7
#> 4 Panama 1961 0.06 20
#> 5 Peru 1961 0.06 21
#> 6 XXXXXX 1961 0.06 22
Created on 2022-04-08 by the reprex package (v2.0.0)
Reproducible data:
dat <- structure(list(SERIES = c("CPI", "CPI", "CPI", "CPI", "CPI",
"CPI", "% Gross savings", "% Gross savings", "% Gross savings",
"% Gross savings", "% Gross savings", "% Gross savings"), ECONOMY = c("Panama",
"Peru", "XXXXXX", "Panama", "Peru", "XXXXXX", "Panama", "Peru",
"XXXXXX", "Panama", "Peru", "XXXXXX"), YEAR = c(1960, 1960, 1960,
1961, 1961, 1961, 1960, 1960, 1960, 1961, 1961, 1961), Value = c(0.05,
0.05, 0.05, 0.06, 0.06, 0.06, 5, 6, 7, 20, 21, 22)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
reshape2
reshape2::dcast(ECONOMY + YEAR ~ SERIES, data = zz)
# Using Value as value column: use value.var to override.
# ECONOMY YEAR %_Gross_savings CPI
# 1 Panama 1960 5 0.05
# 2 Panama 1961 20 0.06
# 3 Peru 1960 6 0.05
# 4 Peru 1961 21 0.06
# 5 XXXXXX 1960 7 0.05
# 6 XXXXXX 1961 22 0.06
Data
zz <- structure(list(SERIES = c("CPI", "CPI", "CPI", "CPI", "CPI", "CPI", "%_Gross_savings", "%_Gross_savings", "%_Gross_savings", "%_Gross_savings", "%_Gross_savings", "%_Gross_savings"), ECONOMY = c("Panama", "Peru", "XXXXXX", "Panama", "Peru", "XXXXXX", "Panama", "Peru", "XXXXXX", "Panama", "Peru", "XXXXXX"), YEAR = c(1960L, 1960L, 1960L, 1961L, 1961L, 1961L, 1960L, 1960L, 1960L, 1961L, 1961L, 1961L), Value = c(0.05, 0.05, 0.05, 0.06, 0.06, 0.06, 5, 6, 7, 20, 21, 22)), class = "data.frame", row.names = c("246", "247", "248", "249", "250", "251", "252", "253", "254", "255", "256", "257"))
I have the following data frame but in a bigger scale of course:
country
year
strain
num_cases
mex
1996
sp_m014
412
mex
1996
sp_f014
214
mex
1998
sp_m014
150
mex
1998
sp_f014
200
usa
1996
sp_m014
200
usa
1996
sp_f014
180
usa
1997
sp_m014
190
usa
1997
sp_f014
150
I want to get the following result, that is the sum of sp_m014 (male) and sp_f014 (female) for mex and usa individually:
country
year
strain
num_cases
mex
1996
sp
626
mex
1998
sp
350
usa
1996
sp
380
usa
1997
sp
340
In my real data frame I have a lot more age ranges, here I only show the 014 for males and females. But I want to summarize them that way for every age range and gender.
Thanks!
Grouped by 'country', 'year' summarise to update the 'strain' as 'sp' and get the sum of 'num_cases'
library(dplyr)
df1 %>%
group_by(country, year) %>%
summarise(strain = 'sp', num_cases = sum(num_cases), .groups = 'drop')
-output
# A tibble: 4 x 4
# country year strain num_cases
#* <chr> <int> <chr> <int>
#1 mex 1996 sp 626
#2 mex 1998 sp 350
#3 usa 1996 sp 380
#4 usa 1997 sp 340
data
df1 <- structure(list(country = c("mex", "mex", "mex", "mex", "usa",
"usa", "usa", "usa"), year = c(1996L, 1996L, 1998L, 1998L, 1996L,
1996L, 1997L, 1997L), strain = c("sp_m014", "sp_f014", "sp_m014",
"sp_f014", "sp_m014", "sp_f014", "sp_m014", "sp_f014"), num_cases = c(412L,
214L, 150L, 200L, 200L, 180L, 190L, 150L)),
class = "data.frame", row.names = c(NA,
-8L))
Here's an approach with tidyr::extract:
library(tidyr);library(dplyr)
df1 %>%
extract(strain, into = c("strain","sex","age"), "(\\w+)_([mf])(.*)") %>%
group_by(country,year,strain) %>%
summarise(across(num_cases,sum))
# A tibble: 4 x 4
# Groups: country, year [4]
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340
Now that you have the strains fully parsed you can easily group by sex or age. Thanks to #akrun for the data.
Update:
To use the age range you can do parse_number
df1 %>%
mutate(age_range=parse_number(strain)) %>%
group_by(country, year, age_range) %>%
summarise(num_cases=sum(num_cases))
Output:
country year age_range num_cases
<chr> <int> <dbl> <int>
1 mex 1996 14 626
2 mex 1998 14 350
3 usa 1996 14 380
4 usa 1997 14 340
First answer:
Thanks to akrun for the data:
library(tidyverse)
df1 %>%
group_by(country, year, strain) %>%
mutate(strain=str_extract(strain, "^.{2}")) %>%
summarise(num_cases=sum(num_cases))
Output:
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340
Dear Community,
I am working with R and looking for trends in time series data of bilateral exports over a duration of 20 years. As the data is fluctuating a lot between the years (and in addition is not 100% reliable), I would prefer to use four-years-average data (instead of looking at every single year separately) in order to analyze how the main export partners have changed over time.
I have the following dataset, called GrossExp3, covering the bilateral exports (in 1000 USD) of 15 reporter countries for all years between (1998 – 2019) to all available partner countries.
It covers the following four variables:
Year, ReporterName (= exporter) , PartnerName (= export destination), 'TradeValue in 1000 USD' ( = export value to the destination)
The PartnerName column also includes an entry, called “All”, which is the total sum of all exports for each year by reporter.
Here is the summary of my data
> summary(GrossExp3)
Year ReporterName PartnerName TradeValue in 1000 USD
Min. :1998 Length:35961 Length:35961 Min. : 0
1st Qu.:2004 Class :character Class :character 1st Qu.: 39
Median :2009 Mode :character Mode :character Median : 597
Mean :2009 Mean : 134370
3rd Qu.:2014 3rd Qu.: 10090
Max. :2018 Max. :47471515
My goal is to return a table which shows the percentage of total trade for each exporter to the export destination in percentage of total exports for that period. Instead of every single year, I want to have the average data for the following periods: 2000-2003, 2004-2007, 2008-2011, 2012-2015, 2016-2019.
What I tried
My current code (created with support of this amazing community is the following: (At the current moment, it shows the data for each year separately, but I need the average data in the headline)
# install packages
library(data.table)
library(dplyr)
library(tidyr)
library(stringr)
library(plyr)
library(visdat)
# set working directory
setwd("C:/R/R_09.2020/Other Indicators/Bilateral Trade Shift of Partners")
# load data
# create a file path SITC 3
path1 <- file.path("SITC Rev 3_Data from 1998.csv")
# load cvs data table, call "SITC3"
SITC3 <- fread(path1, drop = c(1,9,11,13))
# prepare data (SITC3) for analysis
# Filter for GROSS EXPORTS SITC3 (Gross exports = Exports that include intermediate products)
GrossExp3 <- SITC3 %>%
filter(TradeFlowName == "Gross Exp.", PartnerISO3 != "All", Year != 2019) %>% # filter for gross exports, remove "All", remove 2019
select(Year, ReporterName, PartnerName, `TradeValue in 1000 USD`) %>%
arrange(ReporterName, desc(Year))
# compare with old subset
summary(GrossExp3)
summary(SITC3)
# calculate percentage of total
GrossExp3Main <- GrossExp3 %>%
group_by(Year, ReporterName) %>%
add_tally(wt = `TradeValue in 1000 USD`, name = "TotalValue") %>%
mutate(Percentage = 100 * (`TradeValue in 1000 USD` / TotalValue)) %>%
arrange(ReporterName, desc(Year), desc(Percentage))
head(GrossExp3Main, n = 20)
# print tables in separate sheets to get an overview about hierarchy of export partners and development over time
SpreadExpMain <- GrossExp3Main %>%
select(Year, ReporterName, PartnerName, Percentage) %>%
spread(key = Year, value = Percentage) %>%
arrange(ReporterName, desc(`2018`))
View(SpreadExpMain) # shows whole table
Here is the head of my data
> head(GrossExp3Main, n = 20)
# A tibble: 20 x 6
# Groups: Year, ReporterName [7]
Year ReporterName PartnerName `TradeValue in 100~ TotalValue Percentage
<int> <chr> <chr> <dbl> <dbl> <dbl>
1 2018 Angola China 24517058. 42096736. 58.2
2 2018 Angola India 3768940. 42096736. 8.95
3 2017 Angola China 19487067. 34904881. 55.8
4 2017 Angola India 2890061. 34904881. 8.28
5 2016 Angola China 13923092. 28057500. 49.6
6 2016 Angola India 1948845. 28057500. 6.95
7 2016 Angola United States 1525650. 28057500. 5.44
8 2015 Angola China 14320566. 33924937. 42.2
9 2015 Angola India 2676340. 33924937. 7.89
10 2015 Angola Spain 2245976. 33924937. 6.62
11 2014 Angola China 27527111. 58672369. 46.9
12 2014 Angola India 4507416. 58672369. 7.68
13 2014 Angola Spain 3726455. 58672369. 6.35
14 2013 Angola China 31947235. 67712527. 47.2
15 2013 Angola India 6764233. 67712527. 9.99
16 2013 Angola United States 5018391. 67712527. 7.41
17 2013 Angola Other Asia, ~ 4007020. 67712527. 5.92
18 2012 Angola China 33710030. 70863076. 47.6
19 2012 Angola India 6932061. 70863076. 9.78
20 2012 Angola United States 6594526. 70863076. 9.31
I am not sure if the results I get up to this point are right?
In addition, I have the following questions:
Do you have any recommendation on how to print nice looking tables with R?
How can I better round the percentage data to one number behind the comma?
As I have been stuck with these issues over the week, I would be very grateful for any recommendations on how to solve the issue!!
Wishing you a nice weekend and all the best,
Melike
** EDIT**
here is some sample data
dput(head(GrossExp3Main, n = 20))
structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L), ReporterName = c("Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola"), PartnerName = c("China",
"India", "United States", "Spain", "South Africa", "Portugal",
"United Arab Emirates", "France", "Thailand", "Canada", "Indonesia",
"Singapore", "Italy", "Israel", "United Kingdom", "Unspecified",
"Namibia", "Uruguay", "Congo, Rep.", "Japan"), `TradeValue in 1000 USD` = c(24517058.342,
3768940.47, 1470132.736, 1250554.873, 1161852.097, 1074137.369,
884725.078, 734551.345, 649626.328, 647164.297, 575477.283, 513982.584,
468914.918, 452453.482, 425616.975, 423008.886, 327921.516, 320586.229,
299119.102, 264671.779), TotalValue = c(42096736.31, 42096736.31,
42096736.31, 42096736.31, 42096736.31, 42096736.31, 42096736.31,
42096736.31, 42096736.31, 42096736.31, 42096736.31, 42096736.31,
42096736.31, 42096736.31, 42096736.31, 42096736.31, 42096736.31,
42096736.31, 42096736.31, 42096736.31), Percentage = c(58.2398078593471,
8.9530467213552, 3.49227247731025, 2.97066942147468, 2.75995765667944,
2.55159298119945, 2.10164767046284, 1.74491281127062, 1.54317504144777,
1.53732653342598, 1.3670353890672, 1.22095589599877, 1.11389850877492,
1.07479467925527, 1.01104506502775, 1.00484959899258, 0.778971352043039,
0.761546516668669, 0.710551762961598, 0.62872279943737)), row.names = c(NA,
-20L), groups = structure(list(Year = 2018L, ReporterName = "Angola",
.rows = structure(list(1:20), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1L, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
>
To do what you want need an additional variable to group the year together. I used cut to do that.
library(dplyr)
# Define the cut breaks and labels for each group
# The cut define by the starting of each group and when using cut function
# I would use param right = FALSE to have the desire cut that I want here.
year_group_break <- c(2000, 2004, 2008, 2012, 2016, 2020)
year_group_labels <- c("2000-2003", "2004-2007", "2008-2011", "2012-2015", "2016-2019")
data %>%
# create the year group variable
mutate(year_group = cut(Year, breaks = year_group_break,
labels = year_group_labels,
include.lowest = TRUE, right = FALSE)) %>%
# calculte the total value for each Reporter + Partner in each year group
group_by(year_group, ReporterName, PartnerName) %>%
summarize(`TradeValue in 1000 USD` = sum(`TradeValue in 1000 USD`),
.groups = "drop") %>%
# calculate the percentage value for Partner of each Reporter/Year group
group_by(year_group, ReporterName) %>%
mutate(Percentage = `TradeValue in 1000 USD` / sum(`TradeValue in 1000 USD`)) %>%
ungroup()
Sample output
year_group ReporterName PartnerName `TradeValue in 1000 USD` Percentage
<fct> <chr> <chr> <dbl> <dbl>
1 2016-2019 Angola Canada 647164. 0.0161
2 2016-2019 Angola China 24517058. 0.609
3 2016-2019 Angola Congo, Rep. 299119. 0.00744
4 2016-2019 Angola France 734551. 0.0183
5 2016-2019 Angola India 3768940. 0.0937
6 2016-2019 Angola Indonesia 575477. 0.0143
7 2016-2019 Angola Israel 452453. 0.0112
8 2016-2019 Angola Italy 468915. 0.0117
9 2016-2019 Angola Japan 264672. 0.00658
10 2016-2019 Angola Namibia 327922. 0.00815
11 2016-2019 Angola Portugal 1074137. 0.0267
12 2016-2019 Angola Singapore 513983. 0.0128
13 2016-2019 Angola South Africa 1161852. 0.0289
14 2016-2019 Angola Spain 1250555. 0.0311
15 2016-2019 Angola Thailand 649626. 0.0161
16 2016-2019 Angola United Arab Emirates 884725. 0.0220
17 2016-2019 Angola United Kingdom 425617. 0.0106
18 2016-2019 Angola United States 1470133. 0.0365
19 2016-2019 Angola Unspecified 423009. 0.0105
20 2016-2019 Angola Uruguay 320586. 0.00797
I have two dataset which look like below
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 NA
South Asia 2009 4.5 NA
South Asia 2011 11 0
South Asia 2014 16.7 NA
Africa 2008 0.4 NA
Africa 2013 3.5 0
Africa 2017 9.7 NA
Strategy
Region StrategyYear
South Asia 2011
Africa 2013
Japan 2007
SE Asia 2009
There are multiple regions and many review years which are not periodic and not even same for all regions. I have added a column 'Index' to dataframe 'Sales' such that for a strategy year from second dataframe, the index value is zero. I now want to change NA to a series of numbers that tell how many rows before or after that particular row is to 0 row, grouped by 'Region'.
I can do this using a for loop but that is just tedious and checking if there is a cleaner way to do this. Final output should look like
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 -2
South Asia 2009 4.5 -1
South Asia 2011 11 0
South Asia 2014 16.7 1
Africa 2008 0.4 -1
Africa 2013 3.5 0
Africa 2017 9.7 1
Join the two datasets by Region and for each Region create an Index column by subtracting the row number with the index where StrategyYear matches the ReviewYear.
library(dplyr)
left_join(Sales, Strategy, by = 'Region') %>%
arrange(Region, StrategyYear) %>%
group_by(Region) %>%
mutate(Index = row_number() - match(first(StrategyYear), ReviewYear))
# Region ReviewYear Sales Index StrategyYear
# <chr> <int> <dbl> <int> <int>
#1 Africa 2008 0.4 -1 2013
#2 Africa 2013 3.5 0 2013
#3 Africa 2017 9.7 1 2013
#4 SouthAsia 2006 1.5 -2 2011
#5 SouthAsia 2009 4.5 -1 2011
#6 SouthAsia 2011 11 0 2011
#7 SouthAsia 2014 16.7 1 2011
data
Sales <- structure(list(Region = c("SouthAsia", "SouthAsia", "SouthAsia",
"SouthAsia", "Africa", "Africa", "Africa"), ReviewYear = c(2006L,
2009L, 2011L, 2014L, 2008L, 2013L, 2017L), Sales = c(1.5, 4.5,
11, 16.7, 0.4, 3.5, 9.7), Index = c(NA, NA, 0L, NA, NA, 0L, NA
)), class = "data.frame", row.names = c(NA, -7L))
Strategy <- structure(list(Region = c("SouthAsia", "Africa", "Japan", "SEAsia"
), StrategyYear = c(2011L, 2013L, 2007L, 2009L)), class = "data.frame",
row.names = c(NA, -4L))