To be honest, I am completely stuck, I'm not quite sure how to phrase the title either.
I have two datasets, lets say it looks something like this:
Dataset1 (ie GDP related):
Year
Country
2000
Austria
2001
Austria
2000
Belgium
2001
Belgium
Dataset2 (TAX-related):
Year
Austria
Belgium
2000
55
48
2001
51
45
So what I would like, is to generate some sort of function/loop that essentially says:
if our country variable in dataset1 has a name that is a column name in dataset2, use these observations
Then, conditional on the year and country, I want to create a new variable in dataset1 called tax, apply the country's tax rate from dataset two into dataset1.
So for instance, we know Austria (observation) is also a name of a variable, then I want to get this tax rate from dataset2, and apply 55 for year 2000 and 56 for 2001, for dataset1. And this will go on for all countries and years.
And should thus look like
Dataset1 (ie GDP related):
Year
Country
Tax
2000
Austria
55
2001
Austria
51
2000
Belgium
48
2001
Belgium
45
My dataset is quite big, so it is much preferred if I have some sort of algorithm for this
Thanks!
Assuming the first data have more columns, then after reshaping the second data to long with pivot_longer, do a join with the first data (left_join) which matches the 'Year', 'Country'
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(cols = -Year, names_to = 'Country', values_to = 'Tax') %>%
left_join(df1, .)
-output
Year Country Tax
1 2000 Austria 55
2 2001 Austria 51
3 2000 Belgium 48
4 2001 Belgium 45
data
df1 <- structure(list(Year = c(2000L, 2001L, 2000L, 2001L), Country = c("Austria",
"Austria", "Belgium", "Belgium")), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(Year = 2000:2001, Austria = c(55L, 51L), Belgium = c(48L,
45L)), class = "data.frame", row.names = c(NA, -2L))
This should also work:
library(dplyr)
library(tidyr)
df2 %>%
# pivot_longer(-Year) %>% first solution
pivot_longer(cols = -Year, names_to = 'Country', values_to = 'Tax') %>% # taken from #akrun
arrange(Country)
Year Country Tax
<int> <chr> <int>
1 2000 Austria 55
2 2001 Austria 51
3 2000 Belgium 48
4 2001 Belgium 45
Related
I have a dataset looks like this
year china India United state ....
2020 30 40 50
2021 20 30 60
2022 34 20 40
....
I have 10 columns and more than 50 rows in this dataframe. I have to plot them in one graph to show the movement of different countries.
So I think line graph would be good for the purpose.But I don't know how should I do the visulisation.
I think I shuold change the dataframe format and then start visulisation. How should I do it?
Pivot (reshape from wide to long) then plot with groups.
dat <- structure(list(year = 2020:2022, China = c(30L, 20L, 34L), India = c(40L, 30L, 20L), UnitedStates = c(50L, 60L, 40L)), class = "data.frame", row.names = c(NA, -3L))
datlong <- reshape2::melt(dat, "year", variable.name = "country", value.name = "value")
datlong
# year country value
# 1 2020 China 30
# 2 2021 China 20
# 3 2022 China 34
# 4 2020 India 40
# 5 2021 India 30
# 6 2022 India 20
# 7 2020 UnitedStates 50
# 8 2021 UnitedStates 60
# 9 2022 UnitedStates 40
### or using tidyr::
tidyr::pivot_longer(dat, -year, names_to = "country", values_to = "value")
Once reshaped, just group= (and optionally color=) lines:
library(ggplot2)
ggplot(datlong, aes(year, value, color = country)) +
geom_line(aes(group = country))
If you have many more years, the decimal-years in the axis will likely smooth out. You can alternately control it by converting year to a Date-class and forcing the display with scale_x_date.
I have the following data frame but in a bigger scale of course:
country
year
strain
num_cases
mex
1996
sp_m014
412
mex
1996
sp_f014
214
mex
1998
sp_m014
150
mex
1998
sp_f014
200
usa
1996
sp_m014
200
usa
1996
sp_f014
180
usa
1997
sp_m014
190
usa
1997
sp_f014
150
I want to get the following result, that is the sum of sp_m014 (male) and sp_f014 (female) for mex and usa individually:
country
year
strain
num_cases
mex
1996
sp
626
mex
1998
sp
350
usa
1996
sp
380
usa
1997
sp
340
In my real data frame I have a lot more age ranges, here I only show the 014 for males and females. But I want to summarize them that way for every age range and gender.
Thanks!
Grouped by 'country', 'year' summarise to update the 'strain' as 'sp' and get the sum of 'num_cases'
library(dplyr)
df1 %>%
group_by(country, year) %>%
summarise(strain = 'sp', num_cases = sum(num_cases), .groups = 'drop')
-output
# A tibble: 4 x 4
# country year strain num_cases
#* <chr> <int> <chr> <int>
#1 mex 1996 sp 626
#2 mex 1998 sp 350
#3 usa 1996 sp 380
#4 usa 1997 sp 340
data
df1 <- structure(list(country = c("mex", "mex", "mex", "mex", "usa",
"usa", "usa", "usa"), year = c(1996L, 1996L, 1998L, 1998L, 1996L,
1996L, 1997L, 1997L), strain = c("sp_m014", "sp_f014", "sp_m014",
"sp_f014", "sp_m014", "sp_f014", "sp_m014", "sp_f014"), num_cases = c(412L,
214L, 150L, 200L, 200L, 180L, 190L, 150L)),
class = "data.frame", row.names = c(NA,
-8L))
Here's an approach with tidyr::extract:
library(tidyr);library(dplyr)
df1 %>%
extract(strain, into = c("strain","sex","age"), "(\\w+)_([mf])(.*)") %>%
group_by(country,year,strain) %>%
summarise(across(num_cases,sum))
# A tibble: 4 x 4
# Groups: country, year [4]
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340
Now that you have the strains fully parsed you can easily group by sex or age. Thanks to #akrun for the data.
Update:
To use the age range you can do parse_number
df1 %>%
mutate(age_range=parse_number(strain)) %>%
group_by(country, year, age_range) %>%
summarise(num_cases=sum(num_cases))
Output:
country year age_range num_cases
<chr> <int> <dbl> <int>
1 mex 1996 14 626
2 mex 1998 14 350
3 usa 1996 14 380
4 usa 1997 14 340
First answer:
Thanks to akrun for the data:
library(tidyverse)
df1 %>%
group_by(country, year, strain) %>%
mutate(strain=str_extract(strain, "^.{2}")) %>%
summarise(num_cases=sum(num_cases))
Output:
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340
I have some data which I am trying to use tidy R and pivot longer function in R to get the out put as mentioned below. But I am not able to do it, I am getting Data
I have data in this format. ( with many other column names )
Country State Year 1 Population 1 Year 2 Population2
U.S.A IL 2009 20000 2010 30000
U.S.A VA 2009 30000 2010 40000
I want to get data in this format.
Country State Year Population
U.S.A IL 2009 20000
U.S.A IL 2010 30000
U.S.A VA 2009 30000
U.S.A VA 2010 40000
I am able to do it only for on column, but not able to pass other column likes like population
My code is below.
file1<-file %>%
pivot_longer(
cols = contains("Year"),
names_sep = "_",
names_to = c(".value", "repeat"),
)
I was able to make it work on Tidyverse.
library(tidyverse)
file<-read_excel("peps300.xlsx")
names(file)<-str_replace_all(names(file), c("Year " = "Year_" , "Num " = "Num_", "DRate " = "DRate_" , "PRate " = "PRate_", "Denom " = "Denom_"))
file<-file %>%
pivot_longer(
cols = c(contains("Year"),contains("Num"),contains("DRate"),contains("PRate"),contains("Denom")),
names_sep = "_",
names_to = c(".value", "repeat")
)
An option would be to specify the cols that starts_with "Population" or "Year"
library(dplyr)
df1 %>%
pivot_longer(cols = c(starts_with("Population"), starts_with("Year")),
names_to = c(".value", "group"), names_pattern = "(.*)_(.*)")
# A tibble: 4 x 5
# Country State group Population Year
# <chr> <chr> <chr> <int> <int>
#1 U.S.A IL 1 20000 2009
#2 U.S.A IL 2 30000 2010
#3 U.S.A VA 1 30000 2009
#4 U.S.A VA 2 40000 2010
data
df1 <- structure(list(Country = c("U.S.A", "U.S.A"), State = c("IL",
"VA"), Year_1 = c(2009L, 2009L), Population_1 = c(20000L, 30000L
), Year_2 = c(2010L, 2010L), Population_2 = c(30000L, 40000L)),
class = "data.frame", row.names = c(NA,
-2L))
df %>%
pivot_longer(
-c(Country,State),
names_to = c(".value","group"),
names_pattern = "(.+)_(.+)"
)
# A tibble: 4 x 5
Country State group Year Population
<chr> <chr> <chr> <chr> <chr>
1 U.S.A IL 1 2009 20000
2 U.S.A IL 2 2010 30000
3 U.S.A VA 1 2009 30000
4 U.S.A VA 2 2010 40000
You can then drop the group if you don't need it.
And, to do this, you will need to clean your column names first. Make sure they all follow the same pattern and words are connected with a single space or a single underscore.
df <- structure(list(Country = c("U.S.A", "U.S.A"), State = c("IL",
"VA"), Year_1 = c("2009", "2009"), Population_1 = c("20000",
"30000"), Year_2 = c("2010", "2010"), Population_2 = c("30000",
"40000")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L))
I am trying to melt/stack/gather multiple specific columns of a dataframe into 2 columns, retaining all the others.
I have tried many, many answers on stackoverflow without success (some below). I basically have a situation similar to this post here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
only many more columns to retain and combine. It is important to mention my year columns are factors and I have many, many more columns than the sample listed below so I want to call column names not positions.
>df
ID Code Country year.x value.x year.y value.y year.x.x value.x.x
1 A USA 2000 34.33422 2001 35.35241 2002 42.30042
1 A Spain 2000 34.71842 2001 39.82727 2002 43.22209
3 B USA 2000 35.98180 2001 37.70768 2002 44.40232
3 B Peru 2000 33.00000 2001 37.66468 2002 41.30232
4 C Argentina 2000 37.78005 2001 39.25627 2002 45.72927
4 C Peru 2000 40.52575 2001 40.55918 2002 46.62914
I tried using the pivot_longer in tidyr based on the post above which seemed very similar, which resulted in various errors depending on what I did:
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = ".")
I also played with melt in reshape2 in various ways which either melted only the values columns or only the years columns. Such as:
new.df <- reshape2:::melt(df, id.var = c("ID", "Code", "Country"), measure.vars=c("value.x", "value.y", "value.x.x", "value.y.y", "value.x.x.x", "value.y.y.y"), value.name = "value", variable.vars=c('year.x','year.y', "year.x.x", "year.y.y", "year.x.x.x", "year.y.y.y", "value.x", variable.name = "year")
I also tried dplyr gather based on other posts but I find it extremely difficult to understand the help page and posts.
To be clear what I am looking to achieve:
ID Code Country year value
1 A USA 2000 34.33422
1 A Spain 2000 34.71842
3 B USA 2000 35.98180
3 B Peru 2000 33.00000
4 C Argentina2000 37.78005
4 C Peru 2000 40.52575
1 A USA 2001 35.35241
1 A Spain 2001 39.82727
3 B USA 2001 37.70768
3 B Peru 2001 37.66468
4 C Argentina2001 39.25627
4 C Peru 2001 40.55918
1 A USA 2002 42.30042
etc.
I really appreciate the help here.
We can specify the names_pattern
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
Or use the names_sep with escaped . as according to ?pivot_longer
names_sep - names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).
which implies that by default the regex is on and the . in regex matches any character and not the literal dot. To get the literal value, either escape or place it inside square bracket
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "\\.")
# A tibble: 18 x 6
# ID Code Country group year value
# <int> <chr> <chr> <chr> <int> <dbl>
# 1 1 A USA x 2000 34.3
# 2 1 A USA y 2001 35.4
# 3 1 A USA z 2002 42.3
# 4 1 A Spain x 2000 34.7
# 5 1 A Spain y 2001 39.8
# 6 1 A Spain z 2002 43.2
# 7 3 B USA x 2000 36.0
# 8 3 B USA y 2001 37.7
# 9 3 B USA z 2002 44.4
#10 3 B Peru x 2000 33
#11 3 B Peru y 2001 37.7
#12 3 B Peru z 2002 41.3
#13 4 C Argentina x 2000 37.8
#14 4 C Argentina y 2001 39.3
#15 4 C Argentina z 2002 45.7
#16 4 C Peru x 2000 40.5
#17 4 C Peru y 2001 40.6
#18 4 C Peru z 2002 46.6
Update
For the updated dataset
library(stringr)
df2 %>%
rename_at(vars(matches("year|value")), ~
str_replace(., "^([^.]+\\.[^.]+)\\.([^.]+)$", "\\1\\2")) %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
Or without the rename, use regex lookaround
df2 %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "(?<=year|value)\\.")
data
df <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.z = c(2002L, 2002L, 2002L, 2002L, 2002L,
2002L), value.z = c(42.30042, 43.22209, 44.40232, 41.30232, 45.72927,
46.62914)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.x.x = c(2002L, 2002L, 2002L, 2002L,
2002L, 2002L), value.x.x = c(42.30042, 43.22209, 44.40232, 41.30232,
45.72927, 46.62914)), class = "data.frame", row.names = c(NA,
-6L))
I have two data sets (one for each country) that look like this:
dfGermany
Country Sales Year Code
Germany 2000 2000 221
Germany 1500 2001 150
Germany 2150 2002 270
dfJapan
Country Sales Year Code
Japan 500 2000 221
Japan 750 2001 221
Japan 800 2001 270
Japan 1000 2002 270
Code here is the "name" of the product. What I want to do is to take half the Japanese sell and add it to the df for Germany if the code and the year matches.
So for instance, half of the sales value for product 221 and 270 in dfJapan (250 € and 500 €) should be added to dfGermany for year 2000 and 2002. But nothing should happen to the values for 2001 since the code does not match with the year.
I tried with merge, but that function did not work since the data is of different size and I also want to match both year and value.
We can do a join on 'Year', 'Code' and then update the 'dfGermany' 'Sales' column
library(data.table)
setDT(dfGermany)[dfJapan, Sales := Sales + i.Sales/2, on = .(Year, Code)]
dfGermany
# Country Sales Year Code
#1: Germany 2250 2000 221
#2: Germany 1500 2001 150
#3: Germany 2650 2002 270
data
dfGermany <- structure(list(Country = c("Germany", "Germany", "Germany"),
Sales = c(2000, 1500, 2150), Year = 2000:2002, Code = c(221L,
150L, 270L)), row.names = c(NA, -3L), class = "data.frame")
dfJapan <- structure(list(Country = c("Japan", "Japan", "Japan", "Japan"
), Sales = c(500L, 750L, 800L, 1000L), Year = c(2000L, 2001L,
2001L, 2002L), Code = c(221L, 221L, 270L, 270L)),
class = "data.frame", row.names = c(NA, -4L))
Using dplyr and #akrun's provided data:
library(dplyr)
dfGermany %>%
left_join(dfJapan %>%
select(Year, Code, sales_japan = Sales),
by = c('Year', 'Code')) %>%
mutate(Sales = Sales + coalesce(sales_japan / 2, 0)) %>%
select(-sales_japan)
> dfGermany
Country Sales Year Code
1 Germany 2250 2000 221
2 Germany 1500 2001 150
3 Germany 2650 2002 270