I have large dataset in long format, with multiple variables 'stacked', a structure similar to
set.seed(42)
dat_0=data.frame(
c(rep('AFG',4),rep('UK',4)),
c(rep('GDP',2),rep('pop',2)),
rep(c('1990','1991'),4),
runif(8))
colnames(dat_0)<-c('country','variable','year','val')
which produces the following
country variable year val
1 AFG GDP 1990 0.0856120649
2 AFG GDP 1991 0.3052183695
3 AFG pop 1990 0.6674265147
4 AFG pop 1991 0.0002388966
5 UK GDP 1990 0.2085699569
6 UK GDP 1991 0.9330341273
7 UK pop 1990 0.9256447486
8 UK pop 1991 0.7340943010
I want to have each variable (GDP, pop) in one column
country year GDP pop
1 AFG 1990 0.0856120649 0.6674265147
2 AFG 1991 0.3052183695 0.0002388966
3 UK 1990 0.2085699569 0.9256447486
4 UK 1991 0.9330341273 0.7340943010
I am really sorry if this is a duplicate, but after going through earlier posts I have still not managed to re-structure my data.
Related
I am using the datase who (available in the library datasets
tidyr), which for 34 years counts the number of TB cases registered for 56 groups (combinations of gender, age and method of testing) for a number of countries. There is one row per country per year, and the first 4 entries are to do with year, country name and such.
I want to calculate the sum of new cases per country per year, but I just can't make it work.
I was ecpecting something like
group_by(who, country) %>% summarise(count = rowsum(.[5:60]))
would work, but it doesn't.
Can anyone help me understand why it doesn't work, and what to do instead?
You're missing a first step, which is to gather the data into a 'tidy' format. Try this:
who%>%
gather(key=type,value=cases,-country,-iso2,-iso3,-year)%>%
filter(!is.na(cases))%>%
group_by(country,year)%>%
summarise(sum(cases))
Which gives output:
# A tibble: 3,484 × 3
# Groups: country [219]
country year `sum(cases)`
<chr> <int> <int>
1 Afghanistan 1997 128
2 Afghanistan 1998 1778
3 Afghanistan 1999 745
4 Afghanistan 2000 2666
5 Afghanistan 2001 4639
library(tidyverse)
(long_who <- who |> pivot_longer(cols = -(1:4)))
long_who |> filter(startsWith(name,"new")) |> # dont want things like "Population"
group_by(country) |>
summarise(sum_of_new_ = sum(value,na.rm=TRUE))
A base r approach
data.frame(who[,c("country", "year")],
cnt = rowSums(who[5:60], na.rm = TRUE))
#> + country year cnt
#> 1 Afghanistan 1980 0
#> 2 Afghanistan 1981 0
#> 3 Afghanistan 1982 0
#> 4 Afghanistan 1983 0
#> 5 Afghanistan 1984 0
#> 6 Afghanistan 1985 0
You could also do without the long format by using rowSums and across:
library(dplyr)
who |>
group_by(country, year) |>
summarise(count = rowSums(across(5:58), na.rm = TRUE)) |>
ungroup()
Alternatives to across(5:58):
across(starts_with("new"))
across(-(1:4))
Output:
# A tibble: 20 × 3
# Groups: country [1]
country year count
<chr> <int> <dbl>
1 Afghanistan 1980 0
2 Afghanistan 1981 0
3 Afghanistan 1982 0
4 Afghanistan 1983 0
5 Afghanistan 1984 0
6 Afghanistan 1985 0
7 Afghanistan 1986 0
8 Afghanistan 1987 0
9 Afghanistan 1988 0
10 Afghanistan 1989 0
11 Afghanistan 1990 0
12 Afghanistan 1991 0
13 Afghanistan 1992 0
14 Afghanistan 1993 0
15 Afghanistan 1994 0
16 Afghanistan 1995 0
17 Afghanistan 1996 0
18 Afghanistan 1997 128
19 Afghanistan 1998 1778
20 Afghanistan 1999 745
This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Fill missing combinations in a dataframe
(1 answer)
Closed 1 year ago.
I have the next database with country, year, and GDP:
What I have
Country
Year
GDP
Afghanistan
1950
$123
Afghanistan
1951
$123
Afghanistan
2019
$123
Australia
1945
$123
Australia
2021
$123
And what I need is to create or delete rows so each country has rows from 1948 to 2021. So, for example, for Afghanistan I need to create rows 1948 to 1949 and 2021 with a null GDP, and for Australia delete the 1945 row and create everything in between.
This isn't my exact database, I have 200+ countries each with different years. Is there a way to create this easily?
What I need
Country
Year
GDP
Afghanistan
1948
NA
...
...
...
Afghanistan
2021
NA
Australia
1948
$123
...
...
...
Australia
2021
$123
We can use complete to create the missing combinations and specify the GDP as 0
library(tidyr)
complete(df1, Country, Year = 1948:2021, list(GDP = 0)) %>%
arrange(Country)
We can use complete, then filter and finally replace_na.
library(dplyr)
df <-read.table(header=TRUE, text="Country Year GDP
Afghanistan 1950 $123
Afghanistan 1951 $123
Afghanistan 2019 $123
Australia 1945 $123
Australia 2021 $123")
df <- df %>%
complete(Year = 1948:2021, Country) %>%
filter(between(Year, 1948, 2021)) %>%
replace_na(list(GDP = 0)) %>%
arrange(Country)
head(df)
tail(df)
> print(head(df))
# A tibble: 6 x 3
Year Country GDP
<int> <chr> <chr>
1 1948 Afghanistan 0
2 1949 Afghanistan 0
3 1950 Afghanistan $123
4 1951 Afghanistan $123
5 1952 Afghanistan 0
6 1953 Afghanistan 0
> print(tail(df))
# A tibble: 6 x 3
Year Country GDP
<int> <chr> <chr>
1 2016 Australia 0
2 2017 Australia 0
3 2018 Australia 0
4 2019 Australia 0
5 2020 Australia 0
6 2021 Australia $123
Created on 2021-09-26 by the reprex package (v2.0.1)
library(tidyr)
library(dplyr)
df <-
tibble::tribble(
~Country, ~Year, ~GDP,
"Afghanistan", 1950L, "$123",
"Afghanistan", 1951L, "$123",
"Afghanistan", 2019L, "$123",
"Australia", 1945L, "$123",
"Australia", 2021L, "$123"
)
df %>%
filter(Year >= 1948 & Year <= 2021) %>%
complete(Year = 1948:2021,Country) %>%
arrange(Country)
# A tibble: 148 x 3
Year Country GDP
<int> <chr> <chr>
1 1948 Afghanistan NA
2 1949 Afghanistan NA
3 1950 Afghanistan $123
4 1951 Afghanistan $123
5 1952 Afghanistan NA
6 1953 Afghanistan NA
7 1954 Afghanistan NA
8 1955 Afghanistan NA
9 1956 Afghanistan NA
10 1957 Afghanistan NA
# ... with 138 more rows
Here is a solution with complete and coalesce
library(dplyr)
library(tidyr)
df %>%
complete(Year = 1948:2021, Country) %>%
arrange(Country, Year) %>%
mutate(GDP = coalesce(GDP, "0"))
# A tibble: 149 x 3
Year Country GDP
<int> <chr> <chr>
1 1948 Afghanistan 0
2 1949 Afghanistan 0
3 1950 Afghanistan $123
4 1951 Afghanistan $123
5 1952 Afghanistan 0
6 1953 Afghanistan 0
7 1954 Afghanistan 0
8 1955 Afghanistan 0
9 1956 Afghanistan 0
10 1957 Afghanistan 0
# … with 139 more rows
This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
I'm currently facing the following issue and would highly appreciate any help. My data frame looks like this
country_birth year migrants live_in gender
Albania 2000 1 Australia male
Germany 2000 2 Australia female
Albania 2008 3 Australia male
Albania 2000 6 Australia female
Germany 2004 2 Australia female
UK 2004 2 Germany female
US 2004 5 UK male
Now I would like to get the sum of migrants (both gender) for the same country of birth and the same live_in country for a matching year. A new dataframe should look something like this
country_birth year total_migrants live_in
Albania 2000 7 Australia
... ... ... ...
Many thanks in advance!
You can try aggregate + subset like below
> aggregate(migrants ~ ., subset(df, select = -gender), sum)
country_birth year live_in migrants
1 Albania 2000 Australia 7
2 Germany 2000 Australia 2
3 Germany 2004 Australia 2
4 Albania 2008 Australia 3
5 UK 2004 Germany 2
6 US 2004 UK 5
where
subset omits the columns gender
aggregate helps you aggregate migrants, grouped by all other columns.
library(tidyverse)
data %>%
count(country_birth, year, live_in, wt = migrants, name = "total_migrants")
# # A tibble: 6 x 4
# country_birth year live_in total_migrants
# <chr> <dbl> <chr> <dbl>
# 1 Albania 2000 Australia 7
# 2 Albania 2008 Australia 3
# 3 Germany 2000 Australia 2
# 4 Germany 2004 Australia 2
# 5 UK 2004 Germany 2
# 6 US 2004 UK 5
Here is the {dplyr} approach:
data %>%
group_by(country_birth, year, live_in) %>%
summarise(total_migrants = sum(total_migrants))
You can learn more about grouped summaries by reading the dplyr documentation or at R for Data Science.
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I currently have a data frame that looks like this.
country2<-c("Afghanistan","Afghanistan","Afghanistan")
continent2<-c("Asia","Asia","Asia")
series<-c('lifeexp','pop','gdp')
y1901<-c('1','3','100')
y1902<-c('2','4','101')
y1903<-c('2','4','101')
y1904<-c('2','4','101')
y1905<-c('2','4','101')
y1906<-c('2','4','101')
y1907<-c('2','4','101')
df<-data.frame(country2,continent2,series,y1901,y1902,y1903,y1904,y1905,y1906,y1907)
country2 continent2 series y1901 y1902 y1903 y1904 y1905 y1906 y1907
1 Afghanistan Asia lifeexp 1 2 2 2 2 2 2
2 Afghanistan Asia pop 3 4 4 4 4 4 4
3 Afghanistan Asia gdp 100 101 101 101 101 101 101
How can I reshape this data so that it will look like this?
country<-c("Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan")
continent<-c("Asia","Asia","Asia","Asia","Asia","Asia","Asia")
year<-c("1901","1902","1903","1904","1905","1906","1907")
lifeexp<-c("1","2","2","2","2","2","2")
pop<-c('3','4','4','4','4','4','4')
gdp<-c('100','101','101','101','101','101','101')
df<-data.frame(country,continent,year,lifeexp,pop,gdp)
country continent year lifeexp pop gdp
1 Afghanistan Asia 1901 1 3 100
2 Afghanistan Asia 1902 2 4 101
3 Afghanistan Asia 1903 2 4 101
4 Afghanistan Asia 1904 2 4 101
5 Afghanistan Asia 1905 2 4 101
6 Afghanistan Asia 1906 2 4 101
7 Afghanistan Asia 1907 2 4 101
I have tried using dcast2 from the reshape2 to reshape the data but I can only enter 1 column for value.var.
dcast(df,country+region~series,value.var ='y1901',fun.aggregate = sum)
I also tried using ftable and xtabs but I'm still not sure how to enter more than 1 column for the value. The code below gives an error.
ftable(xtabs(c(y2000,y2001)~country+region+series,df))
Thanks
A data.table approach using melt and dcast could be
library(data.table)
setDT(df)
dcast(melt(df,measure = patterns("^y\\d+")),country2 + continent2 + variable~series)
# country2 continent2 variable gdp lifeexp pop
#1: Afghanistan Asia y1901 100 1 3
#2: Afghanistan Asia y1902 101 2 4
#3: Afghanistan Asia y1903 101 2 4
#4: Afghanistan Asia y1904 101 2 4
#5: Afghanistan Asia y1905 101 2 4
#6: Afghanistan Asia y1906 101 2 4
#7: Afghanistan Asia y1907 101 2 4
I know that you are looking for a solution with ftable or dcast but just for your knowledge, you can achieve it using tidyr:
library(tidyverse)
df %>%
pivot_longer(., cols = starts_with("y190"), names_to = "year", values_to = "Value") %>%
pivot_wider(., names_from = "series", values_from = "Value") %>%
mutate(year = gsub("y","", year)) %>%
rename(country = country2, continent = continent2)
# A tibble: 7 x 6
country continent year lifeexp pop gdp
<fct> <fct> <chr> <fct> <fct> <fct>
1 Afghanistan Asia 1901 1 3 100
2 Afghanistan Asia 1902 2 4 101
3 Afghanistan Asia 1903 2 4 101
4 Afghanistan Asia 1904 2 4 101
5 Afghanistan Asia 1905 2 4 101
6 Afghanistan Asia 1906 2 4 101
7 Afghanistan Asia 1907 2 4 101
I'm trying to merge two datasets, by year and country. The first data set (df = GNIPC) represent Gross national income per capite for every country from 1980-2008.
Country Year GNIpc
(chr) (dbl) (dbl)
1 Afghanistan 1990 NA
2 Afghanistan 1991 NA
3 Afghanistan 1992 2010
4 Afghanistan 1993 NA
5 Afghanistan 1994 12550
6 Afghanistan 1995 NA
The second dataset (df = sanctions) represents the imposition of economic sanctions from 1946 to present day.
country imposition sanctiontype sanctions_period
(chr) (dbl) (chr) (chr)
1 Afghanistan 1 1 6 8 1997-2001
2 Afghanistan 1 7 1979-1979
3 Afghanistan 1 4 7 1995-2002
4 Albania 1 2 8 2005-2005
5 Albania 1 7 2005-2006
6 Albania 1 8 2004-2005
I would like to merge the two datasets so that for every GNI year i either have sanctions present in the country or not. For the GNI years that are not in the sanctions_period the value would be 0 and for those that are it would be 1. This is what i want it to look like:
Country Year GNIpc Imposition sanctiontype
(chr) (dbl) (dbl) (dbl) (chr)
1 Afghanistan 1990 NA 0 NA
2 Afghanistan 1991 NA 0 NA
3 Afghanistan 1992 2010 0 NA
4 Afghanistan 1993 NA 0 NA
5 Afghanistan 1994 12550 0 NA
6 Afghanistan 1995 NA 1 4 7
Some example data:
df1 <- data.frame(country = c('Afghanistan', 'Turkey'),
imposition = c(1, 0),
sanctiontype = c('1 6 8', '4'),
sanctions_period = c('1997-2001', '2003-ongoing')
)
country imposition sanctiontype sanctions_period
1 Afghanistan 1 1 6 8 1997-2001
2 Turkey 0 4 2012-ongoing
The "sanctions_period" column can be transformed with dplyr and tidyr:
library(tidyr)
library(dplyr)
df.new <- separate(df1, sanctions_period, c('start', 'end'), remove = F) %>%
mutate(end = ifelse(end == 'ongoing', '2016', end)) %>%
mutate(start = as.numeric(start), end = as.numeric(end)) %>%
group_by(country, sanctions_period) %>%
do(data.frame(country = .$country, imposition = .$imposition, sanctiontype = .$sanctiontype, year = .$start:.$end))
sanctions_period country imposition sanctiontype year
<fctr> <fctr> <dbl> <fctr> <int>
1 1997-2001 Afghanistan 1 1 6 8 1997
2 1997-2001 Afghanistan 1 1 6 8 1998
3 1997-2001 Afghanistan 1 1 6 8 1999
4 1997-2001 Afghanistan 1 1 6 8 2000
5 1997-2001 Afghanistan 1 1 6 8 2001
6 2012-ongoing Turkey 0 4 2012
7 2012-ongoing Turkey 0 4 2013
8 2012-ongoing Turkey 0 4 2014
9 2012-ongoing Turkey 0 4 2015
10 2012-ongoing Turkey 0 4 2016
From there, it should easy to merge with your first data frame. Note that your first data frame capitalizes Country and Year, while the second doesn't.
df.merged <- merge(df.first, df.new, by.x = c('Country', 'Year'), by.y = c('country', 'year'))
Using dplyr:
left_join(GNIPC, sanctions, by=c("Country"="country", "Year"="Year")) %>%
select(Country,Year, GNIpc, Imposition, sanctiontype)