This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I currently have a data frame that looks like this.
country2<-c("Afghanistan","Afghanistan","Afghanistan")
continent2<-c("Asia","Asia","Asia")
series<-c('lifeexp','pop','gdp')
y1901<-c('1','3','100')
y1902<-c('2','4','101')
y1903<-c('2','4','101')
y1904<-c('2','4','101')
y1905<-c('2','4','101')
y1906<-c('2','4','101')
y1907<-c('2','4','101')
df<-data.frame(country2,continent2,series,y1901,y1902,y1903,y1904,y1905,y1906,y1907)
country2 continent2 series y1901 y1902 y1903 y1904 y1905 y1906 y1907
1 Afghanistan Asia lifeexp 1 2 2 2 2 2 2
2 Afghanistan Asia pop 3 4 4 4 4 4 4
3 Afghanistan Asia gdp 100 101 101 101 101 101 101
How can I reshape this data so that it will look like this?
country<-c("Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan")
continent<-c("Asia","Asia","Asia","Asia","Asia","Asia","Asia")
year<-c("1901","1902","1903","1904","1905","1906","1907")
lifeexp<-c("1","2","2","2","2","2","2")
pop<-c('3','4','4','4','4','4','4')
gdp<-c('100','101','101','101','101','101','101')
df<-data.frame(country,continent,year,lifeexp,pop,gdp)
country continent year lifeexp pop gdp
1 Afghanistan Asia 1901 1 3 100
2 Afghanistan Asia 1902 2 4 101
3 Afghanistan Asia 1903 2 4 101
4 Afghanistan Asia 1904 2 4 101
5 Afghanistan Asia 1905 2 4 101
6 Afghanistan Asia 1906 2 4 101
7 Afghanistan Asia 1907 2 4 101
I have tried using dcast2 from the reshape2 to reshape the data but I can only enter 1 column for value.var.
dcast(df,country+region~series,value.var ='y1901',fun.aggregate = sum)
I also tried using ftable and xtabs but I'm still not sure how to enter more than 1 column for the value. The code below gives an error.
ftable(xtabs(c(y2000,y2001)~country+region+series,df))
Thanks
A data.table approach using melt and dcast could be
library(data.table)
setDT(df)
dcast(melt(df,measure = patterns("^y\\d+")),country2 + continent2 + variable~series)
# country2 continent2 variable gdp lifeexp pop
#1: Afghanistan Asia y1901 100 1 3
#2: Afghanistan Asia y1902 101 2 4
#3: Afghanistan Asia y1903 101 2 4
#4: Afghanistan Asia y1904 101 2 4
#5: Afghanistan Asia y1905 101 2 4
#6: Afghanistan Asia y1906 101 2 4
#7: Afghanistan Asia y1907 101 2 4
I know that you are looking for a solution with ftable or dcast but just for your knowledge, you can achieve it using tidyr:
library(tidyverse)
df %>%
pivot_longer(., cols = starts_with("y190"), names_to = "year", values_to = "Value") %>%
pivot_wider(., names_from = "series", values_from = "Value") %>%
mutate(year = gsub("y","", year)) %>%
rename(country = country2, continent = continent2)
# A tibble: 7 x 6
country continent year lifeexp pop gdp
<fct> <fct> <chr> <fct> <fct> <fct>
1 Afghanistan Asia 1901 1 3 100
2 Afghanistan Asia 1902 2 4 101
3 Afghanistan Asia 1903 2 4 101
4 Afghanistan Asia 1904 2 4 101
5 Afghanistan Asia 1905 2 4 101
6 Afghanistan Asia 1906 2 4 101
7 Afghanistan Asia 1907 2 4 101
Related
I am using the datase who (available in the library datasets
tidyr), which for 34 years counts the number of TB cases registered for 56 groups (combinations of gender, age and method of testing) for a number of countries. There is one row per country per year, and the first 4 entries are to do with year, country name and such.
I want to calculate the sum of new cases per country per year, but I just can't make it work.
I was ecpecting something like
group_by(who, country) %>% summarise(count = rowsum(.[5:60]))
would work, but it doesn't.
Can anyone help me understand why it doesn't work, and what to do instead?
You're missing a first step, which is to gather the data into a 'tidy' format. Try this:
who%>%
gather(key=type,value=cases,-country,-iso2,-iso3,-year)%>%
filter(!is.na(cases))%>%
group_by(country,year)%>%
summarise(sum(cases))
Which gives output:
# A tibble: 3,484 × 3
# Groups: country [219]
country year `sum(cases)`
<chr> <int> <int>
1 Afghanistan 1997 128
2 Afghanistan 1998 1778
3 Afghanistan 1999 745
4 Afghanistan 2000 2666
5 Afghanistan 2001 4639
library(tidyverse)
(long_who <- who |> pivot_longer(cols = -(1:4)))
long_who |> filter(startsWith(name,"new")) |> # dont want things like "Population"
group_by(country) |>
summarise(sum_of_new_ = sum(value,na.rm=TRUE))
A base r approach
data.frame(who[,c("country", "year")],
cnt = rowSums(who[5:60], na.rm = TRUE))
#> + country year cnt
#> 1 Afghanistan 1980 0
#> 2 Afghanistan 1981 0
#> 3 Afghanistan 1982 0
#> 4 Afghanistan 1983 0
#> 5 Afghanistan 1984 0
#> 6 Afghanistan 1985 0
You could also do without the long format by using rowSums and across:
library(dplyr)
who |>
group_by(country, year) |>
summarise(count = rowSums(across(5:58), na.rm = TRUE)) |>
ungroup()
Alternatives to across(5:58):
across(starts_with("new"))
across(-(1:4))
Output:
# A tibble: 20 × 3
# Groups: country [1]
country year count
<chr> <int> <dbl>
1 Afghanistan 1980 0
2 Afghanistan 1981 0
3 Afghanistan 1982 0
4 Afghanistan 1983 0
5 Afghanistan 1984 0
6 Afghanistan 1985 0
7 Afghanistan 1986 0
8 Afghanistan 1987 0
9 Afghanistan 1988 0
10 Afghanistan 1989 0
11 Afghanistan 1990 0
12 Afghanistan 1991 0
13 Afghanistan 1992 0
14 Afghanistan 1993 0
15 Afghanistan 1994 0
16 Afghanistan 1995 0
17 Afghanistan 1996 0
18 Afghanistan 1997 128
19 Afghanistan 1998 1778
20 Afghanistan 1999 745
This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
I'm currently facing the following issue and would highly appreciate any help. My data frame looks like this
country_birth year migrants live_in gender
Albania 2000 1 Australia male
Germany 2000 2 Australia female
Albania 2008 3 Australia male
Albania 2000 6 Australia female
Germany 2004 2 Australia female
UK 2004 2 Germany female
US 2004 5 UK male
Now I would like to get the sum of migrants (both gender) for the same country of birth and the same live_in country for a matching year. A new dataframe should look something like this
country_birth year total_migrants live_in
Albania 2000 7 Australia
... ... ... ...
Many thanks in advance!
You can try aggregate + subset like below
> aggregate(migrants ~ ., subset(df, select = -gender), sum)
country_birth year live_in migrants
1 Albania 2000 Australia 7
2 Germany 2000 Australia 2
3 Germany 2004 Australia 2
4 Albania 2008 Australia 3
5 UK 2004 Germany 2
6 US 2004 UK 5
where
subset omits the columns gender
aggregate helps you aggregate migrants, grouped by all other columns.
library(tidyverse)
data %>%
count(country_birth, year, live_in, wt = migrants, name = "total_migrants")
# # A tibble: 6 x 4
# country_birth year live_in total_migrants
# <chr> <dbl> <chr> <dbl>
# 1 Albania 2000 Australia 7
# 2 Albania 2008 Australia 3
# 3 Germany 2000 Australia 2
# 4 Germany 2004 Australia 2
# 5 UK 2004 Germany 2
# 6 US 2004 UK 5
Here is the {dplyr} approach:
data %>%
group_by(country_birth, year, live_in) %>%
summarise(total_migrants = sum(total_migrants))
You can learn more about grouped summaries by reading the dplyr documentation or at R for Data Science.
I have a data frame with three columns: birth_year, death_year, gender.
I have to calculate total alive male and female population for every year in a given range (1950:1980).
The data frame looks like this:
birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female
The person is alive in year x if death_year > x & birth year <= x
The output I am looking for is something like this:
year male female
1950 3 4
1951 2 3
1952 4 3
1953 4 5
.
.
1980 6 3
Thanks!
Does this work:
library(tidyr)
library(purrr)
library(dplyr)
df %>% mutate(year = map2(1950,1980, seq)) %>% unnest(year) %>%
mutate(isalive = case_when(year >= birth_year & year < death_year ~ 1, TRUE ~ 0)) %>%
group_by(year, gender) %>% summarise(alive = sum(isalive)) %>%
pivot_wider(names_from = gender, values_from = alive) %>% print( n = 50)
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 31 x 3
# Groups: year [31]
year female male
<int> <dbl> <dbl>
1 1950 4 3
2 1951 4 3
3 1952 4 3
4 1953 4 3
5 1954 4 3
6 1955 4 3
7 1956 4 2
8 1957 4 2
9 1958 4 2
10 1959 4 2
11 1960 4 2
12 1961 4 2
13 1962 4 2
14 1963 4 2
15 1964 4 2
16 1965 4 2
17 1966 4 1
18 1967 4 1
19 1968 4 1
20 1969 4 1
21 1970 4 1
22 1971 4 1
23 1972 4 1
24 1973 4 1
25 1974 4 1
26 1975 4 1
27 1976 3 1
28 1977 3 1
29 1978 3 1
30 1979 3 1
31 1980 3 1
Data used:
df
# A tibble: 9 x 3
birth_year death_year gender
<dbl> <dbl> <chr>
1 1934 1988 male
2 1922 1993 female
3 1890 1966 male
4 1901 1956 male
5 1946 2009 female
6 1909 1976 female
7 1899 1945 male
8 1887 1949 male
9 1902 1984 female
Here's a simple base R solution. Summing a logical vector will get you your count of alive or dead because TRUE is 1 and FALSE is 0.
number_alive <- function(range, df){
sapply(range, function(x) sum((df$death_year > x) & (df$birth_year <= x)))
}
output <- data.frame('year' = 1950:1980,
'female' = number_alive(1950:1980, df[df$gender == 'female']),
'male' = number_alive(1950:1980, df[df$gender == 'male']))
# year female male
# 1 1950 4 3
# 2 1951 4 3
# 3 1952 4 3
# 4 1953 4 3
# 5 1954 4 3
# 6 1955 4 3
# 7 1956 4 2
# 8 1957 4 2
# 9 1958 4 2
# 10 1959 4 2
# 11 1960 4 2
# 12 1961 4 2
# 13 1962 4 2
# 14 1963 4 2
# 15 1964 4 2
# 16 1965 4 2
# 17 1966 4 1
# 18 1967 4 1
# 19 1968 4 1
# 20 1969 4 1
# 21 1970 4 1
# 22 1971 4 1
# 23 1972 4 1
# 24 1973 4 1
# 25 1974 4 1
# 26 1975 4 1
# 27 1976 3 1
# 28 1977 3 1
# 29 1978 3 1
# 30 1979 3 1
# 31 1980 3 1
This approach uses an ifelse to determine if alive (1) or dead (0).
Data:
df <- "birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female"
df <- read.table(text = df, header = TRUE)
Code:
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
df %>%
mutate(year = map2(1950,1980, seq)) %>%
unnest(year) %>%
select(year, birth_year, death_year, gender) %>%
mutate(
alive = ifelse(year >= birth_year & year <= death_year, 1, 0)
) %>%
group_by(year, gender) %>%
summarise(
is_alive = sum(alive)
) %>%
pivot_wider(
names_from = gender,
values_from = is_alive
) %>%
select(year, male, female)
Output:
#> # A tibble: 31 x 3
#> # Groups: year [31]
#> year male female
#> <int> <dbl> <dbl>
#> 1 1950 3 4
#> 2 1951 3 4
#> 3 1952 3 4
#> 4 1953 3 4
#> 5 1954 3 4
#> 6 1955 3 4
#> 7 1956 3 4
#> 8 1957 2 4
#> 9 1958 2 4
#> 10 1959 2 4
#> # … with 21 more rows
Created on 2020-11-11 by the reprex package (v0.3.0)
Within RStudio, I have this code:
install(ggplot2)
install(dplyr)
Data is *gapminder_data.csv*
*str(gapminder_data.csv)*
'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
When I execute the following code:
gapminder_data.csv %>%
group_by(country) %>%
summarize(min(gdpPercap), max(gdpPercap))
it works:
# A tibble: 142 x 3
country `min(gdpPercap)` `max(gdpPercap)`
<fct> <dbl> <dbl>
1 Afghanistan 635. 978.
2 Albania 1601. 5937.
3 Algeria 2449. 6223.
4 Angola 2277. 5523.
5 Argentina 5911. 12779.
6 Australia 10040. 34435.
7 Austria 6137. 36126.
8 Bahrain 9867. 29796.
9 Bangladesh 630. 1391.
10 Belgium 8343. 33693.
But, I miss the corresponding years for the values:
min(gdpPercap) max(gdpPercap)
How can I solve it?
Thanks for your help.
Does this give you what you need?
mins <- gapminder_data.csv %>%
arrange(gdpPercap) %>%
group_by(country) %>%
slice(1) %>%
ungroup()
maxs <- gapminder_data.csv %>%
arrange(desc(gdpPercap)) %>%
group_by(country) %>%
slice(1) %>%
ungroup()
left_join(
select(mins, country, minyear=year, mingdp=gdpPercap),
select(maxs, country, maxyear=year, maxgdp=gdpPercap),
by = "country")
# # A tibble: 142 x 5
# country minyear mingdp maxyear maxgdp
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Afghanistan 1997 635. 1982 978.
# 2 Albania 1952 1601. 2007 5937.
# 3 Algeria 1952 2449. 2007 6223.
# 4 Angola 1997 2277. 1967 5523.
# 5 Argentina 1952 5911. 2007 12779.
# 6 Australia 1952 10040. 2007 34435.
# 7 Austria 1952 6137. 2007 36126.
# 8 Bahrain 1952 9867. 2007 29796.
# 9 Bangladesh 1972 630. 2007 1391.
# 10 Belgium 1952 8343. 2007 33693.
# # ... with 132 more rows
We can do this pretty easily with a pivot. Since you didn't post a structure we could copy and paste (always helpful!), I've made a small sample tibble, but it should work on your larger set. After grouping by country, make a column to designate the max and min rows. We don't want the ones that aren't either, so drop those and finally spread the values to make a wide tibble with the max and min for each country. Generally, it's best to work in tidy (long-form) tibbles in R (what it is before the pivot), but you can easily get back there by using pivot_longer if need be.
tibble(
country = rep("Afghanistan",4),
year = seq(from = 1952, to = 1955),
gdpPercap = c(779, 821, 853, 836)
) %>%
group_by(country) %>%
mutate(
type = case_when(
gdpPercap == max(gdpPercap) ~ "max",
gdpPercap == min(gdpPercap) ~ "min"
)
) %>%
drop_na() %>%
pivot_wider(
id_cols = country,
names_from = type,
values_from = c(year, gdpPercap)
)
which produces:
# A tibble: 1 x 5
# Groups: country [1]
country year_min year_max gdpPercap_min gdpPercap_max
<chr> <int> <int> <dbl> <dbl>
1 Afghanistan 1952 1954 779 853
I'm trying to merge two datasets, by year and country. The first data set (df = GNIPC) represent Gross national income per capite for every country from 1980-2008.
Country Year GNIpc
(chr) (dbl) (dbl)
1 Afghanistan 1990 NA
2 Afghanistan 1991 NA
3 Afghanistan 1992 2010
4 Afghanistan 1993 NA
5 Afghanistan 1994 12550
6 Afghanistan 1995 NA
The second dataset (df = sanctions) represents the imposition of economic sanctions from 1946 to present day.
country imposition sanctiontype sanctions_period
(chr) (dbl) (chr) (chr)
1 Afghanistan 1 1 6 8 1997-2001
2 Afghanistan 1 7 1979-1979
3 Afghanistan 1 4 7 1995-2002
4 Albania 1 2 8 2005-2005
5 Albania 1 7 2005-2006
6 Albania 1 8 2004-2005
I would like to merge the two datasets so that for every GNI year i either have sanctions present in the country or not. For the GNI years that are not in the sanctions_period the value would be 0 and for those that are it would be 1. This is what i want it to look like:
Country Year GNIpc Imposition sanctiontype
(chr) (dbl) (dbl) (dbl) (chr)
1 Afghanistan 1990 NA 0 NA
2 Afghanistan 1991 NA 0 NA
3 Afghanistan 1992 2010 0 NA
4 Afghanistan 1993 NA 0 NA
5 Afghanistan 1994 12550 0 NA
6 Afghanistan 1995 NA 1 4 7
Some example data:
df1 <- data.frame(country = c('Afghanistan', 'Turkey'),
imposition = c(1, 0),
sanctiontype = c('1 6 8', '4'),
sanctions_period = c('1997-2001', '2003-ongoing')
)
country imposition sanctiontype sanctions_period
1 Afghanistan 1 1 6 8 1997-2001
2 Turkey 0 4 2012-ongoing
The "sanctions_period" column can be transformed with dplyr and tidyr:
library(tidyr)
library(dplyr)
df.new <- separate(df1, sanctions_period, c('start', 'end'), remove = F) %>%
mutate(end = ifelse(end == 'ongoing', '2016', end)) %>%
mutate(start = as.numeric(start), end = as.numeric(end)) %>%
group_by(country, sanctions_period) %>%
do(data.frame(country = .$country, imposition = .$imposition, sanctiontype = .$sanctiontype, year = .$start:.$end))
sanctions_period country imposition sanctiontype year
<fctr> <fctr> <dbl> <fctr> <int>
1 1997-2001 Afghanistan 1 1 6 8 1997
2 1997-2001 Afghanistan 1 1 6 8 1998
3 1997-2001 Afghanistan 1 1 6 8 1999
4 1997-2001 Afghanistan 1 1 6 8 2000
5 1997-2001 Afghanistan 1 1 6 8 2001
6 2012-ongoing Turkey 0 4 2012
7 2012-ongoing Turkey 0 4 2013
8 2012-ongoing Turkey 0 4 2014
9 2012-ongoing Turkey 0 4 2015
10 2012-ongoing Turkey 0 4 2016
From there, it should easy to merge with your first data frame. Note that your first data frame capitalizes Country and Year, while the second doesn't.
df.merged <- merge(df.first, df.new, by.x = c('Country', 'Year'), by.y = c('country', 'year'))
Using dplyr:
left_join(GNIPC, sanctions, by=c("Country"="country", "Year"="Year")) %>%
select(Country,Year, GNIpc, Imposition, sanctiontype)