Date missing in table after summarizing different column - r

Within RStudio, I have this code:
install(ggplot2)
install(dplyr)
Data is *gapminder_data.csv*
*str(gapminder_data.csv)*
'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
When I execute the following code:
gapminder_data.csv %>%
group_by(country) %>%
summarize(min(gdpPercap), max(gdpPercap))
it works:
# A tibble: 142 x 3
country `min(gdpPercap)` `max(gdpPercap)`
<fct> <dbl> <dbl>
1 Afghanistan 635. 978.
2 Albania 1601. 5937.
3 Algeria 2449. 6223.
4 Angola 2277. 5523.
5 Argentina 5911. 12779.
6 Australia 10040. 34435.
7 Austria 6137. 36126.
8 Bahrain 9867. 29796.
9 Bangladesh 630. 1391.
10 Belgium 8343. 33693.
But, I miss the corresponding years for the values:
min(gdpPercap) max(gdpPercap)
How can I solve it?
Thanks for your help.

Does this give you what you need?
mins <- gapminder_data.csv %>%
arrange(gdpPercap) %>%
group_by(country) %>%
slice(1) %>%
ungroup()
maxs <- gapminder_data.csv %>%
arrange(desc(gdpPercap)) %>%
group_by(country) %>%
slice(1) %>%
ungroup()
left_join(
select(mins, country, minyear=year, mingdp=gdpPercap),
select(maxs, country, maxyear=year, maxgdp=gdpPercap),
by = "country")
# # A tibble: 142 x 5
# country minyear mingdp maxyear maxgdp
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Afghanistan 1997 635. 1982 978.
# 2 Albania 1952 1601. 2007 5937.
# 3 Algeria 1952 2449. 2007 6223.
# 4 Angola 1997 2277. 1967 5523.
# 5 Argentina 1952 5911. 2007 12779.
# 6 Australia 1952 10040. 2007 34435.
# 7 Austria 1952 6137. 2007 36126.
# 8 Bahrain 1952 9867. 2007 29796.
# 9 Bangladesh 1972 630. 2007 1391.
# 10 Belgium 1952 8343. 2007 33693.
# # ... with 132 more rows

We can do this pretty easily with a pivot. Since you didn't post a structure we could copy and paste (always helpful!), I've made a small sample tibble, but it should work on your larger set. After grouping by country, make a column to designate the max and min rows. We don't want the ones that aren't either, so drop those and finally spread the values to make a wide tibble with the max and min for each country. Generally, it's best to work in tidy (long-form) tibbles in R (what it is before the pivot), but you can easily get back there by using pivot_longer if need be.
tibble(
country = rep("Afghanistan",4),
year = seq(from = 1952, to = 1955),
gdpPercap = c(779, 821, 853, 836)
) %>%
group_by(country) %>%
mutate(
type = case_when(
gdpPercap == max(gdpPercap) ~ "max",
gdpPercap == min(gdpPercap) ~ "min"
)
) %>%
drop_na() %>%
pivot_wider(
id_cols = country,
names_from = type,
values_from = c(year, gdpPercap)
)
which produces:
# A tibble: 1 x 5
# Groups: country [1]
country year_min year_max gdpPercap_min gdpPercap_max
<chr> <int> <int> <dbl> <dbl>
1 Afghanistan 1952 1954 779 853

Related

Calculating rowsums of grouped data

I am using the datase who (available in the library datasets
tidyr), which for 34 years counts the number of TB cases registered for 56 groups (combinations of gender, age and method of testing) for a number of countries. There is one row per country per year, and the first 4 entries are to do with year, country name and such.
I want to calculate the sum of new cases per country per year, but I just can't make it work.
I was ecpecting something like
group_by(who, country) %>% summarise(count = rowsum(.[5:60]))
would work, but it doesn't.
Can anyone help me understand why it doesn't work, and what to do instead?
You're missing a first step, which is to gather the data into a 'tidy' format. Try this:
who%>%
gather(key=type,value=cases,-country,-iso2,-iso3,-year)%>%
filter(!is.na(cases))%>%
group_by(country,year)%>%
summarise(sum(cases))
Which gives output:
# A tibble: 3,484 × 3
# Groups: country [219]
country year `sum(cases)`
<chr> <int> <int>
1 Afghanistan 1997 128
2 Afghanistan 1998 1778
3 Afghanistan 1999 745
4 Afghanistan 2000 2666
5 Afghanistan 2001 4639
library(tidyverse)
(long_who <- who |> pivot_longer(cols = -(1:4)))
long_who |> filter(startsWith(name,"new")) |> # dont want things like "Population"
group_by(country) |>
summarise(sum_of_new_ = sum(value,na.rm=TRUE))
A base r approach
data.frame(who[,c("country", "year")],
cnt = rowSums(who[5:60], na.rm = TRUE))
#> + country year cnt
#> 1 Afghanistan 1980 0
#> 2 Afghanistan 1981 0
#> 3 Afghanistan 1982 0
#> 4 Afghanistan 1983 0
#> 5 Afghanistan 1984 0
#> 6 Afghanistan 1985 0
You could also do without the long format by using rowSums and across:
library(dplyr)
who |>
group_by(country, year) |>
summarise(count = rowSums(across(5:58), na.rm = TRUE)) |>
ungroup()
Alternatives to across(5:58):
across(starts_with("new"))
across(-(1:4))
Output:
# A tibble: 20 × 3
# Groups: country [1]
country year count
<chr> <int> <dbl>
1 Afghanistan 1980 0
2 Afghanistan 1981 0
3 Afghanistan 1982 0
4 Afghanistan 1983 0
5 Afghanistan 1984 0
6 Afghanistan 1985 0
7 Afghanistan 1986 0
8 Afghanistan 1987 0
9 Afghanistan 1988 0
10 Afghanistan 1989 0
11 Afghanistan 1990 0
12 Afghanistan 1991 0
13 Afghanistan 1992 0
14 Afghanistan 1993 0
15 Afghanistan 1994 0
16 Afghanistan 1995 0
17 Afghanistan 1996 0
18 Afghanistan 1997 128
19 Afghanistan 1998 1778
20 Afghanistan 1999 745

Changing Duplicate Values Within Subjects: R

My data looks like this:
Country
GDP
Year
A
10
1972
A
15
1973
A
20
1973
A
18
1975
B
25
1950
B
30
1951
B
35
1951
B
36
1953
I have so many observations look like data that I presented above. I want to change the duplicated years. However, I want to change first duplicated row of the year variable. I want to see my data like this:
Country
GDP
Year
A
10
1972
A
20
1973
A
15
1974
A
18
1975
B
25
1950
B
35
1951
B
30
1952
B
36
1953
Thank you for your time!
Here is one possible option with tidyverse:
library(tidyverse)
df %>%
group_by(Country, Year) %>%
mutate(dup = case_when(n() == 1 ~ FALSE,
min(GDP) == GDP ~ TRUE,
TRUE ~ FALSE)) %>%
mutate(Year = ifelse(dup == TRUE, Year + 1, Year)) %>%
arrange(Country, Year) %>%
ungroup %>%
select(-dup)
Output
Country GDP Year
<chr> <int> <dbl>
1 A 10 1972
2 A 20 1973
3 A 15 1974
4 A 18 1975
5 B 25 1950
6 B 35 1951
7 B 30 1952
8 B 36 1953
How about this ?
library(dplyr)
df %>%
arrange(Country, Year) %>%
group_by(Country) %>%
mutate(Year = min(Year) + row_number() - 1) %>%
ungroup
# Country GDP Year
# <chr> <int> <dbl>
#1 A 10 1972
#2 A 15 1973
#3 A 20 1974
#4 A 18 1975
#5 B 25 1950
#6 B 30 1951
#7 B 35 1952
#8 B 36 1953
This increments every Year by 1 starting from minimum value in each Country.

Restructuring a time series dataframe in R

I am trying to download information from the JHU on pandemic infections. I am interested in getting the number of daily reported cases per country.
To start with, from the original database, I can find a df with this structure:
`Country/Region` Lat Long `1/22/20` `1/23/20` `1/24/20`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 33.9 67.7 0 0 0
2 Albania 41.2 20.2 0 0 0
3 Algeria 28.0 1.66 0 0 0
4 Andorra 42.5 1.52 0 0 0
5 Angola -11.2 17.9 0 0 0
6 Antigua and Bar~ 17.1 -61.8 0 0 0
# ... with 592 more variables: 1/25/20 <dbl>, 1/26/20 <dbl>, 1/27/20 <dbl>,...
But I would like to get something similar to this:
head(example)
Country/Region Date Cases
1 Afghanistan 2020-01-22 2
2 Afghanistan 2020-01-23 3
3 Afghanistan 2020-01-24 4
.
.
.
100 Albania 2020-01-22 0
101 Albania 2020-01-23 1
102 Albania 2020-01-24 0
#and so on with the rest of the countries
Any idea on how to do so in RStudio?
[Update]
I had tried with this code,suggested by #akrun:
library(dplyr)
library(tidyr)
library(lubridate)
result <- originaldf %>%
select(-c(Lat, Long)) %>%
pivot_longer(cols = -`Country/Region`, names_to = 'Date',
values_to = 'Cases') %>%
group_by(`Country/Region`, Date = mdy(Date)) %>%
summarise(Cases = sum(Cases, na.rm = TRUE), .groups = 'drop')
The result is as follows:
`Country/Region` Date Cases
1 Afghanistan 2020-03-15 1 #this row is fictitious
2 Afghanistan 2020-03-16 18
3 Afghanistan 2020-03-17 20
4 Afghanistan 2020-03-18 24
5 Afghanistan 2020-03-19 25
6 Afghanistan 2020-03-20 29
Even so, if you check the original dataset, this result is acumulating the cases day by day. A proper result should be like this:
`Country/Region` Date Cases
1 Afghanistan 2020-03-15 1
2 Afghanistan 2020-03-16 17
3 Afghanistan 2020-03-17 2
4 Afghanistan 2020-03-18 4
5 Afghanistan 2020-03-19 1
6 Afghanistan 2020-03-20 4
We could reshape to 'long' format with pivot_longer and then do a group by sum
library(dplyr)
library(tidyr)
library(lubridate)
out <- df1 %>%
select(-c(Lat, Long, `Province/State`)) %>%
pivot_longer(cols = -c(`Country/Region`), names_to = 'Date',
values_to = 'Cases') %>%
mutate(Date = mdy(Date)) %>%
arrange(`Country/Region`, Date) %>%
group_by(`Country/Region`) %>%
mutate(Count = c(0, diff(Cases))) %>%
group_by(Date, .add = TRUE) %>%
summarise(Count = sum(Count, na.rm = TRUE), .groups = 'drop')
-output
> tail(out)
# A tibble: 6 x 3
`Country/Region` Date Count
<chr> <date> <dbl>
1 Zimbabwe 2021-09-02 158
2 Zimbabwe 2021-09-03 213
3 Zimbabwe 2021-09-04 94
4 Zimbabwe 2021-09-05 125
5 Zimbabwe 2021-09-06 121
6 Zimbabwe 2021-09-07 125
> head(out)
# A tibble: 6 x 3
`Country/Region` Date Cases
<chr> <date> <int>
1 Afghanistan 2020-01-22 0
2 Afghanistan 2020-01-23 0
3 Afghanistan 2020-01-24 0
4 Afghanistan 2020-01-25 0
5 Afghanistan 2020-01-26 0
6 Afghanistan 2020-01-27 0
data
df1 <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv", check.names = FALSE)
Use tidyr package
library(tidyr)
df %>%
tidyr::gather(key = "Date", value = "Cases", `1/22/20`:`1/24/20`)

Reshape dataframe in R using dcast or ftable [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I currently have a data frame that looks like this.
country2<-c("Afghanistan","Afghanistan","Afghanistan")
continent2<-c("Asia","Asia","Asia")
series<-c('lifeexp','pop','gdp')
y1901<-c('1','3','100')
y1902<-c('2','4','101')
y1903<-c('2','4','101')
y1904<-c('2','4','101')
y1905<-c('2','4','101')
y1906<-c('2','4','101')
y1907<-c('2','4','101')
df<-data.frame(country2,continent2,series,y1901,y1902,y1903,y1904,y1905,y1906,y1907)
country2 continent2 series y1901 y1902 y1903 y1904 y1905 y1906 y1907
1 Afghanistan Asia lifeexp 1 2 2 2 2 2 2
2 Afghanistan Asia pop 3 4 4 4 4 4 4
3 Afghanistan Asia gdp 100 101 101 101 101 101 101
How can I reshape this data so that it will look like this?
country<-c("Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan")
continent<-c("Asia","Asia","Asia","Asia","Asia","Asia","Asia")
year<-c("1901","1902","1903","1904","1905","1906","1907")
lifeexp<-c("1","2","2","2","2","2","2")
pop<-c('3','4','4','4','4','4','4')
gdp<-c('100','101','101','101','101','101','101')
df<-data.frame(country,continent,year,lifeexp,pop,gdp)
country continent year lifeexp pop gdp
1 Afghanistan Asia 1901 1 3 100
2 Afghanistan Asia 1902 2 4 101
3 Afghanistan Asia 1903 2 4 101
4 Afghanistan Asia 1904 2 4 101
5 Afghanistan Asia 1905 2 4 101
6 Afghanistan Asia 1906 2 4 101
7 Afghanistan Asia 1907 2 4 101
I have tried using dcast2 from the reshape2 to reshape the data but I can only enter 1 column for value.var.
dcast(df,country+region~series,value.var ='y1901',fun.aggregate = sum)
I also tried using ftable and xtabs but I'm still not sure how to enter more than 1 column for the value. The code below gives an error.
ftable(xtabs(c(y2000,y2001)~country+region+series,df))
Thanks
A data.table approach using melt and dcast could be
library(data.table)
setDT(df)
dcast(melt(df,measure = patterns("^y\\d+")),country2 + continent2 + variable~series)
# country2 continent2 variable gdp lifeexp pop
#1: Afghanistan Asia y1901 100 1 3
#2: Afghanistan Asia y1902 101 2 4
#3: Afghanistan Asia y1903 101 2 4
#4: Afghanistan Asia y1904 101 2 4
#5: Afghanistan Asia y1905 101 2 4
#6: Afghanistan Asia y1906 101 2 4
#7: Afghanistan Asia y1907 101 2 4
I know that you are looking for a solution with ftable or dcast but just for your knowledge, you can achieve it using tidyr:
library(tidyverse)
df %>%
pivot_longer(., cols = starts_with("y190"), names_to = "year", values_to = "Value") %>%
pivot_wider(., names_from = "series", values_from = "Value") %>%
mutate(year = gsub("y","", year)) %>%
rename(country = country2, continent = continent2)
# A tibble: 7 x 6
country continent year lifeexp pop gdp
<fct> <fct> <chr> <fct> <fct> <fct>
1 Afghanistan Asia 1901 1 3 100
2 Afghanistan Asia 1902 2 4 101
3 Afghanistan Asia 1903 2 4 101
4 Afghanistan Asia 1904 2 4 101
5 Afghanistan Asia 1905 2 4 101
6 Afghanistan Asia 1906 2 4 101
7 Afghanistan Asia 1907 2 4 101

Merging datasets based on more than 1 column in both datasets

I'm trying to merge two datasets, by year and country. The first data set (df = GNIPC) represent Gross national income per capite for every country from 1980-2008.
Country Year GNIpc
(chr) (dbl) (dbl)
1 Afghanistan 1990 NA
2 Afghanistan 1991 NA
3 Afghanistan 1992 2010
4 Afghanistan 1993 NA
5 Afghanistan 1994 12550
6 Afghanistan 1995 NA
The second dataset (df = sanctions) represents the imposition of economic sanctions from 1946 to present day.
country imposition sanctiontype sanctions_period
(chr) (dbl) (chr) (chr)
1 Afghanistan 1 1 6 8 1997-2001
2 Afghanistan 1 7 1979-1979
3 Afghanistan 1 4 7 1995-2002
4 Albania 1 2 8 2005-2005
5 Albania 1 7 2005-2006
6 Albania 1 8 2004-2005
I would like to merge the two datasets so that for every GNI year i either have sanctions present in the country or not. For the GNI years that are not in the sanctions_period the value would be 0 and for those that are it would be 1. This is what i want it to look like:
Country Year GNIpc Imposition sanctiontype
(chr) (dbl) (dbl) (dbl) (chr)
1 Afghanistan 1990 NA 0 NA
2 Afghanistan 1991 NA 0 NA
3 Afghanistan 1992 2010 0 NA
4 Afghanistan 1993 NA 0 NA
5 Afghanistan 1994 12550 0 NA
6 Afghanistan 1995 NA 1 4 7
Some example data:
df1 <- data.frame(country = c('Afghanistan', 'Turkey'),
imposition = c(1, 0),
sanctiontype = c('1 6 8', '4'),
sanctions_period = c('1997-2001', '2003-ongoing')
)
country imposition sanctiontype sanctions_period
1 Afghanistan 1 1 6 8 1997-2001
2 Turkey 0 4 2012-ongoing
The "sanctions_period" column can be transformed with dplyr and tidyr:
library(tidyr)
library(dplyr)
df.new <- separate(df1, sanctions_period, c('start', 'end'), remove = F) %>%
mutate(end = ifelse(end == 'ongoing', '2016', end)) %>%
mutate(start = as.numeric(start), end = as.numeric(end)) %>%
group_by(country, sanctions_period) %>%
do(data.frame(country = .$country, imposition = .$imposition, sanctiontype = .$sanctiontype, year = .$start:.$end))
sanctions_period country imposition sanctiontype year
<fctr> <fctr> <dbl> <fctr> <int>
1 1997-2001 Afghanistan 1 1 6 8 1997
2 1997-2001 Afghanistan 1 1 6 8 1998
3 1997-2001 Afghanistan 1 1 6 8 1999
4 1997-2001 Afghanistan 1 1 6 8 2000
5 1997-2001 Afghanistan 1 1 6 8 2001
6 2012-ongoing Turkey 0 4 2012
7 2012-ongoing Turkey 0 4 2013
8 2012-ongoing Turkey 0 4 2014
9 2012-ongoing Turkey 0 4 2015
10 2012-ongoing Turkey 0 4 2016
From there, it should easy to merge with your first data frame. Note that your first data frame capitalizes Country and Year, while the second doesn't.
df.merged <- merge(df.first, df.new, by.x = c('Country', 'Year'), by.y = c('country', 'year'))
Using dplyr:
left_join(GNIPC, sanctions, by=c("Country"="country", "Year"="Year")) %>%
select(Country,Year, GNIpc, Imposition, sanctiontype)

Resources