Sum up observations from data frame in R (multiple conditions) [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
I'm currently facing the following issue and would highly appreciate any help. My data frame looks like this
country_birth year migrants live_in gender
Albania 2000 1 Australia male
Germany 2000 2 Australia female
Albania 2008 3 Australia male
Albania 2000 6 Australia female
Germany 2004 2 Australia female
UK 2004 2 Germany female
US 2004 5 UK male
Now I would like to get the sum of migrants (both gender) for the same country of birth and the same live_in country for a matching year. A new dataframe should look something like this
country_birth year total_migrants live_in
Albania 2000 7 Australia
... ... ... ...
Many thanks in advance!

You can try aggregate + subset like below
> aggregate(migrants ~ ., subset(df, select = -gender), sum)
country_birth year live_in migrants
1 Albania 2000 Australia 7
2 Germany 2000 Australia 2
3 Germany 2004 Australia 2
4 Albania 2008 Australia 3
5 UK 2004 Germany 2
6 US 2004 UK 5
where
subset omits the columns gender
aggregate helps you aggregate migrants, grouped by all other columns.

library(tidyverse)
data %>%
count(country_birth, year, live_in, wt = migrants, name = "total_migrants")
# # A tibble: 6 x 4
# country_birth year live_in total_migrants
# <chr> <dbl> <chr> <dbl>
# 1 Albania 2000 Australia 7
# 2 Albania 2008 Australia 3
# 3 Germany 2000 Australia 2
# 4 Germany 2004 Australia 2
# 5 UK 2004 Germany 2
# 6 US 2004 UK 5

Here is the {dplyr} approach:
data %>%
group_by(country_birth, year, live_in) %>%
summarise(total_migrants = sum(total_migrants))
You can learn more about grouped summaries by reading the dplyr documentation or at R for Data Science.

Related

Reshape dataframe in R using dcast or ftable [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I currently have a data frame that looks like this.
country2<-c("Afghanistan","Afghanistan","Afghanistan")
continent2<-c("Asia","Asia","Asia")
series<-c('lifeexp','pop','gdp')
y1901<-c('1','3','100')
y1902<-c('2','4','101')
y1903<-c('2','4','101')
y1904<-c('2','4','101')
y1905<-c('2','4','101')
y1906<-c('2','4','101')
y1907<-c('2','4','101')
df<-data.frame(country2,continent2,series,y1901,y1902,y1903,y1904,y1905,y1906,y1907)
country2 continent2 series y1901 y1902 y1903 y1904 y1905 y1906 y1907
1 Afghanistan Asia lifeexp 1 2 2 2 2 2 2
2 Afghanistan Asia pop 3 4 4 4 4 4 4
3 Afghanistan Asia gdp 100 101 101 101 101 101 101
How can I reshape this data so that it will look like this?
country<-c("Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan","Afghanistan")
continent<-c("Asia","Asia","Asia","Asia","Asia","Asia","Asia")
year<-c("1901","1902","1903","1904","1905","1906","1907")
lifeexp<-c("1","2","2","2","2","2","2")
pop<-c('3','4','4','4','4','4','4')
gdp<-c('100','101','101','101','101','101','101')
df<-data.frame(country,continent,year,lifeexp,pop,gdp)
country continent year lifeexp pop gdp
1 Afghanistan Asia 1901 1 3 100
2 Afghanistan Asia 1902 2 4 101
3 Afghanistan Asia 1903 2 4 101
4 Afghanistan Asia 1904 2 4 101
5 Afghanistan Asia 1905 2 4 101
6 Afghanistan Asia 1906 2 4 101
7 Afghanistan Asia 1907 2 4 101
I have tried using dcast2 from the reshape2 to reshape the data but I can only enter 1 column for value.var.
dcast(df,country+region~series,value.var ='y1901',fun.aggregate = sum)
I also tried using ftable and xtabs but I'm still not sure how to enter more than 1 column for the value. The code below gives an error.
ftable(xtabs(c(y2000,y2001)~country+region+series,df))
Thanks
A data.table approach using melt and dcast could be
library(data.table)
setDT(df)
dcast(melt(df,measure = patterns("^y\\d+")),country2 + continent2 + variable~series)
# country2 continent2 variable gdp lifeexp pop
#1: Afghanistan Asia y1901 100 1 3
#2: Afghanistan Asia y1902 101 2 4
#3: Afghanistan Asia y1903 101 2 4
#4: Afghanistan Asia y1904 101 2 4
#5: Afghanistan Asia y1905 101 2 4
#6: Afghanistan Asia y1906 101 2 4
#7: Afghanistan Asia y1907 101 2 4
I know that you are looking for a solution with ftable or dcast but just for your knowledge, you can achieve it using tidyr:
library(tidyverse)
df %>%
pivot_longer(., cols = starts_with("y190"), names_to = "year", values_to = "Value") %>%
pivot_wider(., names_from = "series", values_from = "Value") %>%
mutate(year = gsub("y","", year)) %>%
rename(country = country2, continent = continent2)
# A tibble: 7 x 6
country continent year lifeexp pop gdp
<fct> <fct> <chr> <fct> <fct> <fct>
1 Afghanistan Asia 1901 1 3 100
2 Afghanistan Asia 1902 2 4 101
3 Afghanistan Asia 1903 2 4 101
4 Afghanistan Asia 1904 2 4 101
5 Afghanistan Asia 1905 2 4 101
6 Afghanistan Asia 1906 2 4 101
7 Afghanistan Asia 1907 2 4 101

reshape panel data from long to long, but expanding variables

I have large dataset in long format, with multiple variables 'stacked', a structure similar to
set.seed(42)
dat_0=data.frame(
c(rep('AFG',4),rep('UK',4)),
c(rep('GDP',2),rep('pop',2)),
rep(c('1990','1991'),4),
runif(8))
colnames(dat_0)<-c('country','variable','year','val')
which produces the following
country variable year val
1 AFG GDP 1990 0.0856120649
2 AFG GDP 1991 0.3052183695
3 AFG pop 1990 0.6674265147
4 AFG pop 1991 0.0002388966
5 UK GDP 1990 0.2085699569
6 UK GDP 1991 0.9330341273
7 UK pop 1990 0.9256447486
8 UK pop 1991 0.7340943010
I want to have each variable (GDP, pop) in one column
country year GDP pop
1 AFG 1990 0.0856120649 0.6674265147
2 AFG 1991 0.3052183695 0.0002388966
3 UK 1990 0.2085699569 0.9256447486
4 UK 1991 0.9330341273 0.7340943010
I am really sorry if this is a duplicate, but after going through earlier posts I have still not managed to re-structure my data.

Merging datasets based on more than 1 column in both datasets

I'm trying to merge two datasets, by year and country. The first data set (df = GNIPC) represent Gross national income per capite for every country from 1980-2008.
Country Year GNIpc
(chr) (dbl) (dbl)
1 Afghanistan 1990 NA
2 Afghanistan 1991 NA
3 Afghanistan 1992 2010
4 Afghanistan 1993 NA
5 Afghanistan 1994 12550
6 Afghanistan 1995 NA
The second dataset (df = sanctions) represents the imposition of economic sanctions from 1946 to present day.
country imposition sanctiontype sanctions_period
(chr) (dbl) (chr) (chr)
1 Afghanistan 1 1 6 8 1997-2001
2 Afghanistan 1 7 1979-1979
3 Afghanistan 1 4 7 1995-2002
4 Albania 1 2 8 2005-2005
5 Albania 1 7 2005-2006
6 Albania 1 8 2004-2005
I would like to merge the two datasets so that for every GNI year i either have sanctions present in the country or not. For the GNI years that are not in the sanctions_period the value would be 0 and for those that are it would be 1. This is what i want it to look like:
Country Year GNIpc Imposition sanctiontype
(chr) (dbl) (dbl) (dbl) (chr)
1 Afghanistan 1990 NA 0 NA
2 Afghanistan 1991 NA 0 NA
3 Afghanistan 1992 2010 0 NA
4 Afghanistan 1993 NA 0 NA
5 Afghanistan 1994 12550 0 NA
6 Afghanistan 1995 NA 1 4 7
Some example data:
df1 <- data.frame(country = c('Afghanistan', 'Turkey'),
imposition = c(1, 0),
sanctiontype = c('1 6 8', '4'),
sanctions_period = c('1997-2001', '2003-ongoing')
)
country imposition sanctiontype sanctions_period
1 Afghanistan 1 1 6 8 1997-2001
2 Turkey 0 4 2012-ongoing
The "sanctions_period" column can be transformed with dplyr and tidyr:
library(tidyr)
library(dplyr)
df.new <- separate(df1, sanctions_period, c('start', 'end'), remove = F) %>%
mutate(end = ifelse(end == 'ongoing', '2016', end)) %>%
mutate(start = as.numeric(start), end = as.numeric(end)) %>%
group_by(country, sanctions_period) %>%
do(data.frame(country = .$country, imposition = .$imposition, sanctiontype = .$sanctiontype, year = .$start:.$end))
sanctions_period country imposition sanctiontype year
<fctr> <fctr> <dbl> <fctr> <int>
1 1997-2001 Afghanistan 1 1 6 8 1997
2 1997-2001 Afghanistan 1 1 6 8 1998
3 1997-2001 Afghanistan 1 1 6 8 1999
4 1997-2001 Afghanistan 1 1 6 8 2000
5 1997-2001 Afghanistan 1 1 6 8 2001
6 2012-ongoing Turkey 0 4 2012
7 2012-ongoing Turkey 0 4 2013
8 2012-ongoing Turkey 0 4 2014
9 2012-ongoing Turkey 0 4 2015
10 2012-ongoing Turkey 0 4 2016
From there, it should easy to merge with your first data frame. Note that your first data frame capitalizes Country and Year, while the second doesn't.
df.merged <- merge(df.first, df.new, by.x = c('Country', 'Year'), by.y = c('country', 'year'))
Using dplyr:
left_join(GNIPC, sanctions, by=c("Country"="country", "Year"="Year")) %>%
select(Country,Year, GNIpc, Imposition, sanctiontype)

Add lines with NA values

I have a data frame like this:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2005 hiv 3
4 1 Italy 2000 cancer 4
5 1 Italy 2001 cancer 5
6 1 Italy 2002 cancer 6
7 1 Italy 2003 cancer 7
8 1 Italy 2004 cancer 8
9 1 Italy 2005 cancer 9
10 4 France 2000 hiv 10
11 4 France 2004 hiv 11
12 4 France 2005 hiv 12
13 4 France 2001 cancer 13
14 4 France 2002 cancer 14
15 4 France 2003 cancer 15
16 4 France 2004 cancer 16
17 2 Spain 2000 hiv 17
18 2 Spain 2001 hiv 18
19 2 Spain 2002 hiv 19
20 2 Spain 2003 hiv 20
21 2 Spain 2004 hiv 21
22 2 Spain 2005 hiv 22
23 2 Spain ... ... ...
indx is a value linked to the country (same country = same indx).
In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.
I would like to have one row for each country for each disease from 2000 to 2005.
What I would like to get is:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2002 hiv NA
4 1 Italy 2003 hiv NA
5 1 Italy 2004 hiv NA
6 1 Italy 2005 hiv 3
7 1 Italy 2000 cancer 4
8 1 Italy 2001 cancer 5
9 1 Italy 2002 cancer 6
10 1 Italy 2003 cancer 7
11 1 Italy 2004 cancer 8
12 1 Italy 2005 cancer 9
13 4 France 2000 hiv 10
14 4 France 2001 hiv NA
15 4 France 2002 hiv NA
16 4 France 2003 hiv NA
17 4 France 2004 hiv 11
18 4 France 2005 hiv 12
19 4 France 2000 cancer NA
20 4 France 2001 cancer 13
21 4 France 2002 cancer 14
22 4 France 2003 cancer 15
23 4 France 2004 cancer 16
24 4 France 2005 cancer NA
25 2 Spain 2000 hiv 17
26 2 Spain 2001 hiv 18
27 2 Spain 2002 hiv 19
28 2 Spain 2003 hiv 20
29 2 Spain 2004 hiv 21
30 2 Spain 2005 hiv 22
31 2 Spain ... ... ...
I.e. I would like to add lines with value = NA at the missing years for each country for each disease.
For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.
How can I do that?
For a reproducible example:
indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)
Using base R, you could do:
# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)
This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.
In the package tidyr, there's a special function that does this for you with a a single command:
library(tidyr)
complete(dfl, country, year, death)
Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.
# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death)))
# get index key
indexKey <- unique(df[, c("indx", "country")])
# merge these together
dfNew <- merge(indexKey, dfNew, by="country")
# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
This returns
dfNew
indx country year death value
1 1 Italy 2000 cancer 4
2 1 Italy 2000 hiv 1
3 1 Italy 2001 cancer 5
4 1 Italy 2001 hiv 2
5 1 Italy 2002 cancer 6
6 1 Italy 2002 hiv NA
7 1 Italy 2003 cancer 7
8 1 Italy 2003 hiv NA
9 1 Italy 2004 cancer 8
10 1 Italy 2004 hiv NA
11 1 Italy 2005 cancer 9
12 1 Italy 2005 hiv 3
13 2 Spain 2000 cancer NA
14 2 Spain 2000 hiv 17
15 2 Spain 2001 cancer NA
...
If df is a data.table, here are the corresponding lines of code:
# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
.(country, year, death, value)]
indexKey <- unique(df[, .(indx, country)])
dfNew <- merge(indexKey, dfNew, by="country")
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:
dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death))]
tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:
library(tidyr)
# merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>%
# fill in missing combinations of new column, death, and year
complete(indx_country, death, year) %>%
# separate indx and country back to how they were
separate(indx_country, c('indx', 'country'))
# Source: local data frame [36 x 5]
#
# indx country death year value
# (chr) (chr) (fctr) (int) (int)
# 1 1 Italy cancer 2000 4
# 2 1 Italy cancer 2001 5
# 3 1 Italy cancer 2002 6
# 4 1 Italy cancer 2003 7
# 5 1 Italy cancer 2004 8
# 6 1 Italy cancer 2005 9
# 7 1 Italy hiv 2000 1
# 8 1 Italy hiv 2001 2
# 9 1 Italy hiv 2002 NA
# 10 1 Italy hiv 2003 NA
# .. ... ... ... ... ...

List the names of the NAs in a column

I have a dataframe called df and I have 10 variables inside this df.
df contains a list of countries which are connected to their gdp, unemployment level, and whether they have been colonised as a (TRUE) etc.
For each variable gdp, unemp level and colonised I know there's a number of NAs.
Is there a command where I can list the names of the countries where they have NAs. e.g.: if the UK has NA for gdp, but has unemp and colonised and France has gdp, unemp but NA for colonised.
Is there a command which will bring a list of the UK and France because they have NAs?
My data:
destination origin sum gdp.diff unemployment.diff
1 Albania Azerbaijan 2 27 8.467610
2 Albania Congo 1 -21 NA
3 Albania Dem. Rep. of the Congo 1 -80 13.437610
4 Albania Eritrea 21 -66 NA
5 Albania Iran (Islamic Rep. of) 279 5 2.997610
6 Albania Mali 1 -68 6.137609
So I need Albania to appear in the list because is has an NA for unemp.diff
Using complete.cases:
#dummy data
df <- data.frame(country = letters[1:3],
gdp = c(1,NA,2),
unemployment = c(1,2,3),
colonised = c(T,F,NA))
df
# country gdp unemployment colonised
# 1 a 1 1 TRUE
# 2 b NA 2 FALSE
# 3 c 2 3 NA
df[ !complete.cases(df), ]
# country gdp unemployment colonised
# 2 b NA 2 FALSE
# 3 c 2 3 NA
# check for NAs on one column
df[ is.na(df$gdp), ]
# country gdp unemployment colonised
# 2 b NA 2 FALSE

Resources