List the names of the NAs in a column - r

I have a dataframe called df and I have 10 variables inside this df.
df contains a list of countries which are connected to their gdp, unemployment level, and whether they have been colonised as a (TRUE) etc.
For each variable gdp, unemp level and colonised I know there's a number of NAs.
Is there a command where I can list the names of the countries where they have NAs. e.g.: if the UK has NA for gdp, but has unemp and colonised and France has gdp, unemp but NA for colonised.
Is there a command which will bring a list of the UK and France because they have NAs?
My data:
destination origin sum gdp.diff unemployment.diff
1 Albania Azerbaijan 2 27 8.467610
2 Albania Congo 1 -21 NA
3 Albania Dem. Rep. of the Congo 1 -80 13.437610
4 Albania Eritrea 21 -66 NA
5 Albania Iran (Islamic Rep. of) 279 5 2.997610
6 Albania Mali 1 -68 6.137609
So I need Albania to appear in the list because is has an NA for unemp.diff

Using complete.cases:
#dummy data
df <- data.frame(country = letters[1:3],
gdp = c(1,NA,2),
unemployment = c(1,2,3),
colonised = c(T,F,NA))
df
# country gdp unemployment colonised
# 1 a 1 1 TRUE
# 2 b NA 2 FALSE
# 3 c 2 3 NA
df[ !complete.cases(df), ]
# country gdp unemployment colonised
# 2 b NA 2 FALSE
# 3 c 2 3 NA
# check for NAs on one column
df[ is.na(df$gdp), ]
# country gdp unemployment colonised
# 2 b NA 2 FALSE

Related

Why are some of the rows becoming blank when I proportionately divide in the respective columns in R

I'm doing data wrangling for Heat-Map visualization in R.
The data consist of following columns in AQi_cate_gro variable:
Country
Air Category
AQI_Cate_Counts
Country AQI.Category AQI_Cate_Counts
<chr> <chr> <int>
1 Afghanistan Good 1
2 Afghanistan Moderate 30
3 Afghanistan Unhealthy 5
4 Afghanistan Unhealthy for Sensitive Groups 13
5 Africa Good 4
6 Africa Moderate 26
7 Albania Good 2
8 Albania Moderate 28
9 Albania Unhealthy for Sensitive Groups 2
10 Algeria Good 4
When I divide using AQI_Cate_Counts to Division_counts column
it giving blanks on some of the rows despite having its divider partner.
Country AQI.Category AQI_Cate_Counts Division_counts
<chr> <chr> <int> <dbl>
1 Brazil Good 1125 NA
2 Brazil Moderate 348 NA
3 Brazil Unhealthy 44 NA
4 Brazil Unhealthy for Sensitive Groups 39 NA
5 Brazil Very Unhealthy 6 NA
6 Brazil Hazardous NA NA
7 Russsia Good 1025 0.826
8 Russsia Moderate 203 0.164
9 Russsia Unhealthy 1 0.0008
10 Russsia Unhealthy for Sensitive Groups 10 0.0081
11 Russsia Very Unhealthy 1 0.0008
12 Russsia Hazardous 1 0.0008
13 USA Good 1001 NA
14 USA Moderate 1715 NA
15 USA Unhealthy 18 NA
Here are the codes behind of this result :
#grouping "country"get
AQi_cate_gro <- AQI_propeties %>% group_by(Country) %>%
#counts "AQI.category"
count(AQI.Category) %>% rename("AQI_Cate_Counts" = n) # rename function to rename
#sorting
arrange_aqi_2 <- AQi_cate_gro %>%
pivot_wider(id_cols = Country,
names_from = AQI.Category,
values_from = AQI_Cate_Counts) %>%
arrange(-`Good`,-`Moderate`,-`Unhealthy`,-`Unhealthy for Sensitive Groups`,
-`Very Unhealthy`,-`Hazardous`) %>%
pivot_longer(cols = -Country,
names_to = "AQI.Category",
values_to = "AQI_Cate_Counts") %>%
#Proportiote division of AQI_counts
mutate("Division_counts" = round(prop.table(AQI_Cate_Counts),4))
I first grouped the country column and then find mode of AQI CATEGORY so that I can sort it in descending order but I'm getting stuck on the division step which I tried to explain in upper portion.
I want to have a proper division of AQI_Cate_Counts in Division_counts Column, so that I can visualize this data set in my desire graphs.

Sum up observations from data frame in R (multiple conditions) [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
I'm currently facing the following issue and would highly appreciate any help. My data frame looks like this
country_birth year migrants live_in gender
Albania 2000 1 Australia male
Germany 2000 2 Australia female
Albania 2008 3 Australia male
Albania 2000 6 Australia female
Germany 2004 2 Australia female
UK 2004 2 Germany female
US 2004 5 UK male
Now I would like to get the sum of migrants (both gender) for the same country of birth and the same live_in country for a matching year. A new dataframe should look something like this
country_birth year total_migrants live_in
Albania 2000 7 Australia
... ... ... ...
Many thanks in advance!
You can try aggregate + subset like below
> aggregate(migrants ~ ., subset(df, select = -gender), sum)
country_birth year live_in migrants
1 Albania 2000 Australia 7
2 Germany 2000 Australia 2
3 Germany 2004 Australia 2
4 Albania 2008 Australia 3
5 UK 2004 Germany 2
6 US 2004 UK 5
where
subset omits the columns gender
aggregate helps you aggregate migrants, grouped by all other columns.
library(tidyverse)
data %>%
count(country_birth, year, live_in, wt = migrants, name = "total_migrants")
# # A tibble: 6 x 4
# country_birth year live_in total_migrants
# <chr> <dbl> <chr> <dbl>
# 1 Albania 2000 Australia 7
# 2 Albania 2008 Australia 3
# 3 Germany 2000 Australia 2
# 4 Germany 2004 Australia 2
# 5 UK 2004 Germany 2
# 6 US 2004 UK 5
Here is the {dplyr} approach:
data %>%
group_by(country_birth, year, live_in) %>%
summarise(total_migrants = sum(total_migrants))
You can learn more about grouped summaries by reading the dplyr documentation or at R for Data Science.

Rearranging data frame in R for panel analysis

I have problems with rearranging my data rame so that it is suitable for panel analysis.
The raw data looks like this (there are all countries and 50 years, that's just head):
head(suicide_data_panel)
country variable 1970 1971
Afghanistan suicide NA NA
Afghanistan unempl NA NA
Afghanistan hci NA NA
Afghanistan gini NA NA
Afghanistan inflation NA NA
Afghanistan cpi NA NA
I would like it to be:
country year suicide unempl
Afghanistan 1970 NA NA
Afghanistan 1971 NA NA
Afghanistan 1972 NA NA
Afghanistan 1973 NA NA
Afghanistan 1974 NA NA
Afghanistan 1975 NA NA
So that I can run panel regression. I've tried to use dcast but I don't know how to make it account for different years:
suicide <- dcast(suicide_data_panel, country~variable, sum)
This command will result in taking the last year only into account:
head(suicide)
country account alcohol
1 Afghanistan -18.874843 NA
2 Albania -6.689212 NA
3 Algeria NA NA
4 American Samoa NA NA
5 Andorra NA NA
6 Angola 7.000035 NA
It sorts variables alphabetically. Please help.
You coul try to use the tidyverse package:
library(tidyverse)
suicide_data_panel %>%
gather(year, dummy, -country, -variable) %>%
spread(variable, dummy)
You can do this by first: using MELT function with your ID variables "country" and "variable"; and, second: using dcast function to transform "variable" into individual columns.
Following a reshape approach.
tm <- by(dat, dat$variable, reshape, varying=3:4, idvar="country",
direction="long", sep="", timevar="year", drop=2)
res <- cbind(el(tm)[1:2], data.frame(mapply(`[[`, tm, 3)))
res
# country year hci suicide unempl
# DE.1970 DE 1970 1.51152200 1.3709584 0.6328626
# AT.1970 AT 1970 -0.09465904 -0.5646982 0.4042683
# CH.1970 CH 1970 2.01842371 0.3631284 -0.1061245
# DE.1971 DE 1971 0.63595040 -0.0627141 -1.3888607
# AT.1971 AT 1971 -0.28425292 1.3048697 -0.2787888
# CH.1971 CH 1971 -2.65645542 2.2866454 -0.1333213
Data
set.seed(42)
dat <- cbind(expand.grid(country=c("DE", "AT", "CH"),
variable=c("suicide", "unempl", "hci"),
stringsAsFactors=F), x1970=rnorm(9), x1971=rnorm(9))

Merging datasets based on more than 1 column in both datasets

I'm trying to merge two datasets, by year and country. The first data set (df = GNIPC) represent Gross national income per capite for every country from 1980-2008.
Country Year GNIpc
(chr) (dbl) (dbl)
1 Afghanistan 1990 NA
2 Afghanistan 1991 NA
3 Afghanistan 1992 2010
4 Afghanistan 1993 NA
5 Afghanistan 1994 12550
6 Afghanistan 1995 NA
The second dataset (df = sanctions) represents the imposition of economic sanctions from 1946 to present day.
country imposition sanctiontype sanctions_period
(chr) (dbl) (chr) (chr)
1 Afghanistan 1 1 6 8 1997-2001
2 Afghanistan 1 7 1979-1979
3 Afghanistan 1 4 7 1995-2002
4 Albania 1 2 8 2005-2005
5 Albania 1 7 2005-2006
6 Albania 1 8 2004-2005
I would like to merge the two datasets so that for every GNI year i either have sanctions present in the country or not. For the GNI years that are not in the sanctions_period the value would be 0 and for those that are it would be 1. This is what i want it to look like:
Country Year GNIpc Imposition sanctiontype
(chr) (dbl) (dbl) (dbl) (chr)
1 Afghanistan 1990 NA 0 NA
2 Afghanistan 1991 NA 0 NA
3 Afghanistan 1992 2010 0 NA
4 Afghanistan 1993 NA 0 NA
5 Afghanistan 1994 12550 0 NA
6 Afghanistan 1995 NA 1 4 7
Some example data:
df1 <- data.frame(country = c('Afghanistan', 'Turkey'),
imposition = c(1, 0),
sanctiontype = c('1 6 8', '4'),
sanctions_period = c('1997-2001', '2003-ongoing')
)
country imposition sanctiontype sanctions_period
1 Afghanistan 1 1 6 8 1997-2001
2 Turkey 0 4 2012-ongoing
The "sanctions_period" column can be transformed with dplyr and tidyr:
library(tidyr)
library(dplyr)
df.new <- separate(df1, sanctions_period, c('start', 'end'), remove = F) %>%
mutate(end = ifelse(end == 'ongoing', '2016', end)) %>%
mutate(start = as.numeric(start), end = as.numeric(end)) %>%
group_by(country, sanctions_period) %>%
do(data.frame(country = .$country, imposition = .$imposition, sanctiontype = .$sanctiontype, year = .$start:.$end))
sanctions_period country imposition sanctiontype year
<fctr> <fctr> <dbl> <fctr> <int>
1 1997-2001 Afghanistan 1 1 6 8 1997
2 1997-2001 Afghanistan 1 1 6 8 1998
3 1997-2001 Afghanistan 1 1 6 8 1999
4 1997-2001 Afghanistan 1 1 6 8 2000
5 1997-2001 Afghanistan 1 1 6 8 2001
6 2012-ongoing Turkey 0 4 2012
7 2012-ongoing Turkey 0 4 2013
8 2012-ongoing Turkey 0 4 2014
9 2012-ongoing Turkey 0 4 2015
10 2012-ongoing Turkey 0 4 2016
From there, it should easy to merge with your first data frame. Note that your first data frame capitalizes Country and Year, while the second doesn't.
df.merged <- merge(df.first, df.new, by.x = c('Country', 'Year'), by.y = c('country', 'year'))
Using dplyr:
left_join(GNIPC, sanctions, by=c("Country"="country", "Year"="Year")) %>%
select(Country,Year, GNIpc, Imposition, sanctiontype)

Add lines with NA values

I have a data frame like this:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2005 hiv 3
4 1 Italy 2000 cancer 4
5 1 Italy 2001 cancer 5
6 1 Italy 2002 cancer 6
7 1 Italy 2003 cancer 7
8 1 Italy 2004 cancer 8
9 1 Italy 2005 cancer 9
10 4 France 2000 hiv 10
11 4 France 2004 hiv 11
12 4 France 2005 hiv 12
13 4 France 2001 cancer 13
14 4 France 2002 cancer 14
15 4 France 2003 cancer 15
16 4 France 2004 cancer 16
17 2 Spain 2000 hiv 17
18 2 Spain 2001 hiv 18
19 2 Spain 2002 hiv 19
20 2 Spain 2003 hiv 20
21 2 Spain 2004 hiv 21
22 2 Spain 2005 hiv 22
23 2 Spain ... ... ...
indx is a value linked to the country (same country = same indx).
In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.
I would like to have one row for each country for each disease from 2000 to 2005.
What I would like to get is:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2002 hiv NA
4 1 Italy 2003 hiv NA
5 1 Italy 2004 hiv NA
6 1 Italy 2005 hiv 3
7 1 Italy 2000 cancer 4
8 1 Italy 2001 cancer 5
9 1 Italy 2002 cancer 6
10 1 Italy 2003 cancer 7
11 1 Italy 2004 cancer 8
12 1 Italy 2005 cancer 9
13 4 France 2000 hiv 10
14 4 France 2001 hiv NA
15 4 France 2002 hiv NA
16 4 France 2003 hiv NA
17 4 France 2004 hiv 11
18 4 France 2005 hiv 12
19 4 France 2000 cancer NA
20 4 France 2001 cancer 13
21 4 France 2002 cancer 14
22 4 France 2003 cancer 15
23 4 France 2004 cancer 16
24 4 France 2005 cancer NA
25 2 Spain 2000 hiv 17
26 2 Spain 2001 hiv 18
27 2 Spain 2002 hiv 19
28 2 Spain 2003 hiv 20
29 2 Spain 2004 hiv 21
30 2 Spain 2005 hiv 22
31 2 Spain ... ... ...
I.e. I would like to add lines with value = NA at the missing years for each country for each disease.
For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.
How can I do that?
For a reproducible example:
indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)
Using base R, you could do:
# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)
This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.
In the package tidyr, there's a special function that does this for you with a a single command:
library(tidyr)
complete(dfl, country, year, death)
Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.
# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death)))
# get index key
indexKey <- unique(df[, c("indx", "country")])
# merge these together
dfNew <- merge(indexKey, dfNew, by="country")
# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
This returns
dfNew
indx country year death value
1 1 Italy 2000 cancer 4
2 1 Italy 2000 hiv 1
3 1 Italy 2001 cancer 5
4 1 Italy 2001 hiv 2
5 1 Italy 2002 cancer 6
6 1 Italy 2002 hiv NA
7 1 Italy 2003 cancer 7
8 1 Italy 2003 hiv NA
9 1 Italy 2004 cancer 8
10 1 Italy 2004 hiv NA
11 1 Italy 2005 cancer 9
12 1 Italy 2005 hiv 3
13 2 Spain 2000 cancer NA
14 2 Spain 2000 hiv 17
15 2 Spain 2001 cancer NA
...
If df is a data.table, here are the corresponding lines of code:
# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
.(country, year, death, value)]
indexKey <- unique(df[, .(indx, country)])
dfNew <- merge(indexKey, dfNew, by="country")
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:
dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death))]
tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:
library(tidyr)
# merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>%
# fill in missing combinations of new column, death, and year
complete(indx_country, death, year) %>%
# separate indx and country back to how they were
separate(indx_country, c('indx', 'country'))
# Source: local data frame [36 x 5]
#
# indx country death year value
# (chr) (chr) (fctr) (int) (int)
# 1 1 Italy cancer 2000 4
# 2 1 Italy cancer 2001 5
# 3 1 Italy cancer 2002 6
# 4 1 Italy cancer 2003 7
# 5 1 Italy cancer 2004 8
# 6 1 Italy cancer 2005 9
# 7 1 Italy hiv 2000 1
# 8 1 Italy hiv 2001 2
# 9 1 Italy hiv 2002 NA
# 10 1 Italy hiv 2003 NA
# .. ... ... ... ... ...

Resources