NAs in a data frame split by country in R [duplicate] - r

This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 1 year ago.
I would like to impute NA's in a dataframe with means of observed data in each country. In other words, while dealing with NAs, the values in the specific country should be taken into consideration. For instance;
Date Country Battles Riots
March 2018 Afghanistan 380 NA
March 2018 Yemen 88 5
March 2018 Mali 45 NA
April 2018 Afghanistan 350 NA
April 2018 Yemen NA 66
April 2018 Mali 67 NA
May 2018 Afghanistan NA 7
May 2018 Yemen NA NA
May 2018 Mali NA 6
I have used the following code, but obviously it calculates the means without taking the country specific information.
for(i in 6:ncol(my_data)) {
my_data[ , i][is.na(my_data[ , i])] <- mean(my_data[ , i], na.rm = TRUE)
}
Many thanks in advance.

You could use:
library(dplyr)
library(tidyr)
df %>%
group_by(Country) %>%
mutate(across(c(Battles, Riots), ~ replace_na(.x, mean(.x, na.rm = TRUE)))) %>%
ungroup()
which returns
Date Country Battles Riots
<chr> <chr> <dbl> <dbl>
1 March 2018 Afghanistan 380 7
2 March 2018 Yemen 88 5
3 March 2018 Mali 45 6
4 April 2018 Afghanistan 350 7
5 April 2018 Yemen 88 66
6 April 2018 Mali 67 6
7 May 2018 Afghanistan 365 7
8 May 2018 Yemen 88 35.5
9 May 2018 Mali 56 6
Data
structure(list(Date = c("March 2018", "March 2018", "March 2018",
"April 2018", "April 2018", "April 2018", "May 2018", "May 2018",
"May 2018"), Country = c("Afghanistan", "Yemen", "Mali", "Afghanistan",
"Yemen", "Mali", "Afghanistan", "Yemen", "Mali"), Battles = c(380,
88, 45, 350, NA, 67, NA, NA, NA), Riots = c(NA, 5, NA, NA, 66,
NA, 7, NA, 6)), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))

A data.table option (borrow df from #Martin Gal)
setDT(df)[
,
c("Battles", "Riots") := lapply(
.(Battles, Riots),
function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
),
Country
][]
gives
Date Country Battles Riots
1: March 2018 Afghanistan 380 7.0
2: March 2018 Yemen 88 5.0
3: March 2018 Mali 45 6.0
4: April 2018 Afghanistan 350 7.0
5: April 2018 Yemen 88 66.0
6: April 2018 Mali 67 6.0
7: May 2018 Afghanistan 365 7.0
8: May 2018 Yemen 88 35.5
9: May 2018 Mali 56 6.0

Related

Fill in Multiple NAs with Lagged Values R

I am trying to fill the NA values in this data frame with the most recent non-NA value in the cost column. I want to group by city - so all NAs for Omaha should be 44.50, and the NAs for Lincoln should be 62.50. Here is the code I have been using - it replaces the first NA (April) for each group with the correct value, but does not fill past that.
df <- df %>%
group_by(city) %>%
mutate(cost = ifelse(is.na(cost), lag(cost, na.rm=TRUE), cost))
Data before running code:
year month city cost
2021 January Omaha 45.50
2021 February Omaha 46.75
2021 March Omaha 44.50
2021 April Omaha NA
2021 May Omaha NA
2021 June Omaha NA
2021 January Lincoln 55.25
2021 February Lincoln 53.80
2021 March Lincoln 62.50
2021 April Lincoln NA
2021 May Lincoln NA
2021 June Lincoln NA
Use:
library(tidyverse)
df %>%
group_by(city) %>%
fill(cost)
# A tibble: 12 x 4
# Groups: city [2]
year month city cost
<int> <chr> <chr> <dbl>
1 2021 January Omaha 45.5
2 2021 February Omaha 46.8
3 2021 March Omaha 44.5
4 2021 April Omaha 44.5
5 2021 May Omaha 44.5
6 2021 June Omaha 44.5
7 2021 January Lincoln 55.2
8 2021 February Lincoln 53.8
9 2021 March Lincoln 62.5
10 2021 April Lincoln 62.5
11 2021 May Lincoln 62.5
12 2021 June Lincoln 62.5
With your code, you would want to use last rather than lag (though fill is the much better option here). We also need to wrap cost in na.omit.
library(tidyverse)
df %>%
group_by(city) %>%
mutate(cost = ifelse(is.na(cost), last(na.omit(cost)), cost))
Output
year month city cost
<int> <chr> <chr> <dbl>
1 2021 January Omaha 45.5
2 2021 February Omaha 46.8
3 2021 March Omaha 44.5
4 2021 April Omaha 44.5
5 2021 May Omaha 44.5
6 2021 June Omaha 44.5
7 2021 January Lincoln 55.2
8 2021 February Lincoln 53.8
9 2021 March Lincoln 62.5
10 2021 April Lincoln 62.5
11 2021 May Lincoln 62.5
12 2021 June Lincoln 62.5
Data
df <- structure(list(year = c(2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L), month = c("January",
"February", "March", "April", "May", "June", "January", "February",
"March", "April", "May", "June"), city = c("Omaha", "Omaha",
"Omaha", "Omaha", "Omaha", "Omaha", "Lincoln", "Lincoln", "Lincoln",
"Lincoln", "Lincoln", "Lincoln"), cost = c(45.5, 46.75, 44.5,
NA, NA, NA, 55.25, 53.8, 62.5, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-12L))

dplyr sample_n by one variable through another one

I have a data frame with a "grouping" variable season and another variable year which is repeated for each month.
df <- data.frame(month = as.character(sapply(month.name,function(x)rep(x,4))),
season = c(rep("winter",8),rep("spring",12),rep("summer",12),rep("autumn",12),rep("winter",4)),
year = rep(2021:2024,12))
I would like to use dplyr::sample_n or something similar to choose 2 months in the data frame for each season and keep the same months for all the years, for example:
month season year
1 January winter 2021
2 January winter 2022
3 January winter 2023
4 January winter 2024
5 February winter 2021
6 February winter 2022
7 February winter 2023
8 February winter 2024
9 March spring 2021
10 March spring 2022
11 March spring 2023
12 March spring 2024
13 May spring 2021
14 May spring 2022
15 May spring 2023
16 May spring 2024
17 June summer 2021
18 June summer 2022
19 June summer 2023
20 June summer 2024
21 July summer 2021
22 July summer 2022
23 July summer 2023
24 July summer 2024
25 October autumn 2021
26 October autumn 2022
27 October autumn 2023
28 October autumn 2024
29 November autumn 2021
30 November autumn 2022
31 November autumn 2023
32 November autumn 2024
I cannot make df %>% group_by(season,year) %>% sample_n(2) since it chooses different months for each year.
Thanks!
We can randomly sample 2 values from month and filter them by group.
library(dplyr)
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),2))
# month season year
# <chr> <chr> <int>
# 1 January winter 2021
# 2 January winter 2022
# 3 January winter 2023
# 4 January winter 2024
# 5 February winter 2021
# 6 February winter 2022
# 7 February winter 2023
# 8 February winter 2024
# 9 March spring 2021
#10 March spring 2022
# … with 22 more rows
If for certain groups there are less than 2 unique values we can select minimum between 2 and unique values in the group to sample.
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),min(2, n_distinct(month))))
Using the same logic with base R, we can use ave
df[as.logical(with(df, ave(month, season,
FUN = function(x) x %in% sample(unique(x),2)))), ]
An option using slice
library(dplyr)
df %>%
group_by(season) %>%
slice(which(!is.na(match(month, sample(unique(month), 2)))))
# A tibble: 32 x 3
# Groups: season [4]
# month season year
# <fct> <fct> <int>
# 1 October autumn 2021
# 2 October autumn 2022
# 3 October autumn 2023
# 4 October autumn 2024
# 5 November autumn 2021
# 6 November autumn 2022
# 7 November autumn 2023
# 8 November autumn 2024
# 9 April spring 2021
#10 April spring 2022
# … with 22 more rows
Or using base R
by(df, df$season, FUN = function(x) subset(x, month %in% sample(unique(month), 2 )))

How to add missing months to a data frame?

I have a dataset with three observations: January, February, and March. I would like to add the remaining months as observations of zero to the same datatable, but I'm having trouble appending these.
Here's my current code:
library(dplyr)
Period <- c("January 2015", "February 2015", "March 2015",
"January 2016", "February 2016", "March 2016",
"January 2017", "February 2017", "March 2017",
"January 2018", "February 2018", "March 2018")
Month <- c("January", "February", "March",
"January", "February", "March",
"January", "February", "March",
"January", "February", "March")
Dollars <- c(936, 753, 731,
667, 643, 588,
948, 894, 997,
774,745, 684)
dat <- data.frame(Period = Period, Month = Month, Dollars = Dollars)
dat2 <- dat %>%
dplyr::select(Month, Dollars) %>%
dplyr::group_by(Month) %>%
dplyr::summarise(AvgDollars = mean(Dollars))
Any ideas for populating April through December in the dataset are greatly appreciated. Thanks in advance!
Here's the way to do it using complete in one step:
library(tidyverse)
Then use complete:
dat2 <- data.frame(Period = Period, Month = Month, Dollars = Dollars) %>%
# make a "year" variable
mutate(Year = word(Period, 2,2)) %>%
# remove period variable (we'll add it in later)
select(-Period) %>%
# month.name is a base variable listing all months (thanks #Gregor).
# nesting by "Year" lets complete know you only want the years listed in your dataset.
complete(Month = month.name, nesting(Year), fill = list(Dollars = 0)) %>%
# Arrange by Year and month
arrange(Year, Month) %>%
#remake the "period" variable
mutate(Period = paste(Month, Year)) %>%
group_by(Month) %>%
summarise(AvgDollars = mean(Dollars))
Here is a two-step solution:
library(dplyr)
Sys.setlocale("LC_TIME", "English")
# first, define a dataframe with each month from January 2015 to December 2018
dat2 <- data.frame(Period = format(seq(as.Date("2015/1/1"),
as.Date("2018/12/1"), by = "month"),
format = "%B %Y"),
Month = substr(Period, 1, nchar(Period)-5))
# then, merge dat and dat2
dat %>%
select(Period, Dollars) %>%
right_join(dat2, by = "Period") %>%
select(Period, Month, Dollars)
Period Month Dollars
1 January 2015 January 936
2 February 2015 February 753
3 March 2015 March 731
4 April 2015 January NA
5 May 2015 February NA
6 June 2015 March NA
7 July 2015 January NA
8 August 2015 February NA
9 September 2015 March NA
10 October 2015 January NA
11 November 2015 February NA
12 December 2015 March NA
13 January 2016 January 667
14 February 2016 February 643
15 March 2016 March 588
16 April 2016 January NA
17 May 2016 February NA
18 June 2016 March NA
19 July 2016 January NA
20 August 2016 February NA
21 September 2016 March NA
22 October 2016 January NA
23 November 2016 February NA
24 December 2016 March NA
25 January 2017 January 948
26 February 2017 February 894
27 March 2017 March 997
28 April 2017 January NA
29 May 2017 February NA
30 June 2017 March NA
31 July 2017 January NA
32 August 2017 February NA
33 September 2017 March NA
34 October 2017 January NA
35 November 2017 February NA
36 December 2017 March NA
37 January 2018 January 774
38 February 2018 February 745
39 March 2018 March 684
40 April 2018 January NA
41 May 2018 February NA
42 June 2018 March NA
43 July 2018 January NA
44 August 2018 February NA
45 September 2018 March NA
46 October 2018 January NA
47 November 2018 February NA
48 December 2018 March NA
Maybe there's a more graceful solution with dplyr, but here is a quick solution without much typing:
dat <- rbind(data.frame(Period = Period, Month = Month, Dollars = Dollars),
data.frame(Period = c(sapply(2015:2018, function(x) format(ISOdate(x,4:12,1),"%B %Y"))),
Month = c(sapply(2015:2018, function(x) format(ISOdate(x,4:12,1),"%B"))),
Dollars = 0))

sorting of month in matrix in R

I have a matrix in this format:
year month Freq
1 2014 April 466
2 2015 April 59535
3 2014 August 10982
4 2015 August 0
5 2014 December 35881
6 2015 December 0
7 2014 February 17
8 2015 February 24258
9 2014 January 0
10 2015 January 22785
11 2014 July 2981
12 2015 July 0
13 2014 June 1279
14 2015 June 31356
15 2014 March 289
16 2015 March 40274
I need to sort months on the basis of their occurrence i.e jan, feb, mar... when I sort it gets sorted on the basis of first alphabet. I used this:
mat <- mat[order(mat[,1], decreasing = TRUE), ]
and it looks like this :
row.names April August December February January July June March May November October September
1 2015 59535 0 0 24258 22785 0 31356 40274 84211 0 0 0
2 2014 466 10982 35881 17 0 2981 1279 289 879 8911 8565 4000
Can we sort months on the basis of occurrence in R ?
Suppose DF is the data frame from which you derived your matrix. We provide such a data frame in reproducible form at the end. Ensure that month and year are factors with appropriate levels. Note that month.name is a builtin variable in R that is used here to ensure that the month levels are appropriately sorted and we have assumed year is a numeric column. Then use levelplot like this:
DF2 <- transform(DF,
month = factor(as.character(month), levels = month.name),
year = factor(year)
)
library(lattice)
levelplot(Freq ~ year * month, DF2)
Note: Here is DF in reproducible form:
Lines <- " year month Freq
1 2014 April 466
2 2015 April 59535
3 2014 August 10982
4 2015 August 0
5 2014 December 35881
6 2015 December 0
7 2014 February 17
8 2015 February 24258
9 2014 January 0
10 2015 January 22785
11 2014 July 2981
12 2015 July 0
13 2014 June 1279
14 2015 June 31356
15 2014 March 289
16 2015 March 40274 "
DF <- read.table(text = Lines, header = TRUE)
Assuming you want to sort based on time (have to add a dummy day 1 to convert to time format):
time = strptime(paste(1, mat$month, mat$year), format = "%d %B %Y")
mat = mat[sort.ind(time, index.return=T)$ix, ]
Or if you don't care about the year:
time = strptime(paste(1, mat$month, 2000), format = "%d %B %Y")
mat = mat[sort.ind(time, index.return=T)$ix, ]

Creating a column with differences based on another column

I have a data frame that looks like this (simplified from 699 treaties):
TRT <- data.frame(T.ID=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8),
Treaty=c("hungary slovenia 1994", "hungary slovenia 1994",
"nicaragua taiwan 2006", "nicaragua taiwan 2006",
"ukraine uzbekistan 1994", "ukraine uzbekistan 1994",
"brazil uruguay 1986", "brazil uruguay 1986",
"albania macedonia 2002", "albania macedonia 2002",
"albania moldova 2003", "albania moldova 2003",
"albania romania 2003", "albania romania 2003",
"Treaty of Izmir 1977","Treaty of Izmir 1977",
"Treaty of Izmir 1977"),
sc.y=c("HUN1994", "SLV1994", "NIC2006", "TAW2006", "UKR1994",
"UZB1994", "BRA1986", "URU1986", "ALB2002", "MAC2002",
"ALB2003", "MLD2003", "ALB2003", "RUM2003", "IRN1977",
"TUR1977", "PAK1977"),
prom.demo=c(1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0),
polity=c(10,10,8,10,7,-9,7,9,7,9,7,8,7,8,-10,-7,9))
In the end, I want to have a data frame that lists each treaty only once, its value of the “prom.demo”-column and one column that contains the difference of the maximum and minimum among the “polity”-values of the contracting parties of each treaty (most treaties have only two contracting parties, but some have up to 51).
Is there any R command that spares me 699 calculations?
Using dplyr its a join on scode and year followed by grouping by Treaty and then working out the difference between the min and max polity:
require(dplyr)
left_join(treaties, Polity, c("scode","year")) %>% group_by(Treaty) %>% summarise(PolityDiff=max(polity,na.rm=TRUE)-min(polity,na.rm=TRUE))
Source: local data frame [8 x 2]
Treaty PolityDiff
1 albania macedonia 2002 2
2 albania moldova 2003 1
3 albania romania 2003 1
4 brazil uruguay 1986 2
5 hungary slovenia 1994 0
6 nicaragua taiwan 2006 2
7 Treaty of Izmir 1977 NA
8 ukraine uzbekistan 1994 16
The NA's are where you don't have any matching scode/year (The Treaty of Izmir is IRN/TUR/PAK in 1977, and none of those are in the Polity data).
Note that if you want NA if any one of the participating countries are not in the Polity data, use:
left_join(treaties, Polity, c("scode","year")) %>% group_by(Treaty) %>% summarise(PolityDiff=max(polity)-min(polity))
which gives:
Treaty PolityDiff
1 albania macedonia 2002 2
2 albania moldova 2003 1
3 albania romania 2003 1
4 brazil uruguay 1986 2
5 hungary slovenia 1994 NA
6 nicaragua taiwan 2006 2
7 Treaty of Izmir 1977 NA
8 ukraine uzbekistan 1994 16
because Slovenia is coded as SLV in Polity but there's SLO in the treaties - mistake? Anyway, there's no SLO/1994 in Polity so that treaty returns as NA in this variant. It returns zero in my first example because the NA gets dropped and the polity difference is the difference between one number and itself...

Resources