Replace missing values with mean for subsets of dataframe - r

I have a data frame titled final_project_data with the following structure. It includes 17 columns with data that corresponds to the county/ State and years. For example, Baldwin county in Alabama in 2006 had a population of 69162, an unemployment rate of 4.2% etc.
ID County State Population Year Ump.Rate Fertility
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1003 Baldwin County Alabama 69162 2006 4.2 88
1015 Calhoun County Alabama 112903 2006 2.4 na
1043 Baldwin County Alabama na 2007 1.9 71
1049 Calhoun County Alabama 68014 2007 na 90
1050 CountyY Alaska 2757 2006 3.9 na
1070 CountyZ Alaska 11000 2006 7.8 95
1081 CountyY Alaska na 2007 6.5 70
1082 CountyZ Alaska 67514 2007 4.5 60
There are a number of columns with missing values in them, which I am trying to replace with the mean for the given State and Year. I am running into issues trying to loop over each column with missing values and then each subset of years and rows to fill in the missing values with the mean. The code I have thus far is below:
#get list of unique states
states <- unique(final_project_data$State)
#get list of columns with na in them - we will use this to impute missing
values
list_na <- colnames(final_project_data)[ apply(final_project_data, 2, anyNA) ]
list_na
#create a place to hold the missing values
average_missing <- c()
#Loop through each state to impute the missing values with the mean
for(i in 1:length(states)){
average_missing <- apply(final_project_data[which(final_project_data$State == states[i]),colnames(final_project_data) %in% list_na], 2, mean, na.rm = TRUE)
}
average_missing
However, when I run the above bit of code, I only get one set of values for each of the columns with missing values, not for a different value for every state. I am also not sure how to extend this to include years. Any help or advice would be appreciated!

In a for loop:
dt <- data.frame(
ID = c(1003, 1015, 1043, 1049, 1050, 1070, 1081, 1082, NA, NA),
State = c(rep("Alabama", 4), rep("Alaska", 4), "Alabama", "Alaska"),
Population = c(sample(10000:100000, 8, replace = T), NA, NA),
Year = c(2006, 2006, 2007, 2007, 2006, 2006, 2007, 2007, 2007, 2006),
Unemployment = c(sample(1:5, 8, replace = T), NA, NA)
)
# index through each row in data frame
for (i in 1:nrow(dt)){
# if Population variable is NA
if(is.na(dt$Population[i]) == T){
# calculate mean from all Population variables with the same State and Year as index
dt$Population[i] <- mean(dt$Population[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
}
# repeat for Unemployment variable
if(is.na(dt$Unemployment[i]) == T){
dt$Unemployment[i] <- mean(dt$Unemployment[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
}
}

Here's a dplyr version without a loop. Just add all the columns you want to transform insided vars():
your_data %>%
group_by(State, Year) %>%
mutate_at(vars(Population, Ump.Rate, Fertility),
~ ifelse(is.na(.), mean(., na.rm = TRUE), .))

Related

Creating a new variable from the ranges of another column in which the ranges change - R

I am a beginner in R so sorry if it is a very simple question. I looked but I could not find the same problem.
I want to create a new variable from the ranges of another column in R but the ranges are not the same for each row.
To be more specific, my data has years 1960 - 2000 and i have ranges for employment. For 1960 to 1980 a teacher is 1 and a lawyer is 2 etc. For 1980 - 1990 a teacher is in the value range 1-29 and lawyer is 50-89 etc. Then finally for 1990-2000, the value range for the teacher is 40-65 and for the lawyer it is 1-39.
I dont even know how to begin with it (teacher and lawyer are not the only occupations there are 10 different occupations with overlapping value ranges for different years - which makes it very confusing for me).
I would appreciate your help. Thank you very much.
Here are a couple of approaches to get you started.
First, say you have a data frame with year and occupation_code:
df1 <- data.frame(
year = c(1965, 1985, 1995),
occupation_code = c(1, 2, 3)
)
year occupation_code
1 1965 1
2 1985 2
3 1995 3
Then, create a second data frame which will clearly indicate the year ranges and occupation code ranges with each occupation. You can include all of your occupations here.
df2 <- data.frame(
year_start = c(1960, 1960, 1980, 1980, 1990, 1990),
year_end = c(1980, 1980, 1990, 1990, 2000, 2000),
occupation_code_start = c(1, 2, 1, 50, 40, 1),
occupation_code_end = c(1, 2, 29, 89, 65, 39),
occupation = c("teacher", "lawyer", "teacher", "lawyer", "teacher", "lawyer")
)
year_start year_end occupation_code_start occupation_code_end occupation
1 1960 1980 1 1 teacher
2 1960 1980 2 2 lawyer
3 1980 1990 1 29 teacher
4 1980 1990 50 89 lawyer
5 1990 2000 40 65 teacher
6 1990 2000 1 39 lawyer
Then, you can merge the two together.
One approach is with data.table package.
library(data.table)
setDT(df1)
setDT(df2)
df2[df1,
on = .(year_start <= year,
year_end >= year,
occupation_code_start <= occupation_code,
occupation_code_end >= occupation_code),
.(year, occupation = occupation)]
This will give you:
year occupation
1: 1965 teacher
2: 1985 teacher
3: 1995 lawyer
Another approach is with fuzzyjoin and tidyverse:
library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(df1, df2,
by = c("year" = "year_start",
"year" = "year_end",
"occupation_code" = "occupation_code_start",
"occupation_code" = "occupation_code_end"),
match_fun = list(`>=`, `<=`, `>=`, `<=`)) %>%
select(year, occupation)

Agregating and counting elements in the variables of a dataset

I might have not asked the proper question in my research, sorry in such case.
I have a multiple columns dataset:
helena <-
Year US$ Euros Country Regions
2001 12 13 US America
2000 13 15 UK Europe
2003 14 19 China Asia
I want to group the dataset in a way that I have for each region the total per year of the earnings plus a column showing how many countries have communicated their data per region every year
helena <-
Year US$ Euros Regions Number of countries per region per Year
2000 150 135 America 2
2001 135 151 Europe 15
2002 142 1900 Asia 18
Yet, I have tried
count(helena, c("Regions", "Year"))
but it does not work properly since includes only the columns indicated
Here is the data.table way, I have added a row for Canada for year 2000 to test the code:
library(data.table)
df <- data.frame(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df <- data.table(df)
df[,
.(sum_US = sum(US),
sum_Euros = sum(Euros),
number_of_countries = uniqueN(Country)),
.(Regions, Year)]
Regions Year sum_US sum_Euros number_of_countries
1: America 2000 26 30 2
2: Europe 2001 12 13 1
3: Asia 2003 14 19 1
With dplyr:
library(dplyr)
your_data %>%
group_by(Regions, Year) %>%
summarize(
US = sum(US),
Euros = sum(Euros),
N_countries = n_distinct(Country)
)
using tidyr
library(tidyr)
df %>% group_by(Regions, Year) %>%
summarise(Earnings_US = sum(`US$`),
Earnings_Euros = sum(Euros),
N_Countries = length(Country))
aggregate the data set by regions, summing the earnings columns and doing a length of the country column (assuming countries are unique)
Using tidyverse and building the example
library(tidyverse)
df <- tibble(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df %>%
group_by(Regions, Year) %>%
summarise(US = sum(US),
Euros = sum(Euros),
Countries = n_distinct(Country))
updated to reflect the data in the original question

How do I avoid a slow loop with large data set?

Consider this data set:
> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+ country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
+ action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
+ signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+ ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country action signature_date ratification_date
1 Canada signature 2000 NA
1 Canada ratification NA 2001
1 USA signature 2000 NA
1 USA ratification NA 2002
2 Canada signature 2001 NA
2 Canada ratification NA 2001
2 USA signature 2002 NA
2 USA ratification NA 2002
As you can see, half of the rows have duplicate information. For a small data set like this it is really easy to remove duplicates. I could use the coalesce function (dplyr package), get rid of the "action" column and then erase all the irrelevant rows. Though, there many other ways. The final result should look like this:
> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+ country = c("Canada", "USA", "Canada","USA"),
+ signature_date = c(2000,2000,2001,2002),
+ ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
1 Canada 2000 2001
1 USA 2000 2002
2 Canada 2001 2001
2 USA 2002 2002
The problem, is that my real data set is MUCH bigger (102000 x 270) and there are many more variables. The real data is also more irregular and there are more absent values. The coalesce function seems very slow. The best loop I could make so far still takes up to 5-10 minutes to run.
Is there a simple way of doing this which would be faster? I have the feeling that there must be some function in R for that kind of operation, but I couldn't find any.
I think you need dcast. The version in the data.table library calls itself "fast", and in my experience, it is speedy on large datasets.
First, let's create one column which is either the signature_date or ratification_date, depending on the action
library(data.table)
setDT(DATA)[, date := ifelse(action == "ratification", ratification_date, signature_date)]
Now, let's cast it so that the action are the columns and the value is the date
wide <- dcast(DATA, Agreement_number + country ~ action, value.var = 'date')
So wide looks like this
Agreement_number country ratification signature
1 1 Canada 2001 2000
2 1 USA 2002 2000
3 2 Canada 2001 2001
4 2 USA 2002 2002
The OP has told that his production data has 100 k rows x 270 columns, and speed is a concern for him. Therefore, I suggest to use data.table.
I'm aware that Harland also has proposed to use data.table and dcast() but the solution below is a different approach. It brings the rows in the correct order and copies the ratification_date to the signature row. After some clean-up we get the desired result.
library(data.table)
# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)
# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"),
by = c("Agreement_number", "country")][
# keep only signature rows, remove action column
action == "signature"][, action := NULL]
result
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C
Data
The OP has mentioned that his production data has 270 columns. To simulate this I've added two dummy columns:
set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))
Note that set.seed() is used for repeatable results when sampling.
Agreement_number country action signature_date ratification_date dummy1 dummy2
1 1 Canada signature 2000 NA 2 D
2 1 Canada ratification NA 2001 2 D
3 1 USA signature 2000 NA 3 A
4 1 USA ratification NA 2002 3 A
5 2 Canada signature 2001 NA 1 B
6 2 Canada ratification NA 2001 1 B
7 2 USA signature 2002 NA 4 C
8 2 USA ratification NA 2002 4 C
Addendum: dcast() with additional columns
Harland has suggested to use data.table and dcast(). Besides several other flaws in his answer, it doesn't handle the additional columns the OP has mentioned.
The dcast() approach below will return also the additional columns:
library(data.table)
# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]
# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action,
value.var = "signature_date")
result
Agreement_number country dummy1 dummy2 signature ratification
1: 1 Canada 2 D 2000 2001
2: 1 USA 3 A 2000 2002
3: 2 Canada 1 B 2001 2001
4: 2 USA 4 C 2002 2002
Note that this approach will change the order of columns.
Here is another data.table solution using uwe-block's data.frame. It is similar to uwe-block's method, but uses max to collapse the data.
# covert data.frame to data.table and factor variables to character variables
library(data.table)
setDT(DATA)[, names(DATA) := lapply(.SD,
function(x) if(is.factor(x)) as.character(x) else x)]
# collapse data set, by agreement and country. Take max of remaining variables.
DATA[, lapply(.SD, max, na.rm=TRUE), by=.(Agreement_number, country)][,action := NULL][]
The lapply runs through variables not included in the by statement and calculates the maximum after removing NA values. The next link in the chain drops the unneeded action variable and the final (unnecessary) link prints the output.
This returns
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C

R - group_by utilizing splinefun

I am trying to group my data by Year and CountyID then use splinefun (cubic spline interpolation) on the subset data. I am open to ideas, however the splinefun is a must and cannot be changed.
Here is the code I am trying to use:
age <- seq(from = 0, by = 5, length.out = 18)
TOT_POP <- df %.%
group_by(unique(df$Year), unique(df$CountyID) %.%
splinefun(age, c(0, cumsum(df$TOT_POP)), method = "hyman")
Here is a sample of my data Year = 2010 : 2013, Agegrp = 1 : 17 and CountyIDs are equal to all counties in the US.
CountyID Year Agegrp TOT_POP
1001 2010 1 3586
1001 2010 2 3952
1001 2010 3 4282
1001 2010 4 4136
1001 2010 5 3154
What I am doing is taking the Agegrp 1 : 17 and splitting the grouping into individual years 0 - 84. Right now each group is a representation of 5 years. The splinefun allows me to do this while providing a level of mathematical rigour to the process i.e., splinefun allows me provide a population total per each year of age, in each individual county in the US.
Lastly, the splinefun code by itself does work but within the group_by function it does not, it produces:
Error: wrong result size(4), expected 68 or 1.
The splinefun code the way I am using it works like this
TOT_POP <- splinefun(age, c(0, cumsum(df$TOT_POP)),
method = "hyman")
TOT_POP = pmax(0, diff(TOT_POP(c(0:85))))
Which was tested on one CountyID during one Year. I need to iterate this process over "x" amount of years and roughly 3200 counties.
# Reproducible data set
set.seed(22)
df = data.frame( CountyID = rep(1001:1005,each = 100),
Year = rep(2001:2010, each = 10),
Agegrp = sample(1:17, 500, replace=TRUE),
TOT_POP = rnorm(500, 10000, 2000))
# Convert Agegrp to age
df$Agegrp = df$Agegrp*5
colnames(df)[3] = "age"
# Make a spline function for every CountyID-Year combination
split.dfs = split(df, interaction(df$CountyID, df$Year))
spline.funs = lapply(split.dfs, function(x) splinefun(x[,"age"], x[,"TOT_POP"]))
# Use the spline functions to interpolate populations for all years between 0 and 85
new.split.dfs = list()
for( i in 1:length(split.dfs)) {
new.split.dfs[[i]] = data.frame( CountyID=split.dfs[[i]]$CountyID[1],
Year=split.dfs[[i]]$Year[1],
age=0:85,
TOT_POP=spline.funs[[i]](0:85))
}
# Does this do what you want? If so, then it will be
# easier for others to work from here
# > head(new.split.dfs[[1]])
# CountyID Year age TOT_POP
# 1 1001 2001 0 909033.4
# 2 1001 2001 1 833999.8
# 3 1001 2001 2 763181.8
# 4 1001 2001 3 696460.2
# 5 1001 2001 4 633716.0
# 6 1001 2001 5 574829.9
# > tail(new.split.dfs[[2]])
# CountyID Year age TOT_POP
# 81 1002 2001 80 10201.693
# 82 1002 2001 81 9529.030
# 83 1002 2001 82 8768.306
# 84 1002 2001 83 7916.070
# 85 1002 2001 84 6968.874
# 86 1002 2001 85 5923.268
First, I believe I was using the wrong wording in what I was trying to achieve, my apologies; group_by actually wasn't going to solve the issue. However, I was able to solve the problem using two functions and ddply. Here is the code that solved the issue:
interpolate <- function(x, ageVector){
result <- splinefun(ageVector,
c(0, cumsum(x)), method = "hyman")
diff(result(c(0:85)))
}
mainFunc <- function(df){
age <- seq(from = 0, by = 5, length.out = 18)
colNames <- setdiff(colnames(df)
c("Year","CountyID","AgeGrp"))
colWiseSpline <- colwise(interpolate, .cols = true,
age)(df[ , colNames])
cbind(data.frame(
Year = df$Year[1],
County = df$CountyID[1],
Agegrp = 0:84
),
colWiseSpline
)
}
CompleteMainRaw <- ddply(.data = df,
.variables = .(CountyID, Year),
.fun = mainFunc)
The code now takes each county by year and runs the splinefun on that subset of population data. At the same time it creates a data.frame with the results i.e., splits the data from 17 age groups to 85 age groups while factoring it our appropriately; which is what splinefun does.
Thanks!

Sum duplicates then remove all but first occurrence

I have a data frame (~5000 rows, 6 columns) that contains some duplicate values for an id variable. I have another continuous variable x, whose values I would like to sum for each duplicate id. The observations are time dependent, there are year and month variables, and I'd like to keep the chronologically first observation of each duplicate id and add the subsequent dupes to this first observation.
I've included dummy data that resembles what I have: dat1. I've also included a data set that shows the structure of my desired outcome: outcome.
I've tried two strategies, neither of which quite give me what I want (see below). The first strategy gives me the correct values for x, but I loose my year and month columns - I need to retain these for all the first duplicate id values. The second strategy doesn't sum the values of x correctly.
Any suggestions for how to get my desired outcome would be much appreciated.
# dummy data set
set.seed(179)
dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234),
year = rep(c("2006", "2007"), each = 5),
month = rep(c("December", "January"), each = 5),
x = round(rnorm(10, 10, 3), 2))
# desired outcome
outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564),
year = c(rep("2006", 4), rep("2007", 3)),
month = c(rep("December", 4), rep("January", 3)),
x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41))
# strategy 1:
library(plyr)
dat2 <- ddply(dat1, .(id), summarise, x = sum(x))
# strategy 2:
# partition into two data frames - one with unique cases, one with dupes
dat1_unique <- dat1[!duplicated(dat1$id), ]
dat1_dupes <- dat1[duplicated(dat1$id), ]
# merge these data frames while summing the x variable for duplicated ids
# with plyr
dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE),
.(id), summarise, x = sum(x))
# in base R
dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes,
all.x = TRUE), FUN = sum)
I got different sums, but it were b/c I forgot the seed:
> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
id year month x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December 7.16
6 8503 2007 January 13.23
7 2961 2007 January 7.38
9 8564 2007 January 7.21
(To be safer It would be better to work on a copy. And you might need to add an ordering step.)
You could do this with data.table (quicker, more memory efficiently than plyr)
With a bit of self-joining fun using mult ='first'. Keying by id year and month will sort by id, year then month.
library(data.table)
DT <- data.table(dat1, key = c('id','year','month'))
# setnames is required as there are two x columns that get renamed x, x.1
DT1 <- setnames(DT[DT[,list(x=sum(x)),by=id],mult='first'][,x:=NULL],'x.1','x')
Or a simpler approach :
DT = as.data.table(dat1)
DT[,x:=sum(x),by=id][!duplicated(id)]
id year month x
1: 1234 2006 December 36.42
2: 1321 2006 December 11.55
3: 4321 2006 December 17.31
4: 7423 2006 December 5.97
5: 8503 2007 January 12.48
6: 2961 2007 January 10.22
7: 8564 2007 January 11.41

Resources