Conditional imputation of one variable using Dplyr - r

I have a dataset (main dataset) which looks like this:
id cleaning_fee boro zipcode price
1 NA Manhattan 10014 100
2 70 Manhattan 10013 125
3 NA Brooklyn 11201 97
4 25 Manhattan 10012 110
5 30 Staten Island 10305 60
Grouping by Borough and Zipcode I get this (using na.rm = True):
borough zipcode avgCleaningFee
Brooklyn 11217 88.32000
Brooklyn 11231 89.05085
Brooklyn 11234 42.50000
Manhattan 10003 97.03738
Manhattan 10011 109.97647
What I want to do is impute the NAs in the 'cleaning_fee' variable in my main dataset by either:
(a) imputing the grouped mean (as shown above in table 2 where I group on 2 conditions)
or
(b) use KNN regression on variables such as zipcode, boro and the price to impute the cleaning fee variable. (PS I understand how KNN regression works but I haven't used it, would be great if you can explain the code in 1 line or so)
Would be great if anyone can help me out with this. Thanks!!

We can use the first method
library(dplyr)
df1 %>%
group_by(Borough, Zipcode) %>%
mutate(cleaning_fee = replace(Cleaning_fee,
is.na(Cleaning_fee), mean(Cleaning_fee, na.rm = TRUE))
Or with na.aggregate from zoo
library(zoo)
df1 %>%
group_by(Borough, Zipcode) %>%
mutate(cleaning_fee = na.aggregate(cleaning_fee))

Related

Summarise rows in dataframe by two columns

I have this data frame called Worldwhich shows the following:
City Year Income Tourist
London 2008 50 100
NY 2009 75 250
Paris 2010 45 340
Dubai 2008 32 240
London 2011 50 140
Abu Dhabi 2009 60 120
Paris 2009 70 140
NY 2007 50 150
Tokyo 2008 45 150
Dubai 2010 40 480
#With 207 more rows
I want to summarise each rows so that every city shows the total income and tourists for all the years. So I want to find a code where City and Years are matched and then summarised so that every city just have one row.
Something like this:
City Income Tourist
London 1051 5040
NY 1547 5432
Paris 2600 4321
Dubai 3222 5312
Abu Dhabi 3100 7654
Tokyo 2404 4321
#With 40 more rows
After the research I've done n_distinct and group_by should be used.
Base R solution:
You can use the sapply() function to iterate over cities.
the first argument will be a vector of unique cities
we then write our function that select all the rows (years) of each city and returns the "Income" and "Tourist" columns
Sum the columns values with colSums() function
Transpose the output using the t() function.
t( sapply( unique( World$City ),function(CITY) colSums(World[World$City==CITY,c("Income","Tourist")] ) ) )
Solution with R's data.table package:
Make sure your object is of type data.table.
in the j part of the bracket (the do part):
you can provide names to the wanted columns ("Income="),
and specify the wanted output ("sum(Income)").
To group the cities, add a by argument to the data.table object.
World[,.(Income=sum(Income),Tourist=sum(Tourist)),by=City]
yes, you can use group_by and summarise function.
world %>% group_by(City) %>% summarise(across(c(Income,Tourist), sum))
you can also add Year in the group by function.
world %>% group_by(City,Year) %>% summarise(across(c(Income,Tourist), sum))

Calculating conditional summaries of grouped data in dplyr

I have a dataset of population mortality data segregated by year, decile (ranked) of deprivation, gender, cause of death and age. Age data is broken down into categories including 0-1, 1-4, 5-9, 10-14 etc.
I am trying to coerce my dataset such that the mortality data for 0-1 and 1-4 is merged together to create age categories 0-4, 5-9, 10-14 and so on up to 90. My data is in long format.
Using dplyr I am trying to use if_else and summarise() to aggregate mortality data for 0-1 and 1-4 together, however any iteration of code I apply is merely producing the same dataset I originally had, i.e. the code is not merging my data together.
head(death_popn_long) #cause_death variable content removed for brevity
Year deprivation_decile Sex cause_death ageband deaths popn
1 2017 1 Male NA 0 0 2106
2 2017 1 Male NA 0 0 2106
3 2017 1 Male NA 0 0 2106
4 2017 1 Male NA 0 0 2106
5 2017 1 Male NA 0 0 2106
6 2017 1 Male NA 0 0 2106
#Attempt to merge ageband 0-1 & 1-4 by summarising combined death counts
test <- death_popn_long %>%
group_by(Year, deprivation_decile, Sex, cause_death, ageband) %>%
summarise(deaths = if_else(ageband %in% c("0", "1"), sum(deaths),
deaths))
I would like the deaths variable to be the combined (i.e. sum of both 0-1 and 1-4) death count for these age bands, however the above any any alternative code I attempt merely recreates the previous dataset I already had.
You don't want to use ageband in your group_by statement if you intend on manipulating its groups. You'll need to create your new version of ageband and then group by that:
test <- death_popn_long %>%
mutate(new_ageband = if_else(ageband %in% c("0", "1"), 1, ageband)) %>%
group_by(Year, deprivation_decile, Sex, cause_death, new_ageband) %>%
summarise(deaths = sum(deaths))
If you'd like a marginally shorter version you can define new_ageband in the group_by clause instead of using a mutate verb beforehand. I just did that to be explicit.
Also, for future SO questions - it's very helpful to provide data in your question (using something like dput). :)

replace missing value by grouping with mean

I have a table with countries and gdp and missing value. I want to replace with a mean but not the whole colomn mean just which include in the same group
I have 27 countries and 11 years. like
countries year GDP
1 2001 125
1 2002 ...
1 2003 525
2 2001 222
2 2002 ...
So I would like to get the mean of the first country all year and replace with missing value for GDP
I know how to replace the whole colomn
data$gdp[which(is.na(data$gdp))]<- mean(data$gdp, na.rm=TRUE)
but this will calculate the whole colomn. Dont want to take a subset of each country and calculate seperatly I was thinking if I could do it in one go.
One option is using na.aggregate (from zoo - by default it takes the mean and replace the NA elements) grouped by 'countries'
library(dplyr)
library(zoo)
df1 %>%
group_by(countries) %>%
mutate(GDP = na.aggregate(GDP))

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

Find all largest values in a range, across different objects in data frame

I wonder if there is an simpler way than writing if...else... for the following case. I have a dataframe and I only want the rows with number in column "percentage" >=95. Moreover, for one object, if there is multiple rows fitting this criteria, I only want the largest one(s). If there are more than one largest ones, I would like to keep all of them.
For example:
object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75
Then I'd like the result shows:
object city street percentage
A NY Sun 100
A NY Waterfall 100
B CA Washington 98
I am able to do it using if else statement, but I feel there should be some smarter ways to say: 1. >=95 2. if more than one, choose the largest 3. if more than one largest, choose them all.
You can do this by creating an variable that indicates the rows that have the maximum percentage for each of the objects. We can then use this indicator to subset the data.
# your data
dat <- read.table(text = "object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75", header=TRUE, na.strings="", stringsAsFactors=FALSE)
# create an indicator to identify the rows that have the maximum
# percentage by object
id <- with(dat, ave(percentage, object, FUN=function(i) i==max(i)) )
# subset your data - keep rows that are greater than 95 and have the
# maximum group percentage (given by id equal to one)
dat[dat$percentage >= 95 & id , ]
This works by the addition statement creating a logical, which can then be used to subset the rows of dat.
dat$percentage >= 95 & id
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
Or putting these together
with(dat, dat[percentage >= 95 & ave(percentage, object,
FUN=function(i) i==max(i)) , ])
# object city street percentage
# 1 A NY Sun 100
# 3 A NY Waterfall 100
# 4 B CA Washington 98
You could do this also in data.table using the same approach by #user20650
library(data.table)
setDT(dat)[dat[,percentage==max(percentage) & percentage >=95, by=object]$V1,]
# object city street percentage
#1: A NY Sun 100
#2: A NY Waterfall 100
#3: B CA Washington 98
Or using dplyr
dat %>%
group_by(object) %>%
filter(percentage==max(percentage) & percentage >=95)
Following works:
ddf2 = ddf[ddf$percentage>95,]
ddf3 = ddf2[-c(1:nrow(ddf2)),]
for(oo in unique(ddf2$object)){
tempdf = ddf2[ddf2$object == oo, ]
maxval = max(tempdf$percentage)
tempdf = tempdf[tempdf$percentage==maxval,]
for(i in 1:nrow(tempdf)) ddf3[nrow(ddf3)+1,] = tempdf[i,]
}
ddf3
object city street percentage
1 A NY Sun 100
3 A NY Waterfall 100
4 B CA Washington 98

Resources