replace missing value by grouping with mean - r

I have a table with countries and gdp and missing value. I want to replace with a mean but not the whole colomn mean just which include in the same group
I have 27 countries and 11 years. like
countries year GDP
1 2001 125
1 2002 ...
1 2003 525
2 2001 222
2 2002 ...
So I would like to get the mean of the first country all year and replace with missing value for GDP
I know how to replace the whole colomn
data$gdp[which(is.na(data$gdp))]<- mean(data$gdp, na.rm=TRUE)
but this will calculate the whole colomn. Dont want to take a subset of each country and calculate seperatly I was thinking if I could do it in one go.

One option is using na.aggregate (from zoo - by default it takes the mean and replace the NA elements) grouped by 'countries'
library(dplyr)
library(zoo)
df1 %>%
group_by(countries) %>%
mutate(GDP = na.aggregate(GDP))

Related

In r, how do I add rows together to get totals for a specific set of variables [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 1 year ago.
My goal is to have a list of how much FDI China sent to each country per year. At the moment I have a list of individual projects that looks like this
Year
Country
Amount
2001
Angola
6000000
2001
Angola
8000000
2001
Angola
5.0E7
I want to sum it so it looks like this.
Year
Country
Amount
2001
Angola
6.4E7
How do I merge the rows and add the totals to get nice country-year data? I can't find an R command that does this precise thing.
library(tidyverse)
I copied the data table and read your dataframe into R using:
df <- clipr::read_clip_tbl(clipr::read_clip())
I like using dplyr to solve this question:
df2 <- as.data.frame(df %>% group_by(Country,Year) %>% summarize(Amount=sum(Amount)))
# A tibble: 1 x 3
# Groups: Country [1]
Country Year Amount
<chr> <int> <dbl>
1 Angola 2001 64000000

Growth Rates in Unbalanced Panel Data

I am trying to get a Growth Rate for some variables in an Unbalanced Panel data, but I´m still getting results for years in which the lag does not exist.
I've been trying to get the Growth Rates using library Dplyr. As I Show down here:
total_firmas_growth <- total_firmas %>%
group_by(firma) %>%
arrange(anio, .by_group = T) %>% mutate(
ing_real_growth = (((ingresos_real_2/Lag(ingresos_real_2))-1)*100)
)
for Instance, if a firm has a value for "ingresos_real_2" in the year 2008 and the next value is in year 2012, the code calculate the growth rate instead of get an NA, because of the missing year (i.e 2011 is missing to calculate 2012 growth rate, as you can see in the example with the "firma" 115 (id) right below:
total_firmas_growth <-
" firma anio ingresos_real_2 ing_real_growth
1 110 2005 14000 NA
2 110 2006 15000 7.14
3 110 2007 13000 -13.3
4 115 2008 15000 NA
5 115 2012 13000 NA
6 115 2013 14000 7.69
I will really appreciate your help.
The easiest way to get your original table into a format where there are NAs for columns is to create a tibble with an all-by-all of the grouping columns and your years. Expand creates an all-by-all tibble of the variables you are interested in and {.} takes in whatever was piped more robustly than . (by creating a copy, I believe). Since any mathematical operation that includes an NA will result in an NA, this should get you what you're after if you use your group_by, arrange, mutate code after it.
total_firmas %>%
left_join(
expand({.}, firma, anio),
by = c("firma","anio")
)

R - replace zero values by average of non-zero ones for fixed categories

I am given a dataset of the following form
year<-rep(c(1990:1999),each=10)
age<-rep(50:59, 10)
cat1<-rep(c("A","B","C","D","E"),each=100)
value<-rnorm(10*10*5)
value[c(3,51,100,340,441)]<-0
df<-data.frame(year,age,cat1,value)
year age cat1 value
1 1990 50 A -0.7941799
2 1990 51 A 0.1592270
3 1990 52 A 0.0000000
4 1990 53 A 1.9222384
5 1990 54 A 0.3922259
6 1990 55 A -1.2671957
I now would like to replace any zeroes in the "value" column by the average over the column "cat1" of the non-zero entries of "value" for the corresponding year and age. For example, for year 1990, age 52 the enty for cat1=A is zero, this should be replaced by average of the non-zero entries of the remaining categories for this specific year and age.
As we have
df[df$year==1990 & df$age==52,]
year age cat1 value
3 1990 52 A 0.0000000
103 1990 52 B -1.1325446
203 1990 52 C -1.6136773
303 1990 52 D 0.5724360
403 1990 52 E 0.2795241
we would replace the entry 0 by
sum(df[df$year==1990 & df$age==52,4])/4
[1] -0.4735654
Is there a nice and clean way to this generally?
library(data.table)
setDT(df)[value==0, value := NA,]
df[, value := replace(value, is.na(value), mean(value, na.rm = TRUE)) , by = .(year, age)]
maybe 99,9% of operations with tables can be decomposed into basic fast and optimized: split, concatenation(in case of numeric: sum, multiplication etc), filter, sort, join.
Here left_join from dplyr is your way to go.
Just create another dataframe filtered from zeroes and aggregated over value with proper grouping. Then substitute zeroes with values from new joined column.

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

How do I reorder a factor

I want to reorder a factor based on one of its rows. For example I want to reorder the "country" factor based on the value corresponding to the 2014 entries below. UK would be ranked first and USA second.
dat <- data.frame(
country=c("USA","USA","UK","UK"),
year=c(2014,2013,2014,2013),
value=c(2,NA,1,NA)
)
country year value
1 USA 2014 2
2 USA 2013 NA
3 UK 2014 1
4 UK 2013 NA
I don't quite understand how factors are reordered. It seems the reorder command replaces the an entire column in a data.frame but it I would think that I should only need to specify a new order for the factor labels. "level" seems to do the opposite, giving labels to the ordering.
Maybe this:
factor(dat$country, levels=with(dat[dat$year==2014,], country[order(value)] ))
#[1] USA USA UK UK
#Levels: UK USA
factor(country<-c("USA","USA","UK","UK"),level <- c("UK","USA"))
sort(country)

Resources