replace NA in a dplyr chain - r

Question has been edited from the original.
After reading this interesting discussion I was wondering how to replace NAs in a column using dplyr in, for example, the Lahman batting data:
Source: local data frame [96,600 x 3]
Groups: teamID
yearID teamID G_batting
1 2004 SFN 11
2 2006 CHN 43
3 2007 CHA 2
4 2008 BOS 5
5 2009 SEA 3
6 2010 SEA 4
7 2012 NYA NA
The following does not work as I expected
library(dplyr)
library(Lahman)
df <- Batting[ c("yearID", "teamID", "G_batting") ]
df <- group_by(df, teamID )
df$G_batting[is.na(df$G_batting)] <- mean(df$G_batting, na.rm = TRUE)
Source: local data frame [20 x 3]
Groups: yearID, teamID
yearID teamID G_batting
1 2004 SFN 11.00000
2 2006 CHN 43.00000
3 2007 CHA 2.00000
4 2008 BOS 5.00000
5 2009 SEA 3.00000
6 2010 SEA 4.00000
7 2012 NYA **49.07894**
> mean(Batting$G_battin, na.rm = TRUE)
[1] **49.07894**
In fact it imputed the overall mean and not the group mean. How would you do this in a dplyr chain? Using transform from base R also does not work as it imputed the overall mean and not the group mean. Also this approach converts the data to a regular dat. a frame. Is there a better way to do this?
df %.%
group_by( yearID ) %.%
transform(G_batting = ifelse(is.na(G_batting),
mean(G_batting, na.rm = TRUE),
G_batting)
)
Edit: Replacing transform with mutate gives the following error
Error in mutate_impl(.data, named_dots(...), environment()) :
INTEGER() can only be applied to a 'integer', not a 'double'
Edit: Adding as.integer seems to resolve the error and does produce the expected result. See also #eddi's answer.
df %.%
group_by( teamID ) %.%
mutate(G_batting = ifelse(is.na(G_batting), as.integer(mean(G_batting, na.rm = TRUE)), G_batting))
Source: local data frame [96,600 x 3]
Groups: teamID
yearID teamID G_batting
1 2004 SFN 11
2 2006 CHN 43
3 2007 CHA 2
4 2008 BOS 5
5 2009 SEA 3
6 2010 SEA 4
7 2012 NYA 47
> mean_NYA <- mean(filter(df, teamID == "NYA")$G_batting, na.rm = TRUE)
> as.integer(mean_NYA)
[1] 47
Edit: Following up on #Romain's comment I installed dplyr from github:
> head(df,10)
yearID teamID G_batting
1 2004 SFN 11
2 2006 CHN 43
3 2007 CHA 2
4 2008 BOS 5
5 2009 SEA 3
6 2010 SEA 4
7 2012 NYA NA
8 1954 ML1 122
9 1955 ML1 153
10 1956 ML1 153
> df %.%
+ group_by(teamID) %.%
+ mutate(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting))
Source: local data frame [96,600 x 3]
Groups: teamID
yearID teamID G_batting
1 2004 SFN 0
2 2006 CHN 0
3 2007 CHA 0
4 2008 BOS 0
5 2009 SEA 0
6 2010 SEA 1074266112
7 2012 NYA 90693125
8 1954 ML1 122
9 1955 ML1 153
10 1956 ML1 153
.. ... ... ...
So I didn't get the error (good) but I got a (seemingly) strange result.

The main issue you're having is that mean returns a double while the G_batting column is an integer. So wrapping the mean in as.integer would work, or you'd need to convert the entire column to numeric I guess.
That said, here are a couple of data.table alternatives - I didn't check which one is faster.
library(data.table)
# using ifelse
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a]
# using a temporary column
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL]
And this is what I'd want to do ideally (there is an FR about this):
# again, atm this is pure fantasy and will not work
dt[, b[is.na(b)] := mean(b, na.rm = T), by = a]
The dplyr version of the ifelse is (as in OP):
dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))
I'm not sure how to implement the second data.table idea in a single line in dplyr. I'm also not sure how you can stop dplyr from scrambling/ordering the data (aside from creating an index column).

Related

How to use a loop to create panel data by subsetting and merging a lot of different data frames in R?

I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.
Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)
From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

R Panel data: Create new variable based on ifelse() statement and previous row

My question refers to the following (simplified) panel data, for which I would like to create some sort of xrd_stock.
#Setup data
library(tidyverse)
firm_id <- c(rep(1, 5), rep(2, 3), rep(3, 4))
firm_name <- c(rep("Cosco", 5), rep("Apple", 3), rep("BP", 4))
fyear <- c(seq(2000, 2004, 1), seq(2003, 2005, 1), seq(2005, 2008, 1))
xrd <- c(49,93,121,84,37,197,36,154,104,116,6,21)
df <- data.frame(firm_id, firm_name, fyear, xrd)
#Define variables
growth = 0.08
depr = 0.15
For a new variable called xrd_stock I'd like to apply the following mechanics:
each firm_id should be handled separately: group_by(firm_id)
where fyear is at minimum, calculate xrd_stock as: xrd/(growth + depr)
otherwise, calculate xrd_stock as: xrd + (1-depr) * [xrd_stock from previous row]
With the following code, I already succeeded with step 1. and 2. and parts of step 3.
df2 <- df %>%
ungroup() %>%
group_by(firm_id) %>%
arrange(firm_id, fyear, decreasing = TRUE) %>% #Ensure that data is arranged w/ in asc(fyear) order; not required in this specific example as df is already in correct order
mutate(xrd_stock = ifelse(fyear == min(fyear), xrd/(growth + depr), xrd + (1-depr)*lag(xrd_stock))))
Difficulties occur in the else part of the function, such that R returns:
Error: Problem with `mutate()` input `xrd_stock`.
x object 'xrd_stock' not found
i Input `xrd_stock` is `ifelse(...)`.
i The error occured in group 1: firm_id = 1.
Run `rlang::last_error()` to see where the error occurred.
From this error message, I understand that R cannot refer to the just created xrd_stock in the previous row (logical when considering/assuming that R is not strictly working from top to bottom); however, when simply putting a 9 in the else part, my above code runs without any errors.
Can anyone help me with this problem so that results look eventually as shown below. I am more than happy to answer additional questions if required. Thank you very much to everyone in advance, who looks at my question :-)
Target results (Excel-calculated):
id name fyear xrd xrd_stock Calculation for xrd_stock
1 Cosco 2000 49 213 =49/(0.08+0.15)
1 Cosco 2001 93 274 =93+(1-0.15)*213
1 Cosco 2002 121 354 …
1 Cosco 2003 84 385 …
1 Cosco 2004 37 364 …
2 Apple 2003 197 857 =197/(0.08+0.15)
2 Apple 2004 36 764 =36+(1-0.15)*857
2 Apple 2005 154 803 …
3 BP 2005 104 452 …
3 BP 2006 116 500 …
3 BP 2007 6 431 …
3 BP 2008 21 388 …
arrange the data by fyear so minimum year is always the 1st row, you can then use accumulate to calculate.
library(dplyr)
df %>%
arrange(firm_id, fyear) %>%
group_by(firm_id) %>%
mutate(xrd_stock = purrr::accumulate(xrd[-1], ~.y + (1-depr) * .x,
.init = first(xrd)/(growth + depr)))
# firm_id firm_name fyear xrd xrd_stock
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 Cosco 2000 49 213.
# 2 1 Cosco 2001 93 274.
# 3 1 Cosco 2002 121 354.
# 4 1 Cosco 2003 84 385.
# 5 1 Cosco 2004 37 364.
# 6 2 Apple 2003 197 857.
# 7 2 Apple 2004 36 764.
# 8 2 Apple 2005 154 803.
# 9 3 BP 2005 104 452.
#10 3 BP 2006 116 500.
#11 3 BP 2007 6 431.
#12 3 BP 2008 21 388.

How can you create custom headers using Table function in R?

I have a data frame for each team that looks like nebraska below. However, some of these poor teams don't have a single win, so their $Outcome column has nothing but L in them.
> nebraska
Teams Away/Home Score Outcome
1 Arkansas State Away 36
2 Nebraska Home 43 W
3 Nebraska Away 35 L
4 Oregon Home 42
5 Northern Illinois Away 21
6 Nebraska Home 17 L
7 Rutgers Away 17
8 Nebraska Home 27 W
9 Nebraska Away 28 W
10 Illinois Home 6
11 Wisconsin Away 38
12 Nebraska Home 17 L
13 Ohio State Away 56
14 Nebraska Home 14 L
When I run table(nebraska$Outcome it gives me my expected outcome:
table(nebraska$Outcome)
L W
7 4 3
However, for the teams that don't have a single win (like Baylor), or only have wins, it only gives me something like:
table(baylor$Outcome)
L
7 7
I'd like to specify custom headers for the table function so that I can get have something like this output:
table(baylor$Outcome)
L W
7 7 0
I've tried passing the argument dnn to the table function call, but it throws an error with the following code:
> table(baylor$Outcome,dnn = c("W","L",""))
Error in names(dn) <- dnn :
'names' attribute [3] must be the same length as the vector [1]
Can someone tell me how I can tabulate these wins and losses correctly?
Try this:
with(rle(sort(nebraska$Outcome)),
data.frame(W = max(0, lengths[values == "W"]),
L = max(0, lengths[values == "L"])))
# W L
#1 3 4
I don't think this has to be that complicated. Just make baylor$Outcome a factor and then table. E.g.:
# example data
baylor <- data.frame(Outcome = c("L","L","L"))
Then it is just:
baylor$Outcome <- factor(baylor$Outcome, levels=c("","L","W"))
table(baylor$Outcome)
# L W
#0 3 0
Following a tidy workflow, I offer...
library(dplyr)
library(tidyr)
df <- nebraska %>%
group_by(Teams, Outcome) %>%
summarise(n = n()) %>%
spread(Outcome, n) %>%
select(-c(`<NA>`))
# # A tibble: 8 x 3
# # Groups: Teams [8]
# Teams L W
# * <chr> <int> <int>
# 1 Arkansas State NA NA
# 2 Illinois NA NA
# 3 Nebraska 4 3
# 4 Northern Illinois NA NA
# 5 Ohio State NA NA
# 6 Oregon NA NA
# 7 Rutgers NA NA
# 8 Wisconsin NA NA
...and I couldn't help myself but to pretty with knitr::kable and kableExtra
library(knitr)
library(kableExtra)
df %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))

How can I aggregate data.table in quarterly frequency?

My data is available in monthly frequency and I'm trying to aggregate them in quarterly frequency. I'm working with data.table which package I dont understand very well, to be honest.
X.DATA_BASE NOME_INSTITUICAO SALDO.x SALDO.y
1: 199407 ASB S/A - CFI 1694581 1124580
2: 199407 BANCO ARAUCARIA S.A. 40079517 6314782
3: 199407 BANCO ATLANTIS S.A. 200463907 9356445
4: 199407 BANCO BANKPAR 1078342 5770046
5: 199407 BANCO BBI 97812975 31112289
For each date, which is defined by X.DATA_BASE, 199407 = July 1994. I have several institutions with SALDO.x and SALDO.y values. I want to add SALDO.x and SALDO.y for each institution in each quarterly. One of the problem is that some institutions get in and get out through the time. In the end of the day I want to have mydata with the same columns but quarterly frequency.
How could I do that?
Here's an example of how to group and sum by quarter (with thanks to #eddi for his suggested improvement). First let's create some fake date:
library(data.table)
set.seed(1485)
dat = data.table(date=rep(c(199401:199412,199501:199512),2),
firm=rep(c("A","B"), each=24),
value1=rnorm(48,1000,10),
value2=rnorm(48,2000,100))
dat
date firm value1 value2
1: 199401 A 1009.8620 2054.251
2: 199402 A 1009.7180 2124.202
3: 199403 A 1014.3421 1919.251
...
46: 199510 B 992.9961 2079.517
47: 199511 B 997.9147 1968.676
48: 199512 B 1002.5993 2006.231
Now, summarize by firm, year, and quarter. To do this, we create year and quarter grouping variables from date (we use integer division (%/%) to create the years and mod (%%) plus integer division to create the quarters), and calculate the sum of value1 and value2 for each sub-group. This all assumes date is numeric. If you have it stored as character or factor, convert to numeric first:
dat.summary = dat[ , list(valueByQuarter = sum(sum(value1) + sum(value2))),
by=list(firm,
year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1)]
dat.summary
firm year quarter valueByQuarter
1: A 1994 1 9131.626
2: A 1994 2 8953.116
3: A 1994 3 8981.407
4: A 1994 4 9175.959
5: A 1995 1 9003.225
6: A 1995 2 8962.690
7: A 1995 3 8809.256
8: A 1995 4 8885.264
9: B 1994 1 9000.791
10: B 1994 2 8936.356
11: B 1994 3 8905.789
12: B 1994 4 8951.369
13: B 1995 1 8922.716
14: B 1995 2 9097.134
15: B 1995 3 8724.188
16: B 1995 4 9047.934
For dplyr fans, here's a dplyr approach:
library(dplyr)
dat %>%
group_by(firm, year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1) %>%
summarise(valueByQuarter = sum(value1 + value2))

Resources