Mutate an R tibble by looking up a value in another tibble - r

Wrangling data in R, I would like to mutate a tibble in such a way that the numerical values in the new column are being looked up in a different tibble.
Given a dataset of catheter removals:
# A tibble: 51 x 2
ExplYear RemovalReason
<dbl> <chr>
1 2018 Infection
2 2018 Dysfunction
3 2018 Infection
# ... etc.
where each row corresponds to a single catheter removal, I would like to add a column Implants that holds the total number of _im_plantations in the year that the catheter was removed (_ex_planted).
The implantation numbers are in a tibble impl_per_year:
# A tibble: 13 x 2
ImplYear n
<dbl> <int>
1 2006 14
2 2007 46
3 2008 64
# ... etc.
I have tried to mutate the first tibble with map and a helper function:
lookup = function(year) { impl_per_year[impl_per_year$ImplYear == year,]$n }
explants %>% mutate(Implants = map(ExplYear, lookup)
But this places lots of empty integer vectors into the Implants column:
# A tibble: 51 x 3
ExplYear RemovalReason Implants
<dbl> <chr> <list>
1 18 Infection <int [0]>
2 18 Dysfunction <int [0]>
3 18 Infection <int [0]>
# ... etc.
What is the mistake?

You should be able to simply join the two tables by year. If we call your first tibble ExplTibble and your second ImplTibble, using dplyr:
ExplTibble %>% left_join(ImplTibble, by = c("ExplYear" = "ImplYear"))
This should add a new column n containing the number of implants in each year.

library(tidyverse)
I altered your data so that my illustration wouldn't have a NULL output.
df <- tribble(
~ExplYear, ~RemovalReason,
2018, "Infection",
2017, "Dysfunction",
2016, "Infection")
impl_per_year <- tribble(
~ImplYear, ~n,
2017, 14,
2016, 46,
2016, 64
)
left_join is the function you're looking for. It's part of the dplyr::join family of functions that do this.
It's good to have the same names for "joining" variables, but in your case you need the by = c( ... ) option to let left_join know what you are joining by.
left_join(df, impl_per_year, by = c("ExplYear" = "ImplYear"))
# A tibble: 4 x 3
ExplYear RemovalReason n
<dbl> <chr> <dbl>
1 2018 Infection NA
2 2017 Dysfunction 14
3 2016 Infection 46
4 2016 Infection 64
Depending on what you want, consider right_join, inner_join, etc. until you get the output you are looking for. For example:
inner_join(df, impl_per_year, by = c("ExplYear" = "ImplYear"))
# A tibble: 3 x 3
ExplYear RemovalReason n
<dbl> <chr> <dbl>
1 2017 Dysfunction 14
2 2016 Infection 46
3 2016 Infection 64
... which gives only successful matches from both tibbles.

Related

Finding the mean of two columns with two different classes/labels

right now I'm trying to create a data frame that contains the mean of two columns for two separate labels/categories.
But, I don't know how to calculate the mean for two columns, it just returns the same mean for both winner and opponent/loser.
Currently, I'm using the tidyverse library.
Here is the original data frame:
winner_hand winner_ht winner_ioc winner_age opponent_hand opponent_ht opponent_ioc opponent_age result name
<chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <fct> <chr>
R 178 JPN 29.00479 R NA RUS 22.88569 winner Kei Nishikori
R NA RUS 22.88569 R 188 FRA 33.70568 winner Daniil Medvedev
R 178 JPN 29.00479 R 188 FRA 31.88227 winner Kei Nishikori
R 188 FRA 33.70568 R NA AUS 19.86858 winner Jo Wilfried Tsonga
R NA RUS 22.88569 R 196 CAN 28.01095 winner Daniil Medvedev
R 188 FRA 31.88227 R NA JPN 26.40383 winner Jeremy Chardy
My code:
age_summary <- game_data %>%
group_by(result) %>%
summarize(mean_age = mean(winner_age))
age_summary
Resulting Data frame:
result mean_age
<fct> <dbl>
winner 27.68495
loser 27.68495
If you want summaries from two columns, you need expressions for each column in the call to summarize().
Example with fake data, since your excerpt only has one value for the 'result' column:
library(tidyverse)
dat <- read_csv(
"result, winner_age, opponent_age
A, 5, 10
A, 6, 11,
B, 12, 2
B, 13, 1")
dat %>%
group_by(result) %>%
# note: two expressions here:
summarise(mean_winner_age = mean(winner_age),
mean_opponent_age = mean(opponent_age))
output:
# A tibble: 2 x 3
result mean_winner_age mean_opponent_age
<chr> <dbl> <dbl>
1 A 5.5 10.5
2 B 12.5 1.5

Extend data frame column with inflation in R

I'm trying to extend some code to be able to:
1) read in a vector of prices
2) left join that vector of prices to a data frame of years (or years and months)
3) append/fill the prices for missing years with interpolated data based on the last year of available prices plus a specified inflation rate. Consider an example like this one:
prices <- data.frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA))
What I need is something that will fill the missing rows of each column with the last price plus inflation (suppose 2%). I can do this in a pretty brute force way as:
i_rate<-0.02
for(i in c(1:nrow(prices))){
if(is.na(prices$wti[i]))
prices$wti[i]<-prices$wti[i-1]*(1+i_rate)
if(is.na(prices$brent[i]))
prices$brent[i]<-prices$brent[i-1]*(1+i_rate)
}
It seems to me there should be a way to do this using some combination of apply() and/or fill() but I can't seem to make it work.
Any help would be much appreciated.
As noted by #camille, the problem with dplyr::lag is that it doesn't work here with consecutive NAs because it uses the "original" ith element of a vector instead of the "revised" ith element. We'd have to first create a version of lag that will do this by creating a new function:
impute_inflation <- function(x, rate) {
output <- x
y <- rep(NA, length = length(x)) #Creating an empty vector to fill in with the loop. This makes R faster to run for vectors with a large number of elements.
for (i in seq_len(length(output))) {
if (i == 1) {
y[i] <- output[i] #To avoid an error attempting to use the 0th element.
} else {
y[i] <- output[i - 1]
}
if (is.na(output[i])) {
output[i] <- y[i] * (1 + rate)
} else {
output[i]
}
}
output
}
Then it's a pinch to apply this across a bunch of variables with dplyr::mutate_at():
library(dplyr)
mutate_at(prices, vars(wti, brent), impute_inflation, 0.02)
year wti brent
1 2018 75.000 80.00
2 2019 80.000 85.00
3 2020 90.000 94.00
4 2021 91.800 93.00
5 2022 93.636 94.86
You can use dplyr::lag to get the previous value in a given column. Your lagged values look like this:
library(dplyr)
inflation_factor <- 1.02
prices <- data_frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA)) %>%
mutate_at(vars(wti, brent), as.numeric)
prices %>%
mutate(prev_wti = lag(wti))
#> # A tibble: 5 x 4
#> year wti brent prev_wti
#> <int> <dbl> <dbl> <dbl>
#> 1 2018 75 80 NA
#> 2 2019 80 85 75
#> 3 2020 90 94 80
#> 4 2021 NA 93 90
#> 5 2022 NA NA NA
When a value is NA, multiply the lagged value by the inflation factor. As you can see, that doesn't handle consecutive NAs, however.
prices %>%
mutate(wti = ifelse(is.na(wti), lag(wti) * inflation_factor, wti),
brent = ifelse(is.na(brent), lag(brent) * inflation_factor, brent))
#> # A tibble: 5 x 3
#> year wti brent
#> <int> <dbl> <dbl>
#> 1 2018 75 80
#> 2 2019 80 85
#> 3 2020 90 94
#> 4 2021 91.8 93
#> 5 2022 NA 94.9
Or to scale this and avoid doing the same multiplication over and over, gather the data into a long format, get lags within each group (wti, brent, or any others you may have), and adjust values as needed. Then you can spread back to the original shape:
prices %>%
tidyr::gather(key = key, value = value, wti, brent) %>%
group_by(key) %>%
mutate(value = ifelse(is.na(value), lag(value) * inflation_factor, value)) %>%
tidyr::spread(key = key, value = value)
#> # A tibble: 5 x 3
#> year brent wti
#> <int> <dbl> <dbl>
#> 1 2018 80 75
#> 2 2019 85 80
#> 3 2020 94 90
#> 4 2021 93 91.8
#> 5 2022 94.9 NA
Created on 2018-07-12 by the reprex package (v0.2.0).

Dropping the rows by checking whether it has multiple values in R

I have a data frame in this form;
Year Department Jan Feb ................... Dec
2017 TF 15.15 225.51 .............. 5562.1
2015 CIF ...................................
2013 TTR ....................................
2011 COR ....................
. .............................
. ......................
As a summary, I want to create an algorithm but first I have to make this filtering:
If a department does not have a value for 2013, 2014, 2015, 2016 years, than I want to exclude that department from my data set.
In other words, by reading the each departments data, filtering the data by departments that has all four years values in the months columns.
I tried exists, is.na but the multiple filtering always fails. And another handicap is that filter works for only single condition, but here I need like 4 condition. 4 years values must be exist to use them in next step.
Thank you.
I can't find a clear duplicate to this question. Seems like a quick fix with group_by:
library(dplyr)
df <- data_frame(Year = c(2013:2016, 2015, 2016),
Department = c(rep('TF', 4), 'CIF', 'TTR'))
df
#> # A tibble: 6 x 2
#> Year Department
#> <dbl> <chr>
#> 1 2013 TF
#> 2 2014 TF
#> 3 2015 TF
#> 4 2016 TF
#> 5 2015 CIF
#> 6 2016 TTR
df %>%
group_by(Department) %>%
mutate(x = Year %in% c(2013:2016),
y = sum(x)) %>%
ungroup() %>%
filter(y == 4)
#> # A tibble: 4 x 4
#> Year Department x y
#> <dbl> <chr> <lgl> <int>
#> 1 2013 TF TRUE 4
#> 2 2014 TF TRUE 4
#> 3 2015 TF TRUE 4
#> 4 2016 TF TRUE 4
A solution using R base:
df = read.table(text = "Year, Department
2016,TF
2017,TF
2013,CIF
2014,CIF
2015,CIF
2016,CIF
2013,TTR", header = TRUE, sep = ",", stringsAsFactors = FALSE)
df[df$Department %in% subset(aggregate(subset(df, Year %in% c(2013,2014,2015,2016)), by=list(n$Department), FUN=length), Department==4)[,1], ]
Output:
Year Department
3 2013 CIF
4 2014 CIF
5 2015 CIF
6 2016 CIF

Using custom order to arrange rows after previous sorting with arrange

I know this has already been asked, but I think my issue is a bit different (nevermind if it is in Portuguese).
I have this dataset:
df <- cbind(c(rep(2012,6),rep(2016,6)),
rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
runif(12,0,1))
colnames(df) <- c('Year,'Variable','Value)
I want to order the rows to group first everything that has the same year. Afterwards, I want the Variable column to be ordered like this:
Receitas.total
Fisicas.total
Emp.total
Politicos.total
Proprio.total
Outros.total
I know I could usearrange() from dplyr to sort by the year. However, I do not know how to combine this with any routine using factor and order without messing up the previous ordering by year.
Any help? Thank you
We create a custom order by converting the 'Variable' into factor with levels specified in the custom order
library(dplyr)
df %>%
arrange(Year, factor(Variable, levels = c('Receitas.total',
'Fisicas.total', 'Emp.total', 'Politicos.total',
'Proprio.total', 'Outros.total')))
# A tibble: 12 x 3
# Year Variable Value
# <dbl> <chr> <dbl>
# 1 2012 Receitas.total 0.6626196
# 2 2012 Fisicas.total 0.2248911
# 3 2012 Emp.total 0.2925740
# 4 2012 Politicos.total 0.5188971
# 5 2012 Proprio.total 0.9204438
# 6 2012 Outros,total 0.7042230
# 7 2016 Receitas.total 0.6048889
# 8 2016 Fisicas.total 0.7638205
# 9 2016 Emp.total 0.2797356
#10 2016 Politicos.total 0.2547251
#11 2016 Proprio.total 0.3707349
#12 2016 Outros,total 0.8016306
data
set.seed(24)
df <- data_frame(Year =c(rep(2012,6),rep(2016,6)),
Variable = rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
Value = runif(12,0,1))

Convert data.frame wide to long while concatenating date formats

In R (or other language), I want to transform an upper data frame to lower one.
How can I do that?
Thank you beforehand.
year month income expense
2016 07 50 15
2016 08 30 75
month income_expense
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
Well, it seems that you are trying to do multiple operations in the same question: combine dates columns, melt your data, some colnames transformations and sorting
This will give your expected output:
library(tidyr); library(reshape2); library(dplyr)
df %>% unite("date", c(year, month)) %>%
mutate(expense=-expense) %>% melt(value.name="income_expense") %>%
select(-variable) %>% arrange(date)
#### date income_expense
#### 1 2016_07 50
#### 2 2016_07 -15
#### 3 2016_08 30
#### 4 2016_08 -75
I'm using three different libraries here, for better readability of the code. It might be possible to do it with base R, though.
Here's a solution using only two packages, dplyr and tidyr
First, your dataset:
df <- dplyr::data_frame(
year =2016,
month = c("07", "08"),
income = c(50,30),
expense = c(15, 75)
)
The mutate() function in dplyr creates/edits individual variables. The gather() function in tidyr will bring multiple variables/columns together in the way that you specify.
df <- df %>%
dplyr::mutate(
month = paste0(year, "-", month)
) %>%
tidyr::gather(
key = direction, #your name for the new column containing classification 'key'
value = income_expense, #your name for the new column containing values
income:expense #which columns you're acting on
) %>%
dplyr::mutate(income_expense =
ifelse(direction=='expense', -income_expense, income_expense)
)
The output has all the information you'd need (but we will clean it up in the last step)
> df
# A tibble: 4 × 4
year month direction income_expense
<dbl> <chr> <chr> <dbl>
1 2016 2016-07 income 50
2 2016 2016-08 income 30
3 2016 2016-07 expense -15
4 2016 2016-08 expense -75
Finally, we select() to drop columns we don't want, and then arrange it so that df shows the rows in the same order as you described in the question.
df <- df %>%
dplyr::select(-year, -direction) %>%
dplyr::arrange(month)
> df
# A tibble: 4 × 2
month income_expense
<chr> <dbl>
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
NB: I guess that I'm using three libraries, including magrittr for the pipe operator %>%. But, since the pipe operator is the best thing ever, I often forget to count magrittr.

Resources