I have a dataframe with a column about time and this column contains some NA. I would like to fill these cells with the year before + 1 (if the missing cell is not the beginning of the serie). Here's a reproducible example:
df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, NA, NA, 2000, 2001, 2002))
I tried to follow this post
df <- df %>%
complete(y = seq(min(y), max(y), by = "year"))
but I can't find out how to do so. Any idea?
Edit: expected output:
df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, 2001, 2002, 2000, 2001, 2002))
Note: I would prefer a dplyr solution.
Note 2 (October 23rd 2019): The three answers so far are good but quite complicated. I'm really surprised that it is not possible to do that simply (for example, having the possibility to add a lag in the fill function would be really useful I think).
This solution is a bit annoying but completely vectorized in dplyr. I doubled your df into a new df2 to try across a couple gapped occurrences.
library(tidyr)
library(dplyr)
df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, NA, NA, 2000, 2001, 2002))
df2 <- bind_rows(df, df)
Basically you need to create groups across the blocks with NA. Then you can calculate a within-group cumsum and use fill to drag down the prior value. It is annoying because of all the lines.
df2 %>%
group_by(grp = cumsum(!is.na(y) & lag(is.na(y), default = FALSE))) %>%
mutate(add_year = cumsum(is.na(y))) %>%
fill(y) %>%
mutate(y = y + add_year) %>%
ungroup() %>%
select(-grp, -add_year)
In base you can use ave in combination with cumsum to split your dataset and apply there seq, as you have tried already.
df$y <- ave(df$y, cumsum(!is.na(df$y)), FUN=function(x)
seq(x[1], length.out = length(x)))
identical(df, dfExpected)
#[1] TRUE
df$y
#[1] 2000 2001 2002 2000 2001 2002
In case it starts with NA and you want then to let it start with 2000 you can use replace:
df2$y <-ave(df2$y, cumsum(!is.na(df2$y)), FUN=function(x)
seq(replace(x[1],is.na(x[1]),2000), length.out = length(x)))
identical(df2, dfExpected)
#[1] TRUE
Data:
df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, NA, NA, 2000, 2001, 2002))
dfExpected <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, 2001, 2002, 2000, 2001, 2002))
df2 <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(NA, NA, NA, 2000, 2001, 2002))
This uses dplyr functions case_when() and lag combined with a while-loop in a custom-function.
Output is as expected, try it out.
library(dplyr)
lag_years <- function(df){
while (anyNA(df$y))
{
df %>%
mutate(y = case_when(is.na(y)&!is.na(lag(y))~lag(y)+1,TRUE~y)) %>%
{.} -> df
}
return(df)
}
lag_years(df) %>%
head()
Related
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 7 months ago.
I want to transform a df from a "counting" approach (number of cases) to a "individual observations" approach.
Example:
df <- dplyr::tibble(
city = c("a", "a", "b", "b", "c", "c"),
sex = c(1,0,1,0,1,0),
age = c(1,2,1,2,1,2),
cases = c(2, 3, 1, 1, 1, 1))
Expected result
df <- dplyr::tibble(
city = c("a","a","a","a","a", "b", "b", "c", "c"),
sex = c(1,1,0,0,0,1,0,1,0),
age = c(1,1,2,2,2,1,2,1,2))
uncount() from tidyr can do that for you.
df |> tidyr::uncount(cases)
This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 1 year ago.
I want to get the prop inside each factor using dplyr. The desired result appears in desired$prop
Thanks in advance :))
data <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa","uk",
"spain","usa","uk","spain","usa","uk","spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25)
)
desired <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa",
"uk","spain","usa","uk","spain","usa","uk",
"spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25),
prop = c(0.285714286,0.181818182,0.142857143,0.357142857,
0.272727273,0.5,0.357142857,0.545454545,
0.357142857)
)
#MrFlick is right. And also faster than I am.
library(dplyr)
df <- data %>%
group_by(country) %>%
mutate(prop = value/sum(value))
I would to make a bar chart that plots the bar as a proportion of the total group rather than the usual percentage. For a var to "count" it only needs to occur once in a group. For example in this df where id is the grouping variable
df <-
tibble(id = c(rep(1, 3), rep(2, 3), rep(3, 3)),
vars = c("a", NA, "b", "c", "d", "e", "a", "a", "a"))
The a bars would be:
a = 2/3 # since a occurs in 2 out of 3 groups
b = 1/3
c = 1/3
d = 1/3
e = 1/3
If I understand you correctly, a one-liner would suffice:
ggplot(distinct(df)) + geom_bar(aes(vars, stat(count) / n_distinct(df$id)))
Working answer:
tibble(id = c(rep(1, 3), rep(2, 3), rep(3, 3)),
vars = c("a", "a", "b", "c", "d", "e", "a", "a", "a")) %>%
group_by(id) %>%
distinct(vars) %>%
ungroup() %>%
add_count(vars) %>%
mutate(prop = n / n_distinct(id)) %>%
distinct(vars, .keep_all = T) %>%
ggplot(aes(vars, prop)) +
geom_col()
I have a data frame and from it I'm plotting some trend lines, however, I want to exclude data where there aren't complete records (i.e. if the dose of drug C is NA in 2002, then I don't want C included on the plot at all). How do I achieve this in R?
Reproducible Example
df <- data.frame(year=c(2001, 2002, 2003, 2004, 2001, 2002, 2003, 2004, 2001, 2002, 2003, 2004),
dose=c(500, 600, 750, 550, 300, 330, 350, 390, 100, NA, 250, 125),
drug=c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"))
ggplot(df) + geom_line(aes(x = year, y = dose, color=drug))
The tidyverse approach:
library(tidyverse)
gplot(df %>% group_by(drug) %>% filter(!any(is.na(dose))))+
geom_line(aes(x = year, y = dose, color=drug))
It filters now per drug (from group_by) if there is not ! any na-value
I have a data table, data, I want to group them by group_label
and subtract value of a group form that of other groups.
In other words, I want to subtract all "NYC" values in any group by
the value of NYC in group B.
I want to subtract any value associated with LA, in any group,
from the value of LA associated with LA in group B. so my result looks like
result. How can I do that?
data = data.table(city = c("NYC", "NYC", "NYC", "LA", "LA", "LA"),
group_label = c("A", "A", "B", "B", "A", "C"),
time_period = c(1980, 1990, 2000, 1982, 2007, 2010),
value = c(2, 20, 13, 24, 4, 6)
)
result = data.table(city = c("NYC", "NYC", "NYC", "LA", "LA", "LA"),
group_label = c("A", "A", "B", "B", "A", "C"),
value = c(2, 20, 13, 24, 4, 6),
time_period = c(1980, 1990, 2000, 1982, 2007, 2010),
diff = c(-11, 7, 0, 0, -20, -18)
)
An option would be
data[, diff := value - value[group_label == "B"], city]
Or with dplyr
library(dplyr)
data %>%
group_by(city) %>%
mutate(diff = value - value[group_label == "B"])