I have a database in R where there are some NAs in the variables. I would like to apply a logic function where the NAs would be filled with the immediately preceding value. Below is an example:
dados <- tibble::tibble(x = c(2, 3, 5, NA, 2, 1, NA, NA, 9, 3),
y = c(4, 1, 9, NA, 8, 5, NA, NA, 1, 2)
)
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 NA NA
5 2 8
6 1 5
7 NA NA
8 NA NA
9 9 1
10 3 2
In this case, the 4th value of the variable x would be filled with a 5 and so on.
Thank you!
We could use fill from tidyr package:
ibrary(tidyr)
library(dplyr)
dados %>%
fill(c(x,y), .direction = "down")
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 5 9
5 2 8
6 1 5
7 1 5
8 1 5
9 9 1
10 3 2
We can use coalesce
library(dplyr)
dados %>%
mutate(across(x:y, ~ coalesce(., lag(.))))
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 5 9
5 2 8
6 1 5
7 1 5
8 NA NA
9 9 1
10 3 2
library(dplyr)
dados %>%
mutate(x = case_when(is.na(x) ~ lag(x),
TRUE ~ x),
y = case_when(is.na(y) ~ lag(y),
TRUE ~ y))
The follow will only work, if the first value in a column is not NA but I leave that for the sake of clear and easy code as an execise for you we can solve this for one column as in:
library(tibble)
dados <- tibble::tibble(x = c(2, 3, 5, NA, 2, 1, NA, NA, 9, 3),
y = c(4, 1, 9, NA, 8, 5, NA, NA, 1, 2)
)
#where are the NA?
pos <- dados$x |>
is.na() |>
which()
# replace
while(any(is.na(dados$x)))
dados$x[pos] <- dados$x[pos-1]
dados
Related
I have a data frame that looks like:
df <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
y = c(NA, 2, NA, NA, NA, 3, NA, NA, NA, 1, NA, NA))
I want it to look like this:
data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
y = c(0, 2, 2, 0, 0, 3, 3, 3, 0, 1, 0, 0))
#> x y
#> 1 1 0
#> 2 2 2
#> 3 3 2
#> 4 4 0
#> 5 5 0
#> 6 6 3
#> 7 7 3
#> 8 8 3
#> 9 9 0
#> 10 10 1
#> 11 11 0
#> 12 12 0
I have solved with a while-loop, but was looking for a more R-like solution.
This is the loop solution:
df[is.na(df)] <- 0 # replace all NA with 0
i = 1
while (i < nrow(df)){
if (df$y[i] < 2){ # do nothing if y = 1
i = i+1
} else {
df$y[(i+1):(i+df$y[i]-1)] <- df$y[i]
i = i+df$y[i]
}
}
Bonus question: could it be done within a pipe and for multiple columns (e.g. a column z = c(1, NA, NA, NA, 4, NA, NA, NA, NA, 2, NA, NA))?
You can create an empty vector with numeric, get the value with complete.cases and rep, and get the indices with complete.cases and sequence:
fill_n_repeat <- function(x){
value = x[complete.cases(x)]
idx = which(complete.cases(x))
v = numeric(length(x))
v[sequence(value, idx)] <- rep(value, value)
v
}
library(dplyr)
df %>%
mutate(across(y:z, fill_n_repeat))
x y z
1 1 0 1
2 2 2 0
3 3 2 0
4 4 0 0
5 5 0 4
6 6 3 4
7 7 3 4
8 8 3 4
9 9 0 0
10 10 1 2
11 11 0 2
12 12 0 0
Group the rows so that each non-NA starts a new group and then for each such group if the first element is NA then output 0's and otherwise output the first element that many times followed by 0's. This uses base R only but if you prefer dplyr replace transform with mutate and all else stays the same.
f <- function(x) if (is.na(x[1])) 0 else ifelse(seq_along(x) > x[1], 0, x[1])
transform(df, y = ave(y, cumsum(!is.na(y)), FUN = f))
giving (continued below)
x y
1 1 0
2 2 2
3 3 2
4 4 0
5 5 0
6 6 3
7 7 3
8 8 3
9 9 0
10 10 1
11 11 0
12 12 0
If there were several columns then if ix contains the column numbers to be processed or the column names then using the same f as above then run it over each column to be transformed.
ix <- "y"
f <- function(x) if (is.na(x[1])) 0 else ifelse(seq_along(x) > x[1], 0, x[1])
f2 <- function(i) ave(df[[i]], cumsum(!is.na(df[[i]])), FUN = f)
replace(df, ix, lapply(ix, f2))
Alternatively, please try below code without any custom function
df2 <- df %>% mutate(z=y) %>% fill(z) %>% group_by(y,z) %>%
mutate(row=row_number()+1, y=ifelse(z>=row,z,y)) %>% ungroup() %>%
select(-z,-row)
I have a dateframe like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
# Limits for desired cumulative sum (CumSum)
maxCumSum <- 8
minCumSum <- 0
What I would like to calculate is a cumulative sum of value by group (grp) within the values of maxCumSum and minCumSum. The respective table dt2 should look something like this:
grp t value CumSum
a 1 -1 0
a 2 5 5
a 3 9 8
a 4 -15 0
a 5 6 6
b 1 5 5
b 2 1 6
b 3 7 8
b 4 -11 0
b 5 9 8
Think of CumSum as a water storage with has a certain maximum capacity and the level of which cannot sink below zero.
The normal cumsum does obviously not do the trick since there are no limitations to maximum or minimum. Has anyone a suggestion how to achieve this? In the real dataframe there are of course more than 2 groups and far more than 5 times.
Many thanks!
What you can do is create a function which calculate the cumsum until it reach the max value and start again at the min value like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
library(dplyr)
maxCumSum <- 8
minCumSum <- 0
f <- function(x, y) max(min(x + y, maxCumSum), minCumSum)
df %>%
group_by(grp) %>%
mutate(CumSum = Reduce(f, value, 0, accumulate = TRUE)[-1])
#> # A tibble: 10 × 4
#> # Groups: grp [2]
#> grp t value CumSum
#> <chr> <int> <dbl> <dbl>
#> 1 a 1 -1 0
#> 2 a 2 5 5
#> 3 a 3 9 8
#> 4 a 4 -15 0
#> 5 a 5 6 6
#> 6 b 1 5 5
#> 7 b 2 1 6
#> 8 b 3 7 8
#> 9 b 4 -11 0
#> 10 b 5 9 8
Created on 2022-07-04 by the reprex package (v2.0.1)
I receive dataframe but the number of columns (V) is increasing regularly (actually V49). This example takes only V1 to V7. I have a lot of Nas and 3 others columns ID, REP and all who are used next step.
ID <- c("A", "B", "B3", "E4", "JE5", "L6")
V1 <- c(3, 5, 1, 3, 7, 1)
V2 <- c(6, 4, 2, 7, 6, 2)
V3 <- c(6, 5, 2, 7, 6, 3)
V4 <- c(6, 7, 1, 7, 6, 3)
V5 <- c(NA, NA, 2, 7, NA, 3)
V6 <- c(NA, NA, 2, 7, NA, 3)
V7 <- c(NA, NA, NA, 7, NA, 3)
REP <- c(4, 4, 6, 7, 4, 7)
all <- c(6, 5, 2, 7, 6, 3)
variation <- c(0, 0, 0, 0, 0, 0)
df <- data.frame(ID, V1, V2, V3, V4, V5, V6, V7, REP, all, variation)
I want this result : add variation == 2 when Vi+1 - Vi < 0 or > 1.
i = V1 to Vmax: Vmax is different for each individual V4 at A and B V6 at B3....
# ID V1 V2 V3 V4 V5 V6 V7 REP all variation
# 1 A 3 6 6 6 NA NA NA 4 6 0
# 2 B 5 4 5 7 NA NA NA 4 5 0
# 3 B3 1 2 2 1 2 2 NA 6 2 0
# 4 E4 3 7 7 7 7 7 7 7 7 0
# 5 JE5 7 6 6 6 NA NA NA 4 6 0
# 6 L6 1 2 3 3 3 3 3 7 3 0
I try with loop but it takes long time so I try with case_when(). It's work but every week I need to add new line because I want to see when there is a variation between the variable Vi and Vi+1.
!between(Vi+1 - Vi, 0, 1) ~ 2,....
df <- df %>%
mutate(variation = case_when(
!between(V2 - V1, 0, 1) ~ 2,
!between(V3 - V2, 0, 1) ~ 2,
!between(V4 - V3, 0, 1) ~ 2,
!between(V5 - V4, 0, 1) ~ 2,
!between(V6 - V5, 0, 1) ~ 2,
!between(V7 - V6, 0, 1) ~ 2,
TRUE ~ as.numeric(variation)))
Is there an automatic increment for case_when or other function with apply or map to avoid writing all variables by hand
Thank you in advance.
I would pivot the data into a longer format, then just compare all the values to their lag in a single statement.
df %>%
pivot_longer(matches("V[0-9]+")) %>%
group_by(ID) %>%
summarize(variation = if (any(!between(value - lag(value), 0, 1), na.rm = T)) 2 else unique(variation)) %>%
right_join(select(df, -variation), by = "ID")
#> # A tibble: 6 × 11
#> ID variation V1 V2 V3 V4 V5 V6 V7 REP all
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2 3 6 6 6 NA NA NA 4 6
#> 2 B 2 5 4 5 7 NA NA NA 4 5
#> 3 B3 2 1 2 2 1 2 2 NA 6 2
#> 4 E4 2 3 7 7 7 7 7 7 7 7
#> 5 JE5 2 7 6 6 6 NA NA NA 4 6
#> 6 L6 0 1 2 3 3 3 3 3 7 3
Assume we have an email dataset with a sender and a recipient in every row. We want to find the next occurrence in the dataset for which the sender and the recipient are interchanged. So if sender==x & recipient==y, we are looking for the next row that has sender==y & recipient==x. Subsequently, we want to calculate the difference between counts for those observations. See the column diff_count for the desired output.
# creating the data.frame
id = 1:10
sender = c(1, 2, 3, 2, 3, 1, 2, 1, 2, 3)
recipient = c(2, 1, 2, 3, 1, 2, 3, 3, 1, 1)
count = c(1, 4, 5, 7, 12, 17, 24, 31, 34, 41)
df <- data.frame(id, sender, recipient, count)
# output should look like this
df$diff_count <- c(3, 13, 2, NA, 19, 17, NA, 10, NA, NA)
If there are no more observations that satisfy the requirement, then we simply fill in NA. Solution should be relatively easy with tidyverse, but I seem not to be able to do it.
Another dplyr-way without a custom function but several self joins:
library(dplyr)
data %>%
left_join(data,
by = c("sender" = "recipient", "recipient" = "sender"),
suffix = c("", ".y")) %>%
filter(id < id.y) %>%
group_by(id) %>%
slice_min(id.y) %>%
ungroup() %>%
mutate(diff_count = count.y - count) %>%
right_join(data) %>%
select(-matches("\\.(y|x)")) %>%
arrange(id)
returns
Joining, by = c("id", "sender", "recipient", "count")
# A tibble: 10 x 5
id sender recipient count diff_count
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 1 3
2 2 2 1 4 13
3 3 3 2 5 2
4 4 2 3 7 NA
5 5 3 1 12 19
6 6 1 2 17 17
7 7 2 3 24 NA
8 8 1 3 31 10
9 9 2 1 34 NA
10 10 3 1 41 NA
There should be easier ways, but below is one way using a custom function in tidyverse style:
library(dplyr)
calc_diff <- function(df, send, recp, cnt) {
df %>%
slice_tail(n = nrow(df) - cur_group_rows()) %>%
filter(sender == send, recipient == recp) %>%
slice_head(n = 1) %>%
pull(count) %>%
{ifelse(length(.) == 0, NA, .)} %>%
`-`(., cnt)
}
df %>%
rowwise(id) %>%
mutate(diff_count = calc_diff(df,
send = recipient,
recp = sender,
cnt = count))
#> # A tibble: 10 x 5
#> # Rowwise: id
#> id sender recipient count diff_count
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1 3
#> 2 2 2 1 4 13
#> 3 3 3 2 5 2
#> 4 4 2 3 7 NA
#> 5 5 3 1 12 19
#> 6 6 1 2 17 17
#> 7 7 2 3 24 NA
#> 8 8 1 3 31 10
#> 9 9 2 1 34 NA
#> 10 10 3 1 41 NA
Created on 2021-08-20 by the reprex package (v2.0.1)
I am interested in reversing the values for a column that has NA values in a tidy way.
The rev call won't do the trick here:
library(tidyverse)
tibble(
Which = LETTERS[1:11],
x = c( c(3,1,4,2,16), NA, NA, 4, rep(NA, 2), 10)) %>%
mutate(y = rev(x))
As it completely reverses the values (NAs included).
I essentially want a tidy mutate command (no splitting / joining) that reverses the values for the Which column so that E has value 1 (max becomes min) B has value 16 (min becomes max), etc - and NA values remain NA (F, G, I & J).
Edit:
Several answers do not achieve intended outcome. The question is aimed at effectively having a reverse (rev) work while keeping NAs in position.
#Moody_Mudskipper has a solution to the case where there's no repeats, but it fails when there are repeats, e.g.:
rev_na <- function(x) setNames(sort(x), sort(x, TRUE))[as.character(x)]
Works here:
tibble(
Which = LETTERS[1:11],
x = c( c(3,1,4,2,16), NA, NA, 4, rep(NA, 2), 10)) %>%
mutate(y = rev_na(x))
Fails here:
tibble(
Which = LETTERS[1:7],
x = c(3,1,9,9,9, 9, 10)
) %>% mutate(y = rev_na(x), z = rev(x))
If you can tolerate a little hack :
tibble(
Which = LETTERS[1:11],
x = c( c(3,1,4,2,16), NA, NA, 4, rep(NA, 2), 10)) %>%
mutate(y = setNames(sort(x), sort(x, TRUE))[as.character(x)])
#> # A tibble: 11 x 3
#> Which x y
#> <chr> <dbl> <dbl>
#> 1 A 3 4
#> 2 B 1 16
#> 3 C 4 3
#> 4 D 2 10
#> 5 E 16 1
#> 6 F NA NA
#> 7 G NA NA
#> 8 H 4 3
#> 9 I NA NA
#> 10 J NA NA
#> 11 K 10 2
Created on 2021-05-11 by the reprex package (v0.3.0)
This will do
data.frame(
Which = LETTERS[1:11],
x = c( c(3,1,4,2,16), NA, NA, 4, rep(NA, 2), 10)) -> df
df %>% group_by(d = is.na(x)) %>%
arrange(x) %>%
mutate(y = ifelse(!d, rev(x), x)) %>%
ungroup %>% select(-d)
# A tibble: 11 x 3
Which x y
<chr> <dbl> <dbl>
1 B 1 16
2 D 2 10
3 A 3 4
4 C 4 4
5 H 4 3
6 K 10 2
7 E 16 1
8 F NA NA
9 G NA NA
10 I NA NA
11 J NA NA
Needless to say you may arrange back the results if your Which was arranged already or creating a row_number() at the start of the syntax.
df %>%
group_by(d = is.na(x)) %>%
arrange(x) %>%
mutate(y = ifelse(!d, rev(x), x)) %>%
ungroup %>% select(-d) %>%
arrange(Which)
# A tibble: 11 x 3
Which x y
<chr> <dbl> <dbl>
1 A 3 4
2 B 1 16
3 C 4 4
4 D 2 10
5 E 16 1
6 F NA NA
7 G NA NA
8 H 4 3
9 I NA NA
10 J NA NA
11 K 10 2