increment automatic to case_when or substitute of loops - r

I receive dataframe but the number of columns (V) is increasing regularly (actually V49). This example takes only V1 to V7. I have a lot of Nas and 3 others columns ID, REP and all who are used next step.
ID <- c("A", "B", "B3", "E4", "JE5", "L6")
V1 <- c(3, 5, 1, 3, 7, 1)
V2 <- c(6, 4, 2, 7, 6, 2)
V3 <- c(6, 5, 2, 7, 6, 3)
V4 <- c(6, 7, 1, 7, 6, 3)
V5 <- c(NA, NA, 2, 7, NA, 3)
V6 <- c(NA, NA, 2, 7, NA, 3)
V7 <- c(NA, NA, NA, 7, NA, 3)
REP <- c(4, 4, 6, 7, 4, 7)
all <- c(6, 5, 2, 7, 6, 3)
variation <- c(0, 0, 0, 0, 0, 0)
df <- data.frame(ID, V1, V2, V3, V4, V5, V6, V7, REP, all, variation)
I want this result : add variation == 2 when Vi+1 - Vi < 0 or > 1.
i = V1 to Vmax: Vmax is different for each individual V4 at A and B V6 at B3....
# ID V1 V2 V3 V4 V5 V6 V7 REP all variation
# 1 A 3 6 6 6 NA NA NA 4 6 0
# 2 B 5 4 5 7 NA NA NA 4 5 0
# 3 B3 1 2 2 1 2 2 NA 6 2 0
# 4 E4 3 7 7 7 7 7 7 7 7 0
# 5 JE5 7 6 6 6 NA NA NA 4 6 0
# 6 L6 1 2 3 3 3 3 3 7 3 0
I try with loop but it takes long time so I try with case_when(). It's work but every week I need to add new line because I want to see when there is a variation between the variable Vi and Vi+1.
!between(Vi+1 - Vi, 0, 1) ~ 2,....
df <- df %>%
mutate(variation = case_when(
!between(V2 - V1, 0, 1) ~ 2,
!between(V3 - V2, 0, 1) ~ 2,
!between(V4 - V3, 0, 1) ~ 2,
!between(V5 - V4, 0, 1) ~ 2,
!between(V6 - V5, 0, 1) ~ 2,
!between(V7 - V6, 0, 1) ~ 2,
TRUE ~ as.numeric(variation)))
Is there an automatic increment for case_when or other function with apply or map to avoid writing all variables by hand
Thank you in advance.

I would pivot the data into a longer format, then just compare all the values to their lag in a single statement.
df %>%
pivot_longer(matches("V[0-9]+")) %>%
group_by(ID) %>%
summarize(variation = if (any(!between(value - lag(value), 0, 1), na.rm = T)) 2 else unique(variation)) %>%
right_join(select(df, -variation), by = "ID")
#> # A tibble: 6 × 11
#> ID variation V1 V2 V3 V4 V5 V6 V7 REP all
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2 3 6 6 6 NA NA NA 4 6
#> 2 B 2 5 4 5 7 NA NA NA 4 5
#> 3 B3 2 1 2 2 1 2 2 NA 6 2
#> 4 E4 2 3 7 7 7 7 7 7 7 7
#> 5 JE5 2 7 6 6 6 NA NA NA 4 6
#> 6 L6 0 1 2 3 3 3 3 3 7 3

Related

Fill NAs in column with either 0 or value n above repeated n times

I have a data frame that looks like:
df <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
y = c(NA, 2, NA, NA, NA, 3, NA, NA, NA, 1, NA, NA))
I want it to look like this:
data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
y = c(0, 2, 2, 0, 0, 3, 3, 3, 0, 1, 0, 0))
#> x y
#> 1 1 0
#> 2 2 2
#> 3 3 2
#> 4 4 0
#> 5 5 0
#> 6 6 3
#> 7 7 3
#> 8 8 3
#> 9 9 0
#> 10 10 1
#> 11 11 0
#> 12 12 0
I have solved with a while-loop, but was looking for a more R-like solution.
This is the loop solution:
df[is.na(df)] <- 0 # replace all NA with 0
i = 1
while (i < nrow(df)){
if (df$y[i] < 2){ # do nothing if y = 1
i = i+1
} else {
df$y[(i+1):(i+df$y[i]-1)] <- df$y[i]
i = i+df$y[i]
}
}
Bonus question: could it be done within a pipe and for multiple columns (e.g. a column z = c(1, NA, NA, NA, 4, NA, NA, NA, NA, 2, NA, NA))?
You can create an empty vector with numeric, get the value with complete.cases and rep, and get the indices with complete.cases and sequence:
fill_n_repeat <- function(x){
value = x[complete.cases(x)]
idx = which(complete.cases(x))
v = numeric(length(x))
v[sequence(value, idx)] <- rep(value, value)
v
}
library(dplyr)
df %>%
mutate(across(y:z, fill_n_repeat))
x y z
1 1 0 1
2 2 2 0
3 3 2 0
4 4 0 0
5 5 0 4
6 6 3 4
7 7 3 4
8 8 3 4
9 9 0 0
10 10 1 2
11 11 0 2
12 12 0 0
Group the rows so that each non-NA starts a new group and then for each such group if the first element is NA then output 0's and otherwise output the first element that many times followed by 0's. This uses base R only but if you prefer dplyr replace transform with mutate and all else stays the same.
f <- function(x) if (is.na(x[1])) 0 else ifelse(seq_along(x) > x[1], 0, x[1])
transform(df, y = ave(y, cumsum(!is.na(y)), FUN = f))
giving (continued below)
x y
1 1 0
2 2 2
3 3 2
4 4 0
5 5 0
6 6 3
7 7 3
8 8 3
9 9 0
10 10 1
11 11 0
12 12 0
If there were several columns then if ix contains the column numbers to be processed or the column names then using the same f as above then run it over each column to be transformed.
ix <- "y"
f <- function(x) if (is.na(x[1])) 0 else ifelse(seq_along(x) > x[1], 0, x[1])
f2 <- function(i) ave(df[[i]], cumsum(!is.na(df[[i]])), FUN = f)
replace(df, ix, lapply(ix, f2))
Alternatively, please try below code without any custom function
df2 <- df %>% mutate(z=y) %>% fill(z) %>% group_by(y,z) %>%
mutate(row=row_number()+1, y=ifelse(z>=row,z,y)) %>% ungroup() %>%
select(-z,-row)

Is that possible to get the index of unselected rows of data frame in R?

I want the indices of the unselected rows when using sample() in R. Consider the following case.
df <- data.frame(id = c(1,1,2,2,3,3),
v1 = c(2,2,9,4,7,1),
v2 = c(3,5,8,5,8,5))
ss <- ceiling(0.5*nrow(df)) #size
set.seed(123)
rid <- sample(seq_len(nrow(df)),size=ss,replace=F)
Now, the rows 3,6,2 are randomly selected. Is there a way to know indices of unselected rows (1,4,5)?
Thanks!
You can use df[-rid,]:
df <- data.frame(
id = c(1, 1, 2, 2, 3, 3),
v1 = c(2, 2, 9, 4, 7, 1),
v2 = c(3, 5, 8, 5, 8, 5)
)
ss <- ceiling(0.5 * nrow(df)) # size
set.seed(123)
rid <- sample(seq_len(nrow(df)), size = ss, replace = F)
rid
#> [1] 3 6 2
df
#> id v1 v2
#> 1 1 2 3
#> 2 1 2 5
#> 3 2 9 8
#> 4 2 4 5
#> 5 3 7 8
#> 6 3 1 5
df[rid,]
#> id v1 v2
#> 3 2 9 8
#> 6 3 1 5
#> 2 1 2 5
df[-rid, ]
#> id v1 v2
#> 1 1 2 3
#> 4 2 4 5
#> 5 3 7 8
rownames(df[-rid, ])
#> [1] "1" "4" "5"
Created on 2021-11-05 by the reprex package (v2.0.1)

Having trouble constructing a 2*2*2 contingency table

The file "Aspirin" contains a 2 × 2 × 2 contingency table with columns defined as follows.
Column 1: V1=Observation number. [Observations 1 to 8.]
Column 2: V2=Count. [Nonnegative integer count for each cell in the Table.]
Column 3: V3=Case/Control Factor. [Factor Level 1 (Controls) and Level 2 (Cases).]
Column 4: V4=Ulcer Type Factor. [Factor Level 1 (Gastric) and Level 2 (Duodenal).]
Column 5: V5=Aspirin Use Factor. [Factor Level 1 (Non-User) and Level 2 (User).]
> aspirin
V1 V2 V3 V4 V5
1 1 62 1 1 1
2 2 39 2 1 1
3 3 53 1 2 1
4 4 49 2 2 1
5 5 6 1 1 2
6 6 25 2 1 2
7 7 8 1 2 2
8 8 8 2 2 2
I want to construct a 2x2x2 contingency table like the image above in R, so I typed the following code:
case_control=factor(aspirin$V3)
ulcer=factor(aspirin$V4)
use=factor(aspirin$V5)
table(case_control,ulcer,use)
But I get something like this:
, , use = 1
ulcer
case_control 1 2
1 1 1
2 1 1
, , use = 2
ulcer
case_control 1 2
1 1 1
2 1 1
I want a contingency table with counts, so obviously the result above is not what I'm desiring. Is there a way to fix this?
In your case, just use
ftable(case_control,ulcer,use)
which returns a "flat" table
use 1 2
case_control ulcer
1 1 1 1
2 1 1
2 1 1 1
2 1 1
The main problem here is, that you are discarding your count column. So as an alternative here is a - in my opinion - better approach:
You could use xtabs together with ftable() (here used in a dplyr pipe):
library(dplyr)
df %>%
transmute(ID = V1,
Count = V2,
Case_Control = factor(V3,
labels = c("Control", "Case")),
Ulcer_Type = factor(V4,
labels = c("Gastric", "Duodenal")),
Aspirin_Use = factor(V5,
labels = c("Non-User", "User"))) %>%
xtabs(Count ~ Ulcer_Type + Case_Control + Aspirin_Use, data = .) %>%
ftable()
This returns
Aspirin_Use Non-User User
Ulcer_Type Case_Control
Gastric Control 62 6
Case 39 25
Duodenal Control 53 8
Case 49 8
Data
df <- structure(list(V1 = c(1, 2, 3, 4, 5, 6, 7, 8), V2 = c(62, 39,
53, 49, 6, 25, 8, 8), V3 = c(1, 2, 1, 2, 1, 2, 1, 2), V4 = c(1,
1, 2, 2, 1, 1, 2, 2), V5 = c(1, 1, 1, 1, 2, 2, 2, 2)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))

Completing the NAs of a Tibble in R

I have a database in R where there are some NAs in the variables. I would like to apply a logic function where the NAs would be filled with the immediately preceding value. Below is an example:
dados <- tibble::tibble(x = c(2, 3, 5, NA, 2, 1, NA, NA, 9, 3),
y = c(4, 1, 9, NA, 8, 5, NA, NA, 1, 2)
)
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 NA NA
5 2 8
6 1 5
7 NA NA
8 NA NA
9 9 1
10 3 2
In this case, the 4th value of the variable x would be filled with a 5 and so on.
Thank you!
We could use fill from tidyr package:
ibrary(tidyr)
library(dplyr)
dados %>%
fill(c(x,y), .direction = "down")
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 5 9
5 2 8
6 1 5
7 1 5
8 1 5
9 9 1
10 3 2
We can use coalesce
library(dplyr)
dados %>%
mutate(across(x:y, ~ coalesce(., lag(.))))
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 5 9
5 2 8
6 1 5
7 1 5
8 NA NA
9 9 1
10 3 2
library(dplyr)
dados %>%
mutate(x = case_when(is.na(x) ~ lag(x),
TRUE ~ x),
y = case_when(is.na(y) ~ lag(y),
TRUE ~ y))
The follow will only work, if the first value in a column is not NA but I leave that for the sake of clear and easy code as an execise for you we can solve this for one column as in:
library(tibble)
dados <- tibble::tibble(x = c(2, 3, 5, NA, 2, 1, NA, NA, 9, 3),
y = c(4, 1, 9, NA, 8, 5, NA, NA, 1, 2)
)
#where are the NA?
pos <- dados$x |>
is.na() |>
which()
# replace
while(any(is.na(dados$x)))
dados$x[pos] <- dados$x[pos-1]
dados

dplyr how to lag by group

I have a data frame of orders and receivables with lead times.
Can I use dplyr to fill in the receive column according to the groups lead time?
df <- data.frame(team = c("a","a","a","a", "a", "b", "b", "b", "b", "b"),
order = c(2, 4, 3, 5, 6, 7, 8, 5, 4, 5),
lead_time = c(3, 3, 3, 3, 3, 2, 2, 2, 2, 2))
>df
team order lead_time
a 2 3
a 4 3
a 3 3
a 5 3
a 6 3
b 7 2
b 8 2
b 5 2
b 4 2
b 5 2
And adding a receive column like so:
dfb <- data.frame(team = c("a","a","a","a", "a", "b", "b", "b", "b", "b"),
order = c(2, 4, 3, 5, 6, 7, 8, 5, 4, 5),
lead_time = c(3, 3, 3, 3, 3, 2, 2, 2, 2, 2),
receive = c(0, 0, 0, 2, 4, 0, 0, 7, 8, 5))
>dfb
team order lead_time receive
a 2 3 0
a 4 3 0
a 3 3 0
a 5 3 2
a 6 3 4
b 7 2 0
b 8 2 0
b 5 2 7
b 4 2 8
b 5 2 5
I was thinking along these lines but run into an error
dfc <- df %>%
group_by(team) %>%
mutate(receive = if_else( row_number() < lead_time, 0, lag(order, n = lead_time)))
Error in mutate_impl(.data, dots) :
could not convert second argument to an integer. type=SYMSXP, length = 1
Thanks for the help!
This looks like a bug; There might be some unintended mask of the lag function between dplyr and stats package, try this work around:
df %>%
group_by(team) %>%
# explicitly specify the source of the lag function here
mutate(receive = dplyr::lag(order, n=unique(lead_time), default=0))
#Source: local data frame [10 x 4]
#Groups: team [2]
# team order lead_time receive
# <fctr> <dbl> <dbl> <dbl>
#1 a 2 3 0
#2 a 4 3 0
#3 a 3 3 0
#4 a 5 3 2
#5 a 6 3 4
#6 b 7 2 0
#7 b 8 2 0
#8 b 5 2 7
#9 b 4 2 8
#10 b 5 2 5
We can also use shift from data.table
library(data.table)
setDT(df)[, receive := shift(order, n = lead_time[1], fill=0), by = team]
df
# team order lead_time receive
# 1: a 2 3 0
# 2: a 4 3 0
# 3: a 3 3 0
# 4: a 5 3 2
# 5: a 6 3 4
# 6: b 7 2 0
# 7: b 8 2 0
# 8: b 5 2 7
# 9: b 4 2 8
#10: b 5 2 5

Resources