Rolling values by group - r

I would like to do some calculations using frollaply() or rollapplyr() with a conditional factor.
I have the following data
df <- tibble(w = c(NA, NA, "c1", NA, NA, "c2", NA, NA, "c3", NA, NA, "c4"),
x = 1:12, y = x * 2) %>%
as.data.table()
Using data.table I generate the following result.
df[, sumx := frollapply(x, 3, FUN = sum)]
w
x
y
sumx
1
2
NA
2
4
NA
c1
3
6
6
4
8
9
5
10
12
c2
6
12
15
7
14
18
8
16
21
c3
9
18
24
10
20
27
11
22
30
c4
12
24
33
I like this result. Although I would to do something more complicated.
First: I would like let this output more clean, like this:
w
x
y
sumx
1
2
NA
2
4
NA
c1
3
6
6
4
8
NA
5
10
NA
c2
6
12
15
7
14
NA
8
16
NA
c3
9
18
24
10
20
NA
11
22
NA
c4
12
24
33
Second: I would like create an another variable, for example "sumx2", where the values of the line "c1" is the sum (OBS: not just sum, could be mean or count of a specific value) of all 4 or 5 or n values of variable "x" above (OBS: If not have 4 or 5 or n values above, this absent values has to be understand as NA). The correspondent lines "c2" and "c3" following the same idea. In this way the output expected would be:
w
x
y
sumx
sumx2
1
2
NA
NA
2
4
NA
NA
c1
3
6
6
6
4
8
NA
NA
5
10
NA
NA
c2
6
12
15
18
7
14
NA
NA
8
16
NA
NA
c3
9
18
24
30
10
20
NA
NA
11
22
NA
NA
c4
12
24
33
42
Your help is appreciated!

if I understood everything correctly
library(tibble)
df <- tibble(w = c(NA, NA, "c1", NA, NA, "c2", NA, NA, "c3", NA, NA, "c4"),
x = 1:12, y = x * 2)
library(data.table)
setDT(df)
nm_cols <- c("sumX", "sumx2")
df[, (nm_cols) := list(
ifelse(is.na(w), NA, zoo::rollapplyr(x, width = 3, FUN = function(x) sum(x), partial = T)),
ifelse(is.na(w), NA, zoo::rollapplyr(x, width = 4, FUN = function(x) sum(x), partial = T))
)]
df
#> w x y sumX sumx2
#> 1: <NA> 1 2 NA NA
#> 2: <NA> 2 4 NA NA
#> 3: c1 3 6 6 6
#> 4: <NA> 4 8 NA NA
#> 5: <NA> 5 10 NA NA
#> 6: c2 6 12 15 18
#> 7: <NA> 7 14 NA NA
#> 8: <NA> 8 16 NA NA
#> 9: c3 9 18 24 30
#> 10: <NA> 10 20 NA NA
#> 11: <NA> 11 22 NA NA
#> 12: c4 12 24 33 42
Created on 2021-03-21 by the reprex package (v1.0.0)

Check this
library(data.table)
dt <- data.table(w = c(NA, NA, "c1", NA, NA, "c2", NA, NA, "c3", NA, NA, "c4"),
x = 1:12)
dt[,id:=rleidv(x)]
#dt[,sumx := ifelse(is.na(w),NA,frollapply(x,3,sum))]
dt[,sumx := fcase(!is.na(w),frollapply(x,3,sum))]
dt[,sumx2 := fcase(!is.na(w) & id == 3, frollapply(x, n = 3, sum),
!is.na(w) & id >= 4, frollapply(x, n = 4, sum))
]
dt[,id:=NULL]
Result:
dt
w x sumx sumx2
1: <NA> 1 NA NA
2: <NA> 2 NA NA
3: c1 3 6 6
4: <NA> 4 NA NA
5: <NA> 5 NA NA
6: c2 6 15 18
7: <NA> 7 NA NA
8: <NA> 8 NA NA
9: c3 9 24 30
10: <NA> 10 NA NA
11: <NA> 11 NA NA
12: c4 12 33 42

Related

Complete a dataframe in R By ID upto selected values of the dataframe only

I have created the following dataframe in R
library(tidyR)
library(dplyr)
DF11<- data.frame("ID"= c("A", "A", "A", "B", "B", "B", "B", "B"))
DF11$X_F<-c(5, 7,9,6,7,8,9,10)
DF11$X_A<-c(7, 8,9,3,6,7,9,10)
The dataframe looks as follows
ID X_F X_A
A 5 7
A 7 8
A 9 9
B 6 3
B 7 6
B 8 7
B 9 9
B 10 10
ID is the grouping variable. I would like to use dplyr to create the following dataframe.
ID X_F X_A
A 0 NA
A 1 NA
A 2 NA
A 3 NA
A 4 NA
A 5 7
A 7 8
A 9 9
A 10 NA
A 11 NA
A 12 NA
B 0 NA
B 1 NA
B 2 NA
B 3 NA
B 4 NA
B 5 NA
B 6 3
B 7 6
B 8 7
B 9 9
B 10 10
B 11 NA
B 12 NA
B 13 NA
The resultant dataframe should take DF11 and then group the X_F column using ID column. Next it should complete X_F group-wise from 0 to the minimum value of X_F by group, and then from the maximum value of X_F to maximum value X_F +3.
I tried the following code and was able to solve it partially.
DF112<-DF11%>%group_by(ID)%>%complete(X_F=seq(0, max(X_F)+3, by =1))
ID X_F X_A
A 0 NA
A 1 NA
A 2 NA
A 3 NA
A 4 NA
A 5 7
A 6 NA
A 7 8
A 8 NA
A 9 9
A 10 NA
A 11 NA
A 12 NA
B 0 NA
B 1 NA
B 2 NA
B 3 NA
B 4 NA
B 5 NA
B 6 3
B 7 6
B 8 7
B 9 9
B 10 10
B 11 NA
B 12 NA
B 13 NA
How do I get the desired output mentioned above. I request someone to guide me.
It would work to pass two vectors into your complete function call, one to do the lower values and one to do the upper:
library(tidyr)
library(dplyr)
DF11 <- data.frame("ID" = c("A", "A", "A", "B", "B", "B", "B", "B"))
DF11$X_F <- c(5, 7, 9, 6, 7, 8, 9, 10)
DF11$X_A <- c(7, 8, 9, 3, 6, 7, 9, 10)
DF11 %>%
group_by(ID) %>%
complete(X_F = c(seq(0, min(X_F) - 1 , by = 1), seq(max(X_F) + 1, max(X_F) + 3, by = 1))) |>
arrange(ID, X_F)
# A tibble: 25 × 3
# Groups: ID [2]
ID X_F X_A
<chr> <dbl> <dbl>
1 A 0 NA
2 A 1 NA
3 A 2 NA
4 A 3 NA
5 A 4 NA
6 A 5 7
7 A 7 8
8 A 9 9
9 A 10 NA
10 A 11 NA
11 A 12 NA
12 B 0 NA
13 B 1 NA
14 B 2 NA
15 B 3 NA
16 B 4 NA
17 B 5 NA
18 B 6 3
19 B 7 6
20 B 8 7
21 B 9 9
22 B 10 10
23 B 11 NA
24 B 12 NA
25 B 13 NA
Created on 2022-11-01 with reprex v2.0.2

Removing strings from data.table based on partial match with regex in R

I am looking to remove strings across my data.table based on a partial match:
$$ER
Since these strings differ across the entire table, and my table is reasonably large, efficiency and speed is preferred. I have tried data.table's %like% but this is way too inefficient. gsub should do fine but I have an issue referencing the "$$" in the "$$ER".
structure(list(Country = c("NL", "NL", "NL", "NL", "DE", "DE",
"DE", "GB", "GB"), Value1 = c("$$ER: Data not found", NA, NA,
NA, "$$ERROR: NOT AVAILABLE", NA, NA, "3", "4"), Value2 = c("$$ER: Data not found",
NA, NA, NA, "$$ERROR: NOT AVAILABLE", NA, NA, "3", "4"), Value3 = c(10,
15, 12, 9, 8, 20, 23, 3, 4)), class = "data.frame", row.names = c(NA,
-9L))
Country Value1 Value2 Value3
1 NL $$ER: Data not found $$ER: Data not found 10
2 NL <NA> <NA> 15
3 NL <NA> <NA> 12
4 NL <NA> <NA> 9
5 DE $$ERROR: NOT AVAILABLE $$ERROR: NOT AVAILABLE 8
6 DE <NA> <NA> 20
7 DE <NA> <NA> 23
8 GB 5 6 3
9 GB 6 8 4
Desired output:
Country Value1 Value2 Value3
1 NL NA NA 10
2 NL NA NA 15
3 NL NA NA 12
4 NL NA NA 9
5 DE NA NA 8
6 DE NA NA 20
7 DE NA NA 23
8 GB 5 6 3
9 GB 6 8 4
An alternative would be to use grepl:
df[apply(df, 2, function(i) grepl('$$ER', i, fixed = T))] <- NA
which would yield the following:
# Country Value1 Value2 Value3
# 1 NL <NA> <NA> 10
# 2 NL <NA> <NA> 15
# 3 NL <NA> <NA> 12
# 4 NL <NA> <NA> 9
# 5 DE <NA> <NA> 8
# 6 DE <NA> <NA> 20
# 7 DE <NA> <NA> 23
# 8 GB 3 3 3
# 9 GB 4 4 4
You can use startsWith in sapply testing for $$ER.
D[2:3][sapply(D[2:3], startsWith, "$$ER")] <- NA
D
# Country Value1 Value2 Value3
#1 NL <NA> <NA> 10
#2 NL <NA> <NA> 15
#3 NL <NA> <NA> 12
#4 NL <NA> <NA> 9
#5 DE <NA> <NA> 8
#6 DE <NA> <NA> 20
#7 DE <NA> <NA> 23
#8 GB 3 3 3
#9 GB 4 4 4
But maybe you want to use as.numeric:
D[2:3] <- sapply(D[2:3], as.numeric)
D
# Country Value1 Value2 Value3
#1 NL NA NA 10
#2 NL NA NA 15
#3 NL NA NA 12
#4 NL NA NA 9
#5 DE NA NA 8
#6 DE NA NA 20
#7 DE NA NA 23
#8 GB 3 3 3
#9 GB 4 4 4
Using data.table -
library(data.table)
setDT(df)[, (2:3) := lapply(.SD, function(x)
as.numeric(replace(x, grepl('$$ER', x, fixed = TRUE), NA))), .SDcols = 2:3]
df
# Country Value1 Value2 Value3
#1: NL NA NA 10
#2: NL NA NA 15
#3: NL NA NA 12
#4: NL NA NA 9
#5: DE NA NA 8
#6: DE NA NA 20
#7: DE NA NA 23
#8: GB 3 3 3
#9: GB 4 4 4

Assign ID to column with NA's

This must be easy but my brain is blocked!
I have this dataframe:
col1
<chr>
1 A
2 B
3 NA
4 C
5 D
6 NA
7 NA
8 E
9 NA
10 F
df <- structure(list(col1 = c("A", "B", NA, "C", "D", NA, NA, "E",
NA, "F")), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
I want to add a column with uniqueID only for values that are not NA with tidyverse.
Expected output:
col1 uniqueID
<chr> <dbl>
1 A 1
2 B 2
3 NA NA
4 C 3
5 D 4
6 NA NA
7 NA NA
8 E 5
9 NA NA
10 F 6
I have tried: n(), row_number(), cur_group_id ....
We could do this easily in data.table. Specify the condition in i i.e. non-NA elements in 'col1', create the column 'uniqueID' with the sequence of elements by assignment (:=)
library(data.table)
setDT(df)[!is.na(col1), uniqueID := seq_len(.N)]
-output
df
col1 uniqueID
1: A 1
2: B 2
3: <NA> NA
4: C 3
5: D 4
6: <NA> NA
7: <NA> NA
8: E 5
9: <NA> NA
10: F 6
In dplyr, we can use replace
library(dplyr)
df %>%
mutate(uniqueID = replace(col1, !is.na(col1),
seq_len(sum(!is.na(col1)))))
-output
# A tibble: 10 x 2
col1 uniqueID
<chr> <chr>
1 A 1
2 B 2
3 <NA> <NA>
4 C 3
5 D 4
6 <NA> <NA>
7 <NA> <NA>
8 E 5
9 <NA> <NA>
10 F 6
Another approach:
library(dplyr)
df %>%
mutate(UniqueID = cumsum(!is.na(col1)),
UniqueID = if_else(is.na(col1), NA_integer_, UniqueID))
# A tibble: 10 x 2
col1 UniqueID
<chr> <int>
1 A 1
2 B 2
3 NA NA
4 C 3
5 D 4
6 NA NA
7 NA NA
8 E 5
9 NA NA
10 F 6
A base R option using match + na.omit + unique
transform(
df,
uniqueID = match(col1, na.omit(unique(col1)))
)
gives
col1 uniqueID
1 A 1
2 B 2
3 <NA> NA
4 C 3
5 D 4
6 <NA> NA
7 <NA> NA
8 E 5
9 <NA> NA
10 F 6
A weird tidyverse solution:
library(dplyr)
df %>%
mutate(id = ifelse(is.na(col1), 0, 1),
id = cumsum(id == 1),
id = ifelse(is.na(col1), NA, id))
# A tibble: 10 x 2
col1 id
<chr> <int>
1 A 1
2 B 2
3 NA NA
4 C 3
5 D 4
6 NA NA
7 NA NA
8 E 5
9 NA NA
10 F 6

Update a variable if dplyr filter conditions are met

With the command df %>% filter(is.na(df)[,2:4]) filter function subset in a new df that has rows with NA's in columns 2, 3 and 4. What I want is not a new subsetted df but rather assign in example "1" to a new variable called "Exclude" in the actual df.
This example with mutate was not exactly what I was looking for, but close:
Use dplyr´s filter and mutate to generate a new variable
Also I would need the same to happen with other filter conditions.
Example I have the following:
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3,2:4] <- NA
df[5,2:4] <- NA
df
> df
A B C D
1 1 11 21 31
2 2 12 22 32
3 3 NA NA NA
4 4 14 24 34
5 5 NA NA NA
6 6 16 26 36
and would like
> df
A B C D Exclude
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Any good ideas how the filter subset could be used to update easy? The hard way work around would be to generate this subset, create new variable for all and then join back but that is not tidy code.
We can do this with base R using vectorized rowSums
df$Exclude <- NA^!rowSums(is.na(df[-1]))
-output
df
# A B C D Exclude
#1 1 11 21 31 NA
#2 2 12 22 32 NA
#3 3 NA NA NA 1
#4 4 14 24 34 NA
#5 5 NA NA NA 1
#6 6 16 26 36 NA
Does this work:
library(dplyr)
df %>% rowwise() %>%
mutate(Exclude = +any(is.na(c_across(everything()))), Exclude = na_if(Exclude, 0))
# A tibble: 6 x 5
# Rowwise:
A B C D Exclude
<int> <int> <int> <int> <int>
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Using anyNA.
df %>% mutate(Exclude=ifelse(apply(df[2:4], 1, anyNA), 1, NA))
# A B C D Exclude
# 1 1 11 21 31 NA
# 2 2 12 22 32 NA
# 3 3 NA NA NA 1
# 4 4 14 24 34 NA
# 5 5 NA NA NA 1
# 6 6 16 26 36 NA
Or just
df$Exclude <- ifelse(apply(df[2:4], 1, anyNA), 1, NA)
Another one-line solution:
df$Exclude <- as.numeric(apply(df[2:4], 1, function(x) any(is.na(x))))
Use rowwise, sum over all numeric columns, assign 1 or NA in ifelse.
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3, 2:4] <- NA
df[5, 2:4] <- NA
library(tidyverse)
df %>%
rowwise() %>%
mutate(Exclude = ifelse(
is.na(sum(c_across(where(is.numeric)))), 1, NA
))
#> # A tibble: 6 x 5
#> # Rowwise:
#> A B C D Exclude
#> <int> <int> <int> <int> <dbl>
#> 1 1 11 21 31 NA
#> 2 2 12 22 32 NA
#> 3 3 NA NA NA 1
#> 4 4 14 24 34 NA
#> 5 5 NA NA NA 1
#> 6 6 16 26 36 NA

Using conditions in dplyr::mutate

I am working with a large data frame. I'm trying to create a new vector based on the conditions that exist in two current vectors.
Given the size of the dataset (and its general awesomeness) I'm trying to find a solution using dplyr, which has lead me to mutate. I feel like I'm not far off, but I'm just not able to get a solution to stick.
My data frame resembles:
ID X Y
1 1 10 12
2 2 10 NA
3 3 11 NA
4 4 10 12
5 5 11 NA
6 6 NA NA
7 7 NA NA
8 8 11 NA
9 9 10 12
10 10 11 NA
To recreate it:
ID <- c(1:10)
X <- c(10, 10, 11, 10, 11, NA, NA, 11, 10, 11)
Y <- c(12, NA, NA, 12, NA, NA, NA, NA, 12, NA)
I'm looking to create a new vector 'Z' from the existing data. If Y > X, then I want it return the value from Y. If Y is NA then I'd like it to return the X value. If both are NA, then it should return NA.
My attempt thus far, has using the code below has let me create a new vector meeting the first condition, but not the second.
newData <- data %>%
mutate(Z =
ifelse(Y > X, Y,
ifelse(is.na(Y), X, NA)))
> newData
ID X Y Z
1 1 10 12 12
2 2 10 NA NA
3 3 11 NA NA
4 4 10 12 12
5 5 11 NA NA
6 6 NA NA NA
7 7 NA NA NA
8 8 11 NA NA
9 9 10 12 12
10 10 11 NA NA
I feel like I'm missing something mindblowingly simple. Can point me in the right direction?
pmax(, na.rm=TRUE) is what you are looking for
data <- data_frame(ID = c(1:10),
X = c(10, 10, 11, 10, 11, NA, NA, 11, 10, 11),
Y = c(12, NA, NA, 12, NA, NA, NA, NA, 12, NA))
data %>% mutate(Z = pmax(X, Y, na.rm=TRUE))
# ID X Y Z
#1 1 10 12 12
#2 2 10 NA 10
#3 3 11 NA 11
#4 4 10 12 12
#5 5 11 NA 11
#6 6 NA NA NA
#7 7 NA NA NA
#8 8 11 NA 11
#9 9 10 12 12
#10 10 11 NA 11
The ifelse code can be
data %>%
mutate(Z= ifelse(Y>X & !is.na(Y), Y, X))
# ID X Y Z
#1 1 10 12 12
#2 2 10 NA 10
#3 3 11 NA 11
#4 4 10 12 12
#5 5 11 NA 11
#6 6 NA NA NA
#7 7 NA NA NA
#8 8 11 NA 11
#9 9 10 12 12
#10 10 11 NA 11

Resources