Use of na.locf function (zoo package) with .SD in data.table - r

I am trying to fill out all NA's excluding the first two NA's for cols 1 and 4 and three NA's for cols 2 and 3 with most recent non-NA value . Here is my data and code:
hh<-structure(list(ka = c(NA, NA, 2, NA, NA, 3, NA, NA, NA, NA),
kb = c(NA, NA, NA, 2, NA, NA, 3, NA, NA, NA), gc = c(NA,
NA, NA, 3, NA, NA, 6, NA, NA, NA), hc = c(NA, NA, 8, NA,
NA, NA, 4, NA, NA, NA)), .Names = c("ka", "kb", "gc", "hc"
), row.names = c(NA, -10L), class = "data.frame")
library(zoo) #na.locf
library(data.table)
setDT(hh)[,`:=`(ka=c(NA,NA,na.locf(ka)),kb=c(NA,NA,NA,na.locf(kb)),gc=c(NA,NA,NA,na.locf(gc)),hc=c(NA,NA,na.locf(hc)))][]
ka kb gc hc
1: NA NA NA NA
2: NA NA NA NA
3: 2 NA NA 8
4: 2 2 3 8
5: 2 2 3 8
6: 3 2 3 8
7: 3 3 6 4
8: 3 3 6 4
9: 3 3 6 4
10: 3 3 6 4
However, I am looking for use of lapply with .SD as I have more than two columns for each type. Is this possible?

Try
setDT(hh)[, lapply(.SD, function(x) na.locf(x, na.rm=FALSE))]
Or use set
for(j in seq_along(hh)){
set(hh, i=NULL, j=j, value= na.locf(hh[[j]], na.rm=FALSE))
}

You can use setnafill, available from data.table >= 1.12.3:
setnafill(hh, type = "locf")
hh
# ka kb gc hc
# 1 NA NA NA NA
# 2 NA NA NA NA
# 3 2 NA NA 8
# 4 2 2 3 8
# 5 2 2 3 8
# 6 3 2 3 8
# 7 3 3 6 4
# 8 3 3 6 4
# 9 3 3 6 4
# 10 3 3 6 4

You don't need lapply. This is sufficient:
DT <- as.data.table(hh)
DT[, na.locf(.SD, na.rm = FALSE)]
giving:
ka kb gc hc
1: NA NA NA NA
2: NA NA NA NA
3: 2 NA NA 8
4: 2 2 3 8
5: 2 2 3 8
6: 3 2 3 8
7: 3 3 6 4
8: 3 3 6 4
9: 3 3 6 4
10: 3 3 6 4
This will also work:
DT[, lapply(.SD, na.locf0)]

Related

How best to create a new column for each two-column comparison using purrr?

Say I have the following dataframe:
ABC1_old <- c(1, 5, 3, 4, 3, NA, NA, NA, NA, NA)
ABC2_old <- c(4, 2, 1, 1, 5, NA, NA, NA, NA, NA)
ABC1_adj <- c(NA, NA, NA, NA, NA, 5, 5, 1, 2, 4)
ABC2_adj <- c(NA, NA, NA, NA, NA, 3, 2, 1, 4, 2)
df <- data.frame(ABC1_old, ABC2_old, ABC1_adj, ABC2_adj)
I want to create a column that compares each pair of ABCn_old with its corresponding ABCn_adj. (So ABC1_old would be compared against ABCn_adj, etc.) The resulting column would be called ABCn_new. The evaluation would be that if ABCn_old is NA, fill in the blank with the corresponding value in ABCn_adj, otherwise use ABCn_old's value. The new columns would look like this:
df$ABC1_new <- c(1, 5, 3, 4, 3, 5, 5, 1, 2, 4)
df$ABC2_new <- c(4, 2, 1, 1, 5, 3, 2, 1, 4, 2)
I know a simple mutate could work here, but I would like to use some kind of tidyverse looping via purrr if possible since the dataset is much larger in reality. Any ideas for the best way to achieve this?
map_dfc(split.default(df, str_remove(names(df), "_.*")), ~coalesce(!!!.x))
# A tibble: 10 x 2
ABC1 ABC2
<dbl> <dbl>
1 1 4
2 5 2
3 3 1
4 4 1
5 3 5
6 5 3
7 5 2
8 1 1
9 2 4
10 4 2
Putting it together:
df %>%
split.default(str_replace(names(.), "_.*", "_new")) %>%
map_dfc(~coalesce(!!!.x))%>%
cbind(df, .)
ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
1 1 4 NA NA 1 4
2 5 2 NA NA 5 2
3 3 1 NA NA 3 1
4 4 1 NA NA 4 1
5 3 5 NA NA 3 5
6 NA NA 5 3 5 3
7 NA NA 5 2 5 2
8 NA NA 1 1 1 1
9 NA NA 2 4 2 4
10 NA NA 4 2 4 2
Using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c(".value", 'grp'),
names_sep = '_', values_drop_na = TRUE) %>%
select(-grp, -rn) %>%
rename_all(~ str_c(., '_new')) %>% bind_cols(df, .)
# ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
#1 1 4 NA NA 1 4
#2 5 2 NA NA 5 2
#3 3 1 NA NA 3 1
#4 4 1 NA NA 4 1
#5 3 5 NA NA 3 5
#6 NA NA 5 3 5 3
#7 NA NA 5 2 5 2
#8 NA NA 1 1 1 1
#9 NA NA 2 4 2 4
#10 NA NA 4 2 4 2
Or using dplyr
df %>%
mutate(across(ends_with('old'),
~ coalesce(., get(str_replace(cur_column(),
'old', 'adj'))), .names = '{.col}_new'))
I have a package on github to solve this and similar problems. In this case we could use dplyover::across2 to apply one (or more) functions to two set of columns, which can be selected with tidyselect. In the .names argument we can specify "{pre}" to refer to the common prefix of both sets of columns.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df %>%
mutate(across2(ends_with("_old"),
ends_with("_adj"),
~ coalesce(.x, .y),
.names = "{pre}_new"))
#> ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
#> 1 1 4 NA NA 1 4
#> 2 5 2 NA NA 5 2
#> 3 3 1 NA NA 3 1
#> 4 4 1 NA NA 4 1
#> 5 3 5 NA NA 3 5
#> 6 NA NA 5 3 5 3
#> 7 NA NA 5 2 5 2
#> 8 NA NA 1 1 1 1
#> 9 NA NA 2 4 2 4
#> 10 NA NA 4 2 4 2
Created on 2021-05-16 by the reprex package (v0.3.0)

Rolling values by group

I would like to do some calculations using frollaply() or rollapplyr() with a conditional factor.
I have the following data
df <- tibble(w = c(NA, NA, "c1", NA, NA, "c2", NA, NA, "c3", NA, NA, "c4"),
x = 1:12, y = x * 2) %>%
as.data.table()
Using data.table I generate the following result.
df[, sumx := frollapply(x, 3, FUN = sum)]
w
x
y
sumx
1
2
NA
2
4
NA
c1
3
6
6
4
8
9
5
10
12
c2
6
12
15
7
14
18
8
16
21
c3
9
18
24
10
20
27
11
22
30
c4
12
24
33
I like this result. Although I would to do something more complicated.
First: I would like let this output more clean, like this:
w
x
y
sumx
1
2
NA
2
4
NA
c1
3
6
6
4
8
NA
5
10
NA
c2
6
12
15
7
14
NA
8
16
NA
c3
9
18
24
10
20
NA
11
22
NA
c4
12
24
33
Second: I would like create an another variable, for example "sumx2", where the values of the line "c1" is the sum (OBS: not just sum, could be mean or count of a specific value) of all 4 or 5 or n values of variable "x" above (OBS: If not have 4 or 5 or n values above, this absent values has to be understand as NA). The correspondent lines "c2" and "c3" following the same idea. In this way the output expected would be:
w
x
y
sumx
sumx2
1
2
NA
NA
2
4
NA
NA
c1
3
6
6
6
4
8
NA
NA
5
10
NA
NA
c2
6
12
15
18
7
14
NA
NA
8
16
NA
NA
c3
9
18
24
30
10
20
NA
NA
11
22
NA
NA
c4
12
24
33
42
Your help is appreciated!
if I understood everything correctly
library(tibble)
df <- tibble(w = c(NA, NA, "c1", NA, NA, "c2", NA, NA, "c3", NA, NA, "c4"),
x = 1:12, y = x * 2)
library(data.table)
setDT(df)
nm_cols <- c("sumX", "sumx2")
df[, (nm_cols) := list(
ifelse(is.na(w), NA, zoo::rollapplyr(x, width = 3, FUN = function(x) sum(x), partial = T)),
ifelse(is.na(w), NA, zoo::rollapplyr(x, width = 4, FUN = function(x) sum(x), partial = T))
)]
df
#> w x y sumX sumx2
#> 1: <NA> 1 2 NA NA
#> 2: <NA> 2 4 NA NA
#> 3: c1 3 6 6 6
#> 4: <NA> 4 8 NA NA
#> 5: <NA> 5 10 NA NA
#> 6: c2 6 12 15 18
#> 7: <NA> 7 14 NA NA
#> 8: <NA> 8 16 NA NA
#> 9: c3 9 18 24 30
#> 10: <NA> 10 20 NA NA
#> 11: <NA> 11 22 NA NA
#> 12: c4 12 24 33 42
Created on 2021-03-21 by the reprex package (v1.0.0)
Check this
library(data.table)
dt <- data.table(w = c(NA, NA, "c1", NA, NA, "c2", NA, NA, "c3", NA, NA, "c4"),
x = 1:12)
dt[,id:=rleidv(x)]
#dt[,sumx := ifelse(is.na(w),NA,frollapply(x,3,sum))]
dt[,sumx := fcase(!is.na(w),frollapply(x,3,sum))]
dt[,sumx2 := fcase(!is.na(w) & id == 3, frollapply(x, n = 3, sum),
!is.na(w) & id >= 4, frollapply(x, n = 4, sum))
]
dt[,id:=NULL]
Result:
dt
w x sumx sumx2
1: <NA> 1 NA NA
2: <NA> 2 NA NA
3: c1 3 6 6
4: <NA> 4 NA NA
5: <NA> 5 NA NA
6: c2 6 15 18
7: <NA> 7 NA NA
8: <NA> 8 NA NA
9: c3 9 24 30
10: <NA> 10 NA NA
11: <NA> 11 NA NA
12: c4 12 33 42

Add column to data frame with sequence depending on other column

I have two columns of data like this:
I want to add a column or modify the second column resulting in a sequence of integers starting with 1, wherever the 1 already appears. Result changes to:
I can do this with a loop, but what is the "right" R way of doing it?
Here's my loop:
for(i in 1:length(df2$col2)) {
df2$col3[i] <- ifelse(df2$col2[i] == 1, 1, df2$col3[i - 1] + 1)
if(is.na(df2$col2[i])) df2$col3[i] <- df2$col3[i - 1] + 1
}
Here is a sample data set with 20 rows:
478.69, 320.45, 503.7, 609.3, 478.19, 419.633683050051, 552.939975773916,
785.119385505095, 18.2542654918507, 98.6469651805237, 132.587260054424,
697.119552921504, 512.560374778695, 916.425200179219, 14.3385051051155
), col2 = c(1, NA, 1, NA, NA, 1, NA, 1, NA, NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-20L))
I don't know if this is the way to do it, but it's one way:
df$col3 <- unlist(sapply(diff(c(which(!is.na(df$col2)), nrow(df) + 1)), seq))
df
#> col1 col2 col3
#> 1 478.69000 1 1
#> 2 320.45000 NA 2
#> 3 503.70000 1 1
#> 4 609.30000 NA 2
#> 5 478.19000 NA 3
#> 6 478.69000 1 1
#> 7 320.45000 NA 2
#> 8 503.70000 1 1
#> 9 609.30000 NA 2
#> 10 478.19000 NA 3
#> 11 419.63368 NA 4
#> 12 552.93998 NA 5
#> 13 785.11939 1 1
#> 14 18.25427 NA 2
#> 15 98.64697 NA 3
#> 16 132.58726 NA 4
#> 17 697.11955 NA 5
#> 18 512.56037 NA 6
#> 19 916.42520 NA 7
#> 20 14.33851 NA 8
Note that the first 5 values of col1 were missing from your dput, so I added the second 5 numbers twice - they're not relevant to the question anyway.
Data
df <- structure(list(col1 = c(478.69, 320.45, 503.7, 609.3, 478.19,
478.69, 320.45, 503.7, 609.3, 478.19, 419.633683050051, 552.939975773916,
785.119385505095, 18.2542654918507, 98.6469651805237, 132.587260054424,
697.119552921504, 512.560374778695, 916.425200179219, 14.3385051051155
), col2 = c(1, NA, 1, NA, NA, 1, NA, 1, NA, NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-20L))
df
#> col1 col2
#> 1 478.69000 1
#> 2 320.45000 NA
#> 3 503.70000 1
#> 4 609.30000 NA
#> 5 478.19000 NA
#> 6 478.69000 1
#> 7 320.45000 NA
#> 8 503.70000 1
#> 9 609.30000 NA
#> 10 478.19000 NA
#> 11 419.63368 NA
#> 12 552.93998 NA
#> 13 785.11939 1
#> 14 18.25427 NA
#> 15 98.64697 NA
#> 16 132.58726 NA
#> 17 697.11955 NA
#> 18 512.56037 NA
#> 19 916.42520 NA
#> 20 14.33851 NA

Removing rows with all NA's, from data.frames within a list

Example Data:
df1 <- as.data.frame(rbind(c(1,2,3), c(1, NA, 4), c(NA, NA, NA), c(4,6,7), c(4, 8, NA)))
df2 <- as.data.frame(rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)))
dfList <- list(df1,df2)
colnames <- c("A","B","C")
dfList[[1]]
V1 V2 V3
1 1 2 3
2 1 NA 4
3 NA NA NA
4 4 6 7
5 4 8 NA
dfList[[2]]
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 NA NA NA
5 4 8 NA
How do I remove the rows that are empty/have ALL values NA, within each of the data.frames in the list?
Desired outcome:
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 4 8 NA
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 4 8 NA
You can use lapply to iterate over the list and rowSums to drop rows with all NA values.
lapply(dfList, function(x) x[rowSums(!is.na(x)) != 0, ])
#[[1]]
# V1 V2 V3
#1 1 2 3
#2 1 NA 4
#4 4 6 7
#5 4 8 NA
#[[2]]
# V1 V2 V3
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#5 4 8 NA
use tidyverse
library(tidyverse)
library(janitor)
map(dfList, remove_empty, which = c("rows"))
[[1]]
V1 V2 V3
1 1 2 3
2 1 NA 4
4 4 6 7
5 4 8 NA
[[2]]
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
5 4 8 NA
Here is another solution with all()
lapply(dfList, function(d) d[!apply(is.na(d), 1, all),])

Replace NAs in a window around a value

My zoo data looks like below. This data is part of a larger zoo (time series) data set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NA NA NA NA NA 1 NA NA NA NA NA 3 NA NA NA
library(zoo)
x <- zoo(c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA))
I want to replace NAs in a window around each non-NA value with the non-NA value. For example, a window of [EDIT] 5 around the non-NA would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NA NA NA 1 1 1 1 1 NA 3 3 3 3 3 NA
I can do what I want with a long and messy set of ifelse statements.
Is there a better way? I looked at zoo's NA fill set of functions but did not see anything for a window.
I guess rolling apply will do the job?
> rollapply(x, 5, function(x){mean(x[!is.na(x)])}, fill=NA)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
NA NA NaN 1 1 1 1 1 NaN 3 3 3 3 3 NA NA
We could also use filter
v2 <- stats::filter(replace(v1,is.na(v1),0), rep(1,5))
is.na(v2) <- !v2

Resources