R: Shift multiple columns by different number of rows - r

I have a list of tibbles like the following:
list(A = structure(list(
ID = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
g1 = c(0, 1, 2, NA, NA, NA, NA, NA, NA),
g2 = c(NA, NA, NA, 3, 4, 5, NA, NA, NA),
g3 = c(NA, NA, NA, NA, NA, NA, 6, 7, 8)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")),
B = structure(list(ID = c(1, 2, 1, 2, 1, 2),
g1 = c(10, 11, NA, NA, NA, NA),
g2 = c(NA, NA, 12,13, NA, NA),
g3 = c(NA, NA, NA, NA, 14, 15)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
)
Each element looks like this:
ID g1 g2 g3
<dbl> <dbl> <dbl> <dbl>
1 0 NA NA
2 1 NA NA
3 2 NA NA
1 NA 3 NA
2 NA 4 NA
3 NA 5 NA
1 NA NA 6
2 NA NA 7
3 NA NA 8
The g* columns are created dynamically, during previous mutates, and their number can vary, but it will be the same across all list elements.
Every g* column has only certain non-NA elements (as many as the unique IDs).
I would like to shift the g* columns so that they contain the non-NA element to the top rows.
I can do it for a single column by
num.shifts<- rle(is.na(myList[[1]]$g1))$lengths[1]
shift(myList[[1]]$g2,-num.shifts)
but how can I do it for all the g* columns, for all list elements, when I don't know in advance the number of g* columns?
Ideally, I would like a tidyverse solution, but not a requirement...
Thanks!

We can loop over the list with map, and use mutate_at to go over the columns that matches the 'g' followed by digits and order based on the non-NA elements
library(dplyr)
library(tidyr)
map(lst1, ~
.x %>%
mutate_at(vars(matches('^g\\d+')), ~ .[order(is.na(.))]))
In base R, we can do
lapply(lst1, function(x) {i1 <- grepl("^g\\d+$", names(x))
x[i1] <- lapply(x[i1], function(y) y[order(is.na(y))])
x})

Related

How to pivot longer with the grouping notation at the center of a dataframe?

I have a dataframe with the following column headers:
df <- data.frame(
ABC1_1_1DEF = c(1, 2, 3),
ABC1_2_1DEF = c(NA, 1, 2),
ABC1_3_1DEF = c(1, 1, NA),
ABC1_1_2DEF = c(3, NA, NA),
ABC1_2_2DEF = c(2, NA, NA),
ABC1_3_2DEF = c(NA, 1, 1)
)
I want to pivot the dataframe longer such that the middle number of each column is the group that contains the new columns:
df2 <- data.frame(
ABC1_1 = c(1, 2, 3, 3, NA, NA),
ABC1_2 = c(3, NA, NA, 2, NA, NA),
ABC1_3 = c(2, NA, NA, NA, 1, 1)
)
What's the best way to achieve this using R, ideally with dplyr?
To combine all the ABC1_1, ABC1_2 and ABC1_3 columns you can use -
tidyr::pivot_longer(df, cols = everything(),
names_to = '.value',
names_pattern = '([A-Z]+\\d+_\\d+)')
# ABC1_1 ABC1_2 ABC1_3
# <dbl> <dbl> <dbl>
#1 1 NA 1
#2 3 2 NA
#3 2 1 1
#4 NA NA 1
#5 3 2 NA
#6 NA NA 1

How to replace NA seperately with linear model in R

I've looked up some web pages (but their results don't meet my needs):
NA replacing with blanks
Replacing "NA" (NA string) with NA inplace data.table
replace <NA> with NA.
I want to write a function that could do this:
Say there is a vector a.
a = c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
First, find the value before and after the single and consecutive NA.
In this situation is 137862, NA, NA, NA, 178337 and 178337, NA, NA, NA, NA, NA, 295530.
Second, calculate the slope in every part then replace the NA.
# 137862, NA, NA, NA, 178337
slope_1 = (178337 - 137862)/4
137862 + slope_1*1 # 1st NA replace with 147980.8
137862 + slope_1*2 # 2nd NA replace with 158099.5
137862 + slope_1*3 # 3rd NA replace with 168218.2
# 178337, NA, NA, NA, NA, NA, 295530
slope_2 = (295530 - 178337)/6
178337 + slope_2*1 # 4th NA replace with 197869.2
178337 + slope_2*2 # 5th NA replace with 217401.3
178337 + slope_2*3 # 6th NA replace with 236933.5
178337 + slope_2*4 # 7th NA replace with 256465.7
178337 + slope_2*5 # 8th NA replace with 275997.8
Finally, the expected vector should be this:
a_without_NA = c(100000, 137862, 147980.8, 158099.5, 168218.2, 178337, 197869.2, 217401.3,
236933.5, 256465.7, 275997.8, 295530)
If single or consecutive NA is in the begining, then it would be keep.
# NA at begining
b = c(NA, NA, 1, 3, NA, 5, 7)
# 3, NA, 5
slope_1 = (5-3)/2
3 + slope_1*1 # 3rd NA replace with 4
b_without_NA = c(NA, NA, 1, 3, 4, 5, 7)
# NA at ending
c = c(1, 3, NA, 5, 7, NA, NA)
# 3, NA, 5
slope_1 = (5-3)/2
3 + slope_1*1 # 1st NA replace with 4
c_without_NA = c(1, 3, 4, 5, 7, NA, NA)
Note: in my real situation, every element of the vector is increasing(vector[n + 1] > vector[n]).
I know the principle, but I don't know how to write a self-define function to implement this.
Any help will highly appreciated!!
zoo's na.approx can help :
a = c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
zoo::na.approx(a, na.rm = FALSE)
# [1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3
# [9] 236933.5 256465.7 275997.8 295530.0
b = c(NA, NA, 1, 3, NA, 5, 7)
zoo::na.approx(b, na.rm = FALSE)
#[1] NA NA 1 3 4 5 7
c = c(1, 3, NA, 5, 7, NA, NA)
zoo::na.approx(c, na.rm = FALSE)
#[1] 1 3 4 5 7 NA NA
Here is a base R option using approx
> approx(seq_along(a)[!is.na(a)], a[!is.na(a)], seq_along(a))$y
[1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3
[9] 236933.5 256465.7 275997.8 295530.0
Here is one approach with data.table. Get the run-length-id (rleid) of consecutive NA in 'a' ('grp'), create two temporary columns 'a1', 'a2' as the lag and lead of 'a', grouped by 'grp', create the 'tmp' based on the calculation and finally fcoalesce the original 'a' with that 'tmp'
library(data.table)
data.table(a)[, grp := rleid(is.na(a))][,
c('a1', 'a2') := .(shift(a), shift(a, type = 'lead'))][,
tmp := first(a1) + seq_len(.N) *( (last(a2) - first(a1))/(.N + 1)),
.(grp)][, fcoalesce(a, tmp)]
#[1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0
#[7] 197869.2 217401.3 236933.5 256465.7 275997.8 295530.0
For this purpose I defined a custom function:
my_replace_na <- function(x) {
non <- which(!is.na(x)) # Here we extract the indices of non NA values
for(i in 1:(length(non)-1)) {
if(non[i+1] - non[i] > 1) {
c <- non[i+1]
b <- non[i]
for(i in 1:(c - b - 1)) {
x[b+i] <- x[b] + ((x[c] - x[b]) / (c - b))*i
}
}
}
x
}
a <- c(100000, 137862, NA, NA, NA, 178337, NA, NA, NA, NA, NA, 295530)
my_replace_na(a)
[1] 100000.0 137862.0 147980.8 158099.5 168218.2 178337.0 197869.2 217401.3 236933.5 256465.7
[11] 275997.8 295530.0
# NA at begining
d <- c(NA, NA, 1, 3, NA, 5, 7)
my_replace_na(d)
[1] NA NA 1 3 4 5 7
# NA at ending
e <- c(1, 3, NA, 5, 7, NA, NA)
my_replace_na(e)
[1] 1 3 4 5 7 NA NA

How can I compare multiple rows in R?

I would like to compare multiple values by USER.
Based on USER "A", If the values (A,B,C,D,and E) are same with USER "B", it should be written as B at the newly created variable EQUAL
Here is my data
Desired value
I am very new to R, I tried to look at the compare function but got a little overwhelmed. Would very much appreciate any help.
Here's an abridged version of the data you provided:
library(tidyverse)
df <- data.frame(
id = c(1001, 1002, 1003, 1001, 1002, 1003),
user = c('a', 'a', 'a', 'b', 'b', 'b'),
point_a = c(1, 1, NA, 1, 1, NA),
point_b = c(NA, NA, 2, NA, NA, NA),
point_c = c(3, 2, 3, 3, 2, 3),
point_d = c(2, 1, NA, 2, 1, NA),
point_e = c(4, NA, 1, 4, NA, NA)
)
df
id user point_a point_b point_c point_d point_e
1 1001 a 1 NA 3 2 4
2 1002 a 1 NA 2 1 NA
3 1003 a NA 2 3 NA 1
4 1001 b 1 NA 3 2 4
5 1002 b 1 NA 2 1 NA
6 1003 b NA NA 3 NA NA
If you inner_join on the columns you want to match, and then filter for rows where user.x is greater than user.y (i.e. first in alphabetical order, to get rid of duplicates and rows matching to themselves), you should be left with the matches you're looking for:
df %>%
inner_join(df, by = c('point_a', 'point_b', 'point_c', 'point_d', 'point_e')) %>%
filter(user.x < user.y) %>%
rename(user = user.x,
equal = user.y)
id.x user point_a point_b point_c point_d point_e id.y equal
1 1001 a 1 NA 3 2 4 1001 b
2 1002 a 1 NA 2 1 NA 1002 b
We may split the data along users, and put the result in mapply and calculate the rowSums of TRUEs after comparison with `==`. From the resulting matrix we want to know which.max which allows us to subset the users (without "A"). The result just needs to be subsetted by user "A".
transform(dat, EQUAL=
split(dat, dat$user) |>
(\(.) mapply(\(x, y) rowSums(x == y, na.rm=TRUE),
unname(.['A']),
.[c('B', 'C')]))() |>
(\(.) sort(unique(dat$user))[-1][apply(., 1, which.max)])()
) |>
(\(.) .[.$user == 'A', ])()
# id user point_a point_b point_c point_d point_e EQUAL
# 1 1001 A 1 NA 3 2 4 B
# 2 1002 A 1 NA 2 1 NA B
# 3 1003 A NA 2 3 NA 1 C
Note: R version 4.1.2 (2021-11-01)
Data:
dat <- structure(list(id = c(1001L, 1002L, 1003L, 1001L, 1002L, 1003L,
1001L, 1002L, 1003L), user = c("A", "A", "A", "B", "B", "B",
"C", "C", "C"), point_a = c(1, 1, NA, 1, 1, NA, 4, 1, NA), point_b = c(NA,
NA, 2, NA, NA, NA, 3, NA, 2), point_c = c(3, 2, 3, 3, 2, 3, 3,
2, 3), point_d = c(2, 1, NA, 2, 1, NA, 2, 1, NA), point_e = c(4,
NA, 1, 4, NA, NA, 4, NA, 1)), class = "data.frame", row.names = c(NA,
-9L))

Remove duplicate rows in nested list data frame

I have a data frame with a nested list:
df <- structure(list(zerobonds = c(1, 1, NA), nominal = c(20, 20, NA
), calls = list(list(c(NA, -1), 1), list(list(NA, -1), 1), NA),
call_strike = list(list(c(NA, 90), 110), list(list(NA, 90),
110), NA), puts = list(NA, NA, list(c(NA, 1), -1)), put_strike = list(
NA, NA, list(c(NA, 110), 90))), row.names = c(NA, -3L
), class = "data.frame")
df
## zerobonds nominal calls call_strike puts put_strike
## 1 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 2 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 3 NA NA NA NA NA, 1, -1 NA, 110, 90
My question: You see that the first and second row are duplicated. I want to remove all duplicate rows in such data frames and I am looking for some general method.
What I tried: duplicated doesn't seem to work, I guess because of this special structure of a data frame with nested lists inside.
You may need to flatten the nested lists of each column and then apply unique, e.g.,
> unique({df[]<-Map(function(x) Map(unlist,x),df);df})
zerobonds nominal calls call_strike puts put_strike
1 1 20 NA, -1, 1 NA, 90, 110 NA NA
3 NA NA NA NA NA, 1, -1 NA, 110, 90

Concatenating two columns

I have two data frames. They are
x <- data.frame(sulfur = c(NA, 5, 7, NA, NA), nitrate = c(NA, NA, NA, 3, 7))
y <- data.frame(sulfur = c(NA, 3, 7, 9, NA), nitrate = c(NA, NA, NA, 6, 7))
I want a new data frame which should be like
z <- data.frame(sulfur(NA, 5, 7, NA, NA, NA, 3, 7, 9, NA), nitrate=c(NA, NA, NA, 3, 7, NA, NA, NA, 6, 7))
I am trying two join columns and make it a single data frame. How do I do it?
Try this:
df<-data.frame(Sulfur=c(NA,5,7,NA,NA), Nitrate = c(NA,NA,NA,3,7))
df2<-data.frame(Sulfur=c(NA,3,7,9,NA), Nitrate = c(NA,NA,NA,6,7))
df3<-(rbind(df,df2))
>df3
Sulfur Nitrate
1 NA NA
2 5 NA
3 7 NA
4 NA 3
5 NA 7
6 NA NA
7 3 NA
8 7 NA
9 9 6
10 NA 7
> class(lst)
[1] list
> dplyr::rbind_all(lst)
> do.call(lst,rbind)
You can use above function since it can apply on multiple (more than two) data frames.
Other options include, placing the datasets in a list and then use rbindlist from data.table
library(data.table)
rbindlist(list(x,y))
or we can use bind_rows from dplyr.
library(dplyr)
bind_rows(x,y)
NOTE: The above two functions can be applied to more than 2 datasets.

Resources