Remove duplicate rows in nested list data frame - r

I have a data frame with a nested list:
df <- structure(list(zerobonds = c(1, 1, NA), nominal = c(20, 20, NA
), calls = list(list(c(NA, -1), 1), list(list(NA, -1), 1), NA),
call_strike = list(list(c(NA, 90), 110), list(list(NA, 90),
110), NA), puts = list(NA, NA, list(c(NA, 1), -1)), put_strike = list(
NA, NA, list(c(NA, 110), 90))), row.names = c(NA, -3L
), class = "data.frame")
df
## zerobonds nominal calls call_strike puts put_strike
## 1 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 2 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 3 NA NA NA NA NA, 1, -1 NA, 110, 90
My question: You see that the first and second row are duplicated. I want to remove all duplicate rows in such data frames and I am looking for some general method.
What I tried: duplicated doesn't seem to work, I guess because of this special structure of a data frame with nested lists inside.

You may need to flatten the nested lists of each column and then apply unique, e.g.,
> unique({df[]<-Map(function(x) Map(unlist,x),df);df})
zerobonds nominal calls call_strike puts put_strike
1 1 20 NA, -1, 1 NA, 90, 110 NA NA
3 NA NA NA NA NA, 1, -1 NA, 110, 90

Related

Replace NA between two values without loop

I have the following data frame:
data <- structure(list(Date = structure(c(-17897, -17896, -17895, -17894,
-17893, -17892, -17891, -17890, -17889, -17888, -17887, -17887,
-17886, -17885, -17884, -17883, -17882, -17881, -17880, -17879,
-17878, -17877, -17876, -17875, -17874, -17873, -17872, -17871,
-17870, -17869, -17868, -17867, -17866, -17865, -17864), class = "Date"),
duration = c(NA, NA, NA, 5, NA, NA, NA, 5, NA, NA, 1, 1,
NA, NA, 3, NA, 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 4, NA, NA, 4, NA, NA), name = c(NA, NA, NA, "Date_beg",
NA, NA, NA, "Date_end", NA, NA, "Date_beg", "Date_end", NA,
NA, "Date_beg", NA, "Date_end", NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "Date_beg", NA, NA, "Date_end", NA, NA
)), row.names = c(NA, -35L), class = c("tbl_df", "tbl", "data.frame"
))
And looks like:
Date duration name
<date> <dbl> <chr>
1 1921-01-01 NA NA
2 1921-01-02 NA NA
3 1921-01-03 NA NA
4 1921-01-04 5 Date_beg
5 1921-01-05 NA NA
6 1921-01-06 NA NA
7 1921-01-07 NA NA
8 1921-01-08 5 Date_end
9 1921-01-09 NA NA
10 1921-01-10 NA NA
...
I want to replace the NA values in column name that are between rows with Date_beg and Date_end with the word "event".
I have tried this:
data %<>% mutate(name = ifelse(((lag(name) == 'Date_beg')|(lag(name) == 'event')) &
But only the first row after Date_beg changes. It is quite easy with a for-loop, but I wanted to use a more R-like method.
There is probably a better way using data.table::nafill, but as you're using tidyverse functions, I would do it by creating an extra event column using tidyr::fill and then pulling it through to the name column where name is NA:
library(tidyr)
data %>%
mutate(
events = ifelse(
fill(data, name)$name == "Date_beg",
"event",
NA),
name = coalesce(name, events)
) %>%
select(-events)
You can do it by looking at the indices where there have been more "Date_beg" than "Dat_end" with:
data$name[lag(cumsum(data$name == "Date_beg" & !is.na(data$name))) -
cumsum(data$name == "Date_end" & !is.na(data$name)) >0] <- "event"
print(data, n=20)
# # A tibble: 35 x 3
# Date duration name
# <date> <dbl> <chr>
# 1 1921-01-01 NA NA
# 2 1921-01-02 NA NA
# 3 1921-01-03 NA NA
# 4 1921-01-04 5 Date_beg
# 5 1921-01-05 NA event
# 6 1921-01-06 NA event
# 7 1921-01-07 NA event
# 8 1921-01-08 5 Date_end
# 9 1921-01-09 NA NA
# 10 1921-01-10 NA NA
# 11 1921-01-11 1 Date_beg
# 12 1921-01-11 1 Date_end
# 13 1921-01-12 NA NA
# 14 1921-01-13 NA NA
# 15 1921-01-14 3 Date_beg
# 16 1921-01-15 NA event
# 17 1921-01-16 3 Date_end
# 18 1921-01-17 NA NA
# 19 1921-01-18 NA NA
# 20 1921-01-19 NA NA
# # ... with 15 more rows
Lagging the first index by one is required so that you don't overwrite the "Date_beg" at the start of each run.
Another dplyr approach using the cumsum function.
If the row in the name column in NA, it'll add 0 to the cumsum, otherwise add 1. Therefore the values under Date_beg will always be odd numbers (0 + 1) and the values under Date_end will always be even numbers (0 + 1 + 1). Then replace values that are odd in the ref column AND not NA in the name column with "event".
library(dplyr)
data %>%
mutate(ref = cumsum(ifelse(is.na(name), 0, 1)),
name = ifelse(ref %% 2 == 1 & is.na(name), "event", name)) %>%
select(-ref)

Remove NAs from nested list data frame

The following really seems to be a tough nut to crack:
I have a data frame with a nested list:
df <- structure(list(zerobonds = c(1, 1, NA), nominal = c(20, 20, NA
), calls = list(list(c(NA, -1), 1), list(list(NA, -1), 1), NA),
call_strike = list(list(c(NA, 90), 110), list(list(NA, 90),
110), NA), puts = list(NA, NA, list(c(NA, 1), -1)), put_strike = list(
NA, NA, list(c(NA, 110), 90))), row.names = c(NA, -3L
), class = "data.frame")
df
## zerobonds nominal calls call_strike puts put_strike
## 1 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 2 1 20 NA, -1, 1 NA, 90, 110 NA NA
## 3 NA NA NA NA NA, 1, -1 NA, 110, 90
I want to print the structure without any NAs (dots instead of the blanks are ok too):
zerobonds nominal calls call_strike puts put_strike
1 1 20 -1, 1 90, 110
2 1 20 -1, 1 90, 110
3 1, -1 110, 90
I have tried all kinds of things, the best approach so far seems to be something like rapply(df, na.omit, how = "replace") where I can't even suppress the Warnings (suppressWarnings doesn't seem to work here!). print(df, na.print = "") doesn't help either.
I am really exhausted now, nothing seems to work... data frames in the form of nested lists doesn't seem to be a good idea after all... could anybody help?
You can try the code below
df[]<-rapply(Map(as.list,df), na.omit, how = "replace")
which gives
> df
zerobonds nominal calls call_strike puts put_strike
1 1 20 -1, 1 90, 110
2 1 20 -1, 1 90, 110
3 1, -1 110, 90
You can create your own recursive function and apply it to each column :
rm_nested_na <- function(x) {
if (is.atomic(x)) {
na.omit(x)
} else {
lapply(x, rm_nested_na)
}
}
res <- df
listcol <- sapply(res, is.list)
res[listcol] <- lapply(res[listcol], rm_nested_na)
res
This is clearly inefficient if the nesting is deep.

R: Shift multiple columns by different number of rows

I have a list of tibbles like the following:
list(A = structure(list(
ID = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
g1 = c(0, 1, 2, NA, NA, NA, NA, NA, NA),
g2 = c(NA, NA, NA, 3, 4, 5, NA, NA, NA),
g3 = c(NA, NA, NA, NA, NA, NA, 6, 7, 8)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")),
B = structure(list(ID = c(1, 2, 1, 2, 1, 2),
g1 = c(10, 11, NA, NA, NA, NA),
g2 = c(NA, NA, 12,13, NA, NA),
g3 = c(NA, NA, NA, NA, 14, 15)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
)
Each element looks like this:
ID g1 g2 g3
<dbl> <dbl> <dbl> <dbl>
1 0 NA NA
2 1 NA NA
3 2 NA NA
1 NA 3 NA
2 NA 4 NA
3 NA 5 NA
1 NA NA 6
2 NA NA 7
3 NA NA 8
The g* columns are created dynamically, during previous mutates, and their number can vary, but it will be the same across all list elements.
Every g* column has only certain non-NA elements (as many as the unique IDs).
I would like to shift the g* columns so that they contain the non-NA element to the top rows.
I can do it for a single column by
num.shifts<- rle(is.na(myList[[1]]$g1))$lengths[1]
shift(myList[[1]]$g2,-num.shifts)
but how can I do it for all the g* columns, for all list elements, when I don't know in advance the number of g* columns?
Ideally, I would like a tidyverse solution, but not a requirement...
Thanks!
We can loop over the list with map, and use mutate_at to go over the columns that matches the 'g' followed by digits and order based on the non-NA elements
library(dplyr)
library(tidyr)
map(lst1, ~
.x %>%
mutate_at(vars(matches('^g\\d+')), ~ .[order(is.na(.))]))
In base R, we can do
lapply(lst1, function(x) {i1 <- grepl("^g\\d+$", names(x))
x[i1] <- lapply(x[i1], function(y) y[order(is.na(y))])
x})

If column A has a value, summarise variables in column B, until next variable in column A appears

I have performed an experiment in which people are moving around cubes until they made a figure they like. When they like a figure, they save if and creates a new one. The script tracked time and number of moves between all figures safes.
I now have a column (A) with number of moves between each save and a column (B) with the time between each move until the figure is saved. Thus column A is filled with NA's and then a number (signifies figure safe) and column B has time in seconds in all rows (except from first row) signifing all the moves made.
Excerpt of data:
A B C
NA 1.6667798
NA 3.3326443
NA 3.5506110
NA 11.4995562
NA 1.4334849
NA 4.9502637
NA 2.1161980
NA 4.7833326
NA 2.8500842
NA 4.0331373
NA 4.3498785
12 5.0910905 Sum
NA 4.2424078
NA 1.7332665
NA 1.5341006
3 4.8923275 Sum
NA 4.1064621
NA 3.3498289
NA 1.6002373
3 6.0122170 Sum
I have tried several loop options, but I cannot seem make it work properly.
I made this loop, but it is not doing the correct calculation in column C.
data$C <- rep(NA, nrow(data))
for (i in unique(data$id)) {
C <- which(data$id == i & data$type == "moveblock")
for (e in 1:length(C)){
if (e == 1){
data$C[C[e]] = C[e] - which(data$id == i)[1]
}
else if (e > 1){
data$C[C[e]] = C[e] + C[e+1]+1}
}
d_times <- which(data$id == i)
for (t in 2:length(d_times)){
data$B[d_times[t]] <- data$time[d_times[t]] - data$time[d_times[t-1]]
}
}
I want a new column (C) which has the sum of all rows from column B until a figure has been saved = a number in column A. In other words, I want to calculate the total time it took the subject to make all the moves before saving the figure.
Hope anyone can figure this out!
We can create groups based on occurrence of non NA values and take sum
library(dplyr)
df %>%
group_by(group = lag(cumsum(!is.na(A)), default = 0)) %>%
summarise(sum = sum(B, na.rm = TRUE))
# group sum
# <dbl> <dbl>
#1 0 49.7
#2 1 12.4
#3 2 15.1
In base R, we can use aggregate to do the same
aggregate(B~c(0, head(cumsum(!is.na(A)), -1)), df, sum, na.rm = TRUE)
data
df <- structure(list(A = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 12L, NA, NA, NA, 3L, NA, NA, NA, 3L), B = c(1.6667798, 3.3326443,
3.550611, 11.4995562, 1.4334849, 4.9502637, 2.116198, 4.7833326,
2.8500842, 4.0331373, 4.3498785, 5.0910905, 4.2424078, 1.7332665,
1.5341006, 4.8923275, 4.1064621, 3.3498289, 1.6002373, 6.012217
)), class = "data.frame", row.names = c(NA, -20L))
You could create a matrix out of the periods (i.e. sequences) and sum the values of column B accordingly. For this create vector saved that indicates where a subject has "saved" and list sequences using apply(). Finally the sapply() loops over the sequences in the periods list.
saved <- which(!is.na(dat$A))
periods <- apply(cbind(c(1, (saved + 1)[-3]), saved), 1, function(x) seq(x[1], x[2]))
dat$C[saved] <- sapply(periods, function(x) sum(dat$B[x]))
Result
dat$C
# [1] NA NA NA NA NA NA NA NA NA
# [10] NA NA 49.65706 NA NA NA 12.40210 NA NA
# [19] NA 15.06875
Data
dat <- structure(list(A = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 12L, NA, NA, NA, 3L, NA, NA, NA, 3L), B = c(1.6667798, 3.3326443,
3.550611, 11.4995562, 1.4334849, 4.9502637, 2.116198, 4.7833326,
2.8500842, 4.0331373, 4.3498785, 5.0910905, 4.2424078, 1.7332665,
1.5341006, 4.8923275, 4.1064621, 3.3498289, 1.6002373, 6.012217
), C = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -20L), class = "data.frame")

Function not working on larger dataset

I'm trying to find a total count of a certain value in a large dataset. Specifically, I want to create a new variable called "diabetes" coded 0/1 for no/yes. Here is an example:
Test <- data.frame(
s_1_1 = c(1223, NA, 1223, NA, NA),
s_1_2 = c(NA, 1223, NA, NA, NA),
s_1_2 = c(NA, NA, NA, NA, NA))
Disease0 <- paste("s_1_", 1:2, sep = "")
Test$Tp2Diabetes_0_0 <- apply(Test, 1, function(Db) as.integer(any(Db[Disease0] == 1223, na.rm = TRUE)))
When I run this code on my small example it works fine and gives me the results that I want.
diabetes = 1,1,1,0,0
The issue is that I am running this on a dataset of over 500k and it does not produce the desired results. For example, It shows that only 200 people out of the 500k have diabetes, but the overall data showcase indicates I should have closer to 3,000. I don't understand what is going on here and what I am doing wrong.
You should go for something simpler like this:
Test <- data.frame(
s_1_1 = c(1223, NA, 1223, NA, NA),
s_1_2 = c(NA, 1223, NA, NA, NA),
s_1_2 = c(NA, NA, NA, NA, NA))
Test$Tp2Diabetes_0_0 <- rowSums(Test==1223,na.rm=TRUE)>0
s_1_1 s_1_2 s_1_2.1 Tp2Diabetes_0_0
1 1223 NA NA TRUE
2 NA 1223 NA TRUE
3 1223 NA NA TRUE
4 NA NA NA FALSE
5 NA NA NA FALSE
Or if you need only the first two columns as indicators:
Test$Tp2Diabetes_0_0 <- rowSums(Test[,1:2]==1223,na.rm=TRUE)>0

Resources