how to select rows from lists in r? - r

df1 = data.frame(
y1 = c(1, -2, 3),
y2 = c(4, 5, 6),out = c("*", "*", "")
)
Create another dataframe
df2 = data.frame(
y1 = c(7, -3, 9),
y2 = c(1, 4, 6),out = c("", "*", "")
)
Create list of data frame using list()
lis = list(df1, df2)
I want to compare row by row (first row from index [[1]] with first row from index [[2]] and so on
if both rows have * or empty (similar) under out, then select the row with the highest absolute value in y1 (and put under ind the index)
otherwise, take the y1 value of the row with * under out (and put under ind the index)
example of desired output:
y1 y2 out ind
1 4 * 1
-3 4 * 2
9 6 2

An option is to bind the dataset with bind_rows and specify .id to create an identifier for the list sequence. Then grouped by the rowid of the 'ind' column, either we arrange the rows in the group so that * values in out will be first, then do the order on absolute values of 'y1' in descending. We slice the row with the max absolute value in 'y1' if the number of distinct elements in 'out' is 1 or else return the 1st row (will be * in out if there is one)
library(dplyr)
library(data.table)
bind_rows(lis, .id = 'ind') %>%
group_by(rn = rowid(ind)) %>%
arrange(out != "*", desc(abs(y1)), .by_group = TRUE) %>%
slice(if(n_distinct(out) == 1) which.max(abs(y1)) else 1) %>%
ungroup %>%
select(y1, y2, out, ind)
-output
# A tibble: 3 × 4
y1 y2 out ind
<dbl> <dbl> <chr> <chr>
1 1 4 "*" 1
2 -3 4 "*" 2
3 9 6 "" 2
Or without arrange, after grouping, create a position index in slice ('i1') where the values in out are *, then use that index to subset the y1 values to find the position where it is absolute max in else case and return the position of 'i1' by indexing
bind_rows(lis, .id = 'ind') %>%
group_by(rn = rowid(ind)) %>%
slice({i1 <- which(out == "*")
if(n_distinct(out) == 1) which.max(abs(y1)) else
i1[which.max(abs(y1[i1]))]}) %>%
ungroup %>%
select(y1, y2, out, ind)
-output
# A tibble: 3 × 4
y1 y2 out ind
<dbl> <dbl> <chr> <chr>
1 1 4 "*" 1
2 -3 4 "*" 2
3 9 6 "" 2
Or use a similar ordering approach with data.table
library(data.table)
dt1 <- rbindlist(lis, idcol = "ind")[, rn := rowid(ind)]
dt2 <- dt1[order(rn, out != "*", desc(abs(y1)))]
dt2[dt2[, .I[if(uniqueN(out) == 1) which.max(abs(y1)) else
1L], .(rn)]$V1][, rn := NULL][]
ind y1 y2 out
1: 1 1 4 *
2: 2 -3 4 *
3: 2 9 6

Related

Find unique entries in otherwise identical rows

I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.
My dataset has the following properties:
The dataset's columns comprise an identifier variable (ID) and a large number of response variables (x1 - xn).
Each row should represent one individual, meaning the values in the ID column should all be unique (and not repeated).
Some rows are duplicated, with repeated entries in the ID column and seemingly identical response item values (x1 - xn). However, the dataset is too large to get a full overview over all variables.
As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct() function. In my case, not all "duplicate" rows are removed by distinct(), which can only mean that not all entries are identical.
I want to find a way to identify which entries are unique in these otherwise duplicate rows.
Example:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
# The dataframe with all entries
df
A data.frame: 6 × 7
ID x1 x2 x3 x4 x5 x6
1 4 a 7 d x g
1 4 a 10 d p g
2 5 b 8 e y h
2 5 b 8 e y h
3 6 c 9 f z i
3 6 d 11 f q i
# The dataframe
df %>%
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)
ID dupe_count x1 x2 x3 x4 x5 x6
1 2 4 a 7 d x g
1 2 4 a 10 d p g
3 2 6 c 9 f z i
3 2 6 d 11 f q i
In the example above I demonstrate how dplyr::distinct() will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).
What I want is an overview over which columns that are not duplicates for each value:
df %>%
distinct() %>%
janitor::get_dupes(ID) %>%
# Here I want a way to find columns with unidentical entries:
find_nomatch()
ID x2 x3 x5
1 7 x
1 10 p
3 c 9 z
3 d 11 q
A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).
Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).
Finally, reshape the data back to wide format (dcast).
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
A bit more simple than yours I think:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
d <- df %>%
distinct() %>%
janitor::get_dupes(ID)
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-18 by the reprex package (v2.0.1)
Edit
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) !Vectorize(identical)(unlist(.x[1, ]), .y))) %>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-19 by the reprex package (v2.0.1)
I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary. I can only presume there's a more elegant solution out there. Anyway, this should work:
df <- df %>%
distinct() %>%
janitor::get_dupes(ID)
# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()
# Lapply on each ID
df <- lapply(
l,
function(x) {
# Filter rows for the duplicated ID
dplyr::filter(df, ID == x) %>%
# Transpose dataframe (converts it into a matrix)
t() %>%
# Convert back to data frame
as.data.frame() %>%
# Filter columns that are not identical
dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
# Transpose back
t() %>%
# Convert back to data frame
as.data.frame()
}
) %>%
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)
#Remove row names (not necessary)
row.names(df) <- NULL
df
A data.frame: 4 × 3
x2 x3 x5
NA 7 x
NA 10 p
c 9 z
d 11 q
Feel free to comment
If you just want to keep the first instance of each identifier:
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = rep(letters[1:3], each = 2),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2)
)
df %>%
distinct(ID, .keep_all = TRUE)
Output:
ID x1 x2 x3 x4
1 1 4 a 7 d
2 2 5 b 8 e
3 3 6 c 9 f

Replacing value depending on paired column

I have a dataframe with two columns per sample (n > 1000 samples):
df <- data.frame(
"sample1.a" = 1:5, "sample1.b" = 2,
"sample2.a" = 2:6, "sample2.b" = c(1, 3, 3, 3, 3),
"sample3.a" = 3:7, "sample3.b" = 2)
If there is a zero in column .b, the correspsonding value in column .a should be set to NA.
I thought to write a function over colnames (without suffix) to filter each pair of columns and conditional exchaning values. Is there a simpler approach based on tidyverse?
We can split the data.frame into a list of data.frames and do the replacement in base R
df1 <- do.call(cbind, lapply(split.default(df,
sub("\\..*", "", names(df))), function(x) {
x[,1][x[2] == 0] <- NA
x}))
Or another option is Map
acols <- endsWith(names(df), "a")
bcols <- endsWith(names(df), "b")
df[acols] <- Map(function(x, y) replace(x, y == 0, NA), df[acols], df[bcols])
Or if the columns are alternate with 'a', 'b' columns, use a logical index for recycling, create the logical matrix with 'b' columns and assign the corresponding values in 'a' columns to NA
df[c(TRUE, FALSE)][df[c(FALSE, TRUE)] == 0] <- NA
or an option with tidyverse by reshaping into 'long' format (pivot_longer), changing the 'a' column to NA if there is a correspoinding 0 in 'a', and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_sep="\\.",
names_to = c('group', '.value')) %>%
mutate(a = na_if(b, a == 0)) %>%
pivot_wider(names_from = group, values_from = c(a, b)) %>%
select(-rn)
# A tibble: 5 x 6
# a_sample1 a_sample2 a_sample3 b_sample1 b_sample2 b_sample3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2 1 2 2 1 2
#2 2 3 2 2 3 2
#3 2 3 2 2 3 2
#4 2 3 2 2 3 2
#5 2 3 2 2 3 2

Copy preceding row if condition is met

My data
set.seed(123)
df <- data.frame(loc = rep(1:5, each = 5),value = sample(0:4, 25, replace = T))
a <- c("x","y","z","k")
df$id <- ifelse(df$value == 0, "no.data", sample(a,1))
head(df)
loc value id
1 1 1 z
2 1 3 z
3 1 2 z
4 1 4 z
5 1 4 z
6 2 0 no.data
Rows for which I have no data, the id and value columns have no.data and 0. For all rows where I have no data (id == no.data and value == 0), I want to copy the value and id from the preceding row.
loc value id
1 1 1 z
2 1 3 z
3 1 2 z
4 1 4 z
5 1 4 z
6 2 4 z
Something like:
df %>% group_by(loc) %>% mutate(value = ifelse(value == 0, copy the value from preceding row), id = ifelse(id== "no.data", copy the id from preceding row ))
We could replace the 0s by NA and then do a fill
library(tidyverse)
library(naniar)
df %>%
replace_with_na(replace = list(value = 0, id = "no.data")) %>%
fill(value, id)
Unless you have a very big dataset a simple loop should do
for (r in 2:nrow(df)) {
if (with(df[r, ], id == "no.data" && value == 0)) {
df[r, c("id", "value")] <- df[r - 1L, c("id", "value")]
}
}

Removing groups from dataframe if variable has repeated values

I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())

add rows to data.frame that are calculation from existing rows

Given
x = data.frame(value = 1:3, col2 = c('min', 'min', 'max'), id = c(1, 2, 1))
Is there any way to group by col3 and then average min and max then store the result in col1?
value col2 id
1 1 min 1
2 2 min 2
3 3 max 1
4 2 avg 1 ## row I'd like to add. where "avg" identifies average of min and max
If you only have those three columns, and each group only has 2 values (min and max) then,
library(dplyr)
x %>%
group_by(col3) %>%
summarise(col1 = mean(col1), col2 = 'Average') %>%
bind_rows(x) %>%
arrange(col3)
which gives,
# A tibble: 5 x 3
col3 col1 col2
<dbl> <dbl> <chr>
1 1 2 Average
2 1 1 min
3 1 3 max
4 2 2 Average
5 2 2 min
I think you are looking for this :
library(dplyr)
x = data.frame(value = 1:3, col2 = c('min', 'min', 'max'), id = c(1, 2, 1))
x %>% group_by(id) %>% summarize(avg=mean(value)
This will give you the values you want, but won't append them to your dataframe.
With this few context, I'm not sure you should do so but you can rbind it if you really want to.

Resources