remove rows if values exists with the same combination in different columns - r

I have a 410 DNA sequences that I have confronted with each other, to get the similarity.
Now, to trim the database, I should get rid of the row that have the same value in 2 columns, because of course every value will be double.
To make myself clear, I have something like
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"b", "a", 100.000,
"c", "d", 99.000,
"d", "c", 99.000,
)
comparing a-b and b-a is the same thing, so I'd want to get rid of the double value
What I want to end up with is
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"c", "d", 99.000
)
I am not sure on how to proceed, all the ways I thought of are kinda hacky. I checked other answers, but don't really satisfy me.
Any input would be greatly appreciated (but tidy inputs are even more appreciated!)

We can use pmin and pmax to sort the values and then use distinct to select unique rows.
library(dplyr)
df %>%
mutate(col1 = pmin(seq01, seq02),
col2 = pmax(seq01, seq02), .before = 1) %>%
distinct(col1, col2, similarity)
# col1 col2 similarity
# <chr> <chr> <dbl>
#1 a b 100
#2 c d 99

Another, base R, approach:
df$add1 <- apply(df[,1:2], 1, min) # find rowwise minimum values
df$add2 <- apply(df[,1:2], 1, max) # find rowwise maximum values
df <- df[!duplicated(df[,4:5]),] # remove rows with identical values in new col's
df[,4:5] <- NULL # remove auxiliary col's
Result:
df
# A tibble: 2 x 3
seq01 seq02 similarity
<chr> <chr> <dbl>
1 a b 100
2 c d 99

Related

Compare values between rows for specific columns

I have a rather curious question, but I hope I can find an answer. Unfortunately, the search function in stackoverflow didn't help me with this one.
I have the following dataset structure:
my.df <- data.frame(prs_id = c(1234, 1255, 1556, 3173),
vrs_id = c(3145, 3145, 3333, 3333),
V1_2017 = c(12,14,12,35),
V2_2017 = c("A", "B", "C", "D"),
V1_2018 = c(13,16,13,34),
V2_2018 = c("A", "B", "C", "D"),
V1_2019 = c(15,17,17,45),
V2_2019 = c("A", "B", "C", "D"),
V1_2020 = c(17,17,22,45),
V2_2020 = c("A", "B", "C", "D"))
As you might see, I filtered duplicates from a larger dataset (duplicates in "vrs_id"). The duplicates are not supposed to be there and the dataset is at the moment in wide format. I need a way to decide, which "vrs_id" to keep and which to drop. Therefore the function must compare each corresponding values of V1_2020 to V1_2017, according to the "vrs_id" they belong to. V2 is just to visualize, that there are more (actually 13) variables between the V1 variables.
E.g. "vrs_id" == 3145 requires to check, which V1_2020 (45 and 22) is larger. If non is larger (see 17 vs 17 for vrs_id == 3145), the function should move to the next variable V1_2019 and do the same. If, at the end, there is no difference in the duplicates, the first (according to the rownumber in the dataframe) should be chosen.
The subset only has duplicates and the corresponding original inside, so a potential function must not be capable of comparing even more values across rows. I tried to include pmax, but when grouping the dataframe by var_id, it automatically chose var_id as maximum in the line. But excluding var_id from the subset, consequently, gave an error in grouping because the grouping variable was missing.
Is there anybody with an idea to compute this comparisons?
Any help would be appreciated!
Edit:
The expected output should look like this:
new.df <- data.frame(
prs_id = c(1255, 3173),
vrs_id = c(3145, 3333),
V1_2017 = c(14,35),
V2_2017 = c("B", "D"),
V1_2018 = c(16,34),
V2_2018 = c("B", "D"),
V1_2019 = c(17,45),
V2_2019 = c("B", "D"),
V1_2020 = c(17,45),
V2_2020 = c("B", "D"))
prs_id No. 2 and 4 should be kept, since for prs_id No. 2 in V1_2019 17>15 (despite in V1_2020 17=17) and for No. 4 V1_2020 shows 45 > 22, therefore prs_id No. 3 is discarded.
This may be an option, using tidyverse packages.
Note this only works where all values for one prs_id are greater than or equal to the variables in duplicate prs_ids.
library(dplyr)
library(tidyr)
library(stringr)
df1 <-
my.df %>%
pivot_longer(starts_with("V1")) %>%
group_by(vrs_id, name) %>%
mutate(max_val = if_else(value == max(value), 0, 1)) %>%
ungroup() %>%
group_by(prs_id) %>%
mutate(prs_discard = sum(max_val)) %>%
filter(prs_discard == 0) %>%
select(-c(max_val, prs_discard)) %>%
pivot_wider(names_from = name, values_from = value)
df1
#> # A tibble: 2 x 10
#> # Groups: prs_id [2]
#> prs_id vrs_id V2_2017 V2_2018 V2_2019 V2_2020 V1_2017 V1_2018 V1_2019 V1_2020
#> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1255 3145 B B B B 14 16 17 17
#> 2 3173 3333 D D D D 35 34 45 45
Created on 2021-11-23 by the reprex package (v2.0.1)

In R, subset a dataframe on rows whose ID appears more than once [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed last month.
Background
I have a dataframe d with ~10,000 rows and n columns, one of which is an ID variable. Most ID's appear once, but some appear more than once. Say that it looks like this:
Problem
I'd like a new dataframe d_sub which only contains ID's that appear more than once in d. I'd like to have something that looks like this:
What I've tried
I've tried something like this:
d_sub <- subset(d, duplicated(d$ID))
But that only gets me one entry for ID's b and d, and I want each of their respective rows:
Any thoughts?
We may need to change the duplicated with | condition as duplicated by itself is FALSE for the first occurrence of 'ID'
d_sub <- subset(d, duplicated(ID)|duplicated(ID, fromLast = TRUE))
We could use add_count, then filter on n:
library(dplyr)
df %>%
add_count(ID) %>%
filter(n!=1) %>%
select(-n)
Example:
library(dplyr)
df <- tribble(
~ID, ~gender, ~zip,
"a", "f", 1,
"b", "f", NA,
"b", "m", 2,
"c", "f", 3,
"d", "f", NA,
"d", "m", 4)
df %>%
add_count(ID) %>%
filter(n!=1) %>%
select(-n)
Output:
ID gender zip
<chr> <chr> <dbl>
1 b f NA
2 b m 2
3 d f NA
4 d m 4

Compare each row in two dataframes in R

I have 2 data frames with account numbers and amounts plus some other irrelevant columns. I would like to compare the output with a Y or N if they match or not.
I need to compare the account number in row 1 in dataframe A to the account number in row 1 in dataframe B and if they match put a Y in a column or an N if they don't. I've managed to get the code to check if there is a match in the entire dataframe but I need to check each row individually.
E.g.
df1
|account.num|x1|x2|x3|
|100|a|b|c|
|101|a|b|c|
|102|a|b|c|
|103|a|b|c|
df2
|account.num|x1|x2|x3|
|100|a|b|c|
|102|a|b|c|
|101|a|b|c|
|103|a|b|c|
output
|account.num|x1|x2|x3|match|
|100|a|b|c|Y|
|101|a|b|c|N|
|102|a|b|c|N|
|103|a|b|c|Y|
So, row 1 matches as they have the same account number, but row 2 doesn't because they are different. However, the other data in the dataframe doesn't matter just that column. Can I do this without merging the data frames? (I did have tables, but they won't work. I don't know why. So sorry if that's hard to follow).
You can use == to compare if account.num is equal, and use this boolean vector to subset c("N", "Y")
df1$match <- c("N", "Y")[1 + (df1[[1]] == df2[[1]])]
df1
# account.num x1 x2 x3 match
#1 100 a b c Y
#2 101 a b c N
#3 102 a b c N
#4 103 a b c Y
Data:
df1 <- data.frame(account.num=100:103, x1="a", x2="b", x3="c")
df2 <- data.frame(account.num=c(100,102,101,103), x1="a", x2="b", x3="c")
If you want a base R solution, here is a quick sketch. Assuming boath dataframes are of the same length (number of rows), it should work with your data.
# example dataframes
a <- data.frame(A=c(1,2,3), B=c("one","two","three"))
b <- data.frame(A=c(3,2,1), B=c("three","two","one"))
res <- c() #initialise empty result vector
for (rownum in c(1:nrow(a))) {
# iterate over all numbers of rows
res[rownum] <- all(a[rownum,]==b[rownum,])
}
res # result vector
# [1] FALSE TRUE FALSE
# you can put it in frame a like this. example colname is "equalB"
a$equalB <- res
If you want a tidyverse solution, you can use left_join.
The principle here would be to try to match the data from df2 to the data from df1. If it matches, it would add TRUE to a match column. Then, the code replace the NA values with FALSE.
I'm also adding code to create the data frames from the exemple.
library(tidyverse)
df1 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
101, "a", "b", "c",
102, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column() # because position in the df is an important information,
# I need to hardcode it in the df
df2 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
102, "a", "b", "c",
101, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column()
# take a
df1 %>%
# try to match df1 with version of df2 with a new column where `match` = TRUE
# according to `rowid`, `account_num`, `x1`, `x2`, and `x3`
left_join(df2 %>%
tibble::add_column(match = TRUE),
by = c("rowid", "account_num", "x1", "x2", "x3")
) %>%
# replace the NA in `match` with FALSE in the df
replace_na(list(match = FALSE))

Looking for a function to impute missing values according to the ratio of other values in R.(looking for probability based)

I have a data frame with over 9000 data points and 3 columns have nearly 1000(each) missing values. I am trying to make a function to impute them according to the proportion of other values.I am not looking for most common method(mode).After I found the missing value=1000(each column),I tried to distribute values from other rows in that particular column according to their proportion.Suppose a column has ratio of "a" and "b" as 50:50 and 4 missing values.I will fill those missing values according to the ratio so 2a and 2b.
I don't completely understand the question but here are some things to try.
You can tabulate the values of b, including NAs
library(tidyverse)
a = c(1:12)
b = c("a", "a", "b", "c",
"a", "c", "b", NA,
"b", "c", "a", "a")
df = tibble(a = a, b = b)
df %>%
group_by(b) %>%
summarise(n())
Or, using table
table(b, useNA = 'always')
#a b c <NA>
#5 3 3 1
To replace the missing value by the most common non-missing value
tab <- table(b)
replacment <- names(which.max(tab))
df %>%
mutate(b = if_else(is.na(b), replacment, b))

How to associate a list of character vectors with your data frame in R

The shape of my data is fairly simple:
set.seed(1337)
id <- c(1:4)
values <- runif(0, 1, n=4)
df <- data.frame(id, values)
df
id values
1 1 0.57632155
2 2 0.56474213
3 3 0.07399023
4 4 0.45386562
What isn't simple: I have a list of character-value arrays that match up to each row, where each list item can be empty, or it can contain up to 5 separate tags, like...
tags <- list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
I will be asked various questions using the tags as classifers, for instance, "what is the average value of all rows with a B tag?" Or "how many rows contain both tag A and tag C?"
What way would you choose to store the tags so that I can do this? My real-life data file is quite large, which makes experimenting with unlist or other commands difficult.
Here are couple of options to get the expected output. Create 'tags' as a list column in the dataset and unnest (already from the comments), and then summarise the number of 'A' or 'C' by getting the sum of logical vector. Similarly, the mean of 'values' where 'tag' is 'B'
library(tidyverse)
df %>%
mutate(tag = tags) %>%
unnest %>%
summarise(nAC = sum(tag %in% c("A", "C")),
meanB = mean(values[tag == "B"], na.rm = TRUE))
That is not very hard . you just need assign your list to your df create a new columns as name tags then we do unnest, I have list the solutions for your listed questions .
library(tidyr)
library(dplyr)
df$tags=list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
Newdf=df%>%tidyr::unnest(tags)
Q1.
Newdf%>%group_by(tags)%>%summarise(Mean=mean(values))%>%filter(tags=='B')
tags Mean
<chr> <dbl>
1 B 0.263927925960161
Q2.
Newdf%>%group_by(id)%>%dplyr::summarise(Count=any(tags=='A')&any(tags=='C'))
# A tibble: 4 x 2
id Count
<int> <lgl>
1 1 FALSE
2 2 NA
3 3 TRUE
4 4 FALSE

Resources