Compare each row in two dataframes in R - r

I have 2 data frames with account numbers and amounts plus some other irrelevant columns. I would like to compare the output with a Y or N if they match or not.
I need to compare the account number in row 1 in dataframe A to the account number in row 1 in dataframe B and if they match put a Y in a column or an N if they don't. I've managed to get the code to check if there is a match in the entire dataframe but I need to check each row individually.
E.g.
df1
|account.num|x1|x2|x3|
|100|a|b|c|
|101|a|b|c|
|102|a|b|c|
|103|a|b|c|
df2
|account.num|x1|x2|x3|
|100|a|b|c|
|102|a|b|c|
|101|a|b|c|
|103|a|b|c|
output
|account.num|x1|x2|x3|match|
|100|a|b|c|Y|
|101|a|b|c|N|
|102|a|b|c|N|
|103|a|b|c|Y|
So, row 1 matches as they have the same account number, but row 2 doesn't because they are different. However, the other data in the dataframe doesn't matter just that column. Can I do this without merging the data frames? (I did have tables, but they won't work. I don't know why. So sorry if that's hard to follow).

You can use == to compare if account.num is equal, and use this boolean vector to subset c("N", "Y")
df1$match <- c("N", "Y")[1 + (df1[[1]] == df2[[1]])]
df1
# account.num x1 x2 x3 match
#1 100 a b c Y
#2 101 a b c N
#3 102 a b c N
#4 103 a b c Y
Data:
df1 <- data.frame(account.num=100:103, x1="a", x2="b", x3="c")
df2 <- data.frame(account.num=c(100,102,101,103), x1="a", x2="b", x3="c")

If you want a base R solution, here is a quick sketch. Assuming boath dataframes are of the same length (number of rows), it should work with your data.
# example dataframes
a <- data.frame(A=c(1,2,3), B=c("one","two","three"))
b <- data.frame(A=c(3,2,1), B=c("three","two","one"))
res <- c() #initialise empty result vector
for (rownum in c(1:nrow(a))) {
# iterate over all numbers of rows
res[rownum] <- all(a[rownum,]==b[rownum,])
}
res # result vector
# [1] FALSE TRUE FALSE
# you can put it in frame a like this. example colname is "equalB"
a$equalB <- res

If you want a tidyverse solution, you can use left_join.
The principle here would be to try to match the data from df2 to the data from df1. If it matches, it would add TRUE to a match column. Then, the code replace the NA values with FALSE.
I'm also adding code to create the data frames from the exemple.
library(tidyverse)
df1 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
101, "a", "b", "c",
102, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column() # because position in the df is an important information,
# I need to hardcode it in the df
df2 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
102, "a", "b", "c",
101, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column()
# take a
df1 %>%
# try to match df1 with version of df2 with a new column where `match` = TRUE
# according to `rowid`, `account_num`, `x1`, `x2`, and `x3`
left_join(df2 %>%
tibble::add_column(match = TRUE),
by = c("rowid", "account_num", "x1", "x2", "x3")
) %>%
# replace the NA in `match` with FALSE in the df
replace_na(list(match = FALSE))

Related

In R, subset a dataframe on rows whose ID appears more than once [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed last month.
Background
I have a dataframe d with ~10,000 rows and n columns, one of which is an ID variable. Most ID's appear once, but some appear more than once. Say that it looks like this:
Problem
I'd like a new dataframe d_sub which only contains ID's that appear more than once in d. I'd like to have something that looks like this:
What I've tried
I've tried something like this:
d_sub <- subset(d, duplicated(d$ID))
But that only gets me one entry for ID's b and d, and I want each of their respective rows:
Any thoughts?
We may need to change the duplicated with | condition as duplicated by itself is FALSE for the first occurrence of 'ID'
d_sub <- subset(d, duplicated(ID)|duplicated(ID, fromLast = TRUE))
We could use add_count, then filter on n:
library(dplyr)
df %>%
add_count(ID) %>%
filter(n!=1) %>%
select(-n)
Example:
library(dplyr)
df <- tribble(
~ID, ~gender, ~zip,
"a", "f", 1,
"b", "f", NA,
"b", "m", 2,
"c", "f", 3,
"d", "f", NA,
"d", "m", 4)
df %>%
add_count(ID) %>%
filter(n!=1) %>%
select(-n)
Output:
ID gender zip
<chr> <chr> <dbl>
1 b f NA
2 b m 2
3 d f NA
4 d m 4

remove rows if values exists with the same combination in different columns

I have a 410 DNA sequences that I have confronted with each other, to get the similarity.
Now, to trim the database, I should get rid of the row that have the same value in 2 columns, because of course every value will be double.
To make myself clear, I have something like
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"b", "a", 100.000,
"c", "d", 99.000,
"d", "c", 99.000,
)
comparing a-b and b-a is the same thing, so I'd want to get rid of the double value
What I want to end up with is
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"c", "d", 99.000
)
I am not sure on how to proceed, all the ways I thought of are kinda hacky. I checked other answers, but don't really satisfy me.
Any input would be greatly appreciated (but tidy inputs are even more appreciated!)
We can use pmin and pmax to sort the values and then use distinct to select unique rows.
library(dplyr)
df %>%
mutate(col1 = pmin(seq01, seq02),
col2 = pmax(seq01, seq02), .before = 1) %>%
distinct(col1, col2, similarity)
# col1 col2 similarity
# <chr> <chr> <dbl>
#1 a b 100
#2 c d 99
Another, base R, approach:
df$add1 <- apply(df[,1:2], 1, min) # find rowwise minimum values
df$add2 <- apply(df[,1:2], 1, max) # find rowwise maximum values
df <- df[!duplicated(df[,4:5]),] # remove rows with identical values in new col's
df[,4:5] <- NULL # remove auxiliary col's
Result:
df
# A tibble: 2 x 3
seq01 seq02 similarity
<chr> <chr> <dbl>
1 a b 100
2 c d 99

Add a row to a dataframe that repeats a row and replaces 2 entries

I want to add rows to a dataframe (or tibble) as part of a data entry project. I need to:
Find one row that holds a specific value in one column (obsid)
Duplicate that row. However, replace the value in column "word".
Append the new row to the dataframe
I want to write a function that makes it easy. When I write the function, it won't add the new rows. I can print out the answer. But it won't alter the basic dataframe
If I do it without a function it works as well.
Why won't the function add the row?
df <- tibble(obsid = c("a","b" , "c", "d"), b=c("a", "a", "b", "b"), word= c("what", "is", "the", "answer"))
df$main <- 1
addrow <- function(id, newword) {
rowtoadd <- df %>%
filter(obsid== id & main==1) %>%
mutate(word=replace(word, main==1, newword)) %>%
mutate(main=replace(main, word==newword, 0))
df <- bind_rows(df, rowtoadd)
print(rowtoadd)
print(filter(df, df$obsid== id))}
addrow("a", "xxx")
R objects usually don't modify itself, you need to warp the result in return() to return the modified copy of that dataframe.
Change your function to:
df <- tibble(obsid = c("a","b" , "c", "d"), b=c("a", "a", "b", "b"), word= c("what", "is", "the", "answer"))
df$main <- 1
addrow <- function(id, newword) {
rowtoadd <- df %>%
filter(obsid== id & main==1) %>%
mutate(word=replace(word, main==1, newword)) %>%
mutate(main=replace(main, word==newword, 0))
df <- bind_rows(df, rowtoadd)
return(df)
}
> addrow("a", "xxx")
# A tibble: 5 x 4
obsid b word main
<chr> <chr> <chr> <dbl>
1 a a what 1
2 b a is 1
3 c b the 1
4 d b answer 1
5 a a xxx 0

Looking for a function to impute missing values according to the ratio of other values in R.(looking for probability based)

I have a data frame with over 9000 data points and 3 columns have nearly 1000(each) missing values. I am trying to make a function to impute them according to the proportion of other values.I am not looking for most common method(mode).After I found the missing value=1000(each column),I tried to distribute values from other rows in that particular column according to their proportion.Suppose a column has ratio of "a" and "b" as 50:50 and 4 missing values.I will fill those missing values according to the ratio so 2a and 2b.
I don't completely understand the question but here are some things to try.
You can tabulate the values of b, including NAs
library(tidyverse)
a = c(1:12)
b = c("a", "a", "b", "c",
"a", "c", "b", NA,
"b", "c", "a", "a")
df = tibble(a = a, b = b)
df %>%
group_by(b) %>%
summarise(n())
Or, using table
table(b, useNA = 'always')
#a b c <NA>
#5 3 3 1
To replace the missing value by the most common non-missing value
tab <- table(b)
replacment <- names(which.max(tab))
df %>%
mutate(b = if_else(is.na(b), replacment, b))

Merge two data sets by character/factor values; keep smaller data set

I have a data set A with a column of character values(factors) and each value appears multiple times. I also have a duplicate of that set (A') which is cleaned (fewer vars and obs). What I try to do now is to merge them in a way that only keeps the rows(records) of the smaller set A'.
I already tried right-joining it but I run into problems because I'm operating on character values.
Info<-c("x","x","x", "y","y","y","z","z","z")
More_info<-c("A", "A","A", "B", "B", "B", "C", "C","C")
Group_A<-cbind(Info, More_info)
vec1<-c("A","B","C")
vec2<-c("one","two","three")
Group_B<-cbind(vec1, vec2)
names(Group_B)<-c("More_Info", "Extra_Info")
x<-right_join(Group_A, Group_B, by= "More_Info")
What I get is:
Error in UseMethod("right_join") :
no applicable method for 'right_join' applied to an object of class "c('matrix', 'character')"
What I need:
-x-
Info More_Info
A X
B Y
C Z
You can use merge.
Info <- c("x","x","x", "y","y","y","z","z","z")
More_info <- c("A", "A","A", "B", "B", "B", "C", "C","C")
Group_A <- cbind(Info, More_info)
vec1 <- c("A","B","C")
vec2 <- c("one","two","three")
Group_B <- cbind(vec1, vec2)
# Use colnames to change column names, not names
# Also the 'i' in 'More_Info' should be lower case
colnames(Group_B) <- c("More_info", "Extra_Info")
# Take the unique values of A
merge(unique(Group_A), Group_B, by = "More_info", all.x = F, all.y = T)
#> More_info Info Extra_Info
#> 1 A x one
#> 2 B y two
#> 3 C z three

Resources