R Compare duplicate values ​for each row in two data sets - r

I want to compare whether the values ​​in each row are the same.
In this case, duplicated and all_equal function are not suitable.
Reproducible Sample Data
df1 <- data.frame(a=c(1,2,3),b=c(4,5,6))
df2 <- data.frame(a=c(1,2,4),b=c(4,5,6))
> df1
a b
1 1 4
2 2 5
3 3 6
> df2
a b
1 1 4
2 2 5
3 4 6
Expected output
final <- data.frame(a=c(1,2,4),b=c(4,5,6),c=c('T','T','F'))
#c column is the result I need. whether the values ​​in each row are the same.
>final
a b c
1 1 4 T
2 2 5 T
3 4 6 F
I try method below... but This is complicated.
#1. making idx of df1, df2
#2. and full_join
#3. and left_join df1
#4. and left_join df2
df1$idx1 <- 1:nrow(df1)
df2$idx2 <- 1:nrow(df2)
df3<-full_join(df1,df2,by=c('a','b'))
df3<-left_join(df3,df1,by=c('a','b'))
df3<-left_join(df3,df2,by=c('a','b')) #This may or may not work..
I think there must be a better way. help!

We could use
df2$c <- Reduce(`&`, Map(`==`, df1, df2))
-output
> df2
a b c
1 1 4 TRUE
2 2 5 TRUE
3 4 6 FALSE

You can get column 'c' by:
c <- df1$a == df2$a & df1$b == df2$b
gives TRUE TRUE FALSE. It looks like you want to then bind this to df2, so
cbind.data.frame(df2, c)

You may use rowSums -
final <- df2
final$c <- rowSums(df1 != df2) == 0
final
# a b c
#1 1 4 TRUE
#2 2 5 TRUE
#3 4 6 FALSE

In case the positions of the rows in each data.frame do not matter you can use merge.
within(merge(df2, within(df1, c <- "T"), all.x=TRUE), c[is.na(c)] <- "F")
# a b c
#1 1 4 T
#2 2 5 T
#3 4 6 F
or using duplicated.
df2$c <- c("F", "T")[1+tail(duplicated(rbind(df1, df2)), nrow(df2))]
df2
# a b c
#1 1 4 T
#2 2 5 T
#3 4 6 F

Related

R Compare duplicate values ​for each row in two data sets + in any order

I want to compare whether the 'values set' ​​in each row are the same.
In this case, duplicated and all_equal function are not suitable.
Reproducible Sample Data
a=c(1,1)
b=c(2,2)
c=c(3,3)
d=c(4,5)
df1<-rbind(a,b,c)
df1<-df1 %>% as.data.frame()
df2<-rbind(a,d,b)
df2<-df2 %>% as.data.frame()
> df1
V1 V2
a 1 1
b 2 2
c 3 3
> df2
V1 V2
a 1 1
d 4 5
b 2 2
Expected output
df1$idx1 <- 1:nrow(df1)
df2$idx2 <- 1:nrow(df2)
df1
df2
df3<-full_join(df1,df2,by=c('V1','V2'))
df3
df3$need <- ifelse(is.na(df3$idx2), 'only_df1',
ifelse(is.na(df3$idx1), 'only_df2',
'duplicated'))
> df3
V1 V2 idx1 idx2 need
1 1 1 1 1 duplicated
2 2 2 2 3 duplicated
3 3 3 3 NA only_df1
4 4 5 NA 2 only_df2
I try... but This is complicated.
I think there must be a better way. help!
Since you are already using dplyr, you may use case_when which is easier to understand and write especially when you have lot of conditions.
library(dplyr)
full_join(df1,df2,by=c('V1','V2')) %>%
mutate(need = case_when(is.na(idx2) ~ 'only_df1',
is.na(idx1) ~ 'only_df2',
TRUE ~ 'duplicated'))
# V1 V2 idx1 idx2 need
#1 1 1 1 1 duplicated
#2 2 2 2 3 duplicated
#3 3 3 3 NA only_df1
#4 4 5 NA 2 only_df2
As already mentioned in the comments, your way looks ok. In case you want to see how it could be done in base:
a <- c(1,1)
b <- c(2,2)
c <- c(3,3) #Better don't use existing function names
d <- c(4,5)
df1 <- as.data.frame(rbind(a,b,c))
df2 <- as.data.frame(rbind(a,d,b))
df1$idx1 <- seq_len(nrow(df1)) #seq_len will also work in case nrow=0
df2$idx2 <- seq_len(nrow(df2))
df3 <- merge(df1, df2, all=TRUE)
df3$need <- ifelse(is.na(df3$idx2), "only_df1",
ifelse(is.na(df3$idx1), "only_df2",
"duplicated"))
df3
# V1 V2 idx1 idx2 need
#1 1 1 1 1 duplicated
#2 2 2 2 3 duplicated
#3 3 3 3 NA only_df1
#4 4 5 NA 2 only_df2
We can use
library(arsenal)
summary(comparedf(df1, df2))

Remove rows based on decimal places in column

I have a date frame that looks something like this:
A B C D
1 1.0 2 4
2 3.1 2 3
3 4.01 3 3
4 5.00 4 5
5 2.003 3 9
I want to delete rows where column B has numeric values other than 0 after the decimal. In the example above, this will leave me with rows 1 and 4. How do I go about this?
Try subset like below
subset(df, B == floor(B))
subset(df, B == ceiling(B))
subset(df, B == round(B))
subset(df, B == trunc(B))
We can also use
subset(df, B == as.integer(B))
Assuming your dataframe is called df:
df[as.integer(df$B) == as.numeric(df$B),]
A regex solution in base R:
df[-which(grepl("\\.[0-9]*[1-9]$", df$B)),]
A B C
2 2 1 2
5 5 357 3
A regex solution in dplyr:
library(dplyr)
df %>%
filter(!grepl("\\.[0-9]*[1-9]$", B))
Data:
df <- data.frame(
A = 1:5,
B = c(1.2, 1.0, 1.00004, 2.806, 357.0),
C = c(2,2,3,4,3)
)

Merge the rows with least abundance

I would like to merge rows lower than specific value, like:
ID A B C
Apple 1 1 1
Banana 2 2 2
Cherry 3 3 3
Dates 4 4 4
For Apple, the total amount in A, B and C is 3, which is 10% (3/30*100%=10%) in total.
I would like to merge the rows with amount lower than 20% in total into a "Others" row, like:
ID A B C
Cherry 3 3 3
Dates 4 4 4
Others 3 3 3
May I know how to draw the function and how to achieve this?
Any suggestion or idea is appreciated
I'd do it like this:
## Your original data
df <- read.table(text="ID A B C
Apple 1 1 1
Banana 2 2 2
Cherry 3 3 3
Dates 4 4 4" ,stringsAsFactors = FALSE)
names(df) <- df[1,] ## adding column names
df <- df[-1,] ## removing the header row
df[,-1] <- lapply(df[,-1], as.numeric) ## converting to numeric
rownames(df) <- df[,1] ## adding rownames
df <- df[,-1] ## removing the header column
df$tots <- apply(df, 1, sum)
df$proportion <- df$tots/sum(df$tots)
df <- rbind(df[which(df$proportion >= 0.2), ],
Others=apply(df[which(df$proportion < 0.2), ], 2, sum))
df <- subset(df, select = -c(tots, proportion))
The result:
>df
>Banana 2 2 2
>Cherry 3 3 3
>Dates 4 4 4
>Others 1 1 1
One option would be to create a logical index by dividing the rowSums of numeric columns with the total sum to check if it is less than or equal to 0.2, then assign the 'ID' based on the index to "Others" (assuming that the "ID" column is character class) and aggregate the columns by 'ID' to get the sum
i1 <- rowSums(df1[-1])/sum(as.matrix(df1[-1])) <= 0.2
df1$ID[i1] <- "Others"
aggregate(.~ ID, df1, sum)
# ID A B C
#1 Cherry 3 3 3
#2 Dates 4 4 4
#3 Others 3 3 3

Replace values from dataframe where vector values match indexes in another dataframe

Perhaps the question and answer are already posted, but I can't find it. Besides, is there any optimal approach to this problem?
Because this is just an example of some rows, but I'll apply it to a data frame of about 1 million rows.
I'm kind of new to R.
I have two data frames
DF1:
a b
1 1 0
2 2 0
3 2 0
4 3 0
5 5 0
and
DF2
l
1 A
2 B
3 C
4 D
5 E
What I try to do, is to match the values in DF1$a with the indexes of DF2 and assign those values to DF1$b so my result would be the following way.
DF1:
a b
1 1 A
2 2 B
3 2 B
4 3 C
5 5 E
I've coded a for loop to do this, but it seems that I'm missing something
for(i in 1:length(df1$a)){
df1$b[i] <- df2$l[df1$a[i]]
}
Which throws the following result:
DF1:
a b
1 1 1
2 2 2
3 2 2
4 3 3
5 5 5
Thanks in advance :)
We can use merge to merge two data frame based on row id and a.
# Create example data frame
DF1 <- data.frame(a = c(1, 2, 2, 3, 5))
DF2 <- data.frame(l = c("A", "B", "C", "D", "E"),
stringsAsFactors = FALSE)
# Create a column called a in DF2 shows the row id
DF2$a <- row.names(DF2)
# Merge DF1 and DF2 by a
DF3 <- merge(DF1, DF2, by = "a", all.x = TRUE)
# Change the name of column l to be b
names(DF3) <- c("a", "b")
DF3
# a b
# 1 1 A
# 2 2 B
# 3 2 B
# 4 3 C
# 5 5 E

How to keep rows with the same values in two variables in r?

I have a dataset with several variables, but I want to keep the rows that are the same based on two columns. Here is an example of what I want to do:
a <- c(rep('A',3), rep('B', 3), rep('C',3))
b <- c(1,1,2,4,4,4,5,5,5)
df <- data.frame(a,b)
a b
1 A 1
2 A 1
3 A 2
4 B 4
5 B 4
6 B 4
7 C 5
8 C 5
9 C 5
I know that if I use the duplicated function I can get:
df[!duplicated(df),]
a b
1 A 1
3 A 2
4 B 4
7 C 5
But since the level 'A' on column a does not have a unique value in b, I want to drop both observations to get a new data.frame as this:
a b
4 B 4
7 C 5
I don't mind to have repeated values across b, as long as for every same level on a there is the same value in b.
Is there a way to do this? Thanks!
This one maybe?
ag <- aggregate(b~a, df, unique)
ag[lengths(ag$b)==1,]
# a b
#2 B 4
#3 C 5
Maybe something like this:
> ind <- apply(sapply(with(df, split(b,a)), diff), 2, function(x) all(x==0) )
> out <- df[!duplicated(df),]
> out[out$a %in% names(ind)[ind], ]
a b
4 B 4
7 C 5
Here is another option with data.table
library(data.table)
setDT(df)[, if(uniqueN(b)==1) .SD[1L], by = a]
# a b
#1: B 4
#2: C 5

Resources