Check if one row is equal to any other rows in R - r

I have a dataset with one ID column, 12 information columns (strings) and n rows. It looks like this:
ID Col1 Col2 Col3 Col4 Col5 ...
01 a b c d a
02 a a a a a
03 b b b b b
...
I need to go row by row and check if that row (considering all of it's columns) is equal to any other row in the dataset. My output needs to be two new columns: one indicating if that particular row is equal to any other row and a second column indicating which row it is equal to (in case of TRUE in the previous column)
I appreciate any suggestions.

Assuming DF in the Note at the end, sort it and create a column dup indicating whether there exists a prior duplicate row. Then set to wx to the row number in the original data frame of the duplicate. Finaly resort back.
We have assumed that duplicate means that the columns other than the ID are the same but that is readily changed if need be. We have also assumed that we should mark the second and subsequent rows among duplicates whereas the first is not so marked becaue it has to that point no duplicate.
The question does not address the situation of more than 2 identical rows but if that situation exists then each duplicate will point to the nearest prior row of which it is a duplicate.
o <- do.call("order", DF[-1])
DFo <- DF[o, ]
DFo$wx <- DFo$dup <- duplicated(DFo)
DFo$wx[DFo$dup] <- as.numeric(rownames(DFo))[which(DFo$dup) - 1]
DFo[order(o), ] # back to original order
giving:
ID Col1 Col2 Col3 Col4 Col5 dup wx
1 1 a b c d a FALSE 0
2 2 a a a a a FALSE 0
3 3 b b b b b FALSE 0
4 1 a b c d a TRUE 1
Note
Lines <- "ID Col1 Col2 Col3 Col4 Col5
01 a b c d a
02 a a a a a
03 b b b b b"
DF <- read.table(text = Lines, header = TRUE)
DF <- DF[c(1:3, 1), ]
rownames(DF) <- NULL
giving:
> DF
ID Col1 Col2 Col3 Col4 Col5
1 1 a b c d a
2 2 a a a a a
3 3 b b b b b
4 1 a b c d a

With a df like below:
ID Col1 Col2 Col3 Col4 Col5
1 1 a b c d a
2 2 a a a a a
3 3 b b b b b
4 3 b b b b b
You could try grouping by all columns and checking whether any count > 1 as well as pasting together row numbers (1:nrow(df)):
df <- transform(
df,
dupe = ave(ID, mget(names(df)), FUN = length) > 1,
dupeRows = ave(1:nrow(df), mget(names(df)), FUN = toString)
)
As this would get you a number for each row, even when there are no duplicates, you could do:
df$dupeRows <- with(df,
Map(function(x, y)
toString(x[x != y]),
strsplit(as.character(dupeRows), split = ', '),
1:nrow(df)))
Output:
ID Col1 Col2 Col3 Col4 Col5 dupe dupeRows
1 1 a b c d a FALSE
2 2 a a a a a FALSE
3 3 b b b b b TRUE 4
4 3 b b b b b TRUE 3
Data
df <- structure(list(ID = c(1L, 2L, 3L, 3L), Col1 = structure(c(1L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col2 = structure(c(2L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col3 = structure(c(3L,
1L, 2L, 2L), .Label = c("a", "b", "c"), class = "factor"), Col4 = structure(c(3L,
1L, 2L, 2L), .Label = c("a", "b", "d"), class = "factor"), Col5 = structure(c(1L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor")), row.names = c(NA,
-4L), class = "data.frame")

A dplyr solution
library(dplyr)
df %>%
mutate(row_num = 1:n(), is_dup = duplicated(df)) %>%
group_by(across(-c(row_num, is_dup))) %>%
mutate(
has_copies = n() > 1L,
which_row = if_else(is_dup, first(row_num), NA_integer_),
row_num = NULL, is_dup = NULL
)
Output
# A tibble: 5 x 8
# Groups: ID, Col1, Col2, Col3, Col4, Col5 [3]
ID Col1 Col2 Col3 Col4 Col5 has_copies which_row
<chr> <fct> <fct> <fct> <fct> <fct> <lgl> <int>
1 1 a b c d a FALSE NA
2 2 a a a a a FALSE NA
3 3 b b b b b TRUE NA
4 3 b b b b b TRUE 3
5 3 b b b b b TRUE 3
For each row that has more than one copies, the has_copies gives a TRUE.
For a set of rows that are the same, I consider the first one as the original and all other rows as duplicates. In this regard, which_row gives you the index of the original for each duplicate it found. In other words, If a row has no duplicate or is the original, it gives you NA.

Related

Change subsequent row values to previous maximum value up to that point if subsequent values are lower than the previous one. Ignore NA's

For example it looks like this now:
Sample
Col1
Col2
Col3
Col4
Col5
A
1
NA
2
1
3
B
1
2
NA
1
5
C
0
1
5
NA
3
I want it to look like this:
Sample
Col1
Col2
Col3
Col4
Col5
A
1
NA
2
2
3
B
1
2
NA
2
5
C
0
1
5
NA
5
df1 <- df
df1[is.na(df1)] <- -Inf
df1[-1] <- matrixStats::rowCummaxs(as.matrix(df1[-1]))* NA^is.na(df[-1])
df1
Sample Col1 Col2 Col3 Col4 Col5
1 A 1 NA 2 2 3
2 B 1 2 NA 2 5
3 C 0 1 5 NA 5
or even:
df1 <- df
df1[is.na(df1)] <- -Inf
df1[-1] <- matrixStats::rowCummaxs(as.matrix(df1[-1]))
is.na(df1) <- is.na(df)
df1
Sample Col1 Col2 Col3 Col4 Col5
1 A 1 NA 2 2 3
2 B 1 2 NA 2 5
3 C 0 1 5 NA 5
We may use cummax from base R - loop over subset of dataset i.e. numeric columns([-1]) by row with apply (MARGIN = 1), replace the non-NA elements with the cumulative max of the values and assign back
df[-1] <- t(apply(df[-1], 1, FUN = function(x) {
i1 <- !is.na(x)
x[i1] <- cummax(x[i1])
x}))
-output
> df
Sample Col1 Col2 Col3 Col4 Col5
1 A 1 NA 2 2 3
2 B 1 2 NA 2 5
3 C 0 1 5 NA 5
data
df <- structure(list(Sample = c("A", "B", "C"), Col1 = c(1L, 1L, 0L
), Col2 = c(NA, 2L, 1L), Col3 = c(2L, NA, 5L), Col4 = c(1L, 1L,
NA), Col5 = c(3L, 5L, 3L)), class = "data.frame", row.names = c(NA,
-3L))

How to isolate values in a dataframe based on a vector and multiply it by another column in the same dataframe using R?

I have a dataframe with multiple columns
col1|col2|col3|colA|colB|colC|Percent
1 1 1 2 2 2 50
Earlier I subset the columns and created a vector
ColAlphabet<-c("ColA","ColB","ColC")
What i want to do is take ColAlphabet and multiply it by Percent so in the end I have
col1|col2|col3|colA|colB|colC|Percent
1 1 1 1 1 1 50
We can use mutate with across. Specify the columns of interest wrapped with all_of and multiply the columns with 'Percent'
library(dplyr)
df2 <- df1 %>%
mutate(across(all_of(ColAlphabet), ~ .* Percent/100))
-output
df2
# col1 col2 col3 colA colB colC Percent
#1 1 1 1 1 1 1 50
data
df1 <- structure(list(col1 = 1L, col2 = 1L, col3 = 1L, colA = 2L, colB = 2L,
colC = 2L, Percent = 50L), class = "data.frame", row.names = c(NA,
-1L))
You can subset the column, multiply with Percent and save it in ColAlphabet again.
ColAlphabet<-c("colA","colB","colC")
df[ColAlphabet] <- df[ColAlphabet] * df$Percent/100
df
# col1 col2 col3 colA colB colC Percent
#1 1 1 1 1 1 1 50
We can also use apply():
#Vector
ColAlphabet<-c("colA","colB","colC")
#Code
df[,ColAlphabet] <- apply(df[,ColAlphabet],2,function(x) x*df$Percent/100)
Output:
df
col1 col2 col3 colA colB colC Percent
1 1 1 1 1 1 1 50
Some data used:
#Data
df <- structure(list(col1 = 1L, col2 = 1L, col3 = 1L, colA = 2L, colB = 2L,
colC = 2L, Percent = 50L), class = "data.frame", row.names = c(NA,
-1L))
In case if you want to multiply directly:
> df <- data.frame(col1 = 1, col2 = 1, col3 = 1, colA = 2, colB = 2, colC = 2, Percent = 50)
> df
col1 col2 col3 colA colB colC Percent
1 1 1 1 2 2 2 50
> df[grep('^c.*[A-Z]$', names(df))] <- df[grep('^c.*[A-Z]$', names(df))] * df$Percent/100
> df
col1 col2 col3 colA colB colC Percent
1 1 1 1 1 1 1 50
>

check if numbers in a column are ascending by a certain value (R dataframe)

I have a column of numbers (index) in a dataframe like the below. I am attempting to check if these numbers are in ascending order by the value of 1. For example, group B and C do not ascend by 1. While I can check by sight, my dataframe is thousands of rows long, so I'd prefer to automate this. Does anyone have advice? Thank you!
group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2
...
I think this works. diff calculates the difference between the two subsequent numbers, and then we can use all to see if all the differences are 1. dat2 is the final output.
library(dplyr)
dat2 <- dat %>%
group_by(group) %>%
summarize(Result = all(diff(index) == 1)) %>%
ungroup()
dat2
# # A tibble: 3 x 2
# group Result
# <chr> <lgl>
# 1 A TRUE
# 2 B FALSE
# 3 C FALSE
DATA
dat <- read.table(text = "group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2",
header = TRUE, stringsAsFactors = FALSE)
Maybe aggregate could help
> aggregate(.~group,df1,function(v) all(diff(v)==1))
group index
1 A TRUE
2 B FALSE
3 C FALSE
We can do a group by group, get the difference between the current and previous value (shift) and check if all the differences are equal to 1.
library(data.table)
setDT(df1)[, .(Result = all((index - shift(index))[-1] == 1)), group]
# group Result
#1: A TRUE
#2: B FALSE
#3: C FALSE
data
df1 <- structure(list(group = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "C", "C", "C", "C"), index = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 2L, 0L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-13L))

Add a new column in R with presence/absence info

Hel lo I have a dataframe such as :
Col1 Col2
A 23
B NA
C 21
D 2
E NA
F 9
and I would like to add a new Col3 with presence/absence info (1/0)
If the number in col2 >=1 I put 1
If NA I put 0
and get :
Col1 Col2 Col3
A 23 1
B NA 0
C 21 1
D 2 1
E NA 0
F 9 1
You could assign Col3 as 1 if col2 is greater than equal to 1 and is not NA.
df$Col3 <- +(df$Col2 >= 1 & !is.na(df$Col2))
df
# Col1 Col2 Col3
#1 A 23 1
#2 B NA 0
#3 C 21 1
#4 D 2 1
#5 E NA 0
#6 F 9 1
+ at the beginning converts logical values TRUE/FALSE to integer values 1/0.
data
df <- structure(list(Col1 = structure(1:6, .Label = c("A", "B", "C",
"D", "E", "F"), class = "factor"), Col2 = c(23L, NA, 21L, 2L,
NA, 9L)), class = "data.frame", row.names = c(NA, -6L))
Another tidy way might be
library(dplyr)
mutate(df,
Col3 = ifelse(Col2 %in% NA,0,1)
)
We can use dplyr
library(dplyr)
df %>%
mutate(Col3 = as.integer(Col2 >=1 & !is.na(Col2)))
# Col1 Col2 Col3
#1 A 23 1
#2 B NA 0
#3 C 21 1
#4 D 2 1
#5 E NA 0
#6 F 9 1
data
df <- structure(list(Col1 = structure(1:6, .Label = c("A", "B", "C",
"D", "E", "F"), class = "factor"), Col2 = c(23L, NA, 21L, 2L,
NA, 9L)), class = "data.frame", row.names = c(NA, -6L))

How to sum df when it contains characters?

I am trying to prep my data and I am stuck with one issue. Lets say I have the following data frame:
df1
Name C1 Val1
A a x1
A a x2
A b x3
A c x4
B d x5
B d x6
...
and I want to narrow down the df to
df2
Name C1 Val
A a,b,c x1+x2+x3+x4
B d x5+x6
...
while a is a character value and x is numeric value
I have been trying using sapply, rowsum and
df2<- aggregate(df1, list(df1[,1]), FUN= summary)
but it just can't put the character values in a list for each Name.
Can someone help me how to receive df2?
m <- function(x) if(is.numeric(x<- type.convert(x)))sum(x) else toString(unique(x))
aggregate(.~Name,df1,m)
Name C1 Val1
1 A a, b, c 10
2 B d 11
where
df1
Name C1 Val1
1 A a 1
2 A a 2
3 A b 3
4 A c 4
5 B d 5
6 B d 6
This is your df, I give it numbers 1 to 6 in Val1
df <-
structure(list(Name = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), C1 = structure(c(1L, 1L, 2L, 3L, 4L,
4L), .Label = c("a", "b", "c", "d"), class = "factor"), Val1 = 1:6), row.names = c(NA,
-6L), class = "data.frame")
We just use summarise:
df %>%
group_by(Name) %>%
summarise(C1=paste(unique(C1),collapse=","),Val1=sum(Val1))
# A tibble: 2 x 3
Name C1 Val1
<fct> <chr> <int>
1 A a,b,c 10
2 B d 11
Quick and easy dplyr solution:
library(dplyr)
library(stringr)
df1 %>%
mutate(Val1_num = as.numeric(str_extract(Val1, "\\d+"))) %>%
group_by(Name) %>%
summarise(C1 = paste(unique(C1), collapse = ","),
Val1 = paste(unique(Val1), collapse = "+"),
Val1_num = sum(Val1_num))
#> # A tibble: 2 x 4
#> Name C1 Val1 Val1_num
#> <chr> <chr> <chr> <dbl>
#> 1 A a,b,c x1+x2+x3+x4 10
#> 2 B d x5+x6 11
Or in base:
df2 <- aggregate(df1, list(df1[,1]), FUN = function(x) {
if (all(grepl("\\d", x))) {
sum(as.numeric(gsub("[^[:digit:]]", "", x)))
} else {
paste(unique(x), collapse = ",")
}
})
df2
#> Group.1 Name C1 Val1
#> 1 A A a,b,c 10
#> 2 B B d 11
data
df1 <- read.csv(text = "
Name,C1,Val1
A,a,x1
A,a,x2
A,b,x3
A,c,x4
B,d,x5
B,d,x6", stringsAsFactors = FALSE)

Resources