Find unique rows in a data frame in R [duplicate] - r

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 6 years ago.
I'd like to create a new data frame column that helps me quickly identify duplicate rows based on the value of the first column per row (index). Assuming that my dataframe (df) has almost 18000 rows-observations and the new column is called "unique" I have tried the following rather unsuccessfully...
df$unique = ifelse(df[row.names(df):1]==df[row.names(df)-1:1], "YES", "NO")
The rationale behind the code is that a comparison between the cell of the same row and the one before in the same column, can give out unique entries as long as these values do not match.
My dataframe
index num1 num2
1 12 12
1 12 12
2 14 14
2 14 14
2 14 14
3 18 18
4 19 19

You can use the duplicated function. Be aware that the first occurence of a non-unique column is not a duplicate, hence we need it twice, searching from the beginning and from the end.
# Toy data, where the first two rows are identical, the third row is unique
df <- data.frame(a = c(1, 1, 1), b = c(1, 1, 2))
# Find unique columns
df$unique <- !(duplicated(df) | duplicated(df, fromLast = TRUE))
Output:
> df
a b unique
1 1 1 FALSE
2 1 1 FALSE
3 1 2 TRUE

Related

How to duplicate a row based on the presence of multiple values in a column in R [duplicate]

This question already has answers here:
Unlist data frame column preserving information from other column
(3 answers)
Closed 2 years ago.
I have a dataframe with phonetic transcriptions of words called trans, and a column pos_numwhich records the position of the phoneme tin the transcription strings.
df <- data.frame(
trans = c("ðət", "əˈpærəntli", "ˈkɒntrækt", "təˈwɔːdz", "pəˈteɪtəʊz"), stringsAsFactors = F
)
df$pos_num <- sapply(strsplit(df$trans, ""), function(x) which(grepl("t", x)))
df
trans pos_num
1 ðət 3
2 əˈpærəntli 8
3 ˈkɒntrækt 5, 9
4 təˈwɔːdz 1
5 pəˈteɪtəʊz 4, 7
In some transcriptions, t occurs more than once, resulting in multiple values in pos_num. Where this is the case I would like to duplicate the entire row, with the original row containing one value and the duplicated row containing the other value. The desired output would be:
df
trans pos_num
1 ðət 3
2 əˈpærəntli 8
3 ˈkɒntrækt 5
4 ˈkɒntrækt 9
5 təˈwɔːdz 1
6 pəˈteɪtəʊz 4
7 pəˈteɪtəʊz 7
How can this be achieved? (There seem to be a few posts on that question for other programming languages but not R.)
library(data.table)
setDT(df)
df[, .(pos_num = unlist((pos_num))),by = .(trans)]

Is there a good way to compare 2 data tables but compare the data from i to data of i+1 in second data table [duplicate]

This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 2 years ago.
I have tried various functions including compare and all.equal but I am having difficulty finding a test to see if variables are the same.
For context, I have a data.frame which in some cases has a duplicate result. I have tried copying the data.frame so I can compare it with itself. I would like to remove the duplicates.
One approach I considered was to look at row A from dataframe 1 and subtract it from row B from dataframe 2. If they equal to zero, I planned to remove one of them.
Is there an approach I can use to do this without copying my data?
Any help would be great, I'm new to R coding.
Suppose I had a data.frame named data:
data
Col1 Col2
A 1 3
B 2 7
C 2 7
D 2 8
E 4 9
F 5 12
I can use the duplicated function to identify duplicated rows and not select them:
data[!duplicated(data),]
Col1 Col2
A 1 3
B 2 7
D 2 8
E 4 9
F 5 12
I can also perform the same action on a single column:
data[!duplicated(data$Col1),]
Col1 Col2
A 1 3
B 2 7
E 4 9
F 5 12
Sample Data
data <- data.frame(Col1 = c(1,2,2,2,4,5), Col2 = c(3,7,7,8,9,12))
rownames(data) <- LETTERS[1:6]

How to sum a specific column of replicate rows in dataframe? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
How to group by two columns in R
(4 answers)
Closed 3 years ago.
I have a data frame which contains a lot of replicates rows. I would like to sum up the last column of replicates rows and remove the replications at the same time. Could anyone tell me how to do that?
The example is here:
name <- c("a","b","c","a","c")
position <- c(192,7,6,192,99)
score <- c(1,2,3,2,5)
df <- data.frame(name,position,score)
> df
name position score
1 a 192 1
2 b 7 2
3 c 6 3
4 a 192 2
5 c 99 5
#I would like to sum the score together if the first two columns are the
#same. The ideal result is like this way
name position score
1 a 192 3
2 b 7 2
3 c 6 3
4 c 99 5
Sincerely thank you for the help.
try this :
library(dplyr)
df %>%
group_by(name, position) %>%
summarise(score = sum(score, na.rm = T))

How to split a text into a vector, where each entry corresponds to an index value assigned to each unique word? [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 4 years ago.
Let's say I have a document with some text, like this, from SO:
doc <- 'Questions with similar titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'
I can then make a dataframe where every word has a row in a df:
library(stringi)
dfall <- data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc))))
We'll add a third column with its unique id. To get the ID, remove duplicates:
library(dplyr)
uniquedf <- distinct(data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc)))))
I'm struggling with how to match the rows against the two dataframes to extract the row index value from uniquedf as a new row value for df
alldf <- alldf %>% mutate(id = which(uniquedf$words == words))
A dply method like this doesn't work.
Is there a more efficient way to do this?
To give an even simpler example to show the expected output, I'd like a dataframe that looks like this:
words id
1 to 1
2 row 2
3 zip 3
4 zip 3
Where my starting word vector is: doc <- c('to', 'row', 'zip', 'zip') or doc <- c('to row zip zip'). The id column adds a unique id for each unique word.
cheap way using sapply
data
doc <- 'Questions with with titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'
function
alldf=cbind(dfall,sapply(1:nrow(dfall),function(x) which(uniquedf$words==dfall$words[x])))
colnames(alldf)=c("words","id")
> alldf
words id
1 questions 1
2 with 2
3 with 2
4 titles 3
5 have 4
6 frequently 5
7 been 6
8 downvoted 7
9 and 8
10 or 9
11 closed 10
12 consider 11
13 using 12
14 a 13
15 title 14
16 that 15
17 more 16
18 accurately 17
19 describes 18
20 your 19
21 question 20

Removing duplicates based on two columns in R [duplicate]

This question already has answers here:
Remove duplicate column pairs, sort rows based on 2 columns [duplicate]
(3 answers)
Closed 7 years ago.
Suppose my data is as follows,
X Y
26 14
26 14
26 15
26 15
27 15
27 15
28 16
28 16
I want to remove the rows of duplicates. I am able to remove the duplicate rows based on one column by this command,
dat[c(T, diff(dat$X) != 0), ] or dat[c(T, diff(dat$Y) != 0), ]
But I want to remove the duplicates only when both the columns have the same previous value. I can't use unique here because the same data would occur later on. I want to check the previous value and calculate it
My sample output is,
x y
26 14
26 15
27 15
28 16
How can we do this in R?
Thanks
Ijaz
Using data.table v1.9.5 - installation instructions here:
require(data.table) # v1.9.5+
df[!duplicated(rleidv(df, cols = c("X", "Y"))), ]
rleidv() is best understood with examples:
rleidv(c(1,1,1,2,2,3,1,1))
# [1] 1 1 1 2 2 3 4 4
A unique index is generated for each consecutive run of values.
And the same can be accomplished on a list() or data.frame() or data.table() on a specific set of columns as well. For example:
df = data.frame(a = c(1,1,2,2,1), b = c(2,3,4,4,2))
rleidv(df) # computes on both columns 'a,b'
# [1] 1 2 3 3 4
rleidv(df, cols = "a") # only looks at 'a'
# [1] 1 1 2 2 3
The rest should be fairly obvious. We just check for duplicated() values, and return the non-duplicated ones.
using dplyr:
library(dplyr)
z %>% filter(X != lag(X) | Y != lag(Y) | row_number() == 1)
We need to include the row_number()==1 or we lose the first row

Resources