R - Delete rows based on duplicate and values in another column [duplicate] - r

This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 6 years ago.
I have a data.frame in R that looks like the following:
> inputtable <- data.frame(TN = c("T","N","T","N","N","T","T","N"),
+ Value = c(1,1,2,2,2,3,3,5))
> inputtable
TN Value
1 T 1
2 N 1
3 T 2
4 N 2
5 N 2
6 T 3
7 T 3
8 N 5
I want to remove values that duplicated in the Value column, but ONLY if one row has "T" and the other has "N" in the TN column.
I played around with duplicated, but this doesn't work the way I've coded it:
TNoverlaps.duprem <- TNoverlaps[ !(duplicated(TNoverlaps$Barcode) & ("T" %in% TNoverlaps$TN & "N" %in% TNoverlaps$TN)), ]
and
TNoverlaps.duprem <- TNoverlaps[ duplicated(TNoverlaps$Barcode) & !duplicated(TNoverlaps$Barcode, TNoverlaps$TN), ]
If there are more than two rows, as in rows 3-5 above, I want to remove all of those, because at least one is "T" and one is "N" in the TN column.
Here's the output I want
> outputtable
TN Value
6 T 3
7 T 3
8 N 5
I found plenty of questions about duplicated rows, and removing rows based on multiple columns. But I didn't see one that did something like this.

You could try:
library(dplyr)
inputtable %>% group_by(Value) %>% filter(!(n_distinct(TN) >= 2))
Source: local data frame [3 x 2]
Groups: Value [2]
TN Value
(fctr) (dbl)
1 T 3
2 T 3
3 N 5

Related

Modify DataFrame, remove double Data with for each, R [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 2 years ago.
Im about to modify a dataframe because it includes double values
Data Frame:
Id Name Account
1 X 1
1 Y 2
1 Z 3
2 J 1
2 T 4
3 O 2
So when there are multiple rows with same Id I just want to keep the last row.
The desired output would be
Id Name Account
1 Z 3
2 T 4
3 O 2
This is my current Code:
for (i in 1:(nrow(mylist)-1)) {
if(mylist$Id[c(i)] == mylist$Id[c(i+1)]){
mylist <- mylist[-c(i), ]
}
}
I have Problems when a row is removed because all other rows get a lower index and the System skips rows in the next step.
You can do this easily with the dplyr package:
library(dplyr)
mylist %>%
group_by(Id) %>%
slice(n()) %>%
ungroup()
First you group_by the Id column. Afterwards you select only the last entry (slice(n())) of each group.
One option in Base-R is
mylist[cumsum(sapply(split(mylist,mylist$Id),nrow)),]
Id Name Account
3 1 Z 3
5 2 T 4
6 3 O 2

How to sum a specific column of replicate rows in dataframe? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
How to group by two columns in R
(4 answers)
Closed 3 years ago.
I have a data frame which contains a lot of replicates rows. I would like to sum up the last column of replicates rows and remove the replications at the same time. Could anyone tell me how to do that?
The example is here:
name <- c("a","b","c","a","c")
position <- c(192,7,6,192,99)
score <- c(1,2,3,2,5)
df <- data.frame(name,position,score)
> df
name position score
1 a 192 1
2 b 7 2
3 c 6 3
4 a 192 2
5 c 99 5
#I would like to sum the score together if the first two columns are the
#same. The ideal result is like this way
name position score
1 a 192 3
2 b 7 2
3 c 6 3
4 c 99 5
Sincerely thank you for the help.
try this :
library(dplyr)
df %>%
group_by(name, position) %>%
summarise(score = sum(score, na.rm = T))

How to use R to return max value of one column and the contents of the corresponding row following column [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I have a data frame like this but much longer:
A B
1 0
3 9
7 3
6 2
1 4
2 1
I want to get the maximum value of column A and the value in column B that corresponds with it, regardless of whether it is also the maximum value. So for this data set I would like to get 7 and 3. But if I use:
Max<-apply(df,2,max)
I get 7 and 9.
Thanks for your help!
You want the row at which A has its maximum: df[which.max(df$A), ]
We can use dplyr
library(dplyr)
df1 %>%
slice(which.max(A))
# A tibble: 1 x 2
# A B
# <int> <int>
#1 7 3

Set a value if there is an increase of more than 1 in a column between rows [duplicate]

This question already has answers here:
Group rows in data frame based on time difference between consecutive rows
(2 answers)
Closed 6 years ago.
I have a dataframe as below:
RecordID <- c("a","b","c","d","e","f","g")
row.number <- c(1,2,10,11,12,45,46)
df <- data.frame(RecordID, row.number)
df$frame.change =1
I want the value in frame.change to increase by 1 from the previous row if there is an increase in row.number of more than 1 from the previous row. I am trying the following code but it doesn't work:
for( i in 2:nrow(df)){
df$frame.change[i] <- if( (df$frame.change[i] - df$frame.change[i-1]) <1 ){df$frame.change[i-1] }else{ df$frame.change[i-1] +1 }
cat("-")
}
It doesn't need to be done within a for loop and I presume lapply will be the solution but I can't seem to get this to work.
Help appreciated
A combination of cumsum and diff is all you need:
df$frame.change <- cumsum(c(1, diff(df$row.number) > 1))
which gives:
> df
RecordID row.number frame.change
1 a 1 1
2 b 2 1
3 c 10 2
4 d 11 2
5 e 12 2
6 f 45 3
7 g 46 3

Find unique rows in a data frame in R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 6 years ago.
I'd like to create a new data frame column that helps me quickly identify duplicate rows based on the value of the first column per row (index). Assuming that my dataframe (df) has almost 18000 rows-observations and the new column is called "unique" I have tried the following rather unsuccessfully...
df$unique = ifelse(df[row.names(df):1]==df[row.names(df)-1:1], "YES", "NO")
The rationale behind the code is that a comparison between the cell of the same row and the one before in the same column, can give out unique entries as long as these values do not match.
My dataframe
index num1 num2
1 12 12
1 12 12
2 14 14
2 14 14
2 14 14
3 18 18
4 19 19
You can use the duplicated function. Be aware that the first occurence of a non-unique column is not a duplicate, hence we need it twice, searching from the beginning and from the end.
# Toy data, where the first two rows are identical, the third row is unique
df <- data.frame(a = c(1, 1, 1), b = c(1, 1, 2))
# Find unique columns
df$unique <- !(duplicated(df) | duplicated(df, fromLast = TRUE))
Output:
> df
a b unique
1 1 1 FALSE
2 1 1 FALSE
3 1 2 TRUE

Resources