I've got a lovely dataframe, my very first, and I'm starting to get the hang of R. One thing I haven't been able to find is a test for duplicate values. I have one column that I'm pretty sure is all unique values, but I don't know that.
Is there a way I can ask? For simplicity, let's pretend this is my data:
var1 var2 var3
1 1 A 1
2 2 B 3
3 3 C NA
4 4 D NA
5 5 E 4
and I want to know whether var1 ever repeats.
Check out the duplicated function:
duplicated(dat$var1) # the rows of dat var1 duplicated
Documentation is here.
You should also look at the unique function.
Remove duplicates based on columns:
my_data[!duplicated(my_data$Col_id), ] # Where ! is a logical negation:
Related
I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?
Probesets=paste("a",1:200,sep="")
Genes=sample(letters,200,replace=T)
Value=rnorm(200)
X=data.frame(Probesets,Genes,Value)
X=X[order(X$Value,decreasing=T),]
Y=X[which(!duplicated(X$Genes)),]
Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:
Y=X[which(!duplicated(X$Genes)),]
Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:
nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26
If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:
Y=X[!duplicated(X),]
To see how it works consider this example:
df <- data.frame(
a = c(1,1,2,3),
b = c(1,1,3,4)
)
df
a b
1 1 1
2 1 1
3 2 3
4 3 4
df[!duplicated(df),]
a b
1 1 1
3 2 3
4 3 4
Your code is keeping the records containing maximum value per gene.
I need to update all the values of a column, using as reference another df.
The two dataframes have equal structures:
cod name dom_by
1 A 3
2 B 4
3 C 1
4 D 2
I tried to use the following line, but apparently it did not work:
df2$name[df2$dom_by==df1$cod] <- df1$name[df2$dom_by==df1$cod]
It keeps saying that replacement has 92 rows, data has 2.
(df1 has 92 rows and df2 has 2).
Although it seems like a simple problem, I still can not solve it, even after some searches.
I have data about baseball result in 2016.
Now, I want to remove the column that made tie score.
That is, I want to remove the column that has same value in $team1_score and $team2_score.
How can I use the function in r?
I just tried to use the following code, but it didn't work well.
Baseball2 <- Baseball[!duplicated(Baseball$team1_score)]
Please help me...!!
Here's an simple way to remove rows with tie-score:
(dat <- data.frame(Team1_Score= c(1,2,3), Team2_Score=c(2,3,3)))
Team1_Score Team2_Score
1 1 2
2 2 3
3 3 3
Use logical test to find which row has tie score:
tie <- dat$Team1_Score == dat$Team2_Score
tie
[1] FALSE FALSE TRUE
Use this result to select rows that are not tie:
dat[!tie, ]
Team1_Score Team2_Score
1 1 2
2 2 3
I understand you do not want to remove duplicates, but need to subset the dataframe discarding tied matches.
A very simple option using data.table:
library(data.table)
Baseball2 <- data.table(Baseball)
Baseball2 <- Baseball2[Team1_Score != Team2_Score,]
I have the following string of characters:
pig<-c("A","B","C","D","AB","ABC","AB","AA","CD","CA",NA)
I am trying to get R to tell me how many of each total letters there are and how many total NAs there are. Thus, in this case I would like to the result to look like this:
print(cow)
A B C D NA
6 3 4 2 1
I have tried table in combination with strsplit but cannot figure out exactly how to do it. Any thoughts? Thanks!
You would need to use NULL (or the empty character "") for the split value in strsplit(), then unlist it. Then, in table() you'll want to use the useNA argument to include any NA values. Here we'll use "ifany", so that if there are any NA values they will be shown in the table and if there are not, NA will not be shown in the result at all.
table(unlist(strsplit(pig, NULL)), useNA = "ifany")
#
# A B C D <NA>
# 7 4 4 2 1
The issue seems to be something already treated but after a check I couldn't find any solution. I load a table from a file and it could be (don't know how) that some entire lines are empty. So when I get the data frame I got
# id c1 c2
# 1 a 1 2
# 2 b 2 4
# 3 NA NA
# 4 d 6 1
# 5 e 7 5
# 6 NA NA
if I do
apply(df, 1, function(x) all(is.na(x))
I got all FALSE as the first column is not a number (the table is much bigger with mixed character and numeric columns) and I can't filter these lines. Also with na.omit or complete.cases I cannot sort it out.
Is there any function or expression to check empty rows?
You may be able to cut this problem off at the source with the parameters you pass to read.csv:
For instance if the blanks are one space or blanks you could use
df <- read.csv(<your other logic here>, na.strings=c("NA","", " ")
This question seems to raise similar issues: read.csv blank fields to NA
If this works, then you can use the apply logic to work with the offending rows.