Removing/collapsing duplicate rows in R - r

I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?
Probesets=paste("a",1:200,sep="")
Genes=sample(letters,200,replace=T)
Value=rnorm(200)
X=data.frame(Probesets,Genes,Value)
X=X[order(X$Value,decreasing=T),]
Y=X[which(!duplicated(X$Genes)),]

Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:
Y=X[which(!duplicated(X$Genes)),]
Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:
nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26
If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:
Y=X[!duplicated(X),]
To see how it works consider this example:
df <- data.frame(
a = c(1,1,2,3),
b = c(1,1,3,4)
)
df
a b
1 1 1
2 1 1
3 2 3
4 3 4
df[!duplicated(df),]
a b
1 1 1
3 2 3
4 3 4

Your code is keeping the records containing maximum value per gene.

Related

R - Update value of a column based on condition

I need to update all the values of a column, using as reference another df.
The two dataframes have equal structures:
cod name dom_by
1 A 3
2 B 4
3 C 1
4 D 2
I tried to use the following line, but apparently it did not work:
df2$name[df2$dom_by==df1$cod] <- df1$name[df2$dom_by==df1$cod]
It keeps saying that replacement has 92 rows, data has 2.
(df1 has 92 rows and df2 has 2).
Although it seems like a simple problem, I still can not solve it, even after some searches.

R - counting adjacent duplicate items

New to R and would like to do the following operation:
I have a set of numbers e.g. (1,1,0,1,1,1,0,0,1) and need to count adjacent duplicates as they occur. The result I am looking for is:
2,1,3,2,1
as in 2 ones, 1 zero, 3 ones, etc.
Thanks.
We can use rle
rle(v1)$lengths
#[1] 2 1 3 2 1
data
v1 <- c(1,1,0,1,1,1,0,0,1)

replace specific number in a vector from a list in r

I've been working on this problem and can't seem to figure out the proper solution. Ultimately I'm going to use dplyr in order to group by and apply a function to a column. I turned the column into a vector. Here is a snippet:
vec1 <- append(append(append(rep(1,3),rep(2,6)), rep(3,5)),rep(4,2))
if the number repeats more than 3 times, I want to change the following number to 1. So in the vector above, the number 2 occurs 6 times and the number 3 occurs 5 times. That means I want to replace the number 3 and 4 with 1. Ultimately in this snippet, the answer I'm looking for is:
c(1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1)
What I have below worked for cases when only one number was repeated more than 3 times, but not multiple. In addition, if I'm doing this inefficiently I'd like to learn how to better script it.
stack <- table(vec1)
stack1 <- list(as.numeric(rownames(data.frame(stack[stack>3]))) + 1)
replace(vec1,vec1 == stack1,1)
thanks in advance for any help
Try
inverse.rle(within.list(rle(vec1),
values[c(FALSE,(lengths >3)[-length(lengths)])] <- 1))
#[1] 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1

Find out if column in R table includes duplicate values?

I've got a lovely dataframe, my very first, and I'm starting to get the hang of R. One thing I haven't been able to find is a test for duplicate values. I have one column that I'm pretty sure is all unique values, but I don't know that.
Is there a way I can ask? For simplicity, let's pretend this is my data:
var1 var2 var3
1 1 A 1
2 2 B 3
3 3 C NA
4 4 D NA
5 5 E 4
and I want to know whether var1 ever repeats.
Check out the duplicated function:
duplicated(dat$var1) # the rows of dat var1 duplicated
Documentation is here.
You should also look at the unique function.
Remove duplicates based on columns:
my_data[!duplicated(my_data$Col_id), ] # Where ! is a logical negation:

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!
You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)
You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Resources