New to R and would like to do the following operation:
I have a set of numbers e.g. (1,1,0,1,1,1,0,0,1) and need to count adjacent duplicates as they occur. The result I am looking for is:
2,1,3,2,1
as in 2 ones, 1 zero, 3 ones, etc.
Thanks.
We can use rle
rle(v1)$lengths
#[1] 2 1 3 2 1
data
v1 <- c(1,1,0,1,1,1,0,0,1)
Related
I am trying to find the total of rows that have a column value of 3 or 4. That being said, the first row has only one value of 3 so if I create a new column
currentdx_count1$TotalDiagnoses
That new column called TotalDiagnoses should only have a value of 1 under it for the first row. I have tried
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[2:32])
This doesn't give me what I need as expected because it literally sums up the whole row. That being said, is there an existing function that does what I want to do or will I have to make one? Could I specify more in rowSums for it to work as I need it to?
Thanks for any and all help.
Edit: I'm trying to adapt a method I use earlier in my script that works for a similar purpose
findtotal <- endsWith(names(currentdx_count1), 'Current')
findtotal <- lapply(findtotal, `>`, 2)
findtotal <- unlist(findtotal)
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
I get an error which I have never seen before (an error in view?!)
So I tried just this
findtotal <- endsWith(names(currentdx_count1), 'Current')
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
Gets me closer but it is finding the total count for each column separately which is not what I need. I want a single column to encompass counts for each SID.
You can compare the dataframe with the value of 3 or 4 and then use rowSums to count :
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[-1] == 3 |
currentdx_count1[-1] == 4)
currentdx_count1$TotalDiagnoses
#[1] 1 2 2 2 1 1 1 1 1 1 1 1 1 2
I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?
Probesets=paste("a",1:200,sep="")
Genes=sample(letters,200,replace=T)
Value=rnorm(200)
X=data.frame(Probesets,Genes,Value)
X=X[order(X$Value,decreasing=T),]
Y=X[which(!duplicated(X$Genes)),]
Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:
Y=X[which(!duplicated(X$Genes)),]
Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:
nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26
If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:
Y=X[!duplicated(X),]
To see how it works consider this example:
df <- data.frame(
a = c(1,1,2,3),
b = c(1,1,3,4)
)
df
a b
1 1 1
2 1 1
3 2 3
4 3 4
df[!duplicated(df),]
a b
1 1 1
3 2 3
4 3 4
Your code is keeping the records containing maximum value per gene.
Assuming my dataframe has one column, I wish to add another column to indicate if my ith element is unique within the first i elements. The results I want is:
c1 c2
1 1
2 1
3 1
2 0
1 0
For example, 1 is unique in {1}, 2 is unique in {1,2}, 3 is unique in {1,2,3}, 2 is not unique in {1,2,3,2}, 1 is not unique in {1,2,3,2,1}.
Here is my code, but is runs extremely slow given I have nearly 1 million rows.
for(i in 1:nrow(df)){
k <- sum(df$C1[1:i]==df$C1[i]))
if(k>1){df[i,"C2"]=0}
else{df[i,"C2"]=1}
}
Is there a quicker way of achieving this?
The following works:
x$c2 = as.numeric(! duplicated(x$c1))
Or, if you prefer more explicit code (I do, but it’s slower in this case):
x$c2 = ifelse(duplicated(x$c1), 0, 1)
I've been working on this problem and can't seem to figure out the proper solution. Ultimately I'm going to use dplyr in order to group by and apply a function to a column. I turned the column into a vector. Here is a snippet:
vec1 <- append(append(append(rep(1,3),rep(2,6)), rep(3,5)),rep(4,2))
if the number repeats more than 3 times, I want to change the following number to 1. So in the vector above, the number 2 occurs 6 times and the number 3 occurs 5 times. That means I want to replace the number 3 and 4 with 1. Ultimately in this snippet, the answer I'm looking for is:
c(1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1)
What I have below worked for cases when only one number was repeated more than 3 times, but not multiple. In addition, if I'm doing this inefficiently I'd like to learn how to better script it.
stack <- table(vec1)
stack1 <- list(as.numeric(rownames(data.frame(stack[stack>3]))) + 1)
replace(vec1,vec1 == stack1,1)
thanks in advance for any help
Try
inverse.rle(within.list(rle(vec1),
values[c(FALSE,(lengths >3)[-length(lengths)])] <- 1))
#[1] 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1
I'm using a for-loop to perform operations on specific subsets of my data. At the end of each iteration of the for loop, I have all the values that I need to fill a row of my dataframe.
So far I tried
df=NULL
for(...){
//stuff to calculate
newline=c(allthethingscalculated)
df=rbind(df,newline)
}
this results in the contents of the dataframe not being accessable using '$' , because the rows are then atomic vectors.
I also tried to append the values I get at the end of each iteration to an already existing vector and when the for loop ends create a dataframe from these vectors using but appending the values to the respective vector didn't work, the values weren't added.
x<-data.frame(a,b,c,d,...)
Any ideas on this?
Since my for loop iterates over IDs in my data, I realized I could do something like this:
uids=unique(data$id)
filler=c(1:length(uids))
df=data.frame(uids,filler,filler,filler,filler,filler,filler,filler,filler,filler)
for(i in uids){
...
df[i,]<-newline
}
I used filler to create a dataframe with the correct number of columns and rows so I don't get an error like 'replacement has length of 9, replacement has length of 1'
Is there a better way to do this? Using this approach I still have the values of filler in the respective row that I'd need to remove?
This should work, can your show us you data ?
R) x=data.frame(a=rep(1,3),b=rep(2,3),c=rep(3,3))
R) d=c(4,4,4)
R) rbind(x,d)
a b c
1 1 2 3
2 1 2 3
3 1 2 3
4 4 4 4
R) cbind(x,d)
a b c d
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4