I am trying to get the following individual strings
O U N P D E A Z B X Q R M V H
to look like
"O" & "U" & etc....
This sorta does it... but has the & symbol inside the quotes like "O&"
paste("",varmap[[temp$SplitVar[2]]],"&",sep="")
the varmap part is the O U N P D E A Z B X Q R M V H
the code I was getting it from that someone had written before has it as
for (k in x:0)
{
paste("",varmap[[temp$SplitVar[2]]],"&",sep="")
}
but not really sure what the k in x:0 does, and it gives an error of the vector is too long.
Related
Let's say I have a string:
s <- 'hello world zzz'
I want to shift the alphabetical characters up by one.
So:
a becomes b
b becomes c
c becomes d
d becomes e
and so on...
w becomes x
x becomes y
y becomes z
And:
z becomes a
The other condition is that if there is character that isn't in the alphabet (in this case the space), keep the characters as it is, so the space remains as a space.
Would all this be possible?
My desired output here would be:
ifmmp xpsme aaa
I have tried:
new <- c()
for (i in s)
{
new <- c(new, 'abcdefghijklmnopqrstuvwxyz'[which('abcdefghijklmnopqrstuvwxyz' == i) + 1])
}
print(new)
But it doesn't work... It outputs nothing.
Any ways of doing this?
chartr("abcdefghijklmnopqrstuvwxyz", "bcdefghijklmnopqrstuvwxyza", 'hello world zzz')
# [1] "ifmmp xpsme aaa"
(A function I've never had cause to use ...)
I want to create random sequences for the variables a, b, c, d, e and f with the length of 6000 under specific conditions.
I want to randomly draw from a discrete uniform distribution between 10 and 40 for every sequence, but under the following condition:
a = f < (a+b)/2 < e < c < b < d
Does anyone know how I would code that?
The conditions are somewhat ad-hoc. A hit and miss approach which draws random vectors until the conditions are satisfied could work (though it might not be optimal). Something like:
randvect <- function(){
v <- sample(10:40,5)
while(any(c(v[1] >= v[2],
mean(v[1:2]) >= v[5],
v[5] >= v[3],
v[3] >= v[2],
v[2] >= v[4]))){
v <- sample(10:40,5)
}
v
}
For example,
> randvect()
[1] 16 26 25 36 23
(I don't bother with f since it is the same as a).
To get 6000:
vects <- replicate(6000,randvect())
With all the misses in the hit and miss, that takes about 30 seconds to evaluate on my machine.
This question isn’t really well defined, as there are different implementations that result in different distributions. For instance, taking the condition b=d. The latter is the most natural interpretation, but the most computationally expensive. You can improve it by randomly taking b and d, and then if b > d, then switch b and d. I think this logic can be extended to e,c,b,d: randomly choose four numbers between 10 and 40, then assign e to be the smallest, c the second smallest, etc. I think this will produce the same distribution as the “throw out” method, but I’m not sure. So to get e,c,b, and d:
numbers = sort(sample(10:40,4,replace = TRUE))
e = numbers[1]
c = numbers[2]
b = numbers[3]
d = numbers[4]
I'm still thinking about what to do with a, however.
John Coleman's answer will get there, and is may be a better way to randomly sample, but could potentially take a long time depending on what your allowable space is.
Another option to figure out the allowable space, and sample starting with a.
a has to be between 10 and 34 (to leave room for e, c, b, and d)
the average of a and b has to be =< (b - 2) and < 37. This means b has to be 5 or more than a, and less than 39
a + 4 < b < min((37 * 2) - a, 39)
The rest are a bit more straightforward. These can be wrapped into a function.
I'm going to use data.table more for looking at the results at the end. Also I'm using the function resample described in help(sample) to handle cases where there is only a single value to sample.
library(data.table)
resample <- function(x, ...) x[sample.int(length(x), ...)]
funky <- function() {
a <- resample(10:34, 1)
f <- a
b <- resample((a + 5):min(((37 * 2) - a + 1), 39), 1)
e <- resample(ceiling((a+b)/2 + 0.1):min(38, b - 2), 1)
c <- resample((e + 1):(b - 1), 1)
d <- resample((b + 1):40, 1)
c(a, b, c, d, e, f)
}
A few issues found by trial and error. In e, the 0.1 is added so that if the average is currently an integer, it gets increased by 1, but if the value is X.5 it will get rounded up to X + 1.
dat <- data.table(t(replicate(10000, funky())))
setnames(dat, c("a", "b", "c", "d", "e", "f"))
The following will return all rows that fail the tests in the original question. A few iterations with 10k samples and it doesn't look like anything is failing.
dat[!(a == f &
f < ((a + b) / 2) &
((a + b) / 2) < e &
e < c &
c < b &
b < d)]
I have a data frame with sequences as columns and amino acid sites as rows. I would like to compare the difference between these sequences at each site.
seq1 seq2 seq3 seq4 seq5 seq6 seq7 seq8
1 K E K K A A A A
2 V D A A T A A A
3 W W W W W W W W
4 R R R R R R S R
5 F S F F F Y F F
6 P P P P P P P P
7 N N N C N N N N
8 V I D D Q Q Q Q
9 Q Q Q Q Q Q Q Q
10 E E G G L I S F
11 L L Q L L L L L
12 N N Y Y V V S S
13 N N N N Q Q P P
14 L L L L L L L L
15 T T T T T T T I
Ideally, I would like to be able to have an additional column in my data frame that shows me the sites that are the same in all sequences and those that are the same only between seq1-4 or seq 5-8.
I am not sure what the best way to do this is, and any help is greatly appreciated.
Also, is there a way to add another column that shows the types of amino acids observed at each site?
Thanks in advance!
I am first getting an array where all columns are same:
allsame <- apply(df,1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
Next I am getting an an array where either of the column sets are same
startfour <- apply(df[,1:4],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
lastfour <- apply(df[,5:8],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
gen <- startfour + lastfour
eithersame <- ifelse(gen == 0,0,1)
Finally you can just create a column vector as required and join it to the dataframe using the above 2 arrays
output <- as.character(length(allsame))
for(i in 1:length(allsame)){
if(allsame[i] == 1){
output[i] <- "all same"
}
else if(eithersame[i] == 1){
output[i] <- "either same"
}
else{
output[i] <- "none same"
}
}
df <- cbind(df,output)
Here is a quick and dirty way to create the flags that you mentioned. Assuming the dataframe is called amino:
amino$first_flag<-with(amino,ifelse(seq1==seq2 & seq2==seq3 & seq3 == seq4,"same","diff"))
amino$second_flag<-with(amino,ifelse(seq5==seq6 & seq6==seq7 & seq7 == seq8,"same","diff"))
amino$total_flag<-with(amino,ifelse(first_flag=="same" & second_flag=="same" & seq1==seq5,"same","diff"))
Hopefully that works.
edit: and for your last question, I'm not sure what you mean but if you just want the letters that appear in each row then something like this could work:
for(i in 1:nrow(amino)) amino$types[i]<-paste(unique(amino[i,1:4,drop=TRUE]),collapse=",")
It will give you a column containing a comma separated list of the letters that appeared in each row.
edit2: If you have significantly more than 8 sequences, then a modified form of Ganesh's solution might work better (his output code isn't actually necessary):
amino$first_flag <- apply(amino[,1:4],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$second_flag <- apply(amino[,5:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$total_flag <- apply(amino[,1:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$types <- apply(amino[,1:8],1,function(x) paste(unique(x),collapse=","))
And for your new question-
amino$one_diff <- apply(amino[,1:8],1,function(x){
ifelse(7 %in% as.data.frame(table(x))[,2,drop=TRUE],"1 diff",NA)
})
This uses the table() function which normally gives you a count based on a vector or a column like table(amino$seq1). Using apply, we instead stick a row of the 8 sequences into it, it returns the counts, then we use as.data.frame and the brackets [] to get rid of some extra table() output that we don't need. The "7 %in%" part means if there are 7 of the same letters then there must be 1 different one. Anything else (i.e., all 8 same or more than 1 difference) will get NA.
I am looking for an idiomatic way to join a column, say named 'x', which exists in every data.frame element of a list. I came up with a solution with two steps by using lapply and Reduce. The second attempt trying to use only Reduce failed. Can I actually use only Reduce with one anonymous function to do this?
#data
xs <- replicate(5, data.frame(x=sample(letters, 10, T), y =runif(10)), simplify = FALSE)
# This works, but may be still unnecessarily long
otmap = lapply(xs, function(df) df$x)
jotm = Reduce(c, otmap)
# This does not count as another solution:
jotm = Reduce(c, lapply(xs, function(df) df$x))
# Try to use only Reduce function. This produces an error
jotr =Reduce(function(a,b){c(a$x,b$x)}, xs)
# Error in a$x : $ operator is invalid for atomic vectors
We can unlist after extracting the 'x' column
unlist(lapply(xs, `[[`, 'x'))
#[1] b y y i z o q w p d f f z b h m c u f s j e i v y b w j n q e w i r h p z q f x a b v z e x l c q f
#Levels: b d i o p q w y z c f h m s u e j n v r x a l
I have an ff object. One of the columns, which is a string variable, has white spaces, and I want to remove these.
I have tried the following:
1). newcol <- gsub("[[:space:]]", "", mydata$mystr)
2). newcol<- as.ffdf(gsub("[[:space:]]", "", mydata$mystr))
I also tried to use the as.character command, such that I said the following before applying the gsub command:
mydata$mystr <- as.character(ff(c(mydata$mystr)))
However, none of these options works. Any suggestions/help would be greatly appreciated.
EDIT: SOLUTION GIVEN MY AKRUN BELOW
May be you can try with ffbase
library(ffbase)
library(ff)
head(ffd$y[])
#[1] p l k a i v
#20 Levels: a c c e f h i j k k l l n
#n o ... v
ffd$y <- with(ffd, gsub('[[:space:]]', '', y))
head(ffd$y[])
#[1] p l k a i v
#Levels: a c e f h i j k l n o p q t v
data
set.seed(24)
d <- data.frame(x=1:26, y=sample(c(letters, paste(' ', letters, ' ')),
26, replace=TRUE), z=Sys.time()+1:26)
ffd <- as.ffdf(d)