R. huge vector of 2 character strings

R. huge vector of 2 character strings - r

In order to label thousands of random points I need a huge vector with labels. For logistic reasons I would like that all strings has length 2. What I have so far is this string
sl = paste(letters[1],letters,":0",sep="")
for (i in 2:26){
ll = paste(letters[i],letters,":0",sep="")
sl = c(sl,ll)
}
SL = paste(LETTERS[1],LETTERS,":0",sep="")
for (i in 2:26){
ll = paste(LETTERS[i],LETTERS,":0",sep="")
SL = c(SL,ll)
}
S1 = paste(LETTERS[1],0:9,":0",sep="")
for (i in 2:26){
ll = paste(LETTERS[i],1:10,":0",sep="")
SL = c(SL,ll)
}
s1 = paste(letters[1],0:9,":0",sep="")
for (i in 2:26){
ll = paste(letters[i],1:10,":0",sep="")
SL = c(SL,ll)
}
sl=c(sl,SL,S1,s1)
this vector has 1872 strings only. Taking in account that my questions are
Do you know a more elegant way to have something like this? I am building a package and I find this lines not elegant at all.
Do you know how can I easily increase the length of the vector with more normal strings of length 2?
Any help is appreciated.

Limiting yourself to two character strings and including all permutations of c(letters, LETTERS, 0:9) gives you a maximum of 62^2 = 3844 possibilities. That full vector can be generated via
paste0(
as.vector(
outer(c(letters, LETTERS, 0:9),
c(letters, LETTERS, 0:9),
paste0)
),
":0"
)
If you need more labels than that, you will need to either include more characters to select from, or increase the length of the string.
However, I think such a labeling scheme may not be as useful as you hope. Labeling points like this on a plot runs the risk of making the plot unreadable. Are you sure this is the approach you need?

Related

subtract strings with specific width in for recycle

I am trying to run a for function to extract multiple strings in order from a fasta.
Here is an example(of course the real one is more than 10 thousand)
eg <- ATCGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG
And here is my code
forsubseq <- function(dna){
sta <- for (i in 1:floor(width(dna)/100)) {
seqGC <- Biostrings::subseq(dna, start = 100*i - 99, width = 100) %>%
Biostrings::letterFrequency(letters = "GC", as.prob = TRUE)
}
return(sta)
}
forsubseq(eg)
However, nothing happened after running. It really confused me...What I want to obtain is to analyze GC content for each 100 bp...
Could anyone kindly offer advice? Thanks.

The library Biostrings is not available for the most recent version of R, but one simplified approach would be to split eg at every n th character then use lapply to analyze. In this example I counted the number of "GC" pairs using str_count since I dont have the Biostrings library but you can change to the Biostrings::letterFrequency function:
eg <- "ACGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG"
n <- 10 # you would change to 100
blocks <- seq(1, nchar(eg), n) # prep to separate every n base pairs
splits <- substring(eg, blocks, blocks + n - 1) # separate every n base pairs
lapply(splits,
function(x) stringr::str_count(x, "GC")) # replace with Biostrings::letterFrequency
The output is a list counting the number of "GC" pairs for each block of n characters (here, 10). If you want a vector of integers representing these data, just simply wrap the lapply function in unlist(lapply(...))

How to concatenate NOT as character in R?

I want to concatenate iris$SepalLength, so I can use that in a function to get the Sepal Length column from iris data frame. But when I use paste function paste("iris$", colnames(iris[3])), the result is as characters (with quotes), as "iris$SepalLength". I need the result not as a character. I have tried noquotes(), as.datafram() etc but it doesn't work.
freq <- function(y) {
for (i in iris) {
count <-1
y <- paste0("iris$",colnames(iris[count]))
data.frame(as.list(y))
print(y)
span = seq(min(y),max(y), by = 1)
freq = cut(y, breaks = span, right = FALSE)
table(freq)
count = count +1
}
}
freq(1)

The crux of your problem isn't making that object not be a string, it's convincing R to do what you want with the string. You can do this with, e.g., eval(parse(text = foo)). Isolating out a small working example:
y <- "iris$Sepal.Length"
data.frame(as.list(y)) # does not display iris$Sepal.Length
data.frame(as.list(eval(parse(text = y)))) # DOES display iris.$Sepal.Length
That said, I wanted to point out some issues with your function:
The input variable appears to not do anything (because it is immediately overwritten), which may not have been intended.
The for loop seems broken, since it resets count to 1 on each pass, which I think you didn't mean. Relatedly, it iterates over all i in iris, but then it doesn't use i in any meaningful way other than to keep a count. Instead, you could do something like for(count in 1 : length(iris) which would establish the count variable and iterate it for you as well.
It's generally better to avoid for loops in R entirely; there's a host of families available for doing functions to (e.g.) every column of a data frame. As a very simple version of this, something like apply(iris, 2, table) will apply the table function along margin 2 (the columns) of iris and, in this case, place the results in a list. The idea would be to build your function to do what you want to a single vector, then pass each vector through the function with something from the apply() family. For instance:
cleantable <- function(x) {
myspan = seq(min(x), max(x)) # if unspecified, by = 1
myfreq = cut(x, breaks = myspan, right = FALSE)
table(myfreq)
}
apply(iris[1:4], 2, cleantable) # can only use first 4 columns since 5th isn't numeric
would do what I think you were trying to do on the first 4 columns of iris. This way of programming will be generally more readable and less prone to mistakes.

How to fix the error "there are more elements are supplied than there are to replace" in a for loop in R?

Can someone help me with this? I got the cut_interval code to work for a single test column, but can't seem to get it to work in a for loop to have it run on all of the columns.
#Bin worker data into three groups (low/medium/high %methylation) for the cpg cg10757709
#This code works
cg10757709_interval <- cut_interval(cpgs$cg10757709, n=3, labels = c("low","med","high"))
View(cg10757709_interval)
#Write a loop so that data for each of the significant cpgs will be binned into low, medium, and high groups
#This code gives an error (that there are more elements are supplied than there are to replace)
cpgs_interval <- matrix(ncol = length(cpgs), nrow = 29)
for (i in seq_along(cpgs)) {
cpgs_interval[[i]] <- cut_interval(cpgs[[i]], n=3, labels = c("low","med","high"))
}
View(cpgs_interval)
The error says "Error in cpgs_interval[[i]] <- cut_interval(cpgs[[i]], n = 3, labels = c("low",  : more elements supplied than there are to replace". Should I not be using a matrix for cpgs_interval? Or is something else the problem? I'm rather new to writing for loops. Thanks.

In your example, cpgs_interval is a matrix. If you want to put the variable into the ith column of the matrix, you could do:
for (i in seq_along(cpgs)) {
cpgs_interval[,i] <- cut_interval(cpgs[[i]], n=3, labels = c("low","med","high"))
}
That said, you might be better off making cpgs_interval a data frame, then you'll retain the factor rather than turning it into text.

Poisson Process algorithm in R (renewal processes perspective)

I have the following MATLAB code and I'm working to translating it to R:
nproc=40
T=3
lambda=4
tarr = zeros(1, nproc);
i = 1;
while (min(tarr(i,:))<= T)
tarr = [tarr; tarr(i, :)-log(rand(1, nproc))/lambda];
i = i+1;
end
tarr2=tarr';
X=min(tarr2);
stairs(X, 0:size(tarr, 1)-1);
It is the Poisson Process from the renewal processes perspective. I've done my best in R but something is wrong in my code:
nproc<-40
T<-3
lambda<-4
i<-1
tarr=array(0,nproc)
lst<-vector('list', 1)
while(min(tarr[i]<=T)){
tarr<-tarr[i]-log((runif(nproc))/lambda)
i=i+1
print(tarr)
}
tarr2=tarr^-1
X=min(tarr2)
plot(X, type="s")
The loop prints an aleatory number of arrays and only the last is saved by tarr after it.
The result has to look like...
Thank you in advance. All interesting and supportive comments will be rewarded.

Adding on to the previous comment, there are a few things which are happening in the matlab script that are not in the R:
[tarr; tarr(i, :)-log(rand(1, nproc))/lambda]; from my understanding, you are adding another row to your matrix and populating it with tarr(i, :)-log(rand(1, nproc))/lambda].
You will need to use a different method as Matlab and R handle this type of thing differently.
One glaring thing that stands out to me, is that you seem to be using R: tarr[i] and M: tarr(i, :) as equals where these are very different, as what I think you are trying to achieve is all the columns in a given row i so in R that would look like tarr[i, ]
Now the use of min is also different as R: min() will return the minimum of the matrix (just one number) and M: min() returns the minimum value of each column. So for this in R you can use the Rfast package Rfast::colMins.
The stairs part is something I am not familiar with much but something like ggplot2::qplot(..., geom = "step") may work.
Now I have tried to create something that works in R but am not sure really what the required output is. But nevertheless, hopefully some of the basics can help you get it done on your side. Below is a quick try to achieve something!
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
# Major alteration, create a temporary row from previous row in tarr
temp <- matrix(tarr[i, ] - log((runif(nproc))/lambda), nrow = 1)
# Join temp row to tarr matrix
tarr <- rbind(tarr, temp)
i = i + 1
}
# I am not sure what was meant by tarr' in the matlab script I took it as inverse of tarr
# which in matlab is tarr.^(-1)??
tarr2 = tarr^(-1)
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
As you can see I have sorted the min_for_each_col so that the plot is actually a stair plot and not some random stepwise plot. I think there is a problem since from the Matlab code 0:size(tarr2, 1)-1 gives the number of rows less 1 but I cant figure out why if grabbing colMins (and there are 40 columns) we would create around 20 steps. But I might be completely misunderstanding! Also I have change T to T0 since in R T exists as TRUE and is not good to overwrite!
Hope this helps!

I downloaded GNU Octave today to actually run the MatLab code. After looking at the code running, I made a few tweeks to the great answer by #Croote
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
temp <- matrix(tarr[i, ] - log(runif(nproc))/lambda, nrow = 1) #fixed paren
tarr <- rbind(tarr, temp)
i = i + 1
}
tarr2 = t(tarr) #takes transpose
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
Edit: Some extra plotting tweeks -- seems to be closer to the original
qplot(seq_along(min_for_each_col), c(1:length(min_for_each_col)), geom="step", ylab="", xlab="")
#or with ggplot2
df1 <- cbind(min_for_each_col, 1:length(min_for_each_col)) %>% as.data.frame
colnames(df1)[2] <- "index"
ggplot() +
geom_step(data = df1, mapping = aes(x = min_for_each_col, y = index), color = "blue") +
labs(x = "", y = "")

I'm not too familiar with renewal processes or matlab so bear with me if I misunderstood the intention of your code. That said, let's break down your R code step by step and see what is happening.
The first 4 lines assign numbers to variables.
The fifth line creates an array with 40 (nproc) zeros.
The sixth line (which doesnt seem to be used later) creates an empty vector with mode 'list'.
The seventh line starts a while loop. I suspect this line is supposed to say while the min value of tarr is less than or equal to T ...
or it's supposed to say while i is less than or equal to T ...
It actually takes the minimum of a single boolean value (tarr[i] <= T). Now this can work because TRUE and FALSE are treated like numbers. Namely:
TRUE == 1 # returns TRUE
FALSE == 0 # returns TRUE
TRUE == 0 # returns FALSE
FALSE == 1 # returns FALSE
However, since the value of tarr[i] depends on a random number (see line 8), this could lead to the same code running differently each time it is executed. This might explain why the code "prints an aleatory number of arrays ".
The eight line seems to overwrite the assignment of tarr with the computation on the right. Thus it takes the single value of tarr[i] and subtracts from it the natural log of runif(proc) divided by 4 (lambda) -- which gives 40 different values. These fourty different values from the last time through the loop are stored in tarr.
If you want to store all fourty values from each time through the loop, I'd suggest storing it in say a matrix or dataframe instead. If that's what you want to do, here's an example of storing it in a matrix:
for(i in 1:nrow(yourMatrix)){
//computations
yourMatrix[i,] <- rowCreatedByComputations
}
See this answer for more info about that. Also, since it's a set number of values per run, you could keep them in a vector and simply append to the vector each loop like this:
vector <- c(vector,newvector)
The ninth line increases i by one.
The tenth line prints tarr.
the eleveth line closes the loop statement.
Then after the loop tarr2 is assigned 1/tarr. Again this will be 40 values from the last time through the loop (line 8)
Then X is assigned the min value of tarr2.
This single value is plotted in the last line.
Also note that runif samples from the uniform distribution -- if you're looking for a Poisson distribution see: Poisson
Hope this helped! Let me know if there's more I can do to help.

Project Euler #22, off by 158,055

I'm currently working through Project Euler problem 22 which has the following challenge:
Using names.txt (right click and 'Save Link/Target As...'), a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.
For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 × 53 = 49714.
What is the total of all the name scores in the file?
The file can be downloaded using the above link. I've written the below code to solve the problem:
rm(list=ls())
library(splitstackshape)
#read in data from http://projecteuler.net/problem=22
names=sort(t(read.table("names.txt",sep=",")))
#letters to numbers conversion vectors
from=LETTERS[seq(1,26)]
to=as.character(seq(1,26))
#function to replace all letters with corresponding numbers
gsub2 = function(pattern, replacement, x, ...){
for(i in 1:length(pattern))
x = gsub(pattern[i],paste(replacement[i]," ",sep=""), x, ...)
x
}
#create df, run function, create row number var for later calculation
df=data.frame(names=names)
df$name.num = gsub2(from,to,df$names)
df$rownum=seq(1,nrow(df))
#split letter values, add across rows, multiply by row number to get name score and sum
df=concat.split(df,"name.num"," ")
df$name.sum=rowSums(df[,4:15],na.rm=TRUE)
df$name.score=df$name.sum*df$rownum
print(sum(df$name.score,na.rm=TRUE))
My result appears to be off 158,055 (I get 871040227 where it should be 871198282). I've spot checked parts of it, and it appears that the list of names is sorted correctly, and that the name scores are compiling correctly (for instance, I also get COLIN=49174). I've also read other threads troubleshooting this problem on SO, but they're mostly in Python and the problems seem to be different than mine. My suspicion is that either the names.txt file is somehow not being read in right or that perhaps the method I'm using (concat.split from the splitstackshape package) to split the df$name.num is incorrect, though it seems to be working correctly.
Any ideas?
Also, any suggestions on how to improve/simplify my code are more than welcome!

I used to have fun doing the Euler problems in R. Here's my solution to 22.
namesscore<-function(name) {
score<-0;
for(s in 1:nchar(name)) {
score<-score + which(substr(name,s,s)==LETTERS[1:26])
}
score
}
names<-scan("prob022.txt", "character", sep=",", quote="\"", na.strings="")
name.pos <- rank(names)
name.val <- sapply(names,namesscore)
sum(name.pos*name.val)
# [1] 871198282
There is a name "NA" in the list which may cause you problems.

As pointed out by #MrFlick, there's a 'NA' in the names list, so you need to treat it.
x = sort(scan('http://projecteuler.net/project/names.txt', what = '', sep =',', na.strings = ""))
s = sapply(x, function(w){
match(w, x) * sum(match(strsplit(w, '')[[1]], LETTERS))
})
print(sum(s))
# 871198282

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R. huge vector of 2 character strings - r

Related

subtract strings with specific width in for recycle

How to concatenate NOT as character in R?

How to fix the error "there are more elements are supplied than there are to replace" in a for loop in R?

Poisson Process algorithm in R (renewal processes perspective)

Project Euler #22, off by 158,055

Categories

Resources