In R, I want to find out the effect of character string length on computation time of a certain operation. For this, I need random character strings of different lengths. All I can think of now is:
cases1 <- letters[sample(15)]
cases2 <- paste(letters[sample(15)], letters[sample(15)], sep="")
cases3 <- paste(letters[sample(15)], letters[sample(15)], letters[sample(15)], sep="")
How do I automate that?
I don't want to keep copypasting...
Or does anyone have a better idea?
Try
n <- 3
do.call(`paste0`,as.data.frame(replicate(n, letters[sample(15)])))
If you want say 1:3
n1 <- 1:3
lapply(n1, function(.n) do.call(`paste0`,
as.data.frame(replicate(.n, letters[sample(15)]))))
Or as #Berry showed in the comments
apply(replicate(3, letters[sample(15)]), MARGIN=1, paste, collapse="")
Related
I am trying to run a for function to extract multiple strings in order from a fasta.
Here is an example(of course the real one is more than 10 thousand)
eg <- ATCGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG
And here is my code
forsubseq <- function(dna){
sta <- for (i in 1:floor(width(dna)/100)) {
seqGC <- Biostrings::subseq(dna, start = 100*i - 99, width = 100) %>%
Biostrings::letterFrequency(letters = "GC", as.prob = TRUE)
}
return(sta)
}
forsubseq(eg)
However, nothing happened after running. It really confused me...What I want to obtain is to analyze GC content for each 100 bp...
Could anyone kindly offer advice? Thanks.
The library Biostrings is not available for the most recent version of R, but one simplified approach would be to split eg at every n th character then use lapply to analyze. In this example I counted the number of "GC" pairs using str_count since I dont have the Biostrings library but you can change to the Biostrings::letterFrequency function:
eg <- "ACGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG"
n <- 10 # you would change to 100
blocks <- seq(1, nchar(eg), n) # prep to separate every n base pairs
splits <- substring(eg, blocks, blocks + n - 1) # separate every n base pairs
lapply(splits,
function(x) stringr::str_count(x, "GC")) # replace with Biostrings::letterFrequency
The output is a list counting the number of "GC" pairs for each block of n characters (here, 10). If you want a vector of integers representing these data, just simply wrap the lapply function in unlist(lapply(...))
I am trying to create a matrix of coordinates(indexes) that I randomly pick one from using the sample function. I then use these to select a cell in another matrix. What is the best way to do this? The trouble is how to store these integers in the matrix so that they are easy to separate. Right now I have them stored as strings with a comma, that I then split. Someone suggested I use a pair, or a string, but I cannot seam to get these to work with a matrix. Thanks!
EDIT:What i currently have looks like this (changed a little to make sense out of context):
probs <- matrix(c(0,0,0.6,0,0,
0,0.7,1,0.7,0,
0.6,1,0,1,0.6,
0,0.7,1,0.7,0,
0,0,0.6,0,0),5,5)
cordsMat <- matrix("",5,5)
for (x in 1:5){
for (y in 1:5){
cordsMat[x,y] = paste(x,y,sep=",")
}
}
cords <- sample(cordsMat,1,,probs)
cordsVec <- unlist(strsplit(cords,split = ","))
cordX <- as.numeric(cordsVec[1])
cordY <- as.numeric(cordsVec[2])
otherMat[cordX,cordY]
It sort of works but i would also be interested for a better way, as this will get repeated a lot.
If you want to set the probabilities it can easily be done by providing it to sample
# creating the matrix
matrix(sample(rep(1:6, 15:20), 25), 5) -> other.mat
# set the probs vec
probs <- c(0,0,0.6,0,0,
0,0.7,1,0.7,0,
0.6,1,0,1,0.6,
0,0.7,1,0.7,0,
0,0,0.6,0,0)
# the coordinates matrix
mat <- as.matrix(expand.grid(1:nrow(other.mat),1:ncol(other.mat)))
# sampling a row randomly
sample(mat, 1, prob=probs) -> rand
# getting the value
other.mat[mat[rand,1], mat[rand,2]]
[1] 6
I am trying to create 1000 variables, which I want to name with the index number. I don't know how to create these new variables.
for(i in 1:1000) {
Ui <- rnorm(200,0,1)
}
This is a common sort of thing that people want to do, especially when they are coming from other programming languages. However, there are better ways to accomplish the same thing, and you should not follow recommendations to use assign; that is bad advice that you will likely regret later on.
The way we do this sort of thing in R is to use lists, specifically named lists:
x <- replicate(1000,rnorm(200,0,1),simplify = FALSE)
x <- setNames(x,paste0("A",seq_along(x)))
Now x is a named list of length 1000, each element is a vector of length 200 from a normal(0,1) distribution.
You can refer to each one via x[[1]] or x[["A1"]] as needed. Additionally, since they are all in the same object, you can operate on the easily as a group using tools like lapply.
Pretty much any time you find yourself wanting to create a sequence of objects with similar names, that should be a signal to you that you should be using a list instead.
There is no point in cluttering your environment with so many variables, try to store them in a named list instead
l1 <- setNames(lapply(1:5, function(x) rnorm(5)), paste0("A", 1:5))
l1
#$A1
#[1] 0.4951453 -1.4278665 0.5680115 0.3537730 -0.7757363
#$A2
#[1] -0.11096037 0.05958700 0.02578168 1.00591996 0.54852030
#$A3
#[1] 0.1058318 0.6988443 -0.8213525 -0.1072289 0.8757669
#$A4
#[1] -0.6629634 0.8321713 -0.3073465 -0.2645550 -1.0064132
#$A5
#[1] 2.2191246 0.2054360 -0.1768357 1.6875302 -1.1495807
Now you can access individual list element as
l1[["A1"]]
#[1] 0.4951453 -1.4278665 0.5680115 0.3537730 -0.7757363
Moreover, other method is to generate all the numbers together and then split them into list.
groups = 5
each = 5
setNames(split(rnorm(groups * each), rep(seq_len(groups), each = each)),
paste0("A", seq_len(groups)))
I agree with the others that this is not a good idea. Anyway, to answer your question how you wolud do that is
k <- 1000 # number of variables
n <- 200 # sample size of each variable
for(i in 1:k){
assign(paste0("variable", i), rnorm(n, 0, 1))}
variable1
-0.012947062 0.728284959 -1.627796366 0.003471491 ...
However, personally I would prefer another solution. The both answers so far suggest using lists. I find lsts quite cumbersome, especially if you are new to R. So I would suggest creating a matrix where every column contains one variable.
# creates a matrix
m <- matrix(rep(NA, n*k), ncol= k)
# generates rnorm() in each column
for(i in 1:k){
m[ , i] <- rnorm(n, 0, 1)
}
# now you can name the columns
colnames(m) <- paste0("variable", 1:k)
m
variable1 variable2 ...
[1,] 0.30950749 -2.07388046
[2,] -1.13232330 -0.55511476
...
I have a perhaps basic questions and I have searched on the web. I have a problem reading files. Though, I managed to get to read my files, following #Konrad suggestions, which I appreciate: How to get R to read in files from multiple subdirectories under one large directory?
It is a similar problem, however, I have not resolved it.
My problem:
I have large number of files of with same name ("tempo.out") in different folders. This tempo.out has 5 columns/headers. And they are all the same format with 1048 lines and 5 columns:
id X Y time temp
setwd("~/Documents/ewat")
dat.files <- list.files(path="./ress",
recursive=T,
pattern="tempo.out"
,full.names=T)
readDatFile <- function(f) {
dat.fl <- read.table(f)
}
data.filesf <- sapply(dat.files, readDatFile)
# I might not have the right sintax in sub5:
subs5 <- sapply(data.filesf,`[`,5)
matr5 <- do.call(rbind, subs5)
probs <- c(0.05,0.1,0.16,0.25,0.5,0.75,0.84,0.90,0.95,0.99)
q <- rowQuantiles(matr5, probs=probs)
print(q)
I want to extract the fifth column (temp) of each of those thousands of files and make calculations such as quantiles.
I tried first to read all subfiles in "ress"
The latter gave no error, but my main problem is the "data.filesf" is not a matrix but list, and actually the 5th column is not what I expected. Then the following:
matr5 <- do.call(rbind, subs5)
is also not giving the required values/results.
What could be the best way to get columns into what will become a huge matrix?
Try
lapply(data.filef,[,,5)
Hope this will help
Consider extending your defined function, readDatFile, to extract fifth column, temp, and assign directly to matrix with sapply or vapply (since you know ahead the needed structure -numeric matrix length equal to nrows or 1048). Then, run needed rowQuantiles:
setwd("~/Documents/ewat")
dat.files <- list.files(path="./ress",
recursive=T,
pattern="tempo.out",
full.names=T)
readDatFile <- function(f) read.table(f)$temp # OR USE read.csv(f)[[5]]
matr5 <- sapply(dat.files, readDatFile, USE.NAMES=FALSE)
# matr5 <- vapply(dat.files, readDatFile, numeric(1048), USE.NAMES=FALSE)
probs <- c(0.05,0.1,0.16,0.25,0.5,0.75,0.84,0.90,0.95,0.99)
q <- rowQuantiles(matr5, probs=probs)
I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. Till now I was using for-loops in the following manner
abb <- c()
for(i in 1:length(data$text)){
for(j in 1:length(AbbreviationList$Abb)){
abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
}
}
The abbreviation data frame looks something like the image below and can be generated using the following code
Abbreviation <- c(c("hru", "how are you"),
c("asap", "as soon as possible"),
c("bf", "boyfriend"),
c("ur", "your"),
c("u", "you"),
c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)
names(Abbreviation) <- c("abb","Fullform")
And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code.
data <- data.frame(unlist(c("its good to see you, hru doing?",
"I am near bridge come ASAP",
"Can u tell me the method u used for",
"afk so couldn't respond to ur mails",
"asmof I dont know who is your bf?")))
names(data) <- "text"
Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. If this problem can be solved faster via vectorization method then please suggest how to do that as well.
Thanks for the help!
This should be faster, and without side effect.
mapply(function(x,y){
abb <- paste0("(\\b", x, "\\b)")
gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)
gsub is vectorized so no you give it a character vector where matches are sought. Here I give it data$text
I use mapply to avoid the side effect of for.
First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. Also, there is no need to actually loop over data$text: in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length.
Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )
for( j in 1:length( Abbreviation$abb ) ) {
data$text <- gsub( Abbreviation$regex[j],
Abbreviation$Fullform[j], data$text,
ignore.case= T )
}
The above code works with the example data.