The following SQLite database is a tiny replica of a huge database that i'm working on.
library(RSQLite)
library(inborutils)
library(tibble)
library(dplyr)
library(dbplyr)
col1 <- c(1:20)
col2 <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K",
"L", "M", "N", "O", "P", "Q", "R", "S", "T")
col3 <- c(21:40)
database <- dbConnect(SQLite(), dbname = "testDB.sqlit")
table1 <- tibble(col1, col2, col3)
dbWriteTable(database, "testDBtable", table1)
bd <- tbl(database, "testDBtable")
I want to extract a column and factor the values. I'm facing a problem with the extraction process, either because I'm missing something or i haven't understand the process as whole.
The following code is working to extract one column but very slow (When i use it on my database not on this tiny replica): -
>pull(bd, col2)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
"Q" "R" "S"
[20] "T"
whereas this code return Null: -
>bd$col2
NULL
Any idea why this returns Null?
I want to use the code as follows: -
bd$col2 <- ordered(bd$col2, levels=lvl.100260, labels=lbl.100260)
as this code is awfully slow: -
bd %>%
pull(col2) %>%
ordered(
.,
levels = lvl.100260,
labels = lbl.100260
)
Especially as a huge number of the same code must be run.
You use a package inborutils that is not on CRAN. I ran your code without it, so the results below might not match yours.
The problem is that bd does not have an element called col2:
> names(bd)
[1] "src" "ops"
It has class
[1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy"
[5] "tbl"
so it's not based on a dataframe, it needs to go to the database to extract data. I think it's possible to override the $ operator, but I suspect if you did, you'd find bd$col2 just as slow as pull(bd, col2).
For the more general question of how to speed it up, I don't think there are any easy answers. Probably you want to work with real dataframes (or even better, matrices) for speed, but it sounds as though you'll run into memory limitations if you try to convert the whole database at once. The general advice would be to profile your code to find the bottlenecks, and think about how to improve them.
Related
I am trying to make a function that counts de caracteres between "a" "t" "g" and "t" "a" "g" or "t" "g" "a"or "t" "a" "a" inside of a vector. But my code gets stuck in the while loop. An example would be like x = "a" "a" "a" "t" "a" "t" "g" "t" "c" "g" "t " "t " "t" "t" "a" "g". In this example the code should count 6 characters between "a" "t" "g" and "t" "a" "g". Any help will be appreciated :) .
orfs<-function(x,p){
count<-0
cntorfs<-0
n<-length(x)
v<-n-2
for (i in 1:v){
if(x[i]=="a"&& x[i+1]=="t"&& x[i+2]=="g"){
k<-i+3;
w<-x[k]
y<-x[k+1]
z<-x[k+2]
while (((w!="t")&&(y!="a")&& (z!="g"))||((w!="t")&&(y!="a")&&(z!="a"))||((w!="t")&&(y!="g")&& (z!="a"))||(i+2>v)){
count<-count+1
k<-k+1
w<-x[k]
y<-x[k+1]
z<-x[k+2]
}
}
if(count>p){
cntorfs<-cntorfs+1
}
if (count!=0){
count<-0
}
}
cat("orf:",cntorfs)
}
This is a very inefficient and un-R-like way to count the number of characters between two patterns.
Here is an alternative using gsub that should get you started and can be extended to account for the other stop codons:
x <- c("a", "a", "a", "t", "a", "t", "g", "t", "c", "g", "t", "t", "t", "t", "a", "g")
nchar(gsub("[actg]*atg([actg]*)tag[actg]*", "\\1", paste0(x, collapse = "")))
#[1] 6
A more robust and general approach can be found here making use of Biostrings::matchPattern. I would strongly advise against reinventing the wheel here, and instead recommend using some of the standard Bioconductor packages that were developed for exactly these kind of tasks.
This question already has an answer here:
Collapse vector to string of characters with respective numbers of consequtive occurences
(1 answer)
Closed 4 years ago.
I have a large sequence of strings containing only the following characters
"M", "D", "A"
such as:
"M" "M" "A" "A" "D" "D" "M" "D" "A"
and I would like to compress it to:
M2A2D2M1D1A1
in R. Googling has led me to this (a java solution) but before implementing it, it would be interesting to check if I can find something ready online. Thanks!
R function rle() is your friend.
testVector <- sample(c("M", "D", "A"), 20, replace=T)
res <- rle(testVector)
compressedString <- paste(res$values, res$lengths, collapse = "", sep = "")
Assume I have a list with 5 elements:
list <- list("A", "B", "C", "D", c("E", "F"))
I am trying to return this to a simple character vector using purrr with the need to combine list elements that have two strings into one, separated by a delimiter such as '-'. The output should look like this:
chr [1:5] "A" "B" "C" "D" "E-F"
I've tried a ton of approaches including paste, paste0, str_c and where I am getting hung up is it seems that map applies the function to each individual string of an element of a list and not the group of strings of an element (when there are more than one). The closes I've gotten is:
list2 <- unlist(map(list, str_flatten))
str(list2)
This returns:
chr [1:5] "A" "B" "C" "D" "EF"
where I need a hyphen between E and F:
chr [1:5] "A" "B" "C" "D" "E-F"
When I try to pass a function as a parenthetiinton to str_flatten(), such as str_flatten(list, collapse = "-"), it doesn't work. The big problem is I can't figure out what string to pass as an argument in str_flatten to group two strings of a given element of a list.
You almost had it. Try
library(purrr)
library(stringr)
unlist(map(lst, str_flatten, collapse = "-"))
#[1] "A" "B" "C" "D" "E-F"
You could also use map_chr
map_chr(lst, str_flatten, collapse = "-")
Without additional packages and with thanks to #G.Grothendieck you could do
sapply(lst, paste, collapse = "-")
data
lst <- list("A", "B", "C", "D", c("E", "F"))
We can also use map_chr and paste.
library(purrr)
lst <- list("A", "B", "C", "D", c("E", "F"))
map_chr(lst, ~paste(.x, collapse = "-"))
# [1] "A" "B" "C" "D" "E-F"
I have the following ilst
list <- c("AB", "G", "H")
Now I have certain letters that should be replaced. So fe. B and H should be replaced.
So what I have not is:
replace_letter <- c("B", "H")
for(letter in replace_letter){
for (i in list){
print(i)
print(letter)
if(grepl(letter, i)){
new_value <- gsub(letter,"XXX",i)
print("yes")
}
else{
print("no")
}
}
}
However the XXX in my code should be replace by certain lookup values/.
So instead a B -> B+, in stead of H -> H**.
So I need some kind of dictionary function to replace the XXX with something specific.
Does anybody have suggestion how I can include this in the code above?
Data and dictionary
dictionary <- data.frame(From = LETTERS,
To = LETTERS[c(2:length(LETTERS), 1)], stringsAsFactors = F)
set.seed(1234)
data <- LETTERS[sample(length(LETTERS), 10, replace = T)]
Here is the replace-function
replace <- function(input, dictionary){
dictionary[which(input == dictionary$From),]$To
}
Apply it to data:
sapply(data, replace, dictionary = dictionary)
# C Q P Q W Q A G R N
# "D" "R" "Q" "R" "X" "R" "B" "H" "S" "O"
You just have to adjust your dictionary according to your needs.
I use the function plyr::mapvalues to do this. The function takes three arguments, the strings to do the replacement on, and two vectors from and to that define the replacement.
e.g.
plyr::mapvalues(letters[1:3], c("b", "c"), c("x", "y"))
# [1] "a" "x" "y"
I switched to the newer dplyr library, so I'll add another answer here:
In an interactive session I would enter the replacements in dplyr::recode directly:
dplyr::recode(letters[1:3], "b"="x", "c"="y")
# [1] "a" "x" "y"
Using a pre-defined dictionary, you'll have to use UQS to unquote the dictionary due to the tidy-eval semantics of dpylr:
dict <- c("b"="x", "c"="y")
dict
# b c
# "x" "y"
dplyr::recode(letters[1:3], UQS(dict))
# [1] "a" "x" "y"
I have some data:
> (dput(head(data$Date,10)))
c("18.12.2003", "06.04.2005", "06.04.2005", "07.04.2005", "27.05.2005",
"16.06.2009", "16.06.2009", "21.12.2009", "22.12.2009", "09.06.2011"
)
[1] "18.12.2003" "06.04.2005" "06.04.2005" "07.04.2005" "27.05.2005"
[6] "16.06.2009" "16.06.2009" "21.12.2009" "22.12.2009" "09.06.2011"
> (dput(head(data$Art,10)))
c("V", "K", "K", "K", "Zuteilung", "V", "K", "K", "K", "V")
[1] "V" "K" "K" "K" "Zuteilung" "V"
[7] "K" "K" "K" "V"
As you can see to every date there is a string value.
I can measure all the string values of K with:
> (length(grep("K", data$Art)))
I want to plot all frequencies of K with the same date.
With this I can plot all date, however this does not include the K Strings.
hist(as.Art(data$Date, '%d.%m.%Y'), breaks="days", freq=TRUE)
I really appreciate your answers!