I have some data:
> (dput(head(data$Date,10)))
c("18.12.2003", "06.04.2005", "06.04.2005", "07.04.2005", "27.05.2005",
"16.06.2009", "16.06.2009", "21.12.2009", "22.12.2009", "09.06.2011"
)
[1] "18.12.2003" "06.04.2005" "06.04.2005" "07.04.2005" "27.05.2005"
[6] "16.06.2009" "16.06.2009" "21.12.2009" "22.12.2009" "09.06.2011"
> (dput(head(data$Art,10)))
c("V", "K", "K", "K", "Zuteilung", "V", "K", "K", "K", "V")
[1] "V" "K" "K" "K" "Zuteilung" "V"
[7] "K" "K" "K" "V"
As you can see to every date there is a string value.
I can measure all the string values of K with:
> (length(grep("K", data$Art)))
I want to plot all frequencies of K with the same date.
With this I can plot all date, however this does not include the K Strings.
hist(as.Art(data$Date, '%d.%m.%Y'), breaks="days", freq=TRUE)
I really appreciate your answers!
Related
I'm trying to make the complementary sequence of a dna chain stored in a vector.
It's supposed to change the "A" for the "T" and the "C" for the "G" and vice versa, the thing is, I need this to happen to the first vector and print the complementary sequence correctly. This is what I tried but got stucked:
pilot_sequence <- c("C","G","A","T","C","C","T","A","T")
complement_sequence_display <- function(pilot_sequence){
complement_chain_Incom <- gsub("A", "T", pilot_sequence)
complement_chain <- paste(complement_chain_Incom, collapse = "")
cat("Complement sequence: ", complement_chain, "\n")
}
complement_chain_Incom <- gsub("A","T", pilot_sequence)
complement_chain <- paste(complement_chain_Incom, collapse= "")
complement_sequence_display(pilot_sequence)
I got as answer: CGTTCCTTT,just the second and penultimate T are correct, how do I solve to the rest of letters ?
the pilot_sequence vector is character type and the functions displays no execution errors.
This is a ideal use case for chartr function:
chartr("ATGC","TACG",pilot_sequence)
output:
[1] "G" "C" "T" "A" "G" "G" "A" "T" "A"
You can do this with purrr::map:
pilot_sequence |> purrr::map_chr(~case_when(
.x == "T" ~ "A",
.x == "G" ~ "C",
.x == "A" ~ "T",
.x == "C" ~ "G"
))
#> [1] "G" "C" "T" "A" "G" "G" "A" "T" "A"
You can use recode from dplyr
library(dplyr)
recode(pilot_sequence, "C" = "G", "G" = "C", "A" = "T", "T" = "A")
Or in base R, create a named vector and use match to match the values location in the named vector and then call name to get the names
pilot_sequence <- c("C","G","A","T","C","C","T","A","T")
values = c("G" = "C", "C" = "G", "A" = "T", "T" = "A")
names(values[match(pilot_sequence, values)])
"G" "C" "T" "A" "G" "G" "A" "T" "A"
The following SQLite database is a tiny replica of a huge database that i'm working on.
library(RSQLite)
library(inborutils)
library(tibble)
library(dplyr)
library(dbplyr)
col1 <- c(1:20)
col2 <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K",
"L", "M", "N", "O", "P", "Q", "R", "S", "T")
col3 <- c(21:40)
database <- dbConnect(SQLite(), dbname = "testDB.sqlit")
table1 <- tibble(col1, col2, col3)
dbWriteTable(database, "testDBtable", table1)
bd <- tbl(database, "testDBtable")
I want to extract a column and factor the values. I'm facing a problem with the extraction process, either because I'm missing something or i haven't understand the process as whole.
The following code is working to extract one column but very slow (When i use it on my database not on this tiny replica): -
>pull(bd, col2)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
"Q" "R" "S"
[20] "T"
whereas this code return Null: -
>bd$col2
NULL
Any idea why this returns Null?
I want to use the code as follows: -
bd$col2 <- ordered(bd$col2, levels=lvl.100260, labels=lbl.100260)
as this code is awfully slow: -
bd %>%
pull(col2) %>%
ordered(
.,
levels = lvl.100260,
labels = lbl.100260
)
Especially as a huge number of the same code must be run.
You use a package inborutils that is not on CRAN. I ran your code without it, so the results below might not match yours.
The problem is that bd does not have an element called col2:
> names(bd)
[1] "src" "ops"
It has class
[1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy"
[5] "tbl"
so it's not based on a dataframe, it needs to go to the database to extract data. I think it's possible to override the $ operator, but I suspect if you did, you'd find bd$col2 just as slow as pull(bd, col2).
For the more general question of how to speed it up, I don't think there are any easy answers. Probably you want to work with real dataframes (or even better, matrices) for speed, but it sounds as though you'll run into memory limitations if you try to convert the whole database at once. The general advice would be to profile your code to find the bottlenecks, and think about how to improve them.
I am trying to make a function that counts de caracteres between "a" "t" "g" and "t" "a" "g" or "t" "g" "a"or "t" "a" "a" inside of a vector. But my code gets stuck in the while loop. An example would be like x = "a" "a" "a" "t" "a" "t" "g" "t" "c" "g" "t " "t " "t" "t" "a" "g". In this example the code should count 6 characters between "a" "t" "g" and "t" "a" "g". Any help will be appreciated :) .
orfs<-function(x,p){
count<-0
cntorfs<-0
n<-length(x)
v<-n-2
for (i in 1:v){
if(x[i]=="a"&& x[i+1]=="t"&& x[i+2]=="g"){
k<-i+3;
w<-x[k]
y<-x[k+1]
z<-x[k+2]
while (((w!="t")&&(y!="a")&& (z!="g"))||((w!="t")&&(y!="a")&&(z!="a"))||((w!="t")&&(y!="g")&& (z!="a"))||(i+2>v)){
count<-count+1
k<-k+1
w<-x[k]
y<-x[k+1]
z<-x[k+2]
}
}
if(count>p){
cntorfs<-cntorfs+1
}
if (count!=0){
count<-0
}
}
cat("orf:",cntorfs)
}
This is a very inefficient and un-R-like way to count the number of characters between two patterns.
Here is an alternative using gsub that should get you started and can be extended to account for the other stop codons:
x <- c("a", "a", "a", "t", "a", "t", "g", "t", "c", "g", "t", "t", "t", "t", "a", "g")
nchar(gsub("[actg]*atg([actg]*)tag[actg]*", "\\1", paste0(x, collapse = "")))
#[1] 6
A more robust and general approach can be found here making use of Biostrings::matchPattern. I would strongly advise against reinventing the wheel here, and instead recommend using some of the standard Bioconductor packages that were developed for exactly these kind of tasks.
I have the following character fields I am trying to intersect. These should be equal.
> char.y[[892]]
[1] "E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e" "s"
> char.x[[892]]
[1] "E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e" "s"
> intersect(char.x[[892]], char.y[[892]])
[1] "E" "d" "w" "a" "r" "s" " " "L" "i" "f" "e" "c" "n"
>
expected result:
"E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e"
Using intersect will return the common elements, but will not have them duplicated. For example, s is in there 3 times, but will be in the intersect only once.
If you want to see the same layout, with non intersect values removed, for example, you can use the following:
a <- c("E", "d", "w", "a", "r", "d", "s", " ", "L", "i", "f", "e", "s", "c", "i", "e", "n", "c", "e", "s")
b <- c("E", "d", "w", "a", "r", "d", "s", " ", "L", "i", "f", "e", "s", "c", "i", "e", "n", "c", "e", "s")
a[a %in% intersect(a, b)]
# [1] "E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e" "s"
This would entirely depend on the vectors you are comparing (and which order) but would this be sufficient?
b <- a <- c('E', 'd', 'w', 'a', 'r', 'd', 's', '', 'L', 'i', 'f', 'e', 's', 'c', 'i', 'e', 'n', 'c', 'e')
c <- letters[sample(1:26,100, rep=T)]
a[is.element(a,b)]
# [1] "E" "d" "w" "a" "r" "d" "s" "" "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e"
a[is.element(a,c)]
# [1] "d" "w" "a" "r" "d" "s" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e"
I had the exact same problem and didn't find a solution, so I created my own little function "intersectdup":
intersectdup <- function(vektor1, vektor2) {
result <- c()
for (i in 1:length(vektor2)) {
if (is.element(vektor2[i], vektor1)){
result <- c(result, vektor2[i])
foundAt <- match(vektor2[i], vektor1)
vektor1 <- c(vektor1[1:foundAt-1], vektor1[foundAt+1:length(vektor1)])
}
}
return(result)
}
Picking up on Clemens, here is a simple function in a c-based structure:
intersectMe = function(x, y, duplicates=TRUE)
{
xyi = intersect(x,y);
if(!duplicates) { return (xyi); }
res = c();
for(xy in xyi)
{
y.xy = which(y == xy); ny.xy = length(y.xy);
x.xy = which(x == xy); nx.xy = length(x.xy);
min.xy = min(ny.xy, nx.xy);
res = c(res, rep(xy, min.xy) );
}
res;
}
vecsets library also helps (using on example created by Eric)
vecsets::vintersect(a, b)
[1] "E" "d" "d" "w" "a" "r" "s" "s" "s" " " "L" "i" "i" "f" "e" "e" "e" "c" "c" "n"
This question already has answers here:
Dictionary style replace multiple items
(11 answers)
Closed 5 years ago.
Here's my data :
dataset <- c("h", "H", "homme", "masculin", "f", "femme", "épouse")
How can I replace text values of the vector like :
"femme" -> "f"
"épouse" ->"f"
"Homme"-> "h"
"masculin" -> "h"
What I tried for "femme" -> "f"
test_out <- sapply(dataset, switch,
"f"="femme")
test_out
Expected result :
"h" "h" "h" "masculin" "f" "f" "f"
Try gsub with regular expressions:
dataset = gsub("^((?!h).*)$", "f", gsub("^((h|H|m).*)$", "h", dataset), perl=TRUE)