How to get an Element from a vector without using numbers or indices? - r

Lets say I have these two vectors in my R workspace with the following content:
[1] "Atom.Type" and "Molar.Mass"
> Atom.Type
[1] "Oxygen" "Lithium" "Nitrogen" "Hydrogen"
> Molar.Mass
[1] 16 6.9 14 1
I now want to assign the Molar.Mass belonging to "Lithium" (i.e. 6.9) to a new variable called mass.
The problem is: I have to do that without using any numbers or indices.
Does anyone have a suggestion for this problem?

This should work: mass<-Molar.Mass[Atom.Type=="Lithium"] Clearly this assumes the two vectors are of the same length and sorted correctly. See additional comment from Roland below.

Related

Why does this xlim error occur in circlize initialization?

I want to initialize a new chord diagram with circlize, but I'm getting an error that doesn't seem to make any sense given the data I'm feeding into it:
Error: Since `xlim` is a matrix, it should have same number of rows as the length of the level of `sectors` and number of columns of 2.
I understand the requirement, but when I try to produce different plots, it fails for some but not others. Here's the relevant code snippet with some output for debugging
dev.new()
circos.clear()
circos.par(cell.padding=c(0,0,0,0), track.margin=c(0,0.01), gap.degree=1)
xlim = cbind(0, regionTotal)
print(class(region))
print(length(region))
print(class(xlim))
print(dim(xlim))
circos.initialize(factors=region, xlim=xlim)
The output for a plot that works fine:
[1] "character"
[1] 24
[1] "matrix" "array"
[1] 24 2
And for one that returns the error:
[1] "character"
[1] 50
[1] "matrix" "array"
[1] 50 2
Error: Since `xlim` is a matrix, it should have same number of rows as the length of the level of `sectors` and number of columns of 2.
I am aware of these question:
this one led me to check the class
and this one led me to check my circlize version (0.4.11)
What am I missing??? Thanks for any help you can provide.
After a lot of hair pulling, I figured out the problem: there was a repeated value in my region variable (the factors or sectors entry in circos.initialize), so the effective number of sectors was lower than the dimension of the variable. Hopefully nobody else is dumb enough to make this mistake, but just in case they are, now they can have an additional thing to check if they come across this error.

Correcting mis-typed data in R

I want to correct wrongly entered data in R. For example if I have a vector
V=c('PO','PO','P0')
I want R to recognize that the 0 in the last entry should be a o and to change it. Is there anyway to do that? I have trying to use correctTypos in the deducorrect package in R. However I am having some problem with the editset. I cannot seems to specify that all the entries have to be letters. Any help greatly appreciated.
Another example would be
V2=c('PL','P1','PL','XX')
That 1 should be an L.
The jaro-winkler distance was developed to find issues with data entry. But on entries only 2 long that is going to be difficult as 1 error tends to score higher than you want it to. You could combine this with other distance measurements available in the stringdist package. But in this case that might be too complicated.
Given your examples you might want to use the base function chartr and set up a replacement of numbers to letters.
chartr("01","OL", V2)
[1] "PL" "PL" "PL" "XX"
chartr("01","OL", V)
[1] "PO" "PO" "PO"
This will always replace the 1 by an L and a 0 (zero) by an O. You can add the 5 for S etc etc. But if there are other combo's it might get complicated.
Also note that the next iteration of the deducorrect package is the deductive package.

Check if vector of strings contains words created from two others words

I have very very long vector of strings (peptides).
head(unique(pseq_list))
#[1] "GPPNHHMGPMSER" "SLSGQCHHHGENLR" "HSSGQDKPHETYR"
#"DHDKPHQQSDK" "AHMESDK" "HISESHEK"
I want to check if in this vector are peptides created by two others peptides. For example if there are "AHMESDK", "AHME" and "SDK" I want to know that. I tried grepl function but probably my vector is to long(?). Also, how to save such results?
If it would be too difficult to verify if there exists "AHMESDK" = "AHME" + "SDK" it would be nice to know at least if in the vector are peptides which contains others (for example "HISESHEK" and "SES").
Context provided by #quant in the comments:
As a note for everyone without biological background.
Peptides are macromolecules. Our body can compose these macromolecules by "gluing" different amino acids together. The sequence of amino acids glued together is called the primary structure of a peptide and in bioinformatics often the one letter code, see rpeptide.com is used in order to represent the primary structure.
So AHMESDK simply means a peptide composed of Alanin, Histidine and so on.
Data:
pseq<-c("GPPNHHMGPMSER", "SLSGQCHHHGENLR", "HSSGQDKPHETYR", "DHDKPHQQSDK", "AHMESDK", "AHME", "SES", "HISESHEK")
Two approaches:
Approach 1:
peplist<-sapply(pseq,grep, pseq, value=TRUE)
Result:
$GPPNHHMGPMSER
[1] "GPPNHHMGPMSER"
$SLSGQCHHHGENLR
[1] "SLSGQCHHHGENLR"
$HSSGQDKPHETYR
[1] "HSSGQDKPHETYR"
$DHDKPHQQSDK
[1] "DHDKPHQQSDK"
$AHMESDK
[1] "AHMESDK"
$AHME
[1] "AHMESDK" "AHME"
$SES
[1] "SES" "HISESHEK"
$HISESHEK
[1] "HISESHEK"
This gives you a list where for every element, you get the list of elements it exists in. We can then create a list of only those peptids that appear within other peptids:
peplist[sapply(peplist,length)>1]
Approach 2:
pepcombs<-expand.grid(pseq,pseq) %>%
apply(1,paste0,collapse="")
pseq[pseq %in% pepcombs]
This will give you a list of peptids that can be constructed by combining two of the other peptids.

Storing a value in a nested list with an unknown depth in R

I am trying to optimize a code which is very computational-intensive, because it deals with subsets of a 80-elements set.
A crucial step that I want to accelerate is finding if the current subset in my loop has already been treated or not. For the moment, I check if this subset is contained in the already treated subset of the same size k (cardinal). It would be much more faster to store progressively treated subset in a nested list to check if a subset has already been treated or not (O(1) instead of a search in O(80 choose k)).
I had no problem coding a function to check if the current subset is in my nested list of treated subset: access(treated, subset=c(2,5,3)) returns TRUE iff treated[[2]][[5]][[3]]==TRUE
However, I have no idea how to store (inside my loop) my current subset in the list of treated. I would like something like this to be possible: treated[h] <- TRUE where h is my current subset (in the above example: h=c(2,5,3))
The main problem that I am facing is that the number of "[[..]]" varies inside my loop. Do I have any other option rather than completing h so that it has a length of 80 and putting a sequence of 80 "[[..]]", like: treated[[h[1]]][[h[2]]]...[[h[80]]] <- TRUE ?
If h is a vector of values then
"[["(treated, h)
recursively subsets the list items.
For example, I created a (not so highly) nested list:
> a
[[1]]
[[1]][[1]]
[1] 2
[[1]][[2]]
[[1]][[2]][[1]]
[1] 3
[[2]]
[1] 1
The following command, correctly recursively applies item subsetting to the list:
> "[["(a, c(1,2,1))
[1] 3
The length of the recursively subsetting vector can vary without fixing the number of [[..]]'s. For example, subsetting two levels of depth with the same syntax:
> "[["(a, c(1,2))
[[1]]
[1] 3

R - cannot select the desired element in a vector [duplicate]

This question already has an answer here:
How to index an element of a list object in R
(1 answer)
Closed 8 years ago.
I have a vector of data called empl which I extracted from a NetLogo model using RNetLogo and whose entries look like
[[1403]]
[1] 99
[[1404]]
[1] 97
[[1405]]
[1] 95
[[1406]]
[1] 95
[[1407]]
[1] 95
[[1408]]
[1] 97
I would like to perform simple operation on the last numbers of the vector's entries (the 95,97,...).
Now if I write something like
empl[731] + empl[890]
I get
Error in empl[i] + empl[j] : non-numeric argument to binary operator
If I understand correctly, this is due the fact that empl[i] does not pick the last number in the corresponding entry but rather the whole entry for instance
[[1408]]
[1] 97
But I haven't been able to figure out how to get the last number only. I tried
empl[1,i]
and
empl[i,1]
but got
Error in empl[1, i] : incorrect number of dimensions
Any help on how to select the last number only would be much appreciated. If someone can emply understand the structure of the vector empl that would be even better.
Your empl object is not a vector. It is a list. A list object is formed by elements which can be arbitrary R objects. When you print a list and see:
[[1403]]
[1] 99
it means that the 1403rd element of this list is a vector with just one value (99). You select an element of a list through the double square bracket ([[) operator. So, if you try:
empl[[731]] + empl[[890]]
you won't receive any error. I suggest to read the R language definition, and in particular sections 2.1 (which describes object types) and 3.4 (when indexing is discussed).
It seems like empl is a list
You could do
Reduce(`+`,tail(empl,2))
to get the sum of last 2 elements
If you need to sum some specific elements for example 731, 752, 834
Reduce(`+`,empl[c(731, 752, 834)])
#[1] 812
Or
sum(unlist(tail(empl,2)), na.rm=TRUE)
data
set.seed(42)
empl <- replicate(1000,list(sample(1:950,1,replace=TRUE)))

Resources