Why should I use the sum()-Function here and not the length() Function?
sum(age<=20)/length(age)
Age is a vector with numeric values representing the age.
Why should I not use:
length(age<=20)/length(age)
These function should return the same values, don't they?
No, they shouldn't. The result of:
age<=20
is a boolean vector with the same dimensions of the variable 'age'.
Therefore,
sum(age<=20)
counts how many values are lower than or equal to 20, while
length(age<=20)
will return the length of 'age'
BTW,
sum(age<=20)/length(age)
can be more simply obtained via:
mean(age<=20)
I'm not sure what language you're referring to, but, in general principles:
age<=20 is a condition, returning yes/no, and presumably 1/0, on each item (or record)
sum(age<=20) will sum up the Ones. It gives you the number of items where the condition is satisfied.
On the other hand, length returns a number: the count of your array/vector/recordset. age is a vector, age<=20 is not, it's a boolean expression.
If you ask length(age<=20), you will most likely get 1 as it is the length of the result of age<=20, which is a scalar value.
Related
I am currently having an issue. Basically, I have 2 similar functions in terms of concept but the results do not align. These are the codes I learned from Bioinformatics I on Coursera.
The first code is simply creating a dictionary of occurrences of each k-mer pattern from a text (which is a long stretch of nucleotides). In this case, k is 5.
def FrequencyMap(text,k):
freq ={}
for i in range (0, len(text)-k+1):
freq[text[i:i+k]]=0
for j in range (0, len(text)-k+1):
if text[j:j+k] == text[i:i+k]:
freq[text[i:i+k]] +=1
return freq, max(freq)
The text and the result dictionary are kinda long, but the main point is when I call max(freq), it returns the key 'TTTTC', which has a value of 1.
Meanwhile, I wrote another code that is simply based on the previous code to generate the 5-mer patterns that have the max values (number of occurrences in the text).
def FrequentWords(text, k):
a = FrequencyMap(text, k)
m = max(a.values())
words = []
for i in a:
if a[i]==m:
words.append(i)
return words,m
And this code returns 'ACCTA', which has the value of 99, meaning it appears 99 times in the text. This makes total sense.
I used the same text and k (k=5) for both codes. I ran the codes on Jupyter Notebook. Why does the first one not return 'ACCTA'?
Thank you so much,
Here is the text, if anyone wants to try:
"ACCATCCCTAGGGCATACCTAAGTCTACCTAAAAGGCTACCTAATACCATACCTAATTACCTAACTACCTAAAATAAGTCTACCTAATACCTAATACCTAAAGTTACCTAACGTACCTAATACCTAATACCTAACCACTACCTAATCCGATTTACCTAACAACCGATCGAGTACCTAATCGATACCTAAATAACGGACAATATACCTAATTACCTAATACCTAATACCTAAGTGTACCTAAGACGTCTACCTAATTGTACCTAACTACCTAATTACCTAAGATTAATACCTAATACCTAATTTACCTAATACCTAACGTGGACTACCTAATACCTAACTTTTCCCCTACCTAATACCTAACTGTACCTAAATACCTAATACCTAAGCTACCTAAAGAACAACATTGTACGTGCGCCGTACCTAAATACCTAACAACTACCTAACTGATACCTAATAGTGATTACCTAACGCTTCTACCTAACTACCTAAGTACCTAACGCTACCTAACTACCTAATGTCCACAAAATACCTAATACCTAATAGCTACCTAATTGTGTACCTAAGTACCTAACCTACCTAATAATACCTAAAAATACCTAAGTACCTAACGTACCTAAATTTTACCTAATCTACCTAACGTACCTAATACCTAATTATACCTAATTACCTAATGGTTACCTAAGTTACCTAATATGCCACTACCTAACCTTACCTAAGACCTACCTAATAGGTACCTAACTGGGTACCTAAGGCAGTTTACCTAATTCAGGGCTACCTAATGTACCTAATACCTAAGTACCTAATACCTAATCCCATACCTAATATTTACCTAAGGGCACCGGTACCTAATACCTAATACCTAATACCTAAACCTTCGTACCTAAATACCTAATCTACCTAATGTACCTAAGGTACCTAATACCTAAGTCACTACCTAATACCTAATACCTAATGGGAGGAGCTTACCTAAGGTTACCTAATTACCTAAATACCTAATCGTTACCTAA"
Why does the first one not return 'ACCTA'?
Because max(freq) returns the maximum key of the dictionary. In this case the keys are strings (the k-mers), and strings are compared alphabetically. Hence the maximum one is the last string when the are sorted alphabetically.
If you want the first function to return the k-mer that occurs most often, you should change max(freq) to max(freq.items(), key=lambda key_value_pair: key_value_pair[1])[0]. Here, you are sorting the (kmer, count) pairs (that's the key_value_pair parameter of the lambda expression) based on the frequency and then selecting the kmer.
The problem
I would like to find a length of a list.
The expected output
I would like to find the length based on a condition.
Example
Suppose that I have a list of 4 elements as follows:
myve <–list(1,2,3,0)
Here I have 4 elements, one of them is zero. How can I find the length by extracting the zero values? Then, if the length is > 1I would like to substruct one. That is:
If the length is 4 then, I would like to have 4-1=3. So, the output should be 3.
Note
Please note that I am working with a problem where the zero values may be changed from one case to another. For example, For the first list may I have only one 0 value, while for the second list may I have 2 or 3 zero values.
The values are always positive or zero.
You just need to apply the condition to each element. This will produce a list of boolean, then you sum it to get the number of True elements (i.e. validation your condition).
In your case:
sum(myve != 0)
In a more complex case, where the confition is expressed by a function f:
sapply(myve, f)
Use sapply to extract the ones different to zeros and sum to count them
sum(sapply(myve, function(x) x!=0))
I have a vector that contains fractional numbers:
a<-c(0.5,0.5,0.3,0.5,0.2)
I would like to determine the most frequent (i.e. majority) number in the vector and return that number.
table(a) doesn't work because it will return the whole table. I want it to return only 0.5.
In case of ties I would like to choose randomly.
I have a function that does this for integers:
function(x){
a<-tabulate(x,nbins=max(x)); b<-which(a==max(a))
if (length(b)>1) {a<-sample(b,1)} else{b}
}
However, this won't work for fractions.
Can someone help?
You can use
names(which.max(table(a)))
If you want the numeric one as in your case, then coerce it to numeric
as.numeric(names(which.max(table(a))))
To randomize the tie case, you can add randomize the table
as.numeric(names(which.max(sample(table(a))))) #note this works only if length(unique(a)) > 1
I have a dataframe ma
it has a factor called type
type is comprised of the following factors: I210, I210plus, I210plusc, KV2c, KV2cplus
I'd like to put some of these factors in a vector, say, selected_types
so, selected_types<-c("I210plusc","KV2c")
then, have this command subset the dataframe ma
ma1<-subset(ma, type==selected_types)
such that ma1 would be a subset of ma consisting of only the observations that had
type I210plusc and KV2c
however, when I do this, the number of observations in the resulting dataframe ma1 is less than the sum of the occurrences of the two types in selected_types from the original ma
Any ideas on what I'm doing incorrectly?
Thank you
I originally had this in a comment, but it's a bit lengthy, plus I wanted to add to it. Here some details on what's happening:
what you're doing with == is recycling your two length vector, so that every even row is compared to "KV2c", and every odd one to "I210plusc", so your final result will be the data frame of odd rows that are "KV2c" and even rows that are "I210plusc".
An alternate solution that might make the issue clear is as follows:
subset(ma, type == selected_types[[1]] | type == selected_types[[2]])
Or, more gracefully:
subset(ma, type %in% selected_types)
The %in% operator returns a logical vector of same length as type with TRUE for every position in type that "is in" selected_types (hence the name of the operator).
Hello I am new to R and I can't find the way to do exactly what I want to. I have a vector of x numbers, and what i want to do is order it in increasing order, and then start making subtractions like this (let's say the vecto has 100 numbers for example):
[x(100)-x(99)]+[x(99)-x(98)]+[x(98)-x(97)]+[x(97)-x(96)]+...[x(2)-x(1)]
and then divide all that sum by the number of elements the vector has, in this case 100.
The only thing that I am able to do at the moment is order the vector with:
sort(nameOfTheVector)
Sorry for my bad English.
diff returns suitably lagged and iterated differences. In your case you want the default single lag. sum will return the sum any arguments passed to it, so....
sum(diff(sort(nameOfTheVector))) / length(nameOfTheVector)