Subset of ESet /dividing ESet - r

Is it possible to subset a ExpressionSet like this:
SUB=ESet[,ESet#phenoData#data$x==c(0,1)]
in X are values from 0-9, and I just want the entries when x=0 or x=1.

Try the following:
SUB=ESet[, ESet$x %in% c(0,1)]
At first glance, the difference between == and %in% seems only subtle.
x <- 0:9
x[x==c(0, 1)]
[1] 0 1
> x[x %in% c(0, 1)]
[1] 0 1
But %in% will never return NA, and this could be useful, or even essential, depending on what you want to do. In the following constructed example, == returns NA, whilst %in% returns the expected result:
x <- c(NA, 0:9)
x[x==c(0, 1)]
[1] NA
x[x %in% c(0, 1)]
[1] 0 1
But the difference is much deeper than this. From the help files for ?== it is apparent that when making binary comparisons between vectors of unequal length, the elements of shorter vectors are recycled as necessary.
Try for example the following:
x <- 0:9
x[x==c(1, 2)]
integer(0)
This results in an empty vector. If you recycle the vector c(1, 2), it quickly becomes apparent why:
x: 0 1 2 3 4 5 6 7 8 9
c(1, 2): 1 2 1 2 1 2 1 2 1 2
'==': F F F F F F F F F F

Related

Writing a function in in R

I am doing an exercise to practice writing functions.
I'm trying to figure out the general code before writing the function that reproduces the output from the table function. So far, I have the following:
set.seed(111)
vec <- as.integer(runif(10, 5, 20))
x <- sort(unique(vec))
for (i in x) {
c <- length(x[i] == vec[i])
print(c)
}
But this gives me the following output:
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
I don't think I'm subsetting correctly in my loop. I've been watching videos, but I'm not quite sure where I'm going wrong. Would appreciate any insight!
Thanks!
We can sum the logical vector concatenate it to count
count <- c()
for(number in x) count <- c(count, sum(vec == number))
count
#[1] 3 1 4 1 5 4 3 2 7
In the OP's for loop, it is looping over the 'x' values and not on the sequence of 'x'
If we do
for(number in x) count <- c(count, length(vec[vec == number]))
it should work as well
You can try sapply + setNames to achieve the same result like table, i.e.,
count <- sapply(x, function(k) setNames(sum(k==vec),k))
or
count <- sapply(x, function(k) setNames(length(na.omit(match(vec,k))),k))
such that
> count
1 2 3 4 5 6 7 8 9
3 1 4 1 5 4 3 2 7
Here is a solution without using unique and with one pass through the vector (if only R was fast with for loops!):
count = list()
for (i in vec) {
val = as.character(i)
if (is.null(count[[val]]))
count[[val]] = 1
else
count[[val]] = count[[val]] + 1
}
unlist(count)

Find all subsequences with specific length in sequence of numbers in R

I want to find all subsequences within a sequence with (minimum) length of n. Lets assume I have this sequence
sequence <- c(1,2,3,2,5,3,2,6,7,9)
and I want to find the increasing subsequences with minimum length of 3. The ouput should be a dataframe with start and end position for each subsequence found.
df =data.frame(c(1,7),c(3,10))
colnames(df) <- c("start", "end")
Can somebody give a hint how to solve my problem?
Thanks in advance!
One way using only base R
n <- 3
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
split the index of sequence based on the continuous incremental subsequences, if the length of each group is greater than equal to n return the start and end index of that group.
To understand lets break this down and understand it step by step
Using diff we can find difference between consecutive elements
diff(sequence)
#[1] 0 1 1 -1 3 -2 -1 4 1 2
We check which of them do not have increasing subsequences
diff(sequence) < 1
#[1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
and take cumulative sum over them to create groups
cumsum(c(0, diff(sequence)) < 1)
#[1] 1 1 1 2 2 3 4 4 4 4
Based on this groups, we split the index from 1:length(sequence)
split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1))
#$`1`
#[1] 1 2 3
#$`2`
#[1] 4 5
#$`3`
#[1] 6
#$`4`
#[1] 7 8 9 10
Using sapply we loop over this list and return the start and end index of the list if the length of the list is >= n (3 in this case)
sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)]))
#$`1`
#start end
# 1 3
#$`2`
# NULL
#$`3`
#NULL
#$`4`
#start end
# 7 10
Finally, rbind all of them together using do.call. NULL elements are automatically ignored.
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
Here is another solution using base R. I tried to comment it well but it may still be hard to follow. It seems like you wanted direction / to learn, more than an outright answer so definitely follow up with questions if anything is unclear (or doesn't work for your actual application).
Also, for your data, I added a 12 on the end to make sure it was returning the correct position for repeated increases greater than n (3 in this case):
# Data (I added 11 on the end)
sequence <- c(1,2,3,2,5,3,2,6,7,9, 12)
# Create indices for whether or not the numbers in the sequence increased
indices <- c(1, diff(sequence) >= 1)
indices
[1] 1 1 1 0 1 0 0 1 1 1 1
Now that we have the indices, we need to get the start and end postions for repeates >= 3
# Finding increasing sequences of n length using rle
n <- 3
n <- n - 1
# Examples
rle(indices)$lengths
[1] 3 1 1 2 4
rle(indices)$values
[1] 1 0 1 0 1
# Finding repeated TRUE (1) in our indices vector
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
reps
[1] TRUE FALSE FALSE FALSE TRUE
# Creating a vector of positions for the end of a sequence
# Because our indices are true false, we can use cumsum along
# with rle to create the positions of the end of the sequences
rle_positions <- cumsum(rle(indices)$lengths)
rle_positions
[1] 3 4 5 7 11
# Creating start sequence vector and subsetting start / end using reps
start <- c(1, head(rle_positions, -1))[reps]
end <- rle_positions[reps]
data.frame(start, end)
start end
1 1 3
2 7 11
Or, concisely:
n <- 3
n <- n-1
indices <- c(1, diff(sequence) >= 1)
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
rle_positions <- cumsum(rle(indices)$lengths)
data.frame(start = c(1, head(rle_positions, -1))[reps],
end = rle_positions[reps])
start end
1 1 3
2 7 11
EDIT: #Ronak's update made me realize I should be using diff instead of sapply with an anonymous function for my first step. Updated the answer b/c it was not catching an increase at the end of the vector (e.g., sequence <- c(1,2,3,2,5,3,2,6,7,9,12, 11, 11, 20, 100), also needed to add one more line under n <- 3. This should work as intended now.

convert 1 character string into numeric values

I would like to convert a charter sequence into a numeric sequence.
My variable is called labCancer and is made like this:
labCancer
[1] M M M M M M M M M M M M M M M M M M M B B B M M M M M M M M M M M M M M M B
I would like to have:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0
I tried using
labCancer_2 <- labCancer
for (i in 1:569) {
if (labCancer[i] == "M") {
labCancer_2[i] <- 1
} else {
labCancer_2[i] <- 2
} }
but it doesn't work.
Andrea
The only reason I can think of that would cause that loop to not work is failure to initialize labCancer_2. So you would want to do this prior to starting your loop:
labCancer_2 <- numeric(length(labCancer))
If you want to assign to an object element by element in a loop, you need to initialize that object first, or it needs to otherwise exist in some manner.
However, there is a better way to do this that would not require initialization and would be the way many would argue you should do it in R
labCancer_2 <- ifelse(labCancer == "M", 1, 0)
This takes advantage of R's vectorization.
Depending on what you are using the data for, as long as you only have two values, you can do this:
labCancer_2 <- ifelse(lab_cancer=="M", 1, 0)
If you have multiple values or you want to keep the letters around for reference or graphing, you can make the vector a factor:
labCancer_2 <-factor(lab_cancer, levels=c("B", "M"))
However, the factor begins with 1, so your vector would be
2 2 2 2 ... 1 1 1 ...
rather than
1 1 1 1 ... 0 0 0...
One solution would be to convert your vector to a factor, and then to an integer. This will result in all unique values of your original vector to get a separate integer number:
> x <- c("m", "b", "m", "b")
> x
[1] "m" "b" "m" "b"
> as.factor(x)
[1] m b m b
Levels: b m
> as.integer(as.factor(x))
[1] 2 1 2 1
> c(0, 1)[as.numeric(as.factor(x))]
[1] 1 0 1 0
Using the trick in the last line one can easily change the numbers to match 0 and 1.
create a numeric vector (0,1,0,0,1,1), change it to a vector of characters ("0","1","0","0","1","1")

Modifying dplyr::lag function

I am trying to use the lag function from the dplyr package. However when I give a lag > 0 I want the missing values to be replaced by the first value in x. How can we achieve this
library(dplyr)
x<-c(1,2,3,4)
z<-lag(x,2)
z
## [1] NA NA 1 2
Since you are using the lag function dplyr, there is an argument default. So you can specify that you want x[1] to be the default.
lag(x, 2, default=x[1])
Here's a modified function mylag:
mylag <- function(x, k = 1, ...)
replace(lag(x, k, ...), seq(k), x[1])
x <- 1:4
mylag(x, k = 2)
# [1] 1 1 1 2
May I suggest adapting the function so that it works both ways: for lag and lead (positive AND negative lags).
shift = function(x, lag, fill=FALSE) {
require(dplyr)
switch(sign(lag)/2+1.5,
lead( x, n=abs(lag), default=switch(fill+1, NA, tail(x, 1)) ),
lag( x, n=abs(lag), default=switch(fill+1, NA, head(x, 1)) )
)
}
It has a "fill" argument that automatically fills with first of last value depending on the sign of the lag.
> shift(1:10, -1)
#### [1] 2 3 4 5 6 7 8 9 10 NA
> shift(1:10, +1, fill=TRUE)
#### [1] 1 1 2 3 4 5 6 7 8 9

propagating data within a vector

I'm learning R and I'm curious... I need a function that does this:
> fillInTheBlanks(c(1, NA, NA, 2, 3, NA, 4))
[1] 1 1 1 2 3 3 4
> fillInTheBlanks(c(1, 2, 3, 4))
[1] 1 2 3 4
and I produced this one... but I suspect there's a more R way to do this.
fillInTheBlanks <- function(v) {
## replace each NA with the latest preceding available value
orig <- v
result <- v
for(i in 1:length(v)) {
value <- v[i]
if (!is.na(value))
result[i:length(v)] <- value
}
return(result)
}
Package zoo has a function na.locf():
R> library("zoo")
R> na.locf(c(1, 2, 3, 4))
[1] 1 2 3 4
R> na.locf(c(1, NA, NA, 2, 3, NA, 4))
[1] 1 1 1 2 3 3 4
na.locf: Last Observation Carried Forward;
Generic function for replacing each ‘NA’ with the most recent non-‘NA’ prior to it.
See the source code of the function na.locf.default, it doesn't need a for-loop.
I'm doing some minimal copy&paste from the zoo library (thanks again rcs for pointing me at it) and this is what I really needed:
fillInTheBlanks <- function(S) {
## NA in S are replaced with observed values
## accepts a vector possibly holding NA values and returns a vector
## where all observed values are carried forward and the first is
## also carried backward. cfr na.locf from zoo library.
L <- !is.na(S)
c(S[L][1], S[L])[cumsum(L)+1]
}
Just for fun (since it's slower than fillInTheBlanks), here's a version of na.locf relying on rle function:
my.na.locf <- function(v,fromLast=F){
if(fromLast){
return(rev(my.na.locf(rev(v))))
}
nas <- is.na(v)
e <- rle(nas)
v[nas] <- rep.int(c(NA,v[head(cumsum(e$lengths),-1)]),e$lengths)[nas]
return(v)
}
e.g.
v1 <- c(3,NA,NA,NA,1,2,NA,NA,5)
v2 <- c(NA,NA,NA,1,7,NA,NA,5,NA)
my.na.locf(v1)
#[1] 3 3 3 3 1 2 2 2 5
my.na.locf(v2)
#[1] NA NA NA 1 7 7 7 5 5
my.na.locf(v1,fromLast=T)
#[1] 3 1 1 1 1 2 5 5 5
my.na.locf(v2,fromLast=T)
#[1] 1 1 1 1 7 5 5 5 NA
another simple answer. This one takes care of 1st value being NA. Thats a dead end so my loop stats from index 2.
my_vec <- c(1, NA, NA, 2, 3, NA, 4)
fill.it <- function(vector){
new_vec <- vector
for (i in 2:length(new_vec)){
if(is.na(new_vec[i])) {
new_vec[i] <- new_vec[i-1]
} else {
next
}
}
return(new_vec)
}
Multiple R packages have a na.locf function included, which exactly does that. (imputeTS, zoo, spacetime,...)
Here is a example with imputeTS:
library("imputeTS")
x <- c(1, NA, NA, 2, 3, NA, 4)
na.locf(x)
There are also more advanced methods for replacing missing values provided by the imputeTS package. (and by zoo also)

Resources