how does the order function break ties in R? - r

I've read most of the similar questions here, but I'm still having a hard time understanding how passing arguments in the order function break ties.
The example introduced in the R documentation shows that :
order(x <- c(1,1,3:1,1:4,3), y <- c(9,9:1), z <- c(2,1:9))
returns
[1] 6 5 2 1 7 4 10 8 3 9
However, what does it mean when y is 'breaking ties' of x, and z 'breaking ties' of y? the x vector is:
[1] 1 1 3 2 1 1 2 3 4 3
and the y vector is:
[1] 9 9 8 7 6 5 4 3 2 1
Also, if I eliminate z from the first function,
order(x <- c(1,1,3:1,1:4,3), y <- c(9,9:1))
it returns :
[1] 6 5 1 2 7 4 10 8 3 9
so I'm unclear how the numbers in the y vector are relevant with ordering the four 1s, the two 2s, and the three 3s in x. I would very much appreciate the help. Thanks!

Let's take a look at
idx <- order(x <- c(1,1,3:1,1:4,3), y <- c(9,9:1), z <- c(2,1:9))
idx;
#[1] 6 5 2 1 7 4 10 8 3 9
First thing to note is that
x[idx]
# [1] 1 1 1 1 2 2 3 3 3 4
So idx orders entries in x from smallest to largest values.
Values in y and z affect how order treats ties in x.
Take entries x[5] = 1 and x[6] = 1. Since there is a tie here, order looks up entries at the corresponding positions in y, i.e. y[5] = 6 and y[6] = 5. Since y[6] < y[5], the entries in x are sorted x[6] < x[5].
If there is a tie in y as well, order will look up entries in the next vector z. This happens for x[1] = 1 and x[2] = 2, where both y[1] = 9 and y[2] = 9. Here z breaks the tie because z[2] = 1 < z[1] = 2 and therefore x[2] < x[1].

Related

How to filter variables by fold change difference in R

I'm trying to filter a very heterogeneous dataset.
I have numerous variables with several replicates each one. I have a factor with two levels (lets say X and Y), and I would like to subset the variables which present a fold change on its mean greater than 2 (X/Y >= 2 OR Y/X >= 2).
How can I achieve that in R? I can think of some ways but they seem too much of a hassle, I'm sure there is a better way. I would later run multivariate test on those filtered variables.
This would be an example dataset:
d <- read.table(text = "a b c d factor replicate
1 2 2 3 X 1
3 2 4 4 X 2
2 3 1 2 X 3
1 2 3 2 X 4
5 2 6 4 Y 1
7 4 5 5 Y 2
8 5 7 4 Y 3
6 4 3 3 Y 4", header = TRUE)
From this example, only variables a and c should be kept.
Using colMeans:
#subset
x <- d[ d$factor == "X", 1:4 ]
y <- d[ d$factor == "Y", 1:4 ]
# check colmeans, and get index
which(colMeans(x/y) >= 2 | colMeans(y/x) >= 2)
# a c
# 1 3

R - Shift specified columns using minimum value into positive values

I'm looking for an easy way to add the minimum value for each column inside my dataframe.
This feels like a common thing, but I haven't been able to find any good answers yet...maybe I'm missing something obvious.
Let's say I've got two columns (in reality I have close to 100) with positive and negative numbers.
w <- c(9, 9, 9, 9)
x <- c(-2, 0, 1, 3)
y <- c(-1, 1, 3, 4)
z <- as.data.frame(cbind(w, x, y))
w x y
1 9 -2 -1
2 9 0 1
3 9 1 3
4 9 3 4
I want z to look like this after a transformation for only x and y columns [,2:3]
w x y
1 9 0 0
2 9 2 2
3 9 3 4
4 9 5 5
Does that make sense?
library(dplyr)
dplyr::mutate(z, across(c(x, y), ~ . + abs(min(.))))
w x y
1 9 0 0
2 9 2 2
3 9 3 4
4 9 5 5
You can also do by column position rather than column name by changing c(x,y) to 2:3 or c(2:3, 5) for non-sequential column positions.
Depends exactly what you mean and what you want to happen if there aren't negative values. No matter the values, this will anchor the minimum at 0, but you should be able to adapt it if you want something slightly different.
z[] = lapply(z, function(col) col - min(col))
z
# x y
# 1 0 0
# 2 2 2
# 3 3 4
# 4 5 5
As a side note, as.data.frame(cbind(x, y)) is bad - if you have a mix of numeric and character values, cbind() will convert everything to character. It's shorter and better to simplify to data.frame(x, y).
Do you want
z[] <- lapply(z, function(columnValues) columnValues + abs(min(columnValues)))

Find unique set of strings in vector where vector elements can be multiple strings

I have a series of batch records that are labeled sequentially. Sometimes batches overlap.
x <- c("1","1","1/2","2","3","4","5/4","5")
> data.frame(x)
x
1 1
2 1
3 1/2
4 2
5 3
6 4
7 5/4
8 5
I want to find the set of batches that are not overlapping and label those periods. Batch "1/2" includes both "1" and "2" so it is not unique. When batch = "3" that is not contained in any previous batches, so it starts a new period. I'm having difficulty dealing with the combined batches, otherwise this would be straightforward. The result of this would be:
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
My experience is in more functional programming paradigms, so I know the way I did this is very un-R. I'm looking for the way to do this in R that is clean and simple. Any help is appreciated.
Here's my un-R code that works, but is super clunky and not extensible.
x <- c("1","1","1/2","2","3","4","5/4","5")
p <- 1 #period number
temp <- NULL #temp variable for storing cases of x (batches)
temp[1] <- x[1]
period <- NULL
rl <- 0 #length to repeat period
for (i in 1:length(x)){
#check for "/", split and add to temp
if (grepl("/", x[i])){
z <- strsplit(x[i], "/") #split character
z <- unlist(z) #convert to vector
temp <- c(temp, z, x[i]) #add to temp vector for comparison
}
#check if x in temp
if(x[i] %in% temp){
temp <- append(temp, x[i]) #add to search vector
rl <- rl + 1 #increase length
} else {
period <- append(period, rep(p, rl)) #add to period vector
p <- p + 1 #increase period count
temp <- NULL #reset
rl <- 1 #reset
}
}
#add last batch
rl <- length(x) - length(period)
period <- append(period, rep(p,rl))
df <- data.frame(x,period)
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
R has functional paradigm influences, so you can solve this with Map and Reduce. Note that this solution follows your approach in unioning seen values. A simpler approach is possible if you assume batch numbers are consecutive, as they are in your example.
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
r<-Reduce(union,s,init=list(),acc=TRUE)
p<-cumsum(Map(function(x,y) length(intersect(x,y))==0,s,r[-length(r)]))
data.frame(x,period=p)
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
What this does is first calculate a cumulative union of seen values. Then, it maps across this to determine the places where none of the current values have been seen before. (Alternatively, this second step could be included within the reduce, but this would be wordier without support for destructuring.) The cumulative sum provides the "period" numbers based on the number of times the intersections have come up empty.
If you do make the assumption that the batch numbers are consecutive then you can do the following instead
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
n<-mapply(function(x) range(as.numeric(x)),s)
p<-cumsum(c(1,n[1,-1]>n[2,-ncol(n)]))
data.frame(x,period=p)
For the same result (not repeated here).
A little bit shorter:
x <- c("1","1","1/2","2","3","4","5/4","5")
x<-data.frame(x=x, period=-1, stringsAsFactors = F)
period=0
prevBatch=-1
for (i in 1:nrow(x))
{
spl=unlist(strsplit(x$x[i], "/"))
currentBatch=min(spl)
if (currentBatch<prevBatch) { stop("Error in sequence") }
if (currentBatch>prevBatch)
period=period+1;
x$period[i]=period;
prevBatch=max(spl)
}
x
Here's a twist on the original that uses tidyr to split the data into two columns so it's easier to use:
# sample data
x <- c("1","1","1/2","2","3","4","5/4","5")
df <- data.frame(x)
library(tidyr)
# separate x into two columns, with second NA if only one number
df <- separate(df, x, c('x1', 'x2'), sep = '/', remove = FALSE, convert = TRUE)
Now df looks like:
> df
x x1 x2
1 1 1 NA
2 1 1 NA
3 1/2 1 2
4 2 2 NA
5 3 3 NA
6 4 4 NA
7 5/4 5 4
8 5 5 NA
Now the loop can be a lot simpler:
period <- 1
for(i in 1:nrow(df)){
period <- c(period,
# test if either x1 or x2 of row i are in any x1 or x2 above it
ifelse(any(df[i, 2:3] %in% unlist(df[1:(i-1),2:3])),
period[i], # if so, repeat the terminal value
period[i] + 1)) # else append the terminal value + 1
}
# rebuild df with x and period, which loses its extra initializing value here
df <- data.frame(x = df$x, period = period[2:length(period)])
The resulting df:
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3

cumsum the opposite of diff in r

I have a question and I'm not sure if I'm being totally stupid here or if this is a genuine problem, or if I've misunderstood what these functions do.
Is the opposite of diff the same as cumsum? I thought it was. However, using this example:
dd <- c(17.32571,17.02498,16.71613,16.40615,
16.10242,15.78516,15.47813,15.19073,
14.95551,14.77397)
par(mfrow = c(1,2))
plot(dd)
plot(cumsum(diff(dd)))
> dd
[1] 17.32571 17.02498 16.71613 16.40615 16.10242 15.78516 15.47813 15.19073 14.95551
[10] 14.77397
> cumsum(diff(dd))
[1] -0.30073 -0.60958 -0.91956 -1.22329 -1.54055 -1.84758 -2.13498 -2.37020 -2.55174
These aren't the same. Where have I gone wrong?
AHHH! Fridays.
Obviously
The functions are quite different: diff(x) returns a vector of length (length(x)-1) which contains the difference between one element and the next in a vector x, while cumsum(x) returns a vector of length equal to the length of x containing the sum of the elements in x
Example:
x <- c(1:10)
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1 1 1 1 1
v <- cumsum(x)
> v
#[1] 1 3 6 10 15 21 28 36 45 55
The function cumsum() is the cumulative sum and therefore the entries of the vector v[i] that it returns are a result of all elements in x between x[1] and x[i]. In contrast, diff(x) only takes the difference between one element x[i] and the next, x[i+1].
The combination of cumsum and diff leads to different results, depending on the order in which the functions are executed:
> cumsum(diff(x))
# 1 2 3 4 5 6 7 8 9
Here the result is the cumulative sum of a sequence of nine "1". Note that if this result is compared with the original vector x, the last entry 10 is missing.
On the other hand, by calculating
> diff(cumsum(x))
# 2 3 4 5 6 7 8 9 10
one obtains a vector that is again similar to the original vector x, but now the first entry 1 is missing.
In none of the cases the original vector is restored, therefore it cannot be stated that cumsum() is the opposite or inverse function of diff()
You forgot to account for the impact of the first element
dd == c(dd[[1]], dd[[1]] + cumsum(diff(dd)))
#RHertel answered it well, stating that diff() returns a vector with length(x)-1.
Therefore, another simple workaround would be to add 0 to the beginning of the original vector so that diff() computes the difference between x[1] and 0.
> x <- 5:10
> x
#[1] 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1
> diff(c(0,x))
#[1] 5 1 1 1 1 1
This way it is possible to use diff() with c() as a representation of the inverse of cumsum()
> cumsum(diff(c(0,x)))
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(c(0,cumsum(x)))
#[1] 1 2 3 4 5 6 7 8 9 10
If you know the value of "lag" and "difference".
x<-5:10
y<-diff(x,lag=1,difference=1)
z<-diffinv(y,lag=1,differences = 1,xi=5) #xi is first value.
k<-as.data.frame(cbind(x,z))
k
x z
1 5 5
2 6 6
3 7 7
4 8 8
5 9 9
6 10 10

Store every value in a sequence except some values

If I do the following to a string of letters:
x <- 'broke'
y <- nchar(x)
z <- sequence(y)
How do I store every value of the z that isn't the first, last, or middle values of the sequence.
In this example if z is 1 2 3 4 5 then the desired output would be 2 4
in the case of 1 2 3 4 nothing would be stored however, In the case of say 1 2 3 4 5 6 , 2 and 5 would be stored and so on
if (length(z) %% 2) {
z[-c(1, ceiling(length(z)/2), length(z))]
} else
z[-c(1, c(1,0) + floor(length(z)/2), length(z))]

Resources