Consecutive values below threshold value - r

I have a vector of approximately 1000 years of PDSI values that range from 6 to -6. I'd like to find a quick way to search when there are four or more consecutive years that are <= -2 value. Furthermore I need the data to be kept within a vector of the same length as the original 1000 so that I can plot them together. The end product could even be a logic vector. Here's an example of what I have and what I'd like.
Original <- c(1,6,5,-2,-6,-4,-2,0,1,-2,-3,0)
New <- c(0,0,0,1,1,1,1,0,0,0,0,0) # expected output

You can try the following
x <- c(rep(-2, 5), rep(0, 3), rep(-2, 4), rep(0,6), rep(5, 2), rep(-2, 6), rep(-6, 2))
(r <- rle(x))
## Run Length Encoding
## lengths: int [1:7] 5 3 4 6 2 6 2
## values : num [1:7] -2 0 -2 0 5 -2 -6
(r$lengths[r$lengths > 3 & r$values== -2]) # length of each sequence
## [1] 5 4 6
To get the vector with only a sequence of "-2" you can try
r$values[r$values != -2] <- 0
rep(r$values, r$lengths)
## [1] -2 -2 -2 -2 -2 0 0 0 -2 -2 -2 -2 0 0 0 0 0 0 0 0 -2 -2 -2 -2 -2 -2 0 0

Related

in R: how to take value from i+1th row of 1 dataframe and subtract from every row in i+1th column of 2nd dataframe

Note that the actual dataset is 1000s of columns and 100s of rows so I am looking for a way that does not require that i manually name either columns or rows.
With a dataset that has similar structure as follows:
subvalues <- c(1:10)
df <- data.frame(x = rpois(40,2), y = rpois(40,2), z = rpois(40,2), q = rpois(40,2), t = rpois(40,2))
call the rows of subvalues SVa, SVb, SVc...
call the rows of the dataframe's columns Xa, Xb, Xc... Ya, Yb, Yc... etc.
What I am trying to build is the following: A function that takes first the first cell of subvalues (SVa) and subtracts it from every row in column X (Xa, Xb, Xc, etc.), 2nd to take the 2nd cell of subvalues (SVb) and subtract it from every row in column y (Ya, Yb, Yc, etc.)
What I have so far is:
res <- numeric(length = length(x))
for (i in seq_along(x)) {
res[i] <- xpos - [**SVi+1**]
}
res
I need to figure out the 'SVi+1' loop and how to properly do the loop-within a loop.
Any help is much appreciated
The example dataset you provide won't work, because you need the same length for subvalues and the number of df columns.
After some modifications, here is an example. You don't need to extract the value from subvalues, as it's just a substraction.
Note that I've saved df in tmp, to modify this data.frame without loosing your initial data. Also, if the entire data.frame is numeric, consider using matrix, which can save you time.
subvalues <- c(1:5) # Note here the length 5 for the 5 columns of df.
df <- data.frame(x = rpois(40,2), y = rpois(40,2), z = rpois(40,2), q = rpois(40,2), t = rpois(40,2))
tmp <- df
for(i in seq_along(subvalues)){
# print(subvalues[i])
tmp[,i] <- tmp[,i] - subvalues[i]
}
tmp[,i] is a vector returning the i column of the data.frame, and so you can substract a value to a vector, and save it in it's initial place.
Maybe you can try replicate to create a matrix of same dimensions as df, and do subtraction afterwards, i.e.,
dfout <- df - t(replicate(nrow(df),subvalues))
such that
> dfout
x y z q t
1 0 1 -1 2 -4
2 0 0 0 -2 -1
3 1 1 -2 -2 -3
4 3 0 -2 -3 -2
5 0 0 0 -1 -1
6 3 1 -2 -2 -3
7 3 -2 0 -2 -5
8 1 0 -3 -3 -4
9 1 1 -2 -3 -2
10 -1 1 -2 -2 -4
11 0 0 -2 -2 -3
12 0 2 -3 -4 -2
13 2 0 -1 -4 -2
14 0 -1 1 -2 -4
15 2 -2 0 0 -4
16 1 -2 0 -2 -1
17 2 -1 -1 -2 -3
18 5 0 -1 -2 -2
19 0 0 0 2 -3
20 2 0 -1 -2 -1
21 3 2 -1 -1 -4
22 0 -1 -2 -2 -4
23 1 0 -2 -3 -1
24 -1 -1 3 -3 -3
25 0 0 -1 -1 -1
26 0 -1 -2 -2 -4
27 -1 0 -3 -3 -2
28 0 1 -1 -1 -2
29 3 -2 1 -4 -1
30 0 2 -1 0 -3
31 1 -1 2 -2 -2
32 1 1 0 -2 -4
33 1 -1 -2 -3 -5
34 0 -1 -1 -2 -1
35 2 0 -2 -2 -4
36 1 2 -3 -3 -3
37 2 2 0 -2 -5
38 -1 -1 -3 -4 -2
39 2 1 -1 -3 -4
40 1 3 -1 -3 -2
DATA
set.seed(1)
subvalues <- c(1:5) # Note here the length 5 for the 5 columns of df.
df <- data.frame(x = rpois(40,2), y = rpois(40,2), z = rpois(40,2), q = rpois(40,2), t = rpois(40,2))

How to get all combinations of values rowwise in a dataframe

I have a contingency table (ct) like this:
read.table( text=
1 2 3 4 5 6
1 0 0 1 0 2 0
2 0 0 2 0 0 0
70 0 0 3 0 0 0
76 15 13 19 2 9 8
85 0 0 2 0 0 0
109 0 0 0 0 1 2
479 0 0 0 0 2 0
491 2 0 0 0 0 0
1127 0 1 0 1 6 0
1131 0 1 1 1 2 0
1206 1 3 1 0 0 1
1208 1 0 1 0 0 1
1210 0 1 0 0 0 1
1225 2 0 1 0 0 0
1232 0 0 0 0 1 1
1242 0 0 0 1 0 1
1243 1 0 0 0 1 1
1251 0 0 2 0 1 2
1267 0 2 1 0 0 0
4415 0 2 0 0 0 0
4431 0 0 0 2 0 0
4808 0 0 0 0 2 0
4823 0 2 0 0 0 0 )
Where rows represent cluster, columns represent hospitals and numbers in the table the count of isolates.
For example: Cluster 1 has 3 isolates, 1 in hospital 3 and 2 in hospital 2.
I now want to check, if clusters and hospitals are dependent on each other or not. For that, I would like to create 1000 randomly distributed tables, where all isolates in one cluster have the chance to fall into every hospital.
For example: The 3 Isolates in cluster 1 might then be distributed over 3 hospitals, so that I get the values : 0 1 1 1 0 0 .
Combinations can occur multiple times.
I tried this:
replicates <- 1000
permutations <- lapply(seq(replicates), function(i, ct){
list <- lapply(apply(ct,1,list),unlist)
list <- lapply(list, function(x)as.numeric(x))
z <- as.data.frame(do.call(rbind, lapply(list, function(x) sample(x))))
}, ct = ct)
But by that only the values in the dataframe are shuffled to another position in the row.
Can someone help me with that?
I concur with Maurits Evers answer, at full rank you got binomial combination per lines : n variables mean 2^n combination... if you add m-1 columns this yields 2^(n+m) possibilities.
Here's an alternative using partitions::composition.
library(partitions)
# smaller toy data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
# calculate row sums
rs <- rowSums(d)
# for each unique row sum, partition the value with order m = number of columns
# this avoids repeating calculation of partitions on duplicate row sums
l <- lapply(unique(rs), compositions, m = ncol(d))
# name list elements with row sums
names(l) <- unique(rs)
# set number of samples
n <- 4
# to reproduce sample in this example
set.seed(1)
# loop over rows in data frame
lapply(1:nrow(d), function(i){
# index list of partitions using row sums
m <- l[[as.character(rs[i])]]
# number of columns to sample from
nc <- ncol(m)
# select columns from matrix using a sample of n column indexes
m[ , sample(nc, n, replace = TRUE)]
})
The result is a list where each element is a matrix for each row of the original data. Each matrix column is one (sampled) partition.
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 1 0
# [2,] 1 2 0 0
# [3,] 0 0 1 2
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 0 2
# [2,] 3 1 0 0
# [3,] 0 3 4 2
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 1
# [2,] 0 0 1 1
# [3,] 1 0 0 0
I tried to partition the largest row sum in your example data (66), and it runs pretty quickly. Thus, if your row sums are not very much larger and the number of columns is small (like here), the code above may be a viable option.
system.time(p <- compositions(66, 6))
# user system elapsed
# 1.53 0.16 1.68
str(p)
# 'partition' int [1:6, 1:13019909] 66 0 0 0 0 0 65 1 0 0 ...
Note that it 'explodes' rapidly if the number of columns increases:
system.time(p <- compositions(66, 7))
# user system elapsed
# 14.11 1.61 15.72
Sorry #Henrik for the late response. Your code worked out quite well for me! However, with the help of a colleague of mine, I figured out this code (I'll just show it using your sample data):
#data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
#Number of replicates I want
replicates <- 1000
#Number of columns in the table
k<- 3
l <- NULL
#unlist the dataframe
list <- lapply(apply(d,1,list),unlist)
#Calculate replicates of the dataframe, where numbers are permuted within rows
permutations <- lapply(seq(replicates), function(j){
l_sampled <- lapply(list, function(x){
pos.random <- sample(k, sum(x), replace = T)
x.random <- rep(0,k)
for (i in 1:k){
x.random[i] <- sum(pos.random==i)
}
l = rbind(l, data.frame(x.random))
})
df <- data.frame(matrix(unlist(l_sampled), nrow=length(l_sampled), byrow=T))
})
#Example for results:
> permutations[[8]]
X1 X2 X3
1 2 0 0
2 1 2 1
3 1 0 1
> permutations[[10]]
X1 X2 X3
1 0 1 1
2 2 0 2
3 0 2 0

finding interval indices of values in a vector, when intervals may be overlapped

I want to find the indices of values in a vector belonging to intervals which are defined by a vector of ending values and 1)"look-back" value interval and 2) previous N values.
Suppose I have
x <- c(1,3,4,5,7,8,9,10,13,14,15,16,17,18) #the vector of interest
v_end <- c(5, 7, 15) #the end values
l<-3 #look-back value interval
N<-3 #number of value to look back
What I want is the second and third columns of the following output.
x i n
[1,] 1 0 1
[2,] 3 1 1
[3,] 4 1 1
[4,] 5 1 1
[5,] 7 1 1
[6,] 8 0 0
[7,] 9 0 0
[8,] 10 0 1
[9,] 13 1 1
[10,] 14 1 1
[11,] 15 1 1
[12,] 16 0 0
[13,] 17 0 0
[14,] 18 0 0
Notice that v_end and l result in three intervals [2,5],[4,7],[12,15]. [2,5] and [4,7] have overlaps, essentially, it is [2,7].
And, v_end and l result in three intervals [1,5], [3,7],[10,15]. Again there are overlapps.
The task is similar to function findInterval{base}, but can not be solved by it.
Having ordered "v_end" and "x" (for the "N" case), the intervals for the "l" case are:
ints = cbind(start = v_end - l, end = v_end)
ints
# start end
#[1,] 2 5
#[2,] 4 7
#[3,] 12 15
Their overlaps could be grouped with:
overlap_groups = cumsum(c(TRUE, ints[-nrow(ints), "end"] < ints[-1, "start"]))
which can be used to reduce the intervals that are overlapping:
group_end = cumsum(rle(overlap_groups)$lengths)
group_start = c(1L, group_end [-length(group_end )] + 1L)
ints2 = cbind(start = ints[group_start, "start"], end = ints[group_end, "end"])
ints2
# start end
#[1,] 2 7
#[2,] 12 15
Then using findInterval:
istart = findInterval(x, ints2[, "start"])
iend = findInterval(x, ints2[, "end"], left.open = TRUE)
i = as.integer((istart - iend) == 1L)
i
# [1] 0 1 1 1 1 0 0 0 1 1 1 0 0 0
For the case of "N", starting with:
ints = cbind(start = x[match(v_end, x) - N], end = v_end)
ints
# start end
#[1,] 1 5
#[2,] 3 7
#[3,] 10 15
and following the above steps, we get:
#.....
n = as.integer((istart - iend) == 1L)
n
# [1] 1 1 1 1 1 0 0 1 1 1 1 0 0 0
Generally, a convenient tool for such operations is the "IRanges" package which, here, makes the approach straightforward:
library(IRanges)
xrng = IRanges(x, x)
i = as.integer(overlapsAny(xrng, reduce(IRanges(v_end - l, v_end), min.gapwidth = 0)))
i
# [1] 0 1 1 1 1 0 0 0 1 1 1 0 0 0
n = as.integer(overlapsAny(xrng, reduce(IRanges(x[match(v_end, x) - N], v_end), min.gapwidth = 0)))
n
# [1] 1 1 1 1 1 0 0 1 1 1 1 0 0 0

R ddply with multiple variables

Here is a simple data frame for my real data set:
df <- data.frame(ID=rep(101:102,each=9),phase=rep(1:3,6),variable=rep(LETTERS[1:3],each=3,times=2),mm1=c(1:18),mm2=c(19:36),mm3=c(37:54))
I would like to first group by ID and variable, then for values(mm1, mm2, mm3), phase 3 is subtracted from all phases(phase1 to phase3), which would make mm(1-3) in phase 1 all -2, in phase 2 all -1, and phase 3 all 0.
R throws an error of "Error in Ops.data.frame(x, x[3, ]) : - only defined for equally-sized data frames"
as I tried:
df1 <- ddply(df, .(ID, variable), function(x) (x - x[3,]))
Any advice would be greatly appreciated.
The output should be look like this:
ID phase variable mm1 mm2 mm3
101 1 A -2 -2 -2
101 2 A -1 -1 -1
101 3 A 0 0 0
101 1 B -2 -2 -2
101 2 B -1 -1 -1
101 3 B 0 0 0
101 1 C -2 -2 -2
101 2 C -1 -1 -1
101 3 C 0 0 0
102 1 A -2 -2 -2
102 2 A -1 -1 -1
102 3 A 0 0 0
102 1 B -2 -2 -2
102 2 B -1 -1 -1
102 3 B 0 0 0
102 1 C -2 -2 -2
102 2 C -1 -1 -1
102 3 C 0 0 0
Okay, took me a little bit to figure out what you want, but here is a solution:
cols.to.sub <- paste0("mm", 1:3)
df1 <- ddply(
df, .(ID, variable),
function(x) {
x[cols.to.sub] <- t(t(as.matrix(x[cols.to.sub])) - unlist(x[x$phase == 3, cols.to.sub]))
x
} )
This produces (first 6 rows):
ID phase variable mm1 mm2 mm3
1 101 1 A -2 -2 -2
2 101 2 A -1 -1 -1
3 101 3 A 0 0 0
4 101 1 B -2 -2 -2
5 101 2 B -1 -1 -1
6 101 3 B 0 0 0
Generally speaking the best way to debug this type of issue is to put a browser() statement inside the function you are passing to ddply, so you can examine the objects at your leisure. Doing so would have revealed that:
The data frame passed to your function includes the ID columns, as well as the phase columns, so your mm columns are not the first three (hence the need to define cols.to.sub)
Even if you address that, you can't operate on data frames that have unequal dimensions, so what I do here is convert to matrix, and then take advantage of vector recycling to subtract the one row from the rest of the matrix. I need to t (transpose) because vector recycling is column-wise.

How to assign class with highest frequency to each row of a data.frame in R?

i have the following table:
mymatrix <- matrix(c(34,11,65,32,12,9,32,90,21,51,45,23), ncol=3)
colnames(mymatrix) <- c("pos", "neg", "neutr") # class
rownames(mymatrix) <- c("1 -1 0", "-1 -1 0", "0 -1 1", "0 0 1") # patterns
mytable <- as.table(mymatrix)
mytable
# pos neg neutr
# 1 -1 0 34 12 21
# -1 -1 0 11 9 51
# 0 -1 1 65 32 45
# 0 0 1 32 90 23
now i have new data with three columns. each row contains one of the patterns "1 -1 0", "-1 -1 0", "0 -1 1" and "0 0 1". so for example, my new data looks like this:
one <- c( 1, 1, 0, -1, 0, 1, 1)
two <- c( -1, -1, -1, -1, 0, -1, -1)
three <- c(0, 0, 1, 0, 1, 0, 0)
mydf <- data.frame(one, two, three)
mydf
# one two three
# 1 1 -1 0
# 2 1 -1 0
# 3 0 -1 1
# 4 -1 -1 0
# 5 0 0 1
# 6 1 -1 0
# 7 1 -1 0
now i want to get a fourth column in mydf that assigns the class (pos, neg, neutr) to each row in mydf. the class with the highest frequency should be assigned.
it should look like this:
# one two three four
# 1 1 -1 0 pos # (because for this pattern (1 1 -1), "pos" gets highest frequency in mytable.)
# 2 1 -1 0 pos
# 3 0 -1 1 pos
# 4 -1 -1 0 neutr
# 5 0 0 1 neg
# 6 1 -1 0 pos
# 7 1 -1 0 pos
how can i do that?
thank you!
In the first step you could learn the mapping from triple to label, and then you could look up the mapped value for each row of mydf:
maxes = apply(mytable, 1, function(x) colnames(mytable)[which.max(x)])
mydf$four = maxes[match(paste(mydf$one, mydf$two, mydf$three), rownames(mytable))]
mydf
# mydf
# one two three four
# 1 1 -1 0 pos
# 2 1 -1 0 pos
# 3 0 -1 1 pos
# 4 -1 -1 0 neutr
# 5 0 0 1 neg
# 6 1 -1 0 pos
# 7 1 -1 0 pos

Resources