I am trying to remove consecutive rows in a dataframe if all the values in the rows are less than 1 and it exceeds e.g 4 rows.
Lets say we have a column [0.1, 0, 5, 4, 0.2, 0.1, 0, 0, 0, 4, 9, 10]. Then I would like to remove only the middle part [0.2, 0.1, 0, 0, 0] and have left [0.1, 0, 5, 4, 4, 9, 10]. The thing is I can easily do this by using a for loop, however I am dealing with over 3 million data points and it takes way too long. Therefore I am looking for a solution that makes use of vectorization in R. Does anyone know what function I can use?
Thanks in advance!
You can try to perform a convolution/correlation over your dataset. If all elements in 4 consecutive rows are less than 1, then their sum is less than 4 * m, with m being the number of columns of your dataset. Then, it is a matter of upsampling the result correctly. Here is a complete example, with NumPy array (that you can easily extract from your DataFrame with df.to_numpy()):
import numpy as np
"""
Notation: row whose elements are all < 1, will be called "target row"
Task: Remove every target row in a cluster of 4 consecutive target rows
Input: 11 x 5 dataset with target rows [0, 1, 2, 3, 4, 7]
Output: pruned dataset with rows [5, 6, 7, 8, 9, 10]
(Note that target row 7 must be kept because it's separated from the others)
"""
# Input
n, m = 11, 5
ar = np.random.rand(n, m)
ar[[5, 6, 8, 9, 10]] += 1.
min_rows = 4
# Find all target rows
sums = (ar.sum(axis=1) < ar.shape[1]).astype(np.float32)
print(f" Sums: {sums}")
# Find centers of clusters with 4 consecutive target rows
kernel = np.ones((min_rows,))
output = np.correlate(sums, kernel, mode="same")
print(f" Output: {output}")
mask = output == min_rows
print(f" Mask: {mask.astype(np.float32)}")
# Find all elements in the clusters
mask_ids = np.nonzero(mask)[0]
center = min_rows // 2
rng = np.arange(-center, center + (min_rows % 2 != 0), dtype=np.int32)
ids = (rng + mask_ids.reshape(-1, 1)).ravel()
mask[ids] = True
print(f"New Mask: {mask.astype(np.float32)}")
# mask the dataset
ar = ar[~mask]
Related
As a result of seeing THIS EXAMPLE, I was wondering how I could create one set of 15 shuffled orderings of 1 through 4 in R?
On THIS Website, you can get 1 Set of 15 shuffled Numbers
Ranging: From 1 to 4
As an example, on my run I got:
Set #1:
3, 2, 2, 1, 1, 1, 3, 2, 2, 3, 2, 1, 3, 4, 1
Is there a way I can replicate the above in R?
If I understood correctly your question, at first it comes to mind a solution like the following one: very basic, but it does its job.
size <- 40
vec <- sample(1:4, size = size, replace = TRUE)
while(length(unique(vec)) < 4){
vec <- sample(1:4, size = size, replace = TRUE)
}
vec
The while cycle will not go on for long as it's very unlikely that a digit does not appear in the random vector vec if you sample 40 times.
Of course you can change the size of your vector, the code will still work, except you want vec to be < 4; in that case, the loop will go on indefinitely.
I have a vector created simulating a continious time Markov Chain. The vector represents the path the chain may describe. Simulating 20 steps we could have:
Xt <- c(5, 5, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, 0 ,0)
Further, the vector can jump 1 by 1 or jump from any state (5,4,3,2,1) to 0. So other simulation could be:
Xt <- c(5, 5, 5, 5, 5, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
I want to count the number of times the simulated chain jumps to other state (when the vector changes of number) within a determined interval. For example:
The number of jumps for the first vector I wrote for the first 10 elements is 2 (Jumps from 5 to 4 and 4 to 0). The number of jumps for the second vector I wrote for the last 10 elements is 0 (The last 10 elements are all 0)
So I would like to count the number of jumps (the number of times the pattern changes). I tried using toString(Xt)and then trying to match some regex but nothing worked. Any ideas?
You can use diff for this which counts the difference between adjacent numbers in a vector. Sum all instances not equal to zero to get total times the pattern changes.
First 10:
sum(diff(Xt[1:10])!=0)
[1] 2
Last 10:
sum(diff(Xt[(length(Xt)-10):length(Xt)])!=0)
[1] 0
Seems like just count the number of times the difference was not zero would deliver the desired result:
Xt <- c(5, 5, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, 0 ,0)
sum(diff(Xt) != 0)
If the goal was to write a function that takes a string and a starting positon it could be done thusly:
jump_in_next_10 <- function(string, start){
sum( diff(string[start:(start+9)]) != 0 )}
jump_in_next_10(Xt, 3)
#[1] 2
So, I have a vector full of 1s and 0s. I need to plot a graph that starts at (0, 0) and rises by 1 for every 1 in the vector and dips by 1 for every 0 in the vector. For example if my vector is [ 1, 1, 1, 0, 1, 0, 1, 1 ] I should get something that looks like
I thought about creating another vector that would hold the sum of the first i elements of the original vector at index i (from the example: [ 1, 2, 3, 3, 4, 4, 5, 6 ]) but that would not account for the dips at 0s. Also, I cannot use loops to solve this.
I would convert the zeros to -1, add a zero at the very beginning to make sure it starts from [0,0] and then plot the cumulative sum:
#starting vec
myvec <- c(1, 1, 1, 0, 1, 0, 1, 1)
#convert 0 to -1
myvec[myvec == 0] <- -1
#add a zero at the beginning to make sure it starts from [0,0]
myvec <- c(0, myvec)
#plot cumulative sum
plot(cumsum(myvec), type = 'line')
#points(cumsum(myvec)) - if you also want the points on top of the line
I want to randomly pick a number from a vector with 8 elements that sums to 35. If the number is 0 look for another number. If the number is greater than 0, make this number -1. Do this in a loop until the sum of the vector is 20. How can I do this in R?
For example: vec<-c(2,3,6,0,8,5,6,5)
Pick a number from this list randomly and make the number -1 until the sum of the elements becomes 20.
I'm really really not sure that is what you want, but for what I understand of your question, here is my solution. You'll get most of the concept and key fonctions in my script. Use that and help() to understand them and optimize it.
vec <- c(2, 3, 6, 0, 8, 5, 6, 5)
summ <- 0
new.vec <- NULL
iter <- 1
while(summ<20) {
selected <- sample(vec,1)
if(selected!=0) new.vec[iter] <- selected-1
summ <- sum(new.vec)
iter <- iter+1
}
Try this:
vec <- c(2, 3, 6, 0, 8, 5, 6, 5)
#just setting the seed for reproducibility
set.seed(19)
tabulate(sample(rep(seq_along(vec),vec),20))
#[1] 0 2 4 0 4 5 3 2
I have the following matrix
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(x) = c("Y","Z")
m <-data.frame(m)
I am trying to create a random number in each row where the upper limit is a number based on a variable value (in this case 1*Y based on each row's value for for Z)
I currently have:
samp<-function(x){
sample(0:1,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
which work works well applying the sample function independently to each row, but I always get an error when I try to alter the x in sample. I thought I could do something like this:
samp<-function(x){
sample(0:m$Z,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
but I guess that was wishful thinking.
Ultimately I want the result:
Y Z randoms
2 5 4
4 7 7
3 9 3
5 3 1
1 7 6
Any ideas?
The following will sample from 0 to x$Y for each row, and store the result in randoms:
x$randoms <- sapply(x$Y + 1, sample, 1) - 1
Explanation:
The sapply takes each value in x$Y separately (let's call this y), and calls sample(y + 1, 1) on it.
Note that (e.g.) sample(y+1, 1) will sample 1 random integer from the range 1:(y+1). Since you want a number from 0 to y rather than 1 to y + 1, we subtract 1 at the end.
Also, just pointing out - no need for replace=T here because you are only sampling one value anyway, so it doesn't matter whether it gets replaced or not.
Based on #mathematical.coffee suggestion and my edited example this is the slick final result:
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(m) = c("Y","Z")
m <-data.frame(m)
samp<-function(x){
sample(Z + 1, 1)}
m$randoms <- sapply(m$Z + 1, sample, 1) - 1