I have A vector:
x<-c(1,2,3,3,2,2)
Now I want to order this vector on number of occurences, I know I can count the number of occurences with table:
x.order <- table(x)[rev(order(table(x)))]
Gives me:
2 3 1
3 2 1
Now I know, I first have to select the values of x, which are 2, then the values of x which are 3 and then the values where x is 1. How can I perform this last step?
The final output has to look like:
2,2,2,3,3,1
Or is there a better way to order the vector by number of occurences?
x<-c(1,2,3,3,2,2)
x.order <- sort(table(x), TRUE)
rep(as.numeric(names(x.order)), times=x.order)
#[1] 2 2 2 3 3 1
Related
I have long vector of patient statuses in R that are chronologically sorted, and a label of associated patient IDs. This vector is an element of a dataframe. I would like to label consecutive rows of data for which the patient status is the same. If the status changes, then reverts to its original value, that would be three separate events. This is different than most situations I have searched where duplicated or match would suffice.
An example would be along the lines of:
s <- c(0,0,0,1,1,1,0,0,2,1,1,0,0)
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2)
and the desired output would be
flag <- c(1,1,1,2,2,2,3,1,2,3,4,4)
or
flag <- c(1,1,1,2,2,2,3,4,5,6,7,7)
One inelegant approach would be to generate the sequence:
unlist(tapply(s, id, function(x) cumsum(c(T, x[-1] != rev(rev(x)[-1])))))
Is there a better way?
I think you could use rleid from data.table for this:
library(data.table)
rleid(s,id)
Output:
1 1 1 2 2 2 3 4 5 6 6 7 7
Or for the first sequence:
data.table(s,id)[,rleid(s),id]$V1
Output:
1 1 1 2 2 2 3 1 2 3 3 4 4
Run Length Encoding - rle()
tapply(s, id, function(x) {
v<-rle(x)$length
rep(1:length(v), v)
})
I have to create a sequence of large number (> 10,000) of sequences of different lengths. I only know the lengths of these sequences in a vector form.
length_v <- c(2,3,4,4,2,6,11,75...................)
Each sequence starts from 1 and moves forward in steps of 1. And in the final sequence (combined one), each sequence has to appear one after the other, they can't be jumbled up.
A small demonstrating example is below:
I have say 4 sequences of length 2, 3, 4, 6 respectively.
s1 <- seq(1, 2) # 1,2
s2 <- seq(1, 3) # 1,2,3
s3 <- seq(1, 4) # 1,2,3,4
s4 <- seq(1, 6) # 1,2,3,4,5,6
Final sequence will be
final <- c(s1,s2,s3,s4) **# the order has to be this only. No compromise here.**
I can't do this with > 10,000 sequences which would be very inefficient. Is there any simpler way of doing this?
We can use sequence
sequence(length_v)
#[1] 1 2 1 2 3 1 2 3 4 1 2 3 4 5 6
data
length_v <- c(2,3,4,6)
example:
unlist(sapply(c(2,3,4,6), seq, from=1))
so for you it will be:
unlist(sapply(length_v, seq, from=1))
I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.
I am currently working through an intro class and I and was having some difficulty with this particular problem:
Create a function that takes in a vector of numbers V.Size and a single number N as inputs and outputs a list object of size N where each list member is a vector that contains elements of V.Size such that the largest value in V.Size is in the vector of the first list item, the second largest value in V.Sizeis in the vector of the second list item, etc. The (N+1) ordered value of V.Size should be in the first vector of the list, the (N+2) ordered value ofV.Size should be in the second vector of the list and so on.
Now, this is what I have done thus far, I am trying to make an example code:
V.Size <- c(5,4,2,3,1)
n <- 5
Function <- c(V.Size, n)
Function
[1] 5 4 2 3 1 5
sort(Function, decreasing=TRUE)
[1] 5 5 4 3 2 1
The issue I am having is with (N+1), (N+2) and its ordering.
The first step to addressing this would be to create a vector of the list position for each element in sorted V.size. This is basically the vector (1, 2, ..., N, 1, 2, ..., N, ...), of total length V.size. You can get that with:
V.Size <- c(5,4,2,3,1)
n <- 2
rep(1:n, length.out=length(V.Size))
# [1] 1 2 1 2 1
Now you can use the split function to create a list based on these assignments:
split(sort(V.Size, decreasing=TRUE), rep(1:n, length.out=length(V.Size)))
# $`1`
# [1] 5 3 1
#
# $`2`
# [1] 4 2
I have large data-frame consists of two columns. I want to calculate the average of the second column values for each subset of the first column. The subset of the first column is based on a specified granularity. For example, for the following data-frame, df, I want to calculate the average of df$B values for each subset of df$A with an increment(granularity) of 1 for each subset. The results should be in two new columns.
A B expected results newA newB
0.22096 1 0 1.142857
0.33489 1 1 2
0.33655 1 2 4
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5
This is a simple example, I'm not sure how to loop over the whole data-frame and perform the calculation i.e. the average of the df$B.
tried below to subset, but couldn't figure how to append the results and create final results:
Tried something like :
increment<-1
mx<-max(df$A)
i<-0
newDF<-data.frame()
while(i < mx){
tmp<-subset(df, (A >i & A< (i+increment)))
i<-i+granualrity
}
Not sure about the logic. But I'm sure there is a short way to do the required calculation. Any thoughts?
I would use findInterval for the subset selection (In your example a simple ceiling for each A value should be sufficient, too. But if your increment is different from 1 you need findInterval.) and tapply to calculate the mean:
df <- read.table(textConnection("
A B
0.22096 1
0.33489 1
0.33655 1
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5"), header=TRUE)
## sort data.frame by column A (needed for findInterval)
df <- df[order(df$A), ]
## define granuality
subsets <- seq(1, max(ceiling(df$A)), by=1) # change the "by" argument for different increments
df$subset <- findInterval(df$A, subsets)
tapply(df$B, df$subset, mean)
# 0 1 2
#1.142857 2.000000 4.000000