I have a long vector in R in which successive value often repeat themselves. For example
x = c(rep(0.2, 1500), rep(0.1, 10007), 0.7, 0.9, rep(0.1, 9784))
I am trying to write a function that takes this vector as input and return either of the two following strings
s = "R 0.2 1500 R 0.1 10007 R 0.7 1 R 0.9 1 R 0.1 9784"
or preferentially
s = "R 0.2 1500 R 0.1 10007 0.7 0.9 R 0.1 9784"
, where R 0.7 1 R 0.9 1 became 0.7 0.9.
For your intuition, R stands for repeat or rep. The string therefore closely ressemble the way I constructed the vector x to start with.
I tried looping through each value but that was too slow for my needs. Can you help me out finding a fast solution?
#Data
x = c(rep(0.2, 1500), rep(0.1, 10007), 0.7, 0.9, rep(0.1, 9784))
#Run rle and paste values and lengths together
y = paste("R", rle(x)$values, rle(x)$lengths)
#There may be an easier way to do this using regex
#But here is one solution using strsplit
#Remove 1 and R
y = sapply(strsplit(y," "), function(a)
if (gsub(" ","",a[3]) == "1"){
a = a[2]
} else {
a = a
}
)
#Collapse everything together
paste(unlist(y), collapse = " ")
#[1] "R 0.2 1500 R 0.1 10007 0.7 0.9 R 0.1 9784"
Related
I would like to implement random numbers for the time values equal to 0 (time == 0) and keep other time values as given.
set.seed(123)
df$time.new <- ifelse(df$time == 0, sample(0.2:0.8, replace=F), df$time)
Using the formula only 0.2 is created.
I will fill the blanks in the comment that answered the question. This is your code:
set.seed(123)
df$time.new <- ifelse(df$time == 0, sample(0.2:0.8, replace=F), df$time)
The key to understand why you are getting always 0.2 is to run:
0.2:0.8
This just yields: [1] 0.2 and that's the reason you are always getting 0.2 The seq() command lets you make sequences that have more elements by specifying shorter increments:
> seq(0.2, 0.8, by = 0.1)
[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8
If I remember correctly the default increment for a:b is one unit. Let's check a toy example:
> a <- 1; b <- 7
> a:b
[1] 1 2 3 4 5 6 7
If we do this with a <- 0.2 and b <- 0.8 the resulting vector would consist of just the value 0.2 hence, your code just detects such value.
I'm trying to get round off of numbers by 0.5. My data set is as below -
Wgt
0.160
0.522
0.174
0.765
1.246
2.893
the result i want to get by rounding of is
Round Wgt
0.5
1.0
0.5
1.0
1.5
3
Basically, ceiling of a number by 0.5 . Please help me with it.
Simply do this:
x <- c(0.16, 0.522, 0.174, 0.765, 1.246, 2.893)
y <- x * 2
z <- ceiling(y)
z / 2
This yields
0.5 1.0 0.5 1.0 1.5 3.0
You can try out plyr library with the round_any function which can do exactly this.
> library(plyr)
> x <- c(0.16, 0.522, 0.174, 0.765, 1.246, 2.893)
> round_any(x, 0.5, f=ceiling)
[1] 0.5 1.0 0.5 1.0 1.5 3.0
Couldn't find any explicit answer to this baked into R, but here's a quickie. I made a function called half_ceil that performed the behavior you wanted on one value then used sapply to apply it to a vector:
half_ceil = function(x){
whole = ceiling(x)
if(x >= whole - .5){
return(whole)
}
return(whole - .5)
}
sapply(Wgt, half_ceil)
This will round values with a decimal of .5 up to the next integer rather than down, but you can keep these values at what they are simply by changing the greater than or equal to sign to a less than sign.
# library (energy)
RR=100
n=10
a=2
b=4
miu1=2
miu2=4
m22=(b^2)*(1-(rho^2))
# This is the point where am having problem
# I want the programme to retain the results average0.1, average0.05 and
# average0.01 for every 'rho' from the rho_list used for the simulation
# but I am stuck because I don't know how to get the result
rho_list=c(0,0.3,0.6)
for (rho in rho_list){
energy=rep(NA,RR)
for (i in 1:RR){
z1=rnorm(n,0,1)
z2=rnorm(n,0,1)
x1=miu1+a*z1
x2=miu2+(rho*b*z1)+(sqrt(m22)*z2)
X=matrix(c(x1,x2),byrow=TRUE,ncol=2)
energy[i]=mvnorm.etest(X)$p.value
}
average0.1=sum(energy<=0.1)/RR
average0.05=sum(energy<=0.05)/RR
average0.01=sum(energy<=0.01)/RR
}
I want the program to retain the results average0.1, average0.05 and average0.01 for every rho from the rho_list used for the simulation
but I am stuck because I don't know how to get the result
Your example is not reproducible, so I'm giving you some simulated data to demonstrate how to output the result.
rho_list=c(0,0.3,0.6)
result <- sapply(rho_list, FUN = function(rho, ...) {
average0.1 = runif(1)
average0.05 = runif(1)
average0.01 = runif(1)
c(rho = rho, a01 = average0.1, a0.05 = average0.05, a0.01 = average0.01)
}, RR = RR, n = n, a = a, b = b, miu1 = miu1, miu2 = miu2, m22 = m22, simplify = FALSE)
do.call("rbind", result)
rho a01 a0.05 a0.01
[1,] 0.0 0.0136175 0.08581583 0.07171591
[2,] 0.3 0.8334469 0.42103038 0.07857328
[3,] 0.6 0.8231120 0.40647485 0.65408540
One option would be to store the results in a list for each value of rho and then bind them into a single data frame. Here's an example. Note that since rho isn't defined in the set-up code, I've substituted the definition of m22 for m22 in the loop. Also, I've set RR=10 to save time in running the code.
library(energy)
RR=10
n=10
a=2
b=4
miu1=2
miu2=4
rho_list=c(0, 0.3, 0.6)
energy_threshold = c(0.1, 0.05, 0.01) # Store energy thresholds in a vector
# Create a list of data frames. Each data frame contains the result for each
# of the three energy thresholds for one value of rho.
results = lapply(rho_list, function(rho) {
energy=rep(NA,RR)
for (i in 1:RR) {
z1=rnorm(n,0,1)
z2=rnorm(n,0,1)
x1=miu1+a*z1
x2=miu2+(rho*b*z1)+(sqrt((b^2)*(1-(rho^2)))*z2)
X=matrix(c(x1,x2),byrow=TRUE,ncol=2)
energy[i]=mvnorm.etest(X)$p.value
}
data.frame(rho, energy_threshold, result=sapply(energy_threshold, function(y) sum(energy <= y)/RR))
})
# Bind the three data frames into a single data frame
results = do.call(rbind, results)
And here's the output:
results
rho energy_threshold result
1 0.0 0.10 0.1
2 0.0 0.05 0.0
3 0.0 0.01 0.0
4 0.3 0.10 0.2
5 0.3 0.05 0.1
6 0.3 0.01 0.0
7 0.6 0.10 0.0
8 0.6 0.05 0.0
9 0.6 0.01 0.0
I stored the variables from the loop into a numeric vector and then used cbind() to store results. Here is the entire code :
library(energy)
RR=10
n=10
a=2
b=4
miu1=2
miu2=4
m22=(b^2)*(1-(rho^2))
average0.1 <- as.numeric()
average0.05 <- as.numeric()
average0.01 <- as.numeric()
# This is the point where am having problem
# I want the programme to retain the results average0.1, average0.05 and
# average0.01 for every 'rho' from the rho_list used for the simulation
# but I am stuck because I dont know how to get the result
rho_list=c(0,0.3,0.6)
for (rho in unique(rho_list)){
energy=rep(NA,RR)
for (i in 1:RR){
z1=rnorm(n,0,1)
z2=rnorm(n,0,1)
x1=miu1+a*z1
x2=miu2+(rho*b*z1)+(sqrt(m22)*z2)
X=matrix(c(x1,x2),byrow=TRUE,ncol=2)
energy[i]=mvnorm.etest(X)$p.value
}
average0.1=rbind(average0.1, sum(energy<=0.1)/RR)
average0.05=rbind(average0.05, sum(energy<=0.05)/RR)
average0.01=rbind(average0.01, sum(energy<=0.01)/RR)
}
I'm basically looking for a way to do a variation of this Ruby script in R.
I have an arbitrary list of numbers (steps of a moderator for a regression plot in this case) which have unequal distances from each other, and I'd like to round values which are within a range around these numbers to the nearest number in the list.
The ranges don't overlap.
arbitrary.numbers <- c(4,10,15) / 10
numbers <- c(16:1 / 10, 0.39, 1.45)
range <- 0.1
Expected output:
numbers
## 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.39 1.45
round_to_nearest_neighbour_in_range(numbers,arbitrary.numbers,range)
## 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
I've got a little helper function that might do for my specific problem, but it's not very flexible and it contains a loop. I can post it here, but I think a real solution would look completely different.
The different answers timed for speed (on a million numbers)
> numbers = rep(numbers,length.out = 1000000)
> system.time({ mvg.round(numbers,arbitrary.numbers,range) })[3]
elapsed
0.067
> system.time({ rinker.loop.round(numbers,arbitrary.numbers,range) })[3]
elapsed
0.289
> system.time({ rinker.round(numbers,arbitrary.numbers,range) })[3]
elapsed
1.403
> system.time({ nograpes.round(numbers,arbitrary.numbers,range) })[3]
elapsed
1.971
> system.time({ january.round(numbers,arbitrary.numbers,range) })[3]
elapsed
16.12
> system.time({ shariff.round(numbers,arbitrary.numbers,range) })[3]
elapsed
15.833
> system.time({ mplourde.round(numbers,arbitrary.numbers,range) })[3]
elapsed
9.613
> system.time({ kohske.round(numbers,arbitrary.numbers,range) })[3]
elapsed
26.274
MvG's function is the fastest, about 5 times faster than Tyler Rinker's second function.
A vectorized solution, without any apply family functions or loops:
The key is findInterval, which finds the "space" in arbitrary.numbers where each element in numbers is "between". So, findInterval(6,c(2,4,7,8)) returns 2, because 6 is between the 2nd and 3rd index of c(2,4,7,8).
# arbitrary.numbers is assumed to be sorted.
# find the index of the number just below each number, and just above.
# So for 6 in c(2,4,7,8) we would find 2 and 3.
low<-findInterval(numbers,arbitrary.numbers) # find index of number just below
high<-low+1 # find the corresponding index just above.
# Find the actual absolute difference between the arbitrary number above and below.
# So for 6 in c(2,4,7,8) we would find 2 and 1.
# (The absolute differences to 4 and 7).
low.diff<-numbers-arbitrary.numbers[ifelse(low==0,NA,low)]
high.diff<-arbitrary.numbers[ifelse(high==0,NA,high)]-numbers
# Find the minimum difference.
# In the example we would find that 6 is closest to 7,
# because the difference is 1.
mins<-pmin(low.diff,high.diff,na.rm=T)
# For each number, pick the arbitrary number with the minimum difference.
# So for 6 pick out 7.
pick<-ifelse(!is.na(low.diff) & mins==low.diff,low,high)
# Compare the actual minimum difference to the range.
ifelse(mins<=range+.Machine$double.eps,arbitrary.numbers[pick],numbers)
# [1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
Yet another solution using findInterval:
arbitrary.numbers<-sort(arbitrary.numbers) # need them sorted
range <- range*1.000001 # avoid rounding issues
nearest <- findInterval(numbers, arbitrary.numbers - range) # index of nearest
nearest <- c(-Inf, arbitrary.numbers)[nearest + 1] # value of nearest
diff <- numbers - nearest # compute errors
snap <- diff <= range # only snap near numbers
numbers[snap] <- nearest[snap] # snap values to nearest
print(numbers)
The nearest in the above code is not really mathematically the nearest number. Instead, it is the largest arbitrary number such that nearest[i] - range <= numbers[i], or equivalently nearest[i] <= numbers[i] + range. So in one go we find the largest arbitrary number which is either in the snapping range for a given input number, or still too small for that. For this reason, we only need to check one way for snap. No absolute value required, and even the squaring from a previous revision of this post was unneccessary.
Thanks to Interval search on a data frame for the pointer at findInterval, as I found it there before recognizing it in the answer by nograpes.
If, in contrast to your original question, you had overlapping ranges, you could write things like this:
arbitrary.numbers<-sort(arbitrary.numbers) # need them sorted
range <- range*1.000001 # avoid rounding issues
nearest <- findInterval(numbers, arbitrary.numbers) + 1 # index of interval
hi <- c(arbitrary.numbers, Inf)[nearest] # next larger
nearest <- c(-Inf, arbitrary.numbers)[nearest] # next smaller
takehi <- (hi - numbers) < (numbers - nearest) # larger better than smaller
nearest[takehi] <- hi[takehi] # now nearest is really nearest
snap <- abs(nearest - numbers) <= range # only snap near numbers
numbers[snap] <- nearest[snap] # snap values to nearest
print(numbers)
In this code, nearestreally ends up being the nearest number. This is achieved by considering both endpoints of every interval. In spirit, this is very much like the version by nograpes, but it avoids using ifelse and NA, which should benefit performance as it reduces the number of branching instructions.
Is this what you want?
> idx <- abs(outer(arbitrary.numbers, numbers, `-`)) <= (range+.Machine$double.eps)
> rounded <- arbitrary.numbers[apply(rbind(idx, colSums(idx) == 0), 2, which)]
> ifelse(is.na(rounded), numbers, rounded)
[1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
Please note that due to rounding errors (most likely), I use range = 0.1000001 to achieve the expected effect.
range <- range + 0.0000001
blah <- rbind( numbers, sapply( numbers, function( x ) abs( x - arbitrary.numbers ) ) )
ff <- function( y ) { if( min( y[-1] ) <= range + 0.000001 ) arbitrary.numbers[ which.min( y[ -1 ] ) ] else y[1] }
apply( blah, 2, ff )
This is still shorter:
sapply(numbers, function(x) ifelse(min(abs(arbitrary.numbers - x)) >
range + .Machine$double.eps, x, arbitrary.numbers[which.min
(abs(arbitrary.numbers - x))] ))
Thanks #MvG
Another option:
arb.round <- function(numbers, arbitrary.numbers, range) {
arrnd <- function(x, ns, r){
ifelse(abs(x - ns) <= range +.00000001, ns, x)
}
lapply(1:length(arbitrary.numbers), function(i){
numbers <<- arrnd(numbers, arbitrary.numbers[i], range)
}
)
numbers
}
arb.round(numbers, arbitrary.numbers, range)
Yields:
> arb.round(numbers, arbitrary.numbers, range)
[1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
EDIT: I removed the return call at the end of the function as it's not necessary adn can burn time.
EDIT: I think a loop will be even faster here:
loop.round <- function(numbers, arbitrary.numbers, range) {
arrnd <- function(x, ns, r){
ifelse(abs(x - ns) <= range +.00000001, ns, x)
}
for(i in seq_along(arbitrary.numbers)){
numbers <- arrnd(numbers, arbitrary.numbers[i], range)
}
numbers
}
I have the following data.frame
df<-data.frame(x=c("A","B","C"),colour=c(0.3,0.6,0.9))
x colour
1 A 0.3
2 B 0.6
3 C 0.9
I want to replace the numbers in df[,"colour"] with colours such that
< 0.4 = colour 1
0.4 - 0.7 = colour 2
< 0.7= colour 3
I have tried the following replacement scheme but the previous colour assignment are replaced with the latter. Any advice?
library(RColorBrewer)
g<-brewer.pal(3,"Greens")
col1<-df[,"colour"] < 0.4
df[col1,"colour"]<-g[1]
col2<-df[,"colour"] < 0.7
df[col2,"colour"]<-g[2]
col3<-df[,"colour"] >= 0.7
df[col3,"colour"]<-g[3]
Thanks for you advice.
THE SOLUTION
Thanks seancarmody (and spaceman for useful comments)
v<-c(0.45,0.65,0.75,0.85,0.95)
breaks<-c(0.4,0.5,0.6,0.7,0.8,0.9)
#Create a colour for each section
cols<-brewer.pal(length(breaks)+1,"Greens")
#Replace the gsim values with colours using the breaks
v <- as.character(cut(v, c(-Inf, breaks, Inf),labels=cols))
> v
[1] "#C7E9C0" "#74C476" "#41AB5D" "#238B45" "#005A32"
Your approach will work if you change order:
col2<-df[,"colour"] < 0.7
df[col2,"colour"]<-g[2]
col1<-df[,"colour"] < 0.4
df[col1,"colour"]<-g[1]
Since everything less than 0.4 is also less than 0.7, your approach was overwriting col1. Since you've edited your original question, the above is out of date. I'd just use the more general approach here:
breaks <- c(0.4, 0.7) # you can add more cut points here
cols <- brewer.pal(length(breaks) + 1, "Greens")
df$colour <- as.character(cut(df$colour, c(-Inf, breaks, Inf), labels=cols))