under what circumstances does R recycle? - r

I have two variables, x (takes in 5 values) and y (takes in 11 values). When I want to run the argument,
> v <- 2*x +y +1
R responds:
Error at 2* x+y: Longer object length is not a multiple of shorter object length.
I tried: 1*x gives me 5 values of x, but y has 11 values. So R says it can’t add 11 to 5 values? – This raises the general question: Under what circumstances does recycling work?

Recycling works in your example:
> x <- seq(5)
> y <- seq(11)
> x+y
[1] 2 4 6 8 10 7 9 11 13 15 12
Warning message:
In x + y : longer object length is not a multiple of shorter object length
> v <- 2*x +y +1
Warning message:
In 2 * x + y :
longer object length is not a multiple of shorter object length
> v
[1] 4 7 10 13 16 9 12 15 18 21 14
The "error" that you reported is in fact a "warning" which means that R is notifying you that it is recycling but recycles anyway. You may have options(warn=2) turned on, which converts warnings into error messages.
In general, avoid relying on recycling. If you get in the habit of ignoring the warnings, some day it will bite you and your code will fail in some very hard to diagnose way.

It doesn't work this way. You have to have vectors of the same length:
x_samelen = c(1,2,3)
y_samelen = c(10,20,30)
x_samelen*y_samelen
[1] 10 40 90
If vectors are of the same length, the result is well defined and understood. You can do "recycling", but it really is not advisable to do so.
I wrote a short script to make your two vectors of the same length, via padding the short vector. This will let you execute your code without warnings:
x_orig <- c(1,2,3,4,5,6,7,8,9,10,11)
y_orig <- c(21,22,23,24,25)
if ( length(x_orig)>length(y_orig) ) {
x <- x_orig
y <- head(x = as.vector(t(rep(x=y_orig, times=ceiling(length(x_orig)/length(y_orig))))), n = length(x_orig) )
cat("padding y\r\n")
} else {
x <- head(x = as.vector(t(rep(x=x_orig, times=ceiling(length(y_orig)/length(x_orig))))), n = length(y_orig) )
y <- y_orig
cat("padding x\r\n")
}
The results are:
x_orig
[1] 1 2 3 4 5 6 7 8 9 10 11
y_orig
[1] 21 22 23 24 25
x
[1] 1 2 3 4 5 6 7 8 9 10 11
y
[1] 21 22 23 24 25 21 22 23 24 25 21
If you reverse x_orig and y_orig:
x_orig
[1] 21 22 23 24 25
y_orig
[1] 1 2 3 4 5 6 7 8 9 10 11
x
[1] 21 22 23 24 25 21 22 23 24 25 21
y
[1] 1 2 3 4 5 6 7 8 9 10 11

Related

R ranges: 1:0 - illogical behavior

I have an array X of length N, and I'd like to compute sum(X[(i+1):N]) - sum(X[1:(i-1)]. This works fine if my index, i, is within 2..(N-1). If it's equal to 1, the second term will return the first element of the array rather than exclude it. If it's equal to N, the first term will return the last element of the array rather than exclude it. seq_len is the only function I'm aware of that does the job, but only for the 2nd term (it indexes 1:n). What I need is a range function that will return NULL (rather than throw an exception like seq) when its 2nd argument is below its first. The sum function will do the rest. Is anyone aware of one, or do I have to write one myself?
I suggest an alternate path for generating indexing sequences: seq_len, which reacts intuitively in the extremes.
Bottom Line Up Front: use sum(X[-seq_len(i)]) - sum(X[seq_len(i-1)]) instead.
First, some sample data:
X <- 1:10
N <- length(X)
Your approach, at the two extremes:
i <- 1
X[(i+1):N]
# [1] 2 3 4 5 6 7 8 9 10
X[1:(i-1)] # oops
# [1] 1
That should return "nothing", I believe. (More the point, sum(...) should return 0. For the record, sum(integer(0)) is 0.)
i <- 10
X[(i+1):N] # oops
# [1] NA 10
X[1:(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
There's your other error, where you'd expect "nothing" in the first subset.
Instead, I suggest you use seq_len:
i <- 1
X[-seq_len(i)]
# [1] 2 3 4 5 6 7 8 9 10
X[seq_len(i-1)]
# integer(0)
i <- 10
X[-seq_len(i)]
# integer(0)
X[seq_len(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
Both seem fine, and something in the middle makes sense.
i <- 5
X[-seq_len(i)]
# [1] 6 7 8 9 10
X[seq_len(i-1)]
# [1] 1 2 3 4
In this contrived example, what we're looking for at any value of i:
1: sum(2:10) - 0 = 54 - 0 = 54
2: sum(3:10) - sum(1:1) = 52 - 1 = 51
3: sum(4:10) - sum(1:2) = 49 - 3 = 46
...
10: 0 - sum(1:9) = 0 - 45 = -45
And we now get that:
func <- function(i, x) sum(x[-seq_len(i)]) - sum(x[seq_len(i-1)])
sapply(c(1,2,3,10), func, X)
# [1] 54 51 46 -45
Edit:
李哲源's answer got me to thinking that you don't need to re-sum the numbers before and after all the time. Just do it once and re-use it. This method could be easily a bit faster if your vector is large.
Xb <- c(0, cumsum(X)[-N])
Xb
# [1] 0 1 3 6 10 15 21 28 36 45
Xa <- c(rev(cumsum(rev(X)))[-1], 0)
Xa
# [1] 54 52 49 45 40 34 27 19 10 0
sapply(c(1,2,3,10), function(i) Xa[i] - Xb[i])
# [1] 54 51 46 -45
So this suggests that your summed value at any value of i is
Xs <- Xa - Xb
Xs
# [1] 54 51 46 39 30 19 6 -9 -26 -45
where you can find the specific value with Xs[i]. No repeated summing required.

How to use apply function instead of for loop if you have multiple if conditions to be excecuted

1st DF:
t.d
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
names(t.d) <- c("ID","A","B","C")
t.d$FinalTime <- c("7/30/2009 08:18:35","9/30/2009 19:18:35","11/30/2009 21:18:35","13/30/2009 20:18:35","15/30/2009 04:18:35")
t.d$InitTime <- c("6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35")
>t.d
ID A B C FinalTime InitTime
1 1 6 11 16 7/30/2009 08:18:35 6/30/2009 9:18:35
2 2 7 12 17 9/30/2009 19:18:35 6/30/2009 9:18:35
3 3 8 13 18 11/30/2009 21:18:35 6/30/2009 9:18:35
4 4 9 14 19 13/30/2009 20:18:35 6/30/2009 9:18:35
5 5 10 15 20 15/30/2009 04:18:35 6/30/2009 9:18:35
2nd DF:
> s.d
F D E Time
1 10 19 28 6/30/2009 08:18:35
2 11 20 29 8/30/2009 19:18:35
3 12 21 30 9/30/2009 21:18:35
4 13 22 31 01/30/2009 20:18:35
5 14 23 32 10/30/2009 04:18:35
6 15 24 33 11/30/2009 04:18:35
7 16 25 34 12/30/2009 04:18:35
8 17 26 35 13/30/2009 04:18:35
9 18 27 36 15/30/2009 04:18:35
Output to be:
From DF "t.d" I have to calculate the time interval for each row between "FinalTime" and "InitTime" (InitTime will always be less than FinalTime).
Another DF "temp" from "s.d" has to be formed having data only within the above time interval, and then the most recent values of "F","D","E" have to be taken and attached to the 'ith' row of "t.d" from which the time interval was calculated.
Also we have to see if the newly formed DF "temp" has the following conditions true:
here 'j' represents value for each row:
if(temp$F[j] < 35.5) + (temp$D[j] >= 100) >= 1)
{
temp$Flag <- 1
} else{
temp$Flag <- 0
}
Originally I have 3 million rows in the dataframe and 20 columns in each DF.
I have solved the above problem using "for loop" but it obviously takes 2 to 3 days as there are a lot of rows.
(Also if I have to add new columns to the resultant DF if multiple conditions get satisfied on each row?)
Can anybody suggest a different technique? Like using apply functions?
My suggestion is:
use lapply over row indices
handle in the function call your if branches
return either your dataframe or NULL
combine everything with rbind
by replacing lapply with mclapply from the 'parallel' package, your code gets executed in parallel.
resultList <- lapply(1:nrow(t.d), function(i){
do stuff
if(condition){
return(df)
}else{
return(NULL)
}
resultDF <- do.call(rbind, resultList)

How can I tell a for loop in R to regenerate a sample if the sample contains a certain pair of species?

I am creating 1000 random communities (vectors) from a species pool of 128 with certain operations applied to the community and stored in a new vector. For simplicity, I have been practicing writing code using 10 random communities from a species pool of 20. The problem is that there are a couple of pairs of species such that if one of the pairs is generated in the random community, I need that community to be thrown out and a new one regenerated. I have been able to code that if the pair is found in a community for that community(vector) to be labeled NA. I also know how to tell the loop to skip that vector using the "next" command. But with both of these options, I do not get all of the communities that I needing.
Here is my code using the NA option, but again that ends up shorting me communities.
C<-c(1:20)
D<-numeric(10)
X<- numeric(5)
for(i in 1:10){
X<-sample(C, size=5, replace = FALSE)
if("10" %in% X & "11" %in% X) X=NA else X=X
if("1" %in% X & "2" %in% X) X=NA else X=X
print(X)
D[i]<-sum(X)
}
print(D)
This is what my result looks like.
[1] 5 1 7 3 14
[1] 20 8 3 18 17
[1] NA
[1] NA
[1] 4 7 1 5 3
[1] 16 1 11 3 12
[1] 14 3 8 10 15
[1] 7 6 18 3 17
[1] 6 5 7 3 20
[1] 16 14 17 7 9
> print(D)
[1] 30 66 NA NA 20 43 50 51 41 63
Thanks so much!

ggplot2 is plotting a line strangely

i am trying to plot the time series x_t = A + (-1)^t B
To do this i am using the following code. The problem is, that the ggplot is wrong.
require (ggplot2)
set.seed(42)
N<-2
A<-sample(1:20,N)
B<-rnorm(N)
X<-c(A+B,A-B)
dat<-sapply(1:N,function(n) X[rep(c(n,N+n),20)],simplify=FALSE)
dat<-data.frame(t=rep(1:20,N),w=rep(A,each=20),val=do.call(c,dat))
ggplot(data=dat,aes(x=t, y=val, color=factor(w)))+
geom_line()+facet_grid(w~.,scale = "free")
looking at the head of dat everything looks right:
> head(dat)
t w val
1 1 12 10.5533
2 2 12 13.4467
3 3 12 10.5533
4 4 12 13.4467
5 5 12 10.5533
6 6 12 13.4467
So the lower (blue) line should only have values 10.5533 and 13.4467. But it also takes different values. What is wrong in my code?
Thanks in advance for any help
You really should be more careful before asserting that something is "wrong". The way you are creating dat the rows are not ordered by dat$t, so head(...) is not displaying the extra values:
head(dat[order(dat$w,dat$t),],10)
# t w val
# 21 1 18 18.43530
# 61 1 18 18.36313
# 22 2 18 19.56470
# 62 2 18 17.63687
# 23 3 18 18.43530
# 63 3 18 18.36313
# 24 4 18 19.56470
# 64 4 18 17.63687
# 25 5 18 18.43530
# 65 5 18 18.36313
Note the row numbers.

Merge values of a factor column

Column data$form contains 170 unique different values, (numbers from 1 to ~800).
I would like to merge some values (e.g with a 10 radius/step).
I need to do this in order to use:
colors = rainbow(length(unique(data$form)))
In a plot and provide a better visual result.
Thank you in advance for your help.
you can use %/% to group them and mean to combine them and normalize to scale them.
# if you want specifically 20 groups:
groups <- sort(form) %/% (800/20)
x <- c(by(sort(form), groups, mean))
x <- normalize(x, TRUE) * 19 + 1
0 1 2 3 4
1.000000 1.971781 2.957476 4.103704 4.948560
5 6 7 8 9
5.950617 7.175309 7.996914 8.953086 9.952263
10 11 12 13 14
10.800705 11.901235 12.888889 13.772291 14.888889
15 16 17 18 19
15.927984 16.864198 17.918519 18.860082 20.000000
You could also use cut. If you use the argument labels=FALSE, you get an integer value:
form <- runif(170, min=1,max=800)
> cut(form, breaks=20)
[1] (518,558] (280,320] (240,280] (121,160] (757,797]
[6] (160,200] (320,359] (598,638] (80.8,121] (359,399]
[7] (121,160] (200,240] ...
20 Levels: (1.18,41] (41,80.8] (80.8,121] (121,160] (160,200] (200,240] (240,280] (280,320] (320,359] (359,399] (399,439] ... (757,797]
> cut(form, breaks=20, labels=FALSE)
[1] 14 8 7 4 20 5 9 16 3 10 4 6 5 18 18 6 2 12
[19] 2 19 13 11 13 11 14 12 17 5 ...
On a side-note, I want you to re-consider plotting with rainbow colours, as it distorts reading the data, cf. Rainbow Color Map (Still) Considered Harmful.

Resources