For loop in R with increments - r

I am trying to write a for loop which will increment its value by 2. The equivalent code is c is
for (i=0; i<=78; i=i+2)
How do I achieve the same in R?

See ?seq for more info:
for(i in seq(from=1, to=78, by=2)){
# stuff, such as
print(i)
}
or
for(i in seq(1, 78, 2))
p.s. Pardon my C ignorance. There, I just outed myself.
However, this is a way to do what you want in R (please see updated code)
EDIT
After learning a bit of how C works, it looks like the example posted in the question iterates over the following sequence: 0 2 4 6 8 ... 74 76 78.
To replicate that exactly in R, start at 0 instead of at 1, as above.
seq(from=0, to=78, by=2)
[1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44
[24] 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78

you can do so in following way, you can put any length upto which you want iteration in place of length(v1), and the increment value at position of 2 to your desired value
for(i in seq(1,length(v1),2))

Related

How can I create unique random numbers in R?

I hope to generate random numbers between 1:100 and then test their divisibility by 3. I have created a loop.
v <- c(0)
for(i in 1:100){
r <- floor(runif(1, min=1, max=100))
if(r %% 3 == 0){
v <- append(v,r)
}
}
print(v)
However, the numbers do keep repeating as you can see in the following output. Is there any way to only generate unique multiples of 3 between 1:100. I am aware there's a way to use the seq function and generate the same numbers, but I still want to know how to acquire unique random numbers.
Output:
[1] 0 18 87 30 45 90 12 72 75 60 27 84 90 27 42 54 63 15 63 30 72 69 57 30 3 6 15 30 3
[30] 60 72 6 6 18 75 96 84 78 24
sample(1:33)*3 is all the multiples of 3 in your range in a random order.

Calculate number of values in vector that exceed values in column of data.frame

I have a long list of numbers, e.g.
set.seed(123)
y<-round(runif(100, 0, 200))
And I would like to store in column y the number of values that exceed each value in column x of a data frame:
df <- data.frame(x=seq(0,200,20))
I can compute the numbers manually, like this:
length(which(y>=20)) #93 values exceed 20
length(which(y>=40)) #81 values exceed 40
etc. I know I can use a for-loop with all values of x, but is there a more elegant way?
I tried this:
df$y <- length(which(y>=df$x))
But this gives a warning and does not give me the desired output.
The data frame should look like this:
df
x y
1 0 100
2 20 93
3 40 81
4 60 70
5 80 61
6 100 47
7 120 40
8 140 29
9 160 19
10 180 8
11 200 0
You can compare each value of df$x against all value of y using sapply
sapply(df$x, function(a) sum(y>a))
#[1] 99 93 81 70 61 47 40 29 18 6 0
#Looking at your output, maybe you want
sapply(df$x, function(a) sum(y>=a))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Here's another approach using outer that allows for element wise comparison of two vectors
rowSums(outer(df$x,y, "<="))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Yet one more (from alexis_laz's comment)
length(y) - findInterval(df$x, sort(y), left.open = TRUE)
# [1] 100 93 81 70 61 47 40 29 19 8 0

Mean and SD in R

maybe it is a very easy question. This is my data.frame:
> read.table("text.txt")
V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860
It means that I have 22516 sequences with length 26, 17129 sequences with length 28, etc. I would like to know the sequence length mean and its standard deviation. I know how to do it, but I know to do it creating a list full of 26 repeated 22516 times and so on... and then compute the mean and SD. However, I thing there is a easier method. Any idea?
Thanks.
For mean: (V1 %*% V2)/sum(V2)
For SD: sqrt(((V1-(V1 %*% V2)/sum(V2))**2 %*% V2)/sum(V2))
I do not find mean(rep(V1,V2)) # 61.902 and sd(rep(V1,V2)) # 14.23891 that complex, but alternatively you might try:
weighted.mean(V1,V2) # 61.902
# recipe from http://www.ltcconline.net/greenl/courses/201/descstat/meansdgrouped.htm
sqrt((sum((V1^2)*V2)-(sum(V1*V2)^2)/sum(V2))/(sum(V2)-1)) # 14.23891
Step1: Set up data:
dat.df <- read.table(text="id V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860",header=T)
Step2: Convert to data.table (only for simplicity and laziness in typing)
library(data.table)
dat <- data.table(dat.df)
Step3: Set up new columns with products, and use them to find mean
dat[,pr:=V1*V2]
dat[,v1sq:=as.numeric(V1*V1*V2)]
dat.Mean <- sum(dat$pr)/sum(dat$V2)
dat.SD <- sqrt( (sum(dat$v1sq)/sum(dat$V2)) - dat.Mean^2)
Hope this helps!!
MEAN = (V1*V2)/sum(V2)
SD = sqrt((V1*V1*V2)/sum(V2) - MEAN^2)

Overlay two differently formatted qplots in ggplot2

I have two scatterplots, based on different but related data, created using qplot() from ggplot2. (Learning ggplot hasn't been a priority because qplot has been sufficient for my needs up to now). What I want to do is superimpose/overlay the two charts so that the x,y data for each is plotted in the same plot space. The complication is that I want each plot to retain its formatting/aesthetics.
That data in question are row and column scores from correspondence analysis - corresp() from MASS - so the number of data rows (i.e. samples or taxa) differ between the two datasets. I can plot the two score sets together easily. Either by combing the two datasets or, even easier, just using the biplot() function.
However, I have been using qplot to get the plots looking exactly as I need them; with samples plotted as colour-coded symbols and taxa as labels:
PlotSample <- qplot(DataCorresp$rscore[,1], DataCorresp$rscore[,2],
colour=factor(DataAll$ColourCode)) +
scale_colour_manual(values = c("black","darkgoldenrod2",
"deepskyblue2","deeppink2"))
and
PlotTaxa <- qplot(DataCorresp$cscore[,1], DataCorresp$cscore[,2],
label=colnames(DataCorresp), size=10, geom=“text”)
Can anyone suggest a way by which either
the two plots (PlotSample and PlotTaxa) can be superimposed atop of each other,
the two datasets (DataCorresp$rscore and DataCorresp$cscore) can be plotted together but formatted in their different ways, or
another function (e.g. biplot()) that could be used to achieve my aim.
Example of workflow using a extremely simplified and made-up dataset:
> require(MASS)
> require(ggplot2)
> alldata<-read.csv("Fake data.csv",header=T,row.name=1)
> selectdata<-alldata[,2:10]
> alldata
Period Species.1 Species.2 Species.3 Species.4 Species.5 Species.6
Sample-1 Early 50 87 97 12 60 49
Sample-2 Early 41 90 36 52 36 27
Sample-3 Early 87 56 82 45 56 13
Sample-4 Early 37 47 78 29 53 34
Sample-5 Early 58 70 34 35 8 21
Sample-6 Early 94 82 48 16 27 26
Sample-7 Early 91 69 50 57 24 13
Sample-8 Early 63 38 86 20 28 11
Sample-9 Middle 4 19 55 99 86 38
Sample-10 Middle 29 25 10 93 37 54
Sample-11 Middle 48 12 59 73 39 92
Sample-12 Middle 31 6 34 81 39 54
Sample-13 Middle 29 40 26 52 34 84
Sample-14 Middle 1 46 15 97 67 41
Sample-15 Late 43 47 30 18 60 23
Sample-16 Late 45 10 49 2 2 45
Sample-17 Late 14 8 51 36 58 51
Sample-18 Late 41 51 32 47 23 43
Sample-19 Late 43 17 6 54 4 12
Sample-20 Late 20 25 1 29 35 2
Species.7 Species.8 Species.9
Sample-1 41 39 57
Sample-2 59 4 45
Sample-3 10 56 5
Sample-4 59 30 39
Sample-5 9 29 57
Sample-6 29 24 35
Sample-7 22 4 42
Sample-8 31 19 40
Sample-9 17 7 57
Sample-10 6 9 29
Sample-11 34 20 0
Sample-12 56 41 59
Sample-13 6 31 13
Sample-14 25 12 28
Sample-15 60 75 84
Sample-16 32 69 34
Sample-17 48 53 56
Sample-18 80 86 46
Sample-19 50 70 82
Sample-20 57 84 70
> biplot(selectca,cex=c(0.6,0.6))
> selectca<-corresp(selectdata,nf=5)
> PlotSample <- qplot(selectca$rscore[,1], selectca$rscore[,2], colour=factor(alldata$Period) )
> PlotTaxa<-qplot(selectca$cscore[,1], selectca$cscore[,2], label=colnames(selectdata), size=10, geom="text")
The biplot will produce this plot: /r/10wk1a8/5
The PlotSample appears as such: /r/i29cba/5
The PlotTaxa appears as such: /r/245bl9d/5
EDIT so don't have enough rep to post pictures and tinypic links not accepted (despite https://meta.stackexchange.com/questions/60563/how-to-upload-images-on-stack-overflow). So if you add tinypic's URL to the start of those codes above you'll get there.
Essentially I want to creat the biplot plot but with samples colour coded as they are in PlotSample.
Have a look at Gavin Simpsons ggvegan-package!
require(vegan)
require(ggvegan)
# some data
data(dune)
# CA
mod <- cca(dune)
# plot
autoplot(mod, geom = 'text')
For a finer control (or if you want to stick with corresp(), you may also want to take a look at the code of the two involved functions fortify.cca (which wraps the data in the cca objects into a useable format for ggplot) and autoplot.cca for creating the plot.
I you want to do it from scratch, you'll have to wrap both scores (sites and species) into one data.frame (see how fortify.cca does this and extract the relevant values from the corresp() object) and use this to build the plot.

In R: Indexing vectors by boolean comparison of a value in range: index==c(min : max)

In R, let's say we have a vector
area = c(rep(c(26:30), 5), rep(c(500:504), 5), rep(c(550:554), 5), rep(c(76:80), 5)) and another vector yield = c(1:100).
Now, say I want to index like so:
> yield[area==27]
[1] 2 7 12 17 22
> yield[area==501]
[1] 27 32 37 42 47
No problem, right? But weird things start happening when I try to index it by using c(A, B). (and even weirder when I try c(min:max) ...)
> yield[area==c(27,501)]
[1] 7 17 32 42
What I'm expecting is of course the instances that are present in both of the other examples, not just some weird combination of them. This works when I can use the pipe OR operator:
> yield[area==27 | area==501]
[1] 2 7 12 17 22 27 32 37 42 47
But what if I'm working with a range? Say I want index it by the range c(27:503)? In my real example there are a lot more data points and ranges, so it makes more sense, please don't suggest I do it by hand, which would essentially mean:
yield[area==27 | area==28 | area==29 | ... | area==303 | ... | area==500 | area==501]
There must be a better way...
You want to use %in%. Also notice that c(27:503) and 27:503 yield the same object.
> yield[area %in% 27:503]
[1] 2 3 4 5 7 8 9 10 12 13 14 15 17
[14] 18 19 20 22 23 24 25 26 27 28 29 31 32
[27] 33 34 36 37 38 39 41 42 43 44 46 47 48
[40] 49 76 77 78 79 80 81 82 83 84 85 86 87
[53] 88 89 90 91 92 93 94 95 96 97 98 99 100
Why not use subset?
subset(yield, area > 26 & area < 504) ## for indexes
subset(area, area > 26 & area < 504) ## for values

Resources