My initial goal was to set ylim for data plotted by barplot. When I started to dig deeper I've found several things that I do not understand. Let me explain my research:
I have 1D vector:
> str(vectorName)
num [1:999] 1 1 1 1 1 1 1 1 1 1 ...
> dim(vectorName)
NULL
> length(vectorName)
[1] 999
If I want to count the particular elements of this vector I do:
> vectorNameTable = table(vectorName)
> vectorNameTable
vectorName
0 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4
563 72 35 22 14 21 14 10 5 3 7 3 6 5 3 1 3
0.425 0.45 0.475 0.5 0.525 0.55 0.575 0.6 0.625 0.65 0.675 0.7 0.725 0.75 0.775 0.8 0.825
1 3 3 5 7 11 3 4 3 11 5 9 5 7 8 5 3
0.85 0.875 0.9 0.925 0.975 1
3 4 2 1 1 108
This is how I display those data more elegant way (in R-studio):
> View(vectorNameTable)
Which gives me output like this:
vectorName Freq
1 0 563
2 0.025 72
3 0.05 35
4 0.075 22
5 0.1 14
6 0.125 21
7 0.15 14
8 0.175 10
9 0.2 5
10 0.225 3
11 0.25 7
12 0.275 3
13 0.3 6
14 0.325 5
15 0.35 3
16 0.375 1
17 0.4 3
18 0.425 1
19 0.45 3
20 0.475 3
21 0.5 5
22 0.525 7
23 0.55 11
24 0.575 3
25 0.6 4
26 0.625 3
27 0.65 11
28 0.675 5
29 0.7 9
30 0.725 5
31 0.75 7
32 0.775 8
33 0.8 5
34 0.825 3
35 0.85 3
36 0.875 4
37 0.9 2
38 0.925 1
39 0.975 1
40 1 108
If I want to plot this data I do:
> barplot(vectorNameTable)
Which gives me this plot:
As you can see 0 is occurring more times than is y-axis size. So what I want is to set the size of y-axis using:
barplot(table(vectorNameTable), ylim=c(0,MAX_VALUE_IN_FREQ_COLUMN))
The problem is that I cannot find the largest value in Freq column. To be more precise I cannot even access the Freq column. I've tried:
> vectorNameTable[,1]
Error in vectorNameTable[, 1] : incorrect number of dimensions
and several other attempts, but seems that the only thing that I am able to obtain is whole row:
> vectorNameTable[1]
0
563
> vectorNameTable[2]
0.025
72
Or even the Freq value in given row:
> vectorNameTable[[1]]
[1] 563
> vectorNameTable[[2]]
[1] 72
The one possible workaround that is working is converting the data to matrix:
vectorNameDF = data.frame(vectorNameTable)
val = vectorNameDF[[1]]
frq = vectorNameDF[[2]]
val = as.numeric(levels(val))
vectorNameMTX = matrix(c(val, frq), nrow=length(val))
Then I cand do something like this:
barplot(vectorNameTable, ylim=c(0,max(vectorNameMTX[,2])+50))
Which will return:
But as you can see it is extreme overkill. Another mysterious thing that I've found is that plotting the graph this way (same as barplot(vectorNameMTX, beside=FALSE)):
> barplot(vectorNameMTX)
Will return this:
This command > barplot(vectorNameMTX, beside=TRUE) will return this:
Why this is happening? I mean what is this "line" on the left? And where is x-axis? If I do View(vectorNameMTX) it returns very similar table to View(vectorNameTable). The documentation for barplot says (only important things):
Bar Plots
Description
Creates a bar plot with vertical or horizontal bars.
Usage
barplot(height, ...)
height
either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.
I'm passing the matrix, but it does not working as expected:
> class(vectorNameMTX)
[1] "matrix"
On the other hand this one is not mentioned as supported type but it is working:
> class(vectorNameTable)
[1] "table"
Why I can't access columns of vectorNameTable? Why is passing the table object working while passing an matrix is not? What I'm missing here and what is the best way to achieve my goal?
Thank you
Table of a 1d vector is a 1d vector, so there is no columns. You can do something like
> a <- rbinom(1000, 25, 0.5)
> tb <- table(a)
> tb
a
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
8 20 31 71 96 155 141 146 136 94 46 33 15 7 1
> dim(tb)
[1] 15 # 1 dimension of 15
> tb[which.max(tb)]
11
155
So you can feed this max value to barplot.
I'm attempting to run a t.test using RCode on some of the sample data you can use from loading the datasets library.
Using the InsectSpray dataset, trying to compare Spray A to Spray C.
My question is, what would be the t.test line of code to compare the two?
The data is originally formatted as a Column Count with the numerical data, and a column Spray which denotes which spray it is, like:
Count: 10 7 8 9... and Spray: A A B B...
Edit: I have already calculated a lot of information and formatted it as:
spray mean sd stderr var
1 A 14.50 4.72 0.39 22.27
2 B 15.33 4.27 0.36 18.24
3 C 2.08 1.98 0.16 3.90
4 D 4.92 2.50 0.21 6.27
5 E 3.50 1.73 0.14 3.00
6 F 16.67 6.21 0.52 38.61
Edit2: I have tried to run something like:
t.test(insect.mn[insect.mn$spray=="A",]$mn, insect.mn[insect.mn$spray=="C",]$mn)
Error in t.test.default(insect.mn[insect.mn$spray == "A", ]$mn, insect.mn[insect.mn$spray == :
not enough 'x' observations
As far as I can tell, t.test is looking for the actual data sets, not the two means (from my basic understanding of statistics, you can't run a t.test on two means).
These are the original data. It should be fairly easy to see the next step since you almost got it right with your posted effort:
> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1
> table(InsectSprays[,2])
A B C D E F
12 12 12 12 12 12
> InsectSprays[InsectSprays$spray=="A",'count']
[1] 10 7 20 14 14 12 10 23 17 20 14 13
> InsectSprays[InsectSprays$spray=="C",'count']
[1] 0 1 7 2 3 1 2 1 3 0 1 4
I have a data frame in R of the following form:
BC solvopt istrng tSolv EPB
1 10 1 0 0.10 -78.1450
2 10 1 1 0.15 -78.7174
3 10 1 10 0.14 -78.7175
4 10 1 100 0.12 -78.7184
5 10 1 1000 0.09 -78.7232
6 10 1 2 0.15 -78.7175
7 10 1 20 0.14 -78.7176
8 10 1 200 0.12 -78.7192
30 10 2 0 0.10 -78.1450
31 10 2 1 0.11 -78.7174
32 10 2 10 0.11 -78.7175
33 10 2 100 0.10 -78.7184
34 10 2 1000 0.13 -78.7232
35 10 2 2 0.11 -78.7174
36 10 2 20 0.10 -78.7176
37 10 2 200 0.10 -78.7192
59 10 3 0 0.16 -78.1450
60 10 3 1 0.23 -78.7174
61 10 3 10 0.21 -78.7175
62 10 3 100 0.19 -78.7184
63 10 3 1000 0.17 -78.7232
64 10 3 2 0.22 -78.7175
65 10 3 20 0.21 -78.7176
66 10 3 200 0.18 -78.7192
88 10 4 0 0.44 -78.1450
89 10 4 1 14.48 -78.7162
90 10 4 10 12.27 -78.7175
91 10 4 100 1.23 -78.7184
92 10 4 1000 0.44 -78.7232
93 10 4 2 14.52 -78.7172
94 10 4 20 6.16 -78.7176
95 10 4 200 0.62 -78.7192
I want to add a column to this frame which shows the relative error in the EPB for each value of BC and istrng relative to solvopt=3.
For example, to compute the relative difference in EPB at each row I would subtract the EPB value of the corresponding row with the same value of BC and istrng but with solvopt=3.
Is there an easy way to do this short of splitting this into multiple data frames (for each solvopt) and then remunging it together?
The end goal is to generate plots of relative error vs istrng for each value of BC using qplot.
If you merge the subset where solvopt==3 against the main data on both BC and istrong, and subtract the difference, you should get the result you want, e.g.:
newdat <- merge(dat,dat[dat$solvopt==3,c("BC","istrng","EPB")], by=c("BC","istrng"))
newdat$diff <- with(newdat, EPB.x - EPB.y)
...or do it all in one fell swoop using match and interaction:
dat$diff <- dat$EPB - dat[dat$solvopt==3,"EPB"][match(
with(dat, interaction(BC,istrng) ),
with(dat[dat$solvopt==3,], interaction(BC,istrng) )
)]
A similar option with data.table
library(data.table)
res <- setkey(setDT(dat), BC,istrng)[dat[solvopt==3, c(1,3,5),
with=FALSE]][, diff:= EPB- i.EPB][]
I have one variable A
0
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
which is an input into the following function
NoBeta <- function(A)
{
return(1-(1- B * (1-4000))/EXP(0.007*A)
}
The variable B is the result of this function how do I feed the result back into the function to calculate my next result?
Here is B
0
0.07
0.10
0.13
0.16
0.19
0.22
0.24
0.27
0.30
0.32
0.34
0.37
0.39
0.41
0.43
0.45
0.47
So the function needs to return the values of B but also using B e.g. if we using A 10 as input then the input for B is 0, when the input for A is 15 the input for B is the result from the previous calculation 0.07
B is calculated with the following formula in Excel
=1-(1-B1*(1-4000))/EXP(0.007*$A2)
How do I implement this formula in R?
If I understand your question correctly you wish to reference a previous row in a calculation for the current row.
You can adapt a function that was provided in another SO question here.
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
test <- data.frame(x = c(1:10), y = c(2:11))
test$z <- rowShift(test$x, -1) + rowShift(test$y, -1)
> test
x y z
1 1 2 NA
2 2 3 3
3 3 4 5
4 4 5 7
5 5 6 9
6 6 7 11
7 7 8 13
8 8 9 15
9 9 10 17
10 10 11 19
Then what you want to achieve becomes
test$z2 <- 1- (1-rowShift(test$x, -1)*(1-4000))/exp(0.007*rowShift(test$y, -1))
> head(test)
x y z z2
1 1 2 NA NA
2 2 3 3 -3943.390
3 3 4 5 -7831.772
4 4 5 7 -11665.716
5 5 6 9 -15445.790
6 6 7 11 -19172.560
I'd like to do a cut with a guaranteed number of levels returned. So i'd like to take any vector of cumulative percentages and get a cut into deciles. I've tried using cut and it works well in most situations, but in cases where there are deciles that have a large percentages it fails to return the desired number of unique cuts, which is 10. Any ideas on how to ensure that the number of cuts is guaranteed to be 10?
In the included example there is no occurrance of decile 7.
> (x <- c(0.04,0.1,0.22,0.24,0.26,0.3,0.35,0.52,0.62,0.66,0.68,0.69,0.76,0.82,1.41,6.19,9.05,18.34,19.85,20.5,20.96,31.85,34.33,36.05,36.32,43.56,44.19,53.33,58.03,72.46,73.4,77.71,78.81,79.88,84.31,90.07,92.69,99.14,99.95))
[1] 0.04 0.10 0.22 0.24 0.26 0.30 0.35 0.52 0.62 0.66 0.68 0.69 0.76 0.82 1.41 6.19 9.05 18.34 19.85 20.50 20.96 31.85 34.33
[24] 36.05 36.32 43.56 44.19 53.33 58.03 72.46 73.40 77.71 78.81 79.88 84.31 90.07 92.69 99.14 99.95
> (cut(x,seq(0,max(x),max(x)/10),labels=FALSE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (as.integer(cut2(x,seq(0,max(x),max(x)/10))))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (findInterval(x,seq(0,max(x),max(x)/10),rightmost.closed=TRUE,all.inside=TRUE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
I would like to get 10 approximately equally sized intervals, sized in such a way that I am assured of getting 10. cut et al gives 9 bins with this example, i want 10. So I'm looking for an algorithm that would recognize that the break between [58.03,72.46],73.4 is large. Instead of assigning to bins 6,8,8 it would assign these cases to bins 6,7,8.
xx <- cut(x, breaks=quantile(x, (1:10)/10, na.rm=TRUE) )
table(xx)
#------------------------
xx
(0.256,0.58] (0.58,0.718] (0.718,6.76] (6.76,20.5]
4 4 4 4
(20.5,35.7] (35.7,49.7] (49.7,75.1] (75.1,85.5]
3 4 4 4
(85.5,100]
4
numBins = 10
cut(x, breaks = seq(from = min(x), to = max(x), length.out = numBins+1))
Output:
...
...
...
10 Levels: (0.04,10] (10,20] (20,30] (30,40] (40,50] (50,60] ... (90,100]
This will make 10 bins that are approximately equally spaced. Note, that by changing the numBins variable, you may obtain any number of bins that are approximately equally spaced.
Not sure I understand what you need, but if you drop the labels=FALSE and use table to make a frequency table of your data, you will get the number of categories desired:
> table(cut(x, breaks=seq(0, 100, 10)))
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
17 2 2 4 2 2 0 5 1 4
Notice that there are is no data in the 7th category, (60,70].
What is the problem you are trying to solve? If you don't want quantiles, then your cutpoints are pretty much arbitrary, so you could just as easily create ten bins by sampling without replacement from your original dataset. I realize that's an absurd method, but I want to make a point: you may be way off track but we can't tell because you haven't explained what you intend to do with your bins. Why, for example, is it so bad that one bin has no content?