My initial goal was to set ylim for data plotted by barplot. When I started to dig deeper I've found several things that I do not understand. Let me explain my research:
I have 1D vector:
> str(vectorName)
num [1:999] 1 1 1 1 1 1 1 1 1 1 ...
> dim(vectorName)
NULL
> length(vectorName)
[1] 999
If I want to count the particular elements of this vector I do:
> vectorNameTable = table(vectorName)
> vectorNameTable
vectorName
0 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4
563 72 35 22 14 21 14 10 5 3 7 3 6 5 3 1 3
0.425 0.45 0.475 0.5 0.525 0.55 0.575 0.6 0.625 0.65 0.675 0.7 0.725 0.75 0.775 0.8 0.825
1 3 3 5 7 11 3 4 3 11 5 9 5 7 8 5 3
0.85 0.875 0.9 0.925 0.975 1
3 4 2 1 1 108
This is how I display those data more elegant way (in R-studio):
> View(vectorNameTable)
Which gives me output like this:
vectorName Freq
1 0 563
2 0.025 72
3 0.05 35
4 0.075 22
5 0.1 14
6 0.125 21
7 0.15 14
8 0.175 10
9 0.2 5
10 0.225 3
11 0.25 7
12 0.275 3
13 0.3 6
14 0.325 5
15 0.35 3
16 0.375 1
17 0.4 3
18 0.425 1
19 0.45 3
20 0.475 3
21 0.5 5
22 0.525 7
23 0.55 11
24 0.575 3
25 0.6 4
26 0.625 3
27 0.65 11
28 0.675 5
29 0.7 9
30 0.725 5
31 0.75 7
32 0.775 8
33 0.8 5
34 0.825 3
35 0.85 3
36 0.875 4
37 0.9 2
38 0.925 1
39 0.975 1
40 1 108
If I want to plot this data I do:
> barplot(vectorNameTable)
Which gives me this plot:
As you can see 0 is occurring more times than is y-axis size. So what I want is to set the size of y-axis using:
barplot(table(vectorNameTable), ylim=c(0,MAX_VALUE_IN_FREQ_COLUMN))
The problem is that I cannot find the largest value in Freq column. To be more precise I cannot even access the Freq column. I've tried:
> vectorNameTable[,1]
Error in vectorNameTable[, 1] : incorrect number of dimensions
and several other attempts, but seems that the only thing that I am able to obtain is whole row:
> vectorNameTable[1]
0
563
> vectorNameTable[2]
0.025
72
Or even the Freq value in given row:
> vectorNameTable[[1]]
[1] 563
> vectorNameTable[[2]]
[1] 72
The one possible workaround that is working is converting the data to matrix:
vectorNameDF = data.frame(vectorNameTable)
val = vectorNameDF[[1]]
frq = vectorNameDF[[2]]
val = as.numeric(levels(val))
vectorNameMTX = matrix(c(val, frq), nrow=length(val))
Then I cand do something like this:
barplot(vectorNameTable, ylim=c(0,max(vectorNameMTX[,2])+50))
Which will return:
But as you can see it is extreme overkill. Another mysterious thing that I've found is that plotting the graph this way (same as barplot(vectorNameMTX, beside=FALSE)):
> barplot(vectorNameMTX)
Will return this:
This command > barplot(vectorNameMTX, beside=TRUE) will return this:
Why this is happening? I mean what is this "line" on the left? And where is x-axis? If I do View(vectorNameMTX) it returns very similar table to View(vectorNameTable). The documentation for barplot says (only important things):
Bar Plots
Description
Creates a bar plot with vertical or horizontal bars.
Usage
barplot(height, ...)
height
either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.
I'm passing the matrix, but it does not working as expected:
> class(vectorNameMTX)
[1] "matrix"
On the other hand this one is not mentioned as supported type but it is working:
> class(vectorNameTable)
[1] "table"
Why I can't access columns of vectorNameTable? Why is passing the table object working while passing an matrix is not? What I'm missing here and what is the best way to achieve my goal?
Thank you
Table of a 1d vector is a 1d vector, so there is no columns. You can do something like
> a <- rbinom(1000, 25, 0.5)
> tb <- table(a)
> tb
a
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
8 20 31 71 96 155 141 146 136 94 46 33 15 7 1
> dim(tb)
[1] 15 # 1 dimension of 15
> tb[which.max(tb)]
11
155
So you can feed this max value to barplot.
I have one variable A
0
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
which is an input into the following function
NoBeta <- function(A)
{
return(1-(1- B * (1-4000))/EXP(0.007*A)
}
The variable B is the result of this function how do I feed the result back into the function to calculate my next result?
Here is B
0
0.07
0.10
0.13
0.16
0.19
0.22
0.24
0.27
0.30
0.32
0.34
0.37
0.39
0.41
0.43
0.45
0.47
So the function needs to return the values of B but also using B e.g. if we using A 10 as input then the input for B is 0, when the input for A is 15 the input for B is the result from the previous calculation 0.07
B is calculated with the following formula in Excel
=1-(1-B1*(1-4000))/EXP(0.007*$A2)
How do I implement this formula in R?
If I understand your question correctly you wish to reference a previous row in a calculation for the current row.
You can adapt a function that was provided in another SO question here.
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
test <- data.frame(x = c(1:10), y = c(2:11))
test$z <- rowShift(test$x, -1) + rowShift(test$y, -1)
> test
x y z
1 1 2 NA
2 2 3 3
3 3 4 5
4 4 5 7
5 5 6 9
6 6 7 11
7 7 8 13
8 8 9 15
9 9 10 17
10 10 11 19
Then what you want to achieve becomes
test$z2 <- 1- (1-rowShift(test$x, -1)*(1-4000))/exp(0.007*rowShift(test$y, -1))
> head(test)
x y z z2
1 1 2 NA NA
2 2 3 3 -3943.390
3 3 4 5 -7831.772
4 4 5 7 -11665.716
5 5 6 9 -15445.790
6 6 7 11 -19172.560
I have a data set generated as follows:
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
The data looks like as this
I would like to generate a stratified sample set of myData with given sample size, i.e., 50. The resulting sample set should follow the proportion allocation of the original data set in terms of "group". For instance, assume myData has 20 records belonging to group 4, then the resulting data set should have 50*20/200=5 records belonging to group 4. How to do that in R.
You can use my stratified function, specifying a value < 1 as your proportion, like this:
## Sample data. Seed for reproducibility
set.seed(1)
N <- 50
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
## Taking the sample
out <- stratified(myData, "group", .3)
out
# a b group
# 17 17 -0.02 2
# 8 8 0.74 3
# 25 25 0.62 3
# 49 49 -0.11 3
# 4 4 1.60 3
# 26 26 -0.06 4
# 27 27 -0.16 4
# 7 7 0.49 4
# 12 12 0.39 4
# 40 40 0.76 4
# 32 32 -0.10 4
# 9 9 0.58 5
# 42 42 -0.25 5
# 43 43 0.70 5
# 37 37 -0.39 5
# 11 11 1.51 6
Compare the counts in the final group with what we would have expected.
round(table(myData$group) * .3)
#
# 2 3 4 5 6
# 1 4 6 4 1
table(out$group)
#
# 2 3 4 5 6
# 1 4 6 4 1
You can also easily take a fixed number of samples per group, like this:
stratified(myData, "group", 2)
# a b group
# 34 34 -0.05 2
# 17 17 -0.02 2
# 49 49 -0.11 3
# 22 22 0.78 3
# 12 12 0.39 4
# 7 7 0.49 4
# 18 18 0.94 5
# 33 33 0.39 5
# 45 45 -0.69 6
# 11 11 1.51 6
I am having a data set:
Security %market value return Quintile*
1 0.07 100 3
2 0.10 88 2
3 0.08 78 1
4 0.12 59 1
5 0.20 106 4
6 0.04 94 3
7 0.05 111 5
8 0.10 83 2
9 0.06 97 3
10 0.03 90 3
11 0.15 119 5
the actual data set is having more than 5,000 rows, and I would like to use R to create 5 quintiles, each quintile is suppose to have 20% of market value. In addition, they have to be ranked in the order of magnitude of return. That is, 1st quintile should contain the 20% securities with the lowest return value, 5th quintile should contain the 20% securities with the highest return value. I would like to create the column "Quintile", among different quintiles there can be different numbers of securities but total %market value should be same.
I have tries several methods and I am very new to R, so please kindly provide me some help. Thank you very much in advance!
Samuel
You can order your data and then use findInterval (adding a small delta to use closed right sided braces):
raw_data <- raw_data[order(raw_data$return),]
raw_data$Q2 <- findInterval( cumsum(raw_data$marketvalue) , seq(0,1,length=5)+0.000001 , right = T )
raw_data
# Security marketvalue return Quintile Q2
#4 4 0.12 59 1 1
#3 3 0.08 78 1 1
#8 8 0.10 83 2 2
#2 2 0.10 88 2 2
#10 10 0.03 90 3 3
#6 6 0.04 94 3 3
#9 9 0.06 97 3 3
#1 1 0.07 100 3 3
#5 5 0.20 106 4 4
#7 7 0.05 111 5 5
#11 11 0.15 119 5 5
The following works with your data.
First, sort by increasing return:
dat <- dat[order(dat$return), ]
Then, compute the cumulative market share and cut every 0.2:
dat$Quintile <- ceiling(cumsum(dat$market) / 0.2)
Finally, sort things back by Security:
dat <- dat[order(dat$Security), ]
I'd like to do a cut with a guaranteed number of levels returned. So i'd like to take any vector of cumulative percentages and get a cut into deciles. I've tried using cut and it works well in most situations, but in cases where there are deciles that have a large percentages it fails to return the desired number of unique cuts, which is 10. Any ideas on how to ensure that the number of cuts is guaranteed to be 10?
In the included example there is no occurrance of decile 7.
> (x <- c(0.04,0.1,0.22,0.24,0.26,0.3,0.35,0.52,0.62,0.66,0.68,0.69,0.76,0.82,1.41,6.19,9.05,18.34,19.85,20.5,20.96,31.85,34.33,36.05,36.32,43.56,44.19,53.33,58.03,72.46,73.4,77.71,78.81,79.88,84.31,90.07,92.69,99.14,99.95))
[1] 0.04 0.10 0.22 0.24 0.26 0.30 0.35 0.52 0.62 0.66 0.68 0.69 0.76 0.82 1.41 6.19 9.05 18.34 19.85 20.50 20.96 31.85 34.33
[24] 36.05 36.32 43.56 44.19 53.33 58.03 72.46 73.40 77.71 78.81 79.88 84.31 90.07 92.69 99.14 99.95
> (cut(x,seq(0,max(x),max(x)/10),labels=FALSE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (as.integer(cut2(x,seq(0,max(x),max(x)/10))))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (findInterval(x,seq(0,max(x),max(x)/10),rightmost.closed=TRUE,all.inside=TRUE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
I would like to get 10 approximately equally sized intervals, sized in such a way that I am assured of getting 10. cut et al gives 9 bins with this example, i want 10. So I'm looking for an algorithm that would recognize that the break between [58.03,72.46],73.4 is large. Instead of assigning to bins 6,8,8 it would assign these cases to bins 6,7,8.
xx <- cut(x, breaks=quantile(x, (1:10)/10, na.rm=TRUE) )
table(xx)
#------------------------
xx
(0.256,0.58] (0.58,0.718] (0.718,6.76] (6.76,20.5]
4 4 4 4
(20.5,35.7] (35.7,49.7] (49.7,75.1] (75.1,85.5]
3 4 4 4
(85.5,100]
4
numBins = 10
cut(x, breaks = seq(from = min(x), to = max(x), length.out = numBins+1))
Output:
...
...
...
10 Levels: (0.04,10] (10,20] (20,30] (30,40] (40,50] (50,60] ... (90,100]
This will make 10 bins that are approximately equally spaced. Note, that by changing the numBins variable, you may obtain any number of bins that are approximately equally spaced.
Not sure I understand what you need, but if you drop the labels=FALSE and use table to make a frequency table of your data, you will get the number of categories desired:
> table(cut(x, breaks=seq(0, 100, 10)))
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
17 2 2 4 2 2 0 5 1 4
Notice that there are is no data in the 7th category, (60,70].
What is the problem you are trying to solve? If you don't want quantiles, then your cutpoints are pretty much arbitrary, so you could just as easily create ten bins by sampling without replacement from your original dataset. I realize that's an absurd method, but I want to make a point: you may be way off track but we can't tell because you haven't explained what you intend to do with your bins. Why, for example, is it so bad that one bin has no content?