Calculate sum of a column based on ranking of another column

Calculate sum of a column based on ranking of another column - r

I am having a data set:
Security %market value return Quintile*
1 0.07 100 3
2 0.10 88 2
3 0.08 78 1
4 0.12 59 1
5 0.20 106 4
6 0.04 94 3
7 0.05 111 5
8 0.10 83 2
9 0.06 97 3
10 0.03 90 3
11 0.15 119 5
the actual data set is having more than 5,000 rows, and I would like to use R to create 5 quintiles, each quintile is suppose to have 20% of market value. In addition, they have to be ranked in the order of magnitude of return. That is, 1st quintile should contain the 20% securities with the lowest return value, 5th quintile should contain the 20% securities with the highest return value. I would like to create the column "Quintile", among different quintiles there can be different numbers of securities but total %market value should be same.
I have tries several methods and I am very new to R, so please kindly provide me some help. Thank you very much in advance!
Samuel

You can order your data and then use findInterval (adding a small delta to use closed right sided braces):
raw_data <- raw_data[order(raw_data$return),]
raw_data$Q2 <- findInterval( cumsum(raw_data$marketvalue) , seq(0,1,length=5)+0.000001 , right = T )
raw_data
# Security marketvalue return Quintile Q2
#4 4 0.12 59 1 1
#3 3 0.08 78 1 1
#8 8 0.10 83 2 2
#2 2 0.10 88 2 2
#10 10 0.03 90 3 3
#6 6 0.04 94 3 3
#9 9 0.06 97 3 3
#1 1 0.07 100 3 3
#5 5 0.20 106 4 4
#7 7 0.05 111 5 5
#11 11 0.15 119 5 5

The following works with your data.
First, sort by increasing return:
dat <- dat[order(dat$return), ]
Then, compute the cumulative market share and cut every 0.2:
dat$Quintile <- ceiling(cumsum(dat$market) / 0.2)
Finally, sort things back by Security:
dat <- dat[order(dat$Security), ]

Related

Correct data type passed to barplot

My initial goal was to set ylim for data plotted by barplot. When I started to dig deeper I've found several things that I do not understand. Let me explain my research:
I have 1D vector:
> str(vectorName)
num [1:999] 1 1 1 1 1 1 1 1 1 1 ...
> dim(vectorName)
NULL
> length(vectorName)
[1] 999
If I want to count the particular elements of this vector I do:
> vectorNameTable = table(vectorName)
> vectorNameTable
vectorName
0 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4
563 72 35 22 14 21 14 10 5 3 7 3 6 5 3 1 3
0.425 0.45 0.475 0.5 0.525 0.55 0.575 0.6 0.625 0.65 0.675 0.7 0.725 0.75 0.775 0.8 0.825
1 3 3 5 7 11 3 4 3 11 5 9 5 7 8 5 3
0.85 0.875 0.9 0.925 0.975 1
3 4 2 1 1 108
This is how I display those data more elegant way (in R-studio):
> View(vectorNameTable)
Which gives me output like this:
vectorName Freq
1 0 563
2 0.025 72
3 0.05 35
4 0.075 22
5 0.1 14
6 0.125 21
7 0.15 14
8 0.175 10
9 0.2 5
10 0.225 3
11 0.25 7
12 0.275 3
13 0.3 6
14 0.325 5
15 0.35 3
16 0.375 1
17 0.4 3
18 0.425 1
19 0.45 3
20 0.475 3
21 0.5 5
22 0.525 7
23 0.55 11
24 0.575 3
25 0.6 4
26 0.625 3
27 0.65 11
28 0.675 5
29 0.7 9
30 0.725 5
31 0.75 7
32 0.775 8
33 0.8 5
34 0.825 3
35 0.85 3
36 0.875 4
37 0.9 2
38 0.925 1
39 0.975 1
40 1 108
If I want to plot this data I do:
> barplot(vectorNameTable)
Which gives me this plot:
As you can see 0 is occurring more times than is y-axis size. So what I want is to set the size of y-axis using:
barplot(table(vectorNameTable), ylim=c(0,MAX_VALUE_IN_FREQ_COLUMN))
The problem is that I cannot find the largest value in Freq column. To be more precise I cannot even access the Freq column. I've tried:
> vectorNameTable[,1]
Error in vectorNameTable[, 1] : incorrect number of dimensions
and several other attempts, but seems that the only thing that I am able to obtain is whole row:
> vectorNameTable[1]
0
563
> vectorNameTable[2]
0.025
72
Or even the Freq value in given row:
> vectorNameTable[[1]]
[1] 563
> vectorNameTable[[2]]
[1] 72
The one possible workaround that is working is converting the data to matrix:
vectorNameDF = data.frame(vectorNameTable)
val = vectorNameDF[[1]]
frq = vectorNameDF[[2]]
val = as.numeric(levels(val))
vectorNameMTX = matrix(c(val, frq), nrow=length(val))
Then I cand do something like this:
barplot(vectorNameTable, ylim=c(0,max(vectorNameMTX[,2])+50))
Which will return:
But as you can see it is extreme overkill. Another mysterious thing that I've found is that plotting the graph this way (same as barplot(vectorNameMTX, beside=FALSE)):
> barplot(vectorNameMTX)
Will return this:
This command > barplot(vectorNameMTX, beside=TRUE) will return this:
Why this is happening? I mean what is this "line" on the left? And where is x-axis? If I do View(vectorNameMTX) it returns very similar table to View(vectorNameTable). The documentation for barplot says (only important things):
Bar Plots
Description
Creates a bar plot with vertical or horizontal bars.
Usage
barplot(height, ...)
height
either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.
I'm passing the matrix, but it does not working as expected:
> class(vectorNameMTX)
[1] "matrix"
On the other hand this one is not mentioned as supported type but it is working:
> class(vectorNameTable)
[1] "table"
Why I can't access columns of vectorNameTable? Why is passing the table object working while passing an matrix is not? What I'm missing here and what is the best way to achieve my goal?
Thank you

Table of a 1d vector is a 1d vector, so there is no columns. You can do something like
> a <- rbinom(1000, 25, 0.5)
> tb <- table(a)
> tb
a
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
8 20 31 71 96 155 141 146 136 94 46 33 15 7 1
> dim(tb)
[1] 15 # 1 dimension of 15
> tb[which.max(tb)]
11
155
So you can feed this max value to barplot.

How to compare two data sets in one table using RCode + t.test

I'm attempting to run a t.test using RCode on some of the sample data you can use from loading the datasets library.
Using the InsectSpray dataset, trying to compare Spray A to Spray C.
My question is, what would be the t.test line of code to compare the two?
The data is originally formatted as a Column Count with the numerical data, and a column Spray which denotes which spray it is, like:
Count: 10 7 8 9... and Spray: A A B B...
Edit: I have already calculated a lot of information and formatted it as:
spray mean sd stderr var
1 A 14.50 4.72 0.39 22.27
2 B 15.33 4.27 0.36 18.24
3 C 2.08 1.98 0.16 3.90
4 D 4.92 2.50 0.21 6.27
5 E 3.50 1.73 0.14 3.00
6 F 16.67 6.21 0.52 38.61
Edit2: I have tried to run something like:
t.test(insect.mn[insect.mn$spray=="A",]$mn, insect.mn[insect.mn$spray=="C",]$mn)
Error in t.test.default(insect.mn[insect.mn$spray == "A", ]$mn, insect.mn[insect.mn$spray == :
not enough 'x' observations
As far as I can tell, t.test is looking for the actual data sets, not the two means (from my basic understanding of statistics, you can't run a t.test on two means).

These are the original data. It should be fairly easy to see the next step since you almost got it right with your posted effort:
> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1
> table(InsectSprays[,2])
A B C D E F
12 12 12 12 12 12
> InsectSprays[InsectSprays$spray=="A",'count']
[1] 10 7 20 14 14 12 10 23 17 20 14 13
> InsectSprays[InsectSprays$spray=="C",'count']
[1] 0 1 7 2 3 1 2 1 3 0 1 4

Compute values relative to specific factor in R data frame

I have a data frame in R of the following form:
BC solvopt istrng tSolv EPB
1 10 1 0 0.10 -78.1450
2 10 1 1 0.15 -78.7174
3 10 1 10 0.14 -78.7175
4 10 1 100 0.12 -78.7184
5 10 1 1000 0.09 -78.7232
6 10 1 2 0.15 -78.7175
7 10 1 20 0.14 -78.7176
8 10 1 200 0.12 -78.7192
30 10 2 0 0.10 -78.1450
31 10 2 1 0.11 -78.7174
32 10 2 10 0.11 -78.7175
33 10 2 100 0.10 -78.7184
34 10 2 1000 0.13 -78.7232
35 10 2 2 0.11 -78.7174
36 10 2 20 0.10 -78.7176
37 10 2 200 0.10 -78.7192
59 10 3 0 0.16 -78.1450
60 10 3 1 0.23 -78.7174
61 10 3 10 0.21 -78.7175
62 10 3 100 0.19 -78.7184
63 10 3 1000 0.17 -78.7232
64 10 3 2 0.22 -78.7175
65 10 3 20 0.21 -78.7176
66 10 3 200 0.18 -78.7192
88 10 4 0 0.44 -78.1450
89 10 4 1 14.48 -78.7162
90 10 4 10 12.27 -78.7175
91 10 4 100 1.23 -78.7184
92 10 4 1000 0.44 -78.7232
93 10 4 2 14.52 -78.7172
94 10 4 20 6.16 -78.7176
95 10 4 200 0.62 -78.7192
I want to add a column to this frame which shows the relative error in the EPB for each value of BC and istrng relative to solvopt=3.
For example, to compute the relative difference in EPB at each row I would subtract the EPB value of the corresponding row with the same value of BC and istrng but with solvopt=3.
Is there an easy way to do this short of splitting this into multiple data frames (for each solvopt) and then remunging it together?
The end goal is to generate plots of relative error vs istrng for each value of BC using qplot.

If you merge the subset where solvopt==3 against the main data on both BC and istrong, and subtract the difference, you should get the result you want, e.g.:
newdat <- merge(dat,dat[dat$solvopt==3,c("BC","istrng","EPB")], by=c("BC","istrng"))
newdat$diff <- with(newdat, EPB.x - EPB.y)
...or do it all in one fell swoop using match and interaction:
dat$diff <- dat$EPB - dat[dat$solvopt==3,"EPB"][match(
with(dat, interaction(BC,istrng) ),
with(dat[dat$solvopt==3,], interaction(BC,istrng) )
)]

A similar option with data.table
library(data.table)
res <- setkey(setDT(dat), BC,istrng)[dat[solvopt==3, c(1,3,5),
with=FALSE]][, diff:= EPB- i.EPB][]

R Function using the result inside the function

I have one variable A
0
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
which is an input into the following function
NoBeta <- function(A)
{
return(1-(1- B * (1-4000))/EXP(0.007*A)
}
The variable B is the result of this function how do I feed the result back into the function to calculate my next result?
Here is B
0
0.07
0.10
0.13
0.16
0.19
0.22
0.24
0.27
0.30
0.32
0.34
0.37
0.39
0.41
0.43
0.45
0.47
So the function needs to return the values of B but also using B e.g. if we using A 10 as input then the input for B is 0, when the input for A is 15 the input for B is the result from the previous calculation 0.07
B is calculated with the following formula in Excel
=1-(1-B1*(1-4000))/EXP(0.007*$A2)
How do I implement this formula in R?

If I understand your question correctly you wish to reference a previous row in a calculation for the current row.
You can adapt a function that was provided in another SO question here.
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
test <- data.frame(x = c(1:10), y = c(2:11))
test$z <- rowShift(test$x, -1) + rowShift(test$y, -1)
> test
x y z
1 1 2 NA
2 2 3 3
3 3 4 5
4 4 5 7
5 5 6 9
6 6 7 11
7 7 8 13
8 8 9 15
9 9 10 17
10 10 11 19
Then what you want to achieve becomes
test$z2 <- 1- (1-rowShift(test$x, -1)*(1-4000))/exp(0.007*rowShift(test$y, -1))
> head(test)
x y z z2
1 1 2 NA NA
2 2 3 3 -3943.390
3 3 4 5 -7831.772
4 4 5 7 -11665.716
5 5 6 9 -15445.790
6 6 7 11 -19172.560

cut that returns guaranteed number of bins

I'd like to do a cut with a guaranteed number of levels returned. So i'd like to take any vector of cumulative percentages and get a cut into deciles. I've tried using cut and it works well in most situations, but in cases where there are deciles that have a large percentages it fails to return the desired number of unique cuts, which is 10. Any ideas on how to ensure that the number of cuts is guaranteed to be 10?
In the included example there is no occurrance of decile 7.
> (x <- c(0.04,0.1,0.22,0.24,0.26,0.3,0.35,0.52,0.62,0.66,0.68,0.69,0.76,0.82,1.41,6.19,9.05,18.34,19.85,20.5,20.96,31.85,34.33,36.05,36.32,43.56,44.19,53.33,58.03,72.46,73.4,77.71,78.81,79.88,84.31,90.07,92.69,99.14,99.95))
[1] 0.04 0.10 0.22 0.24 0.26 0.30 0.35 0.52 0.62 0.66 0.68 0.69 0.76 0.82 1.41 6.19 9.05 18.34 19.85 20.50 20.96 31.85 34.33
[24] 36.05 36.32 43.56 44.19 53.33 58.03 72.46 73.40 77.71 78.81 79.88 84.31 90.07 92.69 99.14 99.95
> (cut(x,seq(0,max(x),max(x)/10),labels=FALSE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (as.integer(cut2(x,seq(0,max(x),max(x)/10))))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (findInterval(x,seq(0,max(x),max(x)/10),rightmost.closed=TRUE,all.inside=TRUE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
I would like to get 10 approximately equally sized intervals, sized in such a way that I am assured of getting 10. cut et al gives 9 bins with this example, i want 10. So I'm looking for an algorithm that would recognize that the break between [58.03,72.46],73.4 is large. Instead of assigning to bins 6,8,8 it would assign these cases to bins 6,7,8.

xx <- cut(x, breaks=quantile(x, (1:10)/10, na.rm=TRUE) )
table(xx)
#------------------------
xx
(0.256,0.58] (0.58,0.718] (0.718,6.76] (6.76,20.5]
4 4 4 4
(20.5,35.7] (35.7,49.7] (49.7,75.1] (75.1,85.5]
3 4 4 4
(85.5,100]
4

numBins = 10
cut(x, breaks = seq(from = min(x), to = max(x), length.out = numBins+1))
Output:
...
...
...
10 Levels: (0.04,10] (10,20] (20,30] (30,40] (40,50] (50,60] ... (90,100]
This will make 10 bins that are approximately equally spaced. Note, that by changing the numBins variable, you may obtain any number of bins that are approximately equally spaced.

Not sure I understand what you need, but if you drop the labels=FALSE and use table to make a frequency table of your data, you will get the number of categories desired:
> table(cut(x, breaks=seq(0, 100, 10)))
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
17 2 2 4 2 2 0 5 1 4
Notice that there are is no data in the 7th category, (60,70].

What is the problem you are trying to solve? If you don't want quantiles, then your cutpoints are pretty much arbitrary, so you could just as easily create ten bins by sampling without replacement from your original dataset. I realize that's an absurd method, but I want to make a point: you may be way off track but we can't tell because you haven't explained what you intend to do with your bins. Why, for example, is it so bad that one bin has no content?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculate sum of a column based on ranking of another column - r

The following works with your data. First, sort by increasing return: dat <- dat[order(dat$return), ] Then, compute the cumulative market share and cut every 0.2: dat$Quintile <- ceiling(cumsum(dat$market) / 0.2) Finally, sort things back by Security: dat <- dat[order(dat$Security), ]

Related

Correct data type passed to barplot

How to compare two data sets in one table using RCode + t.test

Compute values relative to specific factor in R data frame

R Function using the result inside the function

cut that returns guaranteed number of bins

Categories

Resources