Say I have the following data frame:
df <- data.frame(store = LETTERS[1:8],
sales = c( 9, 128, 54, 66, 23, 132, 89, 70),
successRate = c(.80, .25, .54, .92, .85, .35, .54, .46))
I want to rank the stores according to successRate, with ties going to the store with more sales, so first I do this (just to make visualization easier):
df <- df[order(-df$successRate, -df$sales), ]
In order to actually create a ranking variable, I do the following:
df$rank <- ave(df$successRate, FUN = function(x) rank(-x, ties.method='first'))
So df looks like this:
store sales successRate rank
4 D 66 0.92 1
5 E 23 0.85 2
1 A 9 0.80 3
7 G 89 0.54 4
3 C 54 0.54 5
8 H 70 0.46 6
6 F 132 0.35 7
2 B 128 0.25 8
The problem is I don't want small stores to be part of the ranking. Specifically, I want stores with less than 50 sales not to be ranked. So this is how I define df$rank instead:
df$rank <- ifelse(df$sales < 50, NA,
ave(df$successRate, FUN = function(x) rank(-x, ties.method='first')))
The problem is that even though this correctly removes stores E and A, it doesn't reassign the rankings they were occupying. df looks like this now:
store sales successRate rank
4 D 66 0.92 1
5 E 23 0.85 NA
1 A 9 0.80 NA
7 G 89 0.54 4
3 C 54 0.54 5
8 H 70 0.46 6
6 F 132 0.35 7
2 B 128 0.25 8
I've experimented with conditions inside and outside ave(), but I can'r get R to do what I want! How can I get it to rank the stores like this?
store sales successRate rank
4 D 66 0.92 1
5 E 23 0.85 NA
1 A 9 0.80 NA
7 G 89 0.54 2
3 C 54 0.54 3
8 H 70 0.46 4
6 F 132 0.35 5
2 B 128 0.25 6
Super easy to do with data.table:
library(data.table)
dt = data.table(df)
# do the ordering you like (note, could also use setkey to do this faster)
dt = dt[order(-successRate, -sales)]
dt[sales >= 50, rank := .I]
dt
# store sales successRate rank
#1: D 66 0.92 1
#2: E 23 0.85 NA
#3: A 9 0.80 NA
#4: G 89 0.54 2
#5: C 54 0.54 3
#6: H 70 0.46 4
#7: F 132 0.35 5
#8: B 128 0.25 6
If you must do it in data.frame, then after your preferred order, run:
df$rank <- NA
df$rank[df$sales >= 50] <- seq_len(sum(df$sales >= 50))
Related
i have a column say the first column below 'rawdata', i need to calculate rank, percentile and quintile in the below format using the rawdata column?
RawData Quintiles Rank Rank Percentile
1.20 1 87 3
0.58 2 897 30
0.16 5 2,564 84
1.04 1 145 5
NA na
0.32 4 1,966 64
0.18 5 2,471 81
0.22 4 2,374 78
0.89 1 241 9
0.46 3 1,362 45
RawData <- c(1.20, 0.16, 0.58, 1.04)
in general, you can combine the outputs of individual calculations of descriptive statistics into a data.frame using cbind
df <- cbind(
RawData,
quantile = quantile(RawData),
rank = rank(RawData)
)
However, in the data you shared, there are more values of ranks than there are entries in the data set. Are you asking how you would calculate these specific values of rank, quantile, etc. given this particular raw values?
Perhaps something like this (although it does not reproduce your figures, but presumably this is just part of a larger table)...
df <- data.frame(RawData = c(1.2, 0.58, 0.16, 1.04, NA, 1966, 2471, 2374, 241, 1362))
df$Quintile <- cut(df$RawData,quantile(df$RawData,seq(0,1,0.2),na.rm=TRUE),labels=1:5,include.lowest = TRUE)
df$Rank <- rank(df$RawData,na.last="keep")
df$Percentile <- 100*df$Rank/max(df$Rank,na.rm=TRUE)
df
RawData Quintile Rank Percentile
1 1.20 2 4 44.44444
2 0.58 1 2 22.22222
3 0.16 1 1 11.11111
4 1.04 2 3 33.33333
5 NA <NA> NA NA
6 1966.00 4 7 77.77778
7 2471.00 5 9 100.00000
8 2374.00 5 8 88.88889
9 241.00 3 5 55.55556
10 1362.00 4 6 66.66667
I have one variable A
0
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
which is an input into the following function
NoBeta <- function(A)
{
return(1-(1- B * (1-4000))/EXP(0.007*A)
}
The variable B is the result of this function how do I feed the result back into the function to calculate my next result?
Here is B
0
0.07
0.10
0.13
0.16
0.19
0.22
0.24
0.27
0.30
0.32
0.34
0.37
0.39
0.41
0.43
0.45
0.47
So the function needs to return the values of B but also using B e.g. if we using A 10 as input then the input for B is 0, when the input for A is 15 the input for B is the result from the previous calculation 0.07
B is calculated with the following formula in Excel
=1-(1-B1*(1-4000))/EXP(0.007*$A2)
How do I implement this formula in R?
If I understand your question correctly you wish to reference a previous row in a calculation for the current row.
You can adapt a function that was provided in another SO question here.
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
test <- data.frame(x = c(1:10), y = c(2:11))
test$z <- rowShift(test$x, -1) + rowShift(test$y, -1)
> test
x y z
1 1 2 NA
2 2 3 3
3 3 4 5
4 4 5 7
5 5 6 9
6 6 7 11
7 7 8 13
8 8 9 15
9 9 10 17
10 10 11 19
Then what you want to achieve becomes
test$z2 <- 1- (1-rowShift(test$x, -1)*(1-4000))/exp(0.007*rowShift(test$y, -1))
> head(test)
x y z z2
1 1 2 NA NA
2 2 3 3 -3943.390
3 3 4 5 -7831.772
4 4 5 7 -11665.716
5 5 6 9 -15445.790
6 6 7 11 -19172.560
I have a data set generated as follows:
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
The data looks like as this
I would like to generate a stratified sample set of myData with given sample size, i.e., 50. The resulting sample set should follow the proportion allocation of the original data set in terms of "group". For instance, assume myData has 20 records belonging to group 4, then the resulting data set should have 50*20/200=5 records belonging to group 4. How to do that in R.
You can use my stratified function, specifying a value < 1 as your proportion, like this:
## Sample data. Seed for reproducibility
set.seed(1)
N <- 50
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
## Taking the sample
out <- stratified(myData, "group", .3)
out
# a b group
# 17 17 -0.02 2
# 8 8 0.74 3
# 25 25 0.62 3
# 49 49 -0.11 3
# 4 4 1.60 3
# 26 26 -0.06 4
# 27 27 -0.16 4
# 7 7 0.49 4
# 12 12 0.39 4
# 40 40 0.76 4
# 32 32 -0.10 4
# 9 9 0.58 5
# 42 42 -0.25 5
# 43 43 0.70 5
# 37 37 -0.39 5
# 11 11 1.51 6
Compare the counts in the final group with what we would have expected.
round(table(myData$group) * .3)
#
# 2 3 4 5 6
# 1 4 6 4 1
table(out$group)
#
# 2 3 4 5 6
# 1 4 6 4 1
You can also easily take a fixed number of samples per group, like this:
stratified(myData, "group", 2)
# a b group
# 34 34 -0.05 2
# 17 17 -0.02 2
# 49 49 -0.11 3
# 22 22 0.78 3
# 12 12 0.39 4
# 7 7 0.49 4
# 18 18 0.94 5
# 33 33 0.39 5
# 45 45 -0.69 6
# 11 11 1.51 6
I am having a data set:
Security %market value return Quintile*
1 0.07 100 3
2 0.10 88 2
3 0.08 78 1
4 0.12 59 1
5 0.20 106 4
6 0.04 94 3
7 0.05 111 5
8 0.10 83 2
9 0.06 97 3
10 0.03 90 3
11 0.15 119 5
the actual data set is having more than 5,000 rows, and I would like to use R to create 5 quintiles, each quintile is suppose to have 20% of market value. In addition, they have to be ranked in the order of magnitude of return. That is, 1st quintile should contain the 20% securities with the lowest return value, 5th quintile should contain the 20% securities with the highest return value. I would like to create the column "Quintile", among different quintiles there can be different numbers of securities but total %market value should be same.
I have tries several methods and I am very new to R, so please kindly provide me some help. Thank you very much in advance!
Samuel
You can order your data and then use findInterval (adding a small delta to use closed right sided braces):
raw_data <- raw_data[order(raw_data$return),]
raw_data$Q2 <- findInterval( cumsum(raw_data$marketvalue) , seq(0,1,length=5)+0.000001 , right = T )
raw_data
# Security marketvalue return Quintile Q2
#4 4 0.12 59 1 1
#3 3 0.08 78 1 1
#8 8 0.10 83 2 2
#2 2 0.10 88 2 2
#10 10 0.03 90 3 3
#6 6 0.04 94 3 3
#9 9 0.06 97 3 3
#1 1 0.07 100 3 3
#5 5 0.20 106 4 4
#7 7 0.05 111 5 5
#11 11 0.15 119 5 5
The following works with your data.
First, sort by increasing return:
dat <- dat[order(dat$return), ]
Then, compute the cumulative market share and cut every 0.2:
dat$Quintile <- ceiling(cumsum(dat$market) / 0.2)
Finally, sort things back by Security:
dat <- dat[order(dat$Security), ]
R Version 2.11.1 32-bit on Windows 7
I got two data sets: data_A and data_B:
data_A
USER_A USER_B ACTION
1 11 0.3
1 13 0.25
1 16 0.63
1 17 0.26
2 11 0.14
2 14 0.28
data_B
USER_A USER_B ACTION
1 13 0.17
1 14 0.27
2 11 0.25
Now I want to add the ACTION of data_B to the data_A if their USER_A and USER_B are equal. As the example above, the result would be:
data_A
USER_A USER_B ACTION
1 11 0.3
1 13 0.25+0.17
1 16 0.63
1 17 0.26
2 11 0.14+0.25
2 14 0.28
So how could I achieve it?
You can use ddply in package plyr and combine it with merge:
library(plyr)
ddply(merge(data_A, data_B, all.x=TRUE),
.(USER_A, USER_B), summarise, ACTION=sum(ACTION))
Notice that merge is called with the parameter all.x=TRUE - this returns all of the values in the first data.frame passed to merge, i.e. data_A:
USER_A USER_B ACTION
1 1 11 0.30
2 1 13 0.25
3 1 16 0.63
4 1 17 0.26
5 2 11 0.14
6 2 14 0.28
This sort of thing is quite easy to do with a database-like operation. Here I use package sqldf to do a left (outer) join and then summarise the resulting object:
require(sqldf)
tmp <- sqldf("select * from data_A left join data_B using (USER_A, USER_B)")
This results in:
> tmp
USER_A USER_B ACTION ACTION
1 1 11 0.30 NA
2 1 13 0.25 0.17
3 1 16 0.63 NA
4 1 17 0.26 NA
5 2 11 0.14 0.25
6 2 14 0.28 NA
Now we just need sum the two ACTION columns:
data_C <- transform(data_A, ACTION = rowSums(tmp[, 3:4], na.rm = TRUE))
Which gives the desired result:
> data_C
USER_A USER_B ACTION
1 1 11 0.30
2 1 13 0.42
3 1 16 0.63
4 1 17 0.26
5 2 11 0.39
6 2 14 0.28
This can be done using standard R function merge:
> merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
USER_A USER_B ACTION.x ACTION.y
1 1 11 0.30 NA
2 1 13 0.25 0.17
3 1 16 0.63 NA
4 1 17 0.26 NA
5 2 11 0.14 0.25
6 2 14 0.28 NA
So we can replace the sqldf() call above with:
tmp <- merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
whilst the second line using transform() remains the same.
We can use {powerjoin}:
library(powerjoin)
power_left_join(
data_A, data_B, by = c("USER_A", "USER_B"),
conflict = ~ .x + ifelse(is.na(.y), 0, .y)
)
#> USER_A USER_B ACTION
#> 1 1 11 0.30
#> 2 1 13 0.42
#> 3 1 16 0.63
#> 4 1 17 0.26
#> 5 2 11 0.39
#> 6 2 14 0.28
In case of conflict, the function fed to the conflict argument will be used
on pairs of conflicting columns.
We can also use sum(, na.rm = TRUE) row-wise for the same effect :
power_left_join(data_A,data_B, by = c("USER_A", "USER_B"),
conflict = rw ~ sum(.x, .y, na.rm = TRUE))