I am learning to use R for an econometrics project at the university, so forgive my n00bness
basically, using and given - a matrix "stocks prices" (rows = days, coloumns = firm's stock price) - another matrix "market capitalisation" (rows = days, coloumns= firm's market cap), I have to gather in a third matrix the prices of the shares belonging to the first quintile of the distribution of the market capitalisation for every day of observation and then I have to put the mean of the "small caps" in a fourth vector.
the professor I am working for suggested me to use the quintile function, so my question is... how do I get if the "i" stock belongs to the first or the last quintile?
thanks for the forthcoming help!
for (i in 1:ndays){
quantile(marketcap[i,2:nfirms],na.rm=TRUE)
for (j in 1:nfirms){
if marketcap[j,i] #BELONGS TO THE FIRST QUINTILE OF THE MARKETCAPS
thirdmatrix <- prices[i,j]
}
fourthvector[i] <- mean(thirdmatrix[i,])
}
Here's a way to find out to which quintile a value belongs. Note that I used a quintile with "open" ends, i.e., each value belongs to exactly one quintile.
a <- 2:9 # reference vector
b <- 1:10 # test vector
quint <- quantile(a, seq(0, 1, 0.2)) # find quintiles
# 0% 20% 40% 60% 80% 100%
# 2.0 3.4 4.8 6.2 7.6 9.0
# to which quintile belong the values in 'b'?
findInterval(b, quint, all.inside = TRUE)
# [1] 1 1 1 2 3 3 4 5 5 5
Related
Assume the following dataframe:
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
DF
#Application Score
#1 A 0
#2 A 0.6
#3 B 0.6
#4 B 2.0
#5 B 2.0
#6 C 3.8
#7 C 3.8
#8 D 3.9
I want to create an empty results table to be populated through a loop:
1st column - to show the rating being counted (e.g. 0.6)
2nd column - to show the number of times that rating occurs in DF
3rd column - to list total number of ratings in DF (i.e. 8)
4th column - to calculate the proportion of the applications with that rating relative to the overall
#create empty results table
results_rating_bins <- as.data.frame(matrix(nrow = 1, ncol = 4))
#initiate row count
rownr = 1
#Loop:
for (rating in seq(from = 0, to = 4.0, by = 0.1)) {
this_rating <- subset(DF, DF$Score == rating)
results_rating_bins[rownr, 1] = rating
results_rating_bins[rownr, 2] = nrow(this_rating)
results_rating_bins[rownr, 3] = nrow(DF)
results_rating_bins[rownr, 4] = nrow(this_rating) / nrow(DF)
rownr <- rownr + 1
}
The final result is what I expect, except for rating 2.0 where the count is 0 even though it should be 2.
This illustrates at small scale, what I see at larger scale with a 30k line dataset. I have a list of apps with ratings going from 0 to 4.9, so the range in my loop would be set to 0 to 4.9 instead of 0.6 to 4.0 in my example. However, when I run the loop on the large dataset I end up with a number of instances where the rating count is 0 even though it shouldn't be. What's even more odd, is that by playing around with the ranges, the ratings where the anomaly (i.e. count = 0) happens varies completely randomly.
Any idea what may justify this type of behaviour?
Amnesty
Typically I answer the questions as asked, trying to work through the logic a question poster is already using. However, in this case, it is so much easier to use dplyr to aggregate into the new table that I am breaking with tradition.
require(dplyr)
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
df2<-DF%>%
group_by(Application, Rating)%>%
summarize(ratio=(n()/nrow(DF)))
The first part is the same as yours, but with the library call added
where it starts df2 you are setting the df2 data frame equal to a grouped version of your initial data frame based on the combinations of Application and Rating. In the summarize statement, for each possible combination we tell it to count the number n() and divide it by the total number of rows in the original data frame nrow(DF), This creates the third row of your new the percent of total each pair represents.
It looks like this and you could add the column with the number of rows with another summarize statement if you need it, but to perform this function, it is not necessary.
Application Rating ratio
1 A 0 0.125
2 A 0.6 0.125
3 B 0.6 0.125
4 B 2.0 0.250
5 C 3.8 0.250
6 D 3.9 0.125
This will absolutely catch every combination of Application and Rating and calculate the ratio relative to the whole data frame.
EDIT: If you do not care about the Application letter, you cans imply remove it from the group_by function and still get what you want.
And add
%>%
summarise(rows=nrow(DF))
if you want the total number of rows in the frame on each row
I have numbers starting from 1 to 6000 and I want it to be separated in the manner listed below.
1-10 as "Range1"
10-20 as "Range2"
20-30 as ""Range3"
.
.
.
5900-6000 as "Range 600".
I want to calculate the range with equal time interval as 10 and at last I want to calculate the frequency as which range is repeated the most.
How can we solve this in R programming.
You should use the cut function and then table can determine the counts in each category and sort in order of the most prevalent.
x <- 1:6000
x2 <- cut(x, breaks=seq(1,6000,by=10), labels=paste0('Range', 1:599))
sort(table(x2), descending = TRUE)
There is a maths trick to you question. If you want categories of length 10, round(x/10) will create a category in which 0-5 will become 0, 6 to 14 will become 1, 15 to 24 will become 2 etc. If you want to create cat 1-10, 11-20, etc., you can use round((x+4.1)/10).
(i don't know why in R round(0.5)=0 but round(1.5)=2, that's why i have to use 4.1)
Not the most elegant code but maybe the easiest to understand, here is an example:
# Create randomly 50 numbers between 1 and 60
x = sample(1:60, 50)
# Regroup in a data.frame and had a column count containing the value one for each row
df <- data.frame(x, count=1)
df
# create a new column with the category
df$cat <- round((df$x+4.1)/10)
# If you want it as text:
df$cat2 <- paste("Range",round((df$x+4.1)/10), sep="")
str(df)
# Calculate the number of values in each category
freq <- aggregate(count~cat2, data=df, FUN=sum)
# Get the maximum number of values in the most frequent category(ies)
max(freq$count)
# Get the category(ies) name(s)
freq[freq$count == max(freq$count), "cat2"]
The data I have represents sales and their distance (Dist) to a given store One and Two in this example. What I would like to do is, to define the stores catchment area based on sales desity. A cacthment area is defined as the radius that contains 50% of sales. Starting with orders that have the smallest distance (Dist) to a store I would like to calculate radius that contains 50% of sales of a given store.
I the following df that I've calculated in a previous model.
df <- data.frame(ID = c(1,2,3,4,5,6,7,8),
Store = c('One','One','One','One','Two','Two','Two','Two'),
Dist = c(1,5,7,23,1,9,9,23),
Sales = c(10,8,4,1,11,9,4,2))
Now I want to find the minimum distance dist that gives the closes figure to 50% of Sales. So my output looks as follows:
Output <- data.frame(Store = c('One','Two'),
Dist = c(5,9),
Sales = c(18,20))
I have a lot of observation in my actual df and it's unlekely that I will be able to solve for exactly 50%, so I need to round to the nearest observation.
Any suggestions how to do this?
NOTE: I appologise in advance for the poor title, I tried to think of a better way to formulate the problem, suggestions are welcome...
Here is one approach with data.table:
library(data.table)
setDT(df)
df[order(Store, Dist),
.(Dist, Sales = cumsum(Sales), Pct = cumsum(Sales) / sum(Sales)),
by = "Store"][Pct >= 0.5, .SD[1,], by = "Store"]
# Store Dist Sales Pct
# 1: One 5 18 0.7826087
# 2: Two 9 20 0.7692308
setDT(df) converts df into a data.table
The .(...) expression selects Dist, and calculates the cumulative sales and respective cumulative percentage of sales, by Store
Pct >= 0.5 subsets this to only cases where cumulative sales exceeds the threshold, and .SD[1,] takes only the top row (i.e., the smallest value of Dist), by Store
I think it would be easier if you rearrange your data in a certain format. My logic would be to first take cumsum by groups. Then merge sum of groups to the data. Finally i calculate percentage. Now You have got the data and you can subset in any way you want to get the first obs from the group.
df$cums=unlist(lapply(split(df$Sales, df$Store), cumsum), use.names = F)
zz=aggregate(df$Sales, by = list(df$Store), sum)
names(zz)=c('Store', 'TotSale')
df = merge(df, zz)
df$perc=df$cums/df$TotSale
sub-setting the data:
merge(aggregate(perc ~ Store,data=subset(df,perc>=0.5), min),df)
Store perc ID Dist Sales cums TotSale
1 One 0.7826087 2 5 8 18 23
2 Two 0.7692308 6 9 9 20 26
I'm looking for a way to split a data frame into groups of equal size (essentially same number of rows in each group), whose groups have a nearly equal mean.
User Data
1 5.0
2 4.5
3 3.5
4 6.0
5 7.0
6 6.5
7 5.5
8 6.2
9 5.7
10 5.9
This is very similar to this request However this only splits the data into 2 groups.
My actual dataset contains anywhere from 75-150 rows, and I need to split it into anywhere from 5-10 groups of equal mean and fairly equal size.
I've researched on Google & Stack Exchange for the last few days, and I'm just not having much luck. Any guidance would be great.
Thanks in advance!
More details:
Maybe I need to provide some more details, below I've included a real dataset. We are a transportation company, this data set has Driver ID, Miles, Gallons provided. What I have been doing is reading the data into R, and adding and MPG column like so:
data <- read.csv('filename')
data$MPG <- data$Miles / data$Gallons
Then I tried the two provided answers below. Arun's idea gives me almost equal group sizes (9 members per group, 10 groups), however the variation of the means is large, from 6.615 - 7.093 which is too large of a variation for me to start off with. Thomas' idea gets a little bit tighter variation, but the group sizes are all different from 6 - 13 members.
What we are looking to do is improve fleet MPG, and we're going to accomplish this with a team based competition, so I need to randomly put the teams together with them all starting from relatively the same group MPG.
Maybe that helps and can lead us in the correct direction? I tried doing this just in my programming language, but it locks the computer up every time, so I figured that R would probably be able to process the data better.
Thanks again!
If similar means is really all that matters, I've put together a simulation below that basically looks at a bunch of different combinations of the data (n) for a particular group size (k) and then minimizes the variance of the group means. With that minimization you can then extract that grouping from the simulation results.
df <- data.frame(User=1:1000,Data=rnorm(1000,0,1)) # example data
myfun = function(){
k <- 5 # number of groups
tmp <- seq(length(mpg))%%ngroups # really efficient code from #qwwqwwq's answer
thisgroup <- sample(tmp, dim(df)[1], FALSE) # pull a sample
# thisgroup <- sample(1:k,dim(df)[1],TRUE) # original version
thisavg <- as.vector(by(df$Data, thisgroup, mean)) # group means
thisvar <- var(thisavg) # variance of means
return(list(group=thisgroup, avgs=thisavg, var=thisvar))
}
n <- 1000 # number of simulations
sorts <- replicate(n, myfun(), simplify=FALSE)
wh <- which.min(sapply(sorts, function(x) x$var)) # minimization
# sorts[[wh]] # this is the sample you want
split(df, sorts[[wh]]$group) # list of separate dataframes for each group
You could also have k of different sizes, if you don't care about how many cases are in each group by just moving the k <- 5 line into the function and having it be a random draw from the range of number of groups you're willing to have.
There are probably other ways to do this, though.
Going by Thomas' idea, here's a brute-force/greedy approach, which'll give more or less the same values (you can opt for more repetitions until you agree with the closeness of the solution).
# Assuming the data you provided is in `df`
grp <- 5
myfun <- function() {
samp <- sample(nrow(df))
s.mean <- tapply(df$Data, samp %% grp, mean)
s.var <- var(s.mean)
list(samp, s.mean, s.var)
}
out <- replicate(1000, myfun(), simplify=FALSE)
min.pos <- which.min(sapply(out, `[[`, 3))
min.idx <- out[[min.pos]][[1]]
split(df$Data[min.idx], min.idx %% grp)
$`0`
[1] 7.0 5.9
$`1`
[1] 5.0 6.5
$`2`
[1] 5.5 4.5
$`3`
[1] 6.2 3.5
$`4`
[1] 5.7 6.0
This is how out[min.pos] looks like:
out[min.pos]
[[1]]
[[1]][[1]]
[1] 7 9 8 5 3 4 1 2 10 6
[[1]][[2]]
0 1 2 3 4
5.85 5.70 5.60 5.25 5.50
[[1]][[3]]
[1] 0.05075
Simplest way I can think of: Sort the data, modulo all the indicies by the number of groups, and you're done. Should work well if the data are normally distributed I think. Has the advantage of the groups being as equally sized as possible.
mpg <- rnorm(150)
mpg <- sort(mpg)
ngroups = 13
df = data.frame( mpg=mpg, group=seq(length(mpg))%%ngroups)
tapply(df$mpg, df$group, mean)
0 1 2 3 4 5 6 7 8
0.080400272 -0.110797283 -0.046698548 -0.014177675 0.024410834 0.048370962 0.066265303 0.087119914 -0.062259638
9 10 11 12
-0.042172496 -0.003451581 0.033853024 0.056947458
I have a set of user recommandations
review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
and wanted to use summary(review) to show basic properties mean, median, quartiles and min max.
But it gives back the summary of both columns. I refrain from using data.frame because the factors 'Star' are ordered.
How can I tell R that Star is a ordered list of factors numeric score and votes are their frequency?
I'm not exactly sure what you mean by taking the mean in general if Star is supposed to be an ordered factor. However, in the example you give where Star is actually a set of numeric values, you can use the following:
library(Hmisc)
R> review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
R> wtd.mean(review[, 1], weights = review[, 2])
[1] 4.0625
R> wtd.quantile(review[, 1], weights = review[, 2])
0% 25% 50% 75% 100%
1.00 3.75 5.00 5.00 5.00
I don't understand what's the problem. Why shouldn't you use data.frame?
rv <- data.frame(star = ordered(review[, 1]), votes = review[, 2])
You should convert your data.frame to vector:
( vts <- with(rv, rep(star, votes)) )
[1] 5 5 5 5 5 5 5 5 5 5 4 4 3 2 1 1
Levels: 1 < 2 < 3 < 4 < 5
Then do the summary... I just don't know what kind of summary, since summary will bring you back to the start. O_o
summary(vts)
1 2 3 4 5
2 1 1 2 10
EDIT (on #Prasad's suggestion)
Since vts is an ordered factor, you should convert it to numeric, hence calculate the summary (at this moment I will disregard the background statistical issues):
nvts <- as.numeric(levels(vts)[vts]) ## numeric conversion
summary(nvts) ## "ordinary" summary
fivenum(nvts) ## Tukey's five number summary
Just to clarify -- when you say you would like "mean, median, quartiles and min/max", you're talking in terms of number of stars? e.g mean = 4.062 stars?
Then using aL3xa's code, would something like summary(as.numeric(as.character(vts))) be what you want?