Determining observations with same values in one variable - r

My problem is that i want to determine households which have the same and then use the rank number (ranked by income) to create another rank variable.Sample.Data.Frame
For example you have a data.frame like displayed in the image. The first 2 observations have no income. So there are 2(=n) observations with the same income and rank of 1(=y) and 2(=x). The new rank variable I want to create for both observations: rank.new = (y+x)/n. So that there new column with "rank.new" where in observation 1 and 2 the value is 1.5.
Of course I have many more observations an more identical income households, so i want to ask you have i could fix this in R?

You are looking for the function rank
Income = c(0,0,150,300,300,440,500,500,500)
rank(Income)
[1] 1.5 1.5 3.0 4.5 4.5 6.0 8.0 8.0 8.0

I am making your test data a little bigger to show what happens when there are more than two points that are the same group. You just need to group the points that have the same income and take the average of the groups. I am assuming that the data has been sorted by Income.
## Test Data
Income = c(0,0,150,300,300,440,500,500,500)
Rank = 1:length(Income)
Group = cumsum(c(1, diff(Income) != 0))
NewRank = aggregate(Rank, list(Group), mean)[Group,2]
NewRank
[1] 1.5 1.5 3.0 4.5 4.5 6.0 8.0 8.0 8.0

Related

Calculating range for all variables

For someone new to R, what is the best way to view the range for a number of variables? I've run the summary command on the entire dataset, can I do range () on the entire dataset as well or do i need to create variables for each variable in the dataset?
For individual variable, you can use range. To see the range of multiple variables, you can combine range with one of the apply functions. See below for an example.
range(iris$Sepal.Length)
# [1] 4.3 7.9
sapply(iris[, 1:4], range)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#[1,] 4.3 2.0 1.0 0.1
#[2,] 7.9 4.4 6.9 2.5
(only the first four columns were selected from iris since the 5th is a factor, and range doesn't apply for factors)

Compute Variance in each column between certain number of rows

I want to compute the variances for each column of a matrix, but that variance must be calculated every 7 rows, for example
9.8 4.5 0.9 7.8.....
5.4 9.8 1.2 3.5....
3.1 2.6 9.5 7.1.....
3.4 NA 1.1 1.5.....
7.9 5.9 3.4 2.6.....
4.5 5.1 7.4 NA.....
VAR VAR VAR VAR
VAR is the variace of the column.
After 7 rows in the same matrix I have to compute the variance again, removing the NA´s. The dimension of the matrix is 266x107.
I tried with the colVars from the boa package, but that command compute the variance for the entire column.
Here is the data.table approach:
require(data.table)
# Create the data table
dt <- as.data.table(matrix(rnorm(266*107), 266, 107))
# For every 7 rows, calculate variance of each column, ignoring NAs
dt[, lapply(.SD, var, na.rm=T), by=gl(ceiling(266/7), 7, 266)]
aggregate() is a mighty function for this kind of tasks, no need for another package in this case:
lolzdf <- matrix(rnorm(266*107), 266, 107)
n<-7
aggregate(lolzdf,list(rep(1:(nrow(lolzdf)%/%n+1),each=n,len=nrow(lolzdf))),var,na.rm=TRUE)[-1];

Is there any subtle difference between 'mean' and 'average' in R?

Is there any difference between what these two lines of code do:
mv_avg[i-2] <- (sum(file1$rtn[i-2]:file1$rtn[i+2])/5)
and
mv_avg[i-2] <- mean(file1$rtn[i-2]:file1$rtn[i+2])
I'm trying to calculate the moving average of first 5 elements in my dataset. I was running a for loop and the two lines are giving different outputs. Sorry for not providing the data and the rest of the code for you guys to execute and see (can't do that, some issues).
I just want to know if they both do the same thing or if there's a subtle difference between them both.
It's not an issue with mean or sum. The example below illustrates what's happening with your code:
x = seq(0.5,5,0.5)
i = 8
# Your code
x[i-2]:x[i+2]
[1] 3 4 5
# Index this way to get the five values for the moving average
x[(i-2):(i+2)]
[1] 3.0 3.5 4.0 4.5 5.0
x[i-2]=3 and x[i+2]=5, so x[i-2]:x[i+2] is equivalent to 3:5. You're seeing different results with mean and sum because your code is not returning 5 values. Therefore dividing the sum by 5 does not give you the average. In my example, sum(c(3,4,5))/5 != mean(c(3,4,5)).
#G.Grothendieck mentioned rollmean. Here's an example:
library(zoo)
rollmean(x, k=5, align="center")
[1] 2.1 3.1 4.1 5.1 6.1 7.1 8.1

Double For Loop and calculate averages in R

I have a minor problem, and I'm unsure how to fix the error.
Basically, I have two columns and I want to use a Double For Loop to calculate the averages between each number in both columns so it results in a vector of averages. To clarify, apply and mean functions isn't the best function because I need only half of the total possible combinations to obtain averages. For example:
Col1<-c(1,2,3,4,5)
Col2<-c(1,2,3,4,5)
Q1<-data.frame(cbind(Col1, Col2))
Q1$mean<-0
for (i in 1:length(Q1$Col1)) {
for (j in i+1:length(Q1$Col2)) {
Q1$mean[i]<-(Q1$Col1[i]+Q1$Col2[j])/2
}
}
Basically, for each number in Q1$Col1, I want it average it with Q1$Col2. The reason why I want to use a double for loop is to eliminate duplicates. This is the matrix version to provide visualization:
1.0 1.5 2.0 2.5 3.0
1.5 2.0 2.5 3.0 3.5
2.0 2.5 3.0 3.5 4.0
2.5 3.0 3.5 4.0 4.5
3.0 3.5 4.0 4.5 5.0
Here, each row represents a number from Q1$Col1 and each column represents a number from Q1$Col2. However, notice that there is redundancy on both sides of the matrix diagonal. So using the Double For Loop, I eliminate the redundancy to obtain the averages of the unique combination of cases. Using the matrix above, it should look like this:
1.0 1.5 2.0 2.5 3.0
2.0 2.5 3.0 3.5
3.0 3.5 4.0
4.0 4.5
5.0
What I think you're asking is this: given two vectors of numbers, how can I find the mean of the first items in each vector, the mean of the second items in each vector, and so on. If that's the case, then here is a way to do that.
First, you want use cbind() not rbind() in order to get columns not rows.
Col1<-c(1,2,3,4,5)
Col2<-c(2,3,4,5,6)
Q1<-cbind(Col1, Col2)
Then you can use the function [rowMeans()][1] to figure out (you guessed it) the means of each row. (See also rowSums() and colMeans() and colSums().)
rowMeans(Q1)
#> [1] 1.5 2.5 3.5 4.5 5.5
The more general way to do this is the apply() function, which will let us apply a function to each column or row. Here we use the argument 1 to apply it to rows (because the first row takes the first item from Col1 and Col2, etc.).
apply(Q1, 1, mean)
The results are these:
#> [1] 1.5 2.5 3.5 4.5 5.5
If you really want them in your existing matrix, you could do something like this:
means <- rowMeans(Q1)
cbind(Q1, means)
You do not need the loops to get the averages, you can use vectorised operations:
Col1 <- c(1,2,3,4,5)
Col2 <- c(2,3,4,5,6)
Mean <- (Col1+Col2)/2
Q1 <- rbind(Col1, Col2, Mean)
However rbind treats your vectors as rows, you could use cbind for columns.
You could just use the outer function to first calculate the averages, then use lower.trito fill the area underneath the diagonal of the matrix with NA values.
matrix<-outer(Q1$Col1, Q1$Col2, "+")/2
matrix[lower.tri(matrix)] = NA

How to get the x which belongs to a quintile?

I am learning to use R for an econometrics project at the university, so forgive my n00bness
basically, using and given - a matrix "stocks prices" (rows = days, coloumns = firm's stock price) - another matrix "market capitalisation" (rows = days, coloumns= firm's market cap), I have to gather in a third matrix the prices of the shares belonging to the first quintile of the distribution of the market capitalisation for every day of observation and then I have to put the mean of the "small caps" in a fourth vector.
the professor I am working for suggested me to use the quintile function, so my question is... how do I get if the "i" stock belongs to the first or the last quintile?
thanks for the forthcoming help!
for (i in 1:ndays){
quantile(marketcap[i,2:nfirms],na.rm=TRUE)
for (j in 1:nfirms){
if marketcap[j,i] #BELONGS TO THE FIRST QUINTILE OF THE MARKETCAPS
thirdmatrix <- prices[i,j]
}
fourthvector[i] <- mean(thirdmatrix[i,])
}
Here's a way to find out to which quintile a value belongs. Note that I used a quintile with "open" ends, i.e., each value belongs to exactly one quintile.
a <- 2:9 # reference vector
b <- 1:10 # test vector
quint <- quantile(a, seq(0, 1, 0.2)) # find quintiles
# 0% 20% 40% 60% 80% 100%
# 2.0 3.4 4.8 6.2 7.6 9.0
# to which quintile belong the values in 'b'?
findInterval(b, quint, all.inside = TRUE)
# [1] 1 1 1 2 3 3 4 5 5 5

Resources