I want to sum a normalized vector separately in two direction in R.
For example, for a vector 3,4,5,6,10,9,8,7 after normalization 0.3, 0.4, 0.5, 0.6, 1.0, 0.9, 0.8, 07. I want to sum values < 1 on the left and right separately and find their difference. In this case, it will be left=0.3+0.4+0.5+0.6=1.8, right=0.9+0.8+0.7=2.4. The difference will be right minus left equals 0.6.
Below are some of my thoughts:
a <- c(3,4,5,6,10,9,8,7)
norm <- a/max(a) # normalization
left <- sum(a[1:which.max(a)-1]) # left sum
right <- sum(a[which.max(a)+1:length(a)]) # right sum
diff <- right-left
Any suggestions for improvement?
We can use rleid to get the grouping variable, get the sum of the 'norm' for each group ('ind') and get the difference
library(data.table)
ind <- rleid(norm<1)
diff(as.numeric(tapply(norm[ind!=2], ind[ind!=2], FUN = sum)))
#[1] 0.6
Related
I have a simple question, but I am confused with how deciles, quantiles, percentiles are defined.
My purpose is to compute various income and wealth shares. That is the share of x% per cent of the population from total income or wealth.
So, say that one wants to compute how much wealth the top 10% own.
How I can do this on R? Are my below calculation correct?
MWE
w<-rgamma(10000, 3, scale = 1/3)
per <- quantile(w, c(0.1, 0.9))
top_1_percent <- (per[2]/sum(w))*100
bottom_90_percent <-per[1]/sum(w))*100
The top 10% should be:
sum(w[w > per[2]])/sum(w)
Alternately:
sum(tail(sort(w), .1 * length(w))) / sum(w)
The bottom 90% is 1 - top 10%.
If I understand the question correctly, the following will do it.
set.seed(1234) # Make the results reproducible
w <- rgamma(10000, 3, scale = 1/3)
per <- quantile(w, c(0.1, 0.9))
Now get an index i1 on the top 10% and sum their wealth.
i1 <- w >= per[2]
sum(w[i1])
#[1] 2196.856
And the same for the bottom 10%, with index i2.
i2 <- w <= per[1]
sum(w[i2])
#[1] 254.6375
Note that I am using >= and <=. See the help page ?quantile to see the types of quantile computations R can do. This is given by argument type.
Edit.
To compute proportions and percentages of wealth of the top 10% and bottom 10%, divide by the total wealth and multiply by 100.
top10 <- sum(w[i1])/sum(w)
top10
#[1] 0.221291
100*top10
#[1] 22.1291
bottom10 <- sum(w[i2])/sum(w)
bottom10
#[1] 0.02564983
100*bottom10
#[1] 2.564983
So I am working on simulating a Markov Chain with R in which the states are Sunny (S), cloudy (C) and rainy (R) and am looking to figure out the probability that a sunny day is followed by two consecutive cloudy days.
Here is what I have so far:
P = matrix(c(0.7, 0.3, 0.2, 0.2, 0.5, 0.6, 0.1, 0.2, 0.2), 3)
print(P)
x = c("S", "C", "R")
n = 10000
states = character(n+100)
states[1] = "C"
for (i in 2:(n+100)){
if (states[i-1] == "S") {cond.prob = P[1,]}
else if (states[i-1] == "C") {cond.prob = P[2,]}
else {cond.prob = P[3,]}
states[i]=sample(x, 1, prob = cond.prob )
}
print(states[1:100])
states = states[-(1:100)]
head(states)
tail(states)
states[1:200]
At the end of this I am left with a sequence of states. I am looking to divide this sequence into groups of three states (For the three days in the chain) and then count the number of those three set states that are equal to SCC.
I am running a blank on how I would go about doing this and any help would be greatly appreciated!!
Assuming you want a sliding window (i.e., SCC could occur in position 1-3, or 2-4, etc.), collapsing the states to a string and to a regex search should do the trick:
collapsed <- paste(states, collapse="")
length(gregexpr("SCC", collapsed)[[1]])
On the other hand, if you DON'T want a sliding window (i.e., SCC must be in position 1-3, or 4-6, or 7-9, etc.) then you can chop up the sequence using tapply:
indexer <- rep(1:(ceiling(length(states)/3)), each=3, length=length(states))
chopped <- tapply(states, indexer, paste0, collapse="")
sum(chopped == "SCC")
Eric provided the correct answer, but just for the sake of completeness:
you can get the probability you are looking for using the equilibrium distribution of the chain:
# the equilibrium distribution
e <- eigen(t(P))$vectors[,1]
e.norm <- e/sum(e)
# probability of SCC
e.norm[1] * P[1,2] * P[2,2]
This is less computationally expensive and will give you a more accurate estimate of the probability, because your simulation will be biased towards the initial state of the chain.
I need to throw out the outliers of my variable.
I want to reduce the upper 10 percent of my variable.
Yet I have no clue how to find out which are my upper 10 %.
If I make a random cut at 30 I get the upper 3.45 %.
dat$T102_01[dat$T102_01 < 30]
Is there any way to tell r not to take the values < 30 but the first 90% of the values?
Since I don´t want to make a content based decision (anything above 30 is unrealistic) it would be better to take the upper 10% of all variables I have assesed.
I would be very thankful for any comments
Sorry I can´t add a picture of my plot. The distribution is skewed and most values are between 0-30, very view values are between 30-100
I would use the quantile function as follows:
x <- rnorm(50)
p90 <- quantile(x = x,probs = .9)
want <- x[x<p90]
You can do this by doing a sort and find the value 90% of the way through it:
vec <- rnorm(1000)
cut <- sort( vec )[ round( length( vec ) * 0.9 ) ]
vec <- vec[ vec < cut ]
So we sort the vector, and take the value at the point 90% of the way through the vector as a cut point. We then use the cut point to take only the bottom 90% of the main vector.
The concept of jittering in graphical plotting is intended to make sure points do not overlap. I want to do something similar in a vector
Imagine I have a vector like this:
v <- c(0.5, 0.5, 0.55, 0.60, 0.71, 0.71, 0.8)
As you can see, it is a vector that is ordered by increasing numbers, with the caveat that some of the numbers are exactly the same. How can I "jitter" them through adding a very small value, so that they can be ordered strictly in increasing order? I would like to achieve something like this:
0.5, 0.50001, 0.55, 0.60, 0.71, 0.71001, 0.8
How can I achieve this in R?
If the solution allows me to adjust the size of the "added value" it's a bonus!
Jitter and then sort:
sort(jitter(z))
The function rle gets you the run length of repeated elements in a vector. Using this information, you can then create a sequence of the repeats, multiply this by your verySmallNumber and add it to v.
# New vector to illustrate a triplet
v <- c(0.5, 0.5, 0.55, 0.60, 0.71, 0.71, 0.71, 0.8)
# Define the amount you wish to add
verySmallNumber <- 0.00001
# Get the rle
rv <- rle(v)
# Create the sequence, multiply and subtract the verySmallNumber, then add
sequence(rv$lengths) * verySmallNumber - verySmallNumber + v
# [1] 0.50000 0.50001 0.55000 0.60000 0.71000 0.71001 0.71002 0.80000
Of course, eventually, a very long sequence of repeats might lead to a value equal to the next real value. Adding a check to see what the longest repeated value is would possibly solve that.
At one step of my analysis my data looks like this:
x<-c(-0.023,0.009, .1, 1.02, 1.9)
y<-c(0.5, 2, 2, 3, 4)
data<-cbind(x,y)
However, I want my data to work in this format:
x<-c(0,1)
y<-c(4, 7)
data<-cbind(x,y)
How do I reformat my table so that my x values are integers with aggregated y-values corresponding to that x-value?
Also, how do I select the table to start from 0 instead of negative values?
First, you clean up the data (round x and remove any rows with negative x). Then there are a number of ways to finish. I'll give you two:
x <- c(-0.023,0.009, .1, 1.02, 1.9)
y <- c(0.5, 2, 2, 3, 4)
# Dataframes are sometimes easier to work with than matrices
data <- data.frame(x,y)
# Round x down to the nearest integer
data$x <- floor(data$x)
# Remove any negative x
data <- data[data$x>=0,]
# First aggregation method
data1 <- aggregate(data$y,by=list(data$x),sum)
# Second aggregation method
library(plyr)
data2 <- ddply(data,.(x),summarize,y=sum(y))