Percentile rank of column values - R - r

I am looking for a percentage rank for each value in a column.
It is quite easy in Excel, for example:
=RANK.EQ(A1,$A$1:$A$100,1)/COUNT($A$1:$A$100)
Returns a percent value in a new column that ranks the column I referred to above.
I have no problem finding quantile in R, but have not been able to find anything that accurately gives percentile for every single column value.

Try this using the data in your picture:
> Cost.Per.Kilo <- c(rep(c(6045170, 5412330, 3719760, 3589220), each=2),
3507400)
> Cost.Per.Kilo
[1] 6045170 6045170 5412330 5412330 3719760 3719760 3589220 3589220 3507400
> CPK.rank <- rank(Cost.Per.Kilo, ties.method="min")
> CPK.rank
[1] 8 8 6 6 4 4 2 2 1
> round(CPK.rank/length(CPK.rank) * 100)
[1] 89 89 67 67 44 44 22 22 11
In your picture you seem to have divided the ranks by 10, but there are only 9 values. That is why these percentages do not match.

Related

How do I create a column using values of a second column that meet the conditions of a third in R?

I have a dataset Comorbidity in RStudio, where I have added columns such as MDDOnset, and if the age at onset of MDD < the onset of OUD, it equals 1, and if the opposite is true, then it equals 2. I also have another column PhysDis that has values 0-100 (numeric in nature).
What I want to do is make a new column that includes the values of PhysDis, but only if MDDOnset == 1, and another if MDDOnset==2. I want to make these columns so that I can run a t-test on them and compare the two groups (those with MDD prior OUD, and those who had MDD after OUD with regards to which group has a greater physical disability score). I want any case where MDDOnset is not 1 to be NA.
ttest1 <-t.test(Comorbidity$MDDOnset==1, Comorbidity$PhysDis)
ttest2 <-t.test(Comorbidity$MDDOnset==2, Comorbidity$PhysDis)
When I did the t test twice, once where MDDOnset = 1 and another when it equaled 2, the mean for y (Comorbidity$PhysDis) was the same, and when I looked into the original csv file, it turned out that this mean was the mean of the entire column, and not just cases where MDDOnset had a value of one or two. If there is a different way to run the t-tests that would have the mean of PhysDis only when MDDOnset = 1, and another with the mean of PhysDis only when MDDOnset == 2 that does not require making new columns, then please tell me.. Sorry if there are any similar questions or if my approach is way off, I'm new to R and programming in general, and thanks in advance.
Here's a smaller data frame where I tried to replicate the error where the new columns have switched lengths. The issue would be that the length of C would be 4, and the length of D would be 6 if I could replicate the error.
> A <- sample(1:10)
> B <-c(25,34,14,76,56,34,23,12,89,56)
> alphabet <-data.frame(A,B)
> alphabet$C <-ifelse(alphabet$A<7, alphabet$B, NA)
> alphabet$D <-ifelse(alphabet$A>6, alphabet$B, NA)
> print(alphabet)
A B C D
1 7 25 NA 25
2 9 34 NA 34
3 4 14 14 NA
4 2 76 76 NA
5 5 56 56 NA
6 10 34 NA 34
7 8 23 NA 23
8 6 12 12 NA
9 1 89 89 NA
10 3 56 56 NA
> length(which(alphabet$C>0))
[1] 6
> length(which(alphabet$D>0))
[1] 4
I would use the mutate command from the dplyr package.
Comorbidity <- mutate(Comorbidity, newColumn = (ifelse(MDDOnset == 1, PhysDis, "")), newColumn2 = (ifelse(MDDOnset == 2, PhysDis, "")))

Replace value in a column based on a Frequency Count using R

I have a dataset with multiple columns. Many of these columns contain over 32 factors, so to run a Random Forest (for example), I want to replace values in the column based on their Frequency Count.
One of the column reads like this:
$ country
: Factor w/ 92 levels "China","India","USA",..: 30 39 39 20 89 30 16 21 30 30 ...
What I would like to do is only retain the top N (where N is a value between 5 and 20) countries, and replace the remaining values with "Other".
I know how to calculate the frequency of the values using the table function, but I can't seem to find a solution for replacing values on the basis of such a rule. How can this be done?
Some example data:
set.seed(1)
x <- factor(sample(1:5,100,prob=c(1,3,4,2,5),replace=TRUE))
table(x)
# 1 2 3 4 5
# 4 26 30 13 27
Replace all the levels other than the top 3 (Levels 2/3/5) with "Other":
levels(x)[rank(table(x)) < 3] <- "Other"
table(x)
#Other 2 3 5
# 17 26 30 27

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

How to extract certain rows

So As you can see I have a price and Day columns below
Price Day
2 1
5 2
8 3
11 4
14 5
17 6
20 7
23 8
26 9
29 10
32 11
35 12
38 13
41 14
44 15
47 16
50 17
53 18
56 19
59 20
I then want the output below
Difference Day
12 5
15 10
15 15
15 20
So now I have the difference in prices every 5 days...it just basically subtracts the 5th day with the first day.....and then the 10th day with the 5th day etc....
I already made a code that will seperate my data into 5 day intervals...but I want the code that will let me minus the 5th with the 1st day....the 10th day with the 5th day...etc
So the code should look something like this
difference<-tapply(Price[,1],Day, ____________)
So basically Price[,1] will be my Price data.....while "Day" is the variable that I created that will let me seperate my Day data into 5 day intervals.....I'm thinking that in the blank section I could put in the function or another variable that will let me subtract the 5th day with the 1st day prices and then the 10th day and 5th day prices...etc.....you dont have to help me to seperate my Days into intervals...just how to do "difference" section....thanks guys
Here's one option, assuming your data.frame is called "SODF":
within(SODF[c(1, seq(5, nrow(SODF), 5)), ], {
Price <- diff(c(0, Price))
})[-1, ]
# Price Day
# 5 12 5
# 10 15 10
# 15 15 15
# 20 15 20
The first step is basic subsetting. According to your description and expected answer, you want the first row, and then every fifth row starting from row 5:
> SODF[c(1, seq(5, nrow(SODF), 5)), ]
Price Day
1 2 1
5 14 5
10 29 10
15 44 15
20 59 20
From there, you can use diff on the "Price" column, but since diff will result in a vector that is one in length shorter than your input, you need to "pad" the input vector, which I did with diff(c(0, Price)).
# Correct values, but the number of rows needs to be 5
> diff(SODF[c(1, seq(5, nrow(SODF), 5)), "Price"])
[1] 12 15 15 15
Then, the [-1, ] at the end just deletes the extraneous row.
Update
In the comments below, #geektrader points out in the comments (thanks!), an alternative to using:
SODF[c(1, seq(5, nrow(SODF), 5)), ]
as your input data.frame, you may consider using the following instead:
rbind(SODF[1,], SODF[$Day %% 5 == 0,] )
The difference in the two approaches is that the first approach simply subsets by row number, while the second approach subsets according to the value in the "Day" column, extracting rows where "Day" is a multiple of 5. This second approach might be useful, for instance, when there are missing rows in the dataset.
Ananda's is a nice approach (always forget about within myself). Here's another approach:
dat2 <- dat[seq(0, nrow(dat), by=5), ]
data.frame(Difference=diff(c(dat[1,1], dat2[, 1])), Day=dat2[, 2])
Here a solution if you have a matrix as input.
The subsequent function, given a matrix m, a column col_id and a numeric interval interv, subtracts every interv rows the current value in the col_id column of the m matrix with the previous value (5 rows before, same column, obiviously).
The results are stored in a new column called diff and appended to the end of the m matrix.
In short, the approach is very similar to that used by #Ananda Mahto.
So, this is the function:
subtract_column <- function(m, col_id, interv) {
select <- c(1, seq(interv, nrow(m), interv))
cbind(m[select[-1], ], diff = diff(m[select, col_id]))
}
Example:
# this emulates your data as a matrix
price_vect <- c(2,5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50,53,56,59)
day_vect <- 1:20
matr <- do.call(cbind, list(price = price_vect, day = day_vect))
# and this calls the function above and does the job:
# subtracts every 5 rows the current and the previous (5 rows back) value in the column `price` of matrix `matr`
subtract_column(matr, 'price', 5)
Output:
price day diff
[1,] 14 5 12
[2,] 29 10 15
[3,] 44 15 15
[4,] 59 20 15

Row aggregation when values are close enough in a column

I have a dataframe with 2 columns
time x
1306247226 5
1306247236 10
1306248127 20
1306248187 36
1306249248 28
1306249258 24
1306249259 20
...
I'd like to aggregate the rows whose values in the 'time' column are close enough
(eg. let's say their difference is less than 60.) and sum their 'x' values in the aggregated row. The 'time value in the aggregated row will be the one of the first row of the aggregation. ('time' is an unix timestamp)
The goal is to have as output of this example:
time x
1306247226 15
1306248127 20
1306248187 36
1306249248 72
...
The dataset is quite big, a 'for' loop will take a long time... but if it is the only option I can deal with it and wait.
Any idea?
Thanks a lot!
You can use something like this :
First I create a new column for aggregation
dat$gg <- cumsum(c(0,diff(dat$time)) > 60)
Then I use the plyr package to apply function aggregation
library(plyr)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 56
3 2 1306249248 72
Edit after comment
The Op wanted a threshold of 60, not greater than 60. So I need to change the > to >=
dat$gg <- cumsum(c(0,diff(dat$time)) >= 60)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 20
3 2 1306248187 36
4 3 1306249248 72

Resources