R - Rank and Group [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
This is going to be a long shot but i'll try anyway. I want to build a centile (100 groups) or decile (10 groups) based on the data.frame available.
In this example, I have a data frame with 891 records. In this data.frame, I have the following variables.
Unique_ID (numerical). i.e. unique member number
xbeta (numerical) Given credit score. (which allows ranking to be performed)
Good (numerical). Binary Flag (0 or 1). An indicator if member is delinquent
Bad (numerical). Binary Flag (0 or 1) inverse of good
I need your help to build an equivalent table below. By changing the number of groups, i'd be able to split it either 10 or by 100 using xbeta. With the top row being the total (identifiable via TYPE), i'd like to produce the following table (see table below for more details)
r_xbeta is just row number based on the # of groups.
TYPE to identify total or group rank
n = Total Count
count of Good | Bad flag within the rank
xbeta stats, min | max | mean | median
GB_Odds = GOOD / BAD for the rank
LN_GB_ODDs = Log(GB_Odds)
rest should be self explanatory
Your help is much appreciated.
Jim learning R
r_xbeta _TYPE_ n GOOD BAD xbeta_min xbeta_max xbeta_mean xbeta_MEDIAN GB_ODDS LN_GB_ODDS Cummu_Good Cummu_Bad Cummu_Good_pct Cummu_Bad_pct
. 0 891 342 549 -4.42 3.63 -0.7 -1.09 0.62295 -0.47329 342 549 100% 100%
0 1 89 4 85 -4.42 -2.7 -3.6 -3.57 0.04706 -3.05636 4 85 1.20% 15%
1 1 89 12 77 -2.69 -2.37 -2.55 -2.54 0.15584 -1.8589 16 162 4.70% 30%
2 1 87 12 75 -2.35 -1.95 -2.16 -2.2 0.16 -1.83258 28 237 8.20% 43%
3 1 93 14 79 -1.95 -1.54 -1.75 -1.79 0.17722 -1.73039 42 316 12% 58%
4 1 88 10 78 -1.53 -1.09 -1.33 -1.33 0.12821 -2.05412 52 394 15% 72%
5 1 89 27 62 -1.03 -0.25 -0.67 -0.69 0.43548 -0.8313 79 456 23% 83%
6 1 89 44 45 -0.24 0.33 0.05 0.03 0.97778 -0.02247 123 501 36% 91%
7 1 89 54 35 0.37 1.07 0.66 0.63 1.54286 0.43364 177 536 52% 98%
8 1 88 77 11 1.08 2.15 1.56 1.5 7 1.94591 254 547 74% 100%
9 1 90 88 2 2.18 3.63 2.77 2.76 44 3.78419 342 549 100% 100%

A reproducible example would be great, i.e. something we can copy-paste to our terminal that demonstrates your problem. For example, here is the dataframe I'll work with:
set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
xbeta=rnorm(891, sd=10),
Good=round(runif(891) < 0.5),
Bad=round(runif(891) < 0.5))
head(my_dataframe)
# Unique_ID xbeta Good Bad
# 1 1 -6.264538 1 0
# 2 2 1.836433 1 0
# 3 3 -8.356286 0 1
# 4 4 15.952808 1 1
# 5 5 3.295078 1 0
# 6 6 -8.204684 1 1
(The particular numbers don't matter to your question which is why I made up random ones).
The idea is to:
work out which quantile each row belongs to: see ?quantile. You can specify which quantiles you want (I've shown deciles)
quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# -30.0804860 -13.3880074 -8.7326454 -5.1121923 -3.0097613 -0.4493361 2.3680366 5.3732613 8.7867326 13.2425863 38.1027668
This gives the quantile cutoffs; if you use cut on these you can add a variable that says which quantile each row is in (?cut):
my_dataframe$quantile <- cut(my_dataframe$xbeta,
quantile(my_dataframe$xbeta, seq(0, 1, by=.1)))
Have a look at head(my_dataframe) to see what this did. The quantile column is a factor.
split up your dataframe by quantile, and calculate the stats for each. You can use the plyr, dplyr or data.table packages for this; I recommend one of the first two as you are new to R. If you need to do massive merges and calculations on huge tables efficiently (thousands of rows) use data.table, but the learning curve is much steeper. I will show you plyr purely because it's the one I find easiest. dplyr is very similar, but just has a different syntax.
# The idea: `ddply(my_dataframe, .(quantile), FUNCTION)` applies FUNCTION
# to each subset of `my_dataframe`, where we split it up into unique
# `quantile`s.
# For us, `FUNCTION` is `summarize`, which calculates summary stats
# on each subset of the dataframe.
# The arguments after `summarize` are the new summary columns we
# wish to calculate.
library(plyr)
output = ddply(my_dataframe, .(quantile), summarize,
n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
xbeta_min=min(xbeta), xbeta_max=max(xbeta),
GB_ODDS=GOOD/BAD) # you can calculate the rest yourself,
# "the rest should be self explanatory".
> head(output, 3)
quantile n GOOD BAD xbeta_min xbeta_max GB_ODDS
1 (-30.1,-13.4] 89 41 39 -29.397737 -13.388007 1.0512821
2 (-13.4,-8.73] 89 49 45 -13.353714 -8.732645 1.0888889
3 (-8.73,-5.11] 89 46 48 -8.667335 -5.112192 0.9583333
Calculate the other columns. See (E.g.) ?cumsum for cumulative sums. e.g. output$cummu_good <- cumsum(output$GOOD).
Add the 'total' row. You should be able to do this. You can add an extra row to output using rbind.

Here is the final version my script with math coffee's guidance. I had to use .bincode instead of the suggested cut due to "'breaks' are not unique" error.
Thanks everyone.
set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
xbeta=rnorm(891, sd=10),
Good=round(runif(891) < 0.5),
Bad=round(runif(891) < 0.5))
head(my_dataframe)
quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
my_dataframe$quantile = .bincode(my_dataframe$xbeta,quantile(my_dataframe$xbeta,seq(0,1,by=.1)))
library(plyr)
output = ddply(my_dataframe, .(quantile), summarize,
n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
xbeta_min=min(xbeta), xbeta_max=max(xbeta), xbeta_median=median(xbeta), xbeta_mean=mean(xbeta),
GB_ODDS=GOOD/BAD, LN_GB_ODDS = log(GOOD/BAD))
output$cummu_good = cumsum(output$GOOD)
output$cummu_bad = cumsum(output$BAD)
output$cummu_n = cumsum(output$n)
output$sum_good = sum(output$GOOD)
output$sum_bad = sum(output$BAD)
output$cummu_good_pct = cumsum(output$GOOD/output$sum_good)
output$cummu_bad_pct = cumsum(output$BAD/output$sum_bad)
output[["sum_good"]]=NULL
output[["sum_bad"]]=NULL
output

Related

Interpolate contents of one dataset based on another dataset and merge in R - uneven values

I have two datasets (A for the age dataset and TE for the concentration dataset) and I'm aiming to plot concentration ~ Age but I'm stuck on how to merge and expand the Age data to fit the much larger database containing concentrations. These are examples of my two datasets:
(A) Distance in this case is in multiples of 25 micrometers and is distance along the slide. The total distance along each slide differs between slides depending on the side of the item on each slide. Age is cumulative age along each slide (so everything is nested within slide).
Slide
Age
Distance
1
7
25
1
14
50
1
22
75
1
28
100
2
8
25
2
15
50
(TE) Distance is continuous and is distance along the slide but more fine scale and distance between one data point to the next is not consistent.
Slide
Concentration
Distance
1
7800
0.57
1
7895
0.61
1
6547
1.22
1
6589
1.73
1
6887
4.89
1
6342
5.50
2
8560
35.50
2
8657
36.11
2
8500
38.43
2
8352
39.17
2
8334
41.01
2
7456
42.84
2
8912
56.92
I need a way to merge the two so I can do:
ggplot(TE, aes(x = Age, y = Concentration, group = Slide))+
geom(line)
...by expanding the age data to fit to the continuous distance scale in the TE dataset by interpolating age for each distance in the TE database. Something like this:
Slide
Concentration
Distance
Age
1
7800
0.57
0.3
1
7895
0.61
0.4
1
6547
1.22
0.8
1
6589
1.73
1.2
1
6887
4.89
4.3
1
6342
5.50
5.5
2
8560
35.50
7.3
2
8657
36.11
7.4
2
8500
38.43
7.6
2
8352
39.17
7.7
2
8334
41.01
7.8
2
7456
42.84
7.9
2
8912
56.92
8.4
Any ideas?
p.s. Sorry if this isn't clear I can update as necessary if it's not reproducible enough
Based on the data for slide 1 in Table A, it appears that there is a linear relationship between age and distance. Rather than initially joining the two tables while simultaneously interpolating distances in Table A based on age, instead you could: 1) split() Table A by slide; B) use lm() to obtain a linear model of age on distance for each slide; and C) use predict() with each linear model and the distance data from Table TE. This will give you a the linearly interpolated ages for each concentration in Table TE. The interpolated age and concentration data can then be combined for plotting.

How do I calculate CV of triplicates in R?

I have 1000+ rows and I want to calculate the CV for each row that has the same condition.
The data look like this:
Condition Y
0.5 25
0.5 26
0.5 27
1 43
1 45
1 75
5 210
5 124
5 20
10 54
10 78
10 10
and then I did:
CV <- function(x){
(sd(x)/mean(x))*100
}
CV.for every row. <- aggregate(y ~ Condition,
data = df,
FUN = CV)
I have the feeling that what I did, uses the mean of the whole column, cause the results are a bit whatever.

group_by() summarise() and weights percentages - R

Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

In R, how to select rows based on a statistic of a column attribute? [duplicate]

This question already has answers here:
R use ddply or aggregate
(4 answers)
Aggregate by factor levels, keeping other variables in the resulting data frame
(5 answers)
Closed 9 years ago.
My table has thousands of rows classified by 400 classes, and a dozen columns.
The ideal outcome will be a table with 400 rows (1 row for each class) based on the max value of column "z", and containing all the original columns.
Here is an example of my data, and I need only the 2nd, 4th, 7th, 8th rows extracted in this example, using R.
x y z cluster
1 712521.75 3637426.49 19.46 12
2 712520.69 3637426.47 19.66 12 *
3 712518.88 3637426.63 17.37 225
4 712518.4 3637426.48 19.42 225 *
5 712517.11 3637426.51 18.81 225
6 712515.7 3637426.58 17.8 17
7 712514.68 3637426.55 18.16 17 *
8 712513.58 3637426.55 18.23 50 *
9 712512.1 3637426.62 17.24 50
10 712513.93 3637426.88 18.08 50
I have tried many different combinations including these:
tapply(data$z, data$cluster, max) # returns only the max value and cluster columns
which.max(data$z) # returns only the index of the max value in the entire table
I have also read through the plyr package, but did not find a solution.
A very straightforward way is to use aggregate and merge:
> merge(aggregate(z ~ cluster, mydf, max), mydf)
cluster z x y
1 12 19.66 712520.7 3637426
2 17 18.16 712514.7 3637427
3 225 19.42 712518.4 3637426
4 50 18.23 712513.6 3637427
You can even use the output of your tapply code to get what you need. Just make it into a data.frame instead of a named vector.
> merge(mydf, data.frame(z = with(mydf, tapply(z, cluster, max))))
z x y cluster
1 18.16 712514.7 3637427 17
2 18.23 712513.6 3637427 50
3 19.42 712518.4 3637426 225
4 19.66 712520.7 3637426 12
For several more options, see the answers at this question.
Thank you all for the help! aggregate() and merge() worked perfectly for me.
An important point: aggregate() - selected only one of the duplicate points per cluster but, merge() - selected all duplicate points, since they had same max values in one cluster.
This is ideal in this case since the these points are 3D, and are not duplicates when considering x and y coordinates.
Here is my solution:
df <- read.table("data.txt", header=TRUE, sep=",")
attach(df)
names(df)
[1] "Row" "x" "y" "z" "cluster"
head(df)
Row x y z cluster
1 1 712521.8 3637426 19.46 361
2 2 712520.7 3637426 19.66 361
3 3 712518.9 3637427 17.37 147
4 4 712518.4 3637426 19.42 147
5 5 712517.1 3637427 18.81 147
6 6 712515.7 3637427 17.80 42
new_table_a <- aggregate(z ~ cluster, df, max) # output 400 rows, no duplicates
new_table_b <- merge(new_table_a, df) # output 408 rows, includes duplicates of "z"
head(new_table_b)
cluster z Row x y
1 1 20.44 6043 712416.2 3637478
2 10 26.09 1138 712458.4 3637511
3 100 19.39 6496 712423.4 3637485
4 101 25.74 2141 712521.2 3637488
5 102 17.33 2320 712508.2 3637484
6 103 21.01 6908 712462.2 3637493

Resources