Subtracting Values in Previous Rows: Ecological Lifetable Construction - r

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!

If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Related

group_by() summarise() and weights percentages - R

Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))

Stacking two data frame columns into a single separate data frame column in R

I will present my question in two ways. First, requesting a solution for a task; and second, as a description of my overall objective (in case I am overthinking this and there is an easier solution).
1) Task Solution
Data context: each row contains four price variables (columns) representing (a) the price at which the respondent feels the product is too cheap; (b) the price that is perceived as a bargain; (c) the price that is perceived as expensive; (d) the price that is too expensive to purchase.
## mock data set
a<-c(1,5,3,4,5)
b<-c(6,6,5,6,8)
c<-c(7,8,8,10,9)
d<-c(8,10,9,11,12)
df<-as.data.frame(cbind(a,b,c,d))
## result
# a b c d
#1 1 6 7 8
#2 5 6 8 10
#3 3 5 8 9
#4 4 6 10 11
#5 5 8 9 12
Task Objective: The goal is to create a single column in a new data frame that lists all of the unique values contained in a, b, c, and d.
price
#1 1
#2 3
#3 4
#4 5
#5 6
...
#12 12
My initial thought was to use rbind() and unique()...
price<-rbind(df$a,df$b,df$c,df$d)
price<-unique(price)
...expecting that a, b, c and d would stack vertically.
[Pseudo illustration]
a[1]
a[2]
a[...]
a[n]
b[1]
b[2]
b[...]
b[n]
etc.
Instead, the "columns" are treated as rows and stacked horizontally.
V1 V2 V3 V4 V5
1 1 5 3 4 5
2 6 6 5 6 8
3 7 8 8 10 9
4 8 10 9 11 12
How may I stack a, b, c and d such that price consists of only one column ("V1") that contains all twenty responses? (The unique part I can handle separately afterwards).
2) Overall Objective: The Bigger Picture
Ultimately, I want to create a cumulative share of population for each price (too cheap, bargain, expensive, too expensive) at each price point (defined by the unique values described above). For example, what percentage of respondents felt $1 was too cheap, what percentage felt $3 or less was too cheap, etc.
The cumulative shares for bargain and expensive are later inverted to become not.bargain and not.expensive and the four vectors reside in a data frame like this:
buckets too.cheap not.bargain not.expensive too.expensive
1 0.01 to 0.50 0.000000000 1 1 0
2 0.51 to 1.00 0.000000000 1 1 0
3 1.01 to 1.50 0.000000000 1 1 0
4 1.51 to 2.00 0.000000000 1 1 0
5 2.01 to 2.50 0.001041667 1 1 0
6 2.51 to 3.00 0.001041667 1 1 0
...
from which I may plot something that looks like this:
Above, I accomplished my plotting objective using defined price buckets ($0.50 ranges) and the hist() function.
However, the intersections of these lines have meanings and I want to calculate the exact price at which any of the lines cross. This is difficult when the x-axis is defined by price range buckets instead of a specific value; hence the desire to switch to exact values and the need to generate the unique price variable.
[Postscript: This analysis is based on Peter Van Westendorp's Price Sensitivity Meter (https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter) which has known practical limitations but is relevant in the context of my research which will explore consumer perceptions of value under different treatments rather than defining an actual real-world price. I mention this for two reasons 1) to provide greater insight into my objective in case another approach comes to mind, and 2) to keep the thread focused on the mechanics rather than whether or not the Price Sensitivity Meter should be used.]
We can unlist the data.frame to a vector and get the sorted unique elements
sort(unique(unlist(df)))
When we do an rbind, it creates a matrix and unique of matrix calls the unique.matrix
methods('unique')
#[1] unique.array unique.bibentry* unique.data.frame unique.data.table* unique.default unique.IDate* unique.ITime*
#[8] unique.matrix unique.numeric_version unique.POSIXlt unique.warnings
which loops through the rows as the default MARGIN is 1 and then looks for unique elements. Instead, if we use the 'price', either as.vector or c(price) converts into vector
sort(unique(c(price)))
#[1] 1 3 4 5 6 7 8 9 10 11 12
If we use unique.default
sort(unique.default(price))
#[1] 1 3 4 5 6 7 8 9 10 11 12

R - Rank and Group [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
This is going to be a long shot but i'll try anyway. I want to build a centile (100 groups) or decile (10 groups) based on the data.frame available.
In this example, I have a data frame with 891 records. In this data.frame, I have the following variables.
Unique_ID (numerical). i.e. unique member number
xbeta (numerical) Given credit score. (which allows ranking to be performed)
Good (numerical). Binary Flag (0 or 1). An indicator if member is delinquent
Bad (numerical). Binary Flag (0 or 1) inverse of good
I need your help to build an equivalent table below. By changing the number of groups, i'd be able to split it either 10 or by 100 using xbeta. With the top row being the total (identifiable via TYPE), i'd like to produce the following table (see table below for more details)
r_xbeta is just row number based on the # of groups.
TYPE to identify total or group rank
n = Total Count
count of Good | Bad flag within the rank
xbeta stats, min | max | mean | median
GB_Odds = GOOD / BAD for the rank
LN_GB_ODDs = Log(GB_Odds)
rest should be self explanatory
Your help is much appreciated.
Jim learning R
r_xbeta _TYPE_ n GOOD BAD xbeta_min xbeta_max xbeta_mean xbeta_MEDIAN GB_ODDS LN_GB_ODDS Cummu_Good Cummu_Bad Cummu_Good_pct Cummu_Bad_pct
. 0 891 342 549 -4.42 3.63 -0.7 -1.09 0.62295 -0.47329 342 549 100% 100%
0 1 89 4 85 -4.42 -2.7 -3.6 -3.57 0.04706 -3.05636 4 85 1.20% 15%
1 1 89 12 77 -2.69 -2.37 -2.55 -2.54 0.15584 -1.8589 16 162 4.70% 30%
2 1 87 12 75 -2.35 -1.95 -2.16 -2.2 0.16 -1.83258 28 237 8.20% 43%
3 1 93 14 79 -1.95 -1.54 -1.75 -1.79 0.17722 -1.73039 42 316 12% 58%
4 1 88 10 78 -1.53 -1.09 -1.33 -1.33 0.12821 -2.05412 52 394 15% 72%
5 1 89 27 62 -1.03 -0.25 -0.67 -0.69 0.43548 -0.8313 79 456 23% 83%
6 1 89 44 45 -0.24 0.33 0.05 0.03 0.97778 -0.02247 123 501 36% 91%
7 1 89 54 35 0.37 1.07 0.66 0.63 1.54286 0.43364 177 536 52% 98%
8 1 88 77 11 1.08 2.15 1.56 1.5 7 1.94591 254 547 74% 100%
9 1 90 88 2 2.18 3.63 2.77 2.76 44 3.78419 342 549 100% 100%
A reproducible example would be great, i.e. something we can copy-paste to our terminal that demonstrates your problem. For example, here is the dataframe I'll work with:
set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
xbeta=rnorm(891, sd=10),
Good=round(runif(891) < 0.5),
Bad=round(runif(891) < 0.5))
head(my_dataframe)
# Unique_ID xbeta Good Bad
# 1 1 -6.264538 1 0
# 2 2 1.836433 1 0
# 3 3 -8.356286 0 1
# 4 4 15.952808 1 1
# 5 5 3.295078 1 0
# 6 6 -8.204684 1 1
(The particular numbers don't matter to your question which is why I made up random ones).
The idea is to:
work out which quantile each row belongs to: see ?quantile. You can specify which quantiles you want (I've shown deciles)
quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# -30.0804860 -13.3880074 -8.7326454 -5.1121923 -3.0097613 -0.4493361 2.3680366 5.3732613 8.7867326 13.2425863 38.1027668
This gives the quantile cutoffs; if you use cut on these you can add a variable that says which quantile each row is in (?cut):
my_dataframe$quantile <- cut(my_dataframe$xbeta,
quantile(my_dataframe$xbeta, seq(0, 1, by=.1)))
Have a look at head(my_dataframe) to see what this did. The quantile column is a factor.
split up your dataframe by quantile, and calculate the stats for each. You can use the plyr, dplyr or data.table packages for this; I recommend one of the first two as you are new to R. If you need to do massive merges and calculations on huge tables efficiently (thousands of rows) use data.table, but the learning curve is much steeper. I will show you plyr purely because it's the one I find easiest. dplyr is very similar, but just has a different syntax.
# The idea: `ddply(my_dataframe, .(quantile), FUNCTION)` applies FUNCTION
# to each subset of `my_dataframe`, where we split it up into unique
# `quantile`s.
# For us, `FUNCTION` is `summarize`, which calculates summary stats
# on each subset of the dataframe.
# The arguments after `summarize` are the new summary columns we
# wish to calculate.
library(plyr)
output = ddply(my_dataframe, .(quantile), summarize,
n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
xbeta_min=min(xbeta), xbeta_max=max(xbeta),
GB_ODDS=GOOD/BAD) # you can calculate the rest yourself,
# "the rest should be self explanatory".
> head(output, 3)
quantile n GOOD BAD xbeta_min xbeta_max GB_ODDS
1 (-30.1,-13.4] 89 41 39 -29.397737 -13.388007 1.0512821
2 (-13.4,-8.73] 89 49 45 -13.353714 -8.732645 1.0888889
3 (-8.73,-5.11] 89 46 48 -8.667335 -5.112192 0.9583333
Calculate the other columns. See (E.g.) ?cumsum for cumulative sums. e.g. output$cummu_good <- cumsum(output$GOOD).
Add the 'total' row. You should be able to do this. You can add an extra row to output using rbind.
Here is the final version my script with math coffee's guidance. I had to use .bincode instead of the suggested cut due to "'breaks' are not unique" error.
Thanks everyone.
set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
xbeta=rnorm(891, sd=10),
Good=round(runif(891) < 0.5),
Bad=round(runif(891) < 0.5))
head(my_dataframe)
quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
my_dataframe$quantile = .bincode(my_dataframe$xbeta,quantile(my_dataframe$xbeta,seq(0,1,by=.1)))
library(plyr)
output = ddply(my_dataframe, .(quantile), summarize,
n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
xbeta_min=min(xbeta), xbeta_max=max(xbeta), xbeta_median=median(xbeta), xbeta_mean=mean(xbeta),
GB_ODDS=GOOD/BAD, LN_GB_ODDS = log(GOOD/BAD))
output$cummu_good = cumsum(output$GOOD)
output$cummu_bad = cumsum(output$BAD)
output$cummu_n = cumsum(output$n)
output$sum_good = sum(output$GOOD)
output$sum_bad = sum(output$BAD)
output$cummu_good_pct = cumsum(output$GOOD/output$sum_good)
output$cummu_bad_pct = cumsum(output$BAD/output$sum_bad)
output[["sum_good"]]=NULL
output[["sum_bad"]]=NULL
output

Analysing subsets of data from one data frame defined by another data frame

I need to know how to take the mean/median etc. from rows of one data frame selected according to whether they meet a condition that refers to another. Difficult to explain, so I'll just give an example.
> d
Position Value
1 0 0.20
2 5 0.30
3 10 0.45
4 15 0.23
5 20 0.71
6 25 0.10
7 30 0.20
8 35 0.22
9 40 0.80
10 45 0.50
11 50 0.31
12 55 0.40
And also:
Segment Start End
1 1 0 15
2 2 20 40
3 3 45 55
Basically, "d" gives a variable's value at a certain 'position.' "d2" gives start and end points (or positions) of several 'segments' of the data from "d". Now, what I want is the mean and median of the "value" entries from "d" in each "segment." So for segment 1, because it has start and end positions 0 and 15, respectively, it would return the mean of the entries for 0, 10, and 15 from "d". Note that the segments are not necessarily of equal length, so it would not work to just take the mean of the first n entries, second n entries, third n entries, and so on.
One could think of the segments as segments on a chromosome; and each point on the chromosome has a "value" that describes some characteristic of that point on the chromosome, and I have data on what this value equals at each point, and also data on where each segment begins and ends (segments are all contiguous, just not equal length), and now want to compute, say, the mean value for all the points within each segment. Suffice it to say, unlike with my example, in the actual data set there are far too many segments to compute these manually, hence the question. Thanks.
You could try
mapply(function(s,e) {
mean(d$Value[d$Position>=s & d$Position<=e])}
, d2$Start, d2$End)
That should give you a vector the same length as the number of rows of d2 so you where where all the values belong.

Integrating Data

I have a large data frame as follows which is a subset of a larger data frame.
tree=data.frame(INVYR=tree$INVYR,
DIA=tree$DIA,PLOT=tree$PLOT,SPCD=tree$SPCD,
D.2=tree$D.2, BA.T=tree$BA.T)
What I am attempting to do is calculate the total BA.T per Plot per Year (plots are remeasured in subsequent years). I do this by ...
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x<- x[with(x, order(Group.1,Group.2)), ]
This gives me the data frame...
x=data.frame(Group.1,Group.2,x,PLOT)
Where Group.1 is the INVYR, Group.2 is the PLOT, and x is total BA.T per plot per year. So far this works great. Here is where my problem begins. I then want to integrate this back into my original tree data.frame. If I merge the data by plot it doesn't account for year and quadrupoles the data set because of the four remeasurements. I can't run an if statement because the data set is not equal lengths. The data.frame I wish to accompolish is
tree=data.frame(INVYR, DIA, PLOT, SPCD, D.2, BA.T, x)
where x is the total BA.T for the given INVYR and PLOT of that record.
Any thoughts would be greatly appreciated. Thanks.
Edit
INVYR=rbind(1982,1982,1982,1982,1982,1995,1995,1995,1995,1995,2000,2000,2000,2000,2000)
PLOT=rbind(1,1,2,2,3,1,1,2,2,3,1,1,2,2,3)
BA.T=rbind(.1,.2,.3,.4,.2,.3,.5,.8,.3,.6,.7,.2,.1,1,1.02)
tree=data.frame(INVYR,PLOT,BA.T)
head(tree)
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x$INVYR<-x$Group.1
x<- x[with(x, order(Group.1,Group.2)), ]
head(x)
On solution is to use package reshape2.
library(reshape2)
melt(data=tree,id.vars=c('INVYR','PLOT')) ## Notice the choice of the id!the keys!
dcast(tree.m,formula=...~variable,fun.aggregate=sum)
INVYR PLOT BA.T
1 1982 1 0.30
2 1982 2 0.70
3 1982 3 0.20
4 1995 1 0.80
5 1995 2 1.10
6 1995 3 0.60
7 2000 1 0.90
8 2000 2 1.10
9 2000 3 1.02

Resources