merge rows into groups - r

I have a data frame which is constructed like this
age share
...
19 0.02
20 0.01
21 0.03
22 0.04
...
I want to merge each age group into larger cohorts like <20, 20-24, 25-29, 30-34, >=35 (and sum the shares).
Of course this could be easily done by hand, but I hardly can believe there is no dedicated function for that. However, I am not able to find this function. Can you help me?

What you want to use is ?cut. For example:
> myData <- read.table(text="age share
+ 19 0.02
+ 20 0.01
+ 21 0.03
+ 22 0.04", header=TRUE)
>
> myData$ageRange <- cut(myData$age, breaks=c(0, 20, 24, 29, 34, 35, 100))
> myData
age share ageRange
1 19 0.02 (0,20]
2 20 0.01 (0,20]
3 21 0.03 (20,24]
4 22 0.04 (20,24]
Notice that you need to include breakpoints that are below the bottom number and above the top number in order for those intervals to form properly. Notice further that the breakpoint is exactly (e.g.) 20, and not <=20, >=21; that is, there cannot be a 'gap' between 20 and 21 such that 20.5 would be left out.
From there, if you want the shares in rows categorized under the same ageRange to be summed, you can create a new data frame:
> newData <- aggregate(share~ageRange, myData, sum)
> newData
ageRange share
1 (0,20] 0.03
2 (20,24] 0.07

Related

Optimization function across multiple factors

I am trying to identify the appropriate thresholds for two activities which generate the greatest success rate.
Listed below is an example of what I am trying to accomplish. For each location I am trying to identify the thresholds to use for activities 1 & 2, so that if either criteria is met then we would guess 'yes' (1). I then need to make sure that we are guessing 'yes' for only a certain percentage of the total volume for each location, and that we are maximizing our accuracy (our guess of yes = 'outcome' of 1).
location <- c(1,2,3)
testFile <- data.frame(location = rep.int(location, 20),
activity1 = round(rnorm(20, mean = 10, sd = 3)),
activity2 = round(rnorm(20, mean = 20, sd = 3)),
outcome = rbinom(20,1,0.5)
)
set.seed(145)
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
I was able to accomplish this by creating a table that contains all of the possible unique combinations of thresholds for activities 1 & 2, and then merging it with each observation within the sample data set. However, with ~200 locations in the actual data set, each of which with thousands of observations I quickly ran of out of space.
I would like to create a function that takes the location id, set of possible thresholds for activity 1, and also for activity 2, and then calculates how often we would have guessed yes (i.e. the values in 'activity1' or 'activity2' exceed their respective thresholds we're testing) to ensure our application rate stays within our desired range (50% - 75%). Then for each set of thresholds which produce an application rate within our desired range we would want to store only the set of which maximizes accuracy, along with their respective location id, application rate, and accuracy rate. The desired output is listed below.
location act_1_thresh act_2_thresh application_rate accuracy_rate
1 1 13 19 0.52 0.45
2 2 11 24 0.57 0.53
3 3 14 21 0.67 0.42
I had tried writing this into a for loop, but was not able to navigate my way through the number of nested arguments I would have to make in order to account for all of these conditions. I would appreciate assistance from anyone who has attempted a similar problem. Thank you!
An example of how to calculate the application and accuracy rate for a single set of thresholds is listed below.
### Create yard IDs
location <- c(1,2,3)
### Create a single set of thresholds
single_act_1_threshold <- 12
single_act_2_threshold <- 20
### Calculate the simulated application, and success rate of thresholds mentioned above using historical data
as.data.table(testFile)[,
list(
application_rate = round(sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0))/
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2) & (outcome == 1), 1, 0))/
sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0)),2)
),
by = location]
Consider expand.grid that builds a data frame of all combinations betwen both thresholds. Then use Map to iterate elementwise between both columns of data frame to build a list of data tables (of which now includes columns for each threshold indicator).
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
# ALL COMBINATIONS
thresholds_df <- expand.grid(th1=act_1_thresholds, th2=act_2_thresholds)
# USER-DEFINED FUNCTION
calc <- function(th1, th2)
as.data.table(testFile)[, list(
act_1_thresholds = th1, # NEW COLUMN
act_2_thresholds = th2, # NEW COLUMN
application_rate = round(sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)) /
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((th1 <= activity1 | th2 <= activity2) & (outcome == 1), 1, 0)) /
sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)),2)
), by = location]
# LIST OF DATA TABLES
dt_list <- Map(calc, thresholds_df$th1, thresholds_df$th2)
# NAME ELEMENTS OF LIST
names(dt_list) <- paste(thresholds_df$th1, thresholds_df$th2, sep="_")
# SAME RESULT AS POSTED EXAMPLE
dt_list$`12_20`
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 12 20 0.23 0.5
# 2: 2 12 20 0.23 0.5
# 3: 3 12 20 0.23 0.5
And if you need to append all elements use data.table's rbindlist:
final_dt <- rbindlist(dt_list)
final_dt
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 7 19 0.32 0.47
# 2: 2 7 19 0.32 0.47
# 3: 3 7 19 0.32 0.47
# 4: 1 8 19 0.32 0.47
# 5: 2 8 19 0.32 0.47
# ---
# 104: 2 11 24 0.20 0.42
# 105: 3 11 24 0.20 0.42
# 106: 1 12 24 0.15 0.56
# 107: 2 12 24 0.15 0.56
# 108: 3 12 24 0.15 0.56

ANOVA of subsetted data

I am manipulating a data set comprising several factors with several variables. The idea is that I want to do ANOVA analysis between factor levels nested within one level of another factor.
Here is an example similar to my data set:
treatment category trial individual response
1 A big 1 F1 0.10
2 A big 2 F1 0.20
3 A big 1 F2 0.30
4 A big 2 F2 0.11
5 A small 1 F3 0.12
6 A small 2 F3 0.13
7 A small 1 F4 0.20
8 A small 2 F4 0.30
9 B big 1 F5 0.40
10 B big 2 F5 0.21
11 B big 1 F6 0.22
12 B big 2 F6 0.23
13 B small 1 F7 0.31
14 B small 2 F7 0.32
15 B small 1 F8 0.34
16 B small 2 F8 0.25
So basically, I'd like to do an ANOVA between big and small when treatment is A, then B, then same idea with ANOVA between big and small when treatment is A and trial 1... you get the logic.
It seems I have to use:
anova(lm(Y~x,data=dataset))
and add a subset argument, but I can't work the logic out of it and I can't find any example similar to mine. Any hint for it? Thank you in advance!
By your description, you want to apply separated ANOVAs to different subsets of your data.
Try this:
df1 <- df[df$treatment=="A",]
df2 <- df[df$treatment=="B",]
aov(response ~ category, data=df1)
aov(response ~ category, data=df2)
If you are interested in the effect of factor treatment, maybe you should keep it in a more complex model and use a posthoc to test differences within treatment A and B. But it's just a suggestion.

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Grouping consecutive integers in r and performing analysis on groups

I have a data frame, with which I would like to group the intervals based on whether the integer values are consecutive or not and then find the difference between the maximum and minimum value of each group.
Example of data:
x Integers
0.1 14
0.05 15
2.7 17
0.07 19
3.4 20
0.05 21
So Group 1 would consist of 14 and 15 and Group 2 would consist of 19,20 and 21.
The difference of each group then being 1 and 2, respectively.
I have tried the following, to first group the consecutive values, with no luck.
Breaks <- c(0, which(diff(Data$Integer) != 1), length(Data$Integer))
sapply(seq(length(Breaks) - 1),
function(i) Data$Integer[(Breaks[i] + 1):Breaks[i+1]])
Here's a solution using by():
df <- data.frame(x=c(0.1,0.05,2.7,0.07,3.4,0.05),Integers=c(14,15,17,19,20,21));
do.call(rbind,by(df,cumsum(c(0,diff(df$Integers)!=1)),function(g) data.frame(imin=min(g$Integers),imax=max(g$Integers),irange=diff(range(g$Integers)),xmin=min(g$x),xmax=max(g$x),xrange=diff(range(g$x)))));
## imin imax irange xmin xmax xrange
## 0 14 15 1 0.05 0.1 0.05
## 1 17 17 0 2.70 2.7 0.00
## 2 19 21 2 0.05 3.4 3.35
I wasn't sure what data you wanted in the output, so I just included everything you might want.
You can filter out the middle group with subset(...,irange!=0).

Sampling and Calculation in R

I have a file that contains two columns (Time , VA). The file is large and I managed to read it in R(used read and subset -not a practical for large file). Now, I want to do sampling based on the time, where each sample has a sample size and sample shift. Sample size is fixed value for the whole process of sampling e.g. sampleSize=10 second. Sample shift is the start point for each new sample (after First sample). For example, if sampleShift =4 sec and the sampleSize is 10 sec , that means the second sample will start from 5 sec and add 10 sec as the sample sample size=10 sec. For each sample I want feed the
-VA- values to a function to some calculation.
Sampling <- function(values){
# Perform the sampling
lastRowNumber<- #specify the last row manually
sampleSize<-10
lastValueInFile<-lastRowNumber-sampleSize
for (i in 1: (lastValueInFile ) ){
EndOfShift<-9+i
sample<-c(1:sampleSize)
h<-1
for(j in i:EndOfShift){
sample[h] <- values[j,1]
h<-h+1
}
print(sample)
#Perform the Calculation on the extracted sample
#--Samp_Calculation<-SomFunctionDoCalculation(sample)
}
}
The problems with my try are:
1) I have to specify the lastRow number manually for each file I read.
2) I was trying to do the sampling based on rows number not the Time value. Also, the shift was by one for each sample.
file sample:
Time VA
0.00000 1.000
0.12026 2.000
0.13026 2.000
0.14026 2.000
0.14371 3.000
0.14538 4.000
..........
..........
15.51805 79.002
15.51971 79.015
15.52138 79.028
15.52304 79.040
15.52470 79.053
.............
Any suggestion for more professional way ?
I've generated some test data as follows:
val <- data.frame (time=seq(from=0,to=15,by=0.01),VA=c(0:1500))
... then the function:
sampTime <- function (values,sampTimeLen)
{
# return a data frame for a random sample of the data frame -values-
# of length -sampTimeLen-
minTime <- values$time[1]
maxTime <- values$time[length(values$time)] - sampTimeLen
startTime <- runif(1,minTime,maxTime)
values[(values$time >= startTime) & (values$time <= (startTime+sampTimeLen)),]
}
... can be used as follows:
> sampTime(val,0.05)
time VA
857 8.56 856
858 8.57 857
859 8.58 858
860 8.59 859
861 8.60 860
... which I think is what you were looking for.
(EDIT)
Following the clarification that you want a sample from a specific time rather than a random time, this function should give you that:
sampTimeFrom <- function (values,sampTimeLen,startTime)
{
# return a data frame for sample of the data frame -values-
# of length -sampTimeLen- from a specific -startTime-
values[(values$time >= startTime) & (values$time <= (startTime+sampTimeLen)),]
}
... which gives:
> sampTimeFrom(val,0.05,0)
time VA
1 0.00 0
2 0.01 1
3 0.02 2
4 0.03 3
5 0.04 4
6 0.05 5
> sampTimeFrom(val,0.05,0.05)
time VA
6 0.05 5
7 0.06 6
8 0.07 7
9 0.08 8
10 0.09 9
11 0.10 10
If you want multiple samples, they can be delivered with sapply() like this:
> samples <- sapply(seq(from=0,to=0.15,by=0.05),function (x) sampTimeFrom(val,0.05,x))
> samples[,1]
$time
[1] 0.00 0.01 0.02 0.03 0.04 0.05
$VA
[1] 0 1 2 3 4 5
In this case the output will overlap but making the sampTimeLen very slightly smaller than the shift value (which is shown in the by= parameter of the seq) will give you non-overlapping samples. Alternatively, one or both of the criteria in the function could be changed from >= or <= to > or <.

Resources