Grouping consecutive integers in r and performing analysis on groups - r

I have a data frame, with which I would like to group the intervals based on whether the integer values are consecutive or not and then find the difference between the maximum and minimum value of each group.
Example of data:
x Integers
0.1 14
0.05 15
2.7 17
0.07 19
3.4 20
0.05 21
So Group 1 would consist of 14 and 15 and Group 2 would consist of 19,20 and 21.
The difference of each group then being 1 and 2, respectively.
I have tried the following, to first group the consecutive values, with no luck.
Breaks <- c(0, which(diff(Data$Integer) != 1), length(Data$Integer))
sapply(seq(length(Breaks) - 1),
function(i) Data$Integer[(Breaks[i] + 1):Breaks[i+1]])

Here's a solution using by():
df <- data.frame(x=c(0.1,0.05,2.7,0.07,3.4,0.05),Integers=c(14,15,17,19,20,21));
do.call(rbind,by(df,cumsum(c(0,diff(df$Integers)!=1)),function(g) data.frame(imin=min(g$Integers),imax=max(g$Integers),irange=diff(range(g$Integers)),xmin=min(g$x),xmax=max(g$x),xrange=diff(range(g$x)))));
## imin imax irange xmin xmax xrange
## 0 14 15 1 0.05 0.1 0.05
## 1 17 17 0 2.70 2.7 0.00
## 2 19 21 2 0.05 3.4 3.35
I wasn't sure what data you wanted in the output, so I just included everything you might want.
You can filter out the middle group with subset(...,irange!=0).

Related

Optimization function across multiple factors

I am trying to identify the appropriate thresholds for two activities which generate the greatest success rate.
Listed below is an example of what I am trying to accomplish. For each location I am trying to identify the thresholds to use for activities 1 & 2, so that if either criteria is met then we would guess 'yes' (1). I then need to make sure that we are guessing 'yes' for only a certain percentage of the total volume for each location, and that we are maximizing our accuracy (our guess of yes = 'outcome' of 1).
location <- c(1,2,3)
testFile <- data.frame(location = rep.int(location, 20),
activity1 = round(rnorm(20, mean = 10, sd = 3)),
activity2 = round(rnorm(20, mean = 20, sd = 3)),
outcome = rbinom(20,1,0.5)
)
set.seed(145)
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
I was able to accomplish this by creating a table that contains all of the possible unique combinations of thresholds for activities 1 & 2, and then merging it with each observation within the sample data set. However, with ~200 locations in the actual data set, each of which with thousands of observations I quickly ran of out of space.
I would like to create a function that takes the location id, set of possible thresholds for activity 1, and also for activity 2, and then calculates how often we would have guessed yes (i.e. the values in 'activity1' or 'activity2' exceed their respective thresholds we're testing) to ensure our application rate stays within our desired range (50% - 75%). Then for each set of thresholds which produce an application rate within our desired range we would want to store only the set of which maximizes accuracy, along with their respective location id, application rate, and accuracy rate. The desired output is listed below.
location act_1_thresh act_2_thresh application_rate accuracy_rate
1 1 13 19 0.52 0.45
2 2 11 24 0.57 0.53
3 3 14 21 0.67 0.42
I had tried writing this into a for loop, but was not able to navigate my way through the number of nested arguments I would have to make in order to account for all of these conditions. I would appreciate assistance from anyone who has attempted a similar problem. Thank you!
An example of how to calculate the application and accuracy rate for a single set of thresholds is listed below.
### Create yard IDs
location <- c(1,2,3)
### Create a single set of thresholds
single_act_1_threshold <- 12
single_act_2_threshold <- 20
### Calculate the simulated application, and success rate of thresholds mentioned above using historical data
as.data.table(testFile)[,
list(
application_rate = round(sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0))/
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2) & (outcome == 1), 1, 0))/
sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0)),2)
),
by = location]
Consider expand.grid that builds a data frame of all combinations betwen both thresholds. Then use Map to iterate elementwise between both columns of data frame to build a list of data tables (of which now includes columns for each threshold indicator).
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
# ALL COMBINATIONS
thresholds_df <- expand.grid(th1=act_1_thresholds, th2=act_2_thresholds)
# USER-DEFINED FUNCTION
calc <- function(th1, th2)
as.data.table(testFile)[, list(
act_1_thresholds = th1, # NEW COLUMN
act_2_thresholds = th2, # NEW COLUMN
application_rate = round(sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)) /
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((th1 <= activity1 | th2 <= activity2) & (outcome == 1), 1, 0)) /
sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)),2)
), by = location]
# LIST OF DATA TABLES
dt_list <- Map(calc, thresholds_df$th1, thresholds_df$th2)
# NAME ELEMENTS OF LIST
names(dt_list) <- paste(thresholds_df$th1, thresholds_df$th2, sep="_")
# SAME RESULT AS POSTED EXAMPLE
dt_list$`12_20`
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 12 20 0.23 0.5
# 2: 2 12 20 0.23 0.5
# 3: 3 12 20 0.23 0.5
And if you need to append all elements use data.table's rbindlist:
final_dt <- rbindlist(dt_list)
final_dt
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 7 19 0.32 0.47
# 2: 2 7 19 0.32 0.47
# 3: 3 7 19 0.32 0.47
# 4: 1 8 19 0.32 0.47
# 5: 2 8 19 0.32 0.47
# ---
# 104: 2 11 24 0.20 0.42
# 105: 3 11 24 0.20 0.42
# 106: 1 12 24 0.15 0.56
# 107: 2 12 24 0.15 0.56
# 108: 3 12 24 0.15 0.56

How to match two columns with nearest time points?

I have a following dataframe. It is a time series with each observations having values for days 1-4. There is an additional column that shows at which time the test was made in hrs.
dt
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24
I have to make a time series such that each line represents the subject.
First I made a plot with days and values, with subjects as colors.
This gave me a line plot for each subject, plotted against days and values. I am happy with it.
However, I have to incorporte when the test was taken on the line plot. I could do it separately at the top or bottom of the plot. But not exactly on the line.
Could someone please help me?
Thanks in advance!
Use the directlabels package to add the times:
library(ggplot2)
library(directlabels)
ggplot(DF, aes(Days, values, color = Name)) +
geom_line() +
geom_dl(aes(label = Test), method = "last.points")
Note
The input DF in reproducible form is:
Lines <- "
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24"
DF <- read.table(text = Lines, header = TRUE)

Count values in a data set that exceed a threshold in R

I have 2 data sets. The first data set has a vector of p-values from 0.5 - 0.001, and the corresponding threshold that meets that p-vale. For example, for 0.05, the value is 13. Any value greater than 13 has a p-value of <0.05. This data set contains all my thresholds that I'm interested in. Like so:
V1 V2
1 0.500 10
2 0.200 11
3 0.100 12
4 0.050 13
5 0.010 14
6 0.001 15
The 2nd data set is just one long list of values. I need to write an R script that counts the number of values in this set that exceed each threshold. For example, count how many values in the 2nd data set that exceed 13, and therefore have a p-value of <0.05, and do this fore each threshold value.
Here are the first 15 values of the 2nd data set (1000 total):
1 11.100816
2 8.779858
3 10.510090
4 9.503772
5 9.392222
6 10.285920
7 8.317523
8 10.007738
9 11.021283
10 9.964725
11 9.081947
12 11.253643
13 10.896120
14 10.272814
15 10.282408
Function which will help you:
length( which( data$V1 > 3 & data$V2 <0.05 ) )
Assuming dat1 and dat2 both have a V2 column, something like this:
colSums(outer(dat2$V2, setNames(dat1$V2, dat1$V2), ">"))
# 10 11 12 13 14 15
# 9 3 0 0 0 0
(reads as follows: 9 items have a value greater than 10, 3 items have a value greater than 11, etc.)

merge rows into groups

I have a data frame which is constructed like this
age share
...
19 0.02
20 0.01
21 0.03
22 0.04
...
I want to merge each age group into larger cohorts like <20, 20-24, 25-29, 30-34, >=35 (and sum the shares).
Of course this could be easily done by hand, but I hardly can believe there is no dedicated function for that. However, I am not able to find this function. Can you help me?
What you want to use is ?cut. For example:
> myData <- read.table(text="age share
+ 19 0.02
+ 20 0.01
+ 21 0.03
+ 22 0.04", header=TRUE)
>
> myData$ageRange <- cut(myData$age, breaks=c(0, 20, 24, 29, 34, 35, 100))
> myData
age share ageRange
1 19 0.02 (0,20]
2 20 0.01 (0,20]
3 21 0.03 (20,24]
4 22 0.04 (20,24]
Notice that you need to include breakpoints that are below the bottom number and above the top number in order for those intervals to form properly. Notice further that the breakpoint is exactly (e.g.) 20, and not <=20, >=21; that is, there cannot be a 'gap' between 20 and 21 such that 20.5 would be left out.
From there, if you want the shares in rows categorized under the same ageRange to be summed, you can create a new data frame:
> newData <- aggregate(share~ageRange, myData, sum)
> newData
ageRange share
1 (0,20] 0.03
2 (20,24] 0.07

How do I transform a vector and a list of lists into a data.frame in R?

Essentially I'm after the product of a vector and a list of lists where the LoL has arbitrary lengths.
dose<-c(10,20,30,40,50)
resp<-list(c(.3),c(.4,.45,.48),c(.6,.59),c(.8,.76,.78),c(.9))
I can get something pretty close with
data.frame(dose,I(resp))
but it's not quite right. I need to expand out the resp column of lists pairing the values against the dose column.
The desired format is:
10 .3
20 .4
20 .45
20 .48
30 .6
30 .59
40 .8
40 .76
40 .78
50 .9
Here is a solution using rep() and unlist().
Use rep to repeat the elements of dose, with the length of each element of resp.
Use unlist to turn resp into a vector
The code:
data.frame(
dose = rep(dose, sapply(resp, length)),
resp = unlist(resp)
)
dose resp
1 10 0.30
2 20 0.40
3 20 0.45
4 20 0.48
5 30 0.60
6 30 0.59
7 40 0.80
8 40 0.76
9 40 0.78
10 50 0.90

Resources