I have a data frame which has a date column and a cumulative sum column. The cumulative sum data ends at a certain point and I want to use a formula to calculate it for the rest of the dates in the date column. What I am having trouble with is having the formula reference the previous cell in the column, starting from where the count reverts to 0 (where the historical cumulative sum ends).
Example below:
dates.1 <- c("2016-12-06","2016-12-07","2016-12-08","2016-12-09","2016-12-10","2016-12-11","2016-12-12","2016-12-13","2016-12-14")
count.1 <- c(1,3,8,10,0,0,0,0,0)
drift <- .0456
df.1 <- data.frame(cbind(dates.1,count.1))
for (i in df.1$count.1) {
if (i == 0) {
head(df.1$count.1, n = 1L)+exp(drift+(qnorm(runif(5,0,1))))
}
}
I cant get the for loop to calculate it right.
The reason n = 5 for the runif is because that is the number of future entries I want to run the formula for.
The desired output would have something along the lines of
print(df.1$count.1)
[1] 1 3 8 10 12 13 16 17 18
The numbers after the 4th element are just random, the general idea is that the column would be overwritten, keeping the historical data and have the new calculated entries instead of the zeroes.
Any ideas?
There is no need to loop. You can get what you want by first identifying the row index at which the cumsum stopped:
last.ind <- which(df.1$count.1==0)[1]-1
Then use this last.ind to restart the cumsum:
set.seed(123) ## for reproducibility
## simulation of rest of data to cumulatively sum
rest.of.data <- exp(drift+(qnorm(runif(5,0,1))))
df.1$count.1[last.ind:length(df.1$count.1)] <- cumsum(c(df.1$count.1[last.ind],rest.of.data))
print(df.1$count.1)
##[1] 1.00000 3.00000 8.00000 10.00000 10.59757 12.92824 13.75970 17.20085 22.17527
If you do want to use a loop, then you should do the following, which gives the same result but will be slower:
for (i in seq_len(length(df.1$count.1))) {
if (df.1$count.1[i] == 0) {
df.1$count.1[i] <- df.1$count.1[i-1] + exp(drift+(qnorm(runif(1,0,1))))
}
}
Notes:
Loop over indices of df1$.count.1 not values.
If the value at the current index i is 0, write over that value with the sum of the previous value at i-1 and the data to be cumulatively summed.
Also, you should not use cbind to create your data.frame. Doing so in this case will result in df.1$count.1 being a factor instead of numeric. The data used is:
Data:
df.1 <- structure(list(dates.1 = structure(1:9, .Label = c("2016-12-06",
"2016-12-07", "2016-12-08", "2016-12-09", "2016-12-10", "2016-12-11",
"2016-12-12", "2016-12-13", "2016-12-14"), class = "factor"),
count.1 = c(1, 3, 8, 10, 0, 0, 0, 0, 0)), .Names = c("dates.1",
"count.1"), row.names = c(NA, -9L), class = "data.frame")
## dates.1 count.1
##1 2016-12-06 1
##2 2016-12-07 3
##3 2016-12-08 8
##4 2016-12-09 10
##5 2016-12-10 0
##6 2016-12-11 0
##7 2016-12-12 0
##8 2016-12-13 0
##9 2016-12-14 0
Related
I wan to generate 300 random data based on the following criteria:
Class value
0 1-8
1 9-11
2 12-14
3 15-16
4 17-20
Logic: when class = 0, I want to get random data between 1-8. Or when class= 1, I want to get random data between 9-11 and so on.
This gives me the following hypothetical table as an example:
Class Value
0 7
0 4
1 10
1 9
1 11
. .
. .
I want to have equal and unequal mixtures in each class
You could do:
df <- data.frame(Class = sample(0:4, 300, TRUE))
df$Value <- sapply(list(1:8, 9:11, 12:14, 15:16, 17:20)[df$Class + 1],
sample, size = 1)
This gives you a data frame with 300 rows and appropriate numbers for each class:
head(df)
#> Class Value
#> 1 0 3
#> 2 1 10
#> 3 4 19
#> 4 2 12
#> 5 4 19
#> 6 1 10
Created on 2022-12-30 with reprex v2.0.2
Providing some additional flexibility in the code, so that different probabilities can be used in the sampling, and having the smallest possible amount of hard-coded values:
# load data.table
library(data.table)
# this is the original data
a = structure(list(Class = 0:4, value = c("1-8", "9-11", "12-14",
"15-16", "17-20")), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
# this is to replace "-" by ":", we will use that in a second
a[, value := gsub("\\-", ":", value)]
# this is a vector of EQUAL probabilities
probs = rep(1/a[, uniqueN(Class)], a[, uniqueN(Class)])
# This is a vector of UNEQUAL Probabilities. If wanted, it should be
# uncommented and adjusted manually
# probs = c(0.05, 0.1, 0.2, 0.4, 0.25)
# This is the number of Class samples wanted
numberOfSamples = 300
# This is the working horse
a[sample(.N, numberOfSamples, TRUE, prob = probs), ][,
smpl := apply(.SD,
1,
function(x) sample(eval(parse(text = x)), 1)),
.SDcols = "value"][,
.(Class, smpl)]
What is good about this code?
If you change your classes, or the value ranges, the only change you need to be concerned about is the original data frame (a, as I called it)
If you want to use uneven probabilities for your sampling, you can set them and the code still runs.
If you want to take a smaller or larger sample, you don't have to edit your code, you only change the value of a variable.
I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL
I have a set of data in a csv file that I need to group based on transitions of one column. I'm new to R and I'm having trouble finding the right way to accomplish this.
Simplified version of data:
Time Phase Pressure Speed
1 0 0.015 0
2 25 0.015 0
3 25 0.234 0
4 25 0.111 0
5 0 0.567 0
6 0 0.876 0
7 75 0.234 0
8 75 0.542 0
9 75 0.543 0
The length of time that phase changes state is longer than above but I shortened everything to make it readable and this pattern continues on and on. What I'm trying to do is calculate the mean of pressure and speed for each instance where the phase is non-zero. For example, in the output from the sample above there would be two lines, one with the average of the three lines where phase is 25, and with the average of the three lines when phase is 75. It will be possible to see cases where the same numeric value of phase shows up more than once, and I need to treat each of those separately. That is, in the case where phase is 0, 0, 25, 25, 25, 0, 0, 0, 25, 25, 0, I would need to record the first group and the second group of 25s as separate events, as well as any other non-zero groups.
What i've tried:
`csv <- read.csv("c:\\test.csv")`
`ins <- subset(csv,csv$Phase == 25)`
`exs <- subset(csv,csv$Phase == 75)`
`mean(ins$Pressure)`
`mean(exs$Pressure)`
This obviously returns the average of the entire file when phase is 25 and 75, but I need to somehow split it into groups using the trailing and leading 0s. Any help is appreciated.
Super quick:
df <- read.csv("your_file_name.csv")
cbind(aggregate(Pressure ~ Phase, df[df$Phase != 0,], FUN = mean),
aggregate(Speed ~ Phase, df[df$Phase != 0,], FUN = mean)[2])
The cbind is fancy - depending on the distribution of values of Phase, you'll need to merge instead.
EDITED: Based on feedback from the asker, they are really seeking to do some aggregations across runs of numbers (i.e. the first group of continuous 25s, then the second group of continuous 25s, and so on). Because of that, I suggest using rle or the run-level encoding function, to get a group number that you can use in the aggregate command.
I've modified the original data so that it contains two runs of 25, just for illustrative purposes, but it should work regardless. Using rle we get the encoded runs of data, and then we create a group number for each row. We do this by getting a vector of the total number of observed lengths, and then using the rep function to repeat each one by the appropriate length.
After this is done, we can use the same basic aggregation command again.
df_example <- data.frame(Time = 1:9,
Phase = c(0,25,25,25,0,0,25,25,0),
Pressure = c(0.015,0.015,0.234,0.111,0.567,0.876,0.234,0.542,0.543),
Speed = rep(x = 0,times = 9))
encoded_runs <- rle(x = df_example$Phase)
df_example$Group_No <- rep(x = 1:length(x = encoded_runs$lengths),
times = encoded_runs$lengths)
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Group_No = df_example[df_example$Phase != 0,"Group_No"],
Phase = df_example[df_example$Phase != 0,"Phase"]),
FUN = mean)
Group_No Phase Pressure Speed
1 2 25 0.120 0
2 4 25 0.388 0
Building upon comment by Solos, and answer by Cheesman,
try:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
df_example = csv
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Phase = df_example[df_example$Phase != 0,"block"]),
FUN = mean)
actually plyr would be handy:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
require(plyr)
ddply(csv[csv$Phase!=0,], .(block), summarize,
mean.Pressure=mean(Pressure), mean.Speed=mean(Speed))
I am trying to create a new column conditional on another column, a bit like a moving average or moving window but based on distance between points. Take for example row 2 with a CO2 of 399.935. I would like to have the mean of all the points within 100 m (traveled) of that point. In my example (looking at column CumDist), rows 1, 3, 4, 5 would be selected to calculate the mean. The column CumDist (*100,000 to have the units in meters) consists of cumulative distance traveled. I have 5000 points and obviously the width (or the number of rows) of the moving window will vary.
I tested over() from the sp package, but it's problematic if the same road is taken more than once. I looked on the web for other solutions and I did not find anything that could help me.
dput(DF)
structure(list(CO2 = c(399.9350305, 399.9350305, 399.9350305,
400.0320031, 400.0320031, 400.0320031, 399.7718229, 399.7718229,
399.7718229, 399.3855075, 399.3855075, 399.3855075, 399.4708139,
399.4708139, 399.4708139, 400.0362474, 400.0362474, 400.0362474,
399.7556753, 399.7556753), lon = c(-103.7093538, -103.709352,
-103.7093492, -103.7093467, -103.7093455, -103.7093465, -103.7093482,
-103.7093596, -103.7094074, -103.7094625, -103.7094966, -103.709593,
-103.709649, -103.7096717, -103.7097349, -103.7097795, -103.709827,
-103.7099007, -103.709924, -103.7099887), lat = c(49.46972027,
49.46972153, 49.46971675, 49.46971533, 49.46971307, 49.4697124,
49.46970636, 49.46968214, 49.46960921, 49.46955984, 49.46953621,
49.46945809, 49.46938994, 49.46935281, 49.46924309, 49.46918635,
49.46914762, 49.46912566, 49.46912407, 49.46913321),distDiff = c(0.000342016147509882,
0.000191466419697602, 0.000569046320857002, 0.000240367540492089,
0.000265977754839834, 0.000103953049523505, 0.000682968856240796,
0.0028176007969857, 0.00882013898948418, 0.00678966015562509,
0.00360774024245839, 0.011149423290729, 0.00859796340323456,
0.00444526066124642, 0.0130344010874029, 0.00709037369666853,
0.00551435348701512, 0.00587377717110946, 0.00169806309901329,
0.00479849401022625), CumDist = c(0.000342016147509882, 0.000533482567207484,
0.00110252888806449, 0.00134289642855657, 0.00160887418339641,
0.00171282723291991, 0.00239579608916071, 0.00521339688614641,
0.0140335358756306, 0.0208231960312557, 0.0244309362737141, 0.0355803595644431,
0.0441783229676777, 0.0486235836289241, 0.0616579847163269, 0.0687483584129955,
0.0742627119000106, 0.08013648907112, 0.0818345521701333, 0.0866330461803596
)), .Names = c("X12CO2_dry", "coords.x1", "coords.x2", "V1",
"CumDist"), row.names = 2:21, class = "data.frame")
thanks, Martin
Man you beat me to it with a cleaner solution mra68.
Here's mine using a few loops.
####################
for (j in 1:nrow(DF)){#Loop through all rows of your dataset
CO2list<-NULL ##Need to make a variable before storing to it in the loop
for(i in 1:nrow(DF)){##Loop through all distances in the table
if ((abs(DF$CumDist[i]-DF$CumDist[j]))<=0.001) {
##Check to see if difference in CumDist<=100/100000 for all entries
#CumDist[j] is point with the 100 meter window around it
CO2list<-c(CO2list,DF$X12CO2_dry[i])
##Store your CO2 entries that are within the 100 meter window to a vector
}
}
DF$CO2AVG[j]<-mean(CO2list)
#Get the mean of your list and store it to column named CO2AVG
}
The window that belongs to the i-th row starts at n[i] and ends at m[i]-1. Hence the sum of the CO2-values in the i-th window is CumCO2[m[i]]-CumCO2[n[i]]. (Notice that the indices in CumCO2 are shifted by 1, because of the leading 0.) Dividing this CO2-sum by the window size m[i]-n[i] gives the values meanCO2 for the new column:
n <- sapply( df$CumDist,
function(x){
which.max( df$CumDist >= x-0.001 )
}
)
m <- sapply( df$CumDist,
function(x){
which.max( c(df$CumDist,Inf) > x+0.001 )
}
)
CumCO2 <- c( 0, cumsum(df$X12CO2) )
meanCO2 <- ( CumCO2[m] - CumCO2[n] ) / (m-n)
.
> n
[1] 1 1 1 2 3 3 5 8 9 10 11 12 13 14 15 16 17 18 19 20
> m
[1] 4 5 7 7 8 8 8 9 10 11 12 13 14 15 16 17 18 19 20 21
> meanCO2
[1] 399.9350 399.9593 399.9835 399.9932 399.9606 399.9606 399.9453 399.7718 399.7718 399.3855 399.3855 399.3855 399.4708 399.4708 399.4708 400.0362
[17] 400.0362 400.0362 399.7557 399.7557
>
I'm using R to create an occupancy model encounter history. I need to take a list of bird counts for individual leks, separate them by year, then code the count dates into two intervals, either within 10 days of the first count (Interval 1), or after 10 days after the first count (Interval 2). For any year where only 1 count occurred I need to add an entry coded as "U", to indicate that no count occurred during the second interval. Following that I need to subset out only the max count in each year and interval. A sample dataset:
ComplexId Date Males Year category
57 1941-04-15 97 1941 A
57 1942-04-15 67 1942 A
57 1943-04-15 44 1943 A
57 1944-04-15 32 1944 A
57 1946-04-15 21 1946 A
57 1947-04-15 45 1947 A
57 1948-04-15 67 1948 A
57 1989-03-21 25 1989 A
57 1989-03-30 41 1989 A
57 1989-04-13 2 1989 A
57 1991-03-06 35 1991 A
57 1991-04-04 43 1991 A
57 1991-04-11 37 1991 A
57 1991-04-22 25 1991 A
57 1993-03-23 6 1993 A
57 1994-03-06 17 1994 A
57 1994-03-11 10 1994 A
57 1994-04-06 36 1994 A
57 1994-04-15 29 1994 A
57 1994-04-21 27 1994 A
Now here is the code I wrote to accomplish my task, naming the dataframe above "c1" (you'll need to coerce the date column to date, and the category column to character):
c1_Year<-lapply(unique(c1$Year), function(x) c1[c1$Year == x,]) #splits complex counts into list by year
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-cbind(c1_Year[[i]], daydiff = as.numeric(c1_Year[[i]][,2]-c1_Year[[i]][1,2]))
} #adds column with difference between first survey and subsequent surveys
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
} # adds U values to years with only 1 count, while coercing the "u" into the appropriate interval
for(i in 1:length(c1_Year)){
c1_Year[[i]]$Interval<- ifelse(c1_Year[[i]][,6] < 10, 1, 2)
} # adds interval code for each survey, 1 = less than ten days after first count, 2 = more than 2 days after count
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-ddply(.data=c1_Year[[i]], .(Interval), subset, Males==max(Males))
} # subsets out max count in each interval
The problem arises during the second for-loop, which when options(error=recover) is enable returns:
Error in c1_Year[[i]] : subscript out of bounds
No suitable frames for recover()
`
At that point the code accomplishes what it was supposed to and adds the extra line to each year with only one count, even though the error message is generated the extra rows with the "U" code are still appended to the data frames. The issue is that I have 750 leks to do this for. So I tried to build the code above into a function, however when I run the function on any data the subscript out of bounds error stops the function from running. I could brute force it and just run the code above for each lek manually, but I was hoping there might be a more elegant solution. What I need to know is why am I getting the subscript out of bounds error, and how can I fix it?
Here's the function I wrote, so that you can see that it doesn't work:
create.OEH<-function(dataset, final_dataframe){
c1_Year<-lapply(unique(dataset$Year), function(x) dataset[dataset$Year == x,]) #splits complex counts into list by year
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-cbind(c1_Year[[i]], daydiff = as.numeric(c1_Year[[i]][,2]-c1_Year[[i]][1,2]))
} #adds column with difference between first survey and subsequent surveys
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
} # adds U values to years with only 1 count,
for(i in 1:length(c1_Year)){
c1_Year[[i]]$Interval<- ifelse(c1_Year[[i]][,6] < 10, 1, 2)
} # adds interval code for each survey, 1 = less than ten days after first count, 2 = more than 2 days after count
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-ddply(.data=c1_Year[[i]], .(Interval), subset, Males==max(Males))
} #subset out max count for each interval
df<-rbind.fill(c1_Year) #collapse list into single dataframe
final_dataframe<-df[!duplicated(df[,c("Year", "Interval")]),] #remove ties for max count
}
In this bit of code
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
}
You are assigning NULL if length(c1_Year[[i]][,1]==1 is not true, which removes those elements from c1_Year entirely.
You probably want
for(i in 1:length(c1_Year)){
if (length(c1_Year[[i]][,1]) == 1) {
c1_Year[[i]] <- rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
}
}
However, I see you are already using ddply, so you may be able to avoid a lot of your replication.
The ddply(c1, .(Year), ...) splits up c1 into unique years.
c2 <- ddply(c1,
.(Year),
function (x) {
# create 'Interval'
x$Interval <- ifelse(x$Date - x$Date[1] < 10, 1, 2)
# extract max males per interval
o <- ddply(x, .(Interval), subset, Males==max(Males))
# add the 'U' col if no '2' interval
if (all(o$Interval != 2)) {
o <- rbind(o,
list(o$ComplexId, NA, 0, o$Year, 'U', 2))
}
# return the resulting dataframe
o
})
I converted your rbind(.., c(...)) to rbind(.., list(...)) to avoid converting everything back to string (which is what the c does because it cannot handle multiple different types).
Otherwise the code is almost the same as yours.