Split dataset by means of levels in a factor - r

I want to split my dataset into 2 subset datasets. I am analysing behaviours and percentage of time spent exhibiting behaviours. I want to split it into behaviours that take up a mean time of less than 25% of the time, and another that contains the rest.
I am currently using
ZB<- split(ZBehaviour, cut(ZBehaviour$Percentage.of.time, c(0, 25, 100), include.lowest=TRUE))
Unfortunately, because I have multiple observations it splits the data as wanted but I find behaviours that (on mean) take up greater than 25% of time in the less than 25% dataset due to specific observations containing small instances of this behaviour.
Any help would be greatly appreciated. Thanks
Example of my data, the issue i find is that i find the grzing behaviour in both databases, when the mean should place it in the dabase containing behaviours that equate to over 25% of the mean percentage of time
Behaviour|Percentage|Observation
Grazing| 78.5|1
Sleeping|12.5|1
Walking|10|1
Grazing|12.3|2
Walking|20.7|2
Sleeping|24|2
etc

Related

R: imputation of values in a data frame column by distribution of that variable

I have searched stackoverflow and google regarding this but not yet found a fitting answer.
I have a data frame column with ages of individuals.
Out of around 10000 observations, 150 are NAs.
I do not want to impute those with the mean age of the whole column but assign random ages based on the distribution of the ages in my data set i.e. in this column.
How do I do that? I tried fiddling around with the MICE package but didn't make much progress.
Do you have a solution for me?
Thank you,
corkinabottle
You could simply sample 150 values from your observations:
samplevals <- sample(obs, 150)
You could also stratify your observations across quantiles to increase the chances of sampling your tail values by sampling within each quantile range.

Find samples from numeric vector that have a predefined mean value

I am using historical yearly rainfall data to devise 'whatif' scenarios of altered rainfall in ecological models. To do that, I am trying to sample actual rainfall values to create a sample of rainfall years that meet a certain criteria (such as sample of rainfall years that are 10% wetter than the historical average).
I have come up with a relatively simple brute force method described below that works ok if I have a single criteria (such as a target mean value):
rainfall_values = c(270.8, 150.5, 486.2, 442.3, 397.7,
593.4191, 165.608, 116.9841, 265.69, 217.934, 358.138, 238.25,
449.842, 507.655, 344.38, 188.216, 210.058, 153.162, 232.26,
266.02801, 136.918, 230.634, 474.984, 581.156, 674.618, 359.16
)
#brute force
sample_size=10 #number of years included in each sample
n_replicates=1000 #number of total samples calculated
target=mean(rainfall_values)*1.1 #try to find samples that are 10% wetter than historical mean
tolerance=0.01*target #how close do we want to meet the target specified above?
#create large matrix of samples
sampled_DF=t(replicate(n_replicates, sample(x=rainfall_values, size=sample_size, replace=T)))
#calculate mean for each sample
Sampled_mean_vals=apply(sampled_DF,1, mean)
#create DF only with samples that meet the criteria
Sampled_DF_on_target=sampled_DF[Sampled_mean_vals>(target-tolerance)&Sampled_mean_vals<(target+tolerance),]
The problem is that I will eventually have multiple criteria to match (not only a means target, but also standard deviation, and auto correlation coefficients, etc.). With more complex multivariate targets, this brute force method becomes really inefficient in finding matches where I essentially have to look over millions of samples, and taking days even when parallelized...
So -my question is- is there any way to implement this search using an optimization algo or other non-brute force approach?
Some approaches to this kind of question are covered in this link. One respondent calls the "rejection" method what you refer to as the "brute force" method.
This link addresses a related question.

Peculiarity with Scale and Z-Score

I was attempting to scale my data in R after doing some research on the function (which it seems to follow (x - mean) / std.dev. This was just what I was looking for, so I scaled my dataframe in R. I'd also want to make sure my assumptions are correct so that I don't have wrong conclusions.
Assumption
R scales each column independently. Therefore, column 1 will have its own mean and standard deviation. Column 2 will have its own.
Assuming I have a dataset of size 100,000 and I scale 3 columns. If I proceed to remove all columns with a Z-Score over 3 and less than -3, I could have up to (100,000 * .003) = 900 rows removed!
However, when I went to truncate my data, my 100,000 rows were left with 94,798. This means 5,202 rows were removed.
Does this mean my assumption about scale was wrong, and that it doesn't scale by column?
Update
So I ran a test and did Z-Score conversion on my own. Still the same amount of columns removed in the end so I believe scale does work. Now I'm just curious why more than .3% of the data is removed when 3 standard deviations out are removed.

Moving average with dynamic window

I'm trying to add a new column to my data table that contains the average of some of the following rows. How many rows to be selected for the average however depends on the time stamp of the rows.
Here is some test data:
DT<-data.table(Weekstart=c(1,2,2,3,3,4,5,5,6,6,7,7,8,8,9,9),Art=c("a","b","a","b","a","a","a","b","b","a","b","a","b","a","b","a"),Demand=c(1:16))
I want to add a column with the mean of all demands, which occured in the weeks ("Weekstart") up to three weeks before the respective week (grouped by Art, excluding the actual week).
With rollapply from zoo-library, it works like this:
setorder(DT,-Weekstart)
DT[,RollMean:=rollapply(Demand,width=list(1:3),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
The problem however is, some data is missing. In the example, the data for the Art b lack the week no 4, there is no Demand in week 4. As I want the average of the three prior weeks, not the three prior rows, the average is wrong. Instead, the result for Art b for week 6 should look like this:
DT[Art=="b"&Weekstart==6,RollMean:=6]
(6 instead of 14/3, because only Week 5 and Week 3 count: (8+4)/2)
Here is what I tired so far:
It would be possible to loop through the minima of the week of the following rows in order to create a vector that defines for each row, how wide the 'width' should be (the new column 'rollwidth'):
i<-3
DT[,rollwidth:=Weekstart-rollapply(Weekstart,width=list(1:3),partial=TRUE,FUN=min,align="left",fill=1),.(Art)]
while (max(DT[,Weekstart-rollapply(Weekstart,width=list(1:i),partial=TRUE,FUN=min,align="left",fill=NA),.(Art)][,V1],na.rm=TRUE)>3) {
i<-i-1
DT[rollwidth>3,rollwidth:=i]
}
But that seems very unprofessional (excuse my poor skills). And, unfortunately, the rollapply with width and rollwidth doesnt work as intended (produces warnings as 'rollwidth' is considered as all the rollwidths in the table):
DT[,RollMean2:=rollapply(Demand,width=list(1:rollwidth),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
What does work is
DT[,RollMean3:=rollapply(Demand,width=rollwidth,partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
but then again, the average includes the actual week (not what I want).
Does anybody know how to apply a criterion (i.e. the difference in the weeks shall be <= 3) instead of a number of rows to the argument width?
Any suggestions are appreciated!

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

Resources