R: Data-cleaning dataframe that exceeds 2 standard derivations from mean

R: Data-cleaning dataframe that exceeds 2 standard derivations from mean - r

I simply don't know how to do it, and my coding experience is very limited. I have 90 reaction times per subject and some of the trial times are too high (because of lacking motivation or attention). How can I extract subject data that is too high (not in comparison to other subjects, in comparison to the subjects own mean and SD).
print(d1$(Reaction.time < mean+2*SD))
as you can see I have no clue what I'm doing, but eh, I'm trying
d1 is the dataframe containing the data of one particular subject.
Reaction.time is one column of the dataframe containing (duh) all the reaction times of that subject.
mean (mean reaction time) is a column that I added via mutate(and so on) and SD is the standard deviation of that subject that I added via mutate(and so on) as well. SD and mean can be seen in the dataframe. But how can I take out (/or only print the others) all the rows containing reaction times that are above 2 SD from mean and also all rows that are below 2 SD from Mean.

Related

Grouping data based on difference in days

I have a data frame that has 3 columns a subid , test,day. For each subject, I want to identify which tests happened within a time frame of x days and calculate max change in test value. Please see example below. For each subject and a given test ,I want to identify which tests happened within 3 days. so if we look at "Day" column, for the value =1 it wont have any groups as subsequent test was done 6 days after. Values of Day= 10,7,8,9 should be identified as a group and the max change among these should be calculated. Similarly Day = 12,11,10,9 should be identified as another group and the max change among these should be calculated. How can i do this using R. Thank you in advance.

Find samples from numeric vector that have a predefined mean value

I am using historical yearly rainfall data to devise 'whatif' scenarios of altered rainfall in ecological models. To do that, I am trying to sample actual rainfall values to create a sample of rainfall years that meet a certain criteria (such as sample of rainfall years that are 10% wetter than the historical average).
I have come up with a relatively simple brute force method described below that works ok if I have a single criteria (such as a target mean value):
rainfall_values = c(270.8, 150.5, 486.2, 442.3, 397.7,
593.4191, 165.608, 116.9841, 265.69, 217.934, 358.138, 238.25,
449.842, 507.655, 344.38, 188.216, 210.058, 153.162, 232.26,
266.02801, 136.918, 230.634, 474.984, 581.156, 674.618, 359.16
)
#brute force
sample_size=10 #number of years included in each sample
n_replicates=1000 #number of total samples calculated
target=mean(rainfall_values)*1.1 #try to find samples that are 10% wetter than historical mean
tolerance=0.01*target #how close do we want to meet the target specified above?
#create large matrix of samples
sampled_DF=t(replicate(n_replicates, sample(x=rainfall_values, size=sample_size, replace=T)))
#calculate mean for each sample
Sampled_mean_vals=apply(sampled_DF,1, mean)
#create DF only with samples that meet the criteria
Sampled_DF_on_target=sampled_DF[Sampled_mean_vals>(target-tolerance)&Sampled_mean_vals<(target+tolerance),]
The problem is that I will eventually have multiple criteria to match (not only a means target, but also standard deviation, and auto correlation coefficients, etc.). With more complex multivariate targets, this brute force method becomes really inefficient in finding matches where I essentially have to look over millions of samples, and taking days even when parallelized...
So -my question is- is there any way to implement this search using an optimization algo or other non-brute force approach?

Some approaches to this kind of question are covered in this link. One respondent calls the "rejection" method what you refer to as the "brute force" method.
This link addresses a related question.

Split dataset by means of levels in a factor

I want to split my dataset into 2 subset datasets. I am analysing behaviours and percentage of time spent exhibiting behaviours. I want to split it into behaviours that take up a mean time of less than 25% of the time, and another that contains the rest.
I am currently using
ZB<- split(ZBehaviour, cut(ZBehaviour$Percentage.of.time, c(0, 25, 100), include.lowest=TRUE))
Unfortunately, because I have multiple observations it splits the data as wanted but I find behaviours that (on mean) take up greater than 25% of time in the less than 25% dataset due to specific observations containing small instances of this behaviour.
Any help would be greatly appreciated. Thanks
Example of my data, the issue i find is that i find the grzing behaviour in both databases, when the mean should place it in the dabase containing behaviours that equate to over 25% of the mean percentage of time
Behaviour|Percentage|Observation
Grazing| 78.5|1
Sleeping|12.5|1
Walking|10|1
Grazing|12.3|2
Walking|20.7|2
Sleeping|24|2
etc

R - Mean calculation for entire data instead of assigning each column individually

I am a beginner with R and I have a question about simple functions such as mean or standard deviation for a big data set. My data shows monthly returns for hedge funds for the past 30 years and has 1550 columns for all hedge funds. I saw that I can calculate the mean with the mean function for a specific column by referring to the column with the name of my dataset and a $ and the no. of the column. However, I was wondering how I can get the mean for every hedge fund (which is every column) without assigning every single column. Thanks in advance for your help!

We can use colMeans
colMeans(df1, na.rm=TRUE)
where 'df1' is the dataset.
or another option would be to loop through the columns and calculate the mean
vapply(df1, mean, na.rm=TRUE, numeric(1))

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?

Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration

I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Data-cleaning dataframe that exceeds 2 standard derivations from mean - r

Related

Grouping data based on difference in days

Find samples from numeric vector that have a predefined mean value

Split dataset by means of levels in a factor

R - Mean calculation for entire data instead of assigning each column individually

How to group data to minimize the variance while preserving the order of the data in R

Categories

Resources