time averaged data in R - r

I am given a dataset of 10 hour length (36.000 time points, 1 second=1 time point). For my further analysis I am supposed to use 10min averaged data which would equal 600 time points.
So do I understand this right that I have to take the average of the first 600 time points and thats my new time point and then of the next 600 and so on? Which means I end up with a time series of length 60.
How do I do that with R? I thought
xF<-filter(x, 600, sides = 2)
would be the required function but it just changes the values on the y axis.

If your dataset is ordered, you could just create a grouping variable and use tapply:
# simulate data
x <- rnorm(36000)
# create a group variable
group <- factor(rep(1:(36000/600),each=600))
# compute mean for each slice of 600 data point
mean_by_10min <- tapply(x,group,mean)

It is hard to help you without having your data
However, the command looks most likely like
aggregate(iris$Petal.Length,by=list(iris$Petal.Width%/%0.5),mean)
where you need to replace iris$Petal.Length by your values, iris$Petal.Width by the timestamps and 0.5 by 600
Filtering doesn't aggregate your data as I understand your question, hence, you would end up with as many time points.

Related

Create a new column with intervals in R

I am looking for a quick solution how to create a new column in a data frame (interval) by taking into account the output of the time column.
My dummy column
time <- c(7.1,8.2,9.3,10.4,11.5,12.6,50.9)
df <- data.frame(time)
df
My desired output
Based on the info in the TIME column, I would like to determine the interval. In my example, whatever comes between 0.0-10.0 (including 10) equals interval 10. The intervals are grouped by 10, as you see. So whatever comes between 10.0-20.0 will be assigned to interval 20 and so on and so forth.
Any hints on how to get the new column with intervals would be highly appreciated. Thanks
Moved comments to an answer. You can use integer division to do this. You divide the time column by 10 as this is the interval you are looking for. Add 1 and multiply by 10 to get the results.
df["interval"]<-(df$time %/% 10 + 1) * 10

R - filtering negative and positive spikes in data

I am plotting a data that consists of some intervals that are more or less constant, and spikes in the data originating from the data being a quotient from two parameters. The relatively high and large quotients aren't not relevant for my purpose, so I have been looking for a way to filter these out. The dataset contains 40k+ values so I can not manually remove the high/low quotients.
Is there any function that can trim/filter out the very large/small quotients?
You can use the filter() function from dplyr. This can create a new dataframe without outliers that you can then plot. For example:
no_spikes <- filter(original_df, x > -100 & x < 100)
This would create a new dataframe, no_spikes, that only contains observations where the variable x is between the values -100 and 100.

How to calculate column mean at intervals of row values in R?

I have dataframe which has 253 rows(locations on a chromosome in Mbps) and 1 column (Allele score at each location). I need to produce a dataframe which contains the mean of the allele score at every 0.5 Mbps on the chromosome. Please help with R code that can do this. thanks.
The picture in this case is adequate to construct an answer but not adequate to support testing. You should learn to post data in a form that doesn't require re-entry by hand. (That's why you are accumulating negative votes.)
The basic R strategy would be to use cut to create a grouping variable and then use a loop construct to accumulate and apply the mean function. Presumably this is in a dataframe which I will assume is named something specific like my_alleles:
tapply( my_alleles$Allele_score, # act on this vector
# in groups defined by this factor
cut(my_alleles$Location,
breaks=seq(0, max(my_alleles$Location), by=0.5)
),
# with this function
FUN=mean)

Sampling according to distribution from a large vetor in R

I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

Resources