Here is an example of my dataset. I want to calculate bin average based on time (i.e., ts) every 10 seconds. Could you please provide some hints so that I can carry on?
In my case, I want to average time (ts) and Var in every 10 seconds. For example, I will get an averaged value of Var and ts from 0 to 10 seconds; I will get another averaged value of Var and ts from 11 to 20 seconds, etc.
df = data.frame(ts = seq(1,100,by=0.5), Var = runif(199,1, 10))
Any functions or libraries in R can I use for this task?
There are many ways to calculate a binned average: with base aggregate,by, with the packages dplyr, data.table, probably with zoo and surely other timeseries packages...
library(dplyr)
df %>%
group_by(interval = round(df$ts/10)*10) %>%
summarize(Var_mean = mean(Var))
# A tibble: 11 x 2
interval Var_mean
<dbl> <dbl>
1 0 4.561653
2 10 6.544980
3 20 6.110336
4 30 4.288523
5 40 5.339249
6 50 6.811147
7 60 6.180795
8 70 4.920476
9 80 5.486937
10 90 5.284871
11 100 5.917074
That's the dplyr approach, see how it and data.table let you name the intermediate variables, which keeps code clean and legible.
Assuming df in the question, convert to a zoo object and then aggregate.
The second argument of aggregate.zoo is a vector the same length as the time vector giving the new times that each original time is to be mapped to. The third argument is applied to all time series values whose times have been mapped to the same value. This mapping could be done in various ways but here we have chosen to map times (0, 10] to 10, (10, 20] to 20, etc. by using 10 * ceiling(time(z) / 10).
In light of some of the other comments in the answers let me point out that in contrast to using a data frame there is significant simplification here, firstly because the data has been reduced to one dimension (vs. 2 in a data.frame), secondly because it is more conducive to the whole object approach whereas with data frames one needs to continually pick apart the object and work on those parts and thirdly because one now has all the facilities of zoo to manipulate the time series such as numerous NA removal schemes, rolling functions, overloaded arithmetic operators, n-way merges, simple access to classic, lattice and ggplot2 graphics, design which emphasizes consistency with base R making it easy to learn and extensive documentation including 5 vignettes plus help files with numerous examples and likely very few bugs given the 14 years of development and widespread use.
library(zoo)
z <- read.zoo(df)
z10 <- aggregate(z, 10 * ceiling(time(z) / 10), mean)
giving:
> z10
10 20 30 40 50 60 70 80
5.629926 6.571754 5.519487 5.641534 5.309415 5.793066 4.890348 5.509859
90 100
4.539044 5.480596
(Note that the data in the question is not reproducible because it used random numbers without set.seed so if you try to repeat the above you won't get an identical answer.)
Now we could plot it, say, using any of these:
plot(z10)
library(lattice)
xyplot(z10)
library(ggplot2)
autoplot(z10)
In general, I agree with #smci, the dplyr and data.table approach is the best here. Let me elaborate a bit further.
# the dplyr way
library(dplyr)
df %>%
group_by(interval = ceiling(seq_along(ts)/20)) %>%
summarize(variable_mean = mean(Var))
# the data.table way
library(data.table)
dt <- data.table(df)
dt[,list(Var_mean = mean(Var)),
by = list(interval = ceiling(seq_along(dt$ts)/20))]
I would not go to the traditional time series solutions like ts, zoo or xts here. Their methods are more suitable to handle regular frequencies and frequency like monthly or quarterly data. Apart from ts they can handle irregular frequencies and also high frequency data, but many methods such as the print methods don't work well or least do not give you an advantage over data.table or data.frame.
As long as you're just aggregating and grouping both data.table and dplyr are also likely faster in terms of performance. Guess data.table has the edge over dplyr in terms of speed, but you would have benchmark / profile that, e.g. using microbenchmark. So if you're not working with a classic R time series format anyway, there's no reason to go to these for aggregating.
Related
This is a follow up to previous question. My question was not fully formulated and therefore not fully answered in my last post. Forgive me, I'm new to using stack overflow.
My professor has assigned a problem set, and we are required to use dplyr and other tidyverse packages. I'm very aware that most (if not all) the tasks that I'm trying to execute are possible in base r, but that's not in agreement with my instructions.
First we are asked to generate a tibble of 1000 random samples from a uniform distribution:
2a. Create a new tibble called uniformDf containing a variable called unifSamples that contains 10000 random samples from a uniform distribution. You should use the runif() function to create the uniform samples. {r 2a}
uniformDf <- tibble(unifSamples = runif(1000))
This goes well.
Then we are asked to loop thru this tibble 1000 times, each time choosing 20 random samples and computing the mean and saving it to a tibble:
2c. Now let's loop through 1000 times, sampling 20 values from a uniform distribution and computing the mean of the sample, saving this mean to a variable called sampMean within a tibble called uniformSampleMeans. {r 2c}
unif_sample_size = 20 # sample size
n_samples = 1000 # number of samples
# set up q data frame to contain the results
uniformSampleMeans <- tibble(sampMean=rep(NA,n_samples))
# loop through all samples. for each one, take a new random sample,
# compute the mean, and store it in the data frame
for (i in 1:n_samples){
uniformSampleMeans$sampMean[i] <- uniformDf %>%
sample_n(unif_sample_size) %>%
summarize(sampMean = mean(sampMean))
}
This all runs, well, I believe until I look at my uniformSampleMeans tibble. Which looks like this:
1 0.471271611726843
2 0.471271611726843
3 0.471271611726843
4 0.471271611726843
5 0.471271611726843
6 0.471271611726843
7 0.471271611726843
...
1000 0.471271611726843
All the values are identical! Does anyone have any insight as to why my output is like this? I'd be less concerned if they varied by +/- 0.000x values seeing as how this is from a distribution that ranges from 0 to 1 but the values are all identical even out to the 15th decimal place! Any help is much appreciated!
The following selects random unif_sample_size rows and gives it's mean
library(dplyr)
uniformDf %>% sample_n(unif_sample_size) %>% pull(unifSamples) %>% mean
#[1] 0.5563638
If you want to do this n times use replicate and repeat it n times
n <- 10
replicate(n, uniformDf %>%
sample_n(unif_sample_size) %>%
pull(unifSamples) %>% mean)
#[1] 0.5070833 0.5259541 0.5617969 0.4695862 0.5030998 0.5745950 0.4688153 0.4914363 0.4449804 0.5202964
I am working with a dataframe which contains text data which has been categorised and coded. Each numerical value from 1-12 represent a type of word.
I want to count the frequencies of occurrence each number (1 to 12) over 6 columns (pre1 to pre6) so I know how many types of words have been used. Could anyone please advise on how to do this?
My df is structured as such:
Would something like that work for you?
library(dplyr)
df <- data.frame(pre1 = c(sample(1:12, 10)),
pre2 = c(sample(1:12, 10)),
pre3 = c(sample(1:12, 10)),
pre4 = c(sample(1:12, 10)),
pre5 = c(sample(1:12, 10)),
pre6 = c(sample(1:12, 10)))
count <- count(df, pre1, pre2, pre3, pre4, pre5, pre6)
One solution is this:
library(tidyverse)
mtcars %>%
select(cyl, am, gear) %>% # select the columns of interest
gather(column, number) %>% # reshape
count(column, number) # get counts of numbers for each column
# # A tibble: 8 x 3
# column number n
# <chr> <dbl> <int>
# 1 am 0 19
# 2 am 1 13
# 3 cyl 4 11
# 4 cyl 6 7
# 5 cyl 8 14
# 6 gear 3 15
# 7 gear 4 12
# 8 gear 5 5
In your case column will get values as pre1, pre2, etc., number' will get values 1 - 12 andn` will be the count of a specific number at a specific column.
It is not entirely clear from the question, whether you want frequency tables for all of these columns together or for each column seperately. In possible further questions you should also make clear, whether those numbers are coded as numerics, as characters or as factors (the result of str(pCat) is a good way to do that). For this particular question, it fortunately does not matter.
The answers I have already given in the comments are
table(unlist(pCat[,4:9]))
and
table(pCat$pre3)
as an extension for the latter, I shall also point to the comment by ANG , which boils down to
lapply(pCat[,4:9], table)
These are straightforward solutions with base R without any further unneccessary packages. The answers by JonGrub and AntoniosK base on the tidyverse. There is no obvious need to import dplyr or tidyverse for this problem but I guess, the authors open those packages anyways, whenever they use R, so it does not really impose any cost on them. Other great packages to base good answers on are data.table and sqldf. Those are good packages and many people do a lot of things, that could be done in base R in one of these packages. The packages promise to be more clear or faster or reuse possible knowledge you might already have. Nothing is wrong with that. However, I take your question as an indication, that you are still learning R and I would advise, to learn R first, before you become distracted by learning special packages and DSLs.
People have been using base R for decades and they will continue to do so. Learning base R wil lnot distract you from R and the knowledge will continue to be worthwhile in decades. If the same can be said for the tidyverse or datatable, time will tell (although sqldf is probably also a solid investment in the future, maybe more so than R).
I have a simple question. The aggregate() function in R operates on a dataframe based on the conditions specified.
aggregate(my.data.frame, list(desired column), function to be applied) is the default usage.
It is useful to compute simple functions like mean and median of a dataframe's column specific values. What I have, though, is a function which doesn't operate on dataframes, but I need to aggregate my dataframe after performing this function on a specific column. Let me show the dataset:
GPS Dataset
So I need to compute the centroid for the longitude and latitude points for EACH BSSID, I need to aggregate it that way. The functions I found online from various packages compute the centroid for a matrix of values and not a dataframe, whereas aggregate() doesn't work on non-dataframes.
Many thanks in advance :)
Aggregate works fine on matrices (and not just data frames).
Here's a reproducible example of your problem, using a matrix instead of a data frame:
my_matrix <- matrix(c(100,100,200,200,11,22,33,44,-1,-2,3,-4),
nrow=4,ncol=3,
dimnames=list(c(1,2,3,4),c('BSSID','lat','long')))
> my_matrix
BSSID lat long
1 100 11 -1
2 100 22 -2
3 200 33 -3
4 200 44 -4
> aggregate(cbind(lat,long)~BSSID,my_matrix,mean)
BSSID lat long
1 100 16.5 -1.5
2 200 38.5 -3.5
So that would be the mean (or the centroid) of the latitudes and longitudes for each BSSID. The cbind function (column-bind) lets you select multiple variables to be aggregated, similar to an Excel Pivot Table.
If still in doubt, you can always convert matrices to data-frames by using the as.data.frame() function and revert back to matrices using as.matrix() if needed.
I like dplyr for this - the syntax looks nice to me.
my.data.frame %>%
group_by(bssid) %>%
summarise(centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
If myfunction is fast, then this will work, but if it is slow, you probably want to rework it so that you only call the function once per bssid.
Edit to show alternative method without %>% operator
grouped.data.frame = group_by(my.data.frame, bssid)
summarised.data.frame = summarise(grouped.data.frame,
centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
The %>% operator takes the left hand side, and passes it as the first argument to the right hand side. It's useful for chaining your statements together without getting confused by hundreds of nested brackets. It makes things easier to read, in my opinion.
This is admittedly a very simple question that I just can't find an answer to.
In R, I have a file that has 2 columns: 1 of categorical data names, and the second a count column (count for each of the categories). With a small dataset, I would use 'reshape' and the function 'untable' to make 1 column and do analysis that way. The question is, how to handle this with a large data set?
In this case, my data is humungous and that just isn't going to work.
My question is, how do I tell R to use something like the following as distribution data:
Cat Count
A 5
B 7
C 1
That is, I give it a histogram as an input and have R figure out that it means there are 5 of A, 7 of B and 1 of C when calculating other information about the data.
The desired input rather than output would be for R to understand that the data would be the same as follows,
A
A
A
A
A
B
B
B
B
B
B
B
C
In reasonable size data, I can do this on my own, but what do you do when the data is very large?
Edit
The total sum of all the counts is 262,916,849.
In terms of what it would be used for:
This is new data, trying to understand the correlation between this new data and other pieces of data. Need to work on linear regressions and mixed models.
I think what you're asking is to reshape a data frame of categories and counts into a single vector of observations, where categories are repeated. Here's one way:
dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1))
# Cat Count
#1 A 5
#2 B 7
#3 C 1
rep.int(dat$Cat,times=dat$Count)
# [1] A A A A A B B B B B B B C
#Levels: A B C
To follow up on #Blue Magister's excellent answer, here's a 100,000 row histogram with a total count of 551,245,193:
set.seed(42)
Cat <- sapply(rep(10, 100000), function(x) {
paste(sample(LETTERS, x, replace=TRUE), collapse='')
})
dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE))
> head(dat)
Cat Count
1 XYHVQNTDRS 5154
2 LSYGMYZDMO 4724
3 XDZYCNKXLV 8691
4 TVKRAVAFXP 2429
5 JLAZLYXQZQ 5704
6 IJKUBTREGN 4635
This is a pretty big dataset by my standards, and the operation Blue Magister describes is very quick:
> system.time(x <- rep(dat$Cat,times=dat$Count))
user system elapsed
4.48 1.95 6.42
It uses about 6GB of RAM to complete the operation.
This really depends on what statistics you are trying to calculate. The xtabs function will create tables for you where you can specify the counts. The Hmisc package has functions like wtd.mean that will take a vector of weights for computing a mean (and related functions for standard deviation, quantiles, etc.). The biglm package could be used to expand parts of the dataset at a time and analyze. There are probably other packages as well that would handle the frequency data, but which is best depends on what question(s) you are trying to answer.
The existing answers are all expanding the pre-binned dataset into a full distribution and then using R's histogram function which is memory inefficient and will not scale for very large datasets like the original poster asked about. The HistogramTools CRAN package includes a
PreBinnedHistogram function which takes arguments for breaks and counts to create a Histogram object in R without massively expanding the dataset.
For Example, if the data set has 3 buckets with 5, 7, and 1 elements, all of the other solutions posted here so far expand that into a list of 13 elements first and then create the histogram. PreBinnedHistogram in contrast creates the histogram directly from the 3 element input list without creating a much larger intermediate vector in memory.
big.histogram <- PreBinnedHistogram(my.data$breaks, my.data$counts)
I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
How about:
testframe[!duplicated(as.list(testframe))]
You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))
unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M
It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]
What about just:
unique.matrix(testframe, MARGIN=2)
Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))
Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]
If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.