Aggregating functions which operate on non-data frame objects in R - r

I have a simple question. The aggregate() function in R operates on a dataframe based on the conditions specified.
aggregate(my.data.frame, list(desired column), function to be applied) is the default usage.
It is useful to compute simple functions like mean and median of a dataframe's column specific values. What I have, though, is a function which doesn't operate on dataframes, but I need to aggregate my dataframe after performing this function on a specific column. Let me show the dataset:
GPS Dataset
So I need to compute the centroid for the longitude and latitude points for EACH BSSID, I need to aggregate it that way. The functions I found online from various packages compute the centroid for a matrix of values and not a dataframe, whereas aggregate() doesn't work on non-dataframes.
Many thanks in advance :)

Aggregate works fine on matrices (and not just data frames).
Here's a reproducible example of your problem, using a matrix instead of a data frame:
my_matrix <- matrix(c(100,100,200,200,11,22,33,44,-1,-2,3,-4),
nrow=4,ncol=3,
dimnames=list(c(1,2,3,4),c('BSSID','lat','long')))
> my_matrix
BSSID lat long
1 100 11 -1
2 100 22 -2
3 200 33 -3
4 200 44 -4
> aggregate(cbind(lat,long)~BSSID,my_matrix,mean)
BSSID lat long
1 100 16.5 -1.5
2 200 38.5 -3.5
So that would be the mean (or the centroid) of the latitudes and longitudes for each BSSID. The cbind function (column-bind) lets you select multiple variables to be aggregated, similar to an Excel Pivot Table.
If still in doubt, you can always convert matrices to data-frames by using the as.data.frame() function and revert back to matrices using as.matrix() if needed.

I like dplyr for this - the syntax looks nice to me.
my.data.frame %>%
group_by(bssid) %>%
summarise(centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
If myfunction is fast, then this will work, but if it is slow, you probably want to rework it so that you only call the function once per bssid.
Edit to show alternative method without %>% operator
grouped.data.frame = group_by(my.data.frame, bssid)
summarised.data.frame = summarise(grouped.data.frame,
centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
The %>% operator takes the left hand side, and passes it as the first argument to the right hand side. It's useful for chaining your statements together without getting confused by hundreds of nested brackets. It makes things easier to read, in my opinion.

Related

Timeseries average based on a defined time interval (bin)

Here is an example of my dataset. I want to calculate bin average based on time (i.e., ts) every 10 seconds. Could you please provide some hints so that I can carry on?
In my case, I want to average time (ts) and Var in every 10 seconds. For example, I will get an averaged value of Var and ts from 0 to 10 seconds; I will get another averaged value of Var and ts from 11 to 20 seconds, etc.
df = data.frame(ts = seq(1,100,by=0.5), Var = runif(199,1, 10))
Any functions or libraries in R can I use for this task?
There are many ways to calculate a binned average: with base aggregate,by, with the packages dplyr, data.table, probably with zoo and surely other timeseries packages...
library(dplyr)
df %>%
group_by(interval = round(df$ts/10)*10) %>%
summarize(Var_mean = mean(Var))
# A tibble: 11 x 2
interval Var_mean
<dbl> <dbl>
1 0 4.561653
2 10 6.544980
3 20 6.110336
4 30 4.288523
5 40 5.339249
6 50 6.811147
7 60 6.180795
8 70 4.920476
9 80 5.486937
10 90 5.284871
11 100 5.917074
That's the dplyr approach, see how it and data.table let you name the intermediate variables, which keeps code clean and legible.
Assuming df in the question, convert to a zoo object and then aggregate.
The second argument of aggregate.zoo is a vector the same length as the time vector giving the new times that each original time is to be mapped to. The third argument is applied to all time series values whose times have been mapped to the same value. This mapping could be done in various ways but here we have chosen to map times (0, 10] to 10, (10, 20] to 20, etc. by using 10 * ceiling(time(z) / 10).
In light of some of the other comments in the answers let me point out that in contrast to using a data frame there is significant simplification here, firstly because the data has been reduced to one dimension (vs. 2 in a data.frame), secondly because it is more conducive to the whole object approach whereas with data frames one needs to continually pick apart the object and work on those parts and thirdly because one now has all the facilities of zoo to manipulate the time series such as numerous NA removal schemes, rolling functions, overloaded arithmetic operators, n-way merges, simple access to classic, lattice and ggplot2 graphics, design which emphasizes consistency with base R making it easy to learn and extensive documentation including 5 vignettes plus help files with numerous examples and likely very few bugs given the 14 years of development and widespread use.
library(zoo)
z <- read.zoo(df)
z10 <- aggregate(z, 10 * ceiling(time(z) / 10), mean)
giving:
> z10
10 20 30 40 50 60 70 80
5.629926 6.571754 5.519487 5.641534 5.309415 5.793066 4.890348 5.509859
90 100
4.539044 5.480596
(Note that the data in the question is not reproducible because it used random numbers without set.seed so if you try to repeat the above you won't get an identical answer.)
Now we could plot it, say, using any of these:
plot(z10)
library(lattice)
xyplot(z10)
library(ggplot2)
autoplot(z10)
In general, I agree with #smci, the dplyr and data.table approach is the best here. Let me elaborate a bit further.
# the dplyr way
library(dplyr)
df %>%
group_by(interval = ceiling(seq_along(ts)/20)) %>%
summarize(variable_mean = mean(Var))
# the data.table way
library(data.table)
dt <- data.table(df)
dt[,list(Var_mean = mean(Var)),
by = list(interval = ceiling(seq_along(dt$ts)/20))]
I would not go to the traditional time series solutions like ts, zoo or xts here. Their methods are more suitable to handle regular frequencies and frequency like monthly or quarterly data. Apart from ts they can handle irregular frequencies and also high frequency data, but many methods such as the print methods don't work well or least do not give you an advantage over data.table or data.frame.
As long as you're just aggregating and grouping both data.table and dplyr are also likely faster in terms of performance. Guess data.table has the edge over dplyr in terms of speed, but you would have benchmark / profile that, e.g. using microbenchmark. So if you're not working with a classic R time series format anyway, there's no reason to go to these for aggregating.

Can dplyr and data.table be used traditionally and inside loops to extract results from data frames?

Suppose I have a data frame of over 700,000 observations and four variables and would like to extract some values by first indexing one of the District variables (shown here as Dist):
Date X Y Dist
2003/01 2.4 5.5 1
2003/02 2.3 4.0 1
2003/03 1.9 4.4 1
.
.
.
2004/11 3.7 2.9 700
2004/12 2.6 5.9 700
That is, a dataset of Xs and Ys for 700 districts, with each district having an yearly record for Xs and Ys. For each district, some values need to be extracted so I thought I could use dplyr here instead of traditional loops and conditions; however, I'm new to it and not very used to its syntax and inspite of passing some efficient commands, I'm not getting the proper results. The resulting data frame should look something like:
X Dist
Some avg. 5
Or even values for multiple districts, arranged in ascending order:
X Dist
Some avg. 4
" 5
" 6
At first, I 'sliced' off data for the districts and saved it as test to extract mean and number of non-NA observations but the resulting dataset contained warnings that I'm unclear why. For example, for districts 1 to 10:
test <- slice(df, Dist == c(1:10))
Gave a warning of longer object length being not a multiple of shorter object. I could slice for each district, and merge them row-wise but that is tedious. I actually used a for loop to come up with similar values but those are simply incomparable when it comes to dplyr's efficiency and speed in extracting valuable insights through just one-liners instead of lines of codes and conditions. It simply speeds up everything besides making markdown files cleaner and readable. How can the chained operation %>% be used here to come up with similar results? Can they be used with traditional loops and conditions?

Calculating the distance between points in different data frames

I am trying to find the distance between points in two different data frames given that they have the same value in one of their columns.
I figure the first step is to join or relate the data in the two data frames. For example there is dataframe A and B which both have lat/long information in them and they share the column Name. Note that for a given Name the lat/long information is different in each dataframe. Thats why I want to calculate the distance between them.
I envision the final function being something like if A$Name=B$Name then use their corresponding lat/long data to calculate the distance between them.
Any thoughts?
Example data:
A <- data.frame(Lat=1:4,Long=1:4,Name=c("a","b","c","d"))
B <- data.frame(Lat=5:8,Long=5:8,Name=c("a","b","c","d"))
Now I want to relate A and B so that I can ask the ultimate question if A$Name==B$Name what is the distance between them using their corresponding lat long data.
I should also note that I will not be able to do a straightforward euclidean distance because the points occur in water and the path distance between them needs to be in the water (or bounded by some area). Any help with that would be appreciated as well.
Without a reproducible example, all I can do is offer you a general solution.
I like data.table and the syntax here will look very simple. Check out the Getting Started vignettes for more on the package.
I'm going to create two data.tables that match your general description first:
library(data.table)
set.seed(1734)
A<-data.table(Name=1:10,x=rnorm(10),key="Name")
B<-data.table(Name=1:10,y=rnorm(10),key="Name")
Now, we want to merge A and B by Name (to merge, we need a key set, which I've conveniently done already), then use the respective x and y coordinates to calculate (Euclidean) distance. To do so is simple:
A[B,distance:=sqrt(x^2+y^2)]
The distance you seek is now stored in the data.table A under the column distance. If you don't want to store the distance, and just want the output, you could do: A[B,sqrt(x^2+y^2)].
To start from scratch if A and B are already stored as data.frames, it's not much more complicated:
setDT(A,key="Name")[setDT(B,key="Name"),distance:=sqrt(x^2+y^2)]
We've used the convenient setDT function to convert A and B (in-line) to a data.table by reference, simultaneously declaring the key to be Name for both*.
*It may not be strictly necessary to set the key of B, but I think it is good practice to do so. Also, the key option of setDT is only currently available in the development version of data.table (1.9.5+); with the CRAN version, use setkey(setDT(A),Name), etc.
For calculating the distance between lat/long points, you can use the distm function from the geosphere package. Within this function you can use several formula's for calculating the distance: distCosine, distHaversine, distVincentySphere and distVincentyEllipsoid. The last one is considered the most accurate one (according to the package author).
library(geosphere)
A <- data.frame(Lat=1:4, Long=1:4, Name=c("a","b","c","d"))
B <- data.frame(Lat=5:8, Long=5:8, Name=c("a","b","c","d"))
A$distance <- distVincentyEllipsoid(A[,c('Long','Lat')], B[,c('Long','Lat')])
this gives:
> A
Lat Long Name distance
1 1 1 a 627129.5
2 2 2 b 626801.7
3 3 3 c 626380.6
4 4 4 d 625866.6
Note that you have to include the lat/long columns in the order of first longitude and then latitude.
Although this works perfectly on this simple example, in larger datasets where the names are not in the same order, this will lead to problems. In that case you can use data.table and set the keys so you can match the points and calculate the distance (as #MichaelChirico did in his answer):
library(data.table)
A <- data.table(Lat=1:4, Long=1:4, Name=c("a","b","c","d"), key="Name")
B <- data.table(Lat=8:5, Long=8:5, Name=c("d","c","b","a"), key="Name")
A[B,distance:=distVincentyEllipsoid(A[,.(Long,Lat)], B[,.(Long,Lat)])]
as you can see, this gives the correct (i.e., the same) result as in the previous method:
> A
Lat Long Name distance
1: 1 1 a 627129.5
2: 2 2 b 626801.7
3: 3 3 c 626380.6
4: 4 4 d 625866.6
To see what key="Name" does, compare the following two datatables:
B1 <- data.table(Lat=8:5, Long=8:5, Name=c("d","c","b","a"), key="Name")
B2 <- data.table(Lat=8:5, Long=8:5, Name=c("d","c","b","a"))
See also this answer for a more elaborate example.

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Identifying duplicate columns in a dataframe

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
How about:
testframe[!duplicated(as.list(testframe))]
You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))
unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M
It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]
What about just:
unique.matrix(testframe, MARGIN=2)
Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))
Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]
If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

Resources