Group by and conditionally count - r

I am still learning data management in R. I know I am really close, but can't get the precise syntax. I have looked at
count a variable by using a condition in R
and
Conditional count and group by in R
but can't quite translate to my work. I am trying to get a count of dist.km that equal 0 by ST. Eventually I will want to add columns with counts of various distance ranges, but should be able to get it after getting this. The final table should have all states and a count of 0s. Here is a 20 row sample.
structure(list(ST = structure(c(12L, 15L, 13L, 10L, 15L, 16L,
11L, 12L, 8L, 14L, 10L, 14L, 6L, 11L, 5L, 5L, 15L, 1L, 6L, 4L
), .Label = c("CT", "DE", "FL", "GA", "MA", "MD", "ME", "NC",
"NH", "NJ", "NY", "PA", "RI", "SC", "VA", "VT", "WV"), class = "factor"),
Rfips = c(42107L, 51760L, 44001L, 34001L, 51061L, 50023L,
36029L, 42101L, 37019L, 45079L, 34029L, 45055L, 24003L, 36027L,
25009L, 25009L, 51760L, 9003L, 24027L, 1111L), zip = c(17972L,
23226L, 2806L, 8330L, 20118L, 5681L, 14072L, 19115L, 28451L,
29206L, 8741L, 29020L, 20776L, 12545L, 1922L, 1938L, 23226L,
6089L, 21042L, 36278L), Year = c(2010L, 2005L, 2010L, 2008L,
2007L, 2006L, 2005L, 2008L, 2009L, 2008L, 2010L, 2006L, 2007L,
2008L, 2011L, 2011L, 2008L, 2005L, 2008L, 2009L), dist.km = c(0,
42.4689368078209, 28.1123394088972, 36.8547005648639, 0,
49.7276501081775, 0, 30.1937156926235, 0, 0, 31.5643658415831,
0, 0, 0, 0, 0, 138.854136893762, 0, 79.4320981205195, 47.1692144550079
)), .Names = c("ST", "Rfips", "zip", "Year", "dist.km"), row.names = c(132931L,
105670L, 123332L, 21361L, 51576L, 3520L, 47367L, 99962L, 18289L,
126153L, 19321L, 83224L, 6041L, 46117L, 49294L, 48951L, 109350L,
64465L, 80164L, 22687L), class = "data.frame")
Here are a couple chunks of code I have tried.
state= DDcomplete %>%
group_by(ST) %>%
summarize(zero = sum(DDcomplete$dist.km==0, na.rm = TRUE))
state= aggregate(dist.km ~ ST, function(x) sum(dist.km==0, data=DDcomplete))
state = (DDcomplete[DDcomplete$dist.km==0,], .(ST), function(x) nrow(x))

If you want to add it as a column you can do:
DDcomplete %>% group_by(ST) %>% mutate(count = sum(dist.km == 0))
Or if you just want the counts per state:
DDcomplete %>% group_by(ST) %>% summarise(count = sum(dist.km == 0))
Actually, you were very close to the solution. Your code
state= DDcomplete %>%
group_by(ST) %>%
summarize(zero = sum(DDcomplete$dist.km==0, na.rm = TRUE))
is almost correct. You can remove the DDcomplete$ from within the call to sum because within dplyr chains, you can access variables directly.
Also note that by using summarise, you will condense your data frame to 1 row per group with only the grouping column(s) and whatever you computed inside the summarise. If you just want to add a column with the counts, you can use mutate as I did in my answer.
If you're only interested in positive counts, you could also use dplyr's count function together with filter to first subset the data:
filter(DDcomplete, dist.km == 0) %>% count(ST)

I hope I'm not missing something, but it sounds like you just want table after doing some subsetting:
table(df[df$dist.km == 0, "ST"])
#
# CT DE FL GA MA MD ME NC NH NJ NY PA RI SC VA VT WV
# 1 0 0 0 2 1 0 1 0 0 2 1 0 2 1 0 0
Other approaches might be:
## dplyr, since you seem to be using it
library(dplyr)
df %>%
filter(dist.km == 0) %>%
group_by(ST) %>%
summarise(n())
## aggregate, since you tried that too
aggregate(dist.km ~ ST, df, function(x) sum(x == 0))
## data.table
library(data.table)
as.data.table(df)[dist.km == 0, .N, by = ST]

Related

Linking Shiny Reactive inputs and input updates

I have a dataset of baby names that are ranked by popularity for each year.
What I currently have: a simple shiny app that filters on year based on a slider and a select button that identifies which column is the rank column to use (which also creates a color highlight). This will be two datasets in actuality, one for gender marked as M or gender marked as F, but I've left it simple for the example here.
What I would like to do: update this to be reactive to the values of a slider, which then updates a select option for which file column to sort and highlight.
The current approach works for simplicity, but the focus selector for the year obviously throws an error if it's a value that no longer exists in the slider selected range.
I've dug around and tried a few approaches, but I just haven't been able to get the reactivity portion to work successfully. I'm sure I'm missing something elementary but hitting a wall. Thank you for any input.
Example:
library(shiny)
library(tidyverse)
library(DT)
#Fake Data
dat <- structure(list(Name = c("Bill", "Sean", "Kirby", "Philbert",
"Bob", "Lucius", "Fry", "Tyron", "Lionel", "Alister", "Newt",
"Craig", "A-Aron", "Bill", "Sean", "Kirby", "Philbert", "Bob",
"Lucius", "Fry", "Tyron", "Lionel", "Alister", "Newt", "Craig",
"A-Aron", "Bill", "Sean", "Kirby", "Philbert", "Bob", "Lucius",
"Fry", "Tyron", "Lionel", "Alister", "Newt", "Craig", "A-Aron"
), rank = c(8L, 1L, 2L, 3L, 4L, 6L, 5L, 9L, 7L, 25L, 10L, 35L,
99L, 4L, 1L, 3L, 2L, 5L, 6L, 7L, 11L, 5L, 12L, 8L, 9L, 10L, 4L,
2L, 3L, 10L, 8L, 11L, 5L, 6L, 12L, 7L, 13L, 9L, 1L), year = c(2008L,
2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L,
2008L, 2008L, 2008L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L,
2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L)), class = "data.frame", row.names = c(NA, -39L))
#Get years
years <- unique(dat$year)
ui <- fluidPage(
titlePanel("Top Ten Male Baby Names"),
sliderInput("range",
label = "Choose year range",
min = min(as.numeric(years)),
max = max(as.numeric(years)),
sep = "",
value = c(max(as.numeric(years))-1,max(as.numeric(years)))
),
selectInput("year",
label = "Choose year for rank",
choices = as.numeric(years),
selected = max(as.numeric(years))
)
,
mainPanel(
dataTableOutput("DataTable")
)
)
server <- function(input, output) {
output$DataTable <- renderDataTable({
dat1 <- dat %>%
filter((year >= input$range[1] & year <= input$range[2]) ) %>%
pivot_wider(id_cols = Name,
values_from = rank,
names_from = year) %>%
filter(.[colnames(.) == as.character(input$year)] <11) %>%
arrange(.[colnames(.)== as.character(input$year)])
datatable(dat1,
options = list(ordering=F,
lengthChange = F,
pageLength = -1)) %>%
formatStyle(input$year,
backgroundColor = "lightgreen"
)
})
}
shinyApp(ui, server)
You could set up an observeEvent to watch for changes to the sliderInput. Then if your select input is not in the range of the slider. Update the selection.
Note: you need to add the session param to the server function.
Also since output$DataTable is filtered by the range and the year. I've added a validate statement incase the user were to manually chose a year that is not in the current range.
server <- function(input, output, session) {
# Observe for a change to slider input
observeEvent(input$range, {
sel = input$year
# update selection if original selected year is not in range
if(!(sel %in% input$range)) {
sel = min(input$range)
updateSelectInput(session, "year", selected = sel)
}
})
output$DataTable <- renderDataTable({
validate(need(input$year %in% input$range,"Current selection not in range"))
dat1 <- dat %>%
filter((year >= input$range[1] & year <= input$range[2]) ) %>%
pivot_wider(id_cols = Name,
values_from = rank,
names_from = year) %>%
filter(.[colnames(.) == as.character(input$year)] <11) %>%
arrange(.[colnames(.)== as.character(input$year)])
datatable(dat1,
options = list(ordering=F,
lengthChange = F,
pageLength = -1)) %>%
formatStyle(input$year,
backgroundColor = "lightgreen"
)
})
}
shinyApp(ui, server)

How to plot Unequal Interval on x axis of a time series data?

I am using plotrix package to visualize changes in the data using colors. The data is available here.
I am using below code for plotting the data.
library(plotrix)
my_colors1=c("red", "green","blue")
a<-read.csv("DataSt.csv")
x<-a$Year
y<-a$TP
clplot(x, y, main="",lwd=5,labels=y,levels=c(37,964,4377),col=my_colors1, showcuts=T, bty="n",xlab="Year", ylab = "numbers", axes=F)
axis(1, at = a$Year, las=2)
axis(2, at = seq(0, 4400, by = 100), las=2)
I am getting the above chart
I want to reduce the axis space between the year 1975 and 1989. Please help me to get unequal interval at the x axis.
It's a bit dangerous to do this give that the viewer might not realize the inconsistent spacing among the x-axis values. Nevertheless, the following example shows a possible solution by treating the x-values as factor levels. The problem is that that plotting function only allows numeric values. I thus plot with factors, but then need to use numeric values to plot some sort of interpolated values in between using segments:
a <- structure(list(Year = c(2021L, 2020L, 2019L, 2018L, 2017L, 2016L,
2015L, 2014L, 2013L, 2012L, 2011L, 2010L, 2009L, 2008L, 2007L,
2006L, 2005L, 2004L, 2003L, 2002L, 2001L, 2000L, 1999L, 1998L,
1997L, 1996L, 1995L, 1994L, 1993L, 1992L, 1991L, 1990L, 1989L,
1975L), TP = c(785L, 848L, 1067L, 1079L, 1263L, 678L, 1204L,
542L, 661L, 387L, 3534L, 4377L, 964L, 244L, 237L, 145L, 86L,
37L, 39L, 23L, 14L, 11L, 7L, 9L, 6L, 3L, 7L, 7L, 6L, 1L, 1L,
1L, 2L, 1L)), class = "data.frame", row.names = c(NA, -34L))
a$Year <- factor(a$Year)
a <- a[order(a$Year),]
head(a)
my_colors1=c("red", "green","blue")
plot(TP ~ Year, a, col = NA, border = NA, las = 2)
for(i in 2:nrow(a)){
b <- as.data.frame(approx(x = as.numeric(a$Year[(i-1):i]), y = a$TP[(i-1):i], n = 100))
b$col <- my_colors1[as.numeric(cut(b$y, breaks = c(-Inf,37,964,4377,Inf)))]
segments(x0 = b$x[-nrow(b)], x1 = b$x[-1], y0 = b$y[-nrow(b)], y1 = b$y[-1], col = b$col[-1])
}
abline(h = c(37,964), lty = 2)

How can I calculate the distance of a state within a cluster from the center of the cluster and visualize it on a graph?

I have a sample of 28 states. I want to plot them in one cluster, identify the center, and then calculate the distance of every state from the center, per year.
my input file resemble the following: first column: Country second column: Year (from 2008 to 2017) third column: PI (index)
Question 1: I am getting the error:
Error in eval(e, x, parent.frame()) : object 'mydata.year' not found when I run: table_2008 = subset(table1, mydata.year ==2008)
Question 2: Which code is best suited to calculate the distance of a state from the center of the cluster.
Initially, I designed four aggregate indices to analyse progress in EU member states (economic, political, social, and institutional). I then clustered the state 6 and visualized them as per the image below (k.means and ggplot). I was calculating the distance between clusters, and the distance between states within a cluster by using withinss and totss.
However, I decide to analyse the state from another perspective. Instead of redesigning an index, I decided to use the Prosperity Index (aggregate index including the four issue areas (economic, political, social, and institutional). Therefor I ended up with 1 variable. what I wish to do is consider that EU member states form one cluster - so I have one cluster of 28 states for every year 2008 to 2017), and I want to calculate the distance of every state to the center. I need a specific numeric value for every state for every year.
Lastly, I want to be able to visualize in a graph similar to the one in the image attached. I will have ten graphs for every year (2008 to 2017). Each graph will contain all 28 state with a center.
My code is:
mydata = read.csv("C:/Users/TA/Desktop/R4./PI4.csv",sep = ",", header=TRUE)
mydata$Country
mydata$Category
mydata$PI
data_cluster = data.frame(mydata$Country,mydata$Category,mydata$PI)
write.csv(data_cluster,"C:/Users/TA/Desktop/R4./OutputPI.csv", row.names = FALSE)
table1 = data_cluster
#plot(uk_line[,4])
table1 = na.omit(table1)
within_results = ts(,start = c(2008), end = c(2017), frequency = 1)
within_resultsbetweenss = ts(,start = c(2008), end = c(2017), frequency = 1)
within_results_withinss = matrix(data= NA, nrow = 10, ncol = 4)
#nrow = years, ncols = number of clusters
#colnames(mydata, c("Country","Year"))
#YEAR 2008
#SELECTING A GIVEN YEAR (subset of rows such that year = 2008)
table_2008 = subset(table1, mydata.year ==2008)
table_2008
data2008_clus = table_2008[,3:ncol(table_2008)]
#NAMING THE ROWS USING THE COUNTRY NAMES
rownames(data2008_clus) = table_2008$mydata.Country
data2008_clus
plot(table_2008)
wss <- (nrow(data2008_clus)-1)*sum(apply(data2008_clus,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(data2008_clus,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
# Compute k-means with k = 1
fit1=kmeans(x = data2008_clus,centers = 1)
fit1$cluster
fviz_cluster(fit1,data = data2008_clus)
fit1$withinss
fit1$totss
fit1$betweenss
table_2008$cluster = factor(fit1$cluster)
centers=as.data.frame(fit1$centers)
table_2008
within_results[1] = fit1$totss
within_resultsbetweenss[1] = fit1$betweenss
within_results_withinss[1,] = fit1$withinss
within_results_withinss[1,] = fit1$withinss
plot(within_results)
plot(within_resultsbetweenss)
plot(within_results_withinss)
# Print the results
print(km.res)
table_2008
As for my data set,
mydata_struct = structure( list( Year = c( 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2009L ), Country = structure( c( 1L, 2L, 3L, 4L, –5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 1L ), .Label = c( "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", –"Czechia", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", –"Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovakia", "Slovenia", "Spain", "Sweden", "United Kingdom" ), class = "factor" ), Prosperity.Index = c( 79.4, 76.1, 62, 65.1, – 69.9, 70.9, 83.2, 73.5, 81.2, 75.9, 79.9, 66, 66.7, 78.9, 69.6, 67.7, 66.6, 79.9, 73.4, 81.2, 66.9, 71, 62.6, 68.2, 72.7, 72.6, 82.8, 78, 79.4 ) ), row.names = c(NA, 29L), class = "data.frame" )

Multiple Timeseries graph in R

I am trying to create a time series plot that has multiple data over the years. I would like to just plot the years and get the data to run from start date to end date. Here I have converted the respective columns to dates and then combined them but I do not get the result I am looking for.
The data is available from this website: https://www.businessinsider.co.za/coronavirus-deaths-how-pandemic-compares-to-other-deadly-outbreaks-2020-4?r=US&IR=T
Something like this where the data doesn't start in the same year or end in the same year:
https://ichef.bbci.co.uk/news/410/cpsprodpb/6E25/production/_111779182_optimised-mortality-nc.png
(time period vs deaths caused)
library(lubridate)
library(ggplot2)
otherDiseaseData <- structure(list(ï..Disease = structure(c(11L, 2L, 12L, 6L, 3L,
1L, 9L, 7L, 13L, 4L, 5L, 8L, 10L), .Label = c("Asian Flu", "blackdeath",
"Cholera", "Covid 19", "Ebola", "HIV", "Hong Kong Flu", "Mers",
"Russian Flu", "Sars", "smallpox", "spanish flu", "Swine Flu"
), class = "factor"), Start = c(0L, 1347L, 1918L, 1981L, 1899L,
1957L, 1889L, 1968L, 2009L, 2019L, 2014L, 2012L, 2002L), End = c(1979L,
1351L, 1919L, 2020L, 1923L, 1958L, 1890L, 1970L, 2010L, 2020L,
2016L, 2020L, 2003L), Death = c(300000L, 225000000L, 50000L,
2360000L, 1500000L, 1100000L, 1000000L, 1000000L, 151700L, 101526L,
11300L, 866L, 774L)), class = "data.frame", row.names = c(NA,
-13L))
yrs <- otherDiseaseData$Start
yr <- as.Date(as.character(yrs), format = "%Y")
yStart <- year(yr)
yrs <- otherDiseaseData$End
yr <- as.Date(as.character(yrs), format = "%Y")
yStart <- year(yr)
otherDiseaseData$x <- paste(otherDiseaseData$Start,otherDiseaseData$End)
otherDiseaseData
ggplot(otherDiseaseData, aes(y = Death, x = otherDiseaseData$x),xlim=0000-2000) + geom_point()
I'm not sure I've fully understood what you're asking for, but my interpretation is this:
df <- reshape::melt(otherDiseaseData, measure.vars = c("Start", "End"))
ggplot(df %>% filter(Disease != "smallpox", Death != 225000000)) +
geom_line(aes(value,Death, colour = Disease), size = 2) +
theme_minimal() +
ggrepel::geom_label_repel(data = filter(df, Disease != "smallpox", Death != 225000000, variable != "Start"),
aes(label = Disease, x = value, y = Death)) +
scale_y_log10() +
theme(legend.position = "none", aspect.ratio = 1) +
ylab("Number of Deaths") + xlab("Year")
I've used the reshape package to reorganise the given data, and then ggrepel to label the bars. I've had to remove some data as it really throws the scale, which I've ended up making logarithmic to spread the data out a little. It gives you this plot:
It's not perfect but it might be heading in the right direction? Apologies if I've misunderstood what you were angling for.

Collapsing a data frame by factors with multiple criteria

I have a data frame that describes the sequential movements of animals (ID column) and the time spent there (start and end columns). These movements are recorded over small scales but are classified within larger regions (classification column), such that an animal can move multiple times within a region before later moving to another region and moving around. They can also stay in one region for the whole time, or never move at all.
The sequence of movements within each region is tracked in the sequent_moves column (see this question for a more thorough explanation of how these are created). Animals can potentially move back to a region they earlier left. There is also a column of chemical data, Mean_8786Sr which is related to that region.
I want to collapse this data frame so that I end up with a description of only the regional movements. So, subsetting by Sample and sequent_moves I want to keep the minimum start value and the maximum end value, ending up with the start and end time within the region. I further want a mean of the chemical data in Mean_8786Sr. The rest of the columns I want to either keep the minimum value or the factor value as shown in the example code below.
I can do this using by(), but so far it requires a statement for each column. My actual data has quite a few more columns and many thousand rows. I'm pretty sure there is a faster, more elegant way to do this, perhaps with data.table (since I'm liking what I've seen from that package so far).
Below is my result. Is there a more efficient way to do this?
movement = data.frame(structure(list(start = c(0, 0, 110, 126, 235, 0, 17, 139, 251,
0, 35, 47, 99, 219, 232, 269, 386, 398, 414, 443, 459), end = c(782L,
110L, 126L, 235L, 612L, 17L, 139L, 251L, 493L, 35L, 47L, 99L,
219L, 232L, 269L, 386L, 398L, 414L, 443L, 459L, 765L), Mean_8786Sr = c(0.709269349163555,
0.710120935400909, 0.70934948311875, 0.71042744033211, 0.709296068424668,
0.708621911917647, 0.709358583256557, 0.710189508916071, 0.709257758963636,
0.711148891471429, 0.712470115258333, 0.713742475130769, 0.714572498375,
0.713400790353846, 0.711656338391892, 0.710380629097436, 0.711571667241667,
0.71290867871875, 0.712009033513793, 0.71104293234375, 0.709344687326471
), Sample = c("2006_3174", "2006_3185", "2006_3185", "2006_3185",
"2006_3185", "2006_3189", "2006_3189", "2006_3189", "2006_3189",
"2006_3194", "2006_3194", "2006_3194", "2006_3194", "2006_3194",
"2006_3194", "2006_3194", "2006_3194", "2006_3194", "2006_3194",
"2006_3194", "2006_3194"), ID = c("1", "1", "2", "3", "4", "1",
"2", "3", "4", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12"), return_year = c(2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L), classification = c("CW",
"CW", "SK", "CW", "CW", "SK", "SK", "CW", "CW", "CW", "CW", "CW",
"CW", "CW", "CW", "CW", "CW", "CW", "CW", "CW", "CW"), sequent_moves = c(1L,
1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), Sample_cptID = c("2006_3174 1", "2006_3185 1",
"2006_3185 2", "2006_3185 3", "2006_3185 3", "2006_3189 1", "2006_3189 1",
"2006_3189 2", "2006_3189 2", "2006_3194 1", "2006_3194 1", "2006_3194 1",
"2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1",
"2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1")), .Names = c("start",
"end", "Mean_8786Sr", "Sample", "ID", "return_year", "classification",
"sequent_moves", "Sample_cptID"), class = "data.frame", row.names = 6:26))
Here is my solution using by():
moves = by(movement_dput, INDICES = c(factor(movement_dput$Sample_cptID)), function (x) {
start = min(x[,"start"])
end = max(x[,"end"])
Mean_8786Sr = mean(x[,"Mean_8786Sr"])
Sample = x[1,"Sample"]
ID = min(x[,"ID"])
return_year = x[1,"return_year"]
classification = x[1,"classification"]
sequent_moves = x[1,"sequent_moves"]
move = cbind(start, end, Mean_8786Sr, Sample, ID, return_year, classification, sequent_moves)
move
}
)
regional_moves = do.call(rbind.data.frame, moves)
regional_moves
Is there,
a more efficient way to do this?
an easier or more compact way to specify which
columns I want max(), min(), etc...?
Edit: Adding partial data.table solution per Jeannie's comment.
Here is what I have so far using data.table.
require('data.table')
m=setDT(movement)
m[, .(start=base::min(start),
end=base::max(end),
Mean_8786Sr=mean(Mean_8786Sr),
ID = base::min(ID),
return_year = return_year[1],
classification = classification[1],
Sample_cptID = Sample_cptID[1])
, by=c('Sample', 'sequent_moves')]
If I run this without base::min() I get errors. The current error is:
Error in `g[`(Sample_cptID, 1) : object 'Sample_cptID' not found
in a prior iteration (that didn't work) I got:
Error in gmin(ID) :
GForce min can only be applied to columns, not .SD or similar. To find min of all items in a list such as .SD, either add the prefix base::min(.SD) or turn off GForce optimization using options(datatable.optimize=1). More likely, you may be looking for 'DT[,lapply(.SD,min),by=,.SDcols=]'
Running it with base min() and max() functions it works. I'm trying to understand what GForce is really doing in optimizing the speed, I assume that that has something to do with why it isn't returning the functionality I expected. This thread talks about it, but I haven't digested it completely. Any ideas?
It would be nice to be able to pass min, max and mean to a list that I can populate with colnames. The vast majority of columns I just want the first element. It would be more compact if there was a way to specify the max, min and mean columns directly and then say the equivalent of " for every other column, give me the first element".
The OP has asked if there is a more efficient way to aggregate the movement data frame than by specifying each column individually.
I'm afraid that it is unavoidable to specify which columns need to be aggregated by which aggregation function. However, data.table syntax is quite compact in general. So, the call to by() can be implemented with data.table as follows:
library(data.table)
setDT(movement)[
, .(start = min(start), end = max(end), Mean_8786Sr = mean(Mean_8786Sr), ID = min(ID)),
by = .(Sample, return_year, classification, sequent_moves)]
Sample return_year classification sequent_moves start end Mean_8786Sr ID
1: 2006_3174 2006 CW 1 0 782 0.7092693 1
2: 2006_3185 2006 CW 1 0 110 0.7101209 1
3: 2006_3185 2006 SK 2 110 126 0.7093495 2
4: 2006_3185 2006 CW 3 126 612 0.7098618 3
5: 2006_3189 2006 SK 1 0 139 0.7089902 1
6: 2006_3189 2006 CW 2 139 493 0.7097236 3
7: 2006_3194 2006 CW 1 0 765 0.7120207 1
Note that all variables which are invariant or constant within each group are treated as grouping variables in by = .... This saves some typing but puts the columns in front of the other (aggregated) columns.

Resources