Collapsing a data frame by factors with multiple criteria - r

I have a data frame that describes the sequential movements of animals (ID column) and the time spent there (start and end columns). These movements are recorded over small scales but are classified within larger regions (classification column), such that an animal can move multiple times within a region before later moving to another region and moving around. They can also stay in one region for the whole time, or never move at all.
The sequence of movements within each region is tracked in the sequent_moves column (see this question for a more thorough explanation of how these are created). Animals can potentially move back to a region they earlier left. There is also a column of chemical data, Mean_8786Sr which is related to that region.
I want to collapse this data frame so that I end up with a description of only the regional movements. So, subsetting by Sample and sequent_moves I want to keep the minimum start value and the maximum end value, ending up with the start and end time within the region. I further want a mean of the chemical data in Mean_8786Sr. The rest of the columns I want to either keep the minimum value or the factor value as shown in the example code below.
I can do this using by(), but so far it requires a statement for each column. My actual data has quite a few more columns and many thousand rows. I'm pretty sure there is a faster, more elegant way to do this, perhaps with data.table (since I'm liking what I've seen from that package so far).
Below is my result. Is there a more efficient way to do this?
movement = data.frame(structure(list(start = c(0, 0, 110, 126, 235, 0, 17, 139, 251,
0, 35, 47, 99, 219, 232, 269, 386, 398, 414, 443, 459), end = c(782L,
110L, 126L, 235L, 612L, 17L, 139L, 251L, 493L, 35L, 47L, 99L,
219L, 232L, 269L, 386L, 398L, 414L, 443L, 459L, 765L), Mean_8786Sr = c(0.709269349163555,
0.710120935400909, 0.70934948311875, 0.71042744033211, 0.709296068424668,
0.708621911917647, 0.709358583256557, 0.710189508916071, 0.709257758963636,
0.711148891471429, 0.712470115258333, 0.713742475130769, 0.714572498375,
0.713400790353846, 0.711656338391892, 0.710380629097436, 0.711571667241667,
0.71290867871875, 0.712009033513793, 0.71104293234375, 0.709344687326471
), Sample = c("2006_3174", "2006_3185", "2006_3185", "2006_3185",
"2006_3185", "2006_3189", "2006_3189", "2006_3189", "2006_3189",
"2006_3194", "2006_3194", "2006_3194", "2006_3194", "2006_3194",
"2006_3194", "2006_3194", "2006_3194", "2006_3194", "2006_3194",
"2006_3194", "2006_3194"), ID = c("1", "1", "2", "3", "4", "1",
"2", "3", "4", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12"), return_year = c(2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L), classification = c("CW",
"CW", "SK", "CW", "CW", "SK", "SK", "CW", "CW", "CW", "CW", "CW",
"CW", "CW", "CW", "CW", "CW", "CW", "CW", "CW", "CW"), sequent_moves = c(1L,
1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), Sample_cptID = c("2006_3174 1", "2006_3185 1",
"2006_3185 2", "2006_3185 3", "2006_3185 3", "2006_3189 1", "2006_3189 1",
"2006_3189 2", "2006_3189 2", "2006_3194 1", "2006_3194 1", "2006_3194 1",
"2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1",
"2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1")), .Names = c("start",
"end", "Mean_8786Sr", "Sample", "ID", "return_year", "classification",
"sequent_moves", "Sample_cptID"), class = "data.frame", row.names = 6:26))
Here is my solution using by():
moves = by(movement_dput, INDICES = c(factor(movement_dput$Sample_cptID)), function (x) {
start = min(x[,"start"])
end = max(x[,"end"])
Mean_8786Sr = mean(x[,"Mean_8786Sr"])
Sample = x[1,"Sample"]
ID = min(x[,"ID"])
return_year = x[1,"return_year"]
classification = x[1,"classification"]
sequent_moves = x[1,"sequent_moves"]
move = cbind(start, end, Mean_8786Sr, Sample, ID, return_year, classification, sequent_moves)
move
}
)
regional_moves = do.call(rbind.data.frame, moves)
regional_moves
Is there,
a more efficient way to do this?
an easier or more compact way to specify which
columns I want max(), min(), etc...?
Edit: Adding partial data.table solution per Jeannie's comment.
Here is what I have so far using data.table.
require('data.table')
m=setDT(movement)
m[, .(start=base::min(start),
end=base::max(end),
Mean_8786Sr=mean(Mean_8786Sr),
ID = base::min(ID),
return_year = return_year[1],
classification = classification[1],
Sample_cptID = Sample_cptID[1])
, by=c('Sample', 'sequent_moves')]
If I run this without base::min() I get errors. The current error is:
Error in `g[`(Sample_cptID, 1) : object 'Sample_cptID' not found
in a prior iteration (that didn't work) I got:
Error in gmin(ID) :
GForce min can only be applied to columns, not .SD or similar. To find min of all items in a list such as .SD, either add the prefix base::min(.SD) or turn off GForce optimization using options(datatable.optimize=1). More likely, you may be looking for 'DT[,lapply(.SD,min),by=,.SDcols=]'
Running it with base min() and max() functions it works. I'm trying to understand what GForce is really doing in optimizing the speed, I assume that that has something to do with why it isn't returning the functionality I expected. This thread talks about it, but I haven't digested it completely. Any ideas?
It would be nice to be able to pass min, max and mean to a list that I can populate with colnames. The vast majority of columns I just want the first element. It would be more compact if there was a way to specify the max, min and mean columns directly and then say the equivalent of " for every other column, give me the first element".

The OP has asked if there is a more efficient way to aggregate the movement data frame than by specifying each column individually.
I'm afraid that it is unavoidable to specify which columns need to be aggregated by which aggregation function. However, data.table syntax is quite compact in general. So, the call to by() can be implemented with data.table as follows:
library(data.table)
setDT(movement)[
, .(start = min(start), end = max(end), Mean_8786Sr = mean(Mean_8786Sr), ID = min(ID)),
by = .(Sample, return_year, classification, sequent_moves)]
Sample return_year classification sequent_moves start end Mean_8786Sr ID
1: 2006_3174 2006 CW 1 0 782 0.7092693 1
2: 2006_3185 2006 CW 1 0 110 0.7101209 1
3: 2006_3185 2006 SK 2 110 126 0.7093495 2
4: 2006_3185 2006 CW 3 126 612 0.7098618 3
5: 2006_3189 2006 SK 1 0 139 0.7089902 1
6: 2006_3189 2006 CW 2 139 493 0.7097236 3
7: 2006_3194 2006 CW 1 0 765 0.7120207 1
Note that all variables which are invariant or constant within each group are treated as grouping variables in by = .... This saves some typing but puts the columns in front of the other (aggregated) columns.

Related

manually scale color of a factor in ggplot

Let's say i have a data frame like this
id password year length Something
1 1234567 2001 7 good
2 pass4 2001 5 bad
3 angel3 2003 6 bad
4 pizza 2004 5 ok
im trying to get a code that would create a geom_point with 3 variable but i only want to highlight a single level of the factor ''Something'' . And i dont want any of the other levels of the factor Something(like good or bad) to colored. Or at least they can stay black.
im was thinking maybe something like this :
graph <- dat %>%
ggplot(aes(x=(year), y=length, color=Something$ok)+
geom_point()
but i can't use $ .
You can color just one point by setting all points to one color and changing the color of the point you want to change. To do this you can use scale_color_manual
Data:
dat <- structure(list(id = 1:4, password = structure(c(1L, 3L, 2L, 4L
), .Label = c("1234567", "angel3", "pass4", "pizza"), class = "factor"),
year = c(2001L, 2001L, 2003L, 2004L), length = c(7L, 5L,
6L, 5L), Something = structure(c(2L, 1L, 1L, 3L), .Label = c("bad",
"good", "ok"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
Plot:
dat %>%
ggplot(aes(x=(year), y=length, color = Something == "ok"))+
geom_point() +
scale_color_manual(values = c("blue", "orange"))

How to conduct an ANOVA of several variables taken on individuals separated by multiple grouping variables?

I have a data frame similar to the one created by the code below. In this example, measurements of 5 variables are taken on are 30 individuals represented by ID. The individuals can be separated by any of three grouping variables: GroupVar1,GroupVar2,GroupVar3. For each of the grouping variables, I need to conduct an ANOVA for each of the 5 variables, and return the results of each (possibly onto a pdf or separate document?). How can I write a function, or use iteration, to handle this problem and minimize repetition in my code? What is the best way to extract and visualize the results if you have a large dataset (my real data set has several hundred individuals, and the grouping variables range in size from 6 to 30 groups)?
library(tidyverse)
GroupVar1 <- rep(c("FL", "GA", "SC", "NC", "VA", "GA"), each = 5)
GroupVar2 <- rep(c("alpha", "beta", "gamma"), each = 10)
GroupVar3 <- rep(c("Bravo", "Charlie", "Delta", "Echo"), times = c(7,8,10,5))
ID <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y","Z", "a","b","c","d")
Var1 <- rnorm(30)
Var2 <- rnorm(30)
Var3 <- rnorm(30)
Var4 <- rnorm(30)
Var5 <- rnorm(30)
data <- tibble(GroupVar1,GroupVar2,GroupVar3,ID,Var1,Var2,Var3,Var4,Var5)
> dput(data[1:10,])
structure(list(Location = structure(c(21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L), .Label = c("ALTE", "ASTR", "BREA",
"CAMN", "CFU", "COEN", "JENT", "NAT", "NEAU", "NOCO", "OOGG",
"OPMM", "PING", "PITC", "POMO", "REAN", "ROND", "RTD", "SANT",
"SMIT", "SUN", "TEAR", "WINC"), class = "factor"), PR = structure(c(16L,
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("ALTE",
"ASTR", "CF", "CHOW", "JENT", "NAT", "NEAU", "NSE", "OOGG", "PALM",
"POMO", "REAN", "ROND", "RTD", "SS", "SUN", "WINC"), class = "factor"),
Est = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AS",
"CB", "CF", "CS", "OS", "PS", "SS", "WB"), class = "factor"),
State = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
), .Label = c("FL", "GA", "MD", "NC", "SC", "VA"), class = "factor"),
Year = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L), ID = c(90L, 92L, 93L, 95L, 96L, 98L,
99L, 100L, 103L, 109L), Sex = structure(c(1L, 2L, 2L, 2L,
1L, 1L, 2L, 1L, 1L, 2L), .Label = c("F", "M"), class = "factor"),
DOB = c(-0.674706816, 2.10472846, 0.279952847, -0.26959379,
-1.243977657, 0.188828771, 0.026530709, 0.483363306, -0.63599302,
-0.979506001), Mg = c(-1.409815618, 1.180920604, 0.765102543,
1.828057339, -0.689841498, -0.604272366, 0.194867939, -1.015964127,
-0.520136693, 0.769042585), Mn7 = c(1.387385913, 0.320582444,
-0.490356598, -0.020540649, -0.594210249, -1.119170306, -0.225065868,
-1.892064456, -2.434101506, -0.816518662), Cu7 = c(-0.176599651,
0.100529267, 1.4967142, 0.094840221, 1.791653259, -0.191723817,
-1.526868086, -0.308696916, -2.046613977, -2.228513411),
Zn7 = c(-0.338454617, -0.235800727, -0.785876374, 0.114698826,
0.202960987, 0.432013987, 0.164099621, 0.609232311, 0.169329098,
-0.284402654), Sr7 = c(-0.010929071, -1.616835312, -0.208856,
-0.362538736, 1.662066318, -0.893155185, 0.699406559, -0.333176495,
-2.026364633, -1.324456127), Ba7 = c(-1.041126455, 0.551165907,
0.126849272, -1.069762666, -0.922501551, -1.36095076, 1.57800858,
-0.842518997, -1.017894235, 0.265895019)), row.names = c(NA,
10L), class = "data.frame")
Without knowing too much about the underlying data, my hunch is this may be improper use of an ANOVA. I would advise that you post to Cross Validated to confirm you aren't breaking any assumptions here.
Regardless, here is the code I would use to tackle the problem presented:
# We will use dplyr, tidyr, purrr, stats, and broom to accomplish this
# I am using tidyr v1.0.0. For older versions you will need to modify code for pivot_longer
results <- data %>%
# First pivot the data longer so each dependent variable is on its own row
pivot_longer(
cols = Var1:Var5,
names_to = "name",
values_to = "value"
) %>%
# Second, pivot longer again, so each row is now its unique grouping var
pivot_longer(
cols = GroupVar1:GroupVar3,
names_to = "group_name",
values_to = "group_value"
) %>%
# group by both group name and dependent variable
group_by(name, group_name) %>%
# nest the data, so each dataset is unique for each dependent and independent variable
nest() %>%
mutate(
# run an anova on each nested data frame
anova = map(data, ~aov(data = .x, value ~ group_value)), # may need to change aov() call here
# use broom to tidy the output
tidied_results = map(anova, broom::tidy)
)
# To easily access the ANOVA results, you can do something like the following:
results %>%
# select columns of interest
select(name, group_name, tidied_results) %>%
# unnest to access summary information of ANOVA
unnest(cols = c(tidied_results))
I think you'll also want to use some sort of multiple-comparison correction, such as Bonferroni Correction. Again, Cross Validated can lead you in the right direction here.
Edited answer based on updated question with dput data:
Assuming the columns representing the grouping variables are 1:5 and 7, and assuming the dependent numeric variables are in columns 8:14, this can be done using a double loop, with no other dependencies:
tests <- list()
Groups <- c(1:5, 7)
Variables <- 8:14
for(i in Groups)
{
Group <- as.factor(data[[i]])
for(j in Variables)
{
test_name <- paste0(names(data)[j], "_by_", names(data[i]))
Response <- data[[j]]
tests[[test_name]] <- anova(lm(Response ~ Group))
}
}
Now you can do what you like with all these tests using lapply, such as
lapply(tests, print)
I agree with #DaveGruenewald about multiple hypothesis testing though - in fact, this example gave a nice demonstration of why Buonferroni or Sidak's corrections are needed, since there were (as expected) a few "significant" p values among the random data simply due to the number of tests involved.

R: How to create multiple maps (rworldmap) using apply?

I want to create multiple maps (similar to this example) using the apply family. Here a small sample of my code (~200 rows x 150 cols). (UN and ISO3 are codes for rworldmap):
df <- structure(list(BLUE.fruits = c(12803543,
3745797, 19947613, 0, 130, 4), BLUE.nuts = c(21563867, 533665,
171984, 0, 0, 0), BLUE.veggies = c(92690, 188940, 34910, 0, 0,
577), GREEN.fruits = c(3389314, 15773576, 8942278, 0, 814, 87538
), GREEN.nuts = c(6399474, 1640804, 464688, 0, 0, 0), GREEN.veggies = c(15508,
174504, 149581, 0, 0, 6190), UN = structure(c(4L, 5L, 1L, 6L,
2L, 3L), .Label = c("12", "24", "28", "4", "8", "n/a"), class = "factor"),
ISO3 = structure(c(1L, 3L, 6L, 4L, 2L, 5L), .Label = c("AFG",
"AGO", "ALB", "ASM", "ATG", "DZA"), class = "factor")), .Names = c("BLUE.fruits", "BLUE.nuts", "BLUE.veggies", "GREEN.fruits", "GREEN.nuts",
"GREEN.veggies", "UN", "ISO3"), row.names = c(97L, 150L, 159L,
167L, 184L, 191L), class = "data.frame")
and the code I used before to plot one single map:
library(rworldmap)
mapDevice('x11')
spdf <- joinCountryData2Map(df, joinCode="ISO3", nameJoinColumn="ISO3")
mapWF <- mapCountryData(spdf, nameColumnToPlot="BLUE.nuts",
catMethod="quantiles")
Note: in mapCountryData() I used the names of single columns (in this case "BLUE.nuts"). My question is: is there a way to apply this mapping code for the different columns creating six different maps? Either in one multi-panel using layout() or even better creating six different plots that get saved according to their colnames. Ideas? Thanks a lot in advance
You are close.
Add this to save one plot per column.
#put column names to plot in a vector
col_names <- names(df)[1:6]
lapply(col_names, function(x) {
#opens device to store pdf
pdf(paste0(x,'.pdf'))
#plots map
mapCountryData(spdf, nameColumnToPlot=x)
#closes created pdf
dev.off()
})

How to display separate rows in histogram in R?

I have a set of data that I've assigned to a variable named "data1". I know how to make a histogram of certain column, by hist(data1$RT). But among the RT column, there are "high", "medium", and "low", 'Factor's', I want to make 3 separate histograms for each factor variable but can't figure out how to do this. Here's an example of the data:
Frequency Prime_type RT
1 high prime 450
2 high prime 460
3 med prime 520
4 med prime 430
5 low prime 450
6 low prime 420
I can display hist(data1$RT), but how would I just display RT's 'high' or 'med' factors for example? I've tried a lot of things and am still stumped.
You can do it by faceting the plot with ggplot2. First, we modify df$Frequency to have the panels in order: high, med and low. Then we create the histogram specifying the breaks and using facet_wrap to divide the chart in panels. Note that we add the argument right = TRUE (right-closed and left-open intervals) to calculate the intervals as the hist function does.
library(ggplot2)
df$Frequency <- factor(df$Frequency, levels=unique(df$Frequency))
h <- ggplot(df, aes(x=RT), xlim=c(420,520)) +
geom_histogram(breaks=seq(420, 520, by=20), col="white", right = TRUE) +
facet_wrap( ~ Frequency) +
scale_x_continuous(breaks=seq(420, 520, by=20))
h
Output:
Data:
df <- structure(list(Frequency = structure(c(1L, 1L, 3L, 3L, 2L, 2L
), .Label = c("high", "low", "med"), class = "factor"), Prime_type = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "prime", class = "factor"), RT = c(450L,
460L, 520L, 430L, 450L, 420L)), .Names = c("Frequency", "Prime_type",
"RT"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6"))

Group by and conditionally count

I am still learning data management in R. I know I am really close, but can't get the precise syntax. I have looked at
count a variable by using a condition in R
and
Conditional count and group by in R
but can't quite translate to my work. I am trying to get a count of dist.km that equal 0 by ST. Eventually I will want to add columns with counts of various distance ranges, but should be able to get it after getting this. The final table should have all states and a count of 0s. Here is a 20 row sample.
structure(list(ST = structure(c(12L, 15L, 13L, 10L, 15L, 16L,
11L, 12L, 8L, 14L, 10L, 14L, 6L, 11L, 5L, 5L, 15L, 1L, 6L, 4L
), .Label = c("CT", "DE", "FL", "GA", "MA", "MD", "ME", "NC",
"NH", "NJ", "NY", "PA", "RI", "SC", "VA", "VT", "WV"), class = "factor"),
Rfips = c(42107L, 51760L, 44001L, 34001L, 51061L, 50023L,
36029L, 42101L, 37019L, 45079L, 34029L, 45055L, 24003L, 36027L,
25009L, 25009L, 51760L, 9003L, 24027L, 1111L), zip = c(17972L,
23226L, 2806L, 8330L, 20118L, 5681L, 14072L, 19115L, 28451L,
29206L, 8741L, 29020L, 20776L, 12545L, 1922L, 1938L, 23226L,
6089L, 21042L, 36278L), Year = c(2010L, 2005L, 2010L, 2008L,
2007L, 2006L, 2005L, 2008L, 2009L, 2008L, 2010L, 2006L, 2007L,
2008L, 2011L, 2011L, 2008L, 2005L, 2008L, 2009L), dist.km = c(0,
42.4689368078209, 28.1123394088972, 36.8547005648639, 0,
49.7276501081775, 0, 30.1937156926235, 0, 0, 31.5643658415831,
0, 0, 0, 0, 0, 138.854136893762, 0, 79.4320981205195, 47.1692144550079
)), .Names = c("ST", "Rfips", "zip", "Year", "dist.km"), row.names = c(132931L,
105670L, 123332L, 21361L, 51576L, 3520L, 47367L, 99962L, 18289L,
126153L, 19321L, 83224L, 6041L, 46117L, 49294L, 48951L, 109350L,
64465L, 80164L, 22687L), class = "data.frame")
Here are a couple chunks of code I have tried.
state= DDcomplete %>%
group_by(ST) %>%
summarize(zero = sum(DDcomplete$dist.km==0, na.rm = TRUE))
state= aggregate(dist.km ~ ST, function(x) sum(dist.km==0, data=DDcomplete))
state = (DDcomplete[DDcomplete$dist.km==0,], .(ST), function(x) nrow(x))
If you want to add it as a column you can do:
DDcomplete %>% group_by(ST) %>% mutate(count = sum(dist.km == 0))
Or if you just want the counts per state:
DDcomplete %>% group_by(ST) %>% summarise(count = sum(dist.km == 0))
Actually, you were very close to the solution. Your code
state= DDcomplete %>%
group_by(ST) %>%
summarize(zero = sum(DDcomplete$dist.km==0, na.rm = TRUE))
is almost correct. You can remove the DDcomplete$ from within the call to sum because within dplyr chains, you can access variables directly.
Also note that by using summarise, you will condense your data frame to 1 row per group with only the grouping column(s) and whatever you computed inside the summarise. If you just want to add a column with the counts, you can use mutate as I did in my answer.
If you're only interested in positive counts, you could also use dplyr's count function together with filter to first subset the data:
filter(DDcomplete, dist.km == 0) %>% count(ST)
I hope I'm not missing something, but it sounds like you just want table after doing some subsetting:
table(df[df$dist.km == 0, "ST"])
#
# CT DE FL GA MA MD ME NC NH NJ NY PA RI SC VA VT WV
# 1 0 0 0 2 1 0 1 0 0 2 1 0 2 1 0 0
Other approaches might be:
## dplyr, since you seem to be using it
library(dplyr)
df %>%
filter(dist.km == 0) %>%
group_by(ST) %>%
summarise(n())
## aggregate, since you tried that too
aggregate(dist.km ~ ST, df, function(x) sum(x == 0))
## data.table
library(data.table)
as.data.table(df)[dist.km == 0, .N, by = ST]

Resources