Related
I've just started using R and would like to use look at the autocorrelation in my data using ACF. My dataframe (GL) looks something like this
GL
well year month value area
684 1994 Jan 8.53 H
684 1994 Feb 8.62 H
684 1994 Mar 8.12 H
684 1994 Apr 8.21 H
684 1995 Jan 8.53 H
684 1995 Feb 8.62 H
684 1995 Mar 8.12 H
684 1995 Apr 8.21 H
684 1996 Jan 8.53 H
684 1996 Feb 8.62 H
684 1996 Mar 8.12 H
684 1996 Apr 8.21 H
101 1994 Jan 8.53 R
101 1994 Feb 8.62 R
101 1994 Mar 8.12 R
101 1994 Apr 8.21 R
101 1995 Jan 8.53 R
101 1995 Feb 8.62 R
101 1995 Mar 8.12 R
101 1995 Apr 8.21 R
101 1996 Jan 8.53 R
101 1996 Feb 8.62 R
101 1996 Mar 8.12 R
101 1996 Apr 8.21 R
I would like to:
1. Calculate ACF for each well using lappy or some kind of loop (my actual data set has about 100 wells and three groups)
2. Plot the ACF values (as lines) for each well on one graph for each group (so in this case I would have two acf graphs H & R.
I can use split and lapply to calculate ACF for each well e.g.
split <- split(GL$value,GL$well)
test <- lapply(split,acf)
But splitting this way doesn't save the area information. If I split like this:
split1 <- split(GL,GL$well)
Then I don't know how to perform lapply on the values for each well.
As you split the data by well,
spl1 <- split(GL, GL$well)
the lapply would look like this.
lapply(spl1, function(x) acf(x$value))
We could make this somewhat nicer, though.
When we do the lapply by list number we get a "counter" with which we can access the list names to paste together informative titles. With par(mfrow=c(<rows>, <columns>)) we can set the arrangement of the plots.
par(mfrow=c(1, 2))
lapply(seq_along(spl1), function(x) acf(spl1[[x]]$value,
main=paste0("well ", names(spl1)[x], ", ",
"area ", unique(spl1[[x]]$area))))
Result
This will probably have to be adapted according to how your wells are divided into groups.
(As a sidenote: Better avoid overwriting function names. You use split() and give the result the same name as the function which could induce confusion, both of yourself and of R. Other popular candidates are data, df, table. We can always quickly check with ? whether the name is "free", e.g. ?df.)
Data
# result of `dput(GL)`
GL <- structure(list(well = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("101", "684"), class = "factor"), year = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1996"
), class = "factor"), month = structure(c(3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L), .Label = c("Apr", "Feb", "Jan", "Mar"), class = "factor"),
value = structure(c(3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L), .Label = c("8.12",
"8.21", "8.53", "8.62"), class = "factor"), area = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("H", "R"), class = "factor")), row.names = c(NA,
-24L), class = "data.frame")
You can solve it with data.table:
Let's start with the data (slightly modified from yours, so there will be different values for each well):
structure(list(well = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("101", "684"), class = "factor"), year = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1996"), class = "factor"), month = structure(c(3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("Apr", "Feb", "Jan", "Mar"), class = "factor"),
value = c(4.65144120692275, 8.98342372477055, 17.983893298544,
15.3687085728161, 8.9577708535362, 7.47583840973675, 16.6564453896135, 11.6158618542831, 23.6109819535632, 14.1604918171652, 11.3882310683839, 20.4579487598967, 3.31275907787494, 22.109053656226, 13.598402187461, 12.3686389743816, 17.9585587936454, 17.3689122993965, 7.38424337399192, 6.93579732463695, 13.2789171519689, 21.2500206897967, 13.5766511948314, 3.58588649751619), area = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("H", "R"), class = "factor")), row.names = c(NA, -24L), class = c("data.table", "data.frame"))
Then we create a list for each well:
GL[, datos := .(list(value)) , by = well]
Each row in the datos variable will have a list with all the values corresponding to the well, so we can drop most of them and keep only the first row of each well, as it has all the information already. That is done with GL[, .SD[1,], by = well] so the result will be a two-row data table. After that, we can chain another expression that will produce and save each plot:
GL[, .SD[1,], by = well][
, {png(filename = paste0(well, "-", area, ".png"),
width = 1600,
height = 1600,
units = "px",
res = 100);
plot(a[[1]], main = paste("Well:", well,
"Area:", area, sep = " "));
dev.off()},
by = well]
Your two plots will be saved in the current directory with names like "684-H.png" and "101-R.png".
Key point here: data.table takes expressions and not just functions, so it's absolutely possible to produce the plots and save them to any given location.
I am given a big data set with several columns. As an example
set.seed(1)
x <- 1:15
y <- letters[1:3][sample(1:3, 15, replace = T)]
z <- letters[10:13][sample(1:3, 15, replace = T)]
r <- letters[20:24][sample(1:3, 15, replace = T)]
df <- data.frame("Number"=x, "Section"=y,"Chapter"=z,"Rating"=r)
dput(df)
structure(list(Number = 1:15, Area = structure(c(1L, 2L, 2L, 3L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 1L, 3L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor"), Section = structure(c(2L, 3L, 3L, 2L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 3L, 2L), .Label = c("j", "k", "l"), class = "factor"), Rating = structure(c(2L, 2L, 2L, 1L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 2L, 3L, 2L, 2L), .Label = c("A", "B", "C"), class = "factor")), class = "data.frame", row.names = c(NA,-15L))
I would like now to create frequency tables and graphs split by rating and a a chosen category, e.g. via a string:
Category<-"Section"
data_count <- ddply(df, .(get(Category),Rating), 'count')
data_rel_freq <- ddply(data_count, .(Rating), transform, rel_freq = freq/sum(freq))
dput(data_rel_freq)
structure(list(get.Category. = structure(c(2L, 2L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("j", "k","l"), class = "factor"), Number = c(4L, 8L, 10L, 12L, 1L, 15L, 2L, 3L, 14L, 7L, 9L, 11L, 13L, 5L, 6L), Area = structure(c(3L, 2L, 1L, 1L, 1L, 3L, 2L, 2L, 2L, 3L, 2L, 1L, 3L, 1L, 3L), .Label = c("a", b", "c"), class = "factor"), Section = structure(c(2L, 2L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("j", "k", "l"), class = "factor"), Rating = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), freq = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), rel_freq = c(0.5, 0.5, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.166666666666667, 0.166666666666667, 0.166666666666667, 0.166666666666667, 0.166666666666667, 0.166666666666667)), class = "data.frame", row.names = c(NA, -15L))
Using ggplot
ggplot(data_rel_freq,aes(x = Rating, y = rel_freq,fill = get(Category)))+
geom_bar(position = "fill",stat = "identity",color="black") +
scale_y_continuous(labels = percent_format())+
labs(x = "Rating", y="Relative Frequency")
The issue is now that "get(Category)" is now treated as a new column
get.Category. Number Area Section Rating freq rel_freq
1 k 4 c k A 1 0.5000000
2 k 8 b k A 1 0.5000000
3 j 10 a j B 1 0.1428571
4 j 12 a j B 1 0.1428571
5 k 1 a k B 1 0.1428571
6 k 15 c k B 1 0.1428571
7 l 2 b l B 1 0.1428571
Moreover, the Number column should be summed, e.g. the other categories (here: Area) should be dropped and it we should have just one line with for Section "k" with Rating "A".
We can use count to get the frequency of the column 'Section' by evaluating the object identifier 'Category' after converting to symbol (sym) and evaluate (!!) it. Within the ggplot syntax, the aes can also take a symbol and can be evaluated as earlier
library(tidyverse)
library(scales)
library(ggplot2)
df %>%
count(!! rlang::sym(Category), Rating) %>%
group_by(Rating) %>%
mutate(rel_freq = n/sum(n)) %>%
ggplot(., aes(x =Rating, y = rel_freq, fill = !! rlang::sym(Category))) +
geom_bar(position = "fill",stat = "identity",color="black") +
scale_y_continuous(labels = percent_format())+
labs(x = "Rating", y="Relative Frequency")
-output
This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I am curious how to create another dataset in R, which would store maximum value for a factor variable and matching observation for that maximum value.
Here is a fragment of dataset with just 4 subjects and a code:
library(data.table)
my.data <- structure(list(Subject = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), Supervisor = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Emmi", "Pauli"), class = "factor"), Time = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 3L), .Label = c("01.02.2016 09:45", "01.02.2016 09:48", "01.03.2016 09:55"), class = "factor"), Trials = c(1L, 2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 3L, 4L), Force = c(403.8, 464.6, 567.6, 572.9, 572.4, NA, 533.1, 547, 532.6, 503.8,464.6, 367.6, 372.9), ForceProduction = c(1073, 1149.6, 1944.7, 1906.4, 2260.9, NA, 2634.5, 2471.6, 1187.9, 1073, 1149.6,1944.7, 1906.4)), .Names = c("Subject", "Supervisor", "Time", "Trials", "Force", "ForceProduction"), class = "data.frame", row.names = c(NA, -13L))
DT=as.data.table(my.data)
new.data <- DT[,.SD[which.max(Force)],by=Trials]
Each subject did 2-4 trials. I need to select max value among all trials for a given subject based on Force. So I am interested in max value of Force column. All other observation related to this max Force should be preserved, those that are not in line with max Force should be abondened.
The code result is strange. Just for 3 subjects, ignoring the rest. And not best trial. But I think that I am totally wrong somewhere.
Can you please direct me to a better solution?
Here's a simply dplyr chain that should give you what you want. Grouping by each subject, filter only the values where Force is a maximum for that subject.
library(dplyr)
my.data %>%
group_by(Subject) %>%
filter(Force == max(Force, na.rm = TRUE))
For an example dataframe:
df <- structure(list(id = 1:18, region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), age.cat = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L), .Label = c("0-18",
"19-35", "36-50", "50+"), class = "factor")), .Names = c("id",
"region", "age.cat"), class = "data.frame", row.names = c(NA,
-18L))
I want to reshape the data, as detailed below:
region 0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3
Do I simply aggregate or reshape the data? Any help would be much appreciated.
You can do it just using table:
table(df$region, df$age.cat)
0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3
Using reshape2:
install.packages('reshape2')
library(reshape2)
df1 <- melt(df, measure.vars = 'age.cat')
df1 <- dcast(df1, region ~ value)
Is it possible to return ddply results for only certain values of the splitting variable? For example, with the dataframe example:
example <- structure(list(shape = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("circle", "square", "triangle"
), class = "factor"), property = structure(c(1L, 3L, 2L, 1L,
2L, 3L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 1L), .Label = c("color",
"intensity", "size"), class = "factor"), value = structure(c(5L,
2L, 1L, 5L, 4L, 1L, 5L, 6L, 6L, 7L, 4L, 3L, 6L, 5L), .Label = c("3",
"5", "6", "7", "blue", "green", "red"), class = "factor")), .Names = c("shape",
"property", "value"), class = "data.frame", row.names = c(NA,
-14L))
which looks like this
shape property value
1 circle color blue
2 circle size 5
3 circle intensity 3
4 circle color blue
5 square intensity 7
6 square size 3
7 square color blue
8 square color green
9 square color green
10 triangle color red
11 triangle intensity 7
12 triangle size 6
13 triangle color green
14 triangle color blue
I want to return a dataframe containing the number of each shape that has a certain color, which would be something like this:
shape property blue green red
1 circle color 2 0 0
2 square color 1 2 0
3 triangle color 1 1 1
However, I can't seem to get this to return properly! I've gotten part of the way using something like this:
ColorSummary <- ddply(example,.(shape,property="color"), function(example) summary(example$value))
But this is returning a dataframe with columns for all of the other unique value (from the properties size and intensity, which I do not want):
shape property 3 5 6 7 blue green red
1 circle color 1 1 0 0 2 0 0
2 square NA 1 0 0 1 1 2 0
3 triangle NA 0 0 1 1 1 1 1
What am I doing wrong - is there a way to return a dataframe like the first result that I showed?
Also, while this is a small and fast example, my "real" data are much bigger and take a long time to calculate. Does the speed of ddply improve by limiting to only property="color"?
EDIT: Thanks for the answers so far! Unfortunately for me, I oversimplified the situation and I'm not sure if the dcast solution will work for me. Let me explain - I am actually working with a dataframe example2:
example2 <- structure(list(factory = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), shape = structure(c(1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L), .Label = c("circle",
"square", "triangle"), class = "factor"), property = structure(c(1L,
3L, 2L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 3L, 2L
), .Label = c("color", "intensity", "size"), class = "factor"),
value = structure(c(5L, 2L, 1L, 5L, 4L, 1L, 5L, 6L, 6L, 7L,
4L, 3L, 6L, 5L, 5L, 2L, 1L), .Label = c("3", "5", "6", "7",
"blue", "green", "red"), class = "factor")), .Names = c("factory",
"shape", "property", "value"), class = "data.frame", row.names = c(NA,
-17L))
and I am trying to split by both factory and shape. I have a messy solution using ddply:
ColorSummary2 <- ddply(example2,.(factory,shape,property="color"), function(example2) summary(example2$value))
which gives
factory shape property 3 5 6 7 blue green red
1 A circle color 1 1 0 0 2 0 0
2 A square NA 1 0 0 1 1 2 0
3 A triangle NA 0 0 1 1 1 1 1
4 B circle NA 1 1 0 0 1 0 0
but what I would like to return is this (sorry for the messy table, I have trouble formatting tables on here):
factory shape property blue green red
1 A circle color 2 0 0
2 A square NA 1 2 0
3 A triangle NA 1 1 1
4 B circle NA 1 0 0
Is this possible?
EDIT 2: Sorry for all of the edits, I oversimplified my situation way too much. Here is a more complex dataframe that is closer to my real example. This one has a column state, which I do not want to use for splitting. I can do this (messily) with ddply, but can I ignore state using dcast?
example3 <- structure(list(state = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("CA", "FL"
), class = "factor"), factory = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), shape = structure(c(1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L), .Label = c("circle",
"square", "triangle"), class = "factor"), property = structure(c(1L,
3L, 2L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 3L, 2L
), .Label = c("color", "intensity", "size"), class = "factor"),
value = structure(c(5L, 2L, 1L, 5L, 4L, 1L, 5L, 6L, 6L, 7L,
4L, 3L, 6L, 5L, 5L, 2L, 1L), .Label = c("3", "5", "6", "7",
"blue", "green", "red"), class = "factor")), .Names = c("state",
"factory", "shape", "property", "value"), class = "data.frame", row.names = c(NA,
-17L))
Using dcast from reshape2:
dcast(...~value,data=subset(example,property=='color'))
Aggregation function missing: defaulting to length
shape property blue green red
1 circle color 2 0 0
2 square color 1 2 0
3 triangle color 1 1 1
EDIT
using the second data set example:
dcast(...~value,data=subset(example2,property=='color'))
Aggregation function missing: defaulting to length
factory shape property blue green red
1 A circle color 2 0 0
2 A square color 1 2 0
3 A triangle color 1 1 1
4 B circle color 1 0 0