Using ggplot to map mean values by group

Using ggplot to map mean values by group - r

Using an example dataframe:
df <- structure(list(value = c(10L, 8L, 6L, 4L, 2L, 9L, 7L, 5L, 3L,
1L, 1L, 1L, 2L, 3L, 4L, 3L, 3L, 4L, 5L, 2L, 2L, 4L, 6L, 4L, 7L,
3L, 5L, 4L, 6L, 3L), length = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L, 1L, 2L, 3L, 4L, 5L), wave = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L)), .Names = c("value", "length", "wave"
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-30L), spec = structure(list(cols = structure(list(value = structure(list(), class = c("collector_integer",
"collector")), length = structure(list(), class = c("collector_integer",
"collector")), wave = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("value", "length", "wave")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
I wish to plot the average 'value' (line graph) by 'length' for each group (wave).
Is this possible direct from ggplot? (or do I need to do the preliminary analysis first).
I would have otherwise used:
ggplot(df, aes(x=length, y=value, color=wave)) + geom_point(shape=1)

We can use stat_summary for this task
library(ggplot2)
ggplot(df, aes(x = length, y = value, col = as.factor(wave))) +
stat_summary(geom = "line", fun.y = mean)

Related

cld() output has a wrong order of factor levels

I am using R cld() function with emmeans, but the order of factor level in the output is different from what I set. Before calling cld(), the by.years output is also in the desired order (screenshot), but when I do cld(), the output is in the alphabetical order of Light - Moderate - No(screenshot). I also checked cld.years$Grazing.intensity, the levels are correct. Is there a way to specify the order of factor levels in the cld() output? Any help is appreciated.
# sample data
plants <- structure(list(Grazing.intensity = structure(c(3L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 3L), .Label = c("Light-grazing", "Moderate-grazing", "No-grazing"), class = "factor"), Grazing.intensity1 = structure(c(3L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 3L), .Label = c("LG", "MG", "NG"), class = "factor"), Years = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("Dry-year", "Wet-year"), class = "factor"), Month = structure(c(2L, 2L, 2L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 2L, 2L, 3L), .Label = c("Aug.", "Jul.", "Sept."), class = "factor"), Plots = c(1L, 3L, 8L, 6L, 9L, 7L, 2L, 2L, 10L, 10L, 7L, 7L, 9L, 4L, 2L), Species.richness = c(8L, 6L, 10L, 11L, 9L, 5L, 7L, 13L, 10L, 6L, 5L, 5L, 14L, 8L, 10L)), class = "data.frame", row.names = c(NA, -15L))
# set the order of factor levels
plants$Grazing.intensity <- factor(plants$Grazing.intensity, levels =
c('No-grazing','Light-grazing','Moderate-grazing'))
attach(plants)
lmer.mod <- lmer(Species.richness ~ Grazing.intensity*Years + (1|Month), data = plants)
by.years <- emmeans(lmer.mod, specs = ~ Grazing.intensity:Years, by = 'Years', type = "response")
# display cld
cld.years <- cld(by.years, Letters = letters)
This is my first time posting sample data in StackOverflow, so it may be wrong.. I used dput().

I solved the issue. The order changed because the levels are displayed in the increasing order of emmean. I set sort = FALSE, and the result was displayed in the default order. I should have read the documentations more thoroughly.

Re-grouping data based on report run time

I have a folder which serves as a container for a standardized report from a system. This report is run on a daily basis. However, the report may require re-run for a certain date or range of dates depending on user preferences and asks. Thus file content may change significantly.
I would like to create a script that would group the unique dates together in one dataframe based on the latest run time, and another dataframe for the dates that are being revised.
Here is a simplified version of the table:
structure(list(Source = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), Date = structure(c(1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("11-Feb-20", "12-Feb-20"
), class = "factor"), FarmType = structure(c(3L, 4L, 5L, 1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L), .Label = c("AJSKJA",
"ASKJKA", "GHDGH", "KLKIUK", "KLSAKJ"), class = "factor"), FarmName = structure(c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L), .Label = c("",
"JJHGH", "JKJKK", "JUISO", "SDLLS"), class = "factor"), Perform = c(13.04144378,
1.230474165, 1.230474165, 13.9407486, 13.9407486, 13.04144378,
1.230474165, 1.230474165, 13.9407486, 13.9407486, 13.04144378,
15.26566, 1.230474165, 13.9407486), RunDate = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("02/14/2020",
"02/15/2020"), class = "factor")), class = "data.frame", row.names = c(NA,
-14L))
Please note that the number of columns does not change, however, after each re-run the number of rows may increase/decrease.
The idea is -- the first group of data that is based on the most recent run would represent the up-to-date information (corrections, revisions, etc.), while the second group essentially looks at what is being revised and how the numbers and data are changing.
Expected output for the first group:
structure(list(Source = c(3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L),
Date = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("11-Feb-20",
"12-Feb-20"), class = "factor"), FarmType = structure(c(3L,
4L, 5L, 1L, 3L, 4L, 5L, 1L, 2L), .Label = c("AJSKJA", "ASKJKA",
"GHDGH", "KLKIUK", "KLSAKJ"), class = "factor"), FarmName = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L), .Label = c("", "JJHGH",
"JKJKK", "JUISO", "SDLLS"), class = "factor"), Perform = c(13.04144378,
15.26566, 1.230474165, 13.9407486, 13.04144378, 1.230474165,
1.230474165, 13.9407486, 13.9407486), RunDate = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("02/14/2020",
"02/15/2020"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
Expected output for the second group:
structure(list(Source = c(1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L),
Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "11-Feb-20", class = "factor"),
FarmType = structure(c(3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L
), .Label = c("AJSKJA", "ASKJKA", "GHDGH", "KLKIUK", "KLSAKJ"
), class = "factor"), FarmName = structure(c(1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L), .Label = c("", "JJHGH", "JKJKK",
"JUISO", "SDLLS"), class = "factor"), Perform = c(13.04144378,
1.230474165, 1.230474165, 13.9407486, 13.9407486, 13.04144378,
15.26566, 1.230474165, 13.9407486), RunDate = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("02/14/2020",
"02/15/2020"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
Thank you for your time. Please let me know if you have questions.

We could group by 'Date' and filter those groups where the 'RunDate' is the latest after converting to Date class
library(lubridate)
library(dplyr)
new1 <- df1 %>%
group_by(Date) %>%
filter(mdy(RunDate) == max(mdy(RunDate)))
and for the second set, we can check if the number of distinct elements of 'RunDate' is more than 1
new2 <- df1 %>%
group_by(Date) %>%
filter(n_distinct(RunDate) > 1)

how to count the number of rows of specific column that has specific character

I have data that I want to know the number of specific rows that are with specific character. The data looks like the following
df<-structure(list(Gene.refGene = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1BG", "A1BG-AS1", "A1CF",
"A1CF;PRKG1"), class = "factor"), Chr = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("chr10", "chr19"
), class = "factor"), Start = c(58858232L, 58858615L, 58858676L,
58859052L, 58859055L, 58859066L, 58859510L, 58863162L, 58864479L,
58864150L, 58864867L, 58864879L, 58865857L, 52566433L, 52569637L,
52571047L, 52573510L, 52576068L, 52580561L, 52603659L, 52619845L,
52625849L, 52642500L, 52650951L, 52675605L, 52703952L, 52723140L,
52723638L), End = c(58858232L, 58858615L, 58858676L, 58859052L,
58859055L, 58859066L, 58859510L, 58863166L, 58864479L, 58864150L,
58864867L, 58864879L, 58865857L, 52566433L, 52569637L, 52571047L,
52573510L, 52576068L, 52580561L, 52603659L, 52619845L, 52625849L,
52642500L, 52650958L, 52675605L, 52703952L, 52723140L, 52723638L
), Ref = structure(c(3L, 5L, 2L, 2L, 3L, 2L, 5L, 7L, 6L, 6L,
2L, 1L, 5L, 6L, 5L, 3L, 2L, 5L, 6L, 3L, 3L, 6L, 3L, 4L, 3L, 6L,
6L, 3L), .Label = c("-", "A", "C", "CTCTCTCT", "G", "T", "TTTTT"
), class = "factor"), Alt_df1 = structure(c(1L, 1L, 4L, 4L, 1L,
4L, 5L, 1L, 3L, 3L, 4L, 4L, 3L, 1L, 2L, 5L, 1L, 2L, 1L, 5L, 5L,
2L, 5L, 1L, 4L, 3L, 4L, 2L), .Label = c("-", "A", "C", "G", "T"
), class = "factor")), class = "data.frame", row.names = c(NA,
-28L))
I want to know how many rows of the column named "alt_df1" is missing or - or NA

Here is an answer using which and utilising base R's LETTERS data:
length(which(!df$Alt_df1%in%LETTERS))
#[1] 8
Or using just which:
length(which(df$Alt_df1=="-"))
#[1] 8

One way would be to create a logical vector using %in% and then sum over them to count the number of occurrences.
sum(df$Alt_df1 %in% c("-", NA))
#[1] 8
Or we can also subset and count the number of rows.
nrow(subset(df, Alt_df1 %in% c("-", NA)))
which can also be done in dplyr by
library(dplyr)
df %>% filter(Alt_df1 %in% c("-", NA)) %>% nrow
Another option using grepl
with(df, sum(grepl("-", Alt_df1)) + sum(is.na(Alt_df1)))
and I am sure there are multiple other ways.

how do you create linear line on geom_bar in ggplot2

I need to create stacked ggplot bar plot given this data set with linear line drawn:
dput(t)
structure(list(Date = structure(c(16436, 16436, 16436, 16467,
16467, 16467, 16467, 16467, 16679, 16679, 16679, 16679, 16679
), class = "Date"), Applicatio = structure(c(4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L), .Label = c("DB", "Opt",
"Tom", "Web"), class = "factor"), Code = structure(c(1L, 2L,
4L, 3L, 1L, 2L, 4L, 3L, 3L, 1L, 2L, 4L, 3L), .Label = c("ch",
"db", "tt", "zz"), class = "factor"), m = structure(c(1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("2015-01",
"2015-02", "2015-09"), class = "factor"), count = c(1L, 3L, 1L,
4L, 1L, 7L, 1L, 9L, 1L, 6L, 4L, 7L, 9L), Total = c(1L, 12L, 1L,
2L, 1L, 20L, 1L, 7L, 7L, 9L, 50L, 3L, 6L)), .Names = c("Date",
"Applicatio", "Code", "m", "count", "Total"), row.names = c(NA,
-13L), class = "data.frame")
I am trying this:
ggplot(subset(t, Date> as.Date(c("2015-01-01", format="%Y-%m-%d"))), aes(m,fill=Code))+geom_bar()+
geom_smooth(aes(m,Total),method="lm", se=FALSE)+
guides(colour=FALSE)

I am not entirely sure what you are trying to achieve, but it looks like you want this:
ggplot(subset(t, Date > as.Date("2015-01-01", format="%Y-%m-%d")), aes(m,fill=Code))+geom_bar()+
geom_smooth(aes(m,Total,group=1),method="lm", se=FALSE)+
guides(colour=FALSE)
Basically, you had a c function in the subset function that was not needed and then you needed to use group=1 inside the geom_smooth function as mentioned by the warning.
So, yeah you can have a linear line on geom_bar.

Compare columns and put the output in additional column

Let's start with the example of the data:
structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple",
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L,
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange",
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"),
P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair",
"Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"),
P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge",
"Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L,
3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed",
"Table,Shelf,Fridge"), class = "factor")), .Names = c("P1",
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon",
"P2_location_all_predictors"), class = "data.frame", row.names = c(NA,
-20L))
I would like to compare the two pairs of column. First pair which I would like to comapre is P1_location_subacon with P2_location_subacon. The second pair is P1_location_all_predictors with P2_location_all_predictors.
How I want to compare them ? In each column you have different "locations" of the fruit/vegetable. So:
if the location is the same in the first pair (P1/2_location_subacon) I would like to put number 2 in the additional column.
if the location is the same in the second pair (P1/2_location_all_predictors) I would like to put number 1 in the additional column. That one is a bit more complicated because not all of the locations have to be the same. At least one of them has to be the same for both fruits/vegetables.
if in both cases they are different put 0. You won't see such situation in the example data.
To summarize I show you the output which I would like to achieve:
structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple",
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L,
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange",
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"),
P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair",
"Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"),
P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge",
"Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L,
3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed",
"Table,Shelf,Fridge"), class = "factor"), X = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), Correct = c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L)), .Names = c("P1",
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon",
"P2_location_all_predictors", "X", "Correct"), class = "data.frame", row.names = c(NA,
-20L))

EDIT: using feedback from here Test two columns of strings for match row-wise in R I have improved my answer.
Where DT is your table:
library(data.table)
setDT(DT)
DT <- data.table(sapply(DT,as.character))
DT[, P1_location_all_predictors := gsub(",","|",P1_location_all_predictors)]
DT[, P1_location_subacon := gsub(",","|",P1_location_subacon)]
DT[, match_all_pred := grepl(P1_location_all_predictors, P2_location_all_predictors) + 0, by = P1_location_all_predictors]
DT[, match_subacon := grepl(P1_location_subacon, P2_location_subacon), by = P1_location_subacon]
DT[, P1_location_all_predictors := gsub("\\|",",",P1_location_all_predictors)]
DT[, P1_location_subacon := gsub("\\|",",",P1_location_subacon)]
I instead opted for two columns vs your 0/1/2 notation; it makes the code less straightforward as you have to rely on nested ifs. I also think that multiple columns is better as you can clearly see the F/F, T/F, F/T, and T/T cases.
If you must create the 0/1/2, you can call
DT[, MyCol := match_all_pred - match_subacon*match_all_pred+match_subacon*2]
which assumes that subacon supersedes the all location.

Here is another way:
myData <- data.frame(sapply(myData, as.character), stringsAsFactors=FALSE)
doesIntersect <- function(setA, setB) {length(intersect(setA,setB)) > 0}
myData$Correct <- 0
myData$Correct[mapply(doesIntersect, strsplit(myData$P1_location_all_predictors, ","), strsplit(myData$P2_location_all_predictors, ","))] <- 1
myData$Correct[mapply(setequal, strsplit(myData$P1_location_subacon, ","), strsplit(myData$P2_location_subacon, ","))] <- 2
> myData$Correct
[1] 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using ggplot to map mean values by group - r

We can use stat_summary for this task library(ggplot2) ggplot(df, aes(x = length, y = value, col = as.factor(wave))) + stat_summary(geom = "line", fun.y = mean)

Related

cld() output has a wrong order of factor levels

Re-grouping data based on report run time

how to count the number of rows of specific column that has specific character

how do you create linear line on geom_bar in ggplot2

Compare columns and put the output in additional column

Categories

Resources