Why are my ggplot2 aesthetics the wrong length? - r

I am trying to average reps of data, subset one treatment, then make a bar graph of the response and another factor. My plot ends up not working. Any help would be much appreciated.
My data:
data <- structure(list(Sample = c(1011L, 1012L, 1014L, 1024L, 1025L,
1026L), Collection = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2"), class = "factor"), Irrigation = structure(c(3L, 3L, 3L,
5L, 5L, 5L), .Label = c("Rate1", "Rate2", "Rate3", "Rate4", "Rate5"
), class = "factor"), Variety = structure(c(2L, 1L, 3L, 3L, 2L,
1L), .Label = c("Hodag", "Lamoka", "Snowden"), class = "factor"),
Suc = c(0.7333, 0.4717, 0.5883, 0.6783, 0.8283, 0.6833),
Gluc = c(0.03, 0.04, 0.043, 0.075, 0.057, 0.087), L = c(59.48,
57.59, 59.25, 66.45, 68.29, 65.65), a = c(4.36, 6.85, 3.43,
1.7, 0.78, 2.84), b = c(26.82, 27.6, 26.2, 26.14, 25.37,
27.19), NoDefect = c(100L, 100L, 100L, 92L, 100L, 100L),
Defect = c(0L, 0L, 0L, 8L, 0L, 0L)), row.names = c(NA, 6L
), class = "data.frame")
Averaging between reps:
dataAvgSuc <- data %>%
dplyr::group_by(Collection, Irrigation, Variety) %>%
dplyr::summarise(meanSuc=mean(Suc))
Made 'Collection' a factor:
dataAvgSuc$Collection <- as.factor(dataAvgSuc$Collection)
Subset by variety:
subLamoka <- subset(dataAvgSuc, Variety=="Lamoka")
subHodag <- subset(dataAvgSuc, Variety=="Hodag")
subSnowden <- subset(dataAvgSuc, Variety=="Snowden")
Attempted ggplot:
sucPlot <-ggplot(data=subLamoka, aes(x=dataAvgSuc$Collection,
y=meanSuc)) + geom_bar(stat="identity")
Error code:
Error: Aesthetics must be either length 1 or the same as the data (10):
x, y
However, both the x and y have 30 entries when I look at them.

Trev,
Had some trouble re-generating the issue as the sample data provided are for just 6 observations, not 30. So not sure if the below solution would work for you or not.
I used the code you supplied to create the dataframe:
data <- structure(list(Sample = c(1011L, 1012L, 1014L, 1024L, 1025L, 1026L),
Collection = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2"), class = "factor"),
Irrigation = structure(c(3L, 3L, 3L,5L, 5L, 5L), .Label = c("Rate1", "Rate2",
"Rate3", "Rate4", "Rate5"
), class = "factor"), Variety = structure(c(2L, 1L, 3L, 3L, 2L,
1L), .Label = c("Hodag", "Lamoka", "Snowden"), class = "factor"),
Suc = c(0.7333, 0.4717, 0.5883, 0.6783, 0.8283, 0.6833),
Gluc = c(0.03, 0.04, 0.043, 0.075, 0.057, 0.087),
L = c(59.48, 57.59, 59.25, 66.45, 68.29, 65.65),
a = c(4.36, 6.85, 3.43, 1.7, 0.78, 2.84),
b = c(26.82, 27.6, 26.2, 26.14, 25.37,27.19),
NoDefect = c(100L, 100L, 100L, 92L, 100L, 100L),
Defect = c(0L, 0L, 0L, 8L, 0L, 0L)),
row.names = c(NA, 6L), class = "data.frame")
data$Collection
However, your Collection factor is defined with two levels but only one is shown in the example. Perhaps this could be why the averages were coming out greater than 1?I modified the code below to have 2 levels of collection represented in the data.
data2 <- structure(list(Sample = c(1011L, 1012L, 1014L, 1024L, 1025L, 1026L),
Collection = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1",
"2"), class = "factor"),
Irrigation = structure(c(3L, 3L, 3L,5L, 5L, 5L), .Label = c("Rate1", "Rate2",
"Rate3", "Rate4", "Rate5"
), class = "factor"), Variety = structure(c(2L, 1L, 3L, 3L, 2L,
1L), .Label = c("Hodag", "Lamoka", "Snowden"), class = "factor"),
Suc = c(0.7333, 0.4717, 0.5883, 0.6783, 0.8283, 0.6833),
Gluc = c(0.03, 0.04, 0.043, 0.075, 0.057, 0.087),
L = c(59.48, 57.59, 59.25, 66.45, 68.29, 65.65),
a = c(4.36, 6.85, 3.43, 1.7, 0.78, 2.84),
b = c(26.82, 27.6, 26.2, 26.14, 25.37,27.19),
NoDefect = c(100L, 100L, 100L, 92L, 100L, 100L),
Defect = c(0L, 0L, 0L, 8L, 0L, 0L)),
row.names = c(NA, 6L), class = "data.frame")
data2$Collection
Since you're using dplyr just keep piping that object into ggplot-- I don't think you would need to create subsets of new dataframes, but can instead graph them all separately with a facet_wrap command. I also am using geom_col instead of geom_bar, which the latter is generally trying to graph count data. Since you want to plot an average, geom_col may be better. Also since the example below is piping to the next line, the "data=" definition typically used in ggplot commands is not needed.
First with data:
data %>%
dplyr::group_by(Collection,Irrigation, Variety) %>%
dplyr::summarise(meanSuc=mean(Suc)) %>%
ggplot(aes(x = Collection, y = meanSuc)) +
geom_col() +
facet_wrap(.~Variety)
Incorporate Irrigation:
data %>%
dplyr::group_by(Collection,Irrigation, Variety) %>%
dplyr::summarise(meanSuc=mean(Suc)) %>%
ggplot(aes(x = Collection, y = meanSuc, fill = Irrigation)) +
geom_col() +
facet_wrap(.~Variety)
And using data2 instead, as defined above, will produce the Collection levels 1 and 2 side by side on the graph. With this method I was able to generate a result and all averages were less than 1, between .4~.8

Related

position=dodge in geom_col in barplot

here is a dataset of soccer players that I need to visualise the total number of yellow cards received next to the number of games played per country in one bar plot. SO I need to calculate the total number of yellow cards and the total number of games per league country and bring the data into long format.
dput(head(new_soccer_referee))
structure(list(playerShort = c("lucas-wilchez", "john-utaka",
"abdon-prats", "pablo-mari", "ruben-pena", "aaron-hughes"), player = c("Lucas Wilchez",
"John Utaka", " Abdón Prats", " Pablo Marí", " Rubén Peña", "Aaron Hughes"
), club = c("Real Zaragoza", "Montpellier HSC", "RCD Mallorca",
"RCD Mallorca", "Real Valladolid", "Fulham FC"), leagueCountry = c("Spain",
"France", "Spain", "Spain", "Spain", "England"), birthday = structure(c(4990,
4390, 8386, 8643, 7868, 3598), class = "Date"), height = c(177L,
179L, 181L, 191L, 172L, 182L), weight = c(72L, 82L, 79L, 87L,
70L, 71L), position = c("Attacking Midfielder", "Right Winger",
NA, "Center Back", "Right Midfielder", "Center Back"), games = c(1L,
1L, 1L, 1L, 1L, 1L), victories = c(0L, 0L, 0L, 1L, 1L, 0L), ties = c(0L,
0L, 1L, 0L, 0L, 0L), defeats = c(1L, 1L, 0L, 0L, 0L, 1L), goals = c(0L,
0L, 0L, 0L, 0L, 0L), yellowCards = c(0L, 1L, 1L, 0L, 0L, 0L),
yellowReds = c(0L, 0L, 0L, 0L, 0L, 0L), redCards = c(0L,
0L, 0L, 0L, 0L, 0L), photoID = c("95212.jpg", "1663.jpg",
NA, NA, NA, "3868.jpg"), rater1 = c(0.25, 0.75, NA, NA, NA,
0.25), rater2 = c(0.5, 0.75, NA, NA, NA, 0), refNum = c(1L,
2L, 3L, 3L, 3L, 4L), refCountry = c(1L, 2L, 3L, 3L, 3L, 4L
), Alpha_3 = c("GRC", "ZMB", "ESP", "ESP", "ESP", "LUX"),
meanIAT = c(0.326391469021736, 0.203374724564378, 0.369893594187172,
0.369893594187172, 0.369893594187172, 0.325185154120009),
nIAT = c(712L, 40L, 1785L, 1785L, 1785L, 127L), seIAT = c(0.000564112354334542,
0.0108748941063986, 0.000229489640866464, 0.000229489640866464,
0.000229489640866464, 0.00329680952361961), meanExp = c(0.396,
-0.204081632653061, 0.588297311544544, 0.588297311544544,
0.588297311544544, 0.538461538461538), nExp = c(750L, 49L,
1897L, 1897L, 1897L, 130L), seExp = c(0.0026964901062936,
0.0615044043187379, 0.00100164730649311, 0.00100164730649311,
0.00100164730649311, 0.013752210497518), BMI = c(22.98190175237,
25.5922099809619, 24.1140380330271, 23.8480304816206, 23.6614386154678,
21.4346093466973), position_new = c("Offense", "Offense",
"Goalkeeper", "Defense", "Midfield", "Defense"), rater_mean = c(0.375,
0.75, NA, NA, NA, 0.125), ageinyear = c(28, 30, 19, 18, 20,
32), ageinyears = c(28, 30, 19, 18, 20, 32)), row.names = c(NA,
6L), class = "data.frame")
Use the data to draw a bar plot with the following characteristics:
– The x-axis displays the league country while the y-axis displays the number of games and the number of cards
– For each country there are two bars next to each other: one for the games played and one for the cards received
barplot <- ggplot(new_soccer_referee,aes(x=leagueCountry,y=number))
barplot +
geom_bar(fill=c("games","yellowCards")) +
geom_col(Position="dodge") +
labels(x="leagueCountry", y="number")
ggplot
`
I know it is pretty messy but I am really confused how to build up the layers with ggplot and how to work out the long format, can anyone help?
One option would be to first aggregate your data to compute the number of yellowCards and games by leagueCountry. Afterwards you could convert to long which makes it easy to plot via ggplot2.
Using some fake random example data to mimic your real data:
set.seed(123)
new_soccer_referee <- data.frame(
player = sample(letters, 20),
leagueCountry = sample(c("Spain", "France", "England", "Italy"), 20, replace = TRUE),
yellowCards = sample(1:5, 20, replace = TRUE),
games = sample(1:20, 20, replace = TRUE)
)
library(dplyr)
library(tidyr)
library(ggplot2)
new_soccer_referee_long <- new_soccer_referee %>%
group_by(leagueCountry) %>%
summarise(across(c(yellowCards, games), sum)) %>%
pivot_longer(-leagueCountry, names_to = "variable", values_to = "number")
ggplot(new_soccer_referee_long, aes(leagueCountry, number, fill = variable)) +
geom_col(position = "dodge")
Something like this:
library(tidyverse)
new_soccer_referee %>%
select(leagueCountry, games, yellowCards) %>%
group_by(leagueCountry) %>%
summarise(games = sum(games),
yellowCars = sum(yellowCards)
) %>%
pivot_longer(-leagueCountry) %>%
ggplot(aes(x=leagueCountry, fill=name, y=value)) +
geom_col(position = position_dodge())

Multiple vertical shaded area

I am plotting the proportion of deep sleep (y axis) vs days (x axis). I would like to add vertical shaded area for a better understanding (e.g. grey for week-ends, orange for sick period...).
I have tried using geom_ribbon (I created a variable taking the value of 30, with is the top of my y axis if the data is during the WE - information given in another column), but instead of getting rectangles, I get trapezes.
In another post, someone proposed the use of "geom_rect", or "annotate" if one's know the x and y coordinates, but I don't see how to adapt it in my case, when I want to have the colored area repeated to all week-end (it is not exactly every 7 days because some data are missing).
Do you have any idea ?
Many thanks in advance !
ggplot(Sleep.data, aes(x = DATEID)) +
geom_line(aes(y = P.DEEP, group = 1), col = "deepskyblue3") +
geom_point(aes(y = P.DEEP, group = 1, col = Sign.deep)) +
guides(col=FALSE) +
geom_ribbon(aes(ymin = min, ymax = max.WE), fill = '#6495ED80') +
facet_grid(MONTH~.) +
geom_hline(yintercept = 15, col = "forestgreen") +
geom_hline(yintercept = 20, col = "forestgreen", linetype = "dashed") +
geom_vline(xintercept = c(7,14,21,28), col = "grey") +
scale_x_continuous(breaks=seq(0,28,7)) +
scale_y_continuous(breaks=seq(0,30,5)) +
labs(x = "Days",y="Proportion of deep sleep stage", title = "Deep sleep")
Proportion of deep sleep vs time
Head(Sleep.data)
> dput(head(Sleep.data))
structure(list(DATE = structure(c(1L, 4L, 7L, 10L, 13L, 16L), .Label = c("01-Dec-17",
"01-Feb-18", "01-Jan-18", "02-Dec-17", "02-Feb-18", "02-Jan-18",
"03-Dec-17", "03-Feb-18", "03-Jan-18", "04-Dec-17", "04-Feb-18",
"04-Jan-18", "05-Dec-17", "05-Feb-18", "05-Jan-18", "06-Dec-17",
"06-Feb-18", "06-Jan-18", "07-Dec-17", "07-Feb-18", "07-Jan-18",
"08-Dec-17", "08-Jan-18", "09-Dec-17", "09-Feb-18", "09-Jan-18",
"10-Dec-17", "10-Jan-18", "11-Dec-17", "11-Feb-18", "11-Jan-18",
"12-Dec-17", "12-Jan-18", "13-Dec-17", "13-Feb-18", "13-Jan-18",
"14-Dec-17", "14-Feb-18", "14-Jan-18", "15-Dec-17", "15-Jan-18",
"16-Dec-17", "16-Jan-18", "17-Dec-17", "17-Jan-18", "18-Dec-17",
"18-Jan-18", "19-Dec-17", "19-Jan-18", "20-Dec-17", "21-Dec-17",
"21-Jan-18", "22-Dec-17", "22-Jan-18", "23-Dec-17", "23-Jan-18",
"24-Dec-17", "24-Jan-18", "25-Dec-17", "25-Jan-18", "26-Dec-17",
"26-Jan-18", "27-Dec-17", "27-Jan-18", "28-Dec-17", "28-Jan-18",
"29-Dec-17", "29-Jan-18", "30-Dec-17", "30-Jan-18", "31-Dec-17",
"31-Jan-18"), class = "factor"), DATEID = 1:6, MONTH = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("Decembre", "Janvier", "Février"
), class = "factor"), DURATION = c(8.08, 7.43, 6.85, 6.23, 7.27,
6.62), D.DEEP = c(1.67, 1.37, 1.62, 1.75, 1.95, 0.9), P.DEEP = c(17L,
17L, 21L, 24L, 25L, 12L), STIMS = c(0L, 0L, 0L, 0L, 390L, 147L
), D.REM = c(1.7, 0.95, 0.95, 1.43, 1.47, 0.72), P.REM = c(17L,
11L, 12L, 20L, 19L, 9L), D.LIGHT = c(4.7, 5.12, 4.27, 3.05, 3.83,
4.98), P.LIGHT = c(49L, 63L, 55L, 43L, 49L, 66L), D.AWAKE = c(1.45,
0.58, 0.47, 0.87, 0.37, 0.85), P.AWAKE = c(15L, 7L, 6L, 12L,
4L, 11L), WAKE.UP = c(-2L, 0L, 2L, -1L, 3L, 1L), AGITATION = c(-1L,
-3L, -1L, -2L, 2L, -1L), FRAGMENTATION = c(1L, -2L, 2L, 1L, 0L,
-1L), PERIOD = structure(c(3L, 3L, 4L, 4L, 4L, 4L), .Label = c("HOLIDAYS",
"SICK", "WE", "WORK"), class = "factor"), SPORT = structure(c(2L,
1L, 2L, 2L, 2L, 1L), .Label = c("", "Day", "Evening"), class = "factor"),
ACTIVITY = structure(c(6L, 1L, 3L, 4L, 5L, 1L), .Label = c("",
"Bkool", "eBike", "Gym", "Natation", "Run"), class = "factor"),
TABLETS = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5), Ratio = c(1.15,
2.36, 3.45, 2.01, 5.27, 1.06), Sign = structure(c(2L, 2L,
2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
Sign.ratio = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), Sign.deep = structure(c(2L, 2L,
2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
Sign.awake = structure(c(1L, 2L, 2L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), Sign.light = structure(c(2L, 1L,
1L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
index = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), min = c(0, 0, 0, 0, 0, 0), max.WE = c(30,
30, 0, 0, 0, 0)), .Names = c("DATE", "DATEID", "MONTH", "DURATION",
"D.DEEP", "P.DEEP", "STIMS", "D.REM", "P.REM", "D.LIGHT", "P.LIGHT",
"D.AWAKE", "P.AWAKE", "WAKE.UP", "AGITATION", "FRAGMENTATION",
"PERIOD", "SPORT", "ACTIVITY", "TABLETS", "Ratio", "Sign", "Sign.ratio",
"Sign.deep", "Sign.awake", "Sign.light", "index", "min", "max.WE"
), row.names = c(NA, 6L), class = "data.frame")
Thanks for adding the data, that makes it easier to understand exactly what you're working with and to confirm that an answer actually addresses your question.
I thought it would be helpful to make a separate table with just the start and end of each contiguous set of rows with the same PERIOD. I did this using dplyr::case_when, assuming we should mark dates as a "start" if they are the first row in the table (row_number() == 1), or they have a different PERIOD value than the prior row. I mark dates as an "end" if they are the last row of the table, or have a different PERIOD than the next row. I only keep the starts and ends, and spread these into new columns called start and end.
library(tidyverse)
Period_ranges <- Sleep.data %>%
mutate(period_status = case_when(row_number() == 1 ~ "start",
PERIOD != lag(PERIOD) ~ "start",
row_number() == n() ~ "end",
PERIOD != lead(PERIOD) ~ "end",
TRUE ~ "other")) %>%
filter(period_status %in% c("start", "end")) %>%
select(DATEID, PERIOD, period_status) %>%
mutate(PERIOD_NUM = cumsum(PERIOD != lag(PERIOD) | row_number() == 1)) %>%
spread(period_status, DATEID)
# Output based on sample data only. If there's a problem with the full data, please add more. To share full data, use `dput(Sleep.data)` or to share 20 rows use `dput(head(Sleep.data, 20))`.
>Period_ranges
PERIOD PERIOD_NUM end start
1 WE 1 2 1
2 WORK 2 6 3
We can now use that in the plot. If you want to toggle the inclusion or fiddle with the appearance separately of different PERIOD types, you could modify the code below with Period_ranges %>% filter(PERIOD == "WE"),
ggplot(Sleep.data, aes(x = DATEID)) +
# Here I specify that this geom should use its own data.
# I start the rectangles half a day before and end half a day after to fill the space.
geom_rect(data = Period_ranges, inherit.aes = F,
aes(xmin = start - 0.5, xmax = end + 0.5,
ymin = 0, ymax = 30,
fill = PERIOD), alpha = 0.5) +
# Here we can specify the shading color for each type of PERIOD
scale_fill_manual(values = c(
"WE" = '#6495ED80',
"WORK" = "gray60"
)) +
# rest of your code
Chart based on data sample:

How do I get rid off pre-printed text in Forest Plot using Metafor/R?

I am using the Metafor package to produce meta-analysis and subsequently a Forest Plot. When I print my Forest Plot a "RE Model"-text appears automatically as shown on the attached print. I can't figure out how to remove the "RE Model" although I use a separate text-script. I just want my "own" text to appear aligned with the polygon. Can you help?
First "load data" and then my script:
### Load data
dat <- structure(list(study = structure(c(4L, 5L, 3L, 1L, 2L, 7L, 6L
), .Label = c("Battaglia et al.", "Hong et al.", "Kosyakove et al.",
"Lim et al.", "Rauch et al.", "Swachia et al.", "Tsounis et
al."
), class = "factor"), n1i = c(20L, 121L, 25L, 18L, 31L, 35L,
22L), m1i = c(12.8, 30.2, 24.6, 21, 25, 27,
18.2), sd1i = c(15.4,
21.6, 17, 33, 18, 13.8, 8.72), n2i = c(20L, 129L, 25L, 17L, 32L,
34L, 20L), m2i = c(12.1, 28.7, 25.1, 31, 26, 28.6, 14.7), sd2i = c(14.6,
21.6, 12.2, 25, 19, 24.2, 12.9), ntotal = c(40L, 250L, 50L, 35L,
63L, 69L, 42L), mean.age = c(3L, 3L, 1L, 4L, 4L, 3L, 3L), demograhic =
c(0L,
2L, 1L, 1L, 0L, 1L, 0L), adjusted.comorbid = c(1L, 1L, 1L, 1L,
0L, 1L, 1L), follow.up = c(1L, 3L, 3L, 1L, 2L, 2L, 2L), severity = c(2L,
4L, 1L, 4L, 4L, 4L, 4L), treat.sys = c(1L, 2L, 1L, 2L, 1L, 2L,
1L), treat.int = c(1L, 1L, 2L, 1L, 2L, 1L, 1L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L), citation = c(1L, 2L,
3L, 4L, 5L, 6L, 6L), yi = structure(c(0.700000000000001, 1.5,
-0.5, -10, -1, -1.6, 3.5), measure = "MD", ni = c(40L, 250L,
50L, 35L, 63L, 69L, 42L)), vi = c(22.516, 7.47261195464155, 17.5136,
97.2647058823529, 21.7328629032258, 22.6658487394958, 11.7767909090909
)), .Names = c("study", "n1i", "m1i", "sd1i", "n2i", "m2i", "sd2i",
"ntotal", "mean.age", "demograhic", "adjusted.comorbid", "follow.up",
"severity", "treat.sys", "treat.int", "year", "citation", "yi",
"vi"), row.names = c(NA, -7L), class = c("escalc", "data.frame"
), digits = 4, yi.names = "yi", vi.names = "vi")
AND my code
### My code
res <- rma(yi, vi, data=dat, slab=paste(study, year, citation, sep=", "), method = "REML")
forest(res, xlim=c(-39,24), at=c(-12,-9,-6,-3,0,3,6,9,12), showweights = TRUE,
ilab=cbind(dat$m1i, dat$n1i, dat$sd1i, dat$m2i, dat$n2i, dat$sd2i),
ilab.xpos=c(-26,-24,-22,-18,-16,-14), cex=1, ylim=c(-2, 10), font=1, digits=2, col="darkgrey")
### Add own text
text(-39, -2, pos=4, cex=0.9, font=2,
bquote(paste("Random-effects model for all studies: Q = ",
.(formatC(q1$QE, digits=2, format="f")),
", df = ", .(q1$k - q1$p),", p = ",
.(formatC(q1$QEp, digits=2, format="f")),
", ", I^2, " = ",
.(formatC(q1$I2, digits=1, format="f")),
"%", ", ", tau^2 ==
.(formatC(q1$tau2, digits=2, format="f")))))
Thank you in advance!
To make your forest plot without the "RE Model" text in the bottom left hand corner, just use the mlab = "" argument in your forest function call.
forest(res, xlim=c(-39,24), at=c(-12,-9,-6,-3,0,3,6,9,12), showweights = TRUE,
ilab=cbind(dat$m1i, dat$n1i, dat$sd1i, dat$m2i, dat$n2i, dat$sd2i),
ilab.xpos=c(-26,-24,-22,-18,-16,-14), cex=1, ylim=c(-2, 10),
font=1, digits=2, col="darkgrey", mlab = "")
Unfortunately I can't run the "Add own text" section of your provided code as you do not provide your q1 object. But you should be able to solve that yourself.
I figured this out using the Metafor-Project site, specifically their page on forest plots.

R - Changing colnames with a loop with names from another table

I am trying to split a large dataset and
assign colnames with a loop and
save all individual data back again in a single stacked file
I am using some sample data as follows:
so firstly I split the datasets into 2 based on number of sources in the first column and read in a list using the following code:
out <- split( sample , f = sample$Source)
now I am struggling to set up a loop to change the colnames for coloumn 2 to 8
by matching the existing colnames to the following 'info' table and replacing based on source name as in the first column of the 'info' table.
the info table looks like this:
so the loop should change the colnames similar to this:
I am just wondering if anyone has done something similar could advise me?
also when I try to join them together I can only set the colnames ones using the merge function. is there any way to stack them so that I can preserve the colname for each table and looks something like this? :
my sample input files are:
> dput(sample)
structure(list(Source = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L), .Label = c("Stack 1", "Stack 2"), class = "factor"),
year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L), day = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), hour = c(0L, 1L, 2L, 3L, 0L, 1L, 2L, 3L, 4L), `EXIT VEL` = c(26.2,
26.2, 26.2, 26.2, 22.4, 22.4, 22.4, 22.4, 22.4), TEMP = c(341L,
341L, 341L, 341L, 328L, 328L, 328L, 328L, 328L), `STACK DIAM` = c(1.5,
1.5, 1.5, 1.5, 2.5, 2.5, 2.5, 2.5, 2.5), W = c(0L, 0L, 0L,
0L, 15L, 15L, 15L, 15L, 15L), Nox = c(39, 39, 39, 39, 33.3,
33.3, 33.3, 33.3, 33.3), Sox = c(15.5, 15.5, 15.5, 15.5,
17.9, 17.9, 17.9, 17.9, 17.9)), .Names = c("Source", "year",
"day", "hour", "EXIT VEL", "TEMP", "STACK DIAM", "W", "Nox",
"Sox"), class = "data.frame", row.names = c(NA, -9L))
> dput(stack_info)
structure(list(SNAME = structure(1:2, .Label = c("Stack 1", "Stack 2"
), class = "factor"), ISVARY = c(1L, 4L), VELVOL = c(1L, 4L),
TEMPDENS = c(0L, 2L), `DUM 1` = c(999L, 999L), `DUM 2` = c(999L,
999L), NPOL = c(2L, 2L), `EXIT VEL` = c(26.2, 22.4), TEMP = c(341L,
328L), `STACK DIAM` = c(1.5, 2.5), W = c(0L, 15L), Nox = c(39,
33.3), Sox = c(15.5, 17.9)), .Names = c("SNAME", "ISVARY",
"VELVOL", "TEMPDENS", "DUM 1", "DUM 2", "NPOL", "EXIT VEL", "TEMP",
"STACK DIAM", "W", "Nox", "Sox"), class = "data.frame", row.names = c(NA,
-2L))
thanks in advance
The best I ended with is this:
out <- split( sample , f = sample$Source) # your original step
stack_info[,1] <- as.character(stack_info[,1]) # To get strings column as strings and not index number later
out <- lapply( names(out), function(x) {
# Get the future names
new_cnames <- unname(unlist(stack_info[stack_info$SNAME == x,1:7]))
# replace the column names
colnames(out[[x]]) <- c("Source",new_cnames,colnames(out[[x]])[9:10] )
# Return the modified version without first column
out[[x]][,-1] })
sapply(out,write.table,append=T,file="",row.names=F,sep="|") # write (change "" to the file name you wish and sep to your desired separator and see ?write.table for more documentation)
The main idea is looping over the DF to change their colnames, I do update the list and loop again to write, you may want to append to file in the first loop.
I hope the comments are enough to get the code, tell me if it needs some details.
Output on screen (omitting warnings):
"Stack 1"|"1"|"1.1"|"0"|"999"|"999.1"|"2"|"Nox"|"Sox"
2010|1|0|26.2|341|1.5|0|39|15.5
2010|1|1|26.2|341|1.5|0|39|15.5
2010|1|2|26.2|341|1.5|0|39|15.5
2010|1|3|26.2|341|1.5|0|39|15.5
"Stack 2"|"4"|"4.1"|"2"|"999"|"999.1"|"2.1"|"Nox"|"Sox"
2010|1|0|22.4|328|2.5|15|33.3|17.9
2010|1|1|22.4|328|2.5|15|33.3|17.9
2010|1|2|22.4|328|2.5|15|33.3|17.9
2010|1|3|22.4|328|2.5|15|33.3|17.9
2010|1|4|22.4|328|2.5|15|33.3|17.9

ggplot2: how to use facet_wrap_labeller to get correct subscripts

This question refers to, and requires (but maybe not?), the function facet_wrap_labeller written by Roland here: https://stackoverflow.com/a/16964861/3275826
MWE for my case:
dput(test)
structure(list(V1 = c(0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 3.67, 3.73,
3.79, 3.85, 3.91, 3.97), V2 = c(0.0291598, 3.40333, 1.3881, 0.15733,
0.0200618, 0.00145373, 0.332262, 0.30233, 0.288497, 0.267876,
0.264134, 0.227544), V3 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L, 19L, 19L, 19L), .Label = c("Param0", "Param1",
"Param2", "Param3", "Param4", "Param5", "Param6", "Param7", "Param8",
"Param9", "Param10", "Param11", "Param12", "Param13", "Param14",
"Param15", "Param16", "Param17", "Param18"), class = "factor")), .Names = c("V1",
"V2", "V3"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 945L, 946L,
947L, 948L, 949L, 950L), class = "data.frame")
dput(testLabels)
structure(c(2L, 1L), .Label = c("K[12:3]", "K[12]"), class = "factor")
## Will shows Param0 in the strip but this is modified by the facet_wrap_labeller
pp = ggplot(test, aes(x=test$V1)) + geom_ribbon(aes(ymin=0,ymax=test$V2), fill="blue", alpha=0.2, colour="blue") + facet_wrap(~V3, scales="free", ncol=3)
facet_wrap_labeller(pp, testLabels)
Problem: Facet strip labels are non-trivial to manipulate when using facet_wrap. In my case the plot shows K[12:3] etc rather than the [12:3] as a subscript as I would like.
I like Roland's function but I don't know how to adapt it for the case where your label subscripts are not just integers.
A solution can be found at https://stackoverflow.com/a/6539953/3275826 however I would like to know how to adapt the aforementioned function as I have a lot of these instances.
Someone who is more confident than me with "computing on the language" could help, but perhaps I am overcomplicating the problem? Using R 3.0.2 and ggplot2_0.9.3.1.

Resources