Related
I have a dataset containing 120 observations of 6 variables. Five variables are factors, 1 variable is my target variable.
I need to write a function that will creates a matrix (for each factor) which contains each level of the factor as columns, and the maximum value of the target variable as first row, and the minimum value of the target variable as the second row.
I know how to create a matrix, however I am lost when I need to make it through a function.
Is there someone who can help?
Here is a simple example of what I want to reach with a fictive easy dataset.
Example
As you can see, for each level of the factor (on the picture factor 1), I want to indicate the highest value of the target, and the lowest value of the target.
Here is a subset of my own data:
> dput(data_plu[1:4, ])
structure(list(NaNO3 = structure(c(2L, 8L, 8L, 3L), .Label = c("10",
"14", "18", "2", "22", "26", "30", "6"), class = "factor"),
CaCl2 = structure(c(4L,
8L, 8L, 8L), .Label = c("0.1", "0.28", "0.46", "0.64", "0.82",
"1", "1.19", "1.37"), class = "factor"), PO4 = structure(c(1L,
5L, 5L, 6L), .Label = c("0.1", "0.8", "1.5", "2.2", "2.9", "3.6",
"4.3", "5"), class = "factor"), NH4Cl = structure(c(5L, 3L, 3L,
6L), .Label = c("0.5", "10.86", "12.93", "15", "2.58", "4.65",
"6.72", "8.79"), class = "factor"), MgSO4 = structure(c(4L, 7L,
1L, 7L), .Label = c("0.21", "0.35", "0.5", "0.64", "0.79", "0.93",
"1.08", "1.22"), class = "factor"), DC = c(15000L, 707500L, 720000L,
872500L)), row.names = c(NA, 4L), class = "data.frame")
You may be able to modify this to meet your needs. I wrote a function to handle one factor and then use lapply to handle them all. I've called your sample data dta:
stats <- function(x, y) {
minmax <- aggregate(y, list(x), range)
cols <- minmax[, 1]
result <- as.matrix(t(minmax[, -1]))
dimnames(result) <- list(c("Min", "Max"), Levels=as.character(cols))
return(result)
}
out <- lapply(dta[, -6], function(x) stats(x, dta$DC))
head(out, 1)
# $NaNO3
# Levels
# 14 18 6
# Min 15000 872500 707500
# Max 15000 872500 720000
After importing a csv file, R separated my data into columns every comma it reads.
My issue is that i had originally two columns where i had different values that are floating numbers, and the other column is the sum of all of these floating number. So R spread these elements in 5 or 6 columns sometimes less columns, sometimes more, depending on the number of commas existing.
There's a facilitation in this issue: the first column is delimited from parenthesis: so for example the first row first column is (-5,5+9)+(-10+12) and the second column would be the sum of this floating numbers. So i can easily see where the first column stops, after the second column (that is the sum of the elements of the first column) there are at least 2 or more empty columns so that i can easily recognize where the second column ends. Now what i have to do is to rearrange my dataset in the original form. I post the structure of the dataset for an easy understanding
here's is the code of the first rows
Y= structure(list(V24 = structure(c(66L, 15L, 44L, 28L, 68L, 10L
), .Label = c("", "(-0", "(-0+7", "(-1", "(-1+11", "(-1+11)+(-13",
"(-1+11)+(-18+18", "(-1+3)+(-10+14)", "(-1+8)", "(-2", "(-2+10",
"(-2+10)", "(-2+10)+(-13", "(-2+11", "(-2+11)", "(-2+11)+(-13",
"(-2+11)+(-14+17)", "(-2+12", "(-2+12)", "(-2+12)+(-14", "(-2+12)+(-14+15",
"(-2+12)+(-14+16)", "(-2+6)+(-8+10)+(-14", "(-2+7", "(-2+7)",
"(-2+7)+(-11", "(-2+7)+(-13", "(-2+8", "(-2+8)+(-10", "(-2+8)+(-11",
"(-2+8)+(-13", "(-2+8)+(-15", "(-2+9", "(-2+9)", "(-2+9)+(-13",
"(-2+9)+(-14", "(-3", "(-3+10", "(-3+10)", "(-3+10)+(-13", "(-3+10)+(-13+14",
"(-3+10)+(-14+14", "(-3+11", "(-3+11)", "(-3+11)+(-13", "(-3+12",
"(-3+12)", "(-3+12)+(-13", "(-3+13)", "(-3+7", "(-3+8", "(-3+8)",
"(-3+8)+(-11+12", "(-3+9", "(-3+9)", "(-4", "(-4+10", "(-4+10)",
"(-4+10)+(-11+12)", "(-4+11", "(-4+11)", "(-4+12", "(-4+12)",
"(-4+13)", "(-4+14)", "(-4+6)+(-9", "(-4+8", "(-4+8)+(-10+14)",
"(-4+9", "(-4+9)+(-10+11)+(-13", "(-4+9)+(-12+13)+(-18+18", "(-4+9)+(-13+14",
"(-4+9)+(-14+15)", "(-4+9)+(-9", "(-5", "(-5+10", "(-5+10)",
"(-5+10)+(-13", "(-5+11)", "(-5+12)", "(-5+13)+(-14", "(-6",
"(1+6)+(-8+9", "S"), class = "factor"), V25 = structure(c(7L,
67L, 66L, 58L, 66L, 54L), .Label = c("", "(-4+11", "(-5", "10",
"12", "25)+(-14+15)", "25+12", "25+14", "3)", "3+6)", "3+7",
"5", "5)", "5)+(-10", "5)+(-11", "5)+(-11+13)+(-14", "5)+(-13",
"5)+(-13+13", "5)+(-13+14", "5)+(-14", "5)+(-16", "5)+(-16+16",
"5)+(-16+17)+(-21+22", "5)+(-17", "5)+(-18+18", "5+10", "5+10)",
"5+10)+(-13", "5+11", "5+11)", "5+11)+(-13", "5+11)+(-17+17",
"5+11)+(-21+21", "5+12", "5+12)", "5+12)+(-13", "5+12)+(-20+20",
"5+13", "5+13-13", "5+14", "5+15", "5+16", "5+16)", "5+18)",
"5+6)+(-14+14", "5+7", "5+7)+(-13", "5+7)+(-15", "5+7)+(-9+12",
"5+8", "5+8)", "5+8)+(-17", "5+9", "5+9)", "5+9)+(-13", "5+9)+(-14",
"5+9)+(-22", "50)", "50+10)+(-14", "50+14", "50+7", "6", "7",
"75)", "75)+(-14+15", "8", "9", "T"), class = "factor"), V26 = structure(c(31L,
1L, 1L, 29L, 1L, 29L), .Label = c("", "10", "11", "25)", "25)+(-14+15",
"25+15", "4", "5", "5)", "5)+(-13", "5)+(-14", "5)+(-16", "5)+(-16+17)",
"5)+(-20+21)", "5+10)+(-13", "5+13", "5+14", "5+14)", "5+14)+(-18+18",
"5+15", "5+15)", "5+16)", "5+17", "5+18", "5+18)", "5+23", "50)",
"50+16", "6", "7", "75)", "75+14", "75+15", "8", "9"), class = "factor"),
V27 = structure(c(9L, 1L, 1L, 9L, 1L, 9L), .Label = c("",
"10", "11", "12", "25", "25)", "25+17", "3", "5", "5)", "5+14",
"5+15)", "5+15)+(-18", "50)", "6", "7", "75", "75)", "8",
"9"), class = "factor"), V28 = structure(c(9L, 12L, 15L,
1L, 8L, 1L), .Label = c("", "1", "10", "11", "2", "25)",
"3", "4", "5", "5)", "5+19", "6", "7", "75", "8", "9"), class = "factor"),
V29 = structure(c(1L, 5L, 10L, 1L, 6L, 1L), .Label = c("",
"25", "2prol", "30", "40", "41", "5", "5)", "50", "52", "75",
"8", "9"), class = "factor"), V30 = structure(c(1L, 6L, 12L,
5L, 7L, 13L), .Label = c("", "25", "3", "3conc", "4", "45",
"46", "5", "52", "56", "6", "60", "8", "9"), class = "factor"),
V31 = structure(c(15L, 7L, 10L, 3L, 8L, 7L), .Label = c("",
"35", "40", "43", "4mot", "5", "52", "53", "54", "55", "56",
"57", "60", "63", "7"), class = "factor"), V32 = c(43L, NA,
NA, 52L, NA, 57L), V33 = c(45L, NA, NA, 59L, NA, 56L), V34 = c(55L,
NA, NA, NA, NA, NA)), row.names = 3:8, class = "data.frame")
So my idea is:
read all the columns, identify and separate the first from the second column: the last element of the first column is highlighted by the closing parenthesis
Working on the first one: i would say "take the value of the next column and add to the previous column adding a comma before"
Working on the second one: Since the second column starts where there's the first element after the closing parenthesis, i would take the first value (in that case we have an integer) or if the following column has a number (so that the column is not empty) add the number of the following column to the first columns linked by a comma.
Note: i have the idea how to do it, but i can't translate these ideas into a code, how can i do this?
I am facing the problem that I am not able to specify the shape of the line symbols (without this specification the code works fine):
Below the data in reproducible format (it is the effects data put into a data frame):
structure(list(varL = c(0, 1e+07, 2e+07, 3e+07, 4e+07,
0, 1e+07, 2e+07, 3e+07, 4e+07, 0, 1e+07, 2e+07, 3e+07, 4e+07,
0, 1e+07, 2e+07, 3e+07, 4e+07, 0, 1e+07, 2e+07, 3e+07, 4e+07,
0, 1e+07, 2e+07, 3e+07, 4e+07, 0, 1e+07, 2e+07, 3e+07, 4e+07,
0, 1e+07, 2e+07, 3e+07, 4e+07), varP = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 8L, 8L, 8L), .Label = c("(0,0.1]", "(0.1,0.2]",
"(0.2,0.3]", "(0.3,0.4]", "(0.4,0.5]", "(0.5,0.6]", "(0.6,0.7]",
"(0.7,0.8]", NA), class = "factor"), fit = c(0.0496509727291671,
0.0889644199210129, 0.147763911240627, 0.228140612498209, 0.328558663864939,
0.0137066329240178, 0.0170188110490053, 0.0209924787528359, 0.0257246732663005,
0.0313187292462082, 0.0289376730565942, 0.0324367840687503, 0.036277818691311,
0.0404834466212193, 0.0450765434401318, 0.0377500587733006, 0.0506605267612627,
0.0668653640284829, 0.0868169793966305, 0.110912824327041, 0.0461062991171287,
0.0536136421990573, 0.0620580975149222, 0.071506100162885, 0.0820206662867591,
0.0271688764980807, 0.0310122602430318, 0.0352949603875076, 0.0400511628245002,
0.0453154762467586, 0.0593111130006543, 0.0777425439930874, 0.100226912943776,
0.127122712337706, 0.158670602546708, 0.02092268966042, 0.0481738946672621,
0.0984225581163725, 0.179214944179607, 0.292488347088707), se = c(0.0259513690928884,
0.0478802966619357, 0.0959400030912549, 0.146319368888539, 0.197248937550513,
0.033511891943933, 0.0649738808934063, 0.13283528902344, 0.203454843482363,
0.274713638499851, 0.0399137666412373, 0.0836182332502119, 0.170994872374127,
0.261409298049175, 0.352531889503407, 0.0128068165036135, 0.0265824058594164,
0.054035051049317, 0.0824833429902055, 0.111165505837411, 0.00821998219695643,
0.0204628357910751, 0.0416140898624852, 0.0632975717285407, 0.08510744963605,
0.0111710559049469, 0.0241847618850518, 0.0491238092261353, 0.0748967974373985,
0.100866484066391, 0.0158269724358688, 0.0376131484048352, 0.0769417704226139,
0.117330108518709, 0.157967414110193, 0.041410334660995, 0.0756439112597116,
0.154046905957391, 0.236093539915582, 0.318984533128398), lower = c(0.0446491361632188,
0.0747918828643712, 0.108580794230823, 0.151091116521613, 0.203128798877193,
0.0115654911703096, 0.0123209025069961, 0.0108946289492482, 0.00947592068351736,
0.00819339231583828, 0.0241414249675257, 0.022214689650845, 0.0165544754495112,
0.0119897952488384, 0.00852702139073778, 0.0357322013203882,
0.0454582184069632, 0.05419654291372, 0.0639689796722408, 0.0749947791208939,
0.0445700854939895, 0.0493806984121361, 0.0526928942349234, 0.0560610804290682,
0.0595674652939625, 0.0258256186623322, 0.0278406506196297, 0.0284298721928641,
0.0289213813577344, 0.0293941320590341, 0.0557369600349524, 0.0675693013920541,
0.0762061129774737, 0.0853339447169982, 0.0951745618665348, 0.0171631624694616,
0.0350640216006218, 0.0556341130925104, 0.0836247441564837, 0.120733474483254
), upper = c(0.0550901244832386, 0.10504537627394, 0.195437125694758,
0.323403092731095, 0.477154780078397, 0.0161814201791539, 0.0231702136215256,
0.0380881102321762, 0.0606970430990062, 0.0928632628051039, 0.0345006405904498,
0.046261189541596, 0.0720801941481896, 0.10883443823233, 0.1577758610065,
0.0398599758225597, 0.0563263318177106, 0.0817125182150079, 0.115272694473414,
0.157735685375171, 0.0476847568332783, 0.0581290329012679, 0.072673983374196,
0.0900244403895002, 0.110325327955545, 0.0285699305757627, 0.0344771441605397,
0.0434644025927379, 0.0544344789594513, 0.0675376260500437, 0.0630625316437756,
0.0890383062732732, 0.129352639664315, 0.181375571131999, 0.244994560270329,
0.0253570051034752, 0.06494141884307, 0.161413333266735, 0.324329403439929,
0.531510876579345)), .Names = c("varL", "varP",
"fit", "se", "lower", "upper"), row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26",
"27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
"38", "39", "40"), class = "data.frame") -> effectdat
The following codes yields an error:
library(ggplot2)
ggplot(effectdat) + geom_line(aes(varL,fit,linetype=varP)) + theme_bw() + geom_point(aes(shape = varP))
Error: geom_point requires the following missing aesthetics: x, y
I read here ggplot2_Error: geom_point requires the following missing aesthetics: y to use the unlist function. However, this produces another error:
ggplot(unlist(effectdat)) + geom_line(aes(varL,fit,linetype=varP)) + theme_bw() + geom_point(aes(shape = varP))
Error: ggplot2 doesn't know how to deal with data of class numeric
Any ideas what is wrong? What surprises me is that the function without geom_point() seems to work fine.
No need to unlist the data.frame. Code below works:
ggplot(effectdat) + geom_line(aes(x = varL,y = fit,linetype=varP)) + theme_bw() + geom_point(aes(x = varL,y = fit, shape = varP))
Explanation: added the missing aesthetics as required by the function
I'm trying to create a stacked area graph with r and ggplot2. I'd like it to look
like this, but instead the areas overlap and have holes. I'm trying to ensure that the areas are stacked so that the area with the largest value in the most recent month (2016-05 in this case) are on the bottom.
Related posts like this one seem to have holes in the data, which doesn't seem to be the issue here.
Here's sample code to recreate the issue:
sample.data <- structure(
list(
rank = structure(
c(34L, 34L, 34L, 35L, 35L, 35L, 34L, 34L, 34L, 34L, 35L, 35L, 35L, 35L, 35L, 34L),
.Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35"),
class = "factor"),
vendor = structure(
c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L),
.Label = c("34", "35"),
class = "factor"),
year.month = c("2015-12", "2016-01", "2015-11", "2015-12", "2016-01", "2015-10", "2016-03", "2016-02", "2015-10", "2016-04", "2015-11", "2016-05", "2016-04", "2016-03", "2016-02", "2016-05"),
value = c(431616L, 272224L, 229288L, 195284L, 155168L, 154194L, 149784L, 137302L, 126612L, 117408L, 94141L, 56161L, 54606L, 53173L, 49898L, 45348L)),
.Names = c("rank", "vendor", "year.month", "value"),
row.names = c(6L, 8L, 4L, 5L, 7L, 1L, 12L, 10L, 2L, 14L, 3L, 15L, 13L, 11L, 9L, 16L),
class = "data.frame"
)
ggplot(data = sample.data, aes(x = year.month, y = value, group = vendor, color = vendor, reorder(-value), fill=vendor)) +
geom_area()
Thanks in advance for your help.
Try: + geom_area(position="dodge",stat="identity")
The following works:
ggplot(data = sample.data[order(sample.data$vendor),],
aes(x = year.month, y = value, group = vendor, color = vendor,
reorder(-value), fill=vendor)) + geom_area()
You just had to order your data: sample.data[order(sample.data$vendor),].
If you want to change the order of the graph, you have to "relevel" the vendor variable which is stored as a factor:
sample.data$vendor <- relevel(sample.data$vendor, ref="35")
Here is some code to figure out what vendor to set as the base level according to your criterion:
with(sample.data, sample.data[year.month=="2016-05",
"vendor"][which.max(sample.data[year.month=="2016-05", "value"])])
I'm trying to plot against dates in R. I've run into trouble trying to create vertical lines against a plot that I already have. All of the different formats that I try either result in nothing showing up on the plot, or a line at 1970 (the default date). The year-data is in the form yyyy-mm-dd. For example, "1914-07-01".
I've also tried inputting these dates in a data.frame, but got the same problem.
I've been trying to make a reproducible example, but I haven't seen any example datasets to do so with, and got frustrated trying to create one... sorry about that. Here's the relevant code:
ggplot(M,aes(x=date,color=origin,y=value)) +
geom_point() +
geom_line() +
facet_grid(topic~origin) +
geom_vline(xintercept=as.numeric(as.Date("1914-07-01")))
Everything plots correctly without the addition of the final line.
Edit: here's the result of dput(head(M)):
structure(list(topic = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25"), class = "factor"), date = structure(c(-1767196800, -1765987200,
-1764518400, -1763308800, -1762099200, -1760889600), class = c("POSIXct",
"POSIXt"), tzone = ""), origin = structure(c(2L, 2L, 2L, 2L,
2L, 2L), .Label = c("Blast", "The_Egoist"), class = "factor"),
value = c(6.69960398194253e-07, 7.48757156068349e-07, 7.04834977806836e-07,
7.10226526475778e-07, 6.8295233938925e-07, 6.16466066169137e-07
)), .Names = c("topic", "date", "origin", "value"), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), vars = list(
topic, date), drop = TRUE, indices = list(0L, 1L, 2L, 3L,
4L, 5L), group_sizes = c(1L, 1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L, labels = structure(list(
topic = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22",
"23", "24", "25"), class = "factor"), date = structure(c(-1767196800,
-1765987200, -1764518400, -1763308800, -1762099200, -1760889600
), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-6L), .Names = c("topic", "date"), vars = list(topic, date)))
You were very close, the problem is your data is in POSIXct, and you were trying to convert to Date. To fix it, change to POSIXct:
ggplot(M,aes(x=date,color=origin,y=value)) +
geom_point() +
geom_line() +
facet_grid(topic~origin) +
geom_vline(xintercept=as.numeric(as.POSIXct("1914-07-01")))
You can see the difference in the calls:
as.numeric(as.POSIXct("1914-07-01"))
[1] -1751569200
as.numeric(as.Date("1914-07-01"))
[1] -20273
Explaining why the intecept was so close to 1970 (the 0 for both)