CSV Import issues in R - r

After importing a csv file, R separated my data into columns every comma it reads.
My issue is that i had originally two columns where i had different values that are floating numbers, and the other column is the sum of all of these floating number. So R spread these elements in 5 or 6 columns sometimes less columns, sometimes more, depending on the number of commas existing.
There's a facilitation in this issue: the first column is delimited from parenthesis: so for example the first row first column is (-5,5+9)+(-10+12) and the second column would be the sum of this floating numbers. So i can easily see where the first column stops, after the second column (that is the sum of the elements of the first column) there are at least 2 or more empty columns so that i can easily recognize where the second column ends. Now what i have to do is to rearrange my dataset in the original form. I post the structure of the dataset for an easy understanding
here's is the code of the first rows
Y= structure(list(V24 = structure(c(66L, 15L, 44L, 28L, 68L, 10L
), .Label = c("", "(-0", "(-0+7", "(-1", "(-1+11", "(-1+11)+(-13",
"(-1+11)+(-18+18", "(-1+3)+(-10+14)", "(-1+8)", "(-2", "(-2+10",
"(-2+10)", "(-2+10)+(-13", "(-2+11", "(-2+11)", "(-2+11)+(-13",
"(-2+11)+(-14+17)", "(-2+12", "(-2+12)", "(-2+12)+(-14", "(-2+12)+(-14+15",
"(-2+12)+(-14+16)", "(-2+6)+(-8+10)+(-14", "(-2+7", "(-2+7)",
"(-2+7)+(-11", "(-2+7)+(-13", "(-2+8", "(-2+8)+(-10", "(-2+8)+(-11",
"(-2+8)+(-13", "(-2+8)+(-15", "(-2+9", "(-2+9)", "(-2+9)+(-13",
"(-2+9)+(-14", "(-3", "(-3+10", "(-3+10)", "(-3+10)+(-13", "(-3+10)+(-13+14",
"(-3+10)+(-14+14", "(-3+11", "(-3+11)", "(-3+11)+(-13", "(-3+12",
"(-3+12)", "(-3+12)+(-13", "(-3+13)", "(-3+7", "(-3+8", "(-3+8)",
"(-3+8)+(-11+12", "(-3+9", "(-3+9)", "(-4", "(-4+10", "(-4+10)",
"(-4+10)+(-11+12)", "(-4+11", "(-4+11)", "(-4+12", "(-4+12)",
"(-4+13)", "(-4+14)", "(-4+6)+(-9", "(-4+8", "(-4+8)+(-10+14)",
"(-4+9", "(-4+9)+(-10+11)+(-13", "(-4+9)+(-12+13)+(-18+18", "(-4+9)+(-13+14",
"(-4+9)+(-14+15)", "(-4+9)+(-9", "(-5", "(-5+10", "(-5+10)",
"(-5+10)+(-13", "(-5+11)", "(-5+12)", "(-5+13)+(-14", "(-6",
"(1+6)+(-8+9", "S"), class = "factor"), V25 = structure(c(7L,
67L, 66L, 58L, 66L, 54L), .Label = c("", "(-4+11", "(-5", "10",
"12", "25)+(-14+15)", "25+12", "25+14", "3)", "3+6)", "3+7",
"5", "5)", "5)+(-10", "5)+(-11", "5)+(-11+13)+(-14", "5)+(-13",
"5)+(-13+13", "5)+(-13+14", "5)+(-14", "5)+(-16", "5)+(-16+16",
"5)+(-16+17)+(-21+22", "5)+(-17", "5)+(-18+18", "5+10", "5+10)",
"5+10)+(-13", "5+11", "5+11)", "5+11)+(-13", "5+11)+(-17+17",
"5+11)+(-21+21", "5+12", "5+12)", "5+12)+(-13", "5+12)+(-20+20",
"5+13", "5+13-13", "5+14", "5+15", "5+16", "5+16)", "5+18)",
"5+6)+(-14+14", "5+7", "5+7)+(-13", "5+7)+(-15", "5+7)+(-9+12",
"5+8", "5+8)", "5+8)+(-17", "5+9", "5+9)", "5+9)+(-13", "5+9)+(-14",
"5+9)+(-22", "50)", "50+10)+(-14", "50+14", "50+7", "6", "7",
"75)", "75)+(-14+15", "8", "9", "T"), class = "factor"), V26 = structure(c(31L,
1L, 1L, 29L, 1L, 29L), .Label = c("", "10", "11", "25)", "25)+(-14+15",
"25+15", "4", "5", "5)", "5)+(-13", "5)+(-14", "5)+(-16", "5)+(-16+17)",
"5)+(-20+21)", "5+10)+(-13", "5+13", "5+14", "5+14)", "5+14)+(-18+18",
"5+15", "5+15)", "5+16)", "5+17", "5+18", "5+18)", "5+23", "50)",
"50+16", "6", "7", "75)", "75+14", "75+15", "8", "9"), class = "factor"),
V27 = structure(c(9L, 1L, 1L, 9L, 1L, 9L), .Label = c("",
"10", "11", "12", "25", "25)", "25+17", "3", "5", "5)", "5+14",
"5+15)", "5+15)+(-18", "50)", "6", "7", "75", "75)", "8",
"9"), class = "factor"), V28 = structure(c(9L, 12L, 15L,
1L, 8L, 1L), .Label = c("", "1", "10", "11", "2", "25)",
"3", "4", "5", "5)", "5+19", "6", "7", "75", "8", "9"), class = "factor"),
V29 = structure(c(1L, 5L, 10L, 1L, 6L, 1L), .Label = c("",
"25", "2prol", "30", "40", "41", "5", "5)", "50", "52", "75",
"8", "9"), class = "factor"), V30 = structure(c(1L, 6L, 12L,
5L, 7L, 13L), .Label = c("", "25", "3", "3conc", "4", "45",
"46", "5", "52", "56", "6", "60", "8", "9"), class = "factor"),
V31 = structure(c(15L, 7L, 10L, 3L, 8L, 7L), .Label = c("",
"35", "40", "43", "4mot", "5", "52", "53", "54", "55", "56",
"57", "60", "63", "7"), class = "factor"), V32 = c(43L, NA,
NA, 52L, NA, 57L), V33 = c(45L, NA, NA, 59L, NA, 56L), V34 = c(55L,
NA, NA, NA, NA, NA)), row.names = 3:8, class = "data.frame")
So my idea is:
read all the columns, identify and separate the first from the second column: the last element of the first column is highlighted by the closing parenthesis
Working on the first one: i would say "take the value of the next column and add to the previous column adding a comma before"
Working on the second one: Since the second column starts where there's the first element after the closing parenthesis, i would take the first value (in that case we have an integer) or if the following column has a number (so that the column is not empty) add the number of the following column to the first columns linked by a comma.
Note: i have the idea how to do it, but i can't translate these ideas into a code, how can i do this?

Related

Warning Messages when running linear regression in R

I'm attempting to run a linear regression in R, but get the following errors:
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
The code is:
reg_ex1 <- lm(V45~TotalScore,data = Combineddatainprogresscsv)
Both values, V45, and TotalScore are numerical. A Google search yielded a similar question where it was suggested that the csv file might have commas. But I'm not an expert so don't know how to check this?
Thank you!
There are 1300 lines, so here is just the final part of the output. Let me know if you need more.
"50", "60", "70", "80", "90", "Compared to others who may have taken this test, how well do you think you scored? - 1"
), class = "factor"), V46 = structure(c(23L, 6L, 4L, 22L,
4L, 8L), .Label = c("", "0", "1", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "2", "20", "3", "4",
"5", "6", "7", "8", "9", "Score"), class = "factor"), TotalScore = c(0L,
12L, 10L, 9L, 10L, 14L)), row.names = c(NA, 6L), class = "data.frame")
It seems your response variable V46 is a factor. You can see it in the output you pasted: V46 = structure(c(23L, 6L, 4L, 22L,
4L, 8L), .Label = c("", "0", "1", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "2", "20", "3", "4",
"5", "6", "7", "8", "9", "Score"), class = "factor")
I would suggest converting V46 to character, then to numeric and finally filter out the missing values which will be produced by the "Score" level.
You should definitely listen to the people in the comments so it's easier to help you :)

ggplot2: "geom_point requires the following missing aesthetics: x, y"

I am facing the problem that I am not able to specify the shape of the line symbols (without this specification the code works fine):
Below the data in reproducible format (it is the effects data put into a data frame):
structure(list(varL = c(0, 1e+07, 2e+07, 3e+07, 4e+07,
0, 1e+07, 2e+07, 3e+07, 4e+07, 0, 1e+07, 2e+07, 3e+07, 4e+07,
0, 1e+07, 2e+07, 3e+07, 4e+07, 0, 1e+07, 2e+07, 3e+07, 4e+07,
0, 1e+07, 2e+07, 3e+07, 4e+07, 0, 1e+07, 2e+07, 3e+07, 4e+07,
0, 1e+07, 2e+07, 3e+07, 4e+07), varP = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 8L, 8L, 8L), .Label = c("(0,0.1]", "(0.1,0.2]",
"(0.2,0.3]", "(0.3,0.4]", "(0.4,0.5]", "(0.5,0.6]", "(0.6,0.7]",
"(0.7,0.8]", NA), class = "factor"), fit = c(0.0496509727291671,
0.0889644199210129, 0.147763911240627, 0.228140612498209, 0.328558663864939,
0.0137066329240178, 0.0170188110490053, 0.0209924787528359, 0.0257246732663005,
0.0313187292462082, 0.0289376730565942, 0.0324367840687503, 0.036277818691311,
0.0404834466212193, 0.0450765434401318, 0.0377500587733006, 0.0506605267612627,
0.0668653640284829, 0.0868169793966305, 0.110912824327041, 0.0461062991171287,
0.0536136421990573, 0.0620580975149222, 0.071506100162885, 0.0820206662867591,
0.0271688764980807, 0.0310122602430318, 0.0352949603875076, 0.0400511628245002,
0.0453154762467586, 0.0593111130006543, 0.0777425439930874, 0.100226912943776,
0.127122712337706, 0.158670602546708, 0.02092268966042, 0.0481738946672621,
0.0984225581163725, 0.179214944179607, 0.292488347088707), se = c(0.0259513690928884,
0.0478802966619357, 0.0959400030912549, 0.146319368888539, 0.197248937550513,
0.033511891943933, 0.0649738808934063, 0.13283528902344, 0.203454843482363,
0.274713638499851, 0.0399137666412373, 0.0836182332502119, 0.170994872374127,
0.261409298049175, 0.352531889503407, 0.0128068165036135, 0.0265824058594164,
0.054035051049317, 0.0824833429902055, 0.111165505837411, 0.00821998219695643,
0.0204628357910751, 0.0416140898624852, 0.0632975717285407, 0.08510744963605,
0.0111710559049469, 0.0241847618850518, 0.0491238092261353, 0.0748967974373985,
0.100866484066391, 0.0158269724358688, 0.0376131484048352, 0.0769417704226139,
0.117330108518709, 0.157967414110193, 0.041410334660995, 0.0756439112597116,
0.154046905957391, 0.236093539915582, 0.318984533128398), lower = c(0.0446491361632188,
0.0747918828643712, 0.108580794230823, 0.151091116521613, 0.203128798877193,
0.0115654911703096, 0.0123209025069961, 0.0108946289492482, 0.00947592068351736,
0.00819339231583828, 0.0241414249675257, 0.022214689650845, 0.0165544754495112,
0.0119897952488384, 0.00852702139073778, 0.0357322013203882,
0.0454582184069632, 0.05419654291372, 0.0639689796722408, 0.0749947791208939,
0.0445700854939895, 0.0493806984121361, 0.0526928942349234, 0.0560610804290682,
0.0595674652939625, 0.0258256186623322, 0.0278406506196297, 0.0284298721928641,
0.0289213813577344, 0.0293941320590341, 0.0557369600349524, 0.0675693013920541,
0.0762061129774737, 0.0853339447169982, 0.0951745618665348, 0.0171631624694616,
0.0350640216006218, 0.0556341130925104, 0.0836247441564837, 0.120733474483254
), upper = c(0.0550901244832386, 0.10504537627394, 0.195437125694758,
0.323403092731095, 0.477154780078397, 0.0161814201791539, 0.0231702136215256,
0.0380881102321762, 0.0606970430990062, 0.0928632628051039, 0.0345006405904498,
0.046261189541596, 0.0720801941481896, 0.10883443823233, 0.1577758610065,
0.0398599758225597, 0.0563263318177106, 0.0817125182150079, 0.115272694473414,
0.157735685375171, 0.0476847568332783, 0.0581290329012679, 0.072673983374196,
0.0900244403895002, 0.110325327955545, 0.0285699305757627, 0.0344771441605397,
0.0434644025927379, 0.0544344789594513, 0.0675376260500437, 0.0630625316437756,
0.0890383062732732, 0.129352639664315, 0.181375571131999, 0.244994560270329,
0.0253570051034752, 0.06494141884307, 0.161413333266735, 0.324329403439929,
0.531510876579345)), .Names = c("varL", "varP",
"fit", "se", "lower", "upper"), row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26",
"27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
"38", "39", "40"), class = "data.frame") -> effectdat
The following codes yields an error:
library(ggplot2)
ggplot(effectdat) + geom_line(aes(varL,fit,linetype=varP)) + theme_bw() + geom_point(aes(shape = varP))
Error: geom_point requires the following missing aesthetics: x, y
I read here ggplot2_Error: geom_point requires the following missing aesthetics: y to use the unlist function. However, this produces another error:
ggplot(unlist(effectdat)) + geom_line(aes(varL,fit,linetype=varP)) + theme_bw() + geom_point(aes(shape = varP))
Error: ggplot2 doesn't know how to deal with data of class numeric
Any ideas what is wrong? What surprises me is that the function without geom_point() seems to work fine.
No need to unlist the data.frame. Code below works:
ggplot(effectdat) + geom_line(aes(x = varL,y = fit,linetype=varP)) + theme_bw() + geom_point(aes(x = varL,y = fit, shape = varP))
Explanation: added the missing aesthetics as required by the function

Stacked Area Graph Using R and ggplot2 Has Holes

I'm trying to create a stacked area graph with r and ggplot2. I'd like it to look
like this, but instead the areas overlap and have holes. I'm trying to ensure that the areas are stacked so that the area with the largest value in the most recent month (2016-05 in this case) are on the bottom.
Related posts like this one seem to have holes in the data, which doesn't seem to be the issue here.
Here's sample code to recreate the issue:
sample.data <- structure(
list(
rank = structure(
c(34L, 34L, 34L, 35L, 35L, 35L, 34L, 34L, 34L, 34L, 35L, 35L, 35L, 35L, 35L, 34L),
.Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35"),
class = "factor"),
vendor = structure(
c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L),
.Label = c("34", "35"),
class = "factor"),
year.month = c("2015-12", "2016-01", "2015-11", "2015-12", "2016-01", "2015-10", "2016-03", "2016-02", "2015-10", "2016-04", "2015-11", "2016-05", "2016-04", "2016-03", "2016-02", "2016-05"),
value = c(431616L, 272224L, 229288L, 195284L, 155168L, 154194L, 149784L, 137302L, 126612L, 117408L, 94141L, 56161L, 54606L, 53173L, 49898L, 45348L)),
.Names = c("rank", "vendor", "year.month", "value"),
row.names = c(6L, 8L, 4L, 5L, 7L, 1L, 12L, 10L, 2L, 14L, 3L, 15L, 13L, 11L, 9L, 16L),
class = "data.frame"
)
ggplot(data = sample.data, aes(x = year.month, y = value, group = vendor, color = vendor, reorder(-value), fill=vendor)) +
geom_area()
Thanks in advance for your help.
Try: + geom_area(position="dodge",stat="identity")
The following works:
ggplot(data = sample.data[order(sample.data$vendor),],
aes(x = year.month, y = value, group = vendor, color = vendor,
reorder(-value), fill=vendor)) + geom_area()
You just had to order your data: sample.data[order(sample.data$vendor),].
If you want to change the order of the graph, you have to "relevel" the vendor variable which is stored as a factor:
sample.data$vendor <- relevel(sample.data$vendor, ref="35")
Here is some code to figure out what vendor to set as the base level according to your criterion:
with(sample.data, sample.data[year.month=="2016-05",
"vendor"][which.max(sample.data[year.month=="2016-05", "value"])])

Trouble plotting dates with ggplot2

I'm trying to plot against dates in R. I've run into trouble trying to create vertical lines against a plot that I already have. All of the different formats that I try either result in nothing showing up on the plot, or a line at 1970 (the default date). The year-data is in the form yyyy-mm-dd. For example, "1914-07-01".
I've also tried inputting these dates in a data.frame, but got the same problem.
I've been trying to make a reproducible example, but I haven't seen any example datasets to do so with, and got frustrated trying to create one... sorry about that. Here's the relevant code:
ggplot(M,aes(x=date,color=origin,y=value)) +
geom_point() +
geom_line() +
facet_grid(topic~origin) +
geom_vline(xintercept=as.numeric(as.Date("1914-07-01")))
Everything plots correctly without the addition of the final line.
Edit: here's the result of dput(head(M)):
structure(list(topic = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25"), class = "factor"), date = structure(c(-1767196800, -1765987200,
-1764518400, -1763308800, -1762099200, -1760889600), class = c("POSIXct",
"POSIXt"), tzone = ""), origin = structure(c(2L, 2L, 2L, 2L,
2L, 2L), .Label = c("Blast", "The_Egoist"), class = "factor"),
value = c(6.69960398194253e-07, 7.48757156068349e-07, 7.04834977806836e-07,
7.10226526475778e-07, 6.8295233938925e-07, 6.16466066169137e-07
)), .Names = c("topic", "date", "origin", "value"), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), vars = list(
topic, date), drop = TRUE, indices = list(0L, 1L, 2L, 3L,
4L, 5L), group_sizes = c(1L, 1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L, labels = structure(list(
topic = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22",
"23", "24", "25"), class = "factor"), date = structure(c(-1767196800,
-1765987200, -1764518400, -1763308800, -1762099200, -1760889600
), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-6L), .Names = c("topic", "date"), vars = list(topic, date)))
You were very close, the problem is your data is in POSIXct, and you were trying to convert to Date. To fix it, change to POSIXct:
ggplot(M,aes(x=date,color=origin,y=value)) +
geom_point() +
geom_line() +
facet_grid(topic~origin) +
geom_vline(xintercept=as.numeric(as.POSIXct("1914-07-01")))
You can see the difference in the calls:
as.numeric(as.POSIXct("1914-07-01"))
[1] -1751569200
as.numeric(as.Date("1914-07-01"))
[1] -20273
Explaining why the intecept was so close to 1970 (the 0 for both)

R - plot vertical profile

I have measurements of CH4 concentration with depth:
df <- structure(list(Depth = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 15L, 16L, 17L), .Label = c("0", "10",
"12", "14", "16", "18", "2", "20", "22", "24", "26", "28", "30",
"32", "4", "6", "8", "AR"), class = "factor"), Conc_CH4 = c(4.30769230769231,
23.1846153846154, 14.5615384615385, 21.1769230769231, 16.2615384615385,
132.007692307692, 5.86923076923077, 389.353846153846, 823.023076923077,
948.684615384615, 1436.56923076923, 1939.88461538462, 26.2769230769231,
27.5538461538462, 19.6461538461538)), .Names = c("Depth", "Conc_CH4"
), row.names = c(NA, -15L), class = "data.frame")
And I need to create a plot like this:
But I have some problems: the factors in my data are in the wrong order, and I don't know how to plot this kind of data using ggplot2.
Any ideas?
Here's a solution with base plotting functions (you reverse the limits of ylim):
df$Depth <- as.numeric(as.character(df$Depth))
df <- df[order(df$Depth),]
plot(Depth~Conc_CH4, df, t="l", ylim=rev(range(df$Depth)))
Why not convert Depth to a number and plot?
ggplot(transform(df, Depth=as.numeric(as.character(df$Depth))),
aes(x=Conc_CH4, y=Depth)) +
geom_line() + scale_y_reverse()
The as.numeric(as.character(...)) is because your Depth is a factor and calling as.numeric directly converts factors differently than character to string.
The scale_y_reverse reverses the y scale.
If your actual data has a depth of "AR" in it, you'll have to omit them or otherwise handle them.

Resources