ggplot geom_area failing - r

using the data below, and the line of code below, I am trying to produce a stacked area plot showing planned spend by project across the quarters specified in the data. Capex on the Y axis, quarters on the X axis. I have looked at many examples here and elsewhere, and I just cannot understand why it is failing. I'd like to post a screenshot of the result - but cant see a way to do that. Basically, it has the legend, and the axes look correct. But the main area of the chart is simply a grey grid, empty.
Code:
ggplot(short, aes(x=Quarter,y=Capex, fill=ProjectName, )) + geom_area(position = "stack") + ylim (1, 100000)
data:
ProjectName Quarter Capex
a F01 Jul 41709
a F02 Aug 41696
a F03 Sep 41667
a F04 Oct 41712
a F05 Nov 41676
a F06 Dec 41674
a F07 Jan 41694
a F08 Feb 41693
a F09 Mar 41698
a F10 Apr 41710
a F11 May 41694
a F12 Jun 41671
b F01 Jul 265197
b F02 Aug 265200
b F03 Sep 265187
b F04 Oct 265190
b F05 Nov 265179
b F06 Dec 265170
b F07 Jan 265167
b F08 Feb 265174
b F09 Mar 265187
b F10 Apr 265169
b F11 May 265186
b F12 Jun 265208
c F01 Jul 233335
c F02 Aug 233352
c F03 Sep 233344
c F04 Oct 233344
c F05 Nov 233344
c F06 Dec 233350
c F07 Jan 32
c F08 Feb 31
c F09 Mar 23
c F10 Apr 5046
c F11 May 5005
c F12 Jun 50
d F01 Jul 40
d F02 Aug 43
d F03 Sep 30
d F04 Oct 5038
d F05 Nov 45
d F06 Dec 8
d F07 Jan 45
d F08 Feb 20034
d F09 Mar 40
d F10 Apr 40
d F11 May 2
d F12 Jun 500045
e F01 Jul 300011

I'm pretty sure you want a stacked bar chart, not an area chart? Is this what you're after?
ggplot(short, aes(x=Quarter,y=Capex, fill=ProjectName, )) +
geom_bar(stat = "identity")
I'm not sure why you've got those y axis limits, they cut off your data, but this should be done with scale_y_continuous(limits = c(min, max)).
As a note, it's better to use the output from dput(data) when sharing your data, as it brings the structure of the data along with it. Have a look at How to make a great R reproducible example?

Related

Reading a date, time text file and converting to string using strptime()?

I have a text file of many rows containing date and time and the end goal is for me to group together the number of rows per week that their date values are in. This is so that I can plot a scatter diagram with x values being the week number and y values being the frequency. For example the text file (dates.txt):
Mon May 11 22:51:27 2013
Mon May 11 22:58:34 2013
Wed May 13 23:15:27 2013
Thu May 14 04:11:22 2013
Sat May 16 19:46:55 2013
Sat May 16 22:29:54 2013
Sun May 17 02:08:45 2013
Sun May 17 23:55:15 2013
Mon May 18 00:42:07 2013
So from here, week 1 will have a frequency of 6 and week 2 will have a frequency of 1
As I want to plot a scatter diagram for this, I want to convert them to text value first using strptime() with format %a %b
my attempt so far has been
time_stamp <- strptime(time_stamp, format='%a.%b')
However it shows the input string is too long. I'm very new to R-studio so could somebody please help me figure this out?
Thank you
Example of final output graph : https://imgur.com/a/3o3DivA
You could use readLines() to avoid the data frame, then read time using strptime, and finally strftime to format the output.
strftime(strptime(readLines('dates.txt'), '%c'), '%a.%b')
# [1] "Sat.May" "Sat.May" "Mon.May" "Tue.May" "Thu.May" "Thu.May" "Fri.May" "Fri.May" "Sat.May"
Edit
So it appears that your dates have a time zone abbreviation "Mon Apr 06 23:49:29 PDT 2009". Since it is constant during the dates we can specify it literally in the pattern.
We will use '%d_%m' for strftime to get something numeric seperated by _ with which we feed strsplit and then type.convert into numerics.
Finally we unlist, create a matrix that we fill byrow, and plot the guy.
strptime(readLines('timestamp.txt'), '%a %b %d %H:%M:%S PDT %Y') |>
strftime('%d_%m') |>
strsplit('_') |>
type.convert(as.is=TRUE) |>
unlist() |>
matrix(ncol=2, byrow=TRUE) |>
plot(pch=20, col=4, main='My Plot', xlab='day', ylab='month')
Note: Please use R>=4.1 for the |> pipes.
You need to first read (or assign) the data, parse it to a date type and then use that to e.g. get the number of the week.
Here is one example
text <- "Mon May 11 22:51:27 2013
Mon May 11 22:58:34 2013
Wed May 13 23:15:27 2013
Thu May 14 04:11:22 2013
Sat May 16 19:46:55 2013
Sat May 16 22:29:54 2013
Sun May 17 02:08:45 2013
Sun May 17 23:55:15 2013
Mon May 18 00:42:07 2013"
data <- read.table(text=text, sep='\n', col.names="dates")
data$parse <- anytime::anytime(data$dates)
data$week <- as.integer(format(data$parse, "%V"))
data
The result is a new data.frame object:
> data
dates parse week
1 Mon May 11 22:51:27 2013 2013-05-11 22:51:27 19
2 Mon May 11 22:58:34 2013 2013-05-11 22:58:34 19
3 Wed May 13 23:15:27 2013 2013-05-13 23:15:27 20
4 Thu May 14 04:11:22 2013 2013-05-14 04:11:22 20
5 Sat May 16 19:46:55 2013 2013-05-16 19:46:55 20
6 Sat May 16 22:29:54 2013 2013-05-16 22:29:54 20
7 Sun May 17 02:08:45 2013 2013-05-17 02:08:45 20
8 Sun May 17 23:55:15 2013 2013-05-17 23:55:15 20
9 Mon May 18 00:42:07 2013 2013-05-18 00:42:07 20
>

Model failed to converge (lme4)

I would like to achieve the following task. Using a linear mixed model, I would like to check whether "Month" (see dat table) has a significant effect on the "Response" variable. As for some of the tanks, data comes from different months, I included it as a random factor in my model. Please note, that sampling the same tank in different months does not change the "Response" variable. For some tank-month combinations there are multiple records, as we are included the compartment of the tank that was sampled (e.g. NW =north west).
Here the data:
print(dat)
Tank Month ID Response
1 AEW1 Jul AEW01SOBFJul2008 1.80522937
2 AEW10 Jul AEW10NWBFJul2008 2.13374401
3 AEW10 Jul AEW10NWBFJul2008 2.13374401
4 AEW11 Jun AEW11SWBFJun2008 1.65010205
5 AEW14 Jun AEW14SWBFJun2008 1.75459326
6 AEW15 Jun AEW15SOBFJun2008 2.82200903
7 AEW15 Jun AEW15SOBFJun2008 2.82200903
8 AEW18 Jul AEW18SOBFJul2008 0.39349330
9 AEW19 Jul AEW19NWBFJul2008 0.65886661
10 AEW20 Jul AEW20NWBFJul2008 1.07838018
11 AEW24 Jun AEW24NOBFJun2008 2.56677635
12 AEW27 Jul AEW27SWBFJul2008 2.64019328
13 AEW27 Jul AEW27SWBFJul2008 2.64019328
14 AEW29 Jul AEW29SOBFJul2008 2.06251217
15 AEW30 Jul AEW30NWBFJul2008 1.17010646
16 AEW31 Jun AEW31SWBFJun2008 2.25518873
17 AEW32 Jun AEW32SOBFJun2008 2.38707614
18 AEW33 Jun AEW33SOBFJun2008 2.30498448
19 AEW33 Jun AEW33SOBFJun2008 2.30498448
20 AEW36 Jul AEW36NOBFJul2008 1.92368247
21 AEW37 Jun AEW37NOBFJun2008 0.99387013
22 AEW39 Jul AEW39NOBFJul2008 1.24163732
23 AEW4 Jul AEW04SWBFJul2008 1.56327732
24 AEW42 Jun AEW42SWBFJun2008 1.26012579
25 AEW44 Jun AEW44SWBFJun2008 0.75985267
26 AEW48 Aug AEW48SOBFAug2008 1.57920494
27 AEW50 Jul AEW50NOBFJul2008 0.90052629
28 AEW8 Jul AEW08NOBFJul2008 0.00000000
29 AEW8 Jul AEW08NOBFJul2008 0.00000000
30 AEW9 Jul AEW09NOBFJul2008 0.48529647
31 HEW10 Jun HEW10SWBFJun2008 0.06412823
32 HEW10 Aug HEW10SOBFAug2008 0.06412823
33 HEW12 Jul HEW12NOBFJul2008 0.00000000
34 HEW13 Aug HEW13NWBFAug2008 2.24515850
35 HEW13 Jul HEW13SOBFJul2008 2.24515850
36 HEW13 Jul HEW13NOBFJul2008 2.24515850
37 HEW13 Jun HEW13SOBFJun2008 2.24515850
38 HEW13 Jun HEW13NWBFJun2008 2.24515850
39 HEW14 Jul HEW14SOBFJul2008 1.64783184
40 HEW18 Jun HEW18NWBFJun2008 1.32435721
41 HEW18 Jun HEW18NWBFJun2008 1.32435721
42 HEW19 Jul HEW19SWBFJul2008 1.01761003
43 HEW19 Jul HEW19SWBFJul2008 1.01761003
44 HEW22 Aug HEW22SWBFAug2008 0.63861037
45 HEW23 Jun HEW23SWBFJun2008 1.38472769
46 HEW23 Jun HEW23NWBFJun2008 1.38472769
47 HEW28 Jun HEW28NOBFJun2008 1.44377199
48 HEW3 Jun HEW03SWBFJun2008 2.19793633
49 HEW3 Jul HEW03SWBFJul2008 2.19793633
50 HEW30 Aug HEW30NWBFAug2008 0.76260579
51 HEW31 Jul HEW31SWBFJul2008 1.07879539
52 HEW35 Jun HEW35NWBFJun2008 0.86098152
53 HEW35 Jun HEW35NWBFJun2008 0.86098152
54 HEW36 Aug HEW36SOBFAug2008 0.36533352
55 HEW39 Jun HEW39SOBFJun2008 0.09283168
56 HEW4 Jun HEW04SWBFJun2008 1.89046783
57 HEW41 Aug HEW41NWBFAug2008 0.31996275
58 HEW41 Aug HEW41NWBFAug2008 0.31996275
59 HEW41 Jul HEW41NWBFJul2008 0.31996275
60 HEW41 Jul HEW41NWBFJul2008 0.31996275
61 HEW42 Jul HEW42NWBFJul2008 0.53998250
62 HEW43 Jun HEW43SWBFJun2008 1.85594061
63 HEW43 Jun HEW43SWBFJun2008 1.85594061
64 HEW44 Jun HEW44SOBFJun2008 1.79972095
65 HEW44 Jun HEW44SOBFJun2008 1.79972095
66 HEW49 Jun HEW49SWBFJun2008 1.25229249
67 HEW5 Aug HEW05SWBFAug2008 0.95559764
68 HEW50 Jun HEW50NWBFJun2008 0.42309531
69 HEW50 Jun HEW50NWBFJun2008 0.42309531
70 HEW7 Jul HEW07NWBFJul2008 0.69484213
71 HEW7 Jun HEW07NWBFJun2008 0.69484213
72 HEW8 Jul HEW08SWBFJul2008 1.15617440
73 SEW1 Aug SEW01NWBFAug2008 1.90030109
74 SEW1 Sep SEW01SWBFSep2008 1.90030109
75 SEW11 Aug SEW11NWBFAug2008 2.11940912
76 SEW12 Aug SEW12SOBFAug2008 2.29658624
77 SEW12 Jul SEW12SOBFJul2008 2.29658624
78 SEW17 Aug SEW17NOBFAug2008 1.49277937
79 SEW17 Jul SEW17NOBFJul2008 1.49277937
80 SEW17 Sep SEW17NOBFSep2008 1.49277937
81 SEW17 Aug SEW17SOBFAug2008 1.49277937
82 SEW18 Aug SEW18SOBFAug2008 1.70247509
83 SEW19 Aug SEW19SOBFAug2008 2.11617036
84 SEW20 Jul SEW20SWBFJul2008 1.87718089
85 SEW20 Jul SEW20SOBFJul2008 1.87718089
86 SEW22 Aug SEW22NOBFAug2008 0.77473833
87 SEW23 Aug SEW23NWBFAug2008 0.96183454
88 SEW23 Aug SEW23NOBFAug2008 0.96183454
89 SEW24 Jul SEW24SWBFJul2008 0.64090368
90 SEW24 Jul SEW24NWBFJul2008 0.64090368
91 SEW29 Jul SEW29SOBFJul2008 1.54699664
92 SEW29 Aug SEW29SWBFAug2008 1.54699664
93 SEW29 Aug SEW29SOBFAug2008 1.54699664
94 SEW34 Aug SEW34NWBFAug2008 1.79425003
95 SEW36 Jul SEW36SOBFJul2008 1.20337761
96 SEW4 Aug SEW04SWBFAug2008 1.59611963
97 SEW40 Sep SEW40SOBFSep2008 1.36486039
98 SEW40 Aug SEW40SWBFAug2008 1.36486039
99 SEW43 Sep SEW43SOBFSep2008 1.03169382
100 SEW44 Aug SEW44SWBFAug2008 0.79705660
101 SEW45 Jul SEW45NWBFJul2008 0.34130398
102 SEW46 Aug SEW46SOBFAug2008 0.20690386
103 SEW47 Aug SEW47SWBFAug2008 0.01564703
104 SEW47 Sep SEW47SWBFSep2008 0.01564703
105 SEW48 Aug SEW48SWBFAug2008 0.46745254
106 SEW5 Aug SEW05SWBFAug2008 0.68900435
107 SEW50 Aug SEW50NWBFAug2008 1.10731406
108 SEW7 Aug SEW07SWBFAug2008 0.08552432
109 SEW8 Jul SEW08NWBFJul2008 0.18731374
The model I generated so far is: Mod1 <- lmer(Response ~ Month + (1|Tank), data=dat)
Again, I included "Tank" because we sampled some tanks in several months but that does not change the response variable. Consequently, the response variable is fixed for each tank. Nevertheless, multiple data points originate from the same tank and I tried to account for that by including it as a random factor.
Fitting Mod1 results in the following message:
Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.306567 (tol = 0.002, component 1)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?
The question is now, whether the model is overly complex and whether I can drop "Tank" as a random factor, as measuring tanks repeatedly did not have an effect on the response variable.
Thus, the question is, would a simple linear model Mod1 <- lm(Response ~ Month, data =dat) be valid? And if not, how can I solve the 2 convergence issues.
Any help is very much appreciated! :)

Converting Month character to date for time series without "0" before Month

How do I convert this data set into a time series format in R? Lets call the data set Bob. This is what it looks like
1/2013 25
2/2013 865
3/2013 26
4/2013 33
5/2013 74
6/2013 24
Are you looking for something like this....?
> dat <- read.table(text = "1/2013 25
2/2013 865
3/2013 26
4/2013 33
5/2013 74
6/2013 24
", header=FALSE) # your data
> ts(dat$V2, start=c(2013, 1), frequency = 12) # time series object
Jan Feb Mar Apr May Jun
2013 25 865 26 33 74 24
Assuming that your starting point is the data frame DF defined reproducibly in the Note at the end this converts it to a zoo series z as well as a ts series tt.
library(zoo)
z <- read.zoo(DF, FUN = as.yearmon, format = "%m/%Y")
tt <- as.ts(z)
z
## Jan 2013 Feb 2013 Mar 2013 Apr 2013 May 2013 Jun 2013
## 25 865 26 33 74 24
tt
## Jan Feb Mar Apr May Jun
## 2013 25 865 26 33 74 24
Note
Lines <- "1/2013 25
2/2013 865
3/2013 26
4/2013 33
5/2013 74
6/2013 24"
DF <- read.table(text = Lines)

What is wrong with my R codes for transforming a wide data frame into the long format?

I am running the following R codes in Rstudio with the aim to convert a wide data frame (called 'merged') into a long one.
> merged
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2017 (A) 5980 5341 5890 5596 5753 5470 5589 5545 5749 5938 5844 5356
2017 (P) 5762 5275 5733 5411 5406 4954 5464 5536 5805 5819 5903 5630
I'm after the following output:
Description Month RN
2017 (A) Jan 5980
2017 (P) Jan 5762
2017 (A) Feb 5341
2017 (P) Feb 5275
... ... ...
I have tried the following (but with no success):
library(reshape2)
merged_long <- melt(data=merged,
id.vars="Description",
variable.name="Month",
value.name="RN")
I'm getting the following error message:
Error: id variables not found in data: Description
What am I doing wrong?
As noted by #Sotos in the comments, data in the rownames of the merged data set is required to uniquely identify an observation in the melted data set. To include the rownames in the melted data set, add the following to your code.
merged$Description <- rownames(merged)
Then your original code should produce the expected result.
library(reshape2)
merged_long <- melt(data=merged,
id.vars="Description",
variable.name="Month",
value.name="RN")
It's easiest to just use melt(as.matrix(...)) given the nature of your data. Omit the as.matrix part if your data is already a matrix, obviously.
melt(as.matrix(mydf))
You can use setNames to rename the columns at the same time:
setNames(melt(as.matrix(mydf)), c("Description", "Month", "RN"))
# Description Month RN
# 1 2017 (A) Jan 5980
# 2 2017 (P) Jan 5762
# 3 2017 (A) Feb 5341
# .........................
# .........................
# 23 2017 (A) Dec 5356
# 24 2017 (P) Dec 5630

Calculate average of last 3 non holiday weekdays

I have a dataframe for number of profile hits with date time, week, weekday across various categories.
For sample data refer below (Input Data). What I am looking for is to output a dataframe with average of last 3 weekdays of non holiday weeks from Sunday to Saturday across all categories.
As you can see in the below required output, none of the data from holiday week is considered. Is there any easy way of achieving this without use of loops? If yes how can we do this?
required output:
CAT Day Avg
A SUN =(1 + 3+99) /3
A MON =(6+67+ 45) /3
A TUE = (2+ 53+ 68)/3
A WED
A THU
A FRI
A SAT
Input data:
CAT DATE WEEJ DAY Hits Holiday Week
A 9/3/2016 2016-35 SAT 58 No
A 9/2/2016 2016-35 FRI 9 No
A 9/1/2016 2016-35 THU 20 No
A 8/31/2016 2016-35 WED 92 No
A 8/30/2016 2016-35 TUE 2 No
A 8/29/2016 2016-35 MON 6 No
A 8/28/2016 2016-35 SUN 1 No
A 8/27/2016 2016-34 SAT 58 Yes
A 8/26/2016 2016-34 FRI 56 Yes
A 8/25/2016 2016-34 THU 40 Yes
A 8/24/2016 2016-34 WED 42 Yes
A 8/23/2016 2016-34 TUE 59 Yes
A 8/22/2016 2016-34 MON 21 Yes
A 8/21/2016 2016-34 SUN 98 Yes
A 8/20/2016 2016-33 Sat 2 No
A 8/19/2016 2016-33 FRI 85 No
A 8/18/2016 2016-33 THU 29 No
A 8/17/2016 2016-33 WED 37 No
A 8/16/2016 2016-33 TUE 53 No
A 8/15/2016 2016-33 MON 67 No
A 8/14/2016 2016-33 SUN 3 No
A 8/13/2016 2016-32 SAT 35 No
A 8/12/2016 2016-32 FRI 24 No
A 8/11/2016 2016-32 THU 94 No
A 8/10/2016 2016-32 WED 81 No
A 8/9/2016 2016-32 TUE 68 No
A 8/8/2016 2016-32 MON 45 No
A 8/7/2016 2016-32 SUN 99 No
We can use data.table
library(data.table)
setDT(df1)[order(-as.IDate(DATE, "%m/%d/%Y"), toupper(DAY))
][HolidayWeek=="No",.(Ave = sum(Hits[1:3])/.N) , by = .(DAY=toupper(DAY))]
# DAY Ave
#1: SAT 31.66667
#2: FRI 39.33333
#3: THU 47.66667
#4: WED 70.00000
#5: TUE 41.00000
#6: MON 39.33333
#7: SUN 34.33333
If it is the average of the 3 'Hits'
setDT(df1)[order(-as.IDate(DATE, "%m/%d/%Y"), toupper(DAY))
][HolidayWeek=="No",.(Ave = mean(Hits[1:3])) , by = .(DAY=toupper(DAY))]
Here's a solution with dplyr:
library(dplyr)
answer <- x %>% filter(Holiday=="No") %>% group_by(Day) %>%
top_n(3,desc(Date)) %>% summarise(Avg = sum(Hits)/n())
It removes all Holiday's, then for every 'DAY' it then takes the last three dates for each of those days and finally summarizes the number of hits and divide by the number of those days, giving you the average.
Please note your 'days' of week aren't all Uppercase.
library(data.table)
setDT(df)[Holiday_Week == 'No', .(Avg = sum(head(Hits, 3))/.N), by = .(CAT, DAY = tolower(DAY))]
# CAT DAY Avg
#1: A sat 31.66667
#2: A fri 39.33333
#3: A thu 47.66667
#4: A wed 70.00000
#5: A tue 41.00000
#6: A mon 39.33333
#7: A sun 34.33333
A base R solution
do.call("rbind",
lapply(split(df,df[,c("Holiday","CAT","DAY")]),
function(x) if (x$Holiday[1]=="Yes") {
NULL
} else {
data.frame(CAT=x$CAT[1],
DAY=x$DAY[1],
MN=mean(tail(x[order(x$DATE),],3)$Hits))}))
# CAT DAY MN
#No.A.FRI A FRI 39.33333
#No.A.MON A MON 39.33333
#No.A.SAT A SAT 31.66667
#No.A.SUN A SUN 34.33333
#No.A.THU A THU 47.66667
#No.A.TUE A TUE 41.00000
#No.A.WED A WED 70.00000
Average by day split for non holidays and holidays
Library(data.table)
data <- Input data
setDT(data)[, mean(Hits), by = .(DAY, Holiday) ]
Perhaps use tolower(DAY) as there are some naming differences in your data.
For just no holiday:
setDT(data)[Holiday == "No", mean(Hits), by = tolower(DAY) ]

Resources