Creating line graph using ggplot in r - r

Days Profit
4672 5195 79823.72824
4673 5196 79823.72824
4674 5197 79823.72824
4675 5198 79823.72824
4676 5199 79823.72824
4677 5200 79823.72824
4678 5201 79823.72824
4679 5203 77760.56168
4680 5204 77760.56168
4681 5205 77760.56168
4682 5206 77760.56168
4683 5207 77760.56168
4684 5212 85379.47144
4685 5213 85379.47144
4686 5214 85379.47144
4687 5215 85379.47144
4688 5216 85379.47144
Above is an example of the data frame that I created, I only posted a small chunk of it as it is around 7000 rows. I am trying to create a line graph with the data using ggplot. The graphs that I see others post look very nice and professional but when I created mine it was not. Below you will see how I used ggplot and my graph result.
df <- data.frame(Days = Day_value, Profit = PnL_value)
p <- ggplot(data=df, aes(x=Days, y=Profit)) + geom_line() + geom_point()
The graph below is the print of my entire data set and not just the select data I shared. One thing to notice is that my Days column does not always increment by 1.
I would ideally have my graph look like this one with only 1 line instead of 3.
Ideal Graph
When I check the Structure I get:
'data.frame': 6993 obs. of 2 variables:
$ Days : Factor w/ 6993 levels "100","1000","1001",..: 4180 4286 4397 4489 4598 4699 4810 4910 5008 5111 ...
$ Profit: num 0 0 0 0 0 0 0 0 0 0 ...
NULL
When I run it I also get this error:
geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

Related

Violin plot from summary data

I'd like to use a violin plot to visualise the number of archaeological artefacts by site (A and B) and by century with data in the following format (years are Before Present):
Year SiteA SiteB
22400 356 182
22500 234 124
22600 144 231
22700 12 0
...
24800 112 32
There are some 6000 artefacts in total. In ggplot2, it would seem as if the preferred data entry format is of one line per observation (artefact) for a violin plot:
Site Year
A 22400
A 22400
... (356 times)
A 22400
B 22400
B 22400
... (182 times)
A 22500
A 22500
... (234 times)
A 22500
... ... ... (~5000 lines)
B 24800
B 24800
... (32 times)
B 24800
Is there an effective way of converting summary dataframe (1st grey box) into an observation-by-observation dataframe (2nd grey box) for use in a violin plot?
Alternatively, is there a way of making violin plots from data formatted as in the first grey box?
Update:
With the answer provided by eipi10, if either Site A or B has zero artefacts (as in the updated example above for the year 22,700), I get the following error:
Error in data.frame(Year = rep(dat$Year[i], dat$value[i]), Site = dat$key[i]) :
arguments imply differing number of rows: 0, 1
The plot would look like this:
How about this:
library(tidyverse)
dat = read.table(text="Year SiteA SiteB
22400 356 182
22500 234 124
22600 144 231
24800 112 32", header=TRUE, stringsAsFactors=FALSE)
dat = gather(dat, key, value, -Year)
dat.long = data.frame(Year = rep(dat$Year, dat$value), Site=rep(dat$key, dat$value))
ggplot(dat.long, aes(Site, Year)) +
geom_violin()

Change axises' scale in a plot without creating new varibale

I have a dataset like below (this is only the first 20 rows and the first 3 columns of data):
row fitted measured
1 1866 1950
2 2489 2500
3 1486 1530
4 1682 1720
5 1393 1402
6 2524 2645
7 2676 2789
8 3200 3400
9 1455 1456
10 1685 1765
11 2587 2597
12 3040 3050
13 2767 2769
14 3300 3310
15 4001 4050
16 1918 2001
17 2889 2907
18 2063 2150
19 1591 1640
20 3578 3601
I plotted this data
plot(data$measured~data$fitted, ylab = expression("Measured Length (" * mu ~ "m)"),
xlab = expression("NIR Fitted Length (" * mu ~ "m)"), cex.lab=1.5, cex.axis=1.5)
and got the following:
As you can see the axises scales are in micrometer, I need the axis to be in millimeter.
How can I plot the data while axises are in millimeter, WITHOUT creating a new variable?
Like this;
If I want to create a new variable, I have to change the whole 2000 lines code that I've written before and that's not a road that I want to go! :|
Thanks much :)
I used #bdemarest method for plot and #IukeA method for abline ;
plot(y=data$measured/1000,x=data$fitted/1000, ylab = expression("Measured Length (mm)"),
xlab = expression("NIR Fitted Length (mm)"), cex.lab=1.5, cex.axis=1.5)
a = lm(I(data$measured/1000)~I(data$fitted/1000), data=data)
abline(a)
Here is the final plot;

R String split by spaces

I have a dataframe containing 2 columns, of which one of them is a string that contain spaces. I have used strsplit to split this string into character tokens based on spaces. I defined a function for that (split), which I want to apply on the entire data frame:
split <- function (str) strsplit(str, "\\s+")[[1]]
data.frame(raw_rt_data, apply(raw_rt_data$stimulusitem1,2, split) )
Here is more info about my dataframe :
str(raw_rt_data)
'data.frame': 5372 obs. of 2 variables:
$ stimulusitem1: Factor w/ 4313 levels "ABILITY TAX",..: 2483 3645 1339 2455 2769 3033 3998 2712 1313 250 ...
$ latency : int 4051 1266 2145 2959 1086 2956 3814 4924 4771 2654 ...
> head(raw_rt_data)
stimulusitem1 latency
1 MORNING BUBBLE 4051
2 SYSTEM MEN 1266
3 FRIEND PAIN 2145
4 MOMMY TINYURL 2959
5 PEACE INFORMATION 1086
6 PUBLIC SCRITS 2956
The problem is that executing the above code yields an error:
Error in apply(raw_rt_data$stimulusitem1, 2, split) :
dim(X) must have a positive length
What am I doing wrong? The desired result should be 2 new added columns: one containing the first token and the other one containing the 2nd token.
Any help appreciated

Customizing x-axis labels on plot

I plotted my data and also suppressed the auto x-axis labeling successfully.
Now I'm using the following command to customize my x=axis labels:
axis(
1,
at = min(LoopVariable[ ,"TW"]) - 1 : max(LoopVariable[ ,"TW"]) + 1,
labels = min(LoopVariable[ ,"TW"]) - 1 : max(LoopVariable[ ,"TW"]) + 1,
las = 2
)
And I'm getting:
This is correct in the sense that I'm having 28 data points, but when I do:
LoopVariable[ ,"TW"]
Then I get:
[1] 2801 2808 2813 2825 2833 2835 2839 2840 2844 2856 2858 2863 2865 2868 2870 2871 2873 2879 2881 2903 2904 2914 2918 2947 2970 2974 2977 2986
These are the the values I want as x-axis labels rather than 1:28. There is obviously a little bit missing in my line I seem not to figure out.

lmList - loss of group information

I am using lmList to do linear models on many subsets of a data frame:
res <- lmList(Rds.on.fwd~Length | Wafer, data=sub, na.action=na.omit, pool=F)
This works fine, and I get the desired output (full output not shown):
(Intercept) Length
2492 5816.726 1571.260
2493 2520.311 1361.317
2494 3058.408 1286.516
2502 4727.328 1344.728
2564 3790.942 1576.223
2567 2350.296 1290.396
I have subsetted by "Wafer" (first column above). However, within my data frame ("sub"), the data is grouped by another factor "ERF" (there are many other factors but I am only concerned with "ERF"):
head(sub):
ERF Wafer Device Row Col Width Length Date Von.fwd Vth.fwd STS.fwd On.Off.fwd Ion.fwd Ioff.fwd Rds.on.fwd
1 474 2492 11.06E 11 6 100 5 09/10/2014 12:05 0.596747 3.05655 0.295971 7874420 0.000104 1.32e-11 9626.54
3 474 2492 11.08E 11 8 100 5 09/10/2014 12:05 0.581131 3.08380 0.299050 7890780 0.000109 1.38e-11 9193.62
5 474 2492 11.09E 11 9 100 5 09/10/2014 12:05 0.578171 3.06713 0.298509 8299740 0.000107 1.29e-11 9337.86
7 474 2492 11.10E 11 10 100 5 09/10/2014 12:05 0.565504 2.95532 0.298349 8138320 0.000109 1.34e-11 9173.15
9 474 2492 11.11E 11 11 100 5 09/10/2014 12:05 0.581289 2.97091 0.297885 8463620 0.000109 1.29e-11 9178.50
11 474 2492 11.12E 11 12 100 5 09/10/2014 12:05 0.578003 3.05802 0.294260 9326360 0.000112 1.20e-11 8955.51
I do not want ERF including in my lm but I do want to keep the factor "ERF" with the lm results for colouring graphs later i.e. I want this:
ERF Wafer (Intercept) Length
474 2492 5816.726 1571.260
474 2493 2520.311 1361.317
474 2494 3058.408 1286.516
475 2502 4727.328 1344.728
475 2564 3790.942 1576.223
476 2567 2350.296 1290.396
I know I could do this manually later by just adding a column to the results with a vector containing the correct sequence of ERF. However, I regularly add data to the set and dont want to do this every time. Im sure there is a more elegant way?
Thanks
Edit - data added for solution:
res <- ddply(sub, c("ERF", "Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))
head(res)
ERF Wafer (Intercept) Length
1 474 2492 5816.726 1571.260
2 474 2493 2520.311 1361.317
3 474 2494 3058.408 1286.516
4 474 2502 4727.328 1344.728
5 479 2564 3790.942 1576.223
6 479 2567 2350.296 1290.396
If I drop ERF:
res <- ddply(sub, c("Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))
head(res)
Wafer (Intercept) Length
1 2492 5816.726 1571.260
2 2493 2520.311 1361.317
3 2494 3058.408 1286.516
4 2502 4727.328 1344.728
5 2564 3790.942 1576.223
6 2567 2350.296 1290.396
Does this made sense? Did i ask the question incorrectly?
Ah, with a bit more research i've answer my own question based on this answer:
Regression on subset of data set
Must look harder next time. I used ddply instead of lmList (makes me wonder why anyone uses lmList...maybe I should ask another question?):
res1 <- ddply(sub, c("ERF", "Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))

Resources