Related
I am trying to plot a graph with ggplot. Currently, I am only able to plot with the plot function in R, not ggplot for my .drc results. I want to use ggplot since I already have nice line of code for it and ggplot is more customizable than the plot function in R. my line of code with the .drc fuynction is:
try<-drm(X..bound~Dilution,data=Titration.8.31,Sample,robust="mean",fct=LL.4())
which generates my .drc data. I can then plot this using the plot function which I don't really want to do since I can't really change labels or anything. in ggplot my line of pre-existing code with loess lines of best fit which I want to remove since they drop below zero and replace with my .drc code is :
ggplot(Titration.8.31, aes(x = Dilution, y = `X..bound`)) +
geom_point(size=5,aes(color=Sample,shape=Sample)) +
scale_shape_manual(values=c(0,2,5,8,13,15,16,17,18,19,20,10,9,3)) +
scale_x_continuous(trans = "log10",breaks = trans_breaks("log10", function(x) 10^x),labels = trans_format("log10", math_format(10^.x)), minor_breaks = 10^(seq(0, 7, by = 0.25))) +
labs(x="Antibody Dilution",y="% Cell Binding") +
theme_minimal() +
theme(axis.title.x=element_text(size=22)) +
theme(axis.title.y=element_text(size=22)) +
theme(axis.text=element_text(size=18)) +
scale_color_manual(values=c("dodgerblue2", "#E31A1C", "green4",
"#6A3D9A", "#FF7F00", "black", "gold1", "skyblue2", "palegreen2", "#FDBF6F", "gray70", "maroon", "orchid1", "darkturquoise", "darkorange4", "brown")) +
coord_cartesian(ylim=c(0,100)) +
theme(legend.key.size =unit(1,"in")) +
theme(legend.text=element_text(size=11))`
How do I change this line of code so that my .drc lines can be the new lines of best fit? If I can't use ggplot, how do I change the x axis label in the plot function (which I think this might be easier)?
The data is dput(Titration.8.31):
structure(list(Dilution = c(300L, 900L, 2700L, 8100L, 24300L,
72900L, 218700L, 300L, 900L, 2700L, 8100L, 24300L, 72900L, 218700L,
300L, 900L, 2700L, 8100L, 24300L, 72900L, 218700L, 300L, 900L,
2700L, 8100L, 24300L, 72900L, 218700L, 300L, 900L, 2700L, 8100L,
24300L, 72900L, 218700L, 300L, 900L, 2700L, 8100L, 24300L, 72900L,
218700L, 300L, 900L, 2700L, 8100L, 24300L, 72900L, 218700L),
X..bound = c(92.43, 92.95, 92.26, 86.55, 67.49, 21.86, 0.72,
89.57, 87.84, 82.35, 65.84, 24.18, 3.56, 0.32, 91.63, 90.57,
87.22, 77.03, 39.52, 5.39, 1.24, 93.51, 93.56, 90.33, 80.49,
38.97, 4.7, 0.93, 95.37, 94.44, 91.24, 77.74, 28.76, 2.14,
0.15, 0.01, 0, 0, 0, 0, 0, 0, 0.01, 0, 0.01, 0, 0, 0, 0),
Sample = c("CoV77-39 1mer 0DA", "CoV77-39 1mer 0DA", "CoV77-39 1mer 0DA",
"CoV77-39 1mer 0DA", "CoV77-39 1mer 0DA", "CoV77-39 1mer 0DA",
"CoV77-39 1mer 0DA", "CoV77-39 5mer 0DA", "CoV77-39 5mer 0DA",
"CoV77-39 5mer 0DA", "CoV77-39 5mer 0DA", "CoV77-39 5mer 0DA",
"CoV77-39 5mer 0DA", "CoV77-39 5mer 0DA", "CoV77-39 5mer 2DA GGG",
"CoV77-39 5mer 2DA GGG", "CoV77-39 5mer 2DA GGG", "CoV77-39 5mer 2DA GGG",
"CoV77-39 5mer 2DA GGG", "CoV77-39 5mer 2DA GGG", "CoV77-39 5mer 2DA GGG",
"CoV77-39 5mer 2DA GDGDG", "CoV77-39 5mer 2DA GDGDG", "CoV77-39 5mer 2DA GDGDG",
"CoV77-39 5mer 2DA GDGDG", "CoV77-39 5mer 2DA GDGDG", "CoV77-39 5mer 2DA GDGDG",
"CoV77-39 5mer 2DA GDGDG", "CoV77-39 5mer 2DA GDG", "CoV77-39 5mer 2DA GDG",
"CoV77-39 5mer 2DA GDG", "CoV77-39 5mer 2DA GDG", "CoV77-39 5mer 2DA GDG",
"CoV77-39 5mer 2DA GDG", "CoV77-39 5mer 2DA GDG", "CoV77-39 HA",
"CoV77-39 HA", "CoV77-39 HA", "CoV77-39 HA", "CoV77-39 HA",
"CoV77-39 HA", "CoV77-39 HA", "CoV77-39 WT", "CoV77-39 WT",
"CoV77-39 WT", "CoV77-39 WT", "CoV77-39 WT", "CoV77-39 WT",
"CoV77-39 WT")), class = "data.frame", row.names = c(NA,
-49L))
any help is appreciated and very welcome as I am very new to coding :) Thank you in advance for your time!! It is really appreciated as I really am stuck
I have a titration curve with a line of code
ggplot(Titration.Aug.9, aes(x = Dilution, y = `X..bound`)) +
geom_line(aes(color = Sample)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "loess", se = FALSE, linetype = "dashed") +
scale_x_continuous(trans = "log10",
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)),
minor_breaks = 10^(seq(0, 7, by = 0.25))) +
scale_color_brewer(type = "Sample", palette = "Set1") +
labs(x = "Antibody Dilution", y = "% Cell Binding") +
theme_minimal()
and I have generated a plot that looks pretty nice. However, I need the data to have a line of best fit for each individual plot point. I tried to do this with loess but I need each line to be a different color. graph image This is the graph I had but I need the lines to be the best fit for each sample not linked by the actual dot plots graph image 2 kind of like this one generated with this line of code
ggplot(Titration.Aug.9, aes(x = Dilution, y = `X..bound`)) +geom_line(aes(color = Sample)) +geom_point() +geom_smooth(formula = y ~ x, method = "loess", se = FALSE,aes(Color=Sample)) +scale_x_continuous(trans = "log10",breaks = trans_breaks("log10", function(x) 10^x),labels = trans_format("log10", math_format(10^.x)),minor_breaks = 10^(seq(0, 7, by = 0.25))) +scale_color_brewer(type="Sample",palette="Set1") +labs(x="Antibody Dilution",y="% Cell Binding") +theme_minimal()
However I need each off these to be colored to each different sample like in the first image but with the line fitting format in the second image. My data is :
dput(Titration.Aug.9)
structure(list(Dilution = c(300L, 900L, 2700L, 8100L, 24300L,
72900L, 218700L, 300L, 900L, 2700L, 8100L, 24300L, 72900L, 218700L,
300L, 900L, 2700L, 8100L, 24300L, 72900L, 218700L, 300L, 900L,
2700L, 8100L, 24300L, 72900L, 218700L, 300L, 900L, 2700L, 8100L,
24300L, 72900L, 218700L, 300L, 900L, 2700L, 8100L, 24300L, 72900L,
218700L, 300L, 900L, 2700L, 8100L, 24300L, 72900L, 218700L, 300L,
900L, 2700L, 8100L, 24300L, 72900L, 218700L, 300L, 900L, 2700L,
8100L, 24300L, 72900L, 218700L), X..bound = c(52.74, 40.31, 30.63,
18.89, 7.57, 0.8, 0.01, 20.23, 11.29, 7.55, 3.24, 0.54, 0.12,
0.03, 53.27, 46.82, 38.17, 26.77, 11.59, 2.23, 0.07, 69.25, 63.55,
56.34, 40.95, 19.35, 2.4, 0.05, 75.8, 68.21, 62.82, 40.33, 11.73,
0.82, 0.04, 85.75, 82.82, 74.29, 46.63, 9.36, 0.24, 0.05, 71.65,
66.54, 56.63, 33.96, 6.33, 0.19, 0.03, 85.43, 86.49, 75.73, 51.62,
15.16, 1.05, 0.01, 92.44, 90.13, 85.92, 72.06, 30.08, 3.15, 0.12
), Sample = c("1mer 0DA", "1mer 0DA", "1mer 0DA", "1mer 0DA",
"1mer 0DA", "1mer 0DA", "1mer 0DA", "1mer 2DA", "1mer 2DA", "1mer 2DA",
"1mer 2DA", "1mer 2DA", "1mer 2DA", "1mer 2DA", "1mer 3DA", "1mer 3DA",
"1mer 3DA", "1mer 3DA", "1mer 3DA", "1mer 3DA", "1mer 3DA", "1mer 4DA",
"1mer 4DA", "1mer 4DA", "1mer 4DA", "1mer 4DA", "1mer 4DA", "1mer 4DA",
"5mer 0DA", "5mer 0DA", "5mer 0DA", "5mer 0DA", "5mer 0DA", "5mer 0DA",
"5mer 0DA", "5mer 2DA", "5mer 2DA", "5mer 2DA", "5mer 2DA", "5mer 2DA",
"5mer 2DA", "5mer 2DA", "5mer 4DA", "5mer 4DA", "5mer 4DA", "5mer 4DA",
"5mer 4DA", "5mer 4DA", "5mer 4DA", "5mer 2DA GDG", "5mer 2DA GDG",
"5mer 2DA GDG", "5mer 2DA GDG", "5mer 2DA GDG", "5mer 2DA GDG",
"5mer 2DA GDG", "5mer 2DA GDGDG", "5mer 2DA GDGDG", "5mer 2DA GDGDG",
"5mer 2DA GDGDG", "5mer 2DA GDGDG", "5mer 2DA GDGDG", "5mer 2DA GDGDG"
)), class = "data.frame", row.names = c(NA, -63L))
Any help is appreciated!! Thank you in advance for your time!! :)
You simply need to map Sample to the color aesthetic inside geom_smooth. I have removed the original geom_line to make the result clearer, but this could be added back in.
ggplot(Titration.Aug.9, aes(x = Dilution, y = `X..bound`)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "loess", se = FALSE,
linetype="dashed", aes(color = Sample)) +
scale_x_continuous(trans = "log10",
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)),
minor_breaks = 10^(seq(0, 7, by = 0.25))) +
scale_color_brewer(type="Sample",palette="Set1") +
labs(x="Antibody Dilution",y="% Cell Binding") +
theme_minimal()
I have 4 columns: date & time, stage_duration, various_stages, Vehicle_ID. I want to plot date and time in mins on X-axis and id, stage_duration on Y-axis and fill by various stages on line or bar chart.
Something like this would be good:
Here is my data:
var_events time_date event_duration veh_id
LD 17-06-2018 13:25 6.52 B33
WL 17-06-2018 13:25 14.52 B31
TL 17-06-2018 13:26 0.32 B32
TE 17-06-2018 13:26 4.58 B13
UL 17-06-2018 13:26 3.45 B12
WT 17-06-2018 13:26 5.46 B25
UL 17-06-2018 13:26 1.56 B17
TL 17-06-2018 13:26 13.6 B33
SL 17-06-2018 13:26 0.05 B32
Here is a minimal example that creates the plot
# load data
data(presidential)
data(economics)
# events of interest
events <- presidential[-(1:3),]
# strip year from economics and events data frames
economics$year = as.numeric(format(economics$date, format = "%Y"))
# use dplyr to summarise data by year
#install.packages("dplyr")
library(dplyr)
econonomics_mean <- economics %>%
group_by(year) %>%
summarise(mean_unemployment = mean(unemploy))
# add president terms to summarized data frame as a factor
president <- c(rep(NA,14), rep("Reagan", 8), rep("Bush", 4), rep("Clinton", 8), rep("Bush", 8), rep("Obama", 7))
econonomics_mean$president <- president
# create ggplot
p <- ggplot(data = econonomics_mean, aes(x = year, y = mean_unemployment)) +
geom_point(aes(color = president)) +
geom_line(alpha = 1/3)
Update
This is the output:
structure(list(Event_stage = c("SE", "MN", "MN", "TE", "TE",
"TE", "TE", "TE", "TE", "TE", "TE", "WL", "TE", "TE", "SE", "TE",
"TE", "WL", "WT", "MN", "WL", "TE", "WL", "WL", "WT", "WL", "LD",
"WT", "WL", "WT", "WT", "TE", "WL", "LD", "WT", "LD", "MN", "TL",
"TE", "WL", "TL", "TL", "WT", "TE", "TE", "LD", "WT", "TL", "LD"),
event_date = structure(c(1529573704, 1529573710, 1529573713,
1529573724, 1529573855, 1529573874, 1529573880, 1529573895, 1529573906,
1529573918, 1529573925, 1529573931, 1529573931, 1529573941, 1529573947,
1529573969, 1529574006, 1529574054, 1529574088, 1529574114, 1529574120,
1529574123, 1529574134, 1529574137, 1529574148, 1529574163, 1529574164,
1529574148, 1529574169, 1529574170, 1529574178, 1529574188, 1529574189,
1529574196, 1529574178, 1529574188, 1529574203, 1529574213, 1529574214,
1529574214, 1529574215, 1529574227, 1529574231, 1529574242, 1529574244,
1529574245, 1529574248, 1529574260, 1529574262), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), stage_duration = c(3.78, 3.47, 2.78,
3.45, 3.32, 4.93, 4.23, 4.22, 3.85, 3.37, 5.88, 5.92, 3.97, 3.7,
NA, 4.08, 3.05, 0.57, 11.18, 12.08, 2.6, 3.3, 0.23, 0.85, 0.27,
0.25, 0.82, 10.42, 0.15, 0.43, 1.4, 0.25, 0.7, 0.52, 1.12, 0.45,
12.87, 12.18, 2.92, 0.57, 14.07, 12.72, 17.12, 4.13, 3.13, 0.25,
0.33, 18.98, 1.05), veh_id = c("B35", "B05", "B04", "B08", "B14",
"B13", "B04", "B17", "B41", "B05", "B26", "B08", "B35", "B19a",
"B10a", "B01a", "B28", "B14", "B14", "B18", "B05", "B37", "B04",
"B41", "B04", "B19a", "B04", "B17", "B35", "B13", "B35", "B02b",
"B28", "B13", "B19a", "B41", "B02b", "B04", "B15", "B01a", "B41",
"B13", "B28", "B27", "B33", "B19a", "B01a", "B19a", "B35")),
.Names = c("Event_stage", "event_date", "stage_duration", "veh_id"),
row.names = c(NA, -49L), class = c("tbl_df", "tbl", "data.frame"))
require(ggplot2)
require(dplyr)
df = structure(list(Event_stage = c("SE", "MN", "MN", "TE", "TE", "TE", "TE", "TE", "TE", "TE", "TE", "WL", "TE", "TE", "SE", "TE", "TE", "WL", "WT", "MN", "WL", "TE", "WL", "WL", "WT", "WL", "LD", "WT", "WL", "WT", "WT", "TE", "WL", "LD", "WT", "LD", "MN", "TL", "TE", "WL", "TL", "TL", "WT", "TE", "TE", "LD", "WT", "TL", "LD" ), event_date = structure(c(1529573704, 1529573710, 1529573713, 1529573724, 1529573855, 1529573874, 1529573880, 1529573895, 1529573906, 1529573918, 1529573925, 1529573931, 1529573931, 1529573941, 1529573947, 1529573969, 1529574006, 1529574054, 1529574088, 1529574114, 1529574120, 1529574123, 1529574134, 1529574137, 1529574148, 1529574163, 1529574164, 1529574148, 1529574169, 1529574170, 1529574178, 1529574188, 1529574189, 1529574196, 1529574178, 1529574188, 1529574203, 1529574213, 1529574214, 1529574214, 1529574215, 1529574227, 1529574231, 1529574242, 1529574244, 1529574245, 1529574248, 1529574260, 1529574262), class = c("POSIXct", "POSIXt"), tzone = "UTC"), stage_duration = c(3.78, 3.47, 2.78, 3.45, 3.32, 4.93, 4.23, 4.22, 3.85, 3.37, 5.88, 5.92, 3.97, 3.7, NA, 4.08, 3.05, 0.57, 11.18, 12.08, 2.6, 3.3, 0.23, 0.85, 0.27, 0.25, 0.82, 10.42, 0.15, 0.43, 1.4, 0.25, 0.7, 0.52, 1.12, 0.45, 12.87, 12.18, 2.92, 0.57, 14.07, 12.72, 17.12, 4.13, 3.13, 0.25, 0.33, 18.98, 1.05), veh_id = c("B35", "B05", "B04", "B08", "B14", "B13", "B04", "B17", "B41", "B05", "B26", "B08", "B35", "B19a", "B10a", "B01a", "B28", "B14", "B14", "B18", "B05", "B37", "B04", "B41", "B04", "B19a", "B04", "B17", "B35", "B13", "B35", "B02b", "B28", "B13", "B19a", "B41", "B02b", "B04", "B15", "B01a", "B41", "B13", "B28", "B27", "B33", "B19a", "B01a", "B19a", "B35")), .Names = c("Event_stage", "event_date", "stage_duration", "veh_id"), row.names = c(NA, -49L), class = c("tbl_df", "tbl", "data.frame"))
# create ggplot
ggplot(data = df, aes(x = event_date,
y = stage_duration)) +
geom_point(aes(color = Event_stage), size= 3) +
geom_line(alpha = 1/2)+
facet_wrap(~veh_id, nrow = 4) +
labs(x = "Event date", y = "Stage duration")
I have data on earnings which looks like the following;
# A tibble: 6 x 24
m_ticker ticker comp_name comp_name_2 exchange currency_code per_end_date_fr0
<chr> <chr> <chr> <chr> <chr> <chr> <date>
1 AAPL AAPL APPLE INC Apple Inc. NSDQ USD 2017-09-30
2 AXP AXP AMER EXPRES~ American Express~ NYSE USD 2017-12-31
3 BA BA BOEING CO The Boeing Compa~ NYSE USD 2017-12-31
4 CTR CAT CATERPILLAR~ Caterpillar Inc. NYSE USD 2017-12-31
5 CSCO CSCO CISCO SYSTE~ Cisco Systems, I~ NSDQ USD 2017-07-31
6 SD CVX CHEVRON CORP Chevron Corporat~ NYSE USD 2017-12-31
# ... with 17 more variables: per_end_date_qr1 <date>, eps_mean_est_qr1 <dbl>,
# street_mean_est_qr1 <dbl>, exp_rpt_date_qr1 <date>, exp_rpt_date_qr2 <date>,
# exp_rpt_date_fr1 <date>, exp_rpt_date_fr2 <date>, late_last_flag <dbl>,
# late_last_desc <chr>, source_flag <dbl>, source_desc <chr>, time_of_day_code <dbl>,
# time_of_day_desc <chr>, per_end_date_qr0 <date>, eps_act_qr0 <dbl>,
# per_end_date_qrm3 <date>, eps_act_qrm3 <dbl>
I also have a vector of ticker symbols called tickers.
tickers <- c("PYPL", "GOOG", "AAPL", "MSFT", "CSCO")
I am trying to create an ifelse statement which will print a small table of the following columns from the earnings data:
ticker | comp_name | exp_rpt_date_qr1 | exp_rpt_date_qr2 | time_of_day_desc
So, if ticker matches earnings$ticker then print the above columns.
I have tried using grepl to print a basic yes / no which reports a warning message.
ifelse(grepl(tickers, earnings$ticker), "yes", "no")
Data:
earnings <- structure(list(m_ticker = c("AAPL", "AXP", "BA", "CTR", "CSCO",
"SD", "DIS", "GE", "GS&", "HOMD", "IBM", "ITL", "JNJ", "CHL",
"KO", "MCD", "MMM", "MRK", "MSFT", "NIKE", "PFE", "PG", "SPM",
"UNIH", "UA", "VISA", "BEL", "WMS", "J"), ticker = c("AAPL",
"AXP", "BA", "CAT", "CSCO", "CVX", "DIS", "GE", "GS", "HD", "IBM",
"INTC", "JNJ", "JPM", "KO", "MCD", "MMM", "MRK", "MSFT", "NKE",
"PFE", "PG", "TRV", "UNH", "UTX", "V", "VZ", "WMT", "XOM"), comp_name = c("APPLE INC",
"AMER EXPRESS CO", "BOEING CO", "CATERPILLAR INC", "CISCO SYSTEMS",
"CHEVRON CORP", "DISNEY WALT", "GENL ELECTRIC", "GOLDMAN SACHS",
"HOME DEPOT", "INTL BUS MACH", "INTEL CORP", "JOHNSON & JOHNS",
"JPMORGAN CHASE", "COCA COLA CO", "MCDONALDS CORP", "3M CO",
"MERCK & CO INC", "MICROSOFT CORP", "NIKE INC-B", "PFIZER INC",
"PROCTER & GAMBL", "TRAVELERS COS", "UNITEDHEALTH GP", "UTD TECHS CORP",
"VISA INC-A", "VERIZON COMM", "WALMART INC", "EXXON MOBIL CRP"
), comp_name_2 = c("Apple Inc.", "American Express Company",
"The Boeing Company", "Caterpillar Inc.", "Cisco Systems, Inc.",
"Chevron Corporation", "The Walt Disney Company", "General Electric Company",
"The Goldman Sachs Group, Inc.", "The Home Depot, Inc.", "International Business Machines Corporation",
"Intel Corporation", "Johnson & Johnson", "JPMorgan Chase & Co.",
"Coca-Cola Company (The)", "McDonald's Corporation", "3M Company",
"Merck & Co., Inc.", "Microsoft Corporation", "NIKE, Inc.", "Pfizer Inc.",
"Procter & Gamble Company (The)", "The Travelers Companies, Inc.",
"UnitedHealth Group Incorporated", "United Technologies Corporation",
"Visa Inc.", "Verizon Communications Inc.", "Walmart Inc.", "Exxon Mobil Corporation"
), exchange = c("NSDQ", "NYSE", "NYSE", "NYSE", "NSDQ", "NYSE",
"NYSE", "NYSE", "NYSE", "NYSE", "NYSE", "NSDQ", "NYSE", "NYSE",
"NYSE", "NYSE", "NYSE", "NYSE", "NSDQ", "NYSE", "NYSE", "NYSE",
"NYSE", "NYSE", "NYSE", "NYSE", "NYSE", "NYSE", "NYSE"), currency_code = c("USD",
"USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD",
"USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD",
"USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD",
"USD"), per_end_date_fr0 = structure(c(17439, 17531, 17531, 17531,
17378, 17531, 17439, 17531, 17531, 17562, 17531, 17531, 17531,
17531, 17531, 17531, 17531, 17531, 17347, 17682, 17531, 17347,
17531, 17531, 17531, 17439, 17531, 17562, 17531), class = "Date"),
per_end_date_qr1 = structure(c(17712, 17712, 17712, 17712,
17743, 17712, 17712, 17712, 17712, 17743, 17712, 17712, 17712,
17804, 17712, 17712, 17712, 17712, 17712, 17774, 17712, 17712,
17712, 17712, 17712, 17712, 17712, 17743, 17712), class = "Date"),
eps_mean_est_qr1 = c(2.19, 1.83, 3.43, 2.66, 0.63, 2.1, 2.04,
0.18, 4.67, 2.85, 3.03, 0.99, 2.06, 2.27, 0.6, 1.93, 2.59,
1.03, 1.07, 0.61, 0.75, 0.91, 2.44, 3.03, 1.86, 1.09, 1.15,
1.21, 1.24), street_mean_est_qr1 = c(2.187, 1.83, 3.434,
2.658, 0.689, 2.098, 2.043, 0.178, 4.674, 2.847, 3.031, 0.99,
2.055, 2.27, 0.601, 1.929, 2.594, 1.03, 1.074, 0.609, 0.748,
0.906, 2.436, 3.031, 1.858, 1.089, 1.145, 1.212, 1.244),
exp_rpt_date_qr1 = structure(c(17743, 17730, 17737, 17742,
17758, 17739, 17750, 17732, 17729, 17764, 17730, 17738, 17729,
17815, 17737, 17738, 17736, 17739, 17731, 17799, 17743, 17743,
17731, 17729, 17736, 17737, 17736, 17759, 17739), class = "Date"),
exp_rpt_date_qr2 = structure(c(17836, 17821, 17828, 17827,
17856, 17830, 17843, 17823, 17820, 17848, 17820, 17829, 17820,
17907, 17828, 17827, 17827, 17830, 17829, 17885, 17834, 17823,
17822, 17820, 17827, 17828, 17822, 17850, 17830), class = "Date"),
exp_rpt_date_fr1 = structure(c(17836, 17913, 17926, 17920,
17758, 17928, 17843, 17919, 17912, 17946, 17913, 17920, 17918,
17907, 17942, 17925, 17920, 17928, 17731, 18074, 17925, 17743,
17918, 17911, 17919, 17828, 17918, 17946, 17928), class = "Date"),
exp_rpt_date_fr2 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), class = "Date"), late_last_flag = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), late_last_desc = c("Not late",
"Not late", "Not late", "Not late", "Not late", "Not late",
"Not late", "Not late", "Not late", "Not late", "Not late",
"Not late", "Not late", "Not late", "Not late", "Not late",
"Not late", "Not late", "Not late", "Not late", "Not late",
"Not late", "Not late", "Not late", "Not late", "Not late",
"Not late", "Not late", "Not late"), source_flag = c(1, 1,
1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1,
1, 1, 1, 1, 1, 1, 2, 1), source_desc = c("Company", "Company",
"Company", "Company", "Estimated", "Company", "Company",
"Company", "Company", "Estimated", "Company", "Company",
"Company", "Estimated", "Company", "Company", "Company",
"Company", "Company", "Estimated", "Company", "Company",
"Company", "Company", "Company", "Company", "Company", "Estimated",
"Company"), time_of_day_code = c(1, 1, 2, 2, 4, 2, 1, 2,
2, 4, 1, 1, 2, 4, 2, 2, 2, 2, 1, 4, 2, 2, 2, 2, 2, 1, 2,
4, 2), time_of_day_desc = c("After market close", "After market close",
"Before the open", "Before the open", "Unknown", "Before the open",
"After market close", "Before the open", "Before the open",
"Unknown", "After market close", "After market close", "Before the open",
"Unknown", "Before the open", "Before the open", "Before the open",
"Before the open", "After market close", "Unknown", "Before the open",
"Before the open", "Before the open", "Before the open",
"Before the open", "After market close", "Before the open",
"Unknown", "Before the open"), per_end_date_qr0 = structure(c(17621,
17621, 17621, 17621, 17651, 17621, 17621, 17621, 17621, 17651,
17621, 17621, 17621, 17712, 17621, 17621, 17621, 17621, 17621,
17682, 17621, 17621, 17621, 17621, 17621, 17621, 17621, 17651,
17621), class = "Date"), eps_act_qr0 = c(2.73, 1.86, 3.64,
2.82, 0.6, 1.9, 1.84, 0.16, 6.95, 2.08, 2.45, 0.87, 2.06,
2.29, 0.47, 1.79, 2.5, 1.05, 0.95, 0.69, 0.77, 1, 2.46, 3.04,
1.77, 1.11, 1.17, 1.14, 1.09), per_end_date_qrm3 = structure(c(17347,
17347, 17347, 17347, 17378, 17347, 17347, 17347, 17347, 17378,
17347, 17347, 17347, 17439, 17347, 17347, 17347, 17347, 17347,
17409, 17347, 17347, 17347, 17347, 17347, 17347, 17347, 17378,
17347), class = "Date"), eps_act_qrm3 = c(1.67, 1.47, 2.55,
1.49, 0.55, 0.91, 1.58, 0.28, 3.95, 2.25, 2.97, 0.72, 1.83,
1.76, 0.59, 1.73, 2.58, 1.01, 0.98, 0.57, 0.67, 0.85, 1.92,
2.46, 1.85, 0.86, 0.96, 1.08, 0.78)), .Names = c("m_ticker",
"ticker", "comp_name", "comp_name_2", "exchange", "currency_code",
"per_end_date_fr0", "per_end_date_qr1", "eps_mean_est_qr1", "street_mean_est_qr1",
"exp_rpt_date_qr1", "exp_rpt_date_qr2", "exp_rpt_date_fr1", "exp_rpt_date_fr2",
"late_last_flag", "late_last_desc", "source_flag", "source_desc",
"time_of_day_code", "time_of_day_desc", "per_end_date_qr0", "eps_act_qr0",
"per_end_date_qrm3", "eps_act_qrm3"), row.names = c(NA, -29L), class = c("tbl_df",
"tbl", "data.frame"))
Instead of grepl you can use %in%.
Furthermore, if all you're doing is choosing specific rows and columns to print, you could use subset.
> keepcols = c('ticker','comp_name','exp_rpt_date_qr1','exp_rpt_date_qr2','time_of_day_desc')
> subset(earnings, ticker %in% tickers, select = keepcols)
ticker comp_name exp_rpt_date_qr1 exp_rpt_date_qr2 time_of_day_desc
1 AAPL APPLE INC 2018-07-31 2018-11-01 After market close
5 CSCO CISCO SYSTEMS 2018-08-15 2018-11-21 Unknown
19 MSFT MICROSOFT CORP 2018-07-19 2018-10-25 After market close
In R I'd like to run a correlation or simple linear regression lm(userScoreDF$Score ~ Stock$Adj.Close) between two variables from different data frames but I'm getting an error from the fact that their of unequal length. I have not combined the data because I'm unsure of how to combine them in such a way that matches the two variables by date.
Is there a way to run a correlation or simple linear regression with two variables of unequal lengths from different data frames? Is there a way to how to combine the variables into a data frame in such a way that matches the two variables by date? Here's my data:
dput(userScoreDF)
structure(list(Group.date = structure(c(15737, 15746, 15747,
15748, 15749, 15750, 15751, 15752, 15753, 15754, 15755, 15738,
15756, 15757, 15758, 15759, 15760, 15761, 15762, 15763, 15764,
15739, 15740, 15741, 15742, 15743, 15744, 15745, 15765, 15774,
15775, 15776, 15777, 15778, 15779, 15780, 15781, 15782, 15783,
15766, 15784, 15785, 15786, 15787, 15788, 15789, 15790, 15791,
15792, 15793, 15767, 15794, 15795, 15768, 15769, 15770, 15771,
15772, 15773, 15796, 15805, 15806, 15807, 15808, 15809, 15810,
15811, 15812, 15813, 15814, 15797, 15815, 15816, 15817, 15818,
15819, 15820, 15821, 15822, 15823, 15824, 15798, 15825, 15799,
15800, 15801, 15802, 15803, 15804, 15826, 15835, 15836, 15837,
15838, 15839, 15840, 15841, 15842, 15843, 15844, 15827, 15845,
15846, 15847, 15848, 15849, 15850, 15851, 15852, 15853, 15854,
15828, 15855, 15856, 15829, 15830, 15831, 15832, 15833, 15834,
15857, 15866, 15867, 15868, 15869, 15870, 15871, 15872, 15873,
15874, 15875, 15858, 15876, 15877, 15878, 15879, 15880, 15881,
15882, 15883, 15884, 15885, 15859, 15886, 15860, 15861, 15862,
15863, 15864, 15865, 15887, 15896, 15897, 15898, 15899, 15900,
15901, 15902, 15903, 15904, 15905, 15888, 15906, 15907, 15908,
15909, 15910, 15911, 15912, 15913, 15914, 15915, 15889, 15916,
15917, 15890, 15891, 15892, 15893, 15894, 15895, 15918, 15919,
15920), class = "Date"), Score = c(-1.13, -0.93, -1.14, -1.04,
-0.81, -0.64, -1.12, -1.01, -0.6, -0.82, -1.05, -1.34, -0.86,
-0.93, -0.99, -0.9, -0.76, -0.91, -1.03, -0.95, -1.22, -0.74,
-0.95, -0.98, -0.96, -0.97, -0.95, -0.79, -1.27, -0.72, -1.06,
-0.95, -1.05, -1.02, -0.67, -0.9, -0.7, -1.1, -0.95, -1.14, -1.07,
-1.02, -0.88, -0.79, -1.05, -0.97, -0.9, -1.13, -1.05, -0.8,
-0.84, -0.82, -0.53, -0.96, -0.84, -0.95, -0.99, -1.06, -0.98,
-0.91, -0.94, -0.98, -1.03, -0.77, -0.75, -1.17, -1.02, -0.96,
-0.95, -0.81, -0.96, -1.32, -0.9, -1.11, -1.05, -1.08, -0.8,
-1.14, -0.82, -0.92, -0.96, -1.14, -1, -0.96, -1.14, -0.84, -0.83,
-1.13, -1.11, -0.96, -1.06, -0.94, -0.85, -1.21, -0.95, -0.98,
-0.99, -1.15, -1.18, -0.86, -0.9, -1.09, -1.04, -1.05, -1.07,
-1.11, -1.18, -1.07, -0.99, -1.43, -1.02, -0.96, -1.18, -1.05,
-0.88, -0.84, -1.11, -1.15, -1.18, -1.14, -1.4, -1.6, -1.16,
-1.28, -1.33, -1.07, -0.98, -1.24, -0.81, -1.23, -1.05, -0.99,
-1.53, -1.06, -1.26, -1.18, -1.46, -1.25, -1.31, -1.12, -0.98,
-1.08, -1.13, -1.24, -1, -1.3, -1.04, -1.02, -1.19, -1.09, -1.21,
-0.99, -1.07, -1.21, -1.06, -0.96, -1.05, -1.47, -1.52, -1.36,
-1.22, -1.33, -1.36, -1.27, -1.16, -1.36, -1.25, -1.27, -1.3,
-1.04, -0.71, -1.34, -1.19, -1.26, -1.55, -1.53, -1.59, -1.17,
-1, -1.26, -1.14, -1.19, -1.17, -1.12)), .Names = c("Group.date",
"Score"), row.names = c(NA, -184L), class = "data.frame")
dput(Stock)
structure(list(Date = structure(c(15737, 15740, 15741, 15742,
15743, 15744, 15747, 15748, 15749, 15750, 15751, 15755, 15756,
15757, 15758, 15761, 15762, 15763, 15764, 15765, 15768, 15769,
15770, 15771, 15772, 15775, 15776, 15777, 15778, 15779, 15782,
15783, 15784, 15785, 15786, 15789, 15790, 15791, 15792, 15796,
15797, 15798, 15799, 15800, 15803, 15804, 15805, 15806, 15807,
15810, 15811, 15812, 15813, 15814, 15817, 15818, 15819, 15820,
15821, 15824, 15825, 15826, 15827, 15828, 15831, 15832, 15833,
15834, 15835, 15838, 15839, 15840, 15841, 15842, 15845, 15846,
15847, 15848, 15849, 15853, 15854, 15855, 15856, 15859, 15860,
15861, 15862, 15863, 15866, 15867, 15868, 15869, 15870, 15873,
15874, 15875, 15876, 15877, 15880, 15881, 15882, 15883, 15884,
15887, 15888, 15889, 15891, 15894, 15895, 15896, 15897, 15898,
15901, 15902, 15903, 15904, 15905, 15908, 15909, 15910, 15911,
15912, 15915, 15916, 15917, 15918, 15919), class = "Date"), Adj.Close = c(5.69,
5.74, 5.71, 5.77, 5.74, 5.77, 5.79, 5.91, 5.86, 5.87, 5.91, 5.9,
5.79, 5.79, 5.82, 5.73, 5.78, 5.86, 5.8, 5.8, 5.83, 5.87, 5.87,
5.85, 5.88, 5.86, 5.92, 5.88, 5.86, 5.81, 5.87, 6.03, 6.03, 6.06,
6.14, 6.03, 6.05, 6.04, 6.21, 6.25, 6.23, 6.16, 6.21, 6.23, 6.3,
6.28, 6.25, 6.26, 6.22, 7.06, 7.2, 7.09, 7.19, 7.17, 7.17, 7.1,
7.09, 7.14, 7.12, 7.12, 7.05, 7.06, 7.1, 7.15, 7.2, 7.22, 7.32,
7.35, 7.36, 7.18, 7.26, 7.25, 7.28, 7.32, 7.29, 7.39, 7.3, 7.31,
7.33, 7.27, 7.28, 7.34, 7.3, 7.22, 7.26, 7.2, 7.34, 7.24, 7.18,
7.35, 7.35, 7.32, 7.32, 7.22, 7.32, 7, 7.07, 6.97, 6.86, 6.88,
6.97, 6.98, 7.02, 7.07, 7.15, 7.19, 7.16, 7.07, 7.06, 7.18, 6.28,
6.45, 6.72, 6.48, 6.25, 6.05, 6.07, 5.92, 5.85, 5.77, 5.82, 5.74,
5.74, 6.16, 5.96, 6.38, 6.67)), .Names = c("Date", "Adj.Close"
), row.names = c(NA, 127L), class = "data.frame", na.action = structure(128:231, .Names = c("128",
"129", "130", "131", "132", "133", "134", "135", "136", "137",
"138", "139", "140", "141", "142", "143", "144", "145", "146",
"147", "148", "149", "150", "151", "152", "153", "154", "155",
"156", "157", "158", "159", "160", "161", "162", "163", "164",
"165", "166", "167", "168", "169", "170", "171", "172", "173",
"174", "175", "176", "177", "178", "179", "180", "181", "182",
"183", "184", "185", "186", "187", "188", "189", "190", "191",
"192", "193", "194", "195", "196", "197", "198", "199", "200",
"201", "202", "203", "204", "205", "206", "207", "208", "209",
"210", "211", "212", "213", "214", "215", "216", "217", "218",
"219", "220", "221", "222", "223", "224", "225", "226", "227",
"228", "229", "230", "231"), class = "omit"))
Merge the data frames along their respective dates and perform the regression:
M <- merge(Stock, userScoreDF, by = 1)
lm(Score ~ Adj.Close, M)
or to calculate the correlation:
with(M, cor(Score, Adj.Close))
Based on your description, I'd normally say this is a terrible idea. But you just neglected to specify that they have overlapping dates. You just need to merge them.
Here, I name your first df x and your second df y.
x2 <- merge(x[which(x$Group.date %in% y$Date),], y, by.x= "Group.date", by.y= "Date")
lm(Score ~ Adj.Close, data= x2)
Of course, a better question might be why are you using lm on time series data (ie correlated error structure)? That is to say that you're doing it wrong. But, hey, you didn't ask about the statistical validity of your approach.