I have an interesting conundrum. I can create the type of chart I seek interactively, but not automatically. Or, I nearly had it automatically, but something broke. (example data at end of post).
I have my loop working the way I would like, but have run into errors when I add some geom_vline() statements (for us, denoting significant changes in our production environment). I've tried working through it outside of the loop and am able to recreate the issue with details below.
I have the following steps:
create a vector with the list of changes:
changeVector <- c(as.Date("2011-11-30"),as.Date("2011-12-05"))
[WORKS] create a plot with the data below, and it works:
ggplot(df,aes(x=OBSDATE,y=AVG_RESP))+geom_line(aes(group=REGION,color=REGION))
[WORKS] try to add the geom_vline(xintercept=c(15308,15313)), and it works (but only if the geom_vline is at the end):
ggplot(df,aes(x=OBSDATE,y=AVG_RESP))+geom_line(aes(group=REGION,color=REGION))+geom_vline(xintercept=c(15308,15313))
[FAIL] try to add the geom_vline(xintercept=changeVector) - I had problems with this for some reason, and had to add as.numeric to recognize the vector values properly:
ggplot(df,aes(x=OBSDATE,y=AVG_RESP))+geom_vline(xintercept=as.numeric(changeVector))+geom_line(aes(group=REGION,color=REGION))
When this step runs, I get the wonderfully useful error message:
Error: Non-continuous variable supplied to scale_x_continuous.
So, any ideas? If I try to add an aesthetic component to the geom_vline, I still make no progress. My desire was to have the geom_vline preceding the geom_line because the vline is context, not data.
Thank you for your help!
Here is a subset of the data (dataFile name df):
OBSDATE REGION COUNT AVG_RESP
2011-11-29 EMEA 293 4.430375
2011-11-30 EMEA 299 4.802876
2011-12-01 EMEA 292 4.362363
2011-12-02 EMEA 293 4.209829
2011-12-03 EMEA 294 4.262959
2011-12-04 EMEA 294 4.207959
2011-12-05 EMEA 293 4.172594
2011-12-06 EMEA 293 4.230887
2011-12-07 EMEA 298 4.259329
2011-12-08 EMEA 293 4.197645
2011-11-29 Americas 296 2.841182
2011-11-30 Americas 296 2.932196
2011-12-01 Americas 292 2.766438
2011-12-02 Americas 293 2.819556
2011-12-03 Americas 291 2.710584
2011-12-04 Americas 295 2.728407
2011-12-05 Americas 290 2.764310
2011-12-06 Americas 290 2.817483
2011-12-07 Americas 295 2.733864
2011-12-08 Americas 291 2.732405
2011-11-29 APAC 328 7.294024
2011-11-30 APAC 325 7.091046
2011-12-01 APAC 314 6.969236
2011-12-02 APAC 327 6.920428
2011-12-03 APAC 325 7.226308
2011-12-04 APAC 324 7.046296
2011-12-05 APAC 318 7.075094
2011-12-06 APAC 317 7.016467
2011-12-07 APAC 318 7.187358
2011-12-08 APAC 318 7.310220
I'm not exactly sure why it is doing that, but here is a workaround that keeps the vertical lines behind the data lines:
ggplot(df,aes(x=OBSDATE,y=AVG_RESP)) +
geom_blank() +
geom_vline(xintercept=as.numeric(changeVector)) +
geom_line(aes(group=REGION,color=REGION))
EDIT:
Here is another workaround: explicitly specify that the x axis is to be a date, rather than have ggplot guess. When it guesses, it looks at the first layer plotted, which is the vertical lines. Given that the xintercept have to be given as numbers rather than dates, the x axis is assumed to be continuous/numeric. When the next layer is drawn, the dates of the x axis can not be mapped onto that and an error is thrown.
ggplot(df,aes(x=OBSDATE,y=AVG_RESP)) +
geom_vline(xintercept=as.numeric(changeVector)) +
geom_line(aes(group=REGION,color=REGION)) +
scale_x_date()
Related
I am using RStudio and I have a time series data (ts object) called data1.
Here is how data1 looks:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 135 172 179 189 212 47 301 183 247 292 280 325
2015 471 243 386 235 388 257 344 526 363 261 189 173
2016 272 267 197 217 393 299 343 341 315 305 384 497
To plot the above, I have run this code:
plot (data1)
and I get the following plot:
I want to have a plot that is broken by Year and I was thinking of implementing the facet_grid feature found in ggplot2 but since my data is a ts object, I can't use ggplot2 directly on it.
After some research, I've found that the ggfortify library works with ts objects. However, I am having a hard time trying to figure out to use the facet_grid feature with it.
My aim to is to plot something like below from my ts data:
'Female'and 'Male' will be replaced by the Years 2014, 2015 and 2016. The X-axis will be the Months (Jan, Feb, Mar, and so on) and the y-axis will be the values in the ts file . I would prefer a line plot rather than a dot plot.
Am I on the right track here or is there another way of approaching this problem?
We can use ggplot2::autoplot. I will use AirPassengers data as an example.
library(ggplot2)
library(lubridate)
autoplot(AirPassengers) +
facet_grid(. ~ year(Index), scales = "free_x") +
scale_x_date(date_labels = "%b")
I would like to take two columns and add them two other columns. For example, I have the data below:
EU.Member.States X. Other.countries..continued. X..1
Austria 122 Cameroon 203
Belgium 150 Canada 156
Denmark 179 Canary Islands 132
Finland 156 Cape Verde 147
France 130 Cayman Islands 213
How can I take the rows under "Other.countries..continued." and "X..1" and add them directly under "EU.Member.States" and "X." respectively?
I have tried using unite of (tidyr) with no success.
Your question is almost identical to this one. Using the piping from dplyr package I can suggest a solution by first duplicating your column names, and then applying classic rbind. I used only the first 2 lines of your example:
df %>% setNames(names(df)[c(1,2,1,2)]) %>% {rbind(.[,1:2], .[,3:4])}
#### EU.Member.States X.
#### 1 Austria 122
#### 2 Belgium 150
#### 3 Cameroon 203
#### 4 Canada 156
Note: the brackets are here to tell the piping not to take the . as an implicit first argument.
I'm making an app that will predict an NFL running back's number of rush attempts and rush yards AFTER a season of 1800+ rush yards. I use slider inputs for the # of rushing yards and attempts, which gets run through lm() and predict() and returns estimates for next year's attempts and rush yards (I know it's not a very good predictor at all, but this is just an exercise in making a Shiny app). Here's the data from my excel file and then the code.
Player Yr. Team Attempts Att.Next.Yr Yards Yards.Next.Yr YPC YPC.Next.Yr
1 Adrian Peterson 2012 MIN 348 279 2097 1266 6.0 4.5
2 Chris Johnson 2009 TEN 358 316 2006 1364 5.6 4.3
3 LaDainian Tomlinson 2006 SD 348 315 1815 1474 5.2 4.7
4 Shaun Alexander 2005 SEA 370 252 1880 896 5.1 3.6
5 Tiki Barber 2005 NYG 357 327 1860 1662 5.2 5.1
6 Jamal Lewis 2003 BAL 387 235 2066 1006 5.3 4.3
7 Ahman Green 2003 GB 355 259 1883 1163 5.3 4.5
8 Ricky Williams 2002 MIA 383 392 1853 1372 4.8 3.5
9 Terrell Davis 1998 DEN 392 67 2008 211 5.1 3.1
10 Jamal Anderson 1998 ATL 410 19 1846 59 4.5 3.1
11 Barry Sanders 1997 DET 335 343 2053 1491 6.1 4.3
12 Barry Sanders 1994 DET 331 314 1883 1500 5.7 4.8
13 Eric Dickerson 1986 RAM 404 60 1821 277 4.5 4.6
14 Eric Dickerson 1984 RAM 379 292 2105 1234 5.6 4.2
15 Eric Dickerson 1983 RAM 390 379 1808 2105 4.6 5.6
16 Earl Campbell 1980 HOU 373 361 1934 1376 5.2 3.8
17 Walter Payton 1977 CHI 339 333 1852 1395 5.5 4.2
18 O.J. Simpson 1975 BUF 329 290 1817 1503 5.5 5.2
19 O.J. Simpson 1973 BUF 332 270 2003 1125 6.0 4.2
20 Jim Brown 1963 CLE 291 280 1863 1446 6.4 5.2
Server.R
# server.R
library(UsingR)
library(xlsx)
rawdata <- read.xlsx("RushingYards.xlsx", sheetIndex=1)
data <- rawdata[c(2:21),]
rownames(data) <- NULL
# Att
set.seed(1)
fitAtt <- lm(Att.Next.Yr ~ Yards + Attempts, data)
# Yds
set.seed(1)
fitYds <- lm(Yards.Next.Yr ~ Yards + Attempts, data)
shinyServer(
function(input, output) {
output$newPlot <- renderPlot({
iYards <- input$Yards
iAttempts <- input$Attempts
test <- data.frame(iYards,iAttempts)
names(test) <- c("Yards", "Attempts")
predictAtt <- predict(fitAtt, test)
predictYds <- predict(fitYds, test)
qplot(data=data, x=Attempts, y=Yards) +
geom_point(aes(x=predictAtt, y=predictYds, color="Estimate"))
output$renderYds <- renderPrint({predictYds})
output$renderAtt <- renderPrint({predictAtt})
})
}
)
UI.R
# ui.R
shinyUI(pageWithSidebar(
headerPanel("Rushing Projections"),
sidebarPanel(
sliderInput('Yards', 'How many yards rushed for this season',
value=1700, min=1500, max=2500, step=25,),
sliderInput('Attempts', 'How many attempts this season',
value=350, min=250, max=450, step=5,),
submitButton('Submit')
),
mainPanel(
plotOutput('newPlot'),
h3('Predicted rushing yards next year: '),
verbatimTextOutput("renderYds"),
h3('Predict attempts next year: '),
verbatimTextOutput("renderAtt")
)
))
The problem I'm having is I can't seem to output BOTH the plot (next year's estimates plotted in red against historical performances for running backs > 1800 rush yards) and the text of next year's estimated rushing yards and attempts at the same time. I can get one or the other to show up depending on where I put those statements. If I put
output$renderYds <- renderPrint({predictYds})
output$renderAtt <- renderPrint({predictAtt})
outside of the output$newPlot (but still inside of function(input, output)) line I can get the plot to show up and the point for next year's estimates changes as the input is changed but I get error messages of
object 'predictYds' not found' and object 'predictAtt' not found for the text. If I put those two lines inside of the function(input, output) line (as I have in the code above) then those two text numbers show up with the correct value but the plot doesn't generate.
Can anyone help with this please?
I changed the structure of Server.R and now it works.
shinyServer(function(input, output) {
predictYds <- function(Y, A){
test <- data.frame(Y, A)
names(test) <- c("Yards", "Attempts")
predict(fitYds, test)
}
predictAtt <- function(Y, A){
test <- data.frame(Y, A)
names(test) <- c("Yards", "Attempts")
predict(fitAtt, test)
}
output$newPlot <- renderPlot({
newYards <- predictYds(input$Yards, input$Attempts)
newAttempts <- predictAtt(input$Yards, input$Attempts)
qplot(data=data, x=Attempts, y=Yards) +
geom_point(aes(x=newAttempts, y=newYards, color="Estimate"))
})
output$renderYds <- renderPrint({predictYds(input$Yards, input$Attempts)})
output$renderAtt <- renderPrint({predictAtt(input$Yards, input$Attempts)})
}
)
Basically PredictYds and PredictAtt were rewritten as normal functions called inside render functions using input variables.
I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)
I have the data frame new1 with 20 columns of variables one of which is new1$year. This includes 25 years with the following count:
> table(new1$year)
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
2770 3171 3392 2955 2906 2801 2930 2985 3181 3059 2977 2884 3039 2428 2653 2522 2558 2370 2666 3046 3155 3047 2941 2591 1580
I tried to prepare an histogram of this with
hist(new1$year, breaks=25)
but I obtain a histogram where the hight of the columns is actually different from the numbers in table(new1$year). FOr example the first column is >4000 in histo while it should be <2770; another example is that for 1995, where there should be a lower bar relatively to the other years around it this bar is also a little higher.
What am I doing wrong? I have tried to define numeric(new1$year) (error says 'invalid length argument') but with no different result.
Many thanks
Marco
Per my comment, try:
barplot(table(new1$year))
The reason hist does not work exactly as you intend has to do with specification of the breaks argument. See ?hist:
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only.