This question already has an answer here:
ggplot year by year comparison
(1 answer)
Closed 5 years ago.
I have a data frame with data like
year range count
2011 '0 to 500' 10
2011 '500 to 1000' 100
2012 '0 to 500' 12
2012 '500 to 1000' 50
2013 '0 to 500' 22
2013 '500 to 1000' 75
How can I use ggplot to plot Range on the x axis, count on the y axis and a line of each year?
I don't think this is a duplicate so I've provided an answer. You data structure require some minimal treatment of text (to extract to and from) and probably geom_segments instead of geom_line.
s<-"year;range;count
2011;'0 to 500';10
2011;'500 to 1000';100
2012;'0 to 500';12
2012;'500 to 1000';50
2013;'0 to 500';22
2013;'500 to 1000';75"
d<-read.delim(textConnection(s),header=TRUE,sep=";",strip.white=TRUE)
d$range <- as.character(d$range) # remove factor
d$range <- gsub("'","",d$range) # remove character
# strsplit returns a list, one per line, with two elements, here i'm getting
# each of those elements
d$from<-as.numeric(sapply(strsplit(d$range,' to '),function(X)X[1]))
d$to<-as.numeric(sapply(strsplit(d$range,' to '),function(X)X[2]))
d$year <- as.factor(d$year)
ggplot(d)+geom_segment(aes(x=from,xend=to,y=count,yend=count,col=year))
Related
I am building a stochastic model to predict the movement of objects floating in the ocean. I have thousands of data from drifter buoys all around the world. In the format as below:
index month year lat long
72615 10 2010 35,278 129,629
72615 11 2010 37,604 136,365
72615 12 2010 39,404 137,775
72615 1 2011 39,281 138,235
72620 1 2011 35,892 132,766
72620 2 2011 38,83 133,893
72620 3 2011 39,638 135,513
72620 4 2011 41,297 139,448
The general concept for the model is to divide whole world into 2592 cells of magnitude of 5x5 degrees. And then create the Markov's chain transition matrix using the formula that
the probability of going from cell i to cell j in 1 month equals to:
the number of times any buoy went from cell i to cell j in 1 month
divided by the
number of times any buoy exitted i (including going from i to i).
However I have two troubles related to managing the data.
1. Is there an easy solution (preferably in Excel or R) to add 6-th column to the data set, whose values would depend only on the value of latitude and longitude, such that it would equal to:
1 when both latitude and longitude are between 0 and 5
2 when latitude is between 0 and 5 and longitude between 5 and 10
3 when latitude is betwwen 0 and 5 and longitude between 10 and 15
and so on up to the number 2592
2. Is there an easy way to count the number of times any buoy went from cell i to cell j in 1 month?
I was trying to figure out the solution to the question 1 in Excel, but could not think of anything more efficient than just sorting by the latitude / longitude columns and then writing the values manually.
I've been also told that R is much better for managing such data sets, but I am not experienced with it and couldn't find the solution myself.
I would really appreciate any help.
Someone can probably come up with something much more sophisticated/faster, but this is a crude approach that has the benefit of being relatively easy to understand.
Sample data:
dd <- read.table(header=TRUE,dec=",",text="
index month year lat long
72615 10 2010 35,278 129,629
72615 11 2010 37,604 136,365
72615 12 2010 39,404 137,775
72615 1 2011 39,281 138,235
72620 1 2011 35,892 132,766
72620 2 2011 38,83 133,893
72620 3 2011 39,638 135,513
72620 4 2011 41,297 139,448")
Generate indices that equal 1 for (0-5), 2 for (6-10), etc.
dd$x <- (dd$lat %/% 5) + 1
dd$y <- (dd$long %/% 5) + 1
Set up an empty matrix (not sure I have the rows/columns right)
mm <- matrix(0,nrow=36,ncol=72)
(you might want to use the dimnames argument here for clarity)
Fill it in:
for (i in 1:nrow(dd)) {
mm[dd[i,"x"],dd[i,"y"]] <- mm[dd[i,"x"],dd[i,"y"]]+1
}
If you have only thousands of rows, this might be fast enough. I would try it and see if you need something fancier. (If you need to collapse the matrix back to a set of columns, you can use reshape2::melt or tidyr::gather ...)
I have a hypothetical data-frame as follows:
# inventory of goods
year category count-of-good
2010 bikes 1
2011 bikes 3
2013 bikes 5
2010 skates 1
2011 skates 1
2013 skates 0
2010 skis 0
2011 skis 2
2013 skis 2
my end goal is to show a stacked bar chart of how the %-<good>-of-decade-total has changed year-to-year.
therefore, i want to compute the following:
now, i should be able to ggplot(df, aes(factor(year), fill=percent.total.decade.goods) + geom_bar, or similar (hopefully!), creating a bar chart where each bar sums to 100%.
however, i'm struggling to determine how to get percent.good.of.decade.total (the far right column) in non-hacky way. Thanks for your time!
You can use dplyr to compute the sum:
library("dplyr")
newDf=df%>%group_by(year)%>%mutate(decades.total.goods=sum(count.of.goods))%>%ungroup()
Either use mutate or normal R syntax to compute the "% good of decade total"
Note: you have not shared your exact data-frame, so the names are obviously made up.
We can do this with ave from base R
df1$decades.total.goods <- with(df1, ave(count.of.good, year, FUN = sum))
df1$decades.total.goods
#[1] 2 6 7 2 6 7 2 6 7
MarriageLicen
Year Month Amount
1 2011 Jan 742
2 2011 Feb 796
3 2011 Mar 1210
4 2011 Apr 1376
BusinessLicen
Month Year MARRIAGE_LICENSES
1 Jan 2011 754
2 Feb 2011 2706
3 Mar 2011 2689
4 Apr 2011 738
My question is how can we predict the number of Marriage Licenses (Y) issued by the city using the number of Business Licenses (X)?
And how can we join two datasets together?
It says that you can join using the combined key of Month and Year.
But I am suffering from this question for several days.
There are three options here.
The first is to just be direct. I'm going to assume you have the labels swapped around for the data frames in your example (it doesn't make a whole lot of sense to have a MARRIAGE_LICENSES variable in the BusinessLicen data frame, if I'm following what you are trying to do).
You can model the relationship between those two variables with:
my.model <- lm(MarriageLicen$MARRIAGE_LICENSES ~ BusinessLicen$Amount)
The second (not very rational) option would be to make a new data frame explicitly, since it looks like you have an exact match on each of your rows:
new.df <- data.frame(marriage.licenses=MarriageLicen$MARRIAGE_LICENSES, business.licenses=BusinessLicen$Amount)
my.model <- lm(marriage.licenses ~ business.licenses, data=new.df)
Finally, if you don't actually have the perfect alignment shown in your example you can use merge.
my.df <- merge(BusinessLicen, MarriageLicen, by=c("Month", "Year"))
my.model <- lm(MARRIAGE_LICENCES ~ Amount, data=my.df)
I'm trying to do a zoo merge between stock prices from selected trading days and observations about those same stocks (we call these "Nx observations") made on the same days. Sometimes do not have Nx observations on stock trading days and sometimes we have Nx observations on non-trading days. We want to place an "NA" where we do not have any Nx observations on trading days but eliminate Nx observations where we have them on non-trading day since without trading data for the same day, Nx observations are useless.
The following SO question is close to mine, but I would characterize that question as REPLACING missing data, whereas my objective is to truly eliminate observations made on non-trading days (if necessary, we can change the process by which Nx observations are taken, but it would be a much less expensive solution to leave it alone).
merge data frames to eliminate missing observations
The script I have prepared to illustrate follows (I'm new to R and SO; all suggestions welcome):
# create Stk_data data.frame for use in the Stack Overflow question
Date_Stk <- c("1/2/13", "1/3/13", "1/4/13", "1/7/13", "1/8/13") # dates for stock prices used in the example
ABC_Stk <- c(65.73, 66.85, 66.92, 66.60, 66.07) # stock prices for tkr ABC for Jan 1 2013 through Jan 8 2013
DEF_Stk <- c(42.98, 42.92, 43.47, 43.16, 43.71) # stock prices for tkr DEF for Jan 1 2013 through Jan 8 2013
GHI_Stk <- c(32.18, 31.73, 32.43, 32.13, 32.18) # stock prices for tkr GHI for Jan 1 2013 through Jan 8 2013
Stk_data <- data.frame(Date_Stk, ABC_Stk, DEF_Stk, GHI_Stk) # create the stock price data.frame
# create Nx_data data.frame for use in the Stack Overflow question
Date_Nx <- c("1/2/13", "1/4/13", "1/5/13", "1/6/13", "1/7/13", "1/8/13") # dates for Nx Observations used in the example
ABC_Nx <- c(51.42857, 51.67565, 57.61905, 57.78349, 58.57143, 58.99564) # Nx scores for stock ABC for Jan 1 2013 through Jan 8 2013
DEF_Nx <- c(35.23809, 36.66667, 28.57142, 28.51778, 27.23150, 26.94331) # Nx scores for stock DEF for Jan 1 2013 through Jan 8 2013
GHI_Nx <- c(7.14256, 8.44573, 6.25344, 6.00423, 5.99239, 6.10034) # Nx scores for stock GHI for Jan 1 2013 through Jan 8 2013
Nx_data <- data.frame(Date_Nx, ABC_Nx, DEF_Nx, GHI_Nx) # create the Nx scores data.frame
# create zoo objects & merge
z.Stk_data <- zoo(Stk_data, as.Date(as.character(Stk_data[, 1]), format = "%m/%d/%Y"))
z.Nx_data <- zoo(Nx_data, as.Date(as.character(Nx_data[, 1]), format = "%m/%d/%Y"))
z.data.outer <- merge(z.Stk_data, z.Nx_data)
The NAs on Jan 3 2013 for the Nx observations are fine (we'll use the na.locf) but we need to eliminate the Nx observations that appear on Jan 5 and 6 as well as the associated NAs in the Stock price section of the zoo objects.
I've read the R Documentation for merge.zoo regarding the use of "all": that its use "allows
intersection, union and left and right joins to be expressed". But trying all combinations of the
following use of "all" yielded the same results (as to why would be a secondary question).
z.data.outer <- zoo(merge(x = Stk_data, y = Nx_data, all.x = FALSE)) # try using "all"
While I would appreciate comments on the secondary question, I'm primarily interested in learning how to eliminate the extraneous Nx observations on days when there is no trading of stocks. Thanks. (And thanks in general to the community for all the great explanations of R!)
The all argument of merge.zoo must be (quoting from the help file):
logical vector having the same length as the number of "zoo" objects to be merged
(otherwise expanded)
and you want to keep all rows from the first argument but not the second so its value should be c(TRUE, FALSE).
merge(z.Stk_data, z.Nx_data, all = c(TRUE, FALSE))
The reason for the change in all syntax for merge.zoo relative to merge.data.frame is that merge.zoo can merge any number of arguments whereas merge.data.frame only handles two so the syntax had to be extended to handle that.
Also note that %Y should have been %y in the question's code.
I hope I have understood your desired output correctly ("NAs on Jan 3 2013 for the Nx observations are fine"; "eliminate [...] observations that appear on Jan 5 and 6"). I don't quite see the need for zoo in the merging step.
merge(Stk_data, Nx_data, by.x = "Date_Stk", by.y = "Date_Nx", all.x = TRUE)
# Date_Stk ABC_Stk DEF_Stk GHI_Stk ABC_Nx DEF_Nx GHI_Nx
# 1 1/2/13 65.73 42.98 32.18 51.42857 35.23809 7.14256
# 2 1/3/13 66.85 42.92 31.73 NA NA NA
# 3 1/4/13 66.92 43.47 32.43 51.67565 36.66667 8.44573
# 4 1/7/13 66.60 43.16 32.13 58.57143 27.23150 5.99239
# 5 1/8/13 66.07 43.71 32.18 58.99564 26.94331 6.10034
I have computed monthly returns from a price series. I then build a dataframe as follows:
y.ret_1981 y.ret_1982 y.ret_1983 y.ret_1984 y.ret_1985
1 0.0001015229 0.0030780203 -0.0052233836 0.017128325 -0.002427308
2 0.0005678989 0.0009249838 -0.0023294622 -0.030531971 0.001831160
3 -0.0019040392 -0.0021614791 0.0022451252 -0.003345983 0.005773503
4 -0.0006015118 0.0010695681 0.0052680258 0.008592513 0.009867972
5 0.0052736054 -0.0003181347 -0.0008505673 -0.000623061 -0.012225140
6 0.0014266119 -0.0101045071 -0.0003073150 -0.016084505 -0.005883687
7 -0.0069002733 -0.0078170620 0.0070058676 -0.007870294 -0.010265335
8 -0.0041963258 0.0039905142 0.0134996961 -0.002149331 -0.007860940
9 0.0020778541 -0.0038834826 0.0052289589 0.007271409 -0.005320848
10 0.0030956487 -0.0005027686 -0.0021452210 0.002502301 -0.001890657
11 -0.0032375542 0.0063916686 0.0009331531 0.004679741 0.004338580
12 0.0014882164 0.0039578527 0.0136663415 0.000000000 0.003807668
... where columns are the monthly returns for the years 1981 to 1985 and the rows 1 to 12 are the months of the year.
I would like to plot a a boxplot similar to the one below:
So what can I go? And I would like my graph to read the months of the years instead of integer 1 to 12.
Thank you.
First, add new column month to your original data frame containing month.name (built-in constant in R) and use it as factor. It is import to set also levels= inside the factor() to ensure that months are arranged in chronological order not the alphabetical.
Then melt this data frame from wide format to long format. In ggplot() use month as x values and value as y.
df$month<-factor(month.name,levels=month.name)
library(reshape2)
df.long<-melt(df,id.vars="month")
ggplot(df.long,aes(month,value))+geom_boxplot()