ggplot: Best way to plot this using R with facets - r

Quarter Team Year Units Sales
2015Q3 A 2015 25000 61.1038751
2015Q3 B 2015 1370 4.5081774
2015Q3 C 2015 19103 34.9492249
2015Q3 D 2015 10757 0.5222169
2015Q3 E 2015 2658 6.0959838
2015Q2 A 2015 2500 10.38751
2015Q2 B 2015 370 3.508
2015Q2 C 2015 1103 2.94
2015Q2 D 2015 757 4.5222169
2015Q2 E 2015 658 5.09
...
2005Q3 A 2015 25000 31.1038751
2005Q3 B 2015 1370 4.5081774
2005Q3 C 2015 12345 4.9492249
2005Q3 D 2015 102 3.5222169
2005Q3 E 2015 2658 5.0959838
I have above data set which is quarterly sales by every team. I am trying to figure out what is the best way to plot this so there can be a good visual representation of each teams sales and units gets shown clear per year per quarter basis. I do have large number of teams for example 50.
I was trying to do something like this
library(ggplot2)
ggplot(df) +
geom_line(aes(x=Quarter,y=Sales,group=Year))+
facet_grid(.~Year,scales="free")
But this doesn't give a good idea of whats going on. I was going over http://docs.ggplot2.org/0.9.3.1/facet_wrap.html but unable to grasp how I should plot it.

Related

How can I change row and column indexes of a dataframe in R?

I have a dataframe in R which has three columns Product_Name(name of books), Year and Units (number of units sold in that year) which looks like this:
Product_Name Year Units
A Modest Proposal 2011 10000
A Modest Proposal 2012 11000
A Modest Proposal 2013 12000
A Modest Proposal 2014 13000
Animal Farm 2011 8000
Animal Farm 2012 9000
Animal Farm 2013 11000
Animal Farm 2014 15000
Catch 22 2011 1000
Catch 22 2012 2000
Catch 22 2013 3000
Catch 22 2014 4000
....
I intend to make a R Shiny dashboard with that where I want to keep the year as a drop-down menu option, for which I wanted to have the dataframe in the following format
A Modest Proposal Animal Farm Catch 22
2011 10000 8000 1000
2012 11000 9000 2000
2013 12000 11000 3000
2014 13000 15000 4000
or the other way round where the Product Names are row indexes and Years are column indexes, either way goes.
How can I do this in R?
Your general issue is transforming long data to wide data. For this, you can use data.table's dcast function (amongst many others):
dt = data.table(
Name = c(rep('A', 4), rep('B', 4), rep('C', 4)),
Year = c(rep(2011:2014, 3)),
Units = rnorm(12)
)
> dt
Name Year Units
1: A 2011 -0.26861318
2: A 2012 0.27194732
3: A 2013 -0.39331361
4: A 2014 0.58200101
5: B 2011 0.09885381
6: B 2012 -0.13786098
7: B 2013 0.03778400
8: B 2014 0.02576433
9: C 2011 -0.86682584
10: C 2012 -1.34319590
11: C 2013 0.10012673
12: C 2014 -0.42956207
> dcast(dt, Year ~ Name, value.var = 'Units')
Year A B C
1: 2011 -0.2686132 0.09885381 -0.8668258
2: 2012 0.2719473 -0.13786098 -1.3431959
3: 2013 -0.3933136 0.03778400 0.1001267
4: 2014 0.5820010 0.02576433 -0.4295621
For the next time, it is easier if you provide a reproducible example, so that the people assisting you do not have to manually recreate your data structure :)
You need to use pivot_wider from tidyr package. I assumed your data is saved in df and you also need dplyr package for %>% (piping)
library(tidyr)
library(dplyr)
df %>%
pivot_wider(names_from = Product_Name, values_from = Units)
Assuming that your dataframe is ordered by Product_Name and by year, I will generate artificial data similar to your datafrme, try this:
Col_1 <- sort(rep(LETTERS[1:3], 4))
Col_2 <- rep(2011:2014, 3)
# artificial data
resp <- ceiling(rnorm(12, 5000, 500))
uu <- data.frame(Col_1, Col_2, resp)
uu
# output is
Col_1 Col_2 resp
1 A 2011 5297
2 A 2012 4963
3 A 2013 4369
4 A 2014 4278
5 B 2011 4721
6 B 2012 5021
7 B 2013 4118
8 B 2014 5262
9 C 2011 4601
10 C 2012 5013
11 C 2013 5707
12 C 2014 5637
>
> # Here starts
> output <- aggregate(uu$resp, list(uu$Col_1), function(x) {x})
> output
Group.1 x.1 x.2 x.3 x.4
1 A 5297 4963 4369 4278
2 B 4721 5021 4118 5262
3 C 4601 5013 5707 5637
>
output2 <- output [, -1]
colnames(output2) <- levels(as.factor(uu$Col_2))
rownames(output2) <- levels(as.factor(uu$Col_1))
# transpose the matrix
> t(output2)
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637
> # or convert to data.frame
> as.data.frame(t(output2))
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

Boxplot not plotting all data

I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.

From monadic to dyadic data in R

For the sake of simplicity, let's say I have a dataset at the country-year level, that lists organizations that received aid from a government, how much money was that, and the type of project. The data frame has "space" for 10 organizations each year, but not every government subsidizes so many organizations each year, so there are a lot a blank spaces. Moreover, they do not follow any order: one organization can be in the first spot one year, and the next year be coded in the second spot. The data looks like this:
> State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2 Org3 Aid3 Proj3 Org4 Aid4 Proj4 ...
Italy 2000 A 1000 Arts B 500 Arts C 300 Social
Italy 2001 B 700 Social A 1000 Envir
Italy 2002 A 1000 Arts C 300 Envir
UK 2000
UK 2001 Z 2000 Social
UK 2002 Z 2000 Social
...
I'm trying to transform this into dyadic data, which would look like this:
> State Org Year Aid Proj
Italy A 2000 1000 Arts
Italy A 2001 1000 Envir
Italy A 2002 1000 Arts
Italy B 2000 500 Arts
Italy B 2001 700 Social
Italy C 2000 300 Social
Italy C 2002 300 Envir
UK Z 2001 2000 Social
...
I'm using R, and the best way I could find was building a pre-defined possible set of dyads —using something like expand.grid(unique(State), unique(Org))— and then looping through the data, finding the corresponding column and filling the data frame. But I don't thing this is the most effective method, so I was wondering whether there would be a better way. I thought about dplyror reshape but can't find a solution.
I know this is a recurring question, but couldn't really find an answer. The most similar question is this one, but it's not exactly the same.
Thanks a lot in advance.
Since you did not use dput, I will try and make some data that resemble yours:
dat = data.frame(State = rep(c("Italy", "UK"), 3),
Year = rep(c(2014, 2015, 2016), 2),
Org1 = letters[1:6],
Aid1 = sample(800:1000, 6),
Proj1 = rep(c("A", "B"), 3),
Org2 = letters[7:12],
Aid2 = sample(600:700, 6),
Proj2 = rep(c("C", "D"), 3),
stringsAsFactors = FALSE)
dat
# State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2
# 1 Italy 2014 a 910 A g 658 C
# 2 UK 2015 b 926 B h 681 D
# 3 Italy 2016 c 834 A i 625 C
# 4 UK 2014 d 858 B j 620 D
# 5 Italy 2015 e 831 A k 650 C
# 6 UK 2016 f 821 B l 687 D
Next I gather the data and then use extract to make 2 new columns and then spread it all again:
library(tidyr)
library(dplyr)
dat %>%
gather(key, value, -c(State, Year)) %>%
extract(key, into = c("key", "num"), "([A-Za-z]+)([0-9]+)") %>%
spread(key, value) %>%
select(-num)
# State Year Aid Org Proj
# 1 Italy 2014 910 a A
# 2 Italy 2014 658 g C
# 3 Italy 2015 831 e A
# 4 Italy 2015 650 k C
# 5 Italy 2016 834 c A
# 6 Italy 2016 625 i C
# 7 UK 2014 858 d B
# 8 UK 2014 620 j D
# 9 UK 2015 926 b B
# 10 UK 2015 681 h D
# 11 UK 2016 821 f B
# 12 UK 2016 687 l D
Is this the desired output?

Reassign the starting Julian date (July 1st = Julian date 1, southern hemisphere)

My data set includes various observations at different stages throughout the year.
year when samples were collected.
site location of measurement
Class physical stage during r of measurement
date date of measurement
Julian Julian date
The final measurements usually occur in the early part of the new year, which is the summer time in the southern hemisphere. (e.g. summer is winter, spring is fall).
year site Class date Julian
1 2009 10C Early 2008-09-15 259
2 2009 10C L2 2008-09-29 273
3 2009 10C L3 2008-12-15 350
4 2010 10C Early 2009-08-31 243
5 2010 10C L2 2009-09-14 257
6 2010 10C L3 2009-12-11 345
7 2012 10C Early 2011-08-23 235
8 2012 10C L2 2011-09-22 265
9 2012 10C L3 2011-12-03 337
10 2012 10C LSample 2012-03-26 86
11 2013 10C Early 2012-09-07 251
12 2013 10C L2 2012-09-30 274
13 2013 10C L3 2012-12-17 352
14 2014 10C Early 2013-09-02 245
15 2014 10C L2 2013-09-16 259
16 2014 10C L3 2013-12-16 350
17 2014 10C LMid 2014-01-07 7
18 2015 10C Early 2014-09-08 251
19 2015 10C L2 2014-09-30 273
20 2015 10C L3 2014-12-01 335
I am having a difficult time converting/reassigning the Julian start date to July 1st instead of January 1st. The dot plot below illustrates the final sampling that occurs at the beginning of the year (February-March).
The chron package has an option to reorder the origin but I cannot get it to work properly with my data.
library(chron)
library(dplyr)
data.date <- data %>%
mutate(July.Julian = chron(date,format = c(dates = "ymd"), options(chron.origin = c(month=7, day=1, year=2008))))
Error in chron(c("2008-09-15", "2008-09-29", "2008-12-15", "2009-08-31", :
misspecified chron format(s) length
or
July.Julian = chron(data$date, format = c(dates = "ymd"), options(chron.origin = c(month=7, day=1, year=2008)))
Error in chron(c("2008-09-15", "2008-09-29", "2008-12-15", "2009-08-31", :
misspecified chron format(s) length
I am trying to start the Julian date as 1 instead of 182.
Thoughts or suggestions are welcome.
Assuming that July.Julian is supposed to be Julian days past July 1st:
transform(date.data, July.Julian = as.chron(sprintf("%d-07-01", year)) + Julian)
or
date.data %>% mutate(July.Julian = as.chron(sprintf("%d-07-01", year)) + Julian)
Note that one does not actually need chron here. Just replace as.chron with as.Date and either of these work.

Resources