ggplot2 or sjPlot sum stacked barplot columns - r

I am running R version 3.5.2 (2018-12-20) -- "Eggshell Igloo" on a MacBook Pro, OS 10.14.2.
I have tried a few methods to get these plots. My preferred method was trying to create a stacked barplot of my data (factor grouped over time, my x axis and the counts as the y), with a dichotomous variable count 0,1 counts in each column as they match the counts on the y axis. However I am flexible. I have this code that works if I can overlay a barplot on this that would help.
ggplot(dat, aes(x=factor(yr),y=n, group=(n>0)))+
stat_summary(aes(color=(n>0)),fun.y=length, geom="line")+
scale_color_discrete("Key",labels=c("NN", "N"))+
labs(title= "1992-2018", x="Years",y="n")
using my full dataset, I tried this and got really close to the stacked barplot, it gave me the correct counts per the "yr" variable, however for my variable "n" it gave me a continuous range 0-1.0.
p<-ggplot(data=dat, aes(x=dat$yr, y=n, fill=n)) +
+ geom_bar(stat="identity")
This is the data I am most interested in. I tried to then coerce it into a table then a data frame.
t2<- table(dat$yr, dat$n)
0 1
1992 6 0
1993 10 0
1994 3 1
1995 20 2
1996 15 2
1997 16 0
1998 16 0
1999 9 3
2000 5 0
2001 5 1
2002 7 1
2003 9 2
2004 4 3
2005 6 3
2006 5 3
2007 6 3
2008 4 3
2009 8 4
2010 7 1
2011 4 5
2012 4 5
2013 6 2
2014 0 2
2015 3 3
2016 5 5
2017 4 4
2018 8 5
t<-table(dat$yr)
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
6 10 4 22 17 16 16 12 5 6 8 11 7
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
9 8 9 7 12 8 9 9 8 2 6 10 8
2018
13
I then tried:
df<- data.frame(t, t2)
head(df)
head(df)
Var1 Freq Var1.1. Var2 Freq.1
1 1992 6 1992 0 6
2 1993 10 1993 0 10
3 1994 4 1994 0 3
4 1995 22 1995 0 20
5 1996 17 1996 0 15
6 1997 16 1997 0 16
p<-ggplot(data=df, aes(x=Var1, y=Var2)) +
geom_bar(stat="identity")
p
replacing these for the dataset variables gave me worse results with the y-axis showing no counts per year for "yr" variable and each column was filled all the way to the top of the range of "1".
Again, I would like to get a stacked barplot with the binary "n" in each year column to show the 0/1 sum which should match the 'yr' counts on the y-axis. or, I can use the ggplot I got in the first code I posted and get the sums for each year there, I would take that as well.
this comes really really close. if it also gave a total at the top it would be perfect.
package sjPlot:
sjp.grpfrq(dat$yr, dat$n, bar.pos = c("stack"), show.values = TRUE, show.n = TRUE, show.prc = FALSE, title = NULL)
The major issue with the sjPlot code is I cannot change the legend labels. it shows n= 0, 1. I need to change this to be specific.
Thanks so much in advance!

Try this and see if that's what you want?
ggplot(data=df, aes(x=Var1, y=Freq)) +
geom_bar(stat="identity")

Resolved.
sjp.grpfrq(dat$yr, dat$n, bar.pos = c("stack"), legend.title = "Key", legend.labels = c("NN", "N"), show.values = TRUE, show.n = TRUE, show.prc = FALSE, show.axis.values = TRUE, title = "1992-2018")

Related

R - Expand dataframe to create panel data

I have a dataset where I observe individuals for different years (e.g., individual 1 is observed in 2012 and 2014, while individuals 2 and 3 are only observed in 2016). I would like to expand the data for each individual (i.e., each individual would have 3 rows: 2012, 2014 and 2016) in order to create a panel data with an indicator for whether an individual is observed or not.
My initial dataset is:
year
individual_id
rank
2012
1
11
2014
1
16
2016
2
76
2016
3
125
And I would like to get something like that:
year
individual_id
rank
present
2012
1
11
1
2014
1
16
1
2016
1
.
0
2012
2
.
0
2014
2
.
0
2016
2
76
1
2012
3
.
0
2014
3
.
0
2016
3
125
1
So far I have tried to play with "expand":
bys researcher: egen count=count(year)
replace count=3-count+1
bys researcher: replace count=. if _n>1
expand count
which gives me 3 rows per individual. Unfortunately this copies one of the initial row, but I am unable to go from there to the final desired dataset.
Thanks in advance for your help!
You can use expand.grid to create a data frame of all combinations your inputs. Then full join the tables together and add a condition to determine if the individual was present that year or not.
library(dplyr)
dt = data.frame(
year = c(2012,2014,2016,2016),
individual_id = c(1,1,2,3),
rank = c(11,16,76,125)
)
exp = expand.grid(year = c(2012,2014,2016), individual_id = c(1:3))
dt %>%
full_join(exp, by = c("year","individual_id")) %>%
mutate(present = ifelse(!is.na(rank), 1, 0)) %>%
arrange(individual_id, year)
year individual_id rank present
1 2012 1 11 1
2 2014 1 16 1
3 2016 1 NA 0
4 2012 2 NA 0
5 2014 2 NA 0
6 2016 2 76 1
7 2012 3 NA 0
8 2014 3 NA 0
9 2016 3 125 1

How do I ggplot a geom_bar with instance counts by year + a line graph with the sum of a column by year?

I'm trying to replicate a graph like this in ggplot (I'm very new to R). I've managed to plot the bars but I'm unable to figure out the line.
My data would look roughly like this:
uid casualties year
1 1 34 1999
2 2 1 1999
3 3 6 1999
4 4 1 2000
5 5 1 2000
6 6 1 2000
7 7 5 2001
8 8 1 2001
9 9 0 2001
10 10 1 2001
11 11 0 2002
12 12 0 2002
13 13 1 2002
14 14 1 2002
My bar graph would depict the number of row instances grouped by year, and my line graph would be the sum of casualties in that year. So for the data sample above, the plot should depict the following
year instances (bar) casualties (line)
1999 3 41
2000 3 3
2001 4 7
2002 4 2
So far my code is as follows:
casualties_by_year = aggregate(data["casualties"],by=data["year"],sum)
ggplot(data, aes(x=factor(year))) +
geom_bar(width=0.4, fill='darkorange') + # works fine
geom_line(data=casualties_by_year, aes(x=year, y=casualties)) + # unexpected results
theme_minimal() +
scale_x_discrete(breaks = NULL)
If I change the year in geom_line to factor(year), it shows me the expected bar graph but no line.
What is my code missing?

How to delete observations in R based criterion that observations have same value?

I have the following data frame, from which I would like to remove observations based on three criteria: x=x, y=y and z>=60.
df <- data.frame(x=c(1,1,2,2,3,3,4,4),
y=c(2011,2012,2011,2011,2013,2014,2011,2012),
z=c(15,15,60,60,15,15,30,15))
> df
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 2 2011 60
5 3 2013 15
6 3 2014 15
7 4 2011 30
8 4 2012 15
The data frame I'm looking for is thus (which one of the x=2 observations is removed doesn't matter):
> df1
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 3 2013 15
5 3 2014 15
6 4 2011 30
7 4 2012 15
My first thoughts included using unique or duplicate, but I cannot seem to understand how to implement it in practice.
This should do the trick. Look for duplicated x and y entries where z is also greater than or equal to 60:
df[!(duplicated(df[,1:2]) & df$z >= 60), ]
# x y z
#1 1 2011 15
#2 1 2012 15
#3 2 2011 60
#5 3 2013 15
#6 3 2014 15
#7 4 2011 30
#8 4 2012 15

How to populate a matrix from a for loop in R

I keep getting a 'subscript out of bounds' error when I try to populate a matrix using a for loop that I have scripted below. My data are a large csv file that look similar to the following dummy dataset:
Sample k3 Year
1 B92028UUU 1 1990
2 B93001UUU 1 1993
3 B93005UUU 1 1993
4 B93006UUU 1 1993
5 B93010UUU 1 1993
6 B93011UUU 1 1994
7 B93022UUU 1 1994
8 B93035UUU 1 2014
9 B93036UUU 1 2014
10 B95015UUU 2 2013
11 B95016UUU 2 2013
12 B98027UUU 2 1990
13 B05005FUS 2 1990
14 B06006FIS 2 2001
15 B06010MUS 2 2001
16 B05023FUN 2 2001
17 B05024FUN 3 2001
18 B05025FIN 3 2001
19 B05034MMN 3 2002
20 B05037MMS 3 1996
21 B05041MUN 3 1996
22 B06047FUS 3 2007
23 B05048MUS 3 2000
24 B06059FUS 3 2000
25 B05063MUN 3 2000
My script is as follows:
Year.Matrix = matrix(1:75,nrow=25,byrow=T)
colnames(Year.Matrix)=c("Group 1","Group 2","Group 3")
rownames(Year.Matrix)=1990:2014
for(i in 1:3){
x=subset(data2,k3==i)
for(j in 1990:2014){
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[j,i]=z
}
}
Not sure why I am getting the error message but from other posts I gather that the issue arises when I try to populate my matrix, and perhaps because I do not have an entry for each year from each of my k3 levels?
Any commentary would be helpful!
No need to use a loop here. You are just computing length by year and k3 columns:
library(data.table)
setDT(dat)[,.N,"Year,k3"]
Year k3 N
1: 1990 1 1
2: 1993 1 4
3: 1994 1 2
4: 2014 1 2
5: 2013 2 2
6: 1990 2 2
7: 2001 2 3
8: 2001 3 2
9: 2002 3 1
10: 1996 3 2
11: 2007 3 1
12: 2000 3 3
You can also use dplyr to do this. A dplyr solution would be the following:
dat %>%
group_by(Year, k3) %>%
summarize(N=n())
Not sure what you are trying to do but as Hubert L said. Your value of j index should be an integer while populating Year.Matrix it should be values like 1..2..3.. since you have done (j in 1990:2014) it will give j values as 1990..1991..1992.....2014
to fix this offset your row index as below. Your for loop
for(i in 1:3){
print(i)
x=subset(data2,k3==i)
for(j in seq_along(1990:2014)){
print(j)
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[j,i]=z
}
}
keep using print statement to debug your function. Running this loop will immediately tell you data you are going to index Year.Matrix[1990,1] which will through out of bound exception.
Fix this for loop by offsetting the index as:
for(i in 1:3){
print(i)
x=subset(data2,k3==i)
for(j in 1990:2014){
print(j)
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[1990-j+1,i]=z
}
}

Merging data frames in R

Let's say I have two data frames. Each has a DAY, a MONTH, and a YEAR column along with one other variable, C and P, respectively. I want to merge the two data frames in two different ways. First, I merge by data:
test<-merge(data1,data2,by.x=c("DAY","MONTH","YEAR"),by.y=c("DAY","MONTH","YEAR"),all.x=T,all.y=F)
This works perfectly. The second merge is the one I'm having trouble with. So, I currently I have merged the value for January 5, 1996 from data1 and the value for January 5, 1996 from data2 into one data frame, but now I would like to merge a third value onto each row of the new data frame. Specifically, I want to merge the value for Jan 4, 1996 from data2 with the two values from January 5, 1996. Any tips on getting merge to be flexible in this way?
sample data:
data1
C DAY MONTH YEAR
1 1 1 1996
6 5 1 1996
5 8 1 1996
3 11 1 1996
9 13 1 1996
2 14 1 1996
3 15 1 1996
4 17 1 1996
data2
P DAY MONTH YEAR
1 1 1 1996
4 2 1 1996
8 3 1 1996
2 4 1 1996
5 5 1 1996
2 6 1 1996
7 7 1 1996
4 8 1 1996
6 9 1 1996
1 10 1 1996
7 11 1 1996
3 12 1 1996
2 13 1 1996
2 14 1 1996
5 15 1 1996
9 16 1 1996
1 17 1 1996
Make a new column that is a Date type, not just some day,month,year integers. You can use as.Date() to do this, though you will need to look up the right format the format= argument given your string. Let's call that column D1. Now do data1$D2 = data1$D1 + 1. The key point here is that Date types allow simple date arithmetic. Now just merge by x=D1 and y=D2.
In case that was confusing, the bottom line is that you need to covert you columns to Date types so that you can do date arithmetic.

Resources