Creating stacked chart - r

I have two tables that stores login attempts of users. One table contains all successful logins and the other contains fail attempts. I'm trying to create a stacked chart by using fail login counts and successful login counts. This is how my tables look like :
Success_login Table:
User_ID Site_Address Login_Attempts
1 xxx.xxx.xxx 5
2 xxx.xxy.yyy 10
Fail_login Table:
User_ID Site_Address Login_Attempts
1 xxx.xxx.xxx 2
2 xxx.xxy.yyy 8
How do I use Login_Attempts columns of those two tables to create stacked chart so that I can highlight success and failure attempt? I looked online and I found this code :
# Stacked Bar Plot with Colors and Legend
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts))
However, it does not work, as my two tables have different number of records. I appreciate if you could guide me to the solution.
Thanks

Discussion
First you have to unify your data into a single table. This can be done with a kind of outer join, if you're familiar with SQL. See How to join (merge) data frames (inner, outer, left, right)?. The resulting NAs (for records which failed to join to the opposite table) must be replaced with zeroes in order for the final call to barplot() to work.
You must then derive a matrix in the format required by barplot() for producing stacked bar charts, which can be done pretty easily with a single call to matrix(). Taking care to set labels/titles/legends/colors correctly, you can get a nice stacked bar chart:
Code
s <- data.frame(User_ID=c(1,2,3), Site_Address=c('xxx.xxx.xxx','xxx.xxy.yyy','xxx.yyy.zzz'), Login_Attempts=c(5,10,3) );
f <- data.frame(User_ID=c(1,2,4), Site_Address=c('xxx.xxx.xxx','xxx.xxy.yyy','xxx.yyy.zzz'), Login_Attempts=c(2,8,4) );
all <- merge(s,f,by=c('User_ID','Site_Address'),suffixes=c('.successful','.failed'),all=T);
all[is.na(all)] <- 0;
stackData <- matrix(c(all$Login_Attempts.failed, all$Login_Attempts.successful ),2,byrow=T);
colnames(stackData) <- paste0(all$User_ID, '#', all$Site_Address );
rownames(stackData) <- c('failed','successful');
barplot(stackData,main='Successful and failed login attempts',xlab='User_ID#Site_Address',ylab='Login_Attempts',col=c('red','blue'),legend=rownames(stackData));
Resulting data
r> s;
User_ID Site_Address Login_Attempts
1 1 xxx.xxx.xxx 5
2 2 xxx.xxy.yyy 10
3 3 xxx.yyy.zzz 3
r> f;
User_ID Site_Address Login_Attempts
1 1 xxx.xxx.xxx 2
2 2 xxx.xxy.yyy 8
3 4 xxx.yyy.zzz 4
r> all;
User_ID Site_Address Login_Attempts.successful Login_Attempts.failed
1 1 xxx.xxx.xxx 5 2
2 2 xxx.xxy.yyy 10 8
3 3 xxx.yyy.zzz 3 0
4 4 xxx.yyy.zzz 0 4
r> stackData;
1#xxx.xxx.xxx 2#xxx.xxy.yyy 3#xxx.yyy.zzz 4#xxx.yyy.zzz
failed 2 8 0 4
successful 5 10 3 0
Output
References
How to join (merge) data frames (inner, outer, left, right)?
R: merge unequal dataframes and replace missing rows with 0
https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html
http://www.statmethods.net/graphs/bar.html
https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/barplot.html
https://stat.ethz.ch/R-manual/R-devel/library/base/html/matrix.html
Edit: It's a little strange to create a one-bar stacked bar chart, but ok, here's how you can do it, using the above data (all) as a base:
barplot(matrix(c(sum(all$Login_Attempts.failed),sum(all$Login_Attempts.successful))),main='Successful and failed login attempts',ylab='Login_Attempts',col=c('red','blue'),legend=c('failed','successful'));
Edit: Yeah, the y-axis should really cover the stack completely by default, it's a weakness in the base graphics package that it doesn't. You can add ylim=c(0,1.2*sum(do.call(c,all[,3:4]))) as an argument to the barplot() call to force the y-axis to extend at least 20% beyond the high point of the stack. (It's unfortunate that you have to calculate that manually from the input data, but as I said, it's a weakness in the package.)
Also, with regard to my comment about the oneness of the bar, it's just more common for stacked bar charts to be used to compare multiple bars, rather than showing a single bar. (That's why my initial assumption was that you wanted a separate bar for each user/site.) Instead of a single stacked bar, normally you'd see a plain old bar chart showing the different data points side-by-side. But it really depends on your application, so do what works best for you.

Try drawing, by hand, the stacked chart you are trying to create. Does it even make sense?
When convinced that you now know what your desired result should look like, by hand, create a single data.frame or matrix necessary for barplot to create your result. Remember to include special instances e.g. where a user only has successful or unsuccessful logins.
Figure how to put your input data.frames together into the single data.frame in the previous step.
The result of step 2 is your reproducible example you need in order to ask a sensible question here.
Step 3 is what you are asking here, but it does not seem you are sure what the intermediate result should look like.
Step 1 is about visualising the end product, and working back from there.

Related

R Question: How can I create a histogram with 2 variables against eachother?

Okay, let me be as clear as I can in my problem. I'm new to R, so your patience is appreciated.
I want to create a histogram using two different vectors. The first vector contains a list of models (products). These models are listed as either integers, strings, or NA. I'm not exactly sure how R is storing them (I assume they're kept as strings), or if that is a relevant issue. I also have a vector containing a list of incidents pertaining to that model. So for example, one row in the dataframe might be:
Model Incidents
XXX1991 7
How can I create a histogram where the number of incidents for each model is shown? So the histogram will look like
| =
| =
Frequency of | =
Incidents | = =
| = = =
| = = = = =
- - - - - -
Each different Model
Just to give a general idea.
I also need to be able to map everything out with standard deviation lines, so that it's easy to see which models are the least reliable. But that's not the main question here. I just don't want to do anything that will make me unable to use standard deviation in the future.
So far, all I really understand is how to make a histogram with the frequency marked, but for some reason, the x-axis is marked with numbers, not the models' names.
I don't really care if I have to download new packages to make this work, but I suspect that this already exists in basic R or ggplot2 and I'm just too dumb to figure it out.
Feel free to ask clarfying questions. Thanks.
EDIT: I forgot to mention, there are multiple rows of incidents listed under each model. So to add to my example earlier:
Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
3
5
XXX1002 9
XXX1002 4
etc . . .
I want to add up all the incidents for a model under one label.
I am assuming that you did not mean to leave the model blank in your example, so I filled in some values.
You can add up the number of incidents by model using aggregate then make the relevant plot using barplot.
## Example Data
data = read.table(text="Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
XXX1992 3
XXX1992 5
XXX1002 9
XXX1002 4",
header=TRUE)
TAB = aggregate(data$Incidents, list(data$Model), sum)
TAB
Group.1 x
1 XXX1002 13
2 XXX1991 27
3 XXX1992 8
barplot(TAB$x, names.arg=TAB$Group.1 )

R view lines based on specific values

I was wondering how I could view certain lines of data based on specific data i.e. good for viewing anomalies in results.
E.g. I have the following results from the command table(df$A)
2 3 4 5 6 19
143914 52194 30856 10662 2901 1
I'm surprised by the 1 observation where df$A=19. How can I see this observation easily in the console without having to make a subset (x<-subset(df, df$A==19)) ?
Thanks in advance
If your goal is to just view the output in an interactive session, and you have no interest in storing that value, you can use [ to "interactively" subset and view the result:
df[df$A == 19, ]

R histogram from already summarized count

I have a really huge file, thus I had to count frequencies for histogram generation outside the R.
Couldn't find the correct answer in already existing threads. Everything I tried led me to bar plot or failure (even R's exceptions didn't let it plot as histogram the way I tried)
file looks like (it's tab delimited):
freq cov
394104974 1
387288861 3
141169009 4
105488813 2
60039934 6
45109486 5
26318120 7
9691068 8
7532886 9
3973434 10
it has sth like 3k lines.
How can I plot this with ggplot2 as a nice histogram? (cov column holds x axis values)
Cheers,
Irek

How to indicate factors in ggplot with horizontal line and Text

My data looks like this example:
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance1",
"Substance1","Substance2","Substance2","Substance2",
"Substance2","Substance1","Substance1"))
dataExample
Time Data1 Data2 Application
1 1 6.511573 5.385265 Substance1
2 2 5.870173 4.512775 Substance1
3 3 6.822132 5.109790 Substance1
4 4 5.940528 6.281412 Substance1
5 5 7.269394 4.680380 Substance2
6 6 6.122454 6.015899 Substance2
7 7 5.660429 6.113362 Substance2
8 8 6.649749 4.344978 Substance2
9 9 7.252656 4.764667 Substance1
10 10 7.204440 5.835590 Substance1
I would like to indicate at which time any Substance was applied that is different from dataExample$Application[1].
Here I show you the way I get this ploted, but I assume that there is a much easier way to do it with ggplot.
library(reshape2)
library(ggplot)
plotDataExample<-function(DataFrame){
longDF<-melt(DataFrame,id.vars=c("Time","Application"))
p=ggplot(longDF,aes(Time,value,color=variable))+geom_line()
maxValue=max(longDF$value)
minValue=min(longDF$value)
yAppLine=maxValue+((maxValue-minValue)/20)
xAppLine1=min(longDF$Time[which(longDF$Application!=longDF$Application[1])])
xAppLine2=max(longDF$Time[which(longDF$Application!=longDF$Application[1])])
lineData=data.frame(x=c(xAppLine1,xAppLine2),y=c(yAppLine,yAppLine))
xAppText=xAppLine1+(xAppLine2-xAppLine1)/2
yAppText=yAppLine+((maxValue-minValue)/20)
appText=longDF$Application[which(longDF$Application!=longDF$Application[1])[1]]
textData=data.frame(x=xAppText,y=yAppText,appText=appText)
p=p+geom_line(data=lineData,aes(x=x, y=y),color="black")
p=p+geom_text(data=textData,aes(x=x,y=y,label = appText),color="black")
return(p)
}
plotDataExample(dataExample)
Question:
Do you know a better way to get a similar result so that I could possibly indicate more than one factor (e.g. Substance3, Substance4 ...).
First, made new sample data to have more than 2 levels and twice repeated Substance2.
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance2",
"Substance2","Substance1","Substance1","Substance2",
"Substance2","Substance3","Substance3"))
Didn't make this as function to show each step.
Add new column groups to original data frame - this contains identifier for grouping of Applications - if substance changes then new group is formed.
dataExample$groups<-c(cumsum(c(1,tail(dataExample$Application,n=-1)!=head(dataExample$Application,n=-1))))
Convert to long format data for lines of data.
longDF<-melt(dataExample,id.vars=c("Time","Application","groups"))
Calculate positions for Substance identifiers. Used function ddply() from library plyr. For calculation only data that differs from first Application value are used (that's subset()). Then Application and groups are used for grouping of data. Calculated starting, middle and ending positions on x axis and y value taken as maximal value +0.3.
library(plyr)
lineData<-ddply(subset(dataExample,Application != dataExample$Application[1]),
.(Application,groups),
summarise,minT=min(Time),maxT=max(Time),
meanT=mean(Time),ypos=max(longDF$value)+0.3)
Now plot longDF data with ggplot() and geom_line() and add segments above plot with geom_segment() and text with annotate() using new data frame lineData.
ggplot(longDF,aes(Time,value,color=variable))+geom_line()+
geom_segment(data=lineData,aes(x=minT,xend=maxT,y=ypos,yend=ypos),inherit.aes=FALSE)+
annotate("text",x=lineData$meanT,y=lineData$ypos+0.1,label=lineData$Application)

Lines between certain points in a plot, based on the data? (with R)

I have done my research and googling but have yet to find a solution to the following problem. I have quite often found solutions to R-related issues from this forum, so I thought I'd give it a try and hope that somebody can suggest something. I would need it for my PhD thesis; anybody who's code or suggestions I will use will naturally be acknowledged and credited.
So: I need to draw lines/segments to connect points in a plot (of multidimensional scaling, specifically) in R (SPSS-based solutions are welcome as well) - but not between all points, just those that represent properties/variables that at least one data item shares - the placement of the lines should be based on the data that the plot in question is based on itself. Let me exeplify; below are some fictional data with dummy variables, where '1' means that the item has the property:
"properties"
a b c
"items" ---------
tree | 1 1 0
house | 0 1 1
hut | 0 1 1
book | 1 0 0
The plot is a multidimensional scaling plot (distances are to be interpreted as dissimilarities). This is the logic:
there's a line between A and B, because there is at least one item/variable ("tree") in
the data that has both properties;
there is a line between B and C, because there is at least one item in the data ("house" and "hut") that has both properties;
there is an item ("book") that has only one property (A), so it does not affect the placement of the lines
importantly, there is no line between A and C because there are no items in the data that have both properties.
What I am looking for is a way to add the grey lines automatically/computationally that I have for now drawn manually on the plot above. The automatic drawing should be based on the data as described above. With a small data set, drawing the lines manually is no problem, but becomes a problem when there are tens of such "properties" and hundreds of items/rows of data.
Any ideas? Some R code (commented if possible) would be most welcome!
EDIT: It seems I forgot something very important. First thing, the solution proposed by #GaborCsardi below works perfectly with the example data, thanks for that! But I forgot to include that the linking of the points should also be "conservative", with as few connecting lines as possible. For example, if there is an item that has all the "properties", then it should not create lines between every single property point in the plot just because of that, if the points are connected by other items already, even if indirectly. So a plot based on the following data should not be a full triangle, even though item1 has all three properties:
A B C
item1 1 1 1
item2 1 1 0
item3 0 1 1
Instead, A,B and B,C should be connected by a line, but a line between A and C would be exessive, as they are already indirectly connected (through B). Could this be done with incidence graphs?
This is very easy if you use graphs, and create the projection of the bipartite graph that you have in your table. E.g.
library(igraph)
## Some example data
mat <- " properties
items a b c
tree 1 1 0
house 0 1 1
hut 0 1 1
book 1 0 0
"
tab <- read.table(textConnection(mat), skip=1,
header=TRUE, row.names=1)
## Create a bipartite graph
graph <- graph.incidence(as.matrix(tab))
## Project the bipartite graph
proj <- bipartite.projection(graph)
## Plot one of the projections, the one you need
## happens to be the second one
plot(proj$proj2)
## Minimum spanning tree of the projection
plot(minimum.spanning.tree(proj$proj2))
For more information see the manual pages, i.e. ?"igraph-package" ?graph.incidence, ?bipartite.projection and ?plot.igraph.

Resources