R histogram from already summarized count - r

I have a really huge file, thus I had to count frequencies for histogram generation outside the R.
Couldn't find the correct answer in already existing threads. Everything I tried led me to bar plot or failure (even R's exceptions didn't let it plot as histogram the way I tried)
file looks like (it's tab delimited):
freq cov
394104974 1
387288861 3
141169009 4
105488813 2
60039934 6
45109486 5
26318120 7
9691068 8
7532886 9
3973434 10
it has sth like 3k lines.
How can I plot this with ggplot2 as a nice histogram? (cov column holds x axis values)
Cheers,
Irek

Related

Trying to make a graph with multiple lines using ggplot

I am new to R and I have been trying to make a line graph with mupltiple lines. I have tried the 'plot' function but didn't get the desired result so I am now trying the ggplot.
I keep running into error:
Aesthetics must be either length 1 or the same as the data (100): x
and there's obviously no graph output.
Any help is much appreciated
I have rearranged my data, before it had 4 separate columns for different consumer types but now I have merged them and made a column that identifies each consumer.
This is the part of the code that generates the error
ggplot(data=consumers,aes(x=scenarios,y=unitary.bill)) +
geom_line(aes(color=consumer.type,group=consumer.type))
my data looks like this:
scenario unitary.bill consumer.type
1 1 0.076536835 net.cons
2 2 0.075835361 net.cons
3 3 0.076696548 net.cons
4 4 0.076431602 net.cons
5 5 0.076816135 net.cons
.........
27 2 0.076794287 smart.cons
28 3 0.075555555 smart.cons
29 4 0.077126955 smart.cons
30 5 0.077925161 smart.cons
.......
100 25 0.049247761 smart.pros
I expect the a line graph to have four different colors (each representing my consumer type) and the scenarios at the x-axis.
Thanks for all the help from Camille and Infominer. My code now looks like this (I added some more details)
ggplot(data=consumers,aes(x = scenarios,y = unitary.bill, colour= SMCs)) +
geom_line(size=1) + scale_colour_manual(values=c("indianred1", "yellowgreen","lightpink","springgreen4"))+
ggtitle(" Unitary bill for each SMC type at the end of the scenario runs")+
scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25))
and the graph looks as I wanted it to. However, if I could put some more distance between the title and the graph that will make it prettier.
you can view the graph here

Barplots Not Plotting Frequency Correctly in R

Struggling to plot a barplot where the height of each bar is related to the value in a column (in this case, freq). Table name is Tag_Count_2
Tag.1 freq
Hello 3
My 4
Name 10
I tried:
Counts<-table(Tag_Count_2$freq)
barplot(Counts)
But it plots the # of times values under Tag.1 share the same freq. I also tried:
barplot(Tag_Count_2, height=freq)
But it said freq wasn't found.
Are you looking for
barplot(Tag_Count_2$freq)
You don't need the table command in this case as you already have the frequency .

Creating stacked chart

I have two tables that stores login attempts of users. One table contains all successful logins and the other contains fail attempts. I'm trying to create a stacked chart by using fail login counts and successful login counts. This is how my tables look like :
Success_login Table:
User_ID Site_Address Login_Attempts
1 xxx.xxx.xxx 5
2 xxx.xxy.yyy 10
Fail_login Table:
User_ID Site_Address Login_Attempts
1 xxx.xxx.xxx 2
2 xxx.xxy.yyy 8
How do I use Login_Attempts columns of those two tables to create stacked chart so that I can highlight success and failure attempt? I looked online and I found this code :
# Stacked Bar Plot with Colors and Legend
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts))
However, it does not work, as my two tables have different number of records. I appreciate if you could guide me to the solution.
Thanks
Discussion
First you have to unify your data into a single table. This can be done with a kind of outer join, if you're familiar with SQL. See How to join (merge) data frames (inner, outer, left, right)?. The resulting NAs (for records which failed to join to the opposite table) must be replaced with zeroes in order for the final call to barplot() to work.
You must then derive a matrix in the format required by barplot() for producing stacked bar charts, which can be done pretty easily with a single call to matrix(). Taking care to set labels/titles/legends/colors correctly, you can get a nice stacked bar chart:
Code
s <- data.frame(User_ID=c(1,2,3), Site_Address=c('xxx.xxx.xxx','xxx.xxy.yyy','xxx.yyy.zzz'), Login_Attempts=c(5,10,3) );
f <- data.frame(User_ID=c(1,2,4), Site_Address=c('xxx.xxx.xxx','xxx.xxy.yyy','xxx.yyy.zzz'), Login_Attempts=c(2,8,4) );
all <- merge(s,f,by=c('User_ID','Site_Address'),suffixes=c('.successful','.failed'),all=T);
all[is.na(all)] <- 0;
stackData <- matrix(c(all$Login_Attempts.failed, all$Login_Attempts.successful ),2,byrow=T);
colnames(stackData) <- paste0(all$User_ID, '#', all$Site_Address );
rownames(stackData) <- c('failed','successful');
barplot(stackData,main='Successful and failed login attempts',xlab='User_ID#Site_Address',ylab='Login_Attempts',col=c('red','blue'),legend=rownames(stackData));
Resulting data
r> s;
User_ID Site_Address Login_Attempts
1 1 xxx.xxx.xxx 5
2 2 xxx.xxy.yyy 10
3 3 xxx.yyy.zzz 3
r> f;
User_ID Site_Address Login_Attempts
1 1 xxx.xxx.xxx 2
2 2 xxx.xxy.yyy 8
3 4 xxx.yyy.zzz 4
r> all;
User_ID Site_Address Login_Attempts.successful Login_Attempts.failed
1 1 xxx.xxx.xxx 5 2
2 2 xxx.xxy.yyy 10 8
3 3 xxx.yyy.zzz 3 0
4 4 xxx.yyy.zzz 0 4
r> stackData;
1#xxx.xxx.xxx 2#xxx.xxy.yyy 3#xxx.yyy.zzz 4#xxx.yyy.zzz
failed 2 8 0 4
successful 5 10 3 0
Output
References
How to join (merge) data frames (inner, outer, left, right)?
R: merge unequal dataframes and replace missing rows with 0
https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html
http://www.statmethods.net/graphs/bar.html
https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/barplot.html
https://stat.ethz.ch/R-manual/R-devel/library/base/html/matrix.html
Edit: It's a little strange to create a one-bar stacked bar chart, but ok, here's how you can do it, using the above data (all) as a base:
barplot(matrix(c(sum(all$Login_Attempts.failed),sum(all$Login_Attempts.successful))),main='Successful and failed login attempts',ylab='Login_Attempts',col=c('red','blue'),legend=c('failed','successful'));
Edit: Yeah, the y-axis should really cover the stack completely by default, it's a weakness in the base graphics package that it doesn't. You can add ylim=c(0,1.2*sum(do.call(c,all[,3:4]))) as an argument to the barplot() call to force the y-axis to extend at least 20% beyond the high point of the stack. (It's unfortunate that you have to calculate that manually from the input data, but as I said, it's a weakness in the package.)
Also, with regard to my comment about the oneness of the bar, it's just more common for stacked bar charts to be used to compare multiple bars, rather than showing a single bar. (That's why my initial assumption was that you wanted a separate bar for each user/site.) Instead of a single stacked bar, normally you'd see a plain old bar chart showing the different data points side-by-side. But it really depends on your application, so do what works best for you.
Try drawing, by hand, the stacked chart you are trying to create. Does it even make sense?
When convinced that you now know what your desired result should look like, by hand, create a single data.frame or matrix necessary for barplot to create your result. Remember to include special instances e.g. where a user only has successful or unsuccessful logins.
Figure how to put your input data.frames together into the single data.frame in the previous step.
The result of step 2 is your reproducible example you need in order to ask a sensible question here.
Step 3 is what you are asking here, but it does not seem you are sure what the intermediate result should look like.
Step 1 is about visualising the end product, and working back from there.

Excel: Select data for graph

To put it simple, I have three columns in excel like the ones below:
Vehicle x y
1 10 10
1 15 12
1 12 9
2 8 7
2 11 6
3 7 12
x and y are the coordinates of customers assigned to the corresponding vehicle. This file is the output of a program I run in advance. The list will always be sorted by vehicle, but the number of customers assigned to vehicle "k" may change from one experiment to the next.
I would like to plot a graph containing 3 series, one for each vehicle, where the customers of each vehicle would appear (as dots in 2D based on their x- and y- values) in different color.
In my real file, I have 12 vehicles and 3200 customers, and the ranges change from one experiment to the next, so I would like to automate the process, i.e copy-paste the list on my excel and see the graph appear automatically (if this is possible).
Thanks in advance for your time and effort.
EDIT: There is a similar post here: Use formulas to select chart data but requires the use of VB. Moreover, I am not sure whether it has been indeed answered.
you should try this free online tool - www.cloudyexcel.com/excel-to-graph/

How to indicate factors in ggplot with horizontal line and Text

My data looks like this example:
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance1",
"Substance1","Substance2","Substance2","Substance2",
"Substance2","Substance1","Substance1"))
dataExample
Time Data1 Data2 Application
1 1 6.511573 5.385265 Substance1
2 2 5.870173 4.512775 Substance1
3 3 6.822132 5.109790 Substance1
4 4 5.940528 6.281412 Substance1
5 5 7.269394 4.680380 Substance2
6 6 6.122454 6.015899 Substance2
7 7 5.660429 6.113362 Substance2
8 8 6.649749 4.344978 Substance2
9 9 7.252656 4.764667 Substance1
10 10 7.204440 5.835590 Substance1
I would like to indicate at which time any Substance was applied that is different from dataExample$Application[1].
Here I show you the way I get this ploted, but I assume that there is a much easier way to do it with ggplot.
library(reshape2)
library(ggplot)
plotDataExample<-function(DataFrame){
longDF<-melt(DataFrame,id.vars=c("Time","Application"))
p=ggplot(longDF,aes(Time,value,color=variable))+geom_line()
maxValue=max(longDF$value)
minValue=min(longDF$value)
yAppLine=maxValue+((maxValue-minValue)/20)
xAppLine1=min(longDF$Time[which(longDF$Application!=longDF$Application[1])])
xAppLine2=max(longDF$Time[which(longDF$Application!=longDF$Application[1])])
lineData=data.frame(x=c(xAppLine1,xAppLine2),y=c(yAppLine,yAppLine))
xAppText=xAppLine1+(xAppLine2-xAppLine1)/2
yAppText=yAppLine+((maxValue-minValue)/20)
appText=longDF$Application[which(longDF$Application!=longDF$Application[1])[1]]
textData=data.frame(x=xAppText,y=yAppText,appText=appText)
p=p+geom_line(data=lineData,aes(x=x, y=y),color="black")
p=p+geom_text(data=textData,aes(x=x,y=y,label = appText),color="black")
return(p)
}
plotDataExample(dataExample)
Question:
Do you know a better way to get a similar result so that I could possibly indicate more than one factor (e.g. Substance3, Substance4 ...).
First, made new sample data to have more than 2 levels and twice repeated Substance2.
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance2",
"Substance2","Substance1","Substance1","Substance2",
"Substance2","Substance3","Substance3"))
Didn't make this as function to show each step.
Add new column groups to original data frame - this contains identifier for grouping of Applications - if substance changes then new group is formed.
dataExample$groups<-c(cumsum(c(1,tail(dataExample$Application,n=-1)!=head(dataExample$Application,n=-1))))
Convert to long format data for lines of data.
longDF<-melt(dataExample,id.vars=c("Time","Application","groups"))
Calculate positions for Substance identifiers. Used function ddply() from library plyr. For calculation only data that differs from first Application value are used (that's subset()). Then Application and groups are used for grouping of data. Calculated starting, middle and ending positions on x axis and y value taken as maximal value +0.3.
library(plyr)
lineData<-ddply(subset(dataExample,Application != dataExample$Application[1]),
.(Application,groups),
summarise,minT=min(Time),maxT=max(Time),
meanT=mean(Time),ypos=max(longDF$value)+0.3)
Now plot longDF data with ggplot() and geom_line() and add segments above plot with geom_segment() and text with annotate() using new data frame lineData.
ggplot(longDF,aes(Time,value,color=variable))+geom_line()+
geom_segment(data=lineData,aes(x=minT,xend=maxT,y=ypos,yend=ypos),inherit.aes=FALSE)+
annotate("text",x=lineData$meanT,y=lineData$ypos+0.1,label=lineData$Application)

Resources