How to generate a histogram with many values in R - r

I have a txt file called values.txt that has 29172 numbers in a single column. I wish to make a histogram from this by entering the commands below:
val = read.table("values.txt", col.name = c("col1"))
hist(val$col1)
But this gives a weird histogram like the below. What's the matter with the code?

Check what's the actual range of your values via
summary(val$col1)
If the histogram shows a seemingly empty range (as in your shown example which goes to -15000) then typically you have outliers in that range. Hence, I would assume that summary tells you there is a minimum value somewhere in -15000 as the x-axis captures by default the full range of values.
You can try to specify xlim=c(-500,1000) as an option into hist() to "zoom in". Moreover, you may want to specify breaks = 500 to configure the bin size afterwards.
Alternatively I suggest working with ggplot's geom_histogram
library(ggplot2)
ggplot(val) + geom_histogram(vars(col1))

Related

Automated labeling of logarithmic plots

I would like to automate the graph axis values for a series of plots in Stata 13.
In particular, I would like to show axis labels like 10^-1, 0, 10^1, 10^2 etc. by feeding the axis options macros.
The solution in the following blog post gives a good starting point:
Labeling logarithmic axes in Stata graphs
I could also also produce nicer labels by using "10{sup:`x'}".
However, I cannot find the solution to the following additional elements:
the range of the axis labels would run from 10^-10 up to 10^10. Moreover, my baseline is ln, so log values are 2.3, 4.6, etc. In particular, the line below which only takes integers as input:
label define expon `vallabel', replace
I would like to force the range of the axis values by graph (e.g. a particular axis runs from 10^-2 to 10^5). I understand that range() only extends axes, but does not allow to trim them.
Any ideas on either or both of the above?
This is a very straightforward output in R or Python, even standard without many additional arguments, but unfortunately not so in Stata.
Q1. You should check out niceloglabels from SSC as announced in this thread. niceloglabels together with other tricks and devices in this territory will be discussed in a column expected to appear in Stata Journal 18(1) in the first quarter of 2018.
Value labels are limited to association with integers but that does not bite here. All you need focus on is the text which is to be displayed as axis labels at designated points on either axis; such points can be specified using any numeric value within the axis range.
Your specific problem appears to be that one of your variables is a natural logarithm but you wish to label axes in terms of powers of 10. Conversion to a variable containing logarithms to base 10 is surely easy, but another program mylabels (SSC) can help here. This is a self-contained example.
* ssc inst mylabels
sysuse auto, clear
set scheme s1color
gen lnprice = ln(price)
mylabels 4000 8000 16000, myscale(ln(#)) local(yla)
gen lnweight = ln(weight)
mylabels 2 3 4, myscale(ln(1000*#)) suffix(" x 10{sup:3}") local(xla)
scatter lnprice lnweight, yla(`yla') xla(`xla') ms(Oh) ytitle(Price (USD)) xtitle(Weight (lb))
I have used different styles for the two axes just to show what is possible. On other grounds it is usually preferable to be consistent about style.
Broadly speaking, the use of niceloglabels is often simpler as it boils down to specifying xscale(log) or yscale(log) with the labels you want to see. niceloglabels also looks at a variable range or a specified minimum and maximum to suggest what labels might be used.
Q2. range() is an option with twoway function that allows extension of the x axis range. For most graph commands, the pertinent options are xscale() or yscale(), which again extend axis ranges if suitably specified. None of these options will omit data as a side-effect or reduce axis ranges as compared with what axis label options imply. If you want to omit data you need to use if or in or clone your variables and replace values you don't want to show with missing values in the clone.
FWIW, superscripts look much better to me than ^ for powers.
I have finally found a clunky but working solution.
The trick is to first generate 2 locals: one to evaluate the axis value, another to denote the axis label. Then combine both into a new local.
Somehow I need to do this separately for positive and negative values.
I'm sure this can be improved...
// define macros
forvalues i = 0(1)10 {
local a`i' = `i'*2.3
local b`i' `" "10{sup:`i'}" "'
local l`i' `a`i'' `"`b`i''"'
}
forvalues i = 1(1)10 {
local am`i' = `i'*-2.3
local bm`i' `" "10{sup:-`i'}" "'
local lm`i' `am`i'' `"`bm`i''"'
}
// graph
hist lnx, ///
xl(`lm4' `lm3' `lm2' `lm1' `l0' `l1' `l2' `l3' `l4')

ggplot2: Why symbol sizes differ when 'size' is including inside vs outside aes statement?

I have created quite a few maps using base-R but I am now trying to perform similar tasks using ggplot2 due to the ease by which multiple plots can be arranged on a single page. Basically, I am plotting the locations at which samples of a particular species of interest have been collected and want the symbol size to reflect the total weight of the species collected at that location. Creating the base map and various layers has not been an issue but I'm having trouble getting the symbol sizes and associated legend the way I want them.
The problem is demonstrated in the workable example below. When I include 'size' outside of aes, the symbol sizes appear to be scaled appropriately (plot1). But when I put 'size' inside the aes statement (in order to get a legend) the symbol sizes are no longer correct (plot2). It looks like ggplot2 has rescaled the data. This should be a simple task so I am clearly missing something very basic. Any help understanding this would be appreciated.
library(ggplot2)
#create a very simple dataset that includes locations and total weight of samples collected from each site
catch.data<-data.frame(long=c(-50,-52.5,-52,-54,-53.8,-52),
lat=c(48,54,54,55,52,50),
wt=c(2,38,3,4,25,122))
#including 'size' outside of aes results in no legend
#but the symbol sizes are represented correctly
plot1<-ggplot(catch.data,aes(x=long,y=lat)) +
geom_point(size=catch.data$wt,colour="white",fill="blue",shape=21)
#including 'size' within aes appears necessary in order to create a legend
#but the symbol sizes are not represented correctly
plot2<-ggplot(catch.data,aes(x=long,y=lat)) +
geom_point(aes(size=catch.data$wt),colour="white",fill="blue",shape=21)
First, you shouldn't reference the data frame name inside of aes, it messed the legend up. So the correct version will be
plot3 <- ggplot(catch.data,aes(x=long,y=lat)) +
geom_point(aes(size=wt),colour="white",fill="blue",shape=21)
Now in order to demonstrate variety you should play around with the range argument of scale_size_continuous, e.g.
plot3 + scale_size_continuous(range = range(catch.data$wt) / 5)
Change it a few times and see which one works for you. Please note that there exists a common visualization pitfall of representing numbers as areas (google e.g. "why pie charts are bad").
Edit: answering the comment below, you could introduce a fixed scaling by e.g.
scale_size_continuous(limits = c(1, 200), range = c(1, 20)).
Any value within the aes() is mapped to the variables in the data, while that is not the case for values specified outside the aes()
Refer to Difference between passing options in aes() and outside of it in ggplot2
Also the documentation : http://ggplot2.tidyverse.org/reference/aes.html

Knitr renders graphic although it shouldnt

I am using RStudio: Version 1.0.136, and I try to understand why knitr renders the histograms entailed in the commands below. Any help is appreciated.
min_ct<-as.numeric(min(hist(myfdata[myfdata$slope>low & myfdata$slope<up, ]$dy, breaks = bi)$counts))
Screenshot of 4 rendered graphics, which are not explicitly generated.
It's not a knitr issue. Calling hist causes a histogram to be rendered even if you assign the output to a variable. In the console try x = hist(rnorm(100)). What gets saved to the variable is a list with the data used to generate the histogram, but the histogram is still printed.
To create bins without printing a histogram, use the cut function to create the bins, then use table to count number of values by bin. For example, table(cut(rnorm(100), breaks=seq(-3,3,0.5))).
cut has options that affect how it assigns bins, so take a look at the help (?cut) for more info. In particular, take note of the right and include.lowest arguments.

Gnuplot: How to remove vectors below a certain magnitude from vector field?

I have a 2D CFD code that gives me the x and y flow velocity at every point on a grid. I am currently visualizing the data using a vector field in gnuplot. My goal is to see how far the plume from an eruption extends, so it would be much cleaner if I could prevent vectors from showing up at all in the field if they fall below a certain magnitude. Does anyone have an idea how to go about this? My current gnuplot script is below. I can also modify the input file as necessary.
reset
set nokey
set term png
set xrange [0:5.1]
set yrange [0:10.1]
do for [i=0:10] {
set title 'Eruption simulation: Timestep '.i
set output 'path/FlowVel'.sprintf('%04.0f',i).'.png'
plot 'path/Flow'.sprintf('%04.0f',i).'.dat' using 1:2:3:4 with vec
}
I guess you want a kind of filtering, which gnuplot doesn't really have, but can be achieved with the following trick (taken from "help using examples" in gnuplot):
One trick is to use the ternary `?:` operator to filter data:
plot 'file' using 1:($3>10 ? $2 : 1/0)
which plots the datum in column two against that in column one provided
the datum in column three exceeds ten. `1/0` is undefined; `gnuplot`
quietly ignores undefined points, so unsuitable points are suppressed.
Or you can use the pre-defined variable NaN to achieve the same result.
So I guess you would want something like this in your case
plot "data.dat" u 1:2:($3**2+$4**2>mag_sq?$3:NaN):($3**2+$4**2>mag_sq?$4:NaN) w vector
where mag_sq is the square of your desired magnitude.

How can 'arrange' command be used to generate a set of bins for histogram plot in R

While I try to plot histogram in R by defining a set of bins I get this error 'some 'x' not counted; maybe 'breaks' do not span range of 'x''.
I am following the information on the website 'http://msenux.redwoods.edu/math/R/hist.php' which states 'Use the arange command to produce this set of bins'. I tried to search in the internet about how to produce suitable range of bins for my data set but in vain.
Could anyone tell me how its done? or if there is any other way.
I have tried to set the bins as
bins=seq(0,3,by=0.2)
and plot histogram as
hist(log10(a),col=4,breaks=bins)
I suspect that some log10(a) is outside the range [0,3]. In that case, you can simply do something like
bins<-seq(min(log10(a)), max(log10(a))+1, by=0.2)
This ensures that all the values are within the bins.

Resources