How to combine two stacked bar charts onto the same axis - graph

I have been tasked to make a figure with two stacked bar charts on the same axis.
While this is generally easy enough, I have run into trouble since the two bar charts do not have a set of common values (like year for instance) that they share in common.
The dataset I am using can be found below:
clear
input str42 outcomes perct str42 highcred hperct
"Certificate & Diploma Only" 8.33 "Certificate & Diploma" 8.33
"Associate Only" 2.93 "Associate" 14.29
"Bachelor Only" 11.36 "Bachelor" 6.93
"Certificate + AA" 2.2 "" .
"Associate + Bachelor" 4.33 "" .
end
My code is the following:
*B. Create a seperate variable for each value of outcomes
levelsof outcomes, local(out)
tokenize "1 2 3 4 5"
foreach level of local out {
gen outcome`1' = .
replace outcome`1' = perct if outcomes=="`level'"
mac shift
}
*C. Create a seperate variable for each value of highgred
levelsof highcred, local(high)
tokenize "1 2 3"
foreach level of local high {
gen highcred`1' = .
replace highcred`1' = hperct if highcred=="`level'"
mac shift
}
//2: Create Bar graphs
*A. Bar 1
graph bar outcome1-outcome5, stack saving(bar1)
*B. Bar 2
graph bar highcred1-highcred3, stack saving(bar2)
*C. Combine graphs
graph combine bar1.gph bar2.gph, ycommon

The ycommon option works as intended. However, your solution for stacking the bars in separate graphs and then combining them is problematic in the sense that the two graphs share the same colors, which makes it impossible to distinguish the different categories. An additional challenge is how these categories can be incorporated in a single legend.
Below you can find a solution that addresses both of these problems:
levelsof outcomes, local(out)
levelsof highcred, local(high)
local highcopy "`high'"
local c1: word count `out'
local c2: word count `high'
local colors1 ebblue pink brown
local colors2 `colors1'
forvalues i = 1 / `= `c1' + `c2'' {
generate outcome`i' = .
gettoken outc out : out
if `i' <= `c1' replace outcome`i' = perct if outcomes == "`outc'"
if `i' > `c1' {
gettoken color colors1 : colors1
local bars1 `bars1' bar(`i', color(`color'))
}
if `i' <= `c2' {
generate highcred`i' = .
gettoken highcc highcopy : highcopy
replace highcred`i' = hperct if highcred == "`highcc'"
gettoken color colors2 : colors2
local bars2 `bars2' bar(`i', color(`color'))
}
if `i' <= `c1' local legend `legend' label(`i' "`outc'")
else {
gettoken highc high : high
local legend `legend' label(`i' "`highc'")
}
}
order outcome* high*
graph bar outcome1-outcome8, stack ///
ylabel(, nogrid) ///
graphregion(color(white)) ///
`bars1' ///
name(bar1, replace) ///
legend(`legend')
graph bar highcred1-highcred3, stack ///
ylabel(, nogrid) ///
yscale(off) ///
graphregion(color(white)) ///
`bars2' ///
name(bar2, replace)
grc1leg bar1 bar2, ycommon graphregion(color(white)) legendfrom(bar1)
Adding the option blabel(bar, position(base)) in each graph bar command will produce:
Note that the community-contributed command grc1leg is used to create the combined graph.

I don't really understand the data. I guess that the order of values is worth preserving, although I don't think your code does that. I suggest that you will be much, much better off with a different data structure, horizontal bars and no stacking. graph bar (asis) is a better idea to avoid nonsense about means in the legend if you have a legend, but you don't need a legend at all.
For this, you would need to install labmask from the Stata Journal (search labmask to get a link).
You should be able to use better text than outcomes highcred.
clear
input str42 outcomes perct str42 highcred hperct
"Certificate & Diploma Only" 8.33 "Certificate & Diploma" 8.33
"Associate Only" 2.93 "Associate" 14.29
"Bachelor Only" 11.36 "Bachelor" 6.93
"Certificate + AA" 2.2 "" .
"Associate + Bachelor" 4.33 "" .
end
rename (outcomes-hperct) (x1 p1 x2 p2)
gen id = _n
reshape long x p , i(id) j(which)
sort which id
replace id = _n
drop in 9/10
labmask id, values(x)
label def which 1 "outcomes" 2 "highcred"
label val which which
graph hbar (asis) p, over(id) over(which) nofill scheme(s1color) ytitle(percent) ///
bar(1, bfcolor(none)) blabel(total, pos(base) format(%3.2f)) yla(none) ysc(alt)

Related

How to correct the output generated through str_detect/str_contains in R

I just have a column "methods_discussed" in CSV (link is https://github.com/pandas-dev/pandas/files/3496001/multiple_responses.zip)
multi<- read.csv("multiple_responses.csv", header = T)
This file having values name of family planning methods in the column name like:
methods_discussed
emergency female_sterilization male_sterilization iud NaN injectables male_condoms -77 male_condoms female_sterilization male_sterilization injectables iud male_condoms
I have created a vector of all but not -77 and NAN of 8 family planning methods as:
method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')
I want to create new indicator variable based on the names of vector (method_names) in the existing data frame multi2, for this I used (I)
for (abc in method_names) {
multi2[abc]<- as.integer(str_detect(multi2$methods_discussed, fixed(abc)))
}
(II)
for (abc in method_names) {
multi2[abc]<- as.integer(str_contains(abc,multi2$methods_discussed))
}
(III) I also tried
for (abc in method_names) {
multi2[abc]<- as.integer(stri_detect_fixed(multi2$methods_discussed, abc))
}
but the output is not matching as expected. Probably male_sterilization is a substring of female_sterilization and it shows 1(TRUE) for male_sterilization for female_sterlization also. It is shown below in the Actual output at row 2. It must show 0 (FALSE) as female_sterilization is in the method_discussed column at row 2. I also don't want to generate any thing like 0/1 (False/True) (should be blank) corresponding to -77 and blank in method_discussed (All are highlighted in Expected output.
Actual Output
Expected Output
No error in code but only in the output.
You can add word boundaries to fix that issue.
multi<- read.csv("multiple_responses.csv", header = T)
method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')
for (abc in method_names) {
multi[abc]<- as.integer(grepl(paste0('\\b', abc, '\\b'), multi$methods_discussed))
}
multi[multi$methods_discussed %in% c('', -77), method_names] <- ''

(gnu) diff - display corresponding line numbers

I'm trying to build a diff viewer plugin for my text editor (Kakoune). I'm comparing two files and marking any lines that are different across two panes.
However the two views don't scroll simultaneously. My idea is to get a list of line numbers that correspond to each other, so I can place the cursor correctly when switching panes, or scroll the secondary window when the primary one scrolls, etc.
So - Is there a way to get a list of corresponding numbers from commandline diff?
I'm hoping for something along the following example: Given file A and B, the output should tell me which line numbers (of the ones that didn't change) would correspond.
File A File B Output
1: hello 1: hello 1:1
2: world 2: another 2:3
3: this 3: world 3:4
4: is 4: this 4:5
5: a 5: is
6: test 6: eof
The goal is that when I scroll to line 4 in file A, I'll know to scroll file B such that it's line 5 is rendered at the same position.
Doesn't have to be Gnu diff, but should only use tools that are available on most/all linux boxes.
I can get some of the way using GNU diff, but then need a python script to post-process it to turn a group-based output into line-based.
#!/usr/bin/env python
import sys
import subprocess
file1, file2 = sys.argv[1:]
cmd = ["diff",
"--changed-group-format=",
"--unchanged-group-format=%df %dl %dF %dL,",
file1,
file2]
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
output = p.communicate()[0]
for item in output.split(",")[:-1]:
start1, end1, start2, end2 = [int(s) for s in item.split()]
n = end1 - start1 + 1
assert(n == end2 - start2 + 1) # unchanged group, should be same length in each file
for i in range(n):
print("{}:{}".format(start1 + i, start2 + i))
gives:
$ ./compare.py fileA fileB
1:1
2:3
3:4
4:5

Geoviews points: changing colormap with high and low options

I recently looked at this question on manually setting the limits of a holoviews colorbar, but after changing the range on one of my vdims, it didn't change the high and low limits of the colorbar. Is there a way to pass a Bokeh LinearColorMapper (and its options) directly for a particular vdim?
opts = {'plot' : dict(width=width_val, height=height_val, tools=[hover_shipments],
size_index='po_qty',
color_index='magnitude',
size_fn=(lambda x : x/100),
click_policy='hide', colorbar=True),
'style': dict(cmap='Viridis', line_width=0.25, alpha=0.75, fill_alpha=0.75,
muted_alpha=0.05)}
ds_time_store.to(gv.Points,
kdims=['longitude_qty','latitude_qty'],
vdims=['store_num',
'city_nm',
'po_qty',
hv.Dimension('magnitude', range=(0, 50))], label='late').opts({'Points' : opts})
By calling redim(aggregate_rating=dict(range=(0, 5))) on my data set locality_ratings before setting up the Points, I was able to set the boundaries for the colorbar from 0 to 5 as per my ratings.
points = locality_ratings.redim(aggregate_rating=dict(range=(0, 5))).to(gv.Points, ['longitude', 'latitude'],
['aggregate_rating', 'votes', 'cuisine_count', 'average_cost_for_two', 'price_range', 'locality'])
(gts.Wikipedia * points.options(width=700, height=600,
tools=['hover', 'save', 'zoom_in', 'zoom_out', 'pan' , 'wheel_zoom'],
colorbar=True, toolbar='above', xaxis=None, yaxis=None,
size_index=3, color_index=2, size=3, cmap=bokeh.palettes.all_palettes['Dark2'][5])).redim(longitude="Longitude",
latitude="Latitude",
aggregate_rating='Rating',
locality="Locality",
votes="Votes",
price_range="Price Range",
average_cost_for_two="Avg. Cost for 2 (R)",
cuisine_count="No. Cuisines"
)

Gnuplot: data normalization

I have several time-based datasets which are of very different scale, e. g.
[set 1]
2010-01-01 10
2010-02-01 12
2010-03-01 13
2010-04-01 19
…
[set 2]
2010-01-01 920
2010-02-01 997
2010-03-01 1010
2010-04-01 1043
…
I'd like to plot the relative growth of both since 2010-01-01. To put both curves on the same graph I have to normalize them. So I basically need to pick the first Y value and use it as a weight:
plot "./set1" using 1:($2/10), "./set2" using 1:($2/920)
But I want to do it automatically instead of hard-coding 10 and 920 as dividers. I don't even need the max value of the second column, I just want to pick the first value or, better, a value for a given date.
So my question: is there a way to parametrize the value of a given column which corresponds a given value of the given X column (X is a time axis)? Something like
plot "./set1" using 1:($2/$2($1="2010-01-01")), "./set2" using 1:($2/$2($1="2010-01-01"))
where $2($1="2010-01-01") is the feature I'm looking for.
Picking the first value is quite easy. Simply remember its value and divide all data values by it:
ref = 0
plot "./set1" using 1:(ref = ($0 == 0 ? $2 : ref), $2/ref),\
"./set2" using 1:(ref = ($0 == 0 ? $2 : ref), $2/ref)
Using the value at a given date is more involved:
Using an external tool (awk)
ref1 = system('awk ''$1 == "2010-01-01" { print $2; exit; }'' set1')
ref2 = system('awk ''$1 == "2010-01-01" { print $2; exit; }'' set1')
plot "./set1" using 1:($2/ref1), "./set1" using 1:($2/ref2)
Using gnuplot
You can use gnuplot's stats command to pick the desired value, but you must pay attention to do all time settings only after that:
a) String comparison
stats "./set1" using (strcol(1) eq "2010-01-01" ? $2 : 1/0)
ref1 = STATS_max
...
set timefmt ...
set xdata time
...
plot ...
b) Compare the actual time value (works like this only since version 5.0):
reftime = strptime("%Y-%m-%d", "2010-01-01")
stats "./set1" using (timecolumn(1, "%Y-%m-%d") == reftime ? $2 : 1/0)
ref1 = STATS_max
...
set timefmt ...
set xdata time
...
plot ...

Large grouped data plotting

I have a large amount of data to plot, and I'm trying to use gnuplot. The data is a sorted array of around 80000 elements. By simply using
plot "myData.txt" using 1:2 with linespoints linetype 1 pointtype 1
I get the output, but: it takes time to render, and the points are often cluttered, with occasional gaps. To address the second, I thought of doing the bar chart: each of the entries
would correspond to a bar. However, I'm not sure how to achieve this. I would like to have some space between consecutive bars, but I don't expect that it would be visible. What would be your suggestion to plot the data?
........................
Due to large data volume, I guess it's best to group.
Note that my data looks like
1 11041.9
2 11041.9
3 9521.07
4 9521.07
5 9520.07
6 9519.07
7 9018.07
...
I would like to plot the data by a groups of 3, ie., the first vertical line should start at 9521.07 as a minimum of the points from 1, 2, 3, and end at 11041. The second vertical line should consider the following 3 points: 4, 5 and 6, and start at 9519.07 with an end at 9521.07, and so on.
Could this be achieved with gnuplot, given the data file as illustrated? If so, I would appreciate if someone posts a set of commands I should use.
To reduce the number of points gnuplot actually draws, you can use the every keyword, e.g.
plot "myData.txt" using 1:2 with linespoints linetype 1 pointtype 1 every 100
will plot every 100th data point.
I am not sure if it's possible to do what you want (plotting vertical lines) elegantly within gnuplot, but here is my solution (assuming a UNIX-y environment). First make an awk script called sort.awk:
BEGIN { RS = "" }
{
# the next two lines handle the case where
# there are not three lines in a record
xval = $1 + 1
ymin = ymax = $2
# find y minimum
if ($2 <= $4 && $2 <= $6)
ymin=$2
else if ($4 <= $2 && $4 <= $6 && $4 != "")
ymin=$4
else if ($6 <= $2 && $6 <= $4 && $6 != "")
ymin=$6
# find y maximum
if ($2 >= $4 && $2 >= $6)
ymax=$2
else if ($4 >= $2 && $4 >= $6)
ymax=$4
else if ($6 >= $2 && $6 >= $4)
ymax=$6
# print the formatted line
print ($1+1) " " ymin " " ymin " " ymax " " ymax
}
Now this gnuplot script will call it:
set terminal postscript enhanced color
set output 'plot.eps'
set boxwidth 3
set style fill solid
plot "<sed 'n;n;G;' myData.txt | awk -f sort.awk" with candlesticks title 'pretty data'
It's not pretty but it works. sed adds a blank line every 3 lines, and awk formats the output for the candlesticks style. You can also try embedding the awk script in the gnuplot script.
You can do something like that...(it'll be easiest on unix). You will need to insert a space every third line -- I don't see any way around that. If you're on unix, the command
awk 'NR % 3 == 0 {print ""} 1' myfile
should do it. ( see How do I insert a blank line every n lines using awk? )
Of course, you could (and probably should) pack that straight into your gnuplot file.
So, all said and done, you'd have something like this:
xval(x)=int(x)/3 #Return the x position on the plot
plot "< awk 'NR % 3 == 0 {print ""} 1' datafile" using (xval($1)):2 with lines

Resources