Stata - Preserving encoded variable and stacked graphing

Stata - Preserving encoded variable and stacked graphing - graph

These data represent ice cream preferences where individuals can change these preferences over time
id time flavor_str flavor_enc
1 1 C 1
1 2 C 1
1 3 V 2
2 1 S 3
2 2 V 2
2 3 C 1
3 1 V 2
4 1 C 1
4 2 V 2
Note: flavor_enc is showing a number, but in Stata it would show the string name in blue, which represents the number
Two issues.
When I create a variable off of the encoded, for example
g initial_pref = 0
replace initial_pref = flavor_enc if = time == 1
OR
bysort id: egen max_pref = max(flavor_enc)
The variable first_pref takes on the encoded numeric, however, I would like to keep it in the same format as flavor_enc.
I then want to create a stacked bar chart (by flavor on the x-axis) and the frequency (on the y-axis). The chart would have one piece of the bar that represents the number of times a given flavor was someones initial preference, a second piece that represents the number of times that flavor was someone's second preference (they switched from their initial, 0 otherwise), and the last piece representing the number of times a flavor was their third preference.
For these data the chart would use these inputs.
C as initial = 2
V as initial = 1
S as initial = 1
C as second = 0
V as second = 3
S as second = 0
C as third = 1
V as third = 0
S as third = 0
I tried graph bar with the stacking option but that did not work. I also could see how to do this outside of Stata but was hoping Stata had the functionality.

The wording is not completely clear to me, but I believe the first issue can be managed with clonevar:
clonevar initial_pref2 = flavor_enc
replace initial_pref2 = 0 if time != 1
Regarding your latest comment (and edit), if you want to compute the maximum and still use clonevar, it is possible:
clonevar max_pref2 = flavor_enc
bysort id (max_pref2): replace max_pref2 = max_pref2[_N]
If you have missings in flavor_enc, adjustments are necessary.
An alternative solution involves extracting the data attributes from the original variable using extended macro functions (help extended_fcn), and assigning them to the new variable.
One way to tackle the graph issue is as follows:
clear
set more off
*----- example data -----
input ///
id time str1 flavor_str flavor
1 1 C 1
1 2 C 1
1 3 V 2
2 3 C 1
2 1 S 3
2 2 V 2
3 1 V 2
4 2 V 2
4 1 C 1
end
drop flavor_str
sort id time
list, sepby(id)
*----- bar graph -----
quietly tabulate time, gen(tt)
collapse (sum) tt*, by(flavor)
label define lblflavor 1 "flavor 1" 2 "flavor 2" 3 "flavor 3"
label values flavor lblflavor
graph bar (asis) tt*, over(flavor) stack ///
ylabel(none) blabel(bar, position(center)) legend(off)
But for sure there is a better way. I seldom use these so my experience is minimal.
I can't say much about its appropriateness except that for this example, it seems like an awful waste of space.

Related

Discovering the dependency relations among the samples of a data set

In the R environment;
Let say I have a data set similar to the one in below:
ID Activity
1 a
1 b
2 a
3 c
2 a
1 c
4 a
4 b
3 b
4 c
As you can see each ID has a sequence of activities. What is important to consider is the number of times an activity is being followed by the other ones.
The results I am looking for are:
1. Discovering existing variants in the dataset (the existing sequence for each ID):
like: `
<a,b, c> : id: 1 & 4
<a,a> : id: 2
<c,b> : id:3
A matrix as following which shows the number of times an activity is being followed by the other one:
like:
:a b c
a 1 2 0
b 0 0 1
c 0 1 0
Thank you for your help.

Here is a solution with data.table
library(data.table)
dt <- data.table(ID=c(1,1,2,3,2,1,4,4,3,4),Activity=c("a","b","a","c","a","c","a","b","b","c"))
IDs per Sequence:
dt[,.(seq=paste(Activity,collapse = ",")),ID][,.(ids=paste(ID,collapse = ",")),seq]
We can get a fast answer:
consecutive_id <- dt[,.(first=(Activity),second=(shift(Activity,type = "lead"))),ID][!is.na(second)]
consecutive <- consecutive_id[,.N,.(first,second)]
but if you need it in the matrix form a few extra steps are needed:
classes <- dt[,unique(Activity)];n <- length(classes)
M_consecutive <- data.table(matrix(0,nrow = n,ncol=n))
setnames(M_consecutive,classes)
M_consecutive$classes <- classes; setkey(M_consecutive,classes)
for(i in 1:nrow(consecutive)) M_consecutive[consecutive[i]$first,(consecutive[i]$second):=consecutive[i]$N]
M_consecutive

How can I fix colors for the numbers in a matrix

Currently I try to make multiple visualizations in which the numbers in a matrix must get a certain (fixed) color in an image.
Due to the fact that I cannot find a way to really assign a color to a fixed number this causes me more trouble than I had thought.
The problem shows in the following examples:
Say we define the following colors to be associated with the following numbers
cols <- c(
'0' = "#FFFFFF",
'1' = "#99FF66",
'2' = "#66FF33",
'3' = "#33CC00",
'4' = "#009900"
)
image(as.matrix(d), col=cols)
Now if we visualise the following matrix all seems good
d<-read.table(text="
0 1 0 3
3 2 1 4
4 1 0 2
3 3 0 1")
image(as.matrix(d), col=cols)
However if a visualise the following matrix the problem becomes clear
d<-read.table(text="
1 1 1 3
3 2 1 4
4 1 2 2
3 3 2 1")
image(as.matrix(d), col=cols)
We should be skipping white ("#FFFFFF") as the number 0 is not present. However R chooses to use white ("#FFFFFF") anyhow and asociate that with the number 1 skipping "#009900" instead.
For the consistency of my visualizations it is rather important that colors remain associated with the same numbers for all images, so how can I implement this?

remove the color values that are not prominent in your matrix:
image(as.matrix(d), col=cols[names(cols)%in%unlist(d)])
unlist works only on lists as the name tells.
If d is already a matrix simply use c(d)

Thanks to Andre's advice I can solve it in a rather neat fashion
d<-as.matrix(read.table(text="
1 1 1 3
3 2 1 4
4 1 1 2
3 3 1 1"))
cols <- c(
'0' = "#FFFFFF",
'1' = "#99FF66",
'2' = "#66FF33",
'3' = "#33CC00",
'4' = "#009900"
)
image(as.matrix(d), col= cols[ names(cols) %in% d ])

R|ggplot2: unordered stacked bar graph [duplicate]

This question already has answers here:
Stacked bar chart
(4 answers)
Closed 7 years ago.
I have a data set that looks like this:
samp.data <- structure(list(Track = c(1,1,1,1,1,1,1,1,2,2,2),
Base = c("A","C","B","A","D","D","C","A","A","B","B"),
Length = c(1,1,1,1,2,3,1,1,1,1,1)),
.Names = c("Track", "Base", "Length"), class = "data.frame",row.names = c(NA, 11L))
# Track Base Length
# 1 1 A 1
# 2 1 C 1
# 3 1 B 1
# 4 1 A 1
# 5 1 D 2
# 6 1 D 3
# 7 1 C 1
# 8 1 A 1
# 9 2 A 1
# 10 2 B 1
# 11 2 B 1
I am trying to plot an unordered stacked bar, with Tracks on the x axis and Length on the y axis. In other words, the bar graph wouldn't group the A bases together and plot it as one length of 1+1+1+1=4. It would plot each base in order. First it would plot the A base of length 1 in Track 1, C base of length 1 above that, B base of length 1 above that, A base of length 1 above that, D base of length 2 above that, and so on.
Below is a crude ASCII diagram of what I am trying to describe:
| C
L | Y
e | Y Key
n | R A = Red
g | B B B = Blue
t | B G C = Green
h | R R D = Yellow
----------
2 1
Track
Sorry if the explanation is a little confusing. Thank you for your help!
Edit: This question is different from the possible duplicate, because I would like to ungroup the stacked sections.

Just use geom_bar(stat='identity'), set your x to Track, your y to length - it all works out.
Note - I converted your Base to factor (makes sense), as well as your Track (also makes sense to me, but if you wish to keep it numeric that's fine. You may wish to add a + scale_x_discrete() then in order to have your tracks show up as whole numbers on the x axis).
samp.data$Base <- factor(samp.data$Base)
samp.data$Track <- factor(samp.data$Track)
ggplot(samp.data, aes(x=Track, y=Length, fill=Base)) +
geom_bar(stat='identity') +
scale_fill_manual(values=c('red', 'blue', 'green', 'yellow'))
The last line sets the colours as you please.
If you wish to reverse the x axis order (so that your track 2 appears first), do + scale_x_reverse().
I do not know what you mean by "ungroup the base" in your question, but say you wanted to draw an outline around each "chunk" of DNA you could add (e.g.) colour="black" in the geom_bar (e.g. in track 1, there is a D of length 2 immediately followed by a D of length 3 so it's drawn as a big D of length 5 - adding colour="black" outlines the 2-chunk separately to the 3-chunk though they still have the same colour).

handling 'wrong' entries and NAs in a data.table substituting them with entries from other table

I am using data.table in the context of a wider application using shiny and handsontable.js. This is the flow of this part of the app:
I publish a data.table on the browser with numeric columns using handsontable & shiny. This is rendered on the screen.
The user changes values and each time this happens a new data.table is returned with the data.
The problem is with error management, specifically if an user accidentally keys a character.
My objective is to correct the user's error replacing the single cell value where the character was entered with the value in the original copy (only this cell as the others may contain valid changes to be saved at a later stage in the app).
Sadly I am not able to find an efficient solution to this problem. This is my code and a reproducible sample:
# I generate a sample datatable
originTable = data.table( Cat = LETTERS[1:5],
Jan=1:5,
Feb=sample(1:5),
Mar=sample(1:5),
Apr=sample(1:5),
May=sample(1:5))
# I take a full copy & to simulate the effect of a character key in by mistake I convert
# the entire column to character
dt_ <- copy(originTable)
dt_[,Jan := as.character(Jan)]
# "q" entered by mistake by the user -
dt_[[5,2]] <- "q"
# This is what I get back:
Cat Jan Feb Mar Apr May
1: A 1 1 2 4 4
2: B 2 5 4 2 2
3: C 3 4 3 1 5
4: D 4 3 5 5 1
5: E q 2 1 3 3
Now to my code to try to fix this:
valCols <- month.abb[1:5]
for (j in valCols)
set(dt_,
i = NULL,
j = j,
value= as.numeric(as.character(dt_[[j]])))
This gives me a data.table with a NA value somewhere (in place of the character entered by mistake - in a position I ignore).
To substitute the value I've used the following code
for (j in valCols)
set(dt_,
i = which(is.na(dt_[[j]])),
j = j,
value= as.numeric(originTable[[j]]))
But it does not work: it finds the correct column, but ignores the i value and copies the value contained in originTable[1,j] rather than originTable[i,j]. In the example dt_[5,2] will get 1 (positioned as originTable[1,2] instead of 5.
In other words I would have expect to see as.numeric(originTable[[j]]) subsetted by i (implicitly) and by j (explicitly).
To be fair the Warning is telling me what is happening:
Warning message:
In set(dt_, i = which(is.na(dt_[[j]])), j = j, value = as.numeric(originTable[[j]])) :
Supplied 5 items to be assigned to 1 items of column 'Jan' (4 unused)
But I remain with my problem unsolved.
I have read countless of apparently similar SO posts but sadly to no avail (possibly because NA handling has evolved in recent releases and older answers do not fully reflect best practice any more). Also a non-NA based solution would be equally acceptable. Thanks

Try the following:
# use your criteria to determine what the incorrect values are in each column
wrongs = lapply(dt_[, !"Cat"], function(x) which(is.na(as.numeric(x))))
# now substitute
for (n in names(wrongs)) dt_[wrongs[[n]], (n) := originTable[[n]][wrongs[[n]]]]
dt_
# Cat Jan Feb Mar Apr May
#1: A 1 2 5 2 4
#2: B 2 4 3 4 5
#3: C 3 3 2 5 2
#4: D 4 1 1 1 1
#5: E 5 5 4 3 3

How to read a user input (character) as a column name (object name)

I have a data set like,
En Mn Hours var1 var2
1 1 1 0.1023488 0.6534707
1 1 2 0.1254325 0.5423215
1 1 3 0.1523245 0.2542354
1 2 1 0.1225425 0.2154533
1 2 2 0.1452354 0.4521255
1 2 3 0.1853324 0.2545545
2 1 1 0.1452369 0.2321542
2 1 2 0.1241241 0.2525212
2 1 3 0.0542232 0.2626214
2 2 1 0.8542154 0.2154522
2 2 2 0.0215420 0.5245125
2 2 3 0.2541254 0.2542512
var <- as.character(readline('Enter the variable of your choice'))
the var stores the user input
Ex: var1
and if I use it in some R commands say cbind or any for that matter, it doesn't work.
simple one if a is a dataset a$var doesn't work.
Ex: aggregate(cbind(var)~Mn+hours,a, FUN=mean)
And I have a var1, var2, var3 like some 30 columns and I want to read multiple inputs from the user
like when user is prompted to enter the variable names he enters like
var1 var2 var3 (space or comma separated it doesn't matter for me)
then I need to read them and use them in the R command line to get some result.

Lets say your dataframe is named df...And user entered column name as you said....
var <- as.character(readline('Enter the variable of your choice'))
df[var] = 0 #assign actual values here... As var one named column is added
You can run it in loop the same code and create n number of columns...
Edit
Now if you want to use user input for various purpose then following are the case,
When you want to refer column directly with data.frame that will be easy
df[var]
When you want to refer column in some function or formula then you have to do it like this
aggregate(cbind(get(var1),get(var2))~Mn+hours,a, FUN=mean)
Concept behind get() is, it will put actual value (column name) in formula for you...
Updated
Refer this for specially aggregate function :
Name columns within aggregate in R
I hope this will work for you...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Stata - Preserving encoded variable and stacked graphing - graph

Related

Discovering the dependency relations among the samples of a data set

How can I fix colors for the numbers in a matrix

R|ggplot2: unordered stacked bar graph [duplicate]

handling 'wrong' entries and NAs in a data.table substituting them with entries from other table

How to read a user input (character) as a column name (object name)

Categories

Resources