R Question: How can I create a histogram with 2 variables against eachother? - r

Okay, let me be as clear as I can in my problem. I'm new to R, so your patience is appreciated.
I want to create a histogram using two different vectors. The first vector contains a list of models (products). These models are listed as either integers, strings, or NA. I'm not exactly sure how R is storing them (I assume they're kept as strings), or if that is a relevant issue. I also have a vector containing a list of incidents pertaining to that model. So for example, one row in the dataframe might be:
Model Incidents
XXX1991 7
How can I create a histogram where the number of incidents for each model is shown? So the histogram will look like
| =
| =
Frequency of | =
Incidents | = =
| = = =
| = = = = =
- - - - - -
Each different Model
Just to give a general idea.
I also need to be able to map everything out with standard deviation lines, so that it's easy to see which models are the least reliable. But that's not the main question here. I just don't want to do anything that will make me unable to use standard deviation in the future.
So far, all I really understand is how to make a histogram with the frequency marked, but for some reason, the x-axis is marked with numbers, not the models' names.
I don't really care if I have to download new packages to make this work, but I suspect that this already exists in basic R or ggplot2 and I'm just too dumb to figure it out.
Feel free to ask clarfying questions. Thanks.
EDIT: I forgot to mention, there are multiple rows of incidents listed under each model. So to add to my example earlier:
Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
3
5
XXX1002 9
XXX1002 4
etc . . .
I want to add up all the incidents for a model under one label.

I am assuming that you did not mean to leave the model blank in your example, so I filled in some values.
You can add up the number of incidents by model using aggregate then make the relevant plot using barplot.
## Example Data
data = read.table(text="Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
XXX1992 3
XXX1992 5
XXX1002 9
XXX1002 4",
header=TRUE)
TAB = aggregate(data$Incidents, list(data$Model), sum)
TAB
Group.1 x
1 XXX1002 13
2 XXX1991 27
3 XXX1992 8
barplot(TAB$x, names.arg=TAB$Group.1 )

Related

Stata plot with information from multiple variables

I have 10 binary variables- var1, var2,...,var10 answering "yes" or "no" (1 or 0) to a certain question, but under different conditions. I want to create a barplot in stata that shows me the proportion of people who answered "no" for each of the variables (a single plot). How can I do this? If I use the regular barplot command for frequencies
graph bar, over(varlist)
I get an error because over() only takes in a single variable, not a varlist. Something like this is pretty easy to do in R or Python, but I'm not sure how to do this in stata. My data looks something like below:
+-------------------------------+
| id var1 var2 var3 |
|-------------------------------|
1. | 1 0 0 1 |
2. | 2 1 1 1 |
3. | 3 0 1 1 |
+-------------------------------+
As stated, each person has answered 3 questions (rather, the same question presented in three different ways) with "yes" or "no". I want to generate a single barplot with three bars ("var1", "var2", "var3"), each representing the proportion of people who answered no to the question (so 0.67, 0.33, and 0, respectively, in the example data).
There is no reproducible data example here. The Stata tag wiki has very detailed advice on how to give data examples.
Plotting the fraction of zeros directly does not yield to any trick obvious to me as I write, but here is a work-around. The principles for 10 variables aren't different from those for a four-variable example invented here. The main idea is that the default of graph hbar (or of graph bar or graph dot) is to show means, and the mean of a binary variable is a proportion.
clear
set seed 2803
set obs 10
forval j = 1/4 {
generate var`j' = runiform() > (`j' * 0.2)
}
forval j = 1/4 {
generate nvar`j' = 1 - var`j'
label var nvar`j' "var`j'"
}
graph hbar nvar* , ascategory ytitle(fraction of Nos) name(G1, replace)
statplot nvar*, ytitle(fraction of Nos) name(G2, replace)
The statplot solution (dependent on installing that command using ssc install statplot) is just an alternative. It's a personal view that its immediate result here is closer to a civilised graph than the default of graph hbar. But it's not different in principle and you would get closer by spelling out more options directly for graph hbar.
Using graph hbar rather than graph bar is a personal choice. But if your real data have variable labels or longer names, then space to show either readably for 10 variables could be a medium-sized deal.

What is the best way to manage/store result from either posthoc.krukal.dunn.test() or dunn.test() - where my input data is in dataframe format?

I am a newbie in R programming and seek help in analyzing the Metabolomics data - 118 metabolites with 4 conditions (3 replicates per condition). I would like to know, for each metabolite, which condition(s) is significantly different from which. Here is part of my data
> head(mydata)
Conditions HMDB03331 HMDB00699 HMDB00606 HMDB00707 HMDB00725 HMDB00017 HMDB01173
1 DMSO_BASAL 0.001289121 0.001578235 0.001612297 0.0007772231 3.475837e-06 0.0001221674 0.02691318
2 DMSO_BASAL 0.001158363 0.001413287 0.001541713 0.0007278363 3.345166e-04 0.0001037669 0.03471329
3 DMSO_BASAL 0.001043537 0.002380287 0.001240891 0.0008595932 4.007387e-04 0.0002033625 0.07426482
4 DMSO_G30 0.001195253 0.002338346 0.002133992 0.0007924157 4.189224e-06 0.0002131131 0.05000778
5 DMSO_G30 0.001511538 0.002264779 0.002535853 0.0011580857 3.639661e-06 0.0001700157 0.02657079
6 DMSO_G30 0.001554804 0.001262859 0.002047611 0.0008419137 6.350990e-04 0.0000851638 0.04752020
This is what I have so far.
I learned the first line from this post
kwtest_pvl = apply(mydata[,-1], 2, function(x) kruskal.test(x,as.factor(mydata$Conditions))$p.value)
and this is where I loop through the metabolite that past KW test
tCol = colnames(mydata[,-1])[kwtest_pvl <= 0.05]
for (k in tCol){
output = posthoc.kruskal.dunn.test(mydata[,k],as.factor(mydata$Conditions),p.adjust.method = "BH")
}
I am not sure how to manage my output such that it is easier to manage for all the metabolites that passed KW test. Perhaps saving the output from each iteration appending to excel? I also tried dunn.test package since it has an option of table or list output. However, it still leaves me at the same point. Kinda stuck here.
Moreover, should I also perform some kind of adjusted p-value, i.e FWER, FDR, BH right after KW test - before performing the posthoc test?
Any suggestion(s) would be greatly appreciated.

Finding Specific Means and Medians in R

I am working on a project for school in R that is looking at swimming data compiled up of 8 different teams looking at each of the 13 events, over 6 years. I have over 8700 rows of data that I have appended and am trying to find out how to draw the specific means that I am looking for. For example, I would like to look at the progression of mean times for team 1 for event 3 for men. Thanks!
You can subset your data-frame to only include those variables, e.g.
ss = subset(df, team == 1 & event == 3)
mean(ss$times)

Discriminant analysis and column name in the code

I have been writing a code to ease performing a discriminant analysis using the lda function. But actually I have a step which I cannot solve. And it is when I have to introduce the name of the categorical column in the code. Imagine we have the next table (called smoke), in which the column Factor represents the groups (in our cases, smoker and nsmok).
smoke
Factor Lung Heart Blood
1 smoker 7 22 15
2 smoker 8 21 12
3 nsmok 22 9 5
This is the code I have been preparing. Please, look at the XXXX's in the code (it appears twice). I want them to write automatically the name of the categorical column, instead of writing directly it twice.
lda=lda(XXXX~.,data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
Tabla=table(Smoke$XXXX,lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))
I thought that writing...
colnames(Table)[1]
... would solve it. But actually there still exist some errors when running the code.
Otherwise, I though that introducing directly the name in this way:
Column_Factor-> Factor
and writing Column_Factor in the two places in the code would solve it. But it isn't.
Any ideas?
You could do something like this:
library(MASS)
#gets the column name of the factor, maybe check if there is only one factor column first
Column_Factor <- names(Smoke)[sapply(Smoke, class)=="factor"]
#creates the formula by pasting the name and the RHS
lda <- lda(as.formula(paste(Column_Factor,"~.",sep="")),data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
#selects the column using the variable
Tabla=table(Smoke[,Column_Factor],lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources