R: how to visualize the relationship between continuous and categorical data - r

I have the following data.frame which contains 3 categorical variables (different types of vascular pathology) and 1 continuous variable (Output). I'm interested in seeing the relationship between Output and the different types of vascular pathologies, i.e. is higher/lower output associated with mild/severe pathology?
> dput(df)
structure(list(Vascular_Pathology_M = structure(c(1L, 2L, 3L,
1L, 1L, 2L, 4L, 3L, 1L, 2L), .Label = c("Absent", "Mild", "Mild/Moderate",
"Moderate/Severe", "Severe"), class = "factor"), Vascular_Pathology_F = structure(c(4L,
2L, 1L, 1L, 1L, 1L, 2L, 4L, 1L, 1L), .Label = c("Absent", "Mild",
"Mild/Moderate", "Moderate/Severe", "Severe"), class = "factor"),
Vascular_Pathology_O = structure(c(1L, 3L, 4L, 3L, 1L, 2L,
1L, 1L, 1L, 2L), .Label = c("Absent", "Mild", "Mild/Moderate",
"Moderate/Severe"), class = "factor"), Output = c(1.01789418758932,
1.05627630598801, 1.49233946102323, 1.38192374975672, 1.13097652937671,
0.861306979571144, 0.707820561413699, 1.16628243128399, 0.983163398006992,
1.23972603843843)), .Names = c("Vascular_Pathology_M", "Vascular_Pathology_F",
"Vascular_Pathology_O", "Output"), row.names = c(1L, 3L, 4L,
5L, 6L, 7L, 8L, 10L, 11L, 12L), class = "data.frame")
> df
Vascular_Pathology_M Vascular_Pathology_F Vascular_Pathology_O Output
1 Absent Moderate/Severe Absent 1.0178942
3 Mild Mild Mild/Moderate 1.0562763
4 Mild/Moderate Absent Moderate/Severe 1.4923395
5 Absent Absent Mild/Moderate 1.3819237
6 Absent Absent Absent 1.1309765
7 Mild Absent Mild 0.8613070
8 Moderate/Severe Mild Absent 0.7078206
10 Mild/Moderate Moderate/Severe Absent 1.1662824
11 Absent Absent Absent 0.9831634
12 Mild Absent Mild 1.2397260

You could look at the interaction of the various pathologies. For example, with a barplot
## Make the interaction variable
df$interact <- interaction(df[, 1:3], sep="_")
## Look at means of groups
library(dplyr)
df %>% group_by(interact) %>%
dplyr::summarise(Output = mean(Output)) -> means
ggplot(means, aes(interact, Output))+
geom_bar(stat="identity") +
theme(axis.text=element_text(angle=90)) +
xlab("Interaction")
or with points
ggplot(df, aes(interact, Output))+
geom_point() +
theme(axis.text=element_text(angle=45, hjust=1)) +
xlab("Interaction") +
geom_point(data=means, col="red") +
ylim(0, 1.6)

You can simply plot the output against the categorical variables
plot(df[, 1], df[, 4])
plot(df[, 2], df[, 4])
plot(df[, 3], df[, 4])

You have a 4 dimensional dataset. One option is to do a scatter plot (x/y = two dimensions), in a small multiple series (there's one more dimension), and map the Output variable to something visual like size (there's a fourth dimension).
Example, after putting your data in a data.frame called my_dat (since df is already assigned to a function in R). Points are jittered to show the multiple observations per point, and colored by Y position to help make clear which point goes with which category.
library(ggplot2)
my_dat$O_with_labels <-
factor(my_dat[, 3], labels=paste('Vasc Path O:', levels(my_dat[, 3])))
ggplot(my_dat,
aes(x=Vascular_Pathology_M, y=Vascular_Pathology_F)) +
geom_jitter(aes(size=Output, color=Vascular_Pathology_F)) +
facet_wrap(~O_with_labels) +
theme_bw() +
theme(axis.text.x = element_text(angle=45, hjust=1))

Related

How can create my own factor column in a dataframe?

I have dataframe and task:"Define your own criterion of income level, and split data according to levels of this criterion"
dput(head(creditcard))
structure(list(card = structure(c(2L, 2L, 2L, 2L, 2L, 2L), levels = c("no",
"yes"), class = "factor"), reports = c(0L, 0L, 0L, 0L, 0L, 0L
), age = c(37.66667, 33.25, 33.66667, 30.5, 32.16667, 23.25),
income = c(4.52, 2.42, 4.5, 2.54, 9.7867, 2.5), share = c(0.03326991,
0.005216942, 0.004155556, 0.06521378, 0.06705059, 0.0444384
), expenditure = c(124.9833, 9.854167, 15, 137.8692, 546.5033,
91.99667), owner = structure(c(2L, 1L, 2L, 1L, 2L, 1L), levels = c("no",
"yes"), class = "factor"), selfemp = structure(c(1L, 1L,
1L, 1L, 1L, 1L), levels = c("no", "yes"), class = "factor"),
dependents = c(3L, 3L, 4L, 0L, 2L, 0L), days = c(54L, 34L,
58L, 25L, 64L, 54L), majorcards = c(1L, 1L, 1L, 1L, 1L, 1L
), active = c(12L, 13L, 5L, 7L, 5L, 1L), income_fam = c(1.13,
0.605, 0.9, 2.54, 3.26223333333333, 2.5)), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
I defined this criterion in this way
inc_l<-c("low","average","above average","high")
grad_fact<-function(x){
ifelse(x>=10, 'high',
ifelse(x>6 && x<10, 'above average',
ifelse(x>=3 && x<=6,'average',
ifelse(x<3, 'low'))))
}
And added a column like this
creditcard<-transform(creditcard, incom_levev=factor(sapply(creditcard$income, grad_fact), inc_l, ordered = TRUE))
But I need not to use saaply for this and I tried to do it in this way
creditcard<-transform(creditcard, incom_level=factor(grad_fact(creditcard$income),inc_l, ordered = TRUE))
But in this case, all the elements of the column take the value "average" and I don't understand why, please help me figure out the problem
We may need to change the && to & as && will return a single TRUE/FALSE. According to ?"&&"
& and && indicate logical AND and | and || indicate logical OR. The shorter forms performs elementwise comparisons in much the same way as arithmetic operators. The longer forms evaluates left to right, proceeding only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.
In addition, the last ifelse didn't had a no case
grad_fact<-function(x){
ifelse(x>=10, 'high',
ifelse(x>6 & x<10, 'above average',
ifelse(x>=3 & x<=6,'average',
ifelse(x<3, 'low', NA_character_))))
}
and then use
creditcard <- transform(creditcard, incom_level=
factor(grad_fact(income),inc_l, ordered = TRUE))
-output
creditcard
card reports age income share expenditure owner selfemp dependents days majorcards active income_fam incom_level
1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no 3 54 1 12 1.130000 average
2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no 3 34 1 13 0.605000 low
3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no 4 58 1 5 0.900000 average
4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no 0 25 1 7 2.540000 low
5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no 2 64 1 5 3.262233 above average
6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no 0 54 1 1 2.500000 low

Add greek letters on the ggpplot axis, and add main Y axis elements [duplicate]

This question already has answers here:
Customize axis labels
(4 answers)
How to use Greek symbols in ggplot2?
(4 answers)
Closed 2 years ago.
Hello everyone and happy new year !!! I would need help in order to improve a ggplot figure.
I have a dfataframe (df1) that looks like so:
x y z
1 a 1 -0.13031299
2 b 1 0.71407346
3 c 1 -0.15669153
4 d 1 0.39894708
5 a 2 0.64465669
6 b 2 -1.18694632
7 c 2 -0.25720456
8 d 2 1.34927378
9 a 3 -1.03584455
10 b 3 0.14840876
11 c 3 0.50119220
12 d 3 0.51168810
13 a 4 -0.94795328
14 b 4 0.08610489
15 c 4 1.55144239
16 d 4 0.20220334
Here is the data as dput() and my code:
df1 <- structure(list(x = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("a", "b", "c", "d"
), class = "factor"), y = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L), z = c(-0.130312994048691, 0.714073455094197,
-0.156691533710652, 0.39894708481517, 0.644656691110372, -1.18694632145378,
-0.257204564112021, 1.34927378214664, -1.03584454605617, 0.148408762003154,
0.501192202628166, 0.511688097742773, -0.947953281835912, 0.0861048893885463,
1.55144239199118, 0.20220333664676)), class = "data.frame", row.names = c(NA,
-16L))
library(ggplot2)
df1$facet <- ifelse(df1$x %in% c("c", "d"), "cd", df1$x)
p1 <- ggplot(df1, aes(x = x, y = y))
p1 <- p1 + geom_tile(aes(fill = z), colour = "grey20")
p1 <- p1 + scale_fill_gradient2(
low = "darkgreen",
mid = "white",
high = "darkred",
breaks = c(min(df1$z), max(df1$z)),
labels = c("Low", "High")
)
p1 + facet_grid(.~facet, space = "free", scales = "free_x") +
theme(strip.text.x = element_blank())
With this code (inspired from here) I get this figure:
But I wondered if someone had an idea in order to :
Add Greek letter in the x axis text (here alpha and beta)
To add sub Y axis element (here noted as Element 1-3) where Element1 (from 0 to 1); Element2 (from 1 to 3) and Element3 (from 3 to the end)
the result should be something like:

Raking Weights on Nested Data: R Output Doesn't Match Stata Output

Introduction
I have multilevel survey data of teachers nested in schools. I have manually calculated design weights and non-response adjustment weights based on probability selection and response rate (oldwt below). Now I want to create post-stratification weights by raking on two marginals: the sex (male or female) of and the employment status (full-time or not full-time) of the teacher. With the help of kind people at Statalist (see here), I have seemingly done this in Stata successfully. However, in trying to replicate the results in R, I come up with vastly different output.
Sample Data
#Variables
#school : unique school id
#caseid : unique teacher id
#oldwt : the product of the design weight and the non-response adjustment
#gender : male or female
#timecat : employment status (full-time or part-time)
#scgender : a combined factor variable of school x gender
#sctime : a combined factor variable of school x timecat
#genderp : the school's true population for gender
#fullp : the school's true population for timecat
#Sample Data
foo <- structure(list(caseid = 1:11, school = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), oldwt = c(1.8, 1.8, 1.8, 1.8, 1.8, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3), gender = structure(c(2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"), timecat = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Full-time", "Part-time"), class = "factor"), scgender = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("1.Female", "1.Male", "2.Female", "2.Male"), class = "factor"), sctime = structure(c(2L, 2L, 1L, 1L, 1L, 4L, 4L, 3L, 3L, 3L, 3L), .Label = c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time"), class = "factor"), genderp = c(0.444, 0.556, 0.556, 0.444, 0.444, 0.25, 0.75, 0.75, 0.25, 0.75, 0.75), fullp = c(0.222, 0.222, 0.778, 0.778, 0.778, 0.375, 0.375, 0.625, 0.625, 0.625, 0.625)), .Names = c("caseid", "school", "oldwt", "gender", "timecat", "scgender", "sctime", "genderp", "fullp"), class = "data.frame", row.names = c(NA, -11L))
Raking Code
(See here and here for in-depth examples of using anesrake in R).
# extract true population proportions into a vector
genderp <- c(aggregate(foo$genderp, by=list(foo$scgender), FUN=max))
fullp <- c(aggregate(foo$fullp, by=list(foo$sctime), FUN=max))
genderp <- as.vector(genderp$x)
fullp <- as.vector(fullp$x)
# align the levels/labels of the population total with the variables
names(genderp) <- c("1.Female", "1.Male", "2.Female", "2.Male")
names(fullp) <- c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time")
# create target list of true population proportions for variables
targets <- list(genderp, fullp)
names(targets) <- c("scgender", "sctime")
# rake
library(anesrake)
outsave <- anesrake(targets, foo, caseid = foo$caseid, weightvec = foo$oldwt, verbose = F, choosemethod = "total", type = "nolim", nlim = 2, force1 = FALSE)
outsave
Comparison with Stata Output
The issue is that the output from R doesn't match up with the output with Stata (even if I set force1 = TRUE), and it seems that the Stata output is the one that is right, making me think my sloppy R code is wrong. Is that the case?
caseid R Stata
1 0.070 0.633
2 0.152 1.367
3 0.404 3.633
4 0.187 1.683
5 0.187 1.683
6 0.143 1.146
7 0.232 1.854
8 0.173 1.382
9 0.107 0.854
10 0.173 1.382
11 0.173 1.382
The distribution of your targets in R should sum up one and represent the distribution in your population. Look at my example. I think that the force1 option will not compute the distribution you want at least each school has the same population weight. This is what force1 is doing:
targets[[1]]/sum(targets[[1]])
1.Female 1.Male 2.Female 2.Male
0.278 0.222 0.125 0.375
Is that what you want?

Why ggplot2 pie-chart facet confuses the facet labelling

I have two types of data that looks like this:
Type 1 (http://dpaste.com/1697615/plain/)
Cluster-6 abTcells 1456.74119
Cluster-6 Macrophages 5656.38478
Cluster-6 Monocytes 4415.69078
Cluster-6 StemCells 1752.11026
Cluster-6 Bcells 1869.37056
Cluster-6 gdTCells 1511.35291
Cluster-6 NKCells 1412.61504
Cluster-6 DendriticCells 3326.87741
Cluster-6 StromalCells 2008.20603
Cluster-6 Neutrophils 12867.50224
Cluster-3 abTcells 471.67118
Cluster-3 Macrophages 1000.98164
Cluster-3 Monocytes 712.92273
Cluster-3 StemCells 557.88648
Cluster-3 Bcells 599.94109
Cluster-3 gdTCells 492.61994
Cluster-3 NKCells 524.42522
Cluster-3 DendriticCells 647.28811
Cluster-3 StromalCells 876.27875
Cluster-3 Neutrophils 1025.24105
And type two, (http://dpaste.com/1697602/plain/).
These values are identical with Cluster-6 in type 1 above:
abTcells 1456.74119
Macrophages 5656.38478
Monocytes 4415.69078
StemCells 1752.11026
Bcells 1869.37056
gdTCells 1511.35291
NKCells 1412.61504
DendriticCells 3326.87741
StromalCells 2008.20603
Neutrophils 12867.50224
But why when dealing with type 1 data with this code:
library(ggplot2);
library(RColorBrewer);
filcol <- brewer.pal(10, "Set3")
dat <- read.table("http://dpaste.com/1697615/plain/")
ggplot(dat,aes(x=factor(1),y=dat$V3,fill=dat$V2))+
facet_wrap(~V1)+
xlab("") +
ylab("") +
geom_bar(width=1,stat="identity",position = "fill") +
scale_fill_manual(values = filcol,guide = guide_legend(title = "")) +
coord_polar(theta="y")+
theme(strip.text.x = element_text(size = 8, colour = "black", angle = 0))
Ready data:
> dput(dat)
structure(list(V1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Cluster-3",
"Cluster-6"), class = "factor"), V2 = structure(c(1L, 5L, 6L,
9L, 2L, 4L, 8L, 3L, 10L, 7L, 1L, 5L, 6L, 9L, 2L, 4L, 8L, 3L,
10L, 7L), .Label = c("abTcells", "Bcells", "DendriticCells",
"gdTCells", "Macrophages", "Monocytes", "Neutrophils", "NKCells",
"StemCells", "StromalCells"), class = "factor"), V3 = c(1456.74119,
5656.38478, 4415.69078, 1752.11026, 1869.37056, 1511.35291, 1412.61504,
3326.87741, 2008.20603, 12867.50224, 471.67118, 1000.98164, 712.92273,
557.88648, 599.94109, 492.61994, 524.42522, 647.28811, 876.27875,
1025.24105)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-20L))
Generated this following figures:
Notice that the Facet label is misplaced, Cluster-3 should be Cluster-6,
where Neutrophils takes larger proportions.
How can I resolve the problem?
When dealing with type 2 data have no problem at all.
library(ggplot2)
df <- read.table("http://dpaste.com/1697602/plain/");
library(RColorBrewer);
filcol <- brewer.pal(10, "Set3")
ggplot(df,aes(x=factor(1),y=V2,fill=V1))+
geom_bar(width=1,stat="identity")+coord_polar(theta="y")+
theme(axis.title = element_blank())+
scale_fill_manual(values = filcol,guide = guide_legend(title = "")) +
theme(strip.text.x = element_text(size = 8, colour = "black", angle = 0))
Ready data:
> dput(df)
structure(list(V1 = structure(c(1L, 5L, 6L, 9L, 2L, 4L, 8L, 3L,
10L, 7L), .Label = c("abTcells", "Bcells", "DendriticCells",
"gdTCells", "Macrophages", "Monocytes", "Neutrophils", "NKCells",
"StemCells", "StromalCells"), class = "factor"), V2 = c(1456.74119,
5656.38478, 4415.69078, 1752.11026, 1869.37056, 1511.35291, 1412.61504,
3326.87741, 2008.20603, 12867.50224)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-10L))
It's because you use the data frame name in aes(...). This fixes the problem.
ggplot(dat,aes(x=factor(1),y=V3,fill=V2))+
facet_wrap(~V1)+
xlab("") +
ylab("") +
geom_bar(width=1,stat="identity",position = "fill") +
scale_fill_manual(values = filcol,guide = guide_legend(title = "")) +
coord_polar(theta="y")+
theme(strip.text.x = element_text(size = 8, colour = "black", angle = 0))
In defining the facets, you reference V1 in the context of the default dataset, and ggplot sorts alphabetically by level (so "Cluster-3" comes first). In your call to aes(...) you reference dat$V3 directly, so ggplot goes out of the context of the default dataset to the original dataframe. There, Cluster-6 is first.
As a general comment, one should never reference data in aes(...) outside the context of the dataset defined with data=.... So:
ggplot(data=dat, aes(y=V3...)) # good
ggplot(data=dat, aes(y=dat$V3...)) # bad
Your problem is a perfect example of why the second option is bad.

return a plot for each level of a factor in r

I want to produce an X,Y plot for each separate ID from the dataframe 'trajectories' :
**trajectories**
X Y ID
2 4 1
1 6 1
2 4 1
1 8 2
3 7 2
1 5 2
1 4 3
1 6 3
7 4 3
I use the code:
sapply(unique(trajectories$ID),(plot(log(abs(trajectories$X)+0.01),log((trajectories$Y)+0.01))))
But this does not seem to work since the error:
Error in match.fun(FUN) :
c("'(plot(log(abs(trajectories$X)+0.01),log((trajectories$Y)' is not a function, character or symbol", "' 0.01)))' is not a function, character or symbol")
Is there a way to rewrite this code so that i get a separate plot for each ID?
You can use the ggplot2 package for this nicely:
library(ggplot2)
trajectories <- structure(list(X = c(2L, 1L, 2L, 1L, 3L, 1L, 1L, 1L, 7L), Y = c(4L, 6L, 4L, 8L, 7L, 5L, 4L, 6L, 4L), ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)), .Names = c("X", "Y", "ID"), class = "data.frame", row.names = c(NA, -9L))
ggplot(trajectories, aes(x=log(abs(X) + 0.01), y=log(Y))) +
geom_point() +
facet_wrap( ~ ID)
For what its worth, the reason your cod is failing is exactly what the error says. the second argument to sapply needs to be a function. If you define your plot code as a function:
myfun <- function(DF) {
plot(log(abs(DF$X) + 0.01), log(DF$Y))
}
But this will not split your data on ID. You could also use the plyr or data.table package to do this splitting and plotting but you will need to write the plots to a file or they will close as each new plot is created.
The lattice package is useful here.
library(lattice)
# Make the data frame
X <- c(2,1,2,1,3,1,1,1,7)
Y <- c(4,6,4,8,7,5,4,6,4)
ID <- c(1,1,1,2,2,2,3,3,3)
trajectories <- data.frame(X=X, Y=Y, ID=ID)
# Plot the graphs as a scatter ploy by ID
xyplot(Y~X | ID,data=trajectories)
# Another useful solution is to treat ID as a factor
# Now, the individual plots are labeled
xyplot(Y~X | factor(ID),data=trajectories)
Even the with basic R this is possible. Using the iris Dataset:
coplot(Sepal.Length ~ Sepal.Width | Species, data = iris)

Resources