I have a dataset:
https://docs.google.com/spreadsheets/d/1ZgyRQ2uTw-MjjkJgWCIiZ1vpnxKmF3o15a5awndttgo/edit?usp=sharing
that I'm trying to apply PCA analysis and to achieve a graph based on graph provided in this post:
https://stats.stackexchange.com/questions/61215/how-to-interpret-this-pca-biplot-coming-from-a-survey-of-what-areas-people-are-i
However, an error doesn't seem to go away:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE, :
arguments imply differing number of rows: 0, 1006
Following is my code that I have trouble finding the source of error. Would like to have some help for error detection. Any hints?
The goal is to produced a PCA graph grouped by levels of Happiness.in.life. I modified the original code to fit with my dataset. Originally, group is determined by Genders, which has 2 levels. What I'm attempting to do is to build a graph based on 5 levels of Happiness.in.life. However, it doesn't seem I can use the old code...
Thanks!
library(magrittr)
library(dplyr)
library(tidyr)
df <- happiness_reduced %>% dplyr::select(Happiness.in.life:Internet.usage, Happiness.in.life)
head(df)
vars_on_hap <- df %>% dplyr::select(-Happiness.in.life)
head(vars_on_hap)
group<-df$Happiness.in.life
fit <- prcomp(vars_on_hap)
pcData <- data.frame(fit$x)
vPCs <- fit$rotation[, c("PC1", "PC2")] %>% as.data.frame()
multiple <- min(
(max(pcData[,"PC1"]) - min(pcData[,"PC1"]))/(max(vPCs[,"PC1"])-
min(vPCs[,"PC1"])),
(max(pcData[,"PC2"]) - min(pcData[,"PC2"]))/(max(vPCs[,"PC2"])-
min(vPCs[,"PC2"]))
)
ggplot(pcData, aes(x=PC1, y=PC2)) +
geom_point(aes(colour=groups)) +
coord_equal() +
geom_text(data=vPCs,
aes(x = fit$rotation[, "PC1"]*multiple*0.82,
y = fit$rotation[,"PC2"]*multiple*0.82,
label=rownames(fit$rotation)),
size = 2, vjust=1, color="black") +
geom_segment(data=vPCs,
aes(x = 0,
y = 0,
xend = fit$rotation[,"PC1"]*multiple*0.8,
yend = fit$rotation[,"PC2"]*multiple*0.8),
arrow = arrow(length = unit(.2, 'cm')),
color = "grey30")
Here is an approach on how to plot the result of PCA in ggplot2:
library(tidyverse)
library(ggrepel)
A good idea (not in all cases for instance if they are all in the same units) is to scale the variables prior to PCA
hapiness %>% #this is the data from google drive. In the future try not top post such links on SO because they tend to be unusable after some time has passed
select(-Happiness.in.life) %>%
prcomp(center = TRUE, scale. = TRUE) -> fit
Now we can proceed to plotting the fit:
fit$x %>% #coordinates of the points are in x element
as.data.frame()%>% #convert matrix to data frame
select(PC1, PC2) %>% #select the first two PC
bind_cols(hapiness = as.factor(hapiness$Happiness.in.life)) %>% #add the coloring variable
ggplot() +
geom_point(aes(x = PC1, y = PC2, colour = hapiness)) + #plot points and color
geom_segment(data = fit$rotation %>% #data we want plotted by geom_segment is in rotation element
as.data.frame()%>%
select(PC1, PC2) %>%
rownames_to_column(), #get to row names so you can label after
aes(x = 0, y = 0, xend = PC1 * 7, yend = PC2* 7, group = rowname), #I scaled the rotation by 7 so it fits in the plot nicely
arrow = arrow(angle = 20, type = "closed", ends = "last",length = unit(0.2,"cm")),
color = "grey30") +
geom_text_repel(data = fit$rotation %>%
as.data.frame()%>%
select(PC1, PC2) %>%
rownames_to_column(),
aes(x = PC1*7,
y = PC2*7,
label = rowname)) +
coord_equal(ratio = fit$sdev[2]^2 / fit$sdev[1]^2) + #I like setting the ratio to the ratio of eigen values
xlab(paste("PC1", round(fit$sdev[1]^2/ sum(fit$sdev^2) *100, 2), "%")) +
ylab(paste("PC2", round(fit$sdev[2]^2/ sum(fit$sdev^2) *100, 2), "%")) +
theme_bw()
Look at all them happy people on the left (well it is hard to notice because of the colors used, I suggest using the palette jco from ggpubr library) get_palette('jco', 5) ie scale_color_manual(values = get_palette('jco', 5))
quite a similar plot can be achieved with library ggord:
library(ggord)
ggord(fit, grp_in = as.factor(hapiness$Happiness.in.life),
size = 1, ellipse = F, ext = 1.2, vec_ext = 5)
the major difference is ggord uses equal scaling for axes. Also I scaled the rotation by 5 instead of 7 as in the first plot.
As you can see I do not like many intermediate data frames.
Related
Reproduced from this code:
library(haven)
library(survey)
library(dplyr)
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
# Rename variables into something more readable
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
nhanesAnalysis <- nhanesDemo %>%
mutate(LowIncome = case_when(
INDFMIN2 < 40 ~ T,
T ~ F
)) %>%
# Select the necessary columns
select(INDFMIN2, LowIncome, persWeight, psu, strata)
# Set up the design
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
svyhist(~log10(INDFMIN2), design=nhanesDesign, main = '')
How do I color the histogram by independent variable, say, LowIncome? I want to have two separate histograms, one for each value of LowIncome. Unfortunately I picked a bad example, but I want them to be see-through in case their values overlap.
If you want to plot a histogram from your model, you can get its data from model.frame (this is what svyhist does under the hood). To get the histogram filled by group, you could use this data frame inside ggplot:
library(ggplot2)
ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
geom_histogram(alpha = 0.5, color = "gray60", breaks = 0:20 / 10) +
theme_classic()
Edit
As Thomas Lumley points out, this does not incorporate sampling weights, so if you wanted this you could do:
ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
geom_histogram(aes(weight = persWeight), alpha = 0.5,
color = "gray60", breaks = 0:20 / 10) +
theme_classic()
To demonstrate this approach works, we can replicate Thomas's approach in ggplot using the data example from svyhist. To get the uneven bin sizes (if this is desired), we need two histogram layers, though I'm guessing this would not be required for most use-cases.
ggplot(model.frame(dstrat), aes(enroll)) +
geom_histogram(aes(fill = "E", weight = pw, y = after_stat(density)),
data = subset(model.frame(dstrat), stype == "E"),
breaks = 0:35 * 100,
position = "identity", col = "gray50") +
geom_histogram(aes(fill = "Not E", weight = pw, y = after_stat(density)),
data = subset(model.frame(dstrat), stype != "E"),
position = "identity", col = "gray50",
breaks = 0:7 * 500) +
scale_fill_manual(NULL, values = c("#00880020", "#88000020")) +
theme_classic()
You can't just extract the data and use ggplot, because that won't use the weights and so misses the whole point of svyhist. You can use the add=TRUE argument, though. You do need to set the x and y axis ranges correctly to make sure the whole plot is visible
Using the data example from ?svyhist
svyhist(~enroll, subset(dstrat,stype=="E"), col="#00880020",ylim=c(0,0.003),xlim=c(0,3500))
svyhist(~enroll, subset(dstrat,stype!="E"), col="#88000020",add=TRUE)
This is a follow-up question of sorts to ggplot2 stat_density_2d: how to fix polygon errors at the dataset bounding box edges?
I am trying to animate a 2D density estimate ggplot2::geom_density_2d_filled over time so that each frame adds data to what was presented before. So far I have the gganimate animation working for the 2D density estimate so that each point in time (the dataframe column monthly) is individual, but I have no idea how to proceed from here.
Is it possible to use gganimate to cumulatively animate geom_density_2d_filled? Or could this be achieved by manipulating the source dataframe somehow?
Please see my code below:
library(dplyr)
library(sf)
library(geofi)
library(ggplot2)
library(gganimate)
# Finland municipalities
muns <- geofi::get_municipalities(year = 2022)
# Create sample points
points <- sf::st_sample(muns, 240) %>% as.data.frame()
points[c("x", "y")] <- sf::st_coordinates(points$geometry)
monthly <- seq(as.Date("2020/1/1"), by = "month", length.out = 24) %>%
rep(., each = 10)
points$monthly <- monthly
p <- ggplot() +
geom_density_2d_filled(data = points,
aes(x = x, y = y, alpha = after_stat(level))) +
geom_sf(data = muns,
fill = NA,
color = "black") +
coord_sf(default_crs = sf::st_crs(3067)) +
geom_point(data = points,
aes(x = x, y = y),
alpha = 0.1) +
scale_alpha_manual(values = c(0, rep(0.75, 13)),
guide = "none") +
# gganimate specific
transition_states(monthly,
transition_length = 1,
state_length = 40) +
labs(title = "Month: {closest_state}") +
ease_aes("linear")
animate(p, renderer = gganimate::gifski_renderer())
gganimate::anim_save(filename = "so.gif", path = "anim")
The resulting animation is seen below. Could this be portrayed cumulatively?
To get cumulative figures the easiest way is to repeat each month's data in future months.
Using the tidyverse, add the following statement before you define p...
points <- points %>%
mutate(monthly = map(monthly, ~seq(., max(monthly), by = "month"))) %>%
unnest(monthly)
Note that a cumulative density will not necessarily increase over time - if you want an animation that steadily increases you might want to add contour_var = "count" to your geom_density... term.
I want to use ggplot2 to create a path with arrows in a plot. However, I have a lot of data points and so I only want the arrow on every nth datapoint. I adapted this answer for every nth label to put an observation point every nth data point, but if I try to use this with path I get straight lines between these points. I just want the arrow head.
The MWE below shows my attempt to get the two paths working together (I do want the full path as a line), and what worked for points (that I want to be directional arrows). In my real data set the arrows will point in different directions (so I can't just use a static arrow head as the observation symbol). I am also working with other filtering within the plots, and so creating new data frames that only keep some points is not a convenient solution.
MWE
library(tidyverse)
library(tidyr)
library(dplyr)
x <- seq(from = -100, to = 100, by = 0.01)
y <- x^3 - 2 * x + x
df<- data.frame(x,y)
df$t<- seq(1:nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_path(size = 0.1, aes(colour = t)) +
geom_path(aes(colour = t),data = . %>% filter(row_number() %% 2000 == 0), arrow = arrow(type = 'open', angle = 30, length = unit(0.1, "inches")))
ggplot(data = df, aes(x = x, y = y)) +
geom_path(size = 0.1, aes(colour = t)) +
geom_point(aes(colour = t),data = . %>% filter(row_number() %% 2000 == 0))
You can try to add a grouping variable.
ggplot(data = df, aes(x = x, y = y)) +
geom_path(aes(colour = t, group =factor(gr)),data = . %>% filter(row_number() %% 2000 == 0) %>%
mutate(gr = gl(n()/2, 2)),
arrow = arrow(type = 'open', angle = 30, length = unit(0.1, "inches")))
I am using ggforce to create a plot like this. .
My goal is to facet this type of plot.
For background on how the chart was made, check out update 3 on this question. The only modification that I have made was adding a geom_segment between the x axis and the Y value positions.
The reason why I believe faceting this graph is either difficult, or even impossible, is because continuous value x coordinates are used to determine where the geom_arc_bar is positioned in space.
My only idea for getting this to work has been supplying each "characteristic" that I want to facet with a set of x coordinates (1,2,3). Initially, as I will demonstrate in my code, I worked with set of highly curated data. Ideally, I would like to scale this to a dataset with many variables.
In the example graph that I have provided, the Y value is from table8, filtered for rows with "DFT". The area of the half-circles is proportional to the values of DDFS and FDFS from table9. Ideally, I would like to be able to create a function allowing for the easy creation of these graphs, with perhaps 3 parameters, the data for the y value, and for both half circles.
Here is my data.
Here is the code that I have written thus far.
For making a single plot
#Filter desired Age and Measurement
table9 %>%
filter(Age == "6-11" & Measurement != 'DFS' ) %>%
select( SurveyYear, Total , Measurement ) %>%
arrange(SurveyYear) %>%
dplyr::rename(Percent = Total) -> table9
#Do the same for table 8.
table8 %>%
filter(Age == "6-11" & Measurement != "DS" & Measurement != "FS") %>%
select(SurveyYear, Total) %>%
dplyr::rename(Y = Total)-> table8
table8 <- table8 %>%
bind_rows(table8) %>%
arrange(Y) %>%
add_column(start = rep(c(-pi/2, pi/2), 3), x = c(1,1,2,2,3,3))
table8_9 <- bind_cols(table8,table9) %>%
select(-SurveyYear1)
#Create the plot
ggplot(table8_9) + geom_segment( aes(x=x, xend=x, y=0, yend=Y), size = 0.5, linetype="solid") +
geom_arc_bar(aes(x0 = x, y0 = Y, r0 = 0, r = sqrt((Percent*2)/pi)/20,
start = start, end = start + pi, fill = Measurement),
color = "black") + guides(fill = guide_legend(title = "Type", reverse = T)) +
guides(fill = guide_legend(title = "Measurement", reverse = F)) +
xlab("Survey Year") + ylab("Mean dfs") + coord_fixed() + theme_pubr() +
scale_y_continuous(expand = c(0, 0), limits = c(0, 5.5)) +
scale_x_continuous(breaks = 1:3, labels = paste0(c("1988-1994", "1999-2004", "2011-2014"))) +
scale_fill_discrete(labels = c("ds/dfs", "fs/dfs")) -> lolliPlot
lolliPlot
Attempt at many plots
#Filter for "DFS"
table8 <- table8 %>%
filter(Measurement=="DFS")
#Duplicate DF vertically, and add column specifying the start point for the arcs.
table8 <- table8 %>%
bind_rows(table8) %>%
add_column(start = rep(c(-pi/2, pi/2), length(.$SurveyYear)/2), x = rep(x = c(1,2,3),length(.$SurveyYear)/3)) %>%
arrange(Age, x)
#Bind two tables today, removing all of the characteristic columns from table 8.
table8_9 <- bind_cols(table8,table9) %>%
select(-Age1, -SurveyYear1, -Measurement) %>%
gather(key = Variable, value = Y, -x,-start,-Age, -SurveyYear, -Measurement1, -Total1, -Male1, -Female1, -'White, non-Hispanic1', -'Black, non-hispanic1', -'Mexican American1', -'Less than 100% FPG1', -'100-199% FPG1', -'Greater than 200% FPG1')
This is where I get stuck. I can't figure out a way to format the data so that I can facet the graph. If anybody has any ideas or advice, I would greatly appreciate it.
I'm currently finishing off my Masters project and need to include some graphics for the write-up. Without boring you too much, I have some data which is associated with AR(1) parameters ranging from 0.1 to 0.9 by 0.1 increments. As such I thought of doing a faceted histogram like the one below (worry not about the hideous fruit salad of colours, it will not be used).
I used this code.
ggplot(opt_lens_geom,aes(x=l_1024,fill=factor(rho))) + geom_histogram()+coord_flip()+facet_grid(.~rho,scales = "free_x")
I also would like to draw a trend line for the median values since the AR(1) parameter is continuous. In a later iteration I deleted the padding and made it "look" like it was one graph, but I have had issues with the endpoints matching up since each facet is a separate graphical device. Can anyone give me some advice on how to do this? I am not particularly partial to the faceting so if it is not needed I do away with it.
I will try and upload sample data, but all simulating 100 values for each of the 9 rhos would work just to get it started like:
opt_lens_geom <- data.frame(rho= rep(seq(0.1,0.9,by=0.1),each=100),l_1024=rnorm(900))
You might consider ggridges. I've assumed here that you want a median value for each value of rho.
library(ggplot2)
library(ggridges)
library(dplyr)
set.seed(1001)
opt_lens_geom <- data.frame(rho = rep(seq(0.1, 0.9, by = 0.1), each = 100),
l_1024 = rnorm(900))
opt_lens_geom %>%
mutate(rho_f = factor(rho)) %>%
ggplot(aes(l_1024, rho_f)) +
stat_density_ridges(quantiles = 2, quantile_lines = TRUE)
Result. You can add scale = 1 as a parameter to stat_density_ridges if you don't like the amount of overlap.
Try the following. It uses a pre-computed data frame of the medians.
library(ggplot2)
df <- iris[c(1, 5)]
names(df) <- c("val", "rho")
med <- plyr::ddply(df, "rho", summarise, m = median(val))
ggplot(data = df, aes(x = val, fill = factor(rho))) +
geom_histogram() +
coord_flip() +
geom_vline(data = med, aes(xintercept = m), colour = 'black') +
facet_wrap(~ factor(rho))
You could do a variant on this using geom_violin instead of using histograms, although you wouldn't get labelled counts, just an idea of the relative density. Example with made up data:
df = data.frame(
rho = rep(c(0.1, 0.2, 0.3), each = 50),
val = sample(1:10, 150, replace = TRUE)
)
df$val = df$val + (5 * (df$rho == 0.2)) + (8 * (df$rho == 0.3))
ggplot(df, aes(x = rho, y = val, fill = factor(rho))) +
geom_violin() +
stat_summary(aes(group = 1), colour = "black",
geom = "line", fun.y = "median")
This produces a violin for each value of rho, and joins the medians for each violin.