Related
I recently conducted a survey within an IT company concering user satisfaction with a specific data management solution. There was one question about the overall satisfaction (dependend variable for my regression). And then various questions about more specific aspects like data quality etc. (independend variables in my regression).
With the help of R, I created a multivariate regression in order to figure out which of the various aspects are most important for the customer satisfaction. However, I believe my results are not 100% correct since some of the results dont make sense. For instance, according to the standardized coeffizient increasing data quality results in less user satisfaction. From my point of view, the coefficient should be positive for all variables.
Maybe somebody here can help me/ give me some tips how to improve my model. Down below you can find my code and the results (anonymized). The rows labeled M-AV are my independend variables. In the columns to the right you can find the standardized coefficent, the standard error, t value and p-value.
#https://www.youtube.com/watch?v=EUbujtw5Azc
#Librarys laden
library(lmtest)
library(car)
library(sandwich)
#Daten einlesen
daten <- read.csv(file.choose(), header = T, sep=";")
#Spalte K transformieren (wird als chr erkannt, ist aber numeric)
daten <- transform(daten, K = as.numeric(K))
str(daten)
#Regressions Modell
#modell <- lm(H ~ M + N + O + P + X + Y + Z + AA + AB + AE + AF + AG + AJ + AL + AM + AN + AQ + AR + AS + AU + AV, daten)
modell <- lm(C ~ M + N + O + P + X + Y + Z + AA + AB + AE + AF + AG + AJ + AL + AM + AN + AQ + AR + AS + AU + AV, daten)
#Vorraussetzungen
# 1 Normalverteilung der Residuen
#Plot Punkte sollten ca. auf Linie liegen (entspricht Normalverteilung). Abweichung am Anfang und Ende ist OK.
plot(modell, 2)
# 2a Homoskedastizität (Streuen Residuen gleich)
plot(modell, 1) #sollte ca. auf Ideallinie liegen
#Breusch-Pagan Test, Null-Hypothese: es liegt Homoskedastizität vor
#falls p-value > 0.05 wird Nullhypothese beibehalten
bptest(modell)
#3 Keine Multikollinarität (unabhängige Variablen korrelieren zu stark)
#Vif sollte auf jeden Fall unter 10 liegen, konservativer unter 6
vif(modell)
#4 Ausreißer/ Einflussreiche Fälle
#https://bjoernwalther.com/cook-distanz-in-r-ermitteln-und-interpretieren-ausreisser-erkennen/
plot(modell, 4)
#Robuste Standardfehler
coeftest(modell, vcov=vcovHC(modell, type ="HC3"))
#Auswertung
summary(modell)
#F-Statistik hat Nullhypothese, das Erklärungsmodell kein Erklärungsbeitrag leistet --> hier <.05, wird also verworfen!
#R2 Wert --> ca. 60% der Variable wird durch Variabeln erklärt (eigentlich 40%, siehe ajustiertes R2)
#standartisierte Koeffizienten um einflussreichste Variable zu finden
zmodell <- lm(scale(C) ~ scale(M)+ scale(N) + scale(O) + scale(P) + scale(X) + scale(Y) + scale(Z) + scale(AA) + scale(AB) + scale(AE) + scale(AF) + scale(AG) + scale(AJ) + scale(AL) + scale(AM) + scale(AN) + scale(AQ) + scale(AR) + scale(AS) + scale(AU) + scale(AV), data = daten)
summary(zmodell)
dput(head(j, 20))
structure(list(A = c(6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L), B = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), C = c(10L, 5L, 9L, 9L, 7L, 10L, 10L, 5L, 10L, 8L,
1L, 8L, 10L, 7L, 8L, 10L, 8L, 2L, 8L, 3L), D = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA), E = c(5L, 3L, 4L, 5L, 4L, 4L, 6L, 3L, 5L, 3L, 4L, 2L, 4L,
2L, 3L, 5L, 3L, 4L, 3L, 2L), F = c(5L, 2L, 6L, 5L, 4L, 2L, 6L,
4L, 5L, 6L, 4L, 4L, 6L, 5L, 5L, 6L, 4L, 3L, 5L, 5L), G = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), H = c(6L, 3L, 5L, 4L, 5L, 4L, 5L, 4L, 5L, 4L, 2L,
5L, 5L, 4L, 4L, 6L, 4L, 5L, 4L, 1L), I = c(6L, 2L, 5L, 4L, 4L,
4L, 5L, 3L, 5L, 4L, 2L, 5L, 5L, 3L, 4L, 5L, 3L, 2L, 4L, 1L),
J = c(3L, 6L, 6L, 5L, 6L, 2L, 5L, 4L, 6L, 6L, 5L, 2L, 5L,
5L, 2L, 6L, 5L, 5L, 6L, 6L), K = c(5, 3.67, 5.33, 4.33, 5,
3.33, 5, 3.67, 5.33, 4.67, 3, 4, 5, 4, 3.33, 5.67, 4, 4,
4.67, 2.67), L = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), M = c(4L, 2L, 6L,
6L, 5L, 6L, 6L, 4L, 6L, 6L, 5L, 6L, 5L, 5L, 5L, 6L, 6L, 6L,
6L, 3L), N = c(6L, 5L, 5L, 5L, 6L, 6L, 6L, 5L, 6L, 6L, 4L,
4L, 4L, 3L, 5L, 5L, 4L, 5L, 5L, 2L), O = c(5L, 1L, 5L, 4L,
6L, 6L, 5L, 2L, 6L, 6L, 1L, 5L, 5L, 3L, 4L, 5L, 4L, 2L, 5L,
3L), P = c(6L, 1L, 4L, 4L, 4L, 6L, 6L, 2L, 5L, 3L, 2L, 5L,
5L, 3L, 5L, 5L, 4L, 5L, 2L, 1L), Q = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), R = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), S = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), T = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), U = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), V = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), W = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), X = c(4L, 1L, 3L, 4L, 5L, 6L, 5L, 3L, 5L, 4L, 1L, 5L,
4L, 1L, 4L, 1L, 5L, 2L, 4L, 1L), Y = c(5L, 1L, 3L, 3L, 3L,
6L, 5L, 2L, 6L, 4L, 1L, 3L, 4L, 1L, 5L, 5L, 3L, 2L, 3L, 2L
), Z = c(5L, 1L, 3L, 4L, 3L, 6L, 5L, 2L, 5L, 4L, 2L, 3L,
5L, 3L, 5L, 3L, 2L, 1L, 4L, 1L), AA = c(6L, 4L, 4L, 5L, 5L,
6L, 5L, 3L, 4L, 5L, 3L, 4L, 4L, 3L, 5L, 6L, 5L, 3L, 6L, 2L
), AB = c(6L, 6L, 4L, 4L, 3L, 6L, 5L, 3L, 5L, 3L, 2L, 6L,
5L, 6L, 5L, 5L, 5L, 5L, 6L, 2L), AC = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), AD = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AE = c(5L, 1L, 6L, 4L, 6L,
5L, 4L, 3L, 5L, 5L, 2L, 2L, 4L, 1L, 5L, 3L, 3L, 4L, 4L, 1L
), AF = c(4L, 1L, 6L, 2L, 5L, 5L, 4L, 3L, 6L, 4L, 2L, 4L,
5L, 4L, 5L, 4L, 3L, 4L, 6L, 2L), AG = c(4L, 1L, 5L, 2L, 5L,
5L, 4L, 4L, 4L, 4L, 2L, 4L, 5L, 5L, 4L, 2L, 3L, 2L, 6L, 2L
), AH = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AI = c(0L, 0L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L
), AJ = c(3L, 2L, 5L, 3L, 4L, 4L, 6L, 3L, 5L, 5L, 2L, 5L,
5L, 3L, 5L, 5L, 4L, 2L, 5L, 1L), AK = c(NA, NA, 5L, 3L, 4L,
4L, 5L, NA, 6L, 5L, NA, NA, 6L, NA, NA, NA, 4L, NA, NA, NA
), AL = c(4L, 4L, 6L, 4L, 6L, 5L, 5L, 3L, 6L, 5L, 4L, 6L,
5L, 3L, 5L, 4L, 5L, 3L, 6L, 1L), AM = c(5L, 1L, 6L, 4L, 5L,
2L, 4L, 2L, 6L, 4L, 2L, 2L, 6L, 1L, 5L, 3L, 2L, 1L, 4L, 3L
), AN = c(1L, 1L, 6L, 3L, 2L, 6L, 4L, 1L, 6L, 2L, 1L, 4L,
5L, 2L, 5L, 5L, 4L, 4L, 5L, 1L), AO = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), AP = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AQ = c(3L, 1L, 6L, 3L, 6L,
1L, 5L, 2L, 6L, 5L, 6L, 3L, 6L, 1L, 5L, 3L, 2L, 2L, 4L, 2L
), AR = c(1L, 4L, 4L, 3L, 6L, 1L, 5L, 1L, 6L, 5L, 5L, 4L,
6L, 2L, 5L, 4L, 2L, 2L, 4L, 2L), AS = c(1L, 1L, 6L, 4L, 6L,
1L, 5L, 3L, 6L, 5L, 6L, 5L, 6L, 5L, 5L, 5L, 4L, 2L, 5L, 2L
), AT = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AU = c(5L, 3L, 4L, 4L, 6L,
3L, 5L, 3L, 6L, 5L, 4L, 4L, 4L, 6L, 5L, 6L, 5L, 6L, 5L, 2L
), AV = c(6L, 3L, 5L, 4L, 6L, 2L, 6L, 2L, 6L, 4L, 4L, 4L,
4L, 6L, 4L, 6L, 3L, 6L, 2L, 3L), AW = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), AX = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AY = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), AZ = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), BA = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), BB = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), BC = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), BD = c(5.25, 2.25, 5, 4.75, 5.25, 6, 5.75, 3.25, 5.75,
5.25, 3, 5, 4.75, 3.5, 4.75, 5.25, 4.5, 4.5, 4.5, 2.25),
BE = c(5.2, 2.6, 3.4, 4, 3.8, 6, 5, 2.6, 5, 4, 1.8, 4.2,
4.4, 2.8, 4.8, 4, 4, 2.6, 4.6, 1.6), BF = c(4.333333333,
1, 5.666666667, 2.666666667, 5.333333333, 5, 4, 3.333333333,
5, 4.333333333, 2, 3.333333333, 4.666666667, 3.333333333,
4.666666667, 3, 3, 3.333333333, 5.333333333, 1.666666667),
BG = c(3.25, 2, 5.75, 3.5, 4.25, 4.25, 4.75, 2.25, 5.75,
4, 2.25, 4.25, 5.25, 2.25, 5, 4.25, 3.75, 2.5, 5, 1.5), BH = c(1.666666667,
2, 5.333333333, 3.333333333, 6, 1, 5, 2, 6, 5, 5.666666667,
4, 6, 2.666666667, 5, 4, 2.666666667, 2, 4.333333333, 2),
BI = c(5.5, 3, 4.5, 4, 6, 2.5, 5.5, 2.5, 6, 4.5, 4, 4, 4,
6, 4.5, 6, 4, 6, 3.5, 2.5)), row.names = c(NA, 20L), class = "data.frame")
I wonder how it would be possible to add labels to single bars in ggplot2 as I would like to label my rows in my barplot as Online Broker, Bank, No Account. Thank you for your help!
Here is my code:
library(gridExtra)
library(ggplot2)
require(gridExtra)
library(tidyverse)
library (scales)
plot1 <- ggplot(data = df, aes(df$InvA, y = after_stat(prop)), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ggtitle("Post-Covid") +
xlab("Accounts") +
ylab("Individuals") +
scale_y_continuous(labels = percent_format(), limits=c(0,0.8))
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
plot2 <- ggplot(data = df, aes(df$InvAcc, y = after_stat(prop)), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ggtitle("Pre-Covid") +
xlab("Accounts") +
ylab("Individuals") +
scale_y_continuous(labels = percent_format(), limits=c(0,0.8))
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
grid.arrange(plot2, plot1, ncol = 2)
The plot then looks like this:
However, the labels should look like these for my previous plot, but since I managed to get the y-scale to percent, my labels disappeared or it doesn't work to compute the y-scale to percent with the values remaining as factors (Online-Broker, Bank, No-Account) since I had to change them to numeric (1,2,3):
dput(dfaccounts) # (with 1=Online Broker, 2=Bank, 3=No Account)
structure(list(df.InvAcc = c(2L, NA, 2L, NA, NA, 3L, 3L, 3L,
NA, 3L, 3L, NA, 1L, NA, 1L, NA, NA, 1L, NA, NA, NA, 1L, 3L, 1L,
NA, NA, 1L, 2L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 2L, NA,
NA, 3L, NA, NA, 1L, NA, 2L, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
1L, 1L, 1L, NA, NA, NA, 3L, NA, 1L, NA, NA, 2L, NA, 1L, 1L, 1L,
NA, 1L, 3L, NA, 1L, NA, 3L, NA, NA, 2L, 3L, 2L, 1L, NA, 3L, 2L,
NA, NA, 3L, NA, 2L, 1L, NA, 3L, 2L, 1L, 3L, 3L, 3L, NA, 3L, NA,
3L, NA, 3L, 1L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, 3L, NA,
NA, 3L, 3L, 3L, 3L, NA, 1L, NA, NA, NA, 3L, NA, 3L), df.InvA = c(NA,
1L, NA, 2L, 1L, NA, NA, NA, 3L, NA, NA, 3L, NA, 3L, NA, 1L, 2L,
NA, 1L, 1L, 1L, NA, NA, NA, 1L, 2L, NA, NA, 2L, 1L, NA, NA, NA,
NA, NA, NA, NA, 3L, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 3L,
1L, 1L, 1L, 2L, 1L, 1L, NA, NA, NA, NA, 1L, 1L, 1L, NA, 2L, NA,
2L, 1L, NA, 2L, NA, NA, NA, 2L, NA, NA, 2L, NA, 1L, NA, 3L, 3L,
NA, NA, NA, NA, 1L, NA, NA, 1L, 2L, NA, 1L, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 3L, NA, 1L,
2L, 2L, NA, 1L, 1L, NA, 3L, 1L, NA, NA, NA, NA, 1L, NA, 1L, 3L,
1L, NA, 3L, NA)), class = "data.frame", row.names = c(NA, -133L
))
The issue is that you InvA and InvAcc columns are numerics. Hence, using scale_x_discrete the axis text gets dropped.
To fix you issue I would suggest to convert the columns to factors with your desired labels set via the labels argument. Additionally, to get the right percentages we have to explicitly set the group aes via group=1:
library(gridExtra)
library(ggplot2)
labels <- c("Online-Broker", "Bank", "No Account")
data$InvA <- factor(data$InvA, labels = labels)
data$InvAcc <- factor(data$InvAcc, labels = labels)
plot1 <- ggplot(data = data, aes(InvA, y = after_stat(prop), group = 1), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ylim(0, 40) +
ggtitle("Post-Covid") +
xlab("Accounts") +
ylab("Total No. of Individuals") +
scale_y_continuous(labels = scales::percent)
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
plot2 <- ggplot(data = data, aes(InvAcc, y = after_stat(prop), group = 1), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ylim(0, 40) +
ggtitle("Pre-Covid") +
xlab("Accounts") +
ylab("Total No. of Individuals") +
scale_y_continuous(labels = scales::percent)
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
grid.arrange(plot2, plot1, ncol = 2)
#> Warning: Removed 64 rows containing non-finite values (stat_count).
#> Warning: Removed 69 rows containing non-finite values (stat_count).
I read already through all the other similar questions & answers, but all the given solutions with scale_y_continous simply don't work for my dataset. I have two different treatment groups data$InvA (Post-Covid) and data$InvAcc (Pre-Covid) where in each group they could choose between the options: Online Broker (1), Bank (2), No Account (3). As the subjects were put randomly in group 1 or 2, I have logically a lot of NA's in my dataset. Now when I use ggplot, I'm able to display both results with the total number of individuals on the y-axis. However, I would like to change this to percent, since it would be a better fit for my thesis. I tried already every other option with scale_y_continuous but it rather doesn't work out properly (3000% percentage, or it doesn't calculate the right percent values) or it doesn't work at all.
This is my code:
library(gridExtra)
library(ggplot2)
require(gridExtra)
library(tidyverse)
plot1 <- ggplot(data = data, aes(InvA), na.rm=TRUE) +
geom_bar()+
scale_x_discrete(na.translate = FALSE)+
ylim(0,40)+
ggtitle("Post-Covid")+
xlab("Accounts")+
ylab("Total No. of Individuals")
plot2 <- ggplot(data = data, aes(InvAcc), na.rm=TRUE) +
geom_bar()+
scale_x_discrete(na.translate = FALSE)+
ylim(0,40)+
ggtitle("Pre-Covid")+
xlab("Accounts")+
ylab("Total No. of Individuals")
grid.arrange(plot2, plot1,ncol=2) # Write the grid.arrange in the file
#dev.off() # Close the file
#pdf("Accountss.pdf", width = 8, height = 6) # Open a new pdf file
My data:
dput(data)
structure(list(data.InvAcc = c(2L, NA, 2L, NA, NA, 3L, 3L, 3L,
NA, 3L, 3L, NA, 1L, NA, 1L, NA, NA, 1L, NA, NA, NA, 1L, 3L, 1L,
NA, NA, 1L, 2L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 2L, NA,
NA, 3L, NA, NA, 1L, NA, 2L, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
1L, 1L, 1L, NA, NA, NA, 3L, NA, 1L, NA, NA, 2L, NA, 1L, 1L, 1L,
NA, 1L, 3L, NA, 1L, NA, 3L, NA, NA, 2L, 3L, 2L, 1L, NA, 3L, 2L,
NA, NA, 3L, NA, 2L, 1L, NA, 3L, 2L, 1L, 3L, 3L, 3L, NA, 3L, NA,
3L, NA, 3L, 1L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, 3L, NA,
NA, 3L, 3L, 3L, 3L, NA, 1L, NA, NA, NA, 3L, NA, 3L), data.InvA = c(NA,
1L, NA, 2L, 1L, NA, NA, NA, 3L, NA, NA, 3L, NA, 3L, NA, 1L, 2L,
NA, 1L, 1L, 1L, NA, NA, NA, 1L, 2L, NA, NA, 2L, 1L, NA, NA, NA,
NA, NA, NA, NA, 3L, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 3L,
1L, 1L, 1L, 2L, 1L, 1L, NA, NA, NA, NA, 1L, 1L, 1L, NA, 2L, NA,
2L, 1L, NA, 2L, NA, NA, NA, 2L, NA, NA, 2L, NA, 1L, NA, 3L, 3L,
NA, NA, NA, NA, 1L, NA, NA, 1L, 2L, NA, 1L, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 3L, NA, 1L,
2L, 2L, NA, 1L, 1L, NA, 3L, 1L, NA, NA, NA, NA, 1L, NA, 1L, 3L,
1L, NA, 3L, NA)), class = "data.frame", row.names = c(NA, -133L
))
data$InvAcc: Online Broker --> 31 (45%), Bank --> 11 (16%), No Account --> 27(39%)
data$InvA: Online Broker --> 40 (63%), Bank --> 13 (20%), No Account --> 11(17%)
Thank you all for your help, appreciate your time!
The issue is that you are plotting the counts. If you want to plot the percentages than you have to tell ggplot to do so using e.g. y = after_stat(prop) which instead of the counts will map the proportions on y. Afterwards you could get petrcent labels using scales::percent:
library(gridExtra)
library(ggplot2)
plot1 <- ggplot(data = data, aes(InvA, y = after_stat(prop)), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ylim(0, 40) +
ggtitle("Post-Covid") +
xlab("Accounts") +
ylab("Total No. of Individuals") +
scale_y_continuous(labels = scales::percent)
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
plot2 <- ggplot(data = data, aes(InvAcc, y = after_stat(prop)), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ylim(0, 40) +
ggtitle("Pre-Covid") +
xlab("Accounts") +
ylab("Total No. of Individuals") +
scale_y_continuous(labels = scales::percent)
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
grid.arrange(plot2, plot1, ncol = 2)
#> Warning: Removed 64 rows containing non-finite values (stat_count).
#> Warning: Removed 69 rows containing non-finite values (stat_count).
I am really new to coding and I need to run a number of statistics in a dataset, for example the pearson correlation, but I am having some trouble manipulating the data.
From what I understood I need to transpose my data in order to calculate the pearson correlation, but here's where I'm having some problems. For starters, the column names turn into a new row instead of becoming the new column names. Then I get a message that my values are not numeric.
I also have some NA and I am trying to calculate the correlation with this code
cor(cr, use = "complete.obs", method = "pearson")
Error in cor(cr1, use = "complete.obs", method = "pearson") :
'x' must be numeric
I need to know the correlation between Victoria and Nuria which should yield 0.3651484
here is the dput of my dataset:
> dput(cr)
structure(list(User = structure(c(8L, 10L, 2L, 17L, 11L, 1L,
18L, 9L, 7L, 5L, 3L, 14L, 13L, 4L, 20L, 6L, 16L, 12L, 15L, 19L
), .Label = c("Ana", "Anton", "Bernard", "Carles", "Chris", "Ivan",
"Jim", "John", "Marc", "Maria", "Martina", "Nadia", "Nerea",
"Nuria", "Oriol", "Rachel", "Roger", "Sergi", "Valery", "Victoria"
), class = "factor"), Star.Wars.IV...A.New.Hope = c(1L, 5L, NA,
NA, 4L, 2L, NA, 4L, 5L, 4L, 2L, 3L, 2L, 3L, 4L, NA, NA, 4L, 5L,
1L), Star.Wars.VI...Return.of.the.Jedi = c(5L, 3L, NA, 3L, 3L,
4L, NA, NA, 1L, 2L, 1L, 5L, 3L, NA, 4L, NA, NA, 5L, 1L, 2L),
Forrest.Gump = c(2L, NA, NA, NA, 4L, 4L, 3L, NA, NA, NA,
5L, 2L, NA, 3L, NA, 1L, NA, 1L, NA, 2L), The.Shawshank.Redemption = c(NA,
2L, 5L, NA, 1L, 4L, 1L, NA, 4L, 5L, NA, NA, 5L, NA, NA, NA,
NA, 5L, NA, 4L), The.Silence.of.the.Lambs = c(4L, 4L, 2L,
NA, 4L, NA, 1L, 3L, 2L, 3L, NA, 2L, 4L, 2L, 5L, 3L, 4L, 1L,
NA, 5L), Gladiator = c(4L, 2L, NA, 1L, 1L, NA, 4L, 2L, 4L,
NA, 5L, NA, NA, NA, 5L, 2L, NA, 1L, 4L, NA), Toy.Story = c(2L,
1L, 4L, 2L, NA, 3L, NA, 2L, 4L, 4L, 5L, 2L, 4L, 3L, 2L, NA,
2L, 4L, 2L, 2L), Saving.Private.Ryan = c(2L, NA, NA, 3L,
4L, 1L, 5L, NA, 4L, 3L, NA, NA, 5L, NA, NA, 2L, NA, NA, 1L,
3L), Pulp.Fiction = c(NA, NA, NA, 4L, NA, 4L, 2L, 3L, NA,
4L, NA, 1L, NA, NA, 3L, NA, 2L, 5L, 3L, 2L), Stand.by.Me = c(3L,
4L, 1L, NA, 1L, 4L, NA, NA, 1L, NA, NA, NA, NA, 4L, 5L, 1L,
NA, NA, 3L, 2L), Shakespeare.in.Love = c(2L, 3L, NA, NA,
5L, 5L, 1L, NA, 2L, NA, NA, 3L, NA, NA, NA, 5L, 2L, NA, 3L,
1L), Total.Recall = c(NA, 2L, 1L, 4L, 1L, 2L, NA, 2L, 3L,
NA, 3L, NA, 2L, 1L, 1L, NA, NA, NA, 1L, NA), Independence.Day = c(5L,
2L, 4L, 1L, NA, 4L, NA, 3L, 1L, 2L, 2L, 3L, 4L, 2L, 3L, NA,
NA, NA, NA, NA), Blade.Runner = c(2L, NA, 4L, 3L, 4L, NA,
3L, 2L, NA, NA, NA, NA, NA, 2L, NA, NA, NA, 4L, NA, 5L),
Groundhog.Day = c(NA, 2L, 1L, 5L, NA, 1L, NA, 4L, 5L, NA,
NA, 2L, 3L, 3L, 2L, 5L, NA, NA, NA, 5L), The.Matrix = c(4L,
NA, 1L, NA, 3L, NA, 1L, NA, NA, 2L, 1L, 5L, NA, 5L, NA, 2L,
4L, NA, 2L, 4L), Schindler.s.List = c(2L, 5L, 2L, 5L, 5L,
NA, NA, 1L, NA, 5L, NA, NA, NA, 1L, 3L, 2L, NA, 2L, NA, 3L
), The.Sixth.Sense = c(5L, 1L, 3L, 1L, 5L, 3L, NA, 3L, NA,
1L, 2L, NA, NA, NA, NA, 4L, NA, 1L, NA, 5L), Raiders.of.the.Lost.Ark = c(NA,
3L, 1L, 1L, NA, NA, 5L, 5L, NA, NA, 1L, NA, 5L, NA, 3L, 3L,
NA, 2L, NA, 3L), Babe = c(NA, NA, 3L, 2L, NA, 2L, 2L, NA,
5L, NA, 4L, 2L, NA, NA, 1L, 4L, NA, 5L, NA, NA)), .Names = c("User",
"Star.Wars.IV...A.New.Hope", "Star.Wars.VI...Return.of.the.Jedi",
"Forrest.Gump", "The.Shawshank.Redemption", "The.Silence.of.the.Lambs",
"Gladiator", "Toy.Story", "Saving.Private.Ryan", "Pulp.Fiction",
"Stand.by.Me", "Shakespeare.in.Love", "Total.Recall", "Independence.Day",
"Blade.Runner", "Groundhog.Day", "The.Matrix", "Schindler.s.List",
"The.Sixth.Sense", "Raiders.of.the.Lost.Ark", "Babe"), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Can someone help me?
This code should give you the correlation matrix between all users.
cr2<-t(cr[,2:21]) # Transpose (first column contains names)
colnames(cr2)<-cr[,1] # Assign names to columns
cor(cr2,use="complete.obs") # Gives an error because there are no complete obs
# Error in cor(cr2, use = "complete.obs") : no complete element pairs
cor(cr2,use="pairwise.complete.obs") # use pairwise deletion
Correlation between Victoria and Nuria is 0.36514837 (using pairwise deletion)
Edit:To get just the correlation between Victoria and Nuria with listwise deletion, run the above and then
cr2<-as.data.frame(cr2)
with(cr2, cor(Victoria, Nuria, use = "complete.obs", method = "pearson"))
[1] 0.3651484
As a summary in addition to #Niek's answer. First transpose the data frame by t() by excluding first column (which contains the names and is not numeric and thus cannot used for correlation calculations); assign these names to new columns in same step. Then calculate specific correlations. The solution in one piece would be:
cr2 <- setNames(as.data.frame(t(cr[, -1])), cr[, 1])
with(cr2, cor(Victoria, Nuria, use = "complete.obs"))
[1] 0.3651484
Or for the whole correlation matrix:
cor(cr2, use = "pairwise.complete.obs")
Is it possible to get a p-value for nodes in a categorical tree analysis with R? I am using rpart and can't locate a p-value for each node. Maybe this is only possible with a regression and not categories.
structure(list(subj = c(702L, 702L, 702L, 702L, 702L, 702L, 702L,
702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L,
702L, 702L, 702L, 702L, 702L, 702L), visit = c(4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L), run = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("A", "B", "C", "D", "E", "xdur", "xend60", "xpre"
), class = "factor"), ho = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), hph = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), longexer = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("10min", "60min"), class = "factor"),
esq_sick = c(NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA), esq_sick2 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA), ll_sick = c(NA, NA, 0L,
NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA,
0L, NA, NA, NA, NA, NA), ll_sick2 = c(NA, NA, 0L, NA, NA,
NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA,
NA, NA, NA, NA), esq_01 = c(NA, NA, 2L, NA, NA, NA, NA, NA,
NA, NA, 2L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA,
NA), esq_02 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 2L,
NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), esq_03 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA), esq_04 = c(NA, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA), esq_05 = c(NA, NA, 0L, NA, NA, NA, NA,
NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA,
NA, NA), esq_06 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA,
1L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA),
esq_07 = c(NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), esq_08 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA), esq_09 = c(NA, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA), esq_10 = c(NA, NA, 0L, NA, NA, NA, NA,
NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA,
NA, NA)), .Names = c("subj", "visit", "run", "ho", "hph",
"longexer", "esq_sick", "esq_sick2", "ll_sick", "ll_sick2", "esq_01",
"esq_02", "esq_03", "esq_04", "esq_05", "esq_06", "esq_07", "esq_08",
"esq_09", "esq_10"), row.names = 7:30, class = "data.frame")
alldata = read.table('symptomology CSV2.csv',header=TRUE,sep=",")
library(rpart)
fit <- rpart(esq_sick2~esq_01_bin + esq_02_bin + esq_03_bin + esq_04_bin + esq_05_bin + esq_06_bin + esq_07_bin + esq_08_bin + esq_09_bin + esq_10_bin + esq_11_bin + esq_12_bin + esq_13_bin + esq_14_bin + esq_15_bin + esq_16_bin + esq_17_bin + esq_18_bin + esq_19_bin + esq_20_bin, method="class", data=alldata)
plot(fit, uniform = FALSE, branch = 1, compress = FALSE, nspace, margin = 0.1, minbranch = 0.3)
text(fit, use.n=TRUE, all=TRUE, cex=.8)
Here's an example that might help you. I'm using the built-in airquality data set and the example provided in the help for ctree:
library(partykit)
# For the sctest function to extract p-values (see help for ctree and sctest)
library(strucchange)
# Data we'll use
airq <- subset(airquality, !is.na(Ozone))
# Build the tree
airct <- ctree(Ozone ~ ., data = airq)
Look at the tree:
airct
Model formula:
Ozone ~ Solar.R + Wind + Temp + Month + Day
Fitted party:
[1] root
| [2] Temp <= 82
| | [3] Wind <= 6.9: 55.600 (n = 10, err = 21946.4)
| | [4] Wind > 6.9
| | | [5] Temp <= 77: 18.479 (n = 48, err = 3956.0)
| | | [6] Temp > 77: 31.143 (n = 21, err = 4620.6)
| [7] Temp > 82
| | [8] Wind <= 10.3: 81.633 (n = 30, err = 15119.0)
| | [9] Wind > 10.3: 48.714 (n = 7, err = 1183.4)
Extract the p-values:
sctest(airct)
$`1`
Solar.R Wind Temp Month Day
statistic 13.34761286 4.161370e+01 5.608632e+01 3.1126596 0.02011554
p.value 0.00129309 5.560572e-10 3.468337e-13 0.3325881 0.99998175
$`2`
Solar.R Wind Temp Month Day
statistic 5.4095322 12.968549828 11.298951405 0.2148961 2.970294
p.value 0.0962041 0.001582833 0.003871534 0.9941976 0.357956
$`3`
NULL
$`4`
Solar.R Wind Temp Month Day
statistic 9.547191843 2.307676 11.598966936 0.06604893 0.2513143
p.value 0.009972755 0.497949 0.003295072 0.99965679 0.9916670
$`5`
Solar.R Wind Temp Month Day
statistic 6.14094026 1.3865355 1.9986304 0.8268341 1.3580462
p.value 0.06432172 0.7447599 0.5753799 0.8952749 0.7528481
$`6`
Solar.R Wind Temp Month Day
statistic 5.1824354 0.02060939 0.9270013 0.165171 4.6220522
p.value 0.1089932 0.99998062 0.8705785 0.996871 0.1481643
$`7`
Solar.R Wind Temp Month Day
statistic 0.8083249 11.711564549 6.77148538 0.1307643 0.03992875
p.value 0.8996614 0.003101788 0.04546281 0.9982052 0.99990034
$`8`
Solar.R Wind Temp Month Day
statistic 0.9056479 3.1585094 2.9285252 0.008106707 0.008686293
p.value 0.8759687 0.3247585 0.3657072 0.999998099 0.999997742
$`9`
NULL