Grouped boxplot in R - simplest way - r

I have been struggling with creating a very simple grouped boxplot. My data looks as follows
> data
Wörter Sätze Text
P.01 0.15 0.24 0.34
P.02 0.10 0.15 0.08
P.03 0.05 0.18 0.16
P.04 0.55 0.60 0.44
P.05 0.00 0.06 0.26
P.06 0.20 0.65 0.68
P.07 0.15 0.31 0.47
P.08 0.35 0.87 0.69
P.09 0.35 0.75 0.76
N.01 0.40 0.78 0.59
N.02 0.55 0.95 0.76
N.03 0.65 0.96 0.83
N.04 0.60 0.90 0.77
N.05 0.50 0.95 0.82
If I simply execute boxplot(data) I obtain almost what I want. One plot with three boxes, each for one of the variables in my data.
Boxplot, almost
What I want is to separate these into two boxes per variable (one for the P-indexed, one for the N-indexed observations) for a total of six plots each.
I began by introducing a new variable
data$Gruppe <- c(rep("P",9), rep("N",5))
> data
Wörter Sätze Text Gruppe
P.01 0.15 0.24 0.34 P
P.02 0.10 0.15 0.08 P
P.03 0.05 0.18 0.16 P
P.04 0.55 0.60 0.44 P
P.05 0.00 0.06 0.26 P
P.06 0.20 0.65 0.68 P
P.07 0.15 0.31 0.47 P
P.08 0.35 0.87 0.69 P
P.09 0.35 0.75 0.76 P
N.01 0.40 0.78 0.59 N
N.02 0.55 0.95 0.76 N
N.03 0.65 0.96 0.83 N
N.04 0.60 0.90 0.77 N
N.05 0.50 0.95 0.82 N
Now that the data contains a non-numerical variable I cannot simply execute the boxplot() function as before. What would be a minimal alteration to make here to obtain the six plots that I want? (colour coding for the two groups would be nice)
I have encountered some solutions to a grouped boxplot, however the data from which others start tends to be organised differently than my (very simple) one.
Many thanks!

As #teunbrand already mentioned in the comments you could use pivot_longer to make your data in a longer format by Gruppe. You could use fill to make for each variable two boxplot in total 6 like this:
library(tidyr)
library(dplyr)
library(ggplot2)
data$Gruppe <- c(rep("P",9), rep("N",5))
data %>%
pivot_longer(cols = -Gruppe) %>%
ggplot(aes(x = name, y = value, fill = Gruppe)) +
geom_boxplot()
Created on 2023-01-10 with reprex v2.0.2
Data used:
data <- read.table(text = " Wörter Sätze Text
P.01 0.15 0.24 0.34
P.02 0.10 0.15 0.08
P.03 0.05 0.18 0.16
P.04 0.55 0.60 0.44
P.05 0.00 0.06 0.26
P.06 0.20 0.65 0.68
P.07 0.15 0.31 0.47
P.08 0.35 0.87 0.69
P.09 0.35 0.75 0.76
N.01 0.40 0.78 0.59
N.02 0.55 0.95 0.76
N.03 0.65 0.96 0.83
N.04 0.60 0.90 0.77
N.05 0.50 0.95 0.82", header = TRUE)

Related

Filtering all rows if any value in a row is less than a threshold value

I would like to remove all rows if any value of the row is less than 0.05. Any suggestions? I need dplyr and base R simple subset solutions.
library(magrittr)
text = '
INNO RISK PRO AMB MKT IP
1 0.00 0.01 0.00 0.00 0.19 0.24
2 1.00 0.83 0.04 0.48 0.60 0.03
3 0.01 0.07 0.79 0.05 0.19 0.00
4 0.99 0.99 0.92 0.86 0.01 0.10
5 0.72 0.93 0.28 0.48 1.00 0.90
6 0.96 1.00 1.00 0.86 1.00 0.75
7 0.02 0.07 0.01 0.86 0.60 0.00
8 0.02 0.01 0.01 0.12 0.60 0.24
9 0.02 0.93 0.92 0.02 0.19 0.90
10 0.99 0.97 0.92 0.86 0.99 0.90'
d10 = textConnection(text) %>% read.table(header = T)
Created on 2020-11-28 by the reprex package (v0.3.0)
We can use rowSums
d10[!rowSums(d10 < 0.05),]
# INNO RISK PRO AMB MKT IP
#5 0.72 0.93 0.28 0.48 1.00 0.90
#6 0.96 1.00 1.00 0.86 1.00 0.75
#10 0.99 0.97 0.92 0.86 0.99 0.90
Or with dplyr
library(dplyr)
d10 %>%
filter(across(everything(), ~ . >= 0.05))
# INNO RISK PRO AMB MKT IP
#5 0.72 0.93 0.28 0.48 1.00 0.90
#6 0.96 1.00 1.00 0.86 1.00 0.75
#10 0.99 0.97 0.92 0.86 0.99 0.90

Create data frame from EFA output in R

I am working on EFA and would like to customize my tables. There is a function, psych.print to suppress factor loadings of a certain value to make the table easier to read. When I run this function, it produces this data and the summary stats in the console (in an .RMD document, it produces console text and a separate data frame of the factor loadings with loadings suppressed). However, if I attempt to save this as an object, it does not keep this data.
Here is an example:
library(psych)
bfi_data=bfi
bfi_data=bfi_data[complete.cases(bfi_data),]
bfi_cor <- cor(bfi_data)
factors_data <- fa(r = bfi_cor, nfactors = 6)
print.psych(fa_ml_oblimin_2, cut=.32, sort="TRUE")
In an R script, it produces this:
item MR2 MR3 MR1 MR5 MR4 MR6 h2 u2 com
N2 17 0.83 0.654 0.35 1.0
N1 16 0.82 0.666 0.33 1.1
N3 18 0.69 0.549 0.45 1.1
N5 20 0.47 0.376 0.62 2.2
N4 19 0.44 0.43 0.506 0.49 2.4
C4 9 -0.67 0.555 0.45 1.3
C2 7 0.66 0.475 0.53 1.4
C5 10 -0.56 0.433 0.57 1.4
C3 8 0.56 0.317 0.68 1.1
C1 6 0.54 0.344 0.66 1.3
In R Markdown, it produces this:
How can I save that data.frame as an object?
Looking at the str of the object it doesn't look that what you want is built-in. An ugly way would be to use capture.output and try to convert the character vector to dataframe using string manipulation. Else since the data is being displayed it means that the data is present somewhere in the object itself. I could find out vectors of same length which can be combined to form the dataframe.
loadings <- unclass(factors_data$loadings)
h2 <- factors_data$communalities
#There is also factors_data$communality which has same values
u2 <- factors_data$uniquenesses
com <- factors_data$complexity
data <- cbind(loadings, h2, u2, com)
data
This returns :
# MR2 MR3 MR1 MR5 MR4 MR6 h2 u2 com
#A1 0.11 0.07 -0.07 -0.56 -0.01 0.35 0.38 0.62 1.85
#A2 0.03 0.09 -0.08 0.64 0.01 -0.06 0.47 0.53 1.09
#A3 -0.04 0.04 -0.10 0.60 0.07 0.16 0.51 0.49 1.26
#A4 -0.07 0.19 -0.07 0.41 -0.13 0.13 0.29 0.71 2.05
#A5 -0.17 0.01 -0.16 0.47 0.10 0.22 0.47 0.53 2.11
#C1 0.05 0.54 0.08 -0.02 0.19 0.05 0.34 0.66 1.32
#C2 0.09 0.66 0.17 0.06 0.08 0.16 0.47 0.53 1.36
#C3 0.00 0.56 0.07 0.07 -0.04 0.05 0.32 0.68 1.09
#C4 0.07 -0.67 0.10 -0.01 0.02 0.25 0.55 0.45 1.35
#C5 0.15 -0.56 0.17 0.02 0.10 0.01 0.43 0.57 1.41
#E1 -0.14 0.09 0.61 -0.14 -0.08 0.09 0.41 0.59 1.34
#E2 0.06 -0.03 0.68 -0.07 -0.08 -0.01 0.56 0.44 1.07
#E3 0.02 0.01 -0.32 0.17 0.38 0.28 0.51 0.49 3.28
#E4 -0.07 0.03 -0.49 0.25 0.00 0.31 0.56 0.44 2.26
#E5 0.16 0.27 -0.39 0.07 0.24 0.04 0.41 0.59 3.01
#N1 0.82 -0.01 -0.09 -0.09 -0.03 0.02 0.67 0.33 1.05
#N2 0.83 0.02 -0.07 -0.07 0.01 -0.07 0.65 0.35 1.04
#N3 0.69 -0.03 0.13 0.09 0.02 0.06 0.55 0.45 1.12
#N4 0.44 -0.14 0.43 0.09 0.10 0.01 0.51 0.49 2.41
#N5 0.47 -0.01 0.21 0.21 -0.17 0.09 0.38 0.62 2.23
#O1 -0.05 0.07 -0.01 -0.04 0.57 0.09 0.36 0.64 1.11
#O2 0.12 -0.09 0.01 0.12 -0.43 0.28 0.30 0.70 2.20
#O3 0.01 0.00 -0.10 0.05 0.65 0.04 0.48 0.52 1.06
#O4 0.10 -0.05 0.34 0.15 0.37 -0.04 0.24 0.76 2.55
#O5 0.04 -0.04 -0.02 -0.01 -0.50 0.30 0.33 0.67 1.67
#gender 0.20 0.09 -0.12 0.33 -0.21 -0.15 0.18 0.82 3.58
#education -0.03 0.01 0.05 0.11 0.12 -0.22 0.07 0.93 2.17
#age -0.06 0.07 -0.02 0.16 0.03 -0.26 0.10 0.90 2.05
Ronak Shaw answered my question above, and I used his answer to help create the following function, which nearly reproduces the psych.print data.frame of fa.sort output
fa_table <- function(x, cut) {
#get sorted loadings
loadings <- fa.sort(fa_ml_oblimin)$loadings %>% round(3)
#cut loadings
loadings[loadings < cut] <- ""
#get additional info
add_info <- cbind(x$communalities,
x$uniquenesses,
x$complexity) %>%
as.data.frame() %>%
rename("commonality" = V1,
"uniqueness" = V2,
"complexity" = V3) %>%
rownames_to_column("item")
#build table
loadings %>%
unclass() %>%
as.data.frame() %>%
rownames_to_column("item") %>%
left_join(add_info) %>%
mutate(across(where(is.numeric), round, 3))
}

Predict using psych in R for PCA

I have a data set which I divided into the training and testing set after first recoding qualitative variables to integers. I ran PCA analysis using the psych package.
For the training set, I ran the below code:
train.scale<-scale(trainagain[,-1:-2])
pcafit<-principal(train.scale,nfactors = 11, rotate="Varimax")
It extracted the components as below:
RC1 RC4 RC3 RC5 RC2 RC6 RC7 RC8 RC9 RC11 RC10
SS loadings 2.44 1.92 1.90 1.72 1.65 1.46 1.40 1.15 1.10 1.01 1.01
Proportion Var 0.10 0.08 0.08 0.07 0.07 0.06 0.06 0.05 0.05 0.04 0.04
Cumulative Var 0.10 0.18 0.26 0.33 0.40 0.46 0.52 0.57 0.61 0.66 0.70
Proportion Explained 0.15 0.11 0.11 0.10 0.10 0.09 0.08 0.07 0.07 0.06 0.06
Cumulative Proportion 0.15 0.26 0.37 0.48 0.58 0.66 0.75 0.81 0.88 0.94 1.00
For the test set, I ran the below code:
str(testagain)
testagain.scores<-data.frame(predict(pcafit,testagain[,c(-1:-2)]))
The str(testagain) shows that my data structure is similar to trainagain, with all contents being integers. However, for the testagain.scores, the contents are all NaN.
How can I get "predict" to work? To my knowledge, I am following:
# S3 method for psych
predict(object, data,old.data,options=NULL,missing=FALSE,impute="none",...)
from:
https://www.rdocumentation.org/packages/psych/versions/2.0.7/topics/predict.psych
I think I might stumble across the solution: to remove one of the features/columns whose data is exactly the same across all samples.

'x' must be numeric ERROR in R while trying to create a Leaf and Stem display

I am a beginner at R and I'm just trying to read a text file that contains values and create a stem display, but I keep getting an error. Here is my code:
setwd("C:/Users/Michael/Desktop/ch1-ch9 data/CH01")
gravity=read.table("C:ex01-11.txt", header=T)
stem(gravity)
**Error in stem(gravity) : 'x' must be numeric**
The File contains this:
'spec_gravity'
0.31
0.35
0.36
0.36
0.37
0.38
0.4
0.4
0.4
0.41
0.41
0.42
0.42
0.42
0.42
0.42
0.43
0.44
0.45
0.46
0.46
0.47
0.48
0.48
0.48
0.51
0.54
0.54
0.55
0.58
0.62
0.66
0.66
0.67
0.68
0.75
If you can help, I would appreciate it! Thanks!
gravity is a data frame. stem expects a vector. You need to select a column of your data set and pass to stem, i.e.
## The first column
stem(gravity[,1])

GNU PLOT 2D Curve

I am trying to plot the following data:
SMO LogiBoost BFTree
25(>=7) 0.81 0.72 0.62
30(>=7) 0.83 0.76 0.56
35(>=7) 0.84 0.70 0.75
40(>=7) 0.74 0.67 0.58
25(>=8) 0.73 0.76 0.57
30(>=8) 0.78 0.74 0.65
35(>=8) 0.83 0.78 0.68
40(>=8) 0.75 0.67 0.66
25(>=9) 0.69 0.74 0.62
30(>=9) 0.79 0.75 0.62
35(>=9) 0.82 0.82 0.69
40(>=9) 0.78 0.80 0.53
25(>=12) 0.77 0.78 0.67
30(>=12) 0.76 0.74 0.59
35(>=12) 0.91 0.94 0.75
40(>=12) 0.75 0.75 0.64
25(>=15) 0.74 0.74 0.60
30(>=15) 0.80 0.71 0.64
35(>=15) 0.80 0.71 0.76
40(>=15) 0.75 0.75 0.75
SansVar(>= 7) 0.80 0.77 0.61
SansVar(>=8) 0.71 0.75 0.56
SansVar(>=9) 0.81 0.76 0.71
SansVar(>=12) 0.84 0.82 0.68
SansVar(>=15) 0.81 0.83 0.75
The first column represents the X labels and the 1st line represents the Y lables
I tried to add the X labels also but they overlap each other, is it possible to fix it?
Command to plot: plot "data1.txt" using 1:xtic(1) title 'SMO' with lines,\ "data.txt" using 2:xtic(1) title 'LogiBoost' with lines, \ "data.txt" using 3:xtic(1) title 'BFTree' with lines
I found maybe a solution which is the following, but still the problem si that the xlabels don't fit in the whole image.
set xtics rotate by -45
You could try resizing the margins.
reset
set terminal png
set rmargin at screen 0.85
set bmargin at screen 0.25
set output 'out.png'
set xtics rotate by -45 scale 0
plot "data.dat" using 1:xtic(1) title 'SMO' with lines, \
"" using 2:xtic(1) title 'LogiBoost' with lines, \
"" using 3:xtic(1) title 'BFTree' with lines

Resources