How to remove rows from dataframe without rewriting it? - r

df = structure(list(V1 = c(1, 2, 2, 3, 4, 5, 5, 6, 7), V2 = c(3.5, 3, 2.5, 2, 3, 2, 3, 5, 4), V3 = c(6.5, 8, 9, 5, 7, 4, 3, 6, 7)), row.names = c(NA, 9L), class = "data.frame")
trash = c(2,3)
How to remove the rows having the IDs in trash without rewriting the df?

I don't think there are inplace operations in r, even if you do
df = structure(list(V1 = c(1, 2, 2, 3, 4, 5, 5, 6, 7), V2 = c(3.5, 3, 2.5, 2, 3, 2, 3, 5, 4), V3 = c(6.5, 8, 9, 5, 7, 4, 3, 6, 7)), row.names = c(NA, 9L), class = "data.frame")
trash = c(2,3)
df = df[-trash,]
It should still rewrite df.

Related

How to create ggplot graphs with the three groups into one plot?

My codes are:
ggplot(data=df2, aes(x=stress, fill=as.factor(JP_Gender))) + geom_density(alpha=.3)
ggplot(data=df1, aes(x=CGstress)) + geom_density(alpha=.3)
My dataset 1:
structure(list(CGstress = c(4, 1, 10, 8, 9.5, 5, 5, 6, 6, 6,
7, 3, 4.5, 8, 9, 1, 5, 1, 5.5, 4, 1, 7, 9, 8, 3, NA, 10, 9, 5,
3, NA, 10, 6, NA, 10, 7)), row.names = c(NA, -36L), class = c("tbl_df",
"tbl", "data.frame"))
My dataset 2:
structure(list(stress = c(7, 2, 5, 6, 7, 1, 6, 10, 9, 10, 10,
10, 10, 8, 9, 4, 7, 6, 4, 9, 4, 8, 3.5, 7, 6, 6, 1, 7, 9, 8,
10, 6, 3, 1, 1, 1, 9, 6, 4), JP_Gender = structure(c(1, 2, 1,
2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1), label = "What is your gender?", format.stata = "%12.0g", labels = c(Male = 1,
Female = 2, Transgender = 3, Other = 4), class = c("haven_labelled",
"vctrs_vctr", "double"))), row.names = c(NA, -39L), class = c("tbl_df",
"tbl", "data.frame"))
Above codes give me 2 graphs. How to combine 2 graphs into one plot? And how to label the legends?
You can try combining the two datasets and then plot :
library(dplyr)
library(ggplot2)
df1 %>%
mutate(id = 3) %>%
rename(stress = CGstress) %>%
bind_rows(df2 %>%
mutate(id = as.integer(JP_Gender)) %>%
select(stress, id)) %>%
mutate(id = factor(id)) %>%
ggplot(aes(x=stress, fill=factor(id))) + geom_density(alpha=.3)

What does "Error: Must use a vector in `[`, not an object of class matrix." mean when running a PCA?

I am quite new to R and I am trying to run a PCA for an incomplete data set with the code:
res.comp <- imputePCA(questionaire_results_PCA, ncp = nb$ncp)
but R tells me:
Error: Must use a vector in [, not an object of class matrix.
Run rlang::last_error() to see where the error occurred.
So I run:
rlang::last_error()
R says:
1. missMDA::imputePCA(questionaire_results_PCA, ncp = nb$ncp)
4. tibble:::`[.tbl_df`(X, !is.na(X))
5. tibble:::check_names_df(i, x)
Run `rlang::last_trace()` to see the full context
So I run:
rlang::last_trace()
And R Says:
Must use a vector in `[`, not an object of class matrix.
Backtrace:
█
1. └─missMDA::imputePCA(questionaire_results_PCA, ncp = nb$ncp)
2. ├─base::mean((res.impute$fittedX[!is.na(X)] - X[!is.na(X)])^2)
3. ├─X[!is.na(X)]
4. └─tibble:::`[.tbl_df`(X, !is.na(X))
5. └─tibble:::check_names_df(i, x)
Does anyone know what this means and how I could get it to work?
I have run:
dput(head(questionaire_results_PCA))
and I got:
structure(list(Active = c(6, 6, 5, 7, 5, 6), `Aggressive to people` = c(NA,
4, NA, 2, NA, 1), Anxious = c(NA, 4, NA, 3, NA, 2), Calm = c(NA,
5, NA, 5, NA, 6), Cooperative = c(7, 6, 7, 6, 6, 6), Curious = c(7,
2, 7, 7, 7, 6), Depressed = c(1, 3, 1, 1, 1, 1), Eccentric = c(1,
3, 1, 4, 1, 4), Excitable = c(5, 2, 5, 5, 4, 4), `Fearful of people` = c(1,
2, 1, 2, 1, 1), `friendly of people` = c(5, 6, 7, 7, 7, 7), Insecure = c(2,
5, 2, 3, 2, 2), Playful = c(4, 6, 2, 5, 6, 6), `Self assured` = c(7,
6, 7, 5, 6, 6), Smart = c(6, 2, 7, 5, 7, 3), Solitary = c(4,
4, 3, 4, 3, 2), Tense = c(1, 2, 1, 3, 1, 2), Timid = c(2, 2,
2, 2, 2, 2), Trusting = c(6, 6, 6, 6, 6, 6), Vigilant = c(7,
6, 5, 3, 5, 3), Vocal = c(2, 7, 1, 6, 1, 7)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I then ran the code:
dput(nb$ncp)
and got:
3L
Here's the answer in case anyone comes across the same issue. Using the data provided by OP:
class(questionaire_results_PCA)
[1] "tbl_df" "tbl" "data.frame"
The input of imputePCA requires a data.frame, but it does not work with a tribble. So we need to convert it back to a matrix or data.frame:
library(missMDA)
res.comp <- imputePCA(data.frame(questionaire_results_PCA), ncp = 2)
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) :
infinite or missing values in 'x'
I get this error because it's a subset of the data and some of the columns have no deviation, we work around this first.
sel = which(apply(questionaire_results_PCA,2,sd)!=0)
# returns you a data.frame
res1 <- imputePCA(as.data.frame(questionaire_results_PCA[,sel]), ncp = 2)
# returns you a matrix
res2 <- imputePCA(as.matrix(questionaire_results_PCA[,sel]), ncp = 2)

Code to analyze relationships between responses to different ranking questions on a survey

My goal is to find much simpler code, which can generalize, that shows the relationships between responses to two survey questions. In the MWE, one question asked respondents to rank eight marketing selections from 1 to 8 and the other asked them to rank nine attribute selections from 1 to 9. Higher rankings indicate the respondent favored the selection more. Here is the data frame.
structure(list(Email = c("a", "b", "c", "d", "e", "f", "g", "h",
"i"), Ads = c(2, 1, 1, 1, 1, 2, 1, 1, 1), Alumni = c(3, 2, 2,
3, 2, 3, 2, 2, 2), Articles = c(6, 4, 3, 2, 3, 4, 3, 3, 3), Referrals = c(4,
3, 4, 8, 7, 8, 8, 6, 4), Speeches = c(7, 7, 6, 7, 4, 7, 4, 5,
5), Updates = c(8, 6, 6, 5, 5, 5, 5, 7, 6), Visits = c(5, 8,
7, 6, 6, 6, 6, 4, 8), `Business Savvy` = c(10, 6, 10, 10, 4,
4, 6, 8, 9), Communication = c(4, 3, 8, 3, 3, 9, 7, 6, 7), Experience = c(7,
7, 7, 9, 2, 8, 5, 9, 5), Innovation = c(2, 1, 4, 2, 1, 2, 2,
1, 1), Nearby = c(3, 2, 2, 1, 5, 3, 3, 2, 2), Personal = c(8,
10, 6, 8, 6, 10, 4, 3, 3), Rates = c(9, 5, 9, 6, 9, 7, 10, 5,
4), `Staffing Model` = c(6, 8, 5, 5, 7, 5, 8, 7, 8), `Total Cost` = c(5,
4, 3, 7, 8, 6, 9, 4, 6)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
If numeric rankings cannot be used for my solution to calculating relationships (correlations), please correct me.
Hoping they can be used, I arrived at the following plodding code, which I hope calculates the correlation matrix of each method selection against each attribute selection.
library(psych)
dataframe2 <- psych::corr.test(dataframe[ , c(2, 9:17)])[[1]][1:10] # the first method vs all attributes
dataframe3 <- psych::corr.test(dataframe[ , c(3, 9:17)])[[1]][1:10] # the 2nd method vs all attributes and so on
dataframe4 <- psych::corr.test(dataframe[ , c(4, 9:17)])[[1]][1:10]
dataframe5 <- psych::corr.test(dataframe[ , c(5, 9:17)])[[1]][1:10]
dataframe6 <- psych::corr.test(dataframe[ , c(6, 9:17)])[[1]][1:10]
dataframe7 <- psych::corr.test(dataframe[ , c(7, 9:17)])[[1]][1:10]
dataframe8 <- psych::corr.test(dataframe[ , c(8, 9:17)])[[1]][1:10]
# create a dataframe from the rbinded rows
bind <- data.frame(rbind(dataframe2, dataframe3, dataframe4, dataframe5, dataframe6, dataframe7, dataframe8))
Rename rows and columns:
colnames(bind) <- c("Sel", colnames(dataframe[9:17]))
rownames(bind) <- colnames(dataframe[2:8])
How can I accomplish the above more efficiently?
By the way, the bind data frame also allows one to produce a heat map with the DataExplorer package.
library(DataExplorer)
DataExplorer::plot_correlation(bind)
[Summary]
In the scope of our discussion, there are two ways to get the correlation data.
Use stats::cor, i.e., cor(subset(dataframe, select = -Email))
Use psych::corr.test, i.e., corr.test(subset(dataframe, select = -Email))[[1]]
Then you may subset the correlation matrix with the desired rows and columns.
In order to use DataExplorer::plot_correlation, you can simply do plot_correlation(dataframe, type = "c"). Note: the output heatmap will include correlations for all columns, so you can just ignore columns that are not of interests.
[Original Answer]
## Create data
dataframe <- structure(
list(
Email = c("a", "b", "c", "d", "e", "f", "g", "h", "i"),
Ads = c(2, 1, 1, 1, 1, 2, 1, 1, 1),
Alumni = c(3, 2, 2, 3, 2, 3, 2, 2, 2),
Articles = c(6, 4, 3, 2, 3, 4, 3, 3, 3),
Referrals = c(4, 3, 4, 8, 7, 8, 8, 6, 4),
Speeches = c(7, 7, 6, 7, 4, 7, 4, 5, 5),
Updates = c(8, 6, 6, 5, 5, 5, 5, 7, 6),
Visits = c(5, 8, 7, 6, 6, 6, 6, 4, 8),
`Business Savvy` = c(10, 6, 10, 10, 4, 4, 6, 8, 9),
Communication = c(4, 3, 8, 3, 3, 9, 7, 6, 7),
Experience = c(7, 7, 7, 9, 2, 8, 5, 9, 5),
Innovation = c(2, 1, 4, 2, 1, 2, 2, 1, 1),
Nearby = c(3, 2, 2, 1, 5, 3, 3, 2, 2),
Personal = c(8, 10, 6, 8, 6, 10, 4, 3, 3),
Rates = c(9, 5, 9, 6, 9, 7, 10, 5, 4),
`Staffing Model` = c(6, 8, 5, 5, 7, 5, 8, 7, 8),
`Total Cost` = c(5, 4, 3, 7, 8, 6, 9, 4, 6)
),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")
)
Following your example strictly, we can do the following:
## Calculate correlation
df2 <- subset(dataframe, select = -Email)
marketing_selections <- names(df2)[1:7]
attribute_selections <- names(df2)[8:16]
corr_matrix <- psych::corr.test(df2)[[1]]
bind <- subset(corr_matrix,
subset = rownames(corr_matrix) %in% marketing_selections,
select = attribute_selections)
DataExplorer::plot_correlation(bind)
WARNING
However, is this what you really want? psych::corr.test generates the correlation matrix, and DataExplorer::plot_correlation calculates the correlation again. It is like the correlation of the correlation.

“For”, with two different index, i and j, to create a plot

I have a list called: list_plot
list_plot=list(list(a = c(2, 3, 4, 5), b = c(3, 4, 5, 5), c = c(3, 7, 5,
5), d = c(3, 4, 9, 5), e = c(3, 4, 5, 9), f = c(3, 4, 1, 9),
g = c(3, 1, 5, 9), h = c(3, 3, 5, 9), i = c(3, 17, 3, 9),
j = c(3, 17, 3, 9)), list(a = c(2, 3, 4, 5), b = c(3, 4,
5, 5), c = c(3, 7, 5, 5), d = c(3, 4, 9, 5), e = c(3, 4, 5, 9
), f = c(3, 4, 1, 9), g = c(3, 1, 5, 9), h = c(3, 3, 5, 9), i = c(3,
17, 3, 9), j = c(3, 17, 3, 9)), list(a = c(2, 3, 4, 5), b = c(3,
4, 5, 5), c = c(3, 7, 5, 5), d = c(3, 4, 9, 5), e = c(3, 4, 5,
9), f = c(3, 4, 1, 9), g = c(3, 1, 5, 9), h = c(3, 3, 5, 9),
i = c(3, 17, 3, 9), j = c(3, 17, 3, 9)), list(a = c(2, 3,
4, 5), b = c(3, 4, 5, 5), c = c(3, 7, 5, 5), d = c(3, 4, 9, 5
), e = c(3, 4, 5, 9), f = c(3, 4, 1, 9), g = c(3, 1, 5, 9), h = c(3,
3, 5, 9), i = c(3, 17, 3, 9), j = c(3, 17, 3, 9)), list(a = c(2,
3, 4, 5), b = c(3, 4, 5, 5), c = c(3, 7, 5, 5), d = c(3, 4, 9,
5), e = c(3, 4, 5, 9), f = c(3, 4, 1, 9), g = c(3, 1, 5, 9),
h = c(3, 3, 5, 9), i = c(3, 17, 3, 9), j = c(3, 17, 3, 9)))
In this list_plot [[i]], i goes from 1 to 5. This list has 10 entries, j = 10 (list_plot [[i]] [j]).
So, for me to get the input one I do this: list_plot [[i]] [1]
Each list_plot [[i]] [j] is a series of numbers that I plot a chart. Here is the code for this graphic:
for (i in 1: 5) {
for (j in 1:10){
x11 ()
par (mfrow = c (3.2))
plot.ts (list_plot [[i]] [j])
}
}
It is showing an error. I want that when i = 1, it goes through allj, which goes from 1 to 10. Then, it can start i = 2, traversing all j, which again goes from 1 to 10. That is, when i = 1 the priority is now to finish j. When i = 2 the priority is to end allj``s and so on.
Any help?
You need extra brackets around j. Change the plot call to:
plot.ts(list_plot[[i]][[j]])
The reason is that [ and [[ return objects of a different class (the double brackets will simplify to a vector whereas the single bracket keep the subset as a list):
class(list_plot[[1]][1])
"list"
class(list_plot[[1]][[1]])
"numeric"

Add legend to graph in R

For a sample dataframe:
df <- structure(list(antibiotic = c(0.828080341411847, 1.52002304506738,
1.31925434545302, 1.66681722567074, 1.17791610945551, 0.950096368502059,
1.10507733691997, 1.0568193215304, 1.03853131016669, 1.02313195567946,
0.868629787234043, 0.902126485349154, 1.12005679002801, 1.88261441540084,
0.137845900627507, 1.07040656448604, 1.41496470588235, 1.30978543173373,
1.16931780610558, 1.05894439450366, 1.24805122785724, 1.21318238007025,
0.497310305098053, 0.872362356327429, 0.902584749481137, 0.999731895498823,
0.907560340983954, 1.05930840957587, 1.40457554864091, 1.09747179272879,
0.944219456216072, 1.10363111431903, 0.974649273935516, 0.989983064420841,
1.14784471036171, 1.17232858907798, 1.44675812720393, 0.727078405331282,
1.36341361598635, 1.06120293299474, 1.06920290856811, 0.711007267992205,
1.39034247642439, 0.710873996527168, 1.30529753573398, 0.781191310196629,
0.921788181250106, 0.932214675722466, 0.752289683770589, 0.942392026874501
), year = c(3, 1, 4, 1, 2, 4, 1, 3, 4, 3, 4, 1, 2, 3, 4, 1, 1,
4, 1, 1, 1, 1, 4, 1, 3, 3, 1, 4, 1, 4, 2, 1, 1, 1, 3, 4, 3, 2,
2, 2, 3, 3, 1, 2, 3, 2, 3, 4, 4, 1), imd.decile = c(8, 2, 5,
5, 4, 3, 2, 8, 6, 4, 3, 6, 9, 2, 5, 3, 5, 6, 4, 2, 9, 11, 2,
8, 3, 5, 7, 8, 7, 4, 9, 7, 6, 4, 8, 10, 5, 6, 6, 11, 6, 4, 2,
4, 10, 8, 2, 8, 4, 3)), .Names = c("antibiotic", "year", "imd.decile"
), row.names = c(17510L, 6566L, 24396L, 2732L, 13684L, 28136L,
1113L, 15308L, 28909L, 21845L, 23440L, 1940L, 8475L, 22406L,
27617L, 4432L, 3411L, 27125L, 6891L, 6564L, 1950L, 5683L, 25240L,
5251L, 20058L, 18068L, 5117L, 29066L, 2807L, 24159L, 12309L,
6044L, 7629L, 2336L, 16583L, 23921L, 17465L, 14911L, 8879L, 13929L,
17409L, 19421L, 7239L, 11570L, 15283L, 8283L, 16246L, 27950L,
23723L, 4411L), class = "data.frame")
I am trying to graph imd.decile by antibiotic for each year
library(ggplot2)
p <- ggplot(df, aes(x = imd.decile, y = antibiotic, group = factor(year))) +
stat_summary(geom = "line", fun.y = mean)
p
How do I add the wave to colour the corresponding graph and add a legend (I can't seem to use the aes command correctly).

Resources