Preparing data before doing Principal component analysis (PCA) - r

I have a data frame(200x300) which consists of mixed(character,numeric) variables and has lots of missing values(NA)
my first problem is how to convert all data into numeric, I can use factors but there are like 100 columns to convert.
secondly, all my columns are not expressed in equivalent units.
I just want some good advice for preparing the data before starting with my analysis
following is the structure of the data
structure(list(Hormonal.cycle.status..P4. = c(1, 1, 4, 1, 4,
1), Hormonal.medication.status = c(2, 1, 1, 2, 1, 1), Hormonal.medication.type = c(21,
27, 27, 26, 27, 27), ID.pathologist.main = c(3, 3, 3, 4, 2, 1
), ID.pathologist.sub = c(2, 1, 2, 2, 2, 2), Day.of.the.cycle_calculated = c(10,
8, 22, 19, 19, 12), Cycle.status..histology.and.cycle.day. = c(12,
18, 9, 1, 18, 3), Cycle.status.final..P4..histology..cycle.day. = c(1,
4, 5, 1, 6, 3), Deep.lesion = c(2, 2, 1, 2, 1, 2), Ovarian.lesion = c(2,
2, 2, 1, 2, 2), Peritoneal.lesion = c(2, 2, 2, 2, 2, 2), Combination.of.lesions = c(1,
1, 7, 4, 7, 1), DEEP.all.types = c(2, 2, NA, 2, NA, 2), DEEP.uterosacral = c(2,
2, NA, 2, NA, 2), DEEP.RVE = c(1, 1, 1, 2, 1, 1), DEEP.bowel = c(1,
1, 1, 1, 1, 1), DEEP.bladder = c(1, 1, 1, 1, 1, 1), Ovarian.endo.cyst = c(5,
5, 3, 1, 3, 5), Peritoneal = c(2, NA, 2, NA, 2, 2), Peritoneal.size.total = c(3,
3, 3, 3, 3, 2), Date.of.the.surgery = c(96, 98, 17, 105, 107,
108), Type.of.surgery = c(1, 1, 1, 1, 1, 1), Perit..surface = c(1,
2, 3, 3, 2, 2), Perit..deep = c(3, NA, NA, NA, 3, 3), R.ovary.surface = c(NA,
1, 2, 1, 1, NA), R.ovary.deep = c(NA, NA, 2, NA, 3, NA), L.ovary.surface = c(2,
NA, 2, NA, 1, NA), L.ovary.deep = c(2, 4, 4, NA, 4, 4), F.d.block = c(NA,
NA, 1, 1, NA, 1), R.ovary.frail = c(NA, NA, 3, NA, NA, 3), R.ovary.tight = c(NA,
NA, NA, NA, 3, NA), L.ovary.frail = c(NA, NA, NA, 2, NA, NA),
L.ovary.tight = c(2, 2, 3, NA, 2, 2), R.tuba.frail = c(NA,
NA, NA, NA, NA, 2), R.tuba.tight = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), L.tuba.frail = c(NA,
NA, NA, NA, NA, 2), L.tuba.tight = c(4, NA, NA, NA, 4, NA
), Elsewhere = c(35, NA, NA, 24, NA, NA), Other.diseases = c(16,
NA, NA, NA, NA, 11), In.microarray = c(2, 2, 1, 2, 2, 1),
In.cytokine.plexes = c(2, 2, 2, 2, 2, 2)), .Names = c("Hormonal.cycle.status..P4.",
"Hormonal.medication.status", "Hormonal.medication.type", "ID.pathologist.main",
"ID.pathologist.sub", "Day.of.the.cycle_calculated", "Cycle.status..histology.and.cycle.day.",
"Cycle.status.final..P4..histology..cycle.day.", "Deep.lesion",
"Ovarian.lesion", "Peritoneal.lesion", "Combination.of.lesions",
"DEEP.all.types", "DEEP.uterosacral", "DEEP.RVE", "DEEP.bowel",
"DEEP.bladder", "Ovarian.endo.cyst", "Peritoneal", "Peritoneal.size.total",
"Date.of.the.surgery", "Type.of.surgery", "Perit..surface", "Perit..deep",
"R.ovary.surface", "R.ovary.deep", "L.ovary.surface", "L.ovary.deep",
"F.d.block", "R.ovary.frail", "R.ovary.tight", "L.ovary.frail",
"L.ovary.tight", "R.tuba.frail", "R.tuba.tight", "L.tuba.frail",
"L.tuba.tight", "Elsewhere", "Other.diseases", "In.microarray",
"In.cytokine.plexes"), row.names = c("H003", "H004", "H006",
"H007", "H008", "H011"), class = "data.frame")

Related

R remove rows with NA in groups of columns containing the same string

I have a dataframe that contains multiple variables each measured with multiple items at two different time points. What I want to remove all rows with NA entries in groups of columns containing the same part of a string. Some of these groups contain multiple columns (e.g., grep("learn"), some only one (e.g., T1_age. This is my original dataframe (a part of it):
data <- data.frame(
T1_age = c(39, 30, 20, 48, 27, 55, 37, 50, 50, 37),
T1_sex = c(2, 1, 1, 2, 2, 1, 1, 2, 1, 1),
T2_learn1 = c(2, NA, 3, 4, 1, NA, NA, 2, 4, 4),
T2_learn2 = c(1, NA, 4, 4, 1, NA, NA, 2, 4, 4),
T2_learn3 = c(2, NA, 4, 4, 1, NA, NA, 3, 4, 4),
T2_learn4 = c(2, NA, 2, 5, 5, NA, NA, 5, 5, 5),
T2_learn5 = c(4, NA, 3, 4, 3, NA, NA, 3, 4, 3),
T2_aut1 = c(NA, NA, 4, 4, 4, NA, NA, 3, 5, 4),
T2_aut2 = c(NA, NA, 4, 4, 4, NA, NA, 3, 5, 5),
T2_aut3 = c(NA, NA, 4, 4, 3, NA, NA, 3, 5, 5),
T2_ssup1 = c(1, NA, 4, 5, 4, NA, NA, 2, 4, 3),
T2_ssup2 = c(3, NA, 4, 5, 5, NA, NA, 3, 4, 4),
T2_ssup3 = c(4, NA, 4, 5, 5, NA, NA, 4, 4, 4),
T2_ssup4 = c(2, NA, 3, 5, 5, NA, NA, 3, 4, 4),
T3_learn1 = c(3, NA, NA, 4, 4, NA, NA, 3, 3, 4),
T3_learn2 = c(1, NA, NA, 4, 3, NA, NA, 3, 3, 4),
T3_learn3 = c(3, NA, NA, 4, 4, NA, NA, 3, 3, 5),
T3_learn4 = c(4, NA, NA, 5, 4, NA, NA, 4, 5, 5),
T3_learn5 = c(4, NA, NA, 3, 4, NA, NA, 3, 3, 4),
T3_aut1 = c(NA, NA, NA, 4, 4, NA, NA, 3, 5, 5),
T3_aut2 = c(NA, NA, NA, 3, 4, NA, NA, 3, 5, 5),
T3_aut3 = c(NA, NA, NA, 3, 2, NA, NA, 3, 5, 5),
T3_ssup1 = c(3, NA, NA, 5, 4, NA, NA, 2, 4, 1),
T3_ssup2 = c(3, NA, NA, 5, 5, NA, NA, 4, 5, 5),
T3_ssup3 = c(4, NA, NA, 5, 5, NA, NA, 4, 5, 3),
T3_ssup4 = c(3, NA, NA, 5, 5, NA, NA, 4, 5, 4)
)
Now I already found a very horrible solution and I believe that could be improved. So this code basically does what I want:
library(dplyr)
library(tidyr)
data <- data %>% filter(rowSums(is.na(.[ , grep("learn", colnames(.))])) != ncol(.[ , grep("learn", colnames(.))]))
data <- data %>% filter(rowSums(is.na(.[ , grep("aut", colnames(.))])) != ncol(.[ , grep("aut", colnames(.))]))
data <- data %>% filter(rowSums(is.na(.[ , grep("ssup", colnames(.))])) != ncol(.[ , grep("ssup", colnames(.))]))
data <- data %>% drop_na(T1_age)
data <- data %>% drop_na(T1_sex)
So the new data frame (and what I want to achieve) looks like this:
data2 <- data.frame(
T1_age = c(20, 48, 27, 50, 50, 37),
T1_sex = c(1, 2, 2, 2, 1, 1),
T2_learn1 = c(3, 4, 1, 2, 4, 4),
T2_learn2 = c(4, 4, 1, 2, 4, 4),
T2_learn3 = c(4, 4, 1, 3, 4, 4),
T2_learn4 = c(2, 5, 5, 5, 5, 5),
T2_learn5 = c(3, 4, 3, 3, 4, 3),
T2_aut1 = c(4, 4, 4, 3, 5, 4),
T2_aut2 = c(4, 4, 4, 3, 5, 5),
T2_aut3 = c(4, 4, 3, 3, 5, 5),
T2_ssup1 = c(4, 5, 4, 2, 4, 3),
T2_ssup2 = c(4, 5, 5, 3, 4, 4),
T2_ssup3 = c(4, 5, 5, 4, 4, 4),
T2_ssup4 = c(3, 5, 5, 3, 4, 4),
T3_learn1 = c(NA, 4, 4, 3, 3, 4),
T3_learn2 = c(NA, 4, 3, 3, 3, 4),
T3_learn3 = c(NA, 4, 4, 3, 3, 5),
T3_learn4 = c(NA, 5, 4, 4, 5, 5),
T3_learn5 = c(NA, 3, 4, 3, 3, 4),
T3_aut1 = c(NA, 4, 4, 3, 5, 5),
T3_aut2 = c(NA, 3, 4, 3, 5, 5),
T3_aut3 = c(NA, 3, 2, 3, 5, 5),
T3_ssup1 = c(NA, 5, 4, 2, 4, 1),
T3_ssup2 = c(NA, 5, 5, 4, 5, 5),
T3_ssup3 = c(NA, 5, 5, 4, 5, 3),
T3_ssup4 = c(NA, 5, 5, 4, 5, 4)
)
Could you help me improve this a bit? Thank you!!!
You may iterate over grep in an sapply and check if the rowSums in the slices reach their number of columns.
V <- c('learn', 'aut', 'ssup')
res <- data[!rowSums(sapply(V, \(v) {
X <- data[grep(v, names(data))]
rowSums(is.na(X)) == dim(X)[2]
})), ]
stopifnot(all.equal(res, data2, check.attributes=FALSE))
Or probably just checking if the sums of NA's in the "hot" columns reach the number of columns (without the demographics) is enough.
res1 <- data[rowSums(is.na(data[grep(paste(V, collapse='|'), names(data))])) !=
dim(data[-(1:2)])[2], ]
stopifnot(all.equal(res1, data2, check.attributes=FALSE))
data2 is the result data frame you provide in OP. dim(data)[2] gives the same as ncol(data).
Note: R version 4.1.2 (2021-11-01)

Trying to manually reorder a bar chart in R

I need this chart to show the bars in a different order.
The "pretest" bar needs to be first. Nothing happens other than my labels changing. Would really appreciate some help!
This research only makes sense if the table is in the correct order.
I have been using this code trying to change the order.
plot_data2 <- main_data %>%
dplyr::select(training_type,
pretest_result,
C1_reps,
C2_reps,
P1_reps,
P2_reps) %>%
drop_na(pretest_result) %>%
gather(test, reps, pretest_result, C1_reps, C2_reps, P1_reps, P2_reps) %>%
group_by(test, training_type) %>%
summarise(
mean = mean(reps),
lci = t.test(reps)$conf.int[[1]],
uci = t.test(reps)$conf.int[[2]]
) %>%
ungroup() %>%
mutate(test = factor(
test,
levels = c("pretest_result", "C1_reps", "C2_reps", "P1_reps", "P2_reps")
))
This is my code for the plot.
ggplot(plot_data2, aes(x=test, y = mean, fill = training_type)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_errorbar(aes(ymin=lci, ymax=uci),
width=.2,
position=position_dodge(.9)) +
scale_y_continuous(breaks = c(5,1,2,3,4)) +
scale_x_discrete(labels = c("Pretest", "C1", "C2", "P1", "P2")) +
labs(x = "Test type", y = "Average repitions", fill = "Training type") +
theme_bw()
This is my data
main_data <- structure(
list(
Horse = c("Skori", "Raudhetta", "Emma", "Freyr",
"Nick", "Hilda", "Aleiga", "Sinfonia", "Saga", "Fengur", "Herkules",
"Rumur", "Gaia", "Frøya", "Fanta", "Lindus", "Betty", "Sjamina",
"Dimma", "Astrix", "Presley", "Odin", "Poineten", "Gåte", "Skori",
"Raudhetta", "Emma", "Freyr", "Nick", "Hilda", "Aleiga", "Sinfonia",
"Saga", "Fengur", "Herkules", "Rumur", "Gaia", "Frøya", "Fanta",
"Lindus", "Betty", "Sjamina", "Dimma", "Astrix", "Presley", "Odin",
"Poineten", "Gåte"),
C1_reps = c(5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 5, 5, 6,
4, 5, 6, 6, 4, 5, 5, 5, 4, 5, 5, 4, 5, 6, 6, 5, 5, 5, 5, 6, 5,
4, 4, 5, 4, 5),
C2_reps = c(5, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 6, 6, 5, 6, 5, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 6, 6, 6, 4, 5),
Compliance = c(3,
4, 4, 3, 3, 2, 1, 4, 4, 4, 3, 4, 1, 4, 1, 3, 3, 4, 4, 4, 3, 4,
1, 3, 4, 4, 4, 4, 3, 3, 4, 4, 4, 4, 4, 3, 4, 3, 4, 3, 4, 3, 4,
4, 4, 3, 4, 3),
P1_reps = c(5, 5, 4, 5, 5, 4, 3, 4, 6, 7, 7,
0, 4, 6, 3, 6, 7, 0, 7, 4, 4, 3, 6, 5, 1, 0, 1, 0, 1, 1, 0, 2,
1, 0, 0, 0, 1, 2, 1, 3, 0, 0, 0, 1, 1, 2, 2, 1),
P2_reps = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6, 6, 2, 6, 7, 4,
6, 5, 6, 4, 4, 6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0, 1, 1, 1, 0, 0, 0, 0, 0, 2, 1, 0),
Test.group = c(1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3),
training_type = c("PC", "PC", "PC", "PC", "PC",
"PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC",
"PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "TT", "TT", "TT",
"TT", "TT", "TT", "TT", "TT", "TT", "TT", "TT", "TT", "TT", "TT",
"TT", "TT", "TT", "TT", "TT", "TT", "TT", "TT", "TT", "TT"),
pretest_result = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 3, 6, 2, 4, 0, 0, 3, 5, 4, 2, 9, 0, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 6, 4, 2, 2, 1, 0, 5, 4, 6,
6, 16, 0)
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-48L))
Bar order is determined by order of factor levels. To reorder bars, you need to reorder factor levels by adding something like the following after your first call to mutate(). List the levels of your factor in the order you want your bars to appear
mutate(test = forcats::fct_relevel(test, "pretest_result", "C1_reps", "C2_reps", "P1_reps", "P2_reps"))
add ordered = T parameter to your script
...
mutate(test = factor(
test,
levels = c("pretest_result", "C1_reps", "C2_reps", "P1_reps", "P2_reps")
)
...
...
mutate(test = factor(
test,
levels = c("pretest_result", "C1_reps", "C2_reps", "P1_reps", "P2_reps"), ordered = T
)

Recoding data from 1,1,1, to 1,2,3

So I have this dataframe. Under the column potential_child, I want to recode the values so that the oldest child == 1, the second oldest == 2, third oldest == 3, etc. I have the ages of the children, but I am floundering how to do this exactly.
DHS1 <- structure(list(person_id = c(1, 2, 1, 2, 3, 4, 1, 7, 1, 2), household_id = c(1,1, 6, 6, 6, 6, 7, 63342, 63344, 63344), year = c(2018, 2018,2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018), month = c(1,1, 1, 1, 1, 1, 1, 12, 12, 12), sex = c(2, 1, 1, 2, 1, 2, 1, 1,1, 2), age = c(28, 28, 44, 37, 10, 10, 60, 65, 55, 55), potential_mom = c(1,NA, NA, 1, NA, NA, NA, NA, NA, 1), potential_child = c(NA, NA,NA, NA, 1, 1, NA, NA, NA, NA), momloc = c(0, 0, 0, 0, 2, 2, 0,0, 0, 0), num_child = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0)), row.names = c(NA,-10L), class = c("tbl_df", "tbl", "data.frame"))
Me trying to think it through (apologies in advance for this ugly rambling):
mutate(potential_child2 = if potential_child == 1 & age =<)
We can arrange the data based on household_id and age and for each household_id get the cumulative sum of potential_child value after replacing NA with 0.
library(dplyr)
DHS1 %>%
arrange(household_id, age) %>%
group_by(household_id) %>%
#Or if you also want to do it for every person
#group_by(person_id, household_id) %>%
mutate(potential_child = cumsum(replace(potential_child,
is.na(potential_child), 0)),
potential_child = replace(potential_child, potential_child == 0, NA))

R: Error when grouping data using subset or index methods- produces list of all column names

I am encountering an error when trying to group my data based on categories of a variable and I am not sure why it is happening because I have used the two most widely recommended methods, subset(dataframe, variable==X) and dataframe[dataframe4variable ==X], successfully with past data sets and just now using the mtcars dataset.
The problem is that when I try to run my code, I get an error in which R just prints out the names of all of my variables(see below).
I am not quite sure how to "show" this problem-- any recommendations regarding what information would be useful to you all would be greatly appreciated. This problem is not reproducible with other datasets. Thank you for any advice you are able to give.
My dataset "wits" has 363 observations and 92 variables. My variable "complete" is a factor variable with four levels: "completed all", "stopped after demos", "stopped after consent", and "skipped manip bc poor id." I would like to create a new dataset made up only of participants with "completed all". I have tried these two methods:
wits_c <- wits[wits$complete=="Completed all", ]
wits_c <-subset(wits,complete=="Completed all")
Which results in the following error:
Error: Columns `Start`, `End`, `GameCode`, `workerID`, `condition`, `about`, `valid`, `consv`, `merit1`, `merit2`, `merit3`, `gender`, `gender_TEXT`, `poorid`, `choseskip`, `age`, `edu`, `race`, `race_TEXT`, `complete`, `distracted`, `happen`, `about__1`, `playread`, `thinking_1`, `thinking_2`, `thinking_3`, `thinking_4`, `thinking_5`, `thinking_6`, `thinking_7`, `text`, `logical_1`, `logical_2`, `logical_3`, `logical_4`, `controll_1`, `controll_2`, `controll_3`, `controll_4`, `controll_5`, `controll_6`, `controll_7`, `controll_8`, `controll_9`, `controll_10`, `privatesol`, `publicsol`, `privatesol_2`, `privatesol_3`, `privatesol_5`, `publicsol_2`, `publicsol_3`, `publicsol_5`, `policy_1`, `policy_2`, `policy_3`, `policy_4`, `colaction_5`, `colaction_6`, `colaction_7`, `colaction_8`, `colaction_10`, `colaction_13`, `joke`, `random`, `say`, `wits`, `wits_nb`, `neutral`, `rural_id`, `relig_id`, `prog_id`, `vignette`, `merit3R`, `policy_3R`, `policy_4R`, `controll_4R`, `controll_5R`,`co
Thank you to user Markdly for the suggestion to include the following output which provides more detailed information about my dataset:
dput(head(wits))
structure(list(Start = structure(c(1499525516, 1499516293, 1499516379,
1499516319, 1499516949, 1499516709), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), End = structure(c(1499525762, 1499518121,
1499516954, 1499517222, 1499517412, 1499517512), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), GameCode = c(2991999, 5712506, 1002944,
8916111, 3495462, 9127270), workerID = c("ACIHCWKHNFC7U", "A3UAO2LYUPO7L6",
"A8L94A9EF23BV", "A258JTYUD56LOE", "A12SJSJIUR3A23", "A1HHOCO3ZZHCJZ"
), condition = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("WITS",
"WITS No Blurb", "Neutral", "Read"), class = "factor"), about = c(2,
2, 2, 2, 2, 2), valid = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label =
c("valid responses",
"invalid responses"), class = "factor"), consv = c(4, 2, 2, 6,
4, 4), merit1 = c(5, 3, 2, 4, 6, 4), merit2 = c(4, 4, 2, 5, 5,
4), merit3 = c(3, 4, 2, 5, 4, 5), gender = structure(c(1L, 1L,
2L, 2L, 2L, 2L), .Label = c("man", "woman", "non-binary", "other"
), class = "factor"), gender_TEXT = c(NA, NA, NA, NA, NA, NA),
poorid = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("not id as poor",
"id as poor"), class = "factor"), choseskip = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), .Label = c("poor but continued", "poor and skipped"), class = "factor"),
age = c(28, 30, 33, 41, 26, 30), edu = c(5, 5, 6, 5, 3, 5
), race = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("white",
"black", "latino", "asian", "native american", "other", "multiracial"
), class = "factor"), race_TEXT = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), complete = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Completed
all",
"Stopped after demos", "Stopped after consent", "Skipped manip bc poor id"
), class = "factor"), distracted = c(4, 5, 0, 0, 0, 5), happen = c(0,
0, 0, 0, 0, 0), about__1 = c(2, 2, 2, 2, 2, 2), playread = c(1,
1, 1, 1, 1, 1), thinking_1 = c(1, 6, 5, 4, 6, 4), thinking_2 = c(4,
3, 5, 1, 5, 5), thinking_3 = c(1, 4, 7, 4, 5, 5), thinking_4 = c(6,
4, 7, 1, 5, 3), thinking_5 = c(5, 3, 6, 6, 5, 4), thinking_6 = c(4,
3, 7, 6, 6, 4), thinking_7 = c(6, 3, 7, 6, 5, 5), text = c(2,
4, 5, 4, 2, 3), logical_1 = c(1, 3, 4, 4, 4, 4), logical_2 = c(2,
4, 3, 4, 5, 4), logical_3 = c(4, 3, 5, 4, 5, 4), logical_4 = c(1,
6, 3, 4, 4, 3), controll_1 = c(6, 6, 1, 4, 4, 4), controll_2 = c(1,
3, 1, 4, 5, 5), controll_3 = c(1, 4, 3, 6, 4, 5), controll_4 = c(3,
4, 6, 3, 4, 4), controll_5 = c(6, 3, 6, 1, 4, 4), controll_6 = c(3,
3, 1, 3, 5, 4), controll_7 = c(2, 5, 1, 5, 4, 5), controll_8 = c(2,
2, 1, 4, 5, 3), controll_9 = c(6, 2, 6, 3, 5, 5), controll_10 = c(1,
3, 6, 2, 5, 3), privatesol = c(5, 5.33333333333333, 12, 8,
8.33333333333333, 8.33333333333333), publicsol = c(7.66666666666667,
2.66666666666667, 12, 2, 11.3333333333333, 8.33333333333333
), privatesol_2 = c(1, 11, 12, 11, 11, 3), privatesol_3 = c(3,
2, 12, 2, 3, 11), privatesol_5 = c(11, 3, 12, 11, 11, 11),
publicsol_2 = c(11, 3, 12, 2, 12, 11), publicsol_3 = c(11,
2, 12, 2, 11, 3), publicsol_5 = c(1, 3, 12, 2, 11, 11), policy_1 = c(1,
5, 6, 1, 4, 4), policy_2 = c(3, 2, 6, 1, 5, 4), policy_3 = c(1,
3, 2, 6, 5, 3), policy_4 = c(6, 3, 5, 6, 5, 4), colaction_5 = c(2,
5, 6, 1, 2, 5), colaction_6 = c(6, 4, 1, 6, 5, 4), colaction_7 = c(6,
2, 6, 1, 3, 4), colaction_8 = c(4, 5, 6, 1, 2, 4), colaction_10 = c(4,
3, 1, 6, 5, 4), colaction_13 = c(3, 2, 6, 1, 2, 3), joke = c(2,
2, 2, 2, 2, 2), random = c(2, 2, 2, 2, 2, 2), say = c(NA,
"Nope", NA, "This was highly biased survey.", "good survey",
"NO"), wits = c(1, 1, 1, 1, 1, 1), wits_nb = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), neutral = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), rural_id = c(0,
0, 0, 1, 0, 0), relig_id = c(0, 1, 1, 1, 1, 1), prog_id = c(0,
0, 1, 0, 1, 1), vignette = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("WITS", "WITS No Blurb", "Neutral", "Read"
), class = "factor"), merit3R = c(4, 3, 5, 2, 3, 2), policy_3R = c(6,
4, 5, 1, 2, 4), policy_4R = c(1, 4, 2, 1, 2, 3), controll_4R = c(4,
3, 1, 4, 3, 3), controll_5R = c(1, 4, 1, 6, 3, 3), controll_9R = c(1,
5, 1, 4, 2, 2), controll_10R = c(6, 4, 1, 5, 2, 4), colaction_6R = c(1,
3, 6, 1, 2, 3), colaction_10R = c(3, 4, 6, 1, 2, 3), logical = c(2,
4, 3.75, 4, 4.5, 3.75), cognition = structure(c(1, 6, 5,
4, 6, 4, 4, 3, 5, 1, 5, 5, 1, 4, 7, 4, 5, 5, 6, 4, 7, 1,
5, 3, 5, 3, 6, 6, 5, 4, 4, 3, 7, 6, 6, 4, 6, 3, 7, 6, 5,
5), .Dim = 6:7), engage = c(5, 3, 6.66666666666667, 6, 5.33333333333333,
4.33333333333333), pertake = c(3.66666666666667, 3.66666666666667,
6.33333333333333, 2, 5, 4.33333333333333), policy = c(2.75,
3.75, 4.75, 1, 3.25, 3.75), colaction = c(3.16666666666667,
3.5, 6, 1, 2.16666666666667, 3.66666666666667), controllability = c(2.7,
3.9, 1.2, 4.5, 3.7, 3.8), completebi = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = c("Completed all", "Stopped after demos"
), class = "factor"), gender2 = structure(c(1L, 1L, 2L, 2L,
2L, 2L), .Label = c("man", "woman"), class = "factor")), .Names = c("Start", "End", "GameCode", "workerID", "condition", "about", "valid",
"consv", "merit1", "merit2", "merit3", "gender", "gender_TEXT",
"poorid", "choseskip", "age", "edu", "race", "race_TEXT", "complete",
"distracted", "happen", "about__1", "playread", "thinking_1",
"thinking_2", "thinking_3", "thinking_4", "thinking_5", "thinking_6",
"thinking_7", "text", "logical_1", "logical_2", "logical_3",
"logical_4", "controll_1", "controll_2", "controll_3", "controll_4",
"controll_5", "controll_6", "controll_7", "controll_8", "controll_9",
"controll_10", "privatesol", "publicsol", "privatesol_2", "privatesol_3",
"privatesol_5", "publicsol_2", "publicsol_3", "publicsol_5",
"policy_1", "policy_2", "policy_3", "policy_4", "colaction_5",
"colaction_6", "colaction_7", "colaction_8", "colaction_10",
"colaction_13", "joke", "random", "say", "wits", "wits_nb", "neutral",
"rural_id", "relig_id", "prog_id", "vignette", "merit3R", "policy_3R",
"policy_4R", "controll_4R", "controll_5R", "controll_9R", "controll_10R",
"colaction_6R", "colaction_10R", "logical", "cognition", "engage",
"pertake", "policy", "colaction", "controllability", "completebi",
"gender2"), row.names = c(NA, 6L), class = c("tbl_df", "tbl",
"data.frame"))
wits$complete
#[1] Completed \nall Completed \nall Completed \nall Completed \nall Completed \nall Completed \nall
#Levels: Completed \nall Stopped after demos Stopped after consent Skipped manip bc poor id
You can see it's "Completed \nall" not "Completed all" in your wits data frame.
##So, you just use which function to subset your wits data frame.
which(wits$complete == "Completed \nall")
#[1] 1 2 3 4 5 6 ## This is index of the row. You put this to subet you data frame as below and you are good to go.
## So, this will subset your data frame
wits[which(wits$complete == "Completed \nall"),]

Multiple imputation of categorical data in R

I used multiple imputations with R software to complete my data set.
See the example below: x3 (minimum value=0 and maximum value=6); x4 (minimum value=1 and maximum value=5).
After imputed my data set via mice (with m=5), I would like to yield the new proportions after imputation (that is the imputed proportions) of these two variables (A3 and A4) for each imputation data set (m=1 to 5). Do you know how to pool the results of the five estimations then into one (proportion and standard error) like this : A3= x%, x%, x%, x%, x%, x% and for A4 = y%, y%, y%, y%, y% ?
Do you know any R code to deal with this?
A1 =c(2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2)
A2 =c(1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2)
A3 =c(NA, NA, NA, 0, 0, 4, 0, 0, 2, 2 , 0, 3, 4, 0, 5, 6, NA, 2, 1, 0, NA, NA, NA, NA, NA, NA, 1, 2, 6, NA, 0, 6, NA, 6, 2)
A4 =c(1, NA, NA, NA, NA, 2, 2, 5, NA, 4, 3, 1, 2, NA, 3, 6, 2, 1, 2, 3, 3, 1, 2, 3, 4, 5, 4, 1, 2, 3, 5, 3, NA, 1, NA)
df =data.frame(x1=x1, x2=x2, x3=x3, x4=x4)
imp <- mice(df, m=5)

Resources