Imputing based on specific columns

Imputing based on specific columns - r

I'm about to do imputation for missing values and I use the mice-package. I need to do imputation based on specific column content. So basically, I have 24 columns that are used to measure 4 Latent Variables (using the plspm-package). I wish to impute N/A's based on specific column content. So for cols 1-6 I wish to impute NAs in those specific columns based only on the content within these 6. (and so forth for cols 7-12, 13-18 and 19-24).
I hope it makes sense for you guys.
My data structure is:
p1 p2 p3 p4 p5 p6 l1 l2 l3 l4 l5 l6
4 3 5 4 5 N/A 2 1 4 5 1 N/A
4 4 1 3 1 2 1 1 1 1 1 1
5 4 5 4 4 4 4 4 5 5 4 4
5 4 5 5 4 5 4 4 N/A 5 4 4
5 5 5 5 5 5 3 2 5 5 2 2
4 3 4 3 3 3 3 2 3 4 3 2
5 4 5 5 3 4 4 1 5 5 5 4
5 5 5 5 5 5 5 3 4 5 3 4
4 4 4 4 3 N/A 4 4 5 4 3 3
5 4 4 4 3 2 1 3 2 5 1 1
4 4 4 4 5 5 3 4 5 5 3 3
4 3 2 N/A 1 2 N/A 1 2 N/A 1 N/A
3 3 4 4 3 2 1 3 3 3 1 3
5 3 4 4 4 2 3 4 4 4 3 3
4 4 4 5 2 2 2 2 2 2 3 3
5 4 4 4 4 4 4 4 5 5 4 3
4 3 3 3 5 2 2 2 4 4 1 1
5 4 5 4 5 3 1 1 5 5 2 3
4 3 1 3 4 4 2 1 4 3 2 3
4 3 1 4 3 1 2 1 4 4 3 2
3 3 5 4 5 1 2 2 4 5 3 2
4 4 5 3 5 5 2 2 3 4 2 3
4 4 2 3 2 3 2 2 3 4 2 2
5 5 5 5 5 5 4 3 3 3 3 3
5 5 5 5 5 4 4 N/A 5 5 N/A N/A
So I guess it's essentially splitting data into 4 blocks and then imputing. I read about the blocks()-function in the help(mice), but I'm not sure I can actually use that for this specific task.
The code i've been using so far is:
temp_pmm <- mice(data_predict,
m = 3,
maxit = 10,
method = "pmm",
seed = 2374)
But the way I understand the package, it imputes based on entire row content (so my latent variable constructs overlap, which I am trying to mitigate).
Hope you can help me out and I appreciate any help.
Thanks in advance!
Tobias

So Dominix' suggestion of simply running separate imputations seems to be the right way to go. Thanks a lot!
For any future reference, this is how I worked it out:
test_pmm_firstv <- mice(data_predict[,c(1:6)],
m = 10,
maxit = 20,
method = "pmm",
seed = 127493)
test_pmm_secondv <- mice(data_predict[,c(7:12)],
m = 10,
maxit = 20,
method = "pmm",
seed = 1239754111)
test_pmm_thirdv <- mice(data_predict[,c(13:18)],
m = 10,
maxit = 20,
method = "pmm",
seed = 1238603)
test_pmm_fourthv <- mice(data_predict[,c(19:24)],
m = 10,
maxit = 20,
method = "pmm",
seed = 356811)
data_pmm_firstv <- mice::complete(test_pmm_firstv, 1)
data_pmm_secondv <- mice::complete(test_pmm_secondv, 1)
data_pmm_thirdv <- mice::complete(test_pmm_thirdv, 1)
data_pmm_fourthv <- mice::complete(test_pmm_fourthv, 1)
data_fixed <- as.data.frame(cbind(data_pmm_firstv, data_pmm_secondv, data_pmm_thirdv, data_pmm_fourthv))
anyNA(data_fixed)
[1] FALSE

Related

Convert a small dataset written in SPSS to CSV

I have a small dataset written in SPSS syntax which comes from Table 5.3 p. 189 of this book (type 210 in the page slot to see the table).
I was wondering if there might be a way to convert this data to .csv file? (I want to use the data in R afterwards)
# SPSS Code:
DATA LIST FREE/gpid anx socskls assert.
BEGIN DATA.
1 5 3 3 1 5 4 3 1 4 5 4 1 4 5 4
1 3 5 5 1 4 5 4 1 4 5 5 1 4 4 4
1 5 4 3 1 5 4 3 1 4 4 4
2 6 2 1 2 6 2 2 2 5 2 3 2 6 2 2
2 4 4 4 2 7 1 1 2 5 4 3 2 5 2 3
2 5 3 3 2 5 4 3 2 6 2 3
3 4 4 4 3 4 3 3 3 4 4 4 3 4 5 5
3 4 5 5 3 4 4 4 3 4 5 4 3 4 6 5
3 4 4 4 3 5 3 3 3 4 4 4
END DATA.
EDIT - in order to check answers I am adding here the actual way the data looks after reading it in SPSS :
gpid anx socskls assert
1 5 3 3
1 5 4 3
1 4 5 4
1 4 5 4
1 3 5 5
1 4 5 4
1 4 5 5
1 4 4 4
1 5 4 3
1 5 4 3
1 4 4 4
2 6 2 1
2 6 2 2
2 5 2 3
2 6 2 2
2 4 4 4
2 7 1 1
2 5 4 3
2 5 2 3
2 5 3 3
2 5 4 3
2 6 2 3
3 4 4 4
3 4 3 3
3 4 4 4
3 4 5 5
3 4 5 5
3 4 4 4
3 4 5 4
3 4 6 5
3 4 4 4
3 5 3 3
3 4 4 4

If I understand correctly, the 1st, 5th, 9th, and 13th column of the dataset belong to variable gpid, the 2nd, 6th, 10th, and 14th column belong to variable anx, and so on. So, we need to
reshape from wide to long format
with multiple measure variables
where each measure variable spans several columns
and where some values are missing.
Many roads lead to Rome.
This is what I would do using my favourite tools. In particular, this approach uses the feature of data.table::melt() to reshape multiple measure columns simultaneously. There is no manual cleanup of the data section in a text editor required.
The resulting dataset result can be used directly afterwards in any subsequent R code as requested by the OP. There is no need to take a detour using a .csv file (However, feel free to save result as a .csv file).
library(data.table)
library(magrittr)
cols <- c("gpid", "anx", "socskls", "assert")
raw <- fread(text = "
1 5 3 3 1 5 4 3 1 4 5 4 1 4 5 4
1 3 5 5 1 4 5 4 1 4 5 5 1 4 4 4
1 5 4 3 1 5 4 3 1 4 4 4
2 6 2 1 2 6 2 2 2 5 2 3 2 6 2 2
2 4 4 4 2 7 1 1 2 5 4 3 2 5 2 3
2 5 3 3 2 5 4 3 2 6 2 3
3 4 4 4 3 4 3 3 3 4 4 4 3 4 5 5
3 4 5 5 3 4 4 4 3 4 5 4 3 4 6 5
3 4 4 4 3 5 3 3 3 4 4 4",
fill = TRUE)
mv <- colnames(raw) %>%
matrix(ncol = 4L, byrow = TRUE) %>%
as.data.table() %>%
setnames(new = cols)
result <- melt(raw, measure.vars = mv, na.rm = TRUE)[
order(rowid(variable))][
, variable := NULL]
result
gpid anx socskls assert
1: 1 5 3 3
2: 1 5 4 3
3: 1 4 5 4
4: 1 4 5 4
5: 1 3 5 5
6: 1 4 5 4
7: 1 4 5 5
8: 1 4 4 4
9: 1 5 4 3
10: 1 5 4 3
11: 1 4 4 4
12: 2 6 2 1
13: 2 6 2 2
14: 2 5 2 3
15: 2 6 2 2
16: 2 4 4 4
17: 2 7 1 1
18: 2 5 4 3
19: 2 5 2 3
20: 2 5 3 3
21: 2 5 4 3
22: 2 6 2 3
23: 3 4 4 4
24: 3 4 3 3
25: 3 4 4 4
26: 3 4 5 5
27: 3 4 5 5
28: 3 4 4 4
29: 3 4 5 4
30: 3 4 6 5
31: 3 4 4 4
32: 3 5 3 3
33: 3 4 4 4
gpid anx socskls assert
Some explanations
fread() returns a data.table raw with default column names V1, V2, ... V16 and with missing values filled with NA
mv is a data.table which indicates which columns of raw belong to each target variable:
mv
gpid anx socskls assert
1: V1 V2 V3 V4
2: V5 V6 V7 V8
3: V9 V10 V11 V12
4: V13 V14 V15 V16
This informations is used by melt(). melt() also removes rows with missing values from the resulting long format.
After reshaping, the rows are ordered by the variable number but need to be reordered in the original row order by using rowid(variable). Finally, the variable column is removed.
EDIT: Improved version
Giving a second thought, here is a streamlined version of the code which skips the creation of mv and uses data.table chaining:
library(data.table)
cols <- c("gpid", "anx", "socskls", "assert")
result <- fread(
text = "
1 5 3 3 1 5 4 3 1 4 5 4 1 4 5 4
1 3 5 5 1 4 5 4 1 4 5 5 1 4 4 4
1 5 4 3 1 5 4 3 1 4 4 4
2 6 2 1 2 6 2 2 2 5 2 3 2 6 2 2
2 4 4 4 2 7 1 1 2 5 4 3 2 5 2 3
2 5 3 3 2 5 4 3 2 6 2 3
3 4 4 4 3 4 3 3 3 4 4 4 3 4 5 5
3 4 5 5 3 4 4 4 3 4 5 4 3 4 6 5
3 4 4 4 3 5 3 3 3 4 4 4",
fill = TRUE, col.names = rep(cols, 4L))[
, melt(.SD, measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)][
order(rowid(variable))][
, variable := NULL][]
result
Here, the columns are renamed within the call to fread(). In this case, duplicated column names are desirable (as opposed to the usual use case) because the patterns() function in the subsequent call to melt() use the duplicated column names to combine the columns which belong to one measure variable.

This requires some manual clean-up in Notepad or similar to place the data in the right format. But essentially, this could be imported using the following
df <- data.frame(
gpid = c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,
2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3),
anx = c(5,5,4,4,3,4,4,4,5,5,4,6,6,5,6,
4,7,5,5,5,5,6,4,4,4,4,4,4,4,4,4,5,4),
socskls = c(3,4,5,5,5,5,5,4,4,4,4,2,2,2,2,
4,1,4,2,3,4,2,4,3,4,5,5,4,5,6,4,3,4),
assert = c(3,3,4,4,5,4,5,4,3,3,4,1,2,3,2,
4,1,3,3,3,3,3,4,3,4,5,5,4,4,5,4,3,4)
)
write.csv(df, "df.csv", row.names = F)
Note that the first 4 values (1, 5, 3, 3) are the gpid, anx, socskls, and assert values for row 1. Whereas the values 1, 5, 4, 3 which appear to be in the next column of the pasted data in SPSS syntax (i.e. the next 4 values reading the syntax left to right) are actually the values for participant 10.
Note: I'm assuming you don't have SPSS installed. If you did the easiest option would using SPSS syntax to create the dataset in SPSS and then just export to R.

Using readLines and some string manipulating tools.
tmp <- readLines("spss1.txt") ## read from .txt
tmp <- trimws(gsub("[A-Z/.]", "", tmp)) ## remove caps and specials
nm <- strsplit(tmp[[1]], " ")[[1]] ## split names
tmp <- unlist(strsplit(tmp[3:11], "\\s{2,}") ) ## split data blocks
Finally, splitting at the spaces gives the result.
dat <- setNames(
type.convert(do.call(rbind.data.frame, strsplit(tmp, "\\s"))),
nm)
Result
dat
# gpid anx socskls assert
# 1 1 5 3 3
# 2 1 5 4 3
# 3 1 4 5 4
# 4 1 4 5 4
# 5 1 3 5 5
# 6 1 4 5 4
# 7 1 4 5 5
# 8 1 4 4 4
# 9 1 5 4 3
# 10 1 5 4 3
# 11 1 4 4 4
# 12 2 6 2 1
# 13 2 6 2 2
# 14 2 5 2 3
# 15 2 6 2 2
# 16 2 4 4 4
# 17 2 7 1 1
# 18 2 5 4 3
# 19 2 5 2 3
# 20 2 5 3 3
# 21 2 5 4 3
# 22 2 6 2 3
# 23 3 4 4 4
# 24 3 4 3 3
# 25 3 4 4 4
# 26 3 4 5 5
# 27 3 4 5 5
# 28 3 4 4 4
# 29 3 4 5 4
# 30 3 4 6 5
# 31 3 4 4 4
# 32 3 5 3 3
# 33 3 4 4 4
Note: Results in the same Wilks' lambda as #emily-kothe's method. Maybe the authors used different data or your manova method is flawed?

Subsetting a dataframe in R including observations that satisfy condition

I would like to randomly subset dataframe with condition that if the observation with alpha=1 is included in a subset, then all observation which has alpha=1 must be included in the subset. I simplify data, so it looks like this.
df
alpha beta gamma
1 5 2
1 6 3
1 5 3
2 3 2
2 5 9
2 2 6
3 3 4
3 4 7
3 3 8
4 3 4
4 8 3
4 4 9
5 9 8
5 5 5
5 3 5
What command should I use to get subsets like the following?
df1
alpha beta gamma
1 5 2
1 6 3
1 5 3
3 3 4
3 4 7
3 3 8
5 9 8
5 5 5
5 3 5
df2
alpha beta gamma
2 3 2
2 5 9
2 2 6
4 3 4
4 8 3
4 4 9
5 9 8
5 5 5
5 3 5
df3
alpha beta gamma
1 5 2
1 6 3
1 5 3
2 3 2
2 5 9
2 2 6
5 9 8
5 5 5
5 3 5
Specifically, the first observation in df with numbers (1,5,2) is randomly fell in subset df1 and df3. If so, it must follow that 2nd and 3d observations in df (1,6,3) and (1,5,3) are also included in subsets df1 and df2.
I hope that my question is clear. Please help.

Try this
str <- "alpha,beta,gamma
1,5,2
1,6,3
1,5,3
2,3,2
2,5,9
2,2,6
3,3,4
3,4,7
3,3,8
4,3,4
4,8,3
4,4,9
5,9,8
5,5,5
5,3,5"
df <- read.csv(textConnection(str))
df[df$alpha %in% sample(unique(df$alpha), 3), ]
Output
alpha beta gamma
4 2 3 2
5 2 5 9
6 2 2 6
10 4 3 4
11 4 8 3
12 4 4 9
13 5 9 8
14 5 5 5
15 5 3 5

From table to data.frame

I have a table that looks like:
dat = data.frame(expand.grid(x = 1:10, y = 1:10),
z = sample(LETTERS[1:3], size = 100, replace = TRUE))
tabl <- with(dat, table(z, y))
tabl
y
z 1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2
Now how do I transform it into a data.frame that looks like
1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2

Here are a couple of options.
The reason as.data.frame(tabl) doesn't work is that it dispatches to the S3 method as.data.frame.table() which does something useful but different from what you want.
as.data.frame.matrix(tabl)
# 1 2 3 4 5 6 7 8 9 10
# A 5 4 3 1 1 3 3 2 6 2
# B 1 4 3 4 5 3 4 4 3 3
# C 4 2 4 5 4 4 3 4 1 5
## This will also work
as.data.frame(unclass(tabl))

using Reduce/do.call with ifelse

This is purely a curiosity (learning more about Reduce). There are way better methods to achieve what I'm doing and I am not interested in them.
Some people use a series of nested ifelse commands to recode/look up something. Maybe it looks like this:
set.seed(10); x <- sample(letters[1:10], 300, T)
ifelse(x=="a", 1,
ifelse(x=="b", 2,
ifelse(x=="c", 3,
ifelse(x=="d", 4, 5))))
Is there a way to use either do.call or Reduce with the ifelse to get the job done a little more eloquently?

Try this:
> library(gsubfn)
> strapply(x, ".", list(a = 1, b = 2, c = 3, d = 4, 5), simplify = TRUE)
[1] 5 4 5 5 1 3 3 3 5 5 5 5 2 5 4 5 1 3 4 5 5 5 5 4 5 5 5 3 5 4 5 1 2 5 5 5 5
[38] 5 5 5 3 3 1 5 3 2 1 5 2 5 4 5 3 5 2 5 5 5 4 5 1 2 5 4 5 5 5 5 1 3 1 5 5 5
[75] 1 5 4 5 3 3 5 5 3 5 3 1 5 3 2 2 5 5 5 5 4 5 3 5 5 1 4 1 4 5 5 5 5 5 5 5 5
[112] 5 2 5 5 5 3 5 5 5 2 4 4 5 3 3 5 4 5 5 5 1 5 3 4 3 5 5 2 5 5 3 1 5 2 5 5 5
[149] 1 5 5 2 1 2 4 2 2 3 5 2 5 5 5 5 5 3 5 5 5 5 5 5 5 5 5 5 2 3 5 4 4 2 5 5 5
[186] 5 5 5 5 2 1 1 1 5 5 5 5 3 5 5 3 5 5 5 2 5 5 5 3 5 5 5 5 5 1 5 5 5 5 2 2 5
[223] 5 5 4 3 4 5 5 4 5 5 5 3 5 3 5 5 5 5 4 5 5 1 5 5 2 5 5 5 2 5 5 3 2 5 4 5 2
[260] 5 5 3 5 5 1 4 3 5 4 5 2 5 5 3 5 5 5 5 5 1 1 5 2 5 1 5 5 5 5 5 5 5 5 5 5 5
[297] 5 1 5 2

Here is an attempt. It is neither beautiful nor does it use ifelse:
f <- function(w,s) {
if(is.null(s$old))
w$output[is.na(w$output)] <- s$new
else
w$output[w$input==s$old] <- s$new
return(w)
}
set.seed(10); x <- sample(letters[1:10], 300, T)
subst <- list(
list(old="a", new=1),
list(old="b", new=2),
list(old="c", new=3),
list(old="d", new=4),
list(old=NULL, new=5)
)
workplace <- list(
input=x,
output=rep(NA, length(x))
)
Reduce(f, subst, workplace)

Is there an expand.grid like function in R, returning permutations?

to become more specific, here is an example:
> expand.grid(5, 5, c(1:4,6),c(1:4,6))
Var1 Var2 Var3 Var4
1 5 5 1 1
2 5 5 2 1
3 5 5 3 1
4 5 5 4 1
5 5 5 6 1
6 5 5 1 2
7 5 5 2 2
8 5 5 3 2
9 5 5 4 2
10 5 5 6 2
11 5 5 1 3
12 5 5 2 3
13 5 5 3 3
14 5 5 4 3
15 5 5 6 3
16 5 5 1 4
17 5 5 2 4
18 5 5 3 4
19 5 5 4 4
20 5 5 6 4
21 5 5 1 6
22 5 5 2 6
23 5 5 3 6
24 5 5 4 6
25 5 5 6 6
This data frame was created from all combinations of the supplied vectors. I would like to create a similar data frame from all permutations of the supplied vectors. Notice that each row must contain exactly 2 fives, yet not necessarily the fist two in line.
Thank you.

The code below works. (relies on permutations from gtools)
comb <- t(as.matrix(expand.grid(5, 5, c(1:4,6),c(1:4,6))))
perms <- t(permutations(4,4))
ans <- apply(comb,2,function(x) x[perms])
ans <- unique(matrix(as.vector(ans), ncol = 4, byrow = TRUE))

Try ?allPerms in the vegan package.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Imputing based on specific columns - r

Related

Convert a small dataset written in SPSS to CSV

Subsetting a dataframe in R including observations that satisfy condition

From table to data.frame

using Reduce/do.call with ifelse

Is there an expand.grid like function in R, returning permutations?

Categories

Resources