How can I combine values from three variables into one variable? - r

What I'm trying to do is to create a single cataract variable from three different datasets that asked about cataract. (Basically, a phone interview, a wave using a short questionnaire, and a wave using a longer questionnaire.) These datasets have been merged, such that there are missing values created for the values for participants in the wave they didn't participate in. I've coded each of the three separate cataract vars as 1=YES and 0=NO.
In the following code, I'm trying to say if you respond yes (1) to any of the three vars, then give a value of 1, then if you are a No (0) to any give a value of 0, otherwise "NA".
survey$cataract<-ifelse(survey$ew3_cat==1 | survey$lq3_catnum==1 | survey$sq3_cat==1,1,
ifelse(survey$ew3_cat==0 | survey$lq3_catnum==0 | survey$sq3_cat==0,0,NA))
As you can see from the following result, I get the 1's, but everything else is "NA", no zeros.
> table(survey$cataract,useNA="ifany")
1 <NA>
10303 63322
Now, if I change the order, say do all the zeros first, then I get the correct 0's, but no 1's.
survey$cataract<-ifelse(survey$ew3_cat==0 | survey$lq3_catnum==0 | survey$sq3_cat==0,0,
ifelse(survey$ew3_cat==1 | survey$lq3_catnum==1 | survey$sq3_cat==1,1,NA))
> table(survey$cataract,useNA="ifany")
0 <NA>
63315 10310
The correct count from the three separate vars should be:
10,303 = 1
63,315 = 0
7= NA
I also tried replicating this problem with made-up data as follows:
x <- c(rep(1,100),rep(0,200),rep(NA,400))
y <- c(rep(NA,300),rep(1,100),rep(0,100),rep(NA,200))
z <- c(rep(NA,500),rep(1,100),rep(0,100))
cat <- ifelse(x==1|y==1|z==1,1,
ifelse(x==0|y==0|z==0,0,NA))
> table(cat,useNA="ifany")
cat
1 <NA>
300 400
Same problem if I reverse the order:
cat <- ifelse(x==0|y==0|z==0,0,
ifelse(x==1|y==1|z==1,1,NA))
> table(cat,useNA="ifany")
cat
0 <NA>
400 300
Any suggestions about what logical thing I'm missing here?

This is a little hackish but should give you the right result:
tmp <- as.numeric(mapply(any, as.logical(x),as.logical(y),as.logical(z), na.rm=TRUE))
tmp[which(mapply(all, is.na(x), is.na(y), is.na(z)))] <- NA
Basically it looks for any values of 1, returning 1 for those values and 0 otherwise. Then it goes back and puts NA values back in wherever all of x, y, and z are NA.
> table(tmp)
tmp
0 1
400 300
Note: Your example data don't seem particularly good for testing this because you have cases that are NA-NA-NA:
> ftable(x,y,z, useNA='always')
z 0 1 NA
x y
0 0 0 0 0
1 0 0 0
NA 0 0 200
1 0 0 0 0
1 0 0 0
NA 0 0 100
NA 0 0 0 100
1 0 0 100
NA 100 100 0
So, here's a slightly modified version of your data that shows the above code works correctly:
x <- c(rep(1,100),rep(0,200),rep(NA,400))
y <- c(rep(NA,300),rep(1,100),rep(0,100),rep(NA,200))
z <- c(rep(NA,500),rep(1,100),rep(0,50),rep(NA,50))
The result for those data:
> ftable(x,y,z, useNA='always')
z 0 1 NA
x y
0 0 0 0 0
1 0 0 0
NA 0 0 200
1 0 0 0 0
1 0 0 0
NA 0 0 100
NA 0 0 0 100
1 0 0 100
NA 50 100 50
> table(tmp, useNA='always')
tmp
0 1 <NA>
350 300 50

Related

Compute combination of a pair variables for a given operation in R

From a given dataframe:
# Create dataframe with 4 variables and 10 obs
set.seed(1)
df<-data.frame(replicate(4,sample(0:1,10,rep=TRUE)))
I would like to compute a substract operation between in all columns combinations by pairs, but only keeping one substact, i.e column A- column B but not column B-column A and so on.
What I got is very manual, and this tend to be not so easy when there are lots of variables.
# Result
df_result <- as.data.frame(list(df$X1-df$X2,
df$X1-df$X3,
df$X1-df$X4,
df$X2-df$X3,
df$X2-df$X4,
df$X3-df$X4))
Also the colname of the feature name should describe the operation i.e.(x1_x2) being x1-x2.
You can use combn:
COMBI = combn(colnames(df),2)
res = data.frame(apply(COMBI,2,function(i)df[,i[1]]-df[,i[2]]))
colnames(res) = apply(COMBI,2,paste0,collapse="minus")
head(res)
X1minusX2 X1minusX3 X1minusX4 X2minusX3 X2minusX4 X3minusX4
1 0 0 -1 0 -1 -1
2 1 1 0 0 -1 -1
3 0 0 0 0 0 0
4 0 0 -1 0 -1 -1
5 1 1 1 0 0 0
6 -1 0 0 1 1 0

merge multiple columns with condition

I have a data frame like this:
Q17a_17 Q17a_18 Q17a_19 Q17a_20 Q17a_21 Q17a_22 Q17a_23
1 NA NA NA NA NA NA NA
2 0 0 0 0 0 0 1
3 0 0 0 0 0 1 1
4 0 0 0 0 0 0 1
5 1 0 0 0 1 1 0
6 0 0 0 0 0 1 1
7 1 1 0 0 1 0 1
And I would like to merge Q17a_17, Q17a_19 and Q17a_23 in a new column with a new name. The "old" columns Q17a_17, Q17a_19 and Q17a_23 should be deleted.
In the new column should be just one value with the following conditions: "NA" if there was "NA" before, "1" if there was somewhere "1" as value before (like in row 3 or 4 or 7) and "0" if there were only zeros before.
Maybe this is really simple, but I struggle already for hours...
The approach I use here is to first compute a vector which is NA when an NA value occurs in at least one of the three columns, and zero otherwise. Also, we compute a vector containing the numerical result you want. What you want can be obtained by logically ORing together the three columns. Then, adding these two computed vectors together produces the desired result.
na.vector <- df$Q17a_17 * df$Q17a_19 * df$Q17a_23
na.vector[!is.na(na.vector)] <- 0
num.vector <- as.numeric(df$Q17a_17 | df$Q17a_19 | df$Q17a_23)
df$new_column <- na.vector + num.vector
df <- df[ , -which(names(df) %in% c("Q17a_17", "Q17a_19", "Q17a_23"))]

Sample random column in dataframe

I have the following code: model$data
model$data
[[1]]
Category1 Category2 Category3 Category4
3555 1 0 0 0
6447 1 0 0 0
5523 1 0 1 0
7550 1 0 1 0
6330 1 0 1 0
2451 1 0 0 0
4308 1 0 1 0
8917 0 0 0 0
4780 1 0 1 0
6802 1 0 1 0
2021 1 0 0 0
5792 1 0 1 0
5475 1 0 1 0
4198 1 0 0 0
223 1 0 1 0
4811 1 0 1 0
678 1 0 1 0
I am trying to use this formula to get an index of the column names:
sample(colnames(model$data), 1)
But I receive the following error message:
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Is there a way to avoid that error?
Notice this?
model$data
[[1]]
The [[1]] means that model$data is a list, whose first component is a data frame. To do anything with it, you need to pass model$data[[1]] to your code, not model$data.
sample(colnames(model$data[[1]]), 1)
This seems to be a near-duplicate of Random rows in dataframes in R and should probably be closed as duplicate. But for completeness, adapting that answer to sampling column-indices is trivial:
you don't need to generate a vector of column-names, only their indices. Keep it simple.
sample your col-indices from 1:ncol(df) instead of 1:nrow(df)
then put those column-indices on the RHS of the comma in df[, ...]
df[, sample(ncol(df), 1)]
the 1 is because you apparently want to take a sample of size 1.
one minor complication is that your dataframe is model$data[[1]], since your model$data looks like a list with one element which is a dataframe, rather than a plain dataframe. So first, assign df <- model$data[[1]]
finally, if you really really want the sampled column-name(s) as well as their indices:
samp_col_idxs <- sample(ncol(df), 1)
samp_col_names <- colnames(df) [samp_col_idxs]

Finding "similar" rows performing a conditional join with sqldf

Say I got a data.table (can also be data.frame, doesn't matter to me) which has numeric columns a, b, c, d and e.
Each row of the table represents an article and a-e are numeric characteristics of the articles.
What I want to find out is which articles are similar to each other, based on columns a, b and c.
I define "similar" by allowing a, b and c to vary +/- 1 at most.
That is, article x is similar to article y if neither a, b nor c differs by more than 1. Their values for d and e don't matter and may differ significantly.
I've already tried a couple of approaches but didn't get the desired result. What I want to achieve is to get a result table which contains only those rows that are similar to at least one other row. Plus, duplicates must be excluded.
Particularly, I'm wondering if this is possible using the sqldf library. My idea is to somehow join the table with itself under the given conditions, but I don't get it together properly. Any ideas (not necessarily using sqldf)?
Suppose our input data frame is the built-in 11x8 anscombe data frame. Its first three column names are x1, x2 and x3. Then here are some solutions.
1) sqldf This returns the pairs of row numbers of similar rows:
library(sqldf)
ans <- anscombe
ans$id <- 1:nrow(ans)
sqldf("select a.id, b.id
from ans a
join ans b on abs(a.x1 - b.x1) <= 1 and
abs(a.x2 - b.x2) <= 1 and
abs(a.x3 - b.x3) <= 1")
Add another condition and a.id < b.id if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id to just exclude self pairs.
2) dist This returns a matrix m whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.
# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0
giving:
1 2 3 4 5 6 7 8 9 10 11
1 1 0 0 1 1 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 0 1 0
3 0 0 1 0 0 1 0 0 1 0 0
4 1 1 0 1 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 1 0 0
6 0 0 1 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 1 1
8 0 0 0 0 0 0 0 1 0 0 1
9 0 0 1 0 1 0 0 0 1 0 0
10 0 1 0 0 0 0 1 0 0 1 0
11 0 0 0 0 0 0 1 1 0 0 1
We could add m[lower.tri(m, diag = TRUE)] <- 0 to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0 to just exclude self pairs.
We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.
# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]
giving:
Var1 Var2
34 1 4
35 2 4
45 1 5
58 3 6
91 3 9
93 5 9
101 2 10
106 7 10
117 7 11
118 8 11
Here is a network graph of the above. Note that answer continues after the graph:
# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)
# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) +
geom_raster()
I am quite new to R so don't expect to much.
What if you create from your values (which are basically vectors) a matrix with the distance from the two values. So you can find those combinations which have a difference of less than 1 from each other. Via this way you can find the matching (a)-pairs. Repeat this with (b) and (c) and find those which are included in all pairs.
Alternatively this can probably be done as a cube as well.
Just as a thought hint.

how to subset a data frame based on list of multiple match case in columns

So I have a list that contains certain characters as shown below
list <- c("MY","GM+" ,"TY","RS","LG")
And I have a variable named "CODE" in the data frame as follows
code <- c("MY GM+","","LGTY", "RS","TY")
df <- data.frame(1:5,code)
df
code
1 MY GM+
2
3 LGTY
4 RS
5 TY
Now I want to create 5 new variables named "MY","GM+","TY","RS","LG"
Which takes binary value, 1 if there's a match case in the CODE variable
df
code MY GM+ TY RS LG
1 MY GM+ 1 1 0 0 0
2 0 0 0 0 0
3 LGTY 0 0 1 0 1
4 RS 0 0 0 1 0
5 TY 0 0 1 0 0
Really appreciate your help. Thank you.
Since you know how many values will be returned (5), and what you want their types to be (integer), you could use vapply() with grepl(). We can turn the resulting logical matrix into integer values by using integer() in vapply()'s FUN.VALUE argument.
cbind(df, vapply(List, grepl, integer(nrow(df)), df$code, fixed = TRUE))
# code MY GM+ TY RS LG
# 1 MY GM+ 1 1 0 0 0
# 2 0 0 0 0 0
# 3 LGTY 0 0 1 0 1
# 4 RS 0 0 0 1 0
# 5 TY 0 0 1 0 0
I think your original data has a couple of typos, so here's what I used:
List <- c("MY", "GM+" , "TY", "RS", "LG")
df <- data.frame(code = c("MY GM+", "", "LGTY", "RS", "TY"))

Resources