How to programatically compare an entire row in R? - r

I have the following dataframe in R:
data=
Time X1 X2 X3
1 1 0 0
2 1 1 1
3 0 0 1
4 1 1 1
5 0 0 0
6 0 1 1
7 1 1 1
8 0 0 0
9 1 1 1
10 0 0 0
Is there a way to programatically select those rows that are equal to (0,1,1)? I know it can be done by doing data[data$X1 == 0 & data$X2 == 1 & data$X3 == 1,] but, in my scenario, (0,1,1) is a list in a variable. My ultimate goal here is to determine the number of rows that are equal to (0,1,1), or any other combination that list variable can hold.
Thanks!
Mariano.

Here's a couple of options using a merge:
merge(list(X1=0,X2=1,X3=1), dat)
#or
merge(setNames(list(0,1,1),c("X1","X2","X3")), dat)
Or even using positional indexes based on what columns you want matched up:
L <- list(0,1,1)
merge(L, dat, by.x=seq_along(L), by.y=2:4)
All of which return:
# X1 X2 X3 Time
#1 0 1 1 6
If your matching variables are all of the same type, you could also safely do it via matrix comparison like:
dat[colSums(t(dat[c("X1","X2","X3")]) == c(0,1,1)) == 3,]

apply(data, 1, function(x) all(x==c(0,1,1)))
This will go down each row of the frame and return TRUE for each row where the row is equal to c(0,1,1).

this is your data
mydf <- structure(list(Time = 1:10, X1 = c(1L, 1L, 0L, 1L, 0L, 0L, 1L,
0L, 1L, 0L), X2 = c(0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L),
X3 = c(0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L)), .Names = c("Time",
"X1", "X2", "X3"), class = "data.frame", row.names = c(NA, -10L
))
Using subset
subset(mydf, X1 == 0 & X2==1 & X3==1)
# Time X1 X2 X3
#6 6 0 1 1
another way
mydf[mydf$X1 ==0 & mydf$X2 ==1 & mydf$X3 ==1, ]
# Time X1 X2 X3
#6 6 0 1 1
or like this
mydf[mydf$X1 ==0 & mydf$X2 & mydf$X3 %in% c(1,1), ]
# Time X1 X2 X3
#6 6 0 1 1
you can also do that by
library(dplyr)
filter(mydf, X1==0 & X2==1 & X3==1)
# Time X1 X2 X3
#1 6 0 1 1

Related

Alternative of Stata's foreach command in R

I am a new user of R and am used to using Stata software.
I used to loop through multiple variables by foreach command in Stata. So, for example, I can convert multiple numerical variables to factor ones.
In Stata, first, define the label:
label define NoYes 0 "No" 1 "Yes"
Then, apply the loop command:
foreach x in var1 var2 var3 {
label values `x' NoYes
}
I am figuring out how I can do so in R; any help would be appreciated.
In base R we can use lapply.
dat[c(1, 3)] <- lapply(dat[c(1, 3)], factor, levels=0:1, labels=c('No', 'Yes'))
dat
# X1 X2 X3 X4 X5
# 1 No 1 No 1 0
# 2 No 1 Yes 1 1
# 3 No 0 No 0 0
# 4 No 1 No 0 0
# 5 Yes 0 Yes 0 0
# 6 Yes 1 Yes 0 0
To avoid confusion, I generally recommend not using too many fancy packages while you're new to R.
The literal translation could look like this (reload dat before trying):
for (x in c('X1', 'X3')) {
dat[[x]] <- factor(dat[[x]], levels=0:1, labels=c('No', 'Yes'))
}
dat
# X1 X2 X3 X4 X5
# 1 No 1 No 1 0
# 2 No 1 Yes 1 1
# 3 No 0 No 0 0
# 4 No 1 No 0 0
# 5 Yes 0 Yes 0 0
# 6 Yes 1 Yes 0 0
Data:
dat <- structure(list(X1 = c(0L, 0L, 0L, 0L, 1L, 1L), X2 = c(1L, 1L,
0L, 1L, 0L, 1L), X3 = c(0L, 1L, 0L, 0L, 1L, 1L), X4 = c(1L, 1L,
0L, 0L, 0L, 0L), X5 = c(0L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Example data:
library(dplyr)
df <- data.frame(var1 = c(1, 1, 0, 0),
var2 = c(0, 0, 1, 1),
var3 = c(0, 1, 0, 1))
There may be several alternatives, and across of dplyr is one of them.
new_df <- df %>% mutate(across(var1:var3, ~ factor(.x, levels = c(0, 1), labels=c("No", "Yes"))))
new_df
var1 var2 var3
1 Yes No No
2 Yes No Yes
3 No Yes No
4 No Yes Yes
You do not really need levels = c(0, 1) here, but I would always do it in real data just to be safe.

R - Linear linear regression with variables in different dataframes

I have 4 large matrixes of the same size A, B, C and D . Each matrix has n samples (columns) and n observations (rows).
A <- structure(list(S1 = c(0L, 0L, 1L, 1L), S2 = c(0L, 1L, 0L, 0L), S3 = c(0L, 0L, 0L, 1L)), class = "data.frame", row.names = c("Ob1", "Ob2", "Ob3", "Ob4"))
# S1 S2 S3
# Ob1 0 0 0
# Ob2 0 1 0
# Ob3 1 0 0
# Ob4 1 0 1
B <- structure(list(S1 = c(0L, 1L, 1L, 1L), S2 = c(0L, 8L, 0L, 0L), S3 = c(0L, 0L, 0L, 1L)), class = "data.frame", row.names = c("Ob1", "Ob2", "Ob3", "Ob4"))
# S1 S2 S3
# Ob1 0 0 0
# Ob2 1 8 0
# Ob3 1 0 0
# Ob4 1 0 1
C <- structure(list(S1 = c(0L, 0L, 4L, 1L), S2 = c(2L, 1L, 0L, 2L), S3 = c(0L, 0L, 0L, 1L)), class = "data.frame", row.names = c("Ob1", "Ob2", "Ob3", "Ob4"))
# S1 S2 S3
# Ob1 0 2 0
# Ob2 0 1 0
# Ob3 4 0 0
# Ob4 1 2 1
D <- structure(list(S1 = c(0L, 0L, 4L, 1L), S2 = c(8L, 1L, 5L, 0L), S3 = c(0L, 0L, 0L, 1L)), class = "data.frame", row.names = c("Ob1", "Ob2", "Ob3", "Ob4"))
# S1 S2 S3
# Ob1 0 8 0
# Ob2 0 1 0
# Ob3 4 5 0
# Ob4 1 0 1
Each matrix contains a different variable. I want to perform a linear regression of 4 variables for each sample and observation of the matrixes. I don't want a linear regression betweeen any combinaton of samples and observations, just pairwise regressions in the form of column 1 and row 1 in matrx A is going to be fitted with column 1 and row 1 in matrixes B, C and D; column 2 and row 2 with column 2 and row 2, and so on.
lm model:
lm(A ~ B * C + D)
I want:
lm(A$S1_Obs1 ~ B$S1_Obs1 * C$S1_Obs1 + D$S1_Obs1)
lm(A$S1_Obs2 ~ B$S1_Obs2 * C$S1_Obs2 + D$S1_Obs2)
lm(A$S1_Obs3 ~ B$S1_Obs3 * C$S1_Obs3 + D$S1_Obs3)
lm(A$S2_Obs1 ~ B$S2_Obs1 * C$S2_Obs1 + D$S2_Obs1)
lm(A$S2_Obs2 ~ B$S2_Obs2 * C$S2_Obs2 + D$S2_Obs2)
lm(A$S2_Obs3 ~ B$S2_Obs3 * C$S2_Obs3 + D$S2_Obs3)
...
Any help appreciated.
We may use asplit to split by row and then construct the linear model by looping each of the split elements in Map
out <- Map(function(a, b, c, d) lm(a ~ b * c + d),
asplit(A, 1), asplit(B, 1), asplit(C, 1), asplit(D, 1))
Here is an approach using the purrr package that assigns names as well:
library(purrr)
seq_along(A) %>%
map(~ lm(A[.] ~ B[.] * C[.] + D[.])) %>%
set_names(map(seq_along(.),
~ arrayInd(.x, dim(A)) %>%
paste(collapse = "_")))

Filtering rows by different columns

In the data frame
x1 x2 x3 x4 x5
1 0 1 1 0 3
2 1 2 2 0 3
3 2 2 0 0 2
4 1 3 0 0 2
5 3 3 2 1 4
6 2 0 0 0 1
column x5 indicates where the first non-zero value in a row is. The table should be read from right (x4) to left (x1). Thus, the first non-zero value in the first row is in column x3, for example.
I want to get all rows where 1 is the first non zero entry, i.e.
x1 x2 x3 x4 x5
1 0 1 1 0 3
2 3 3 2 1 4
should be the result. I tried different version of filter_at but I didn't manage to come up with a solution. E.g. one try was
testdf %>% filter_at(vars(
paste("x",testdf$x5, sep = "")),
any_vars(. == 1))
I want to solve that without a for loop, since the real data set has millions of rows and almost 100 columns.
You can do filtering row-wise easily with the new utility function c_across:
library(dplyr) # version 1.0.2
testdf %>% rowwise() %>% filter(c_across(x1:x4)[x5] == 1) %>% ungroup()
# A tibble: 2 x 5
x1 x2 x3 x4 x5
<int> <int> <int> <int> <int>
1 0 1 1 0 3
2 3 3 2 1 4
A vectorised base R solution would be :
result <- df[df[cbind(1:nrow(df), df$x5)] == 1, ]
result
# x1 x2 x3 x4 x5
#1 0 1 1 0 3
#5 3 3 2 1 4
cbind(1:nrow(df), df$x5) creates a row-column matrix of largest value in each row. We extract those first values and select rows with 1 in them.
Another vectorised solution:
df[t(df)[t(col(df)==df$x5)]==1,]
We can use apply in base R
df1[apply(df1, 1, function(x) x[x[5]] == 1),]
# x1 x2 x3 x4 x5
#1 0 1 1 0 3
#5 3 3 2 1 4
data
df1 <- structure(list(x1 = c(0L, 1L, 2L, 1L, 3L, 2L), x2 = c(1L, 2L,
2L, 3L, 3L, 0L), x3 = c(1L, 2L, 0L, 0L, 2L, 0L), x4 = c(0L, 0L,
0L, 0L, 1L, 0L), x5 = c(3L, 3L, 2L, 2L, 4L, 1L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Concatenate dichotome columns to semicolon-separated column

I have data frame containing the results of a multiple choice question. Each item has either 0 (not mentioned) or 1 (mentioned). The columns are named like this:
F1.2_1, F1.2_2, F1.2_3, F1.2_4, F1.2_5, F1.2_99
etc.
I would like to concatenate these values like this: The new column should be a semicolon-separated string of the selected items. So if a row has a 1 in F1.2_1, F1.2_4 and F1.2_5 it should be: 1;4;5
The last digit(s) of the dichotome columns are the item codes to be used in the string.
Any idea how this could be achieved with R (and data.table)? Thanks for any help!
edit:
Here is a example DF with the desired result:
structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), desired_result = structure(c(3L,
2L, 4L, 1L), .Label = c("1;2;3", "1;3;4", "2", "99"), class = "factor")), .Names = c("F1.2_1",
"F1.2_2", "F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "desired_result"
), class = "data.frame", row.names = c(NA, -4L))
F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 desired_result
1 0 1 0 0 0 0 2
2 1 0 1 1 0 0 1;3;4
3 0 0 0 0 0 1 99
4 1 1 1 0 0 0 1;2;3
In his comment, the OP asked how to deal with more multiple choice questions.
The approach below will be able to handle an arbitrary number of questions and choices for each question. It uses melt() and dcast() from the data.table package.
Sample input data
Let's assume the input data.frame DT for the extended case contains two questions, one with 6 choices and the other with 4 choices:
DT
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11
#1: 0 1 0 0 0 0 0 1 1 0
#2: 1 0 1 1 0 0 1 1 1 1
#3: 0 0 0 0 0 1 1 0 1 0
#4: 1 1 1 0 0 0 1 0 1 1
Code
library(data.table)
# coerce to data.table and add row number for later join
setDT(DT)[, rn := .I]
# reshape from wide to long format
molten <- melt(DT, id.vars = "rn")
# alternatively, the measure cols can be specified (in case of other id vars)
# molten <- melt(DT, measure.vars = patterns("^F"))
# split question id and choice id
molten[, c("question_id", "choice_id") := tstrsplit(variable, "_")]
# reshape only selected choices from long to wide format,
# thereby pasting together the ids of the selected choices for each question
result <- dcast(molten[value == 1], rn ~ question_id, paste, collapse = ";",
fill = NA, value.var = "choice_id")
# final join for demonstration only, remove row number as no longer needed
DT[result, on = "rn"][, rn := NULL][]
# F1.2_1 F1.2_2 F1.2_3 F1.2_4 F1.2_5 F1.2_99 F2.7_1 F2.7_2 F2.7_3 F2.7_11 F1.2 F2.7
#1: 0 1 0 0 0 0 0 1 1 0 2 2;3
#2: 1 0 1 1 0 0 1 1 1 1 1;3;4 1;2;3;11
#3: 0 0 0 0 0 1 1 0 1 0 99 1;3
#4: 1 1 1 0 0 0 1 0 1 1 1;2;3 1;3;11
For each question, the final result shows which choices were selected in each row.
Reproducible data
The sample data can be created with
DT <- structure(list(F1.2_1 = c(0L, 1L, 0L, 1L), F1.2_2 = c(1L, 0L,
0L, 1L), F1.2_3 = c(0L, 1L, 0L, 1L), F1.2_4 = c(0L, 1L, 0L, 0L
), F1.2_5 = c(0L, 0L, 0L, 0L), F1.2_99 = c(0L, 0L, 1L, 0L), F2.7_1 = c(0L,
1L, 1L, 1L), F2.7_2 = c(1L, 1L, 0L, 0L), F2.7_3 = c(1L, 1L, 1L,
1L), F2.7_11 = c(0L, 1L, 0L, 1L)), .Names = c("F1.2_1", "F1.2_2",
"F1.2_3", "F1.2_4", "F1.2_5", "F1.2_99", "F2.7_1", "F2.7_2",
"F2.7_3", "F2.7_11"), row.names = c(NA, -4L), class = "data.frame")
We can try
j1 <- do.call(paste, c(as.integer(sub(".*_", "",
names(DF)[-7]))[col(DF[-7])]*DF[-7], sep=";"))
DF$newCol <- gsub("^;+|;+$", "", gsub(";*0;|0$|^0", ";", j1))
DF$newCol
#[1] "2" "1;3;4" "99" "1;2;3"

How to count the number of combinations of boolean data in R

What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))
Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4
Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))
Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1

Resources