R Join dataframe column to a partially matching grid - r

I have a data frame object where combinations of variables are represented by 1, but which is sparsely populated in that I do not have all combinations mapped out.
e.g.
A B C Outcome
1 0 0 700
0 1 0 900
0 0 1 450
1 1 0 280
0 1 1 100
... which is missing the potential combinations [101] and [111]
From this, I'd like to expand out all combinations of A, B, and C, taking the outcome value where the combination exists, and where not, populate Outcome with a zero.
e.g.
A B C Outcome
1 0 0 700
1 1 0 280
1 0 1 0 <- new row
1 1 1 0 <- new row
0 1 0 900
0 1 1 100
0 0 1 450
I'm afraid I don't really have any idea how to do this functionally. I've had a look at expand.grid() - for example the following also using the plyr package
expand.grid(rlply(n, c(0,1)))
which for n=3 gives
Var1 Var2 Var3
1 0 0 0
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1
which pretty much gives me the grid I'm after, but I'm not clear now how to join my "Outcome" values to this grid, particularly where n is large (say 60 or 70 variables).
Any help gratefully received!

df <- read.table(text =
"A B C Outcome
1 0 0 700
0 1 0 900
0 0 1 450
1 1 0 280
0 1 1 100",
header = TRUE)
res <-
merge(
x = do.call(what = "expand.grid", lapply(head(as.list(df), - 1), unique)),
y = df,
all.x = TRUE
)
res$Outcome[is.na(res$Outcome)] <- 0
res
# A B C Outcome
# 1 0 0 0 0
# 2 0 0 1 450
# 3 0 1 0 900
# 4 0 1 1 100
# 5 1 0 0 700
# 6 1 0 1 0
# 7 1 1 0 280
# 8 1 1 1 0
Edit:
Not sure whether it should go in a separate answer, but here is a more elegant way with the tidyr package:
library(tidyr)
complete(df, A, B, C, fill = list(Outcome = 0))
If you want to avoid typing all 60 or 70 column names:
complete_(df, cols = setdiff(names(df), "Outcome"), fill = list(Outcome = 0))

Related

How to create multiple new columns based of off groups of columns that start with a certain prefix and also contain a certain string?

I have data that look like this
df <- data.frame(ID = c(1,2,3,4,5,6),
var1_unmod = c (1,0,0,1,0,1),
var1_me1 = c(0,1,0,0,0,0),
var1_me2 = c(1,1,1,0,1,0),
var1_me3 = c(0,0,1,0,0,0),
var1_ac1 = c(1,0,1,1,0,1),
var2_unmod = c(1,0,1,1,0,0),
var2_me1 = c(0,0,0,0,1,0),
var2_me2 = c(1,1,0,1,1,1),
var2_ac1 = c(1,1,0,1,0,0),
var2_me1ac1 = c(1,0,0,0,0,0),
var2_me2ac1 = c(1,0,0,1,1,1))
ID var1_unmod var1_me1 var1_me2 var1_me3 var1_ac1 var2_unmod var2_me1 var2_me2 var2_ac1 var2_me1ac1 var2_me2ac1
1 1 1 0 1 0 1 1 0 1 1 1 1
2 2 0 1 1 0 0 0 0 1 1 0 0
3 3 0 0 1 1 1 1 0 0 0 0 0
4 4 1 0 0 0 1 1 0 1 1 0 1
5 5 0 0 1 0 0 0 1 1 0 0 1
6 6 1 0 0 0 1 0 0 1 0 0 1
except that in the actual dataset, the prefixes aren't sequential like var1 and var2, they are basically random combinations of letters and numbers, and there are about 30 different ones.
For each of these prefixes (var1, var2, ...), I need to create a single variable that indicates whether any of the columns with that prefix that also contain me1, me2, or me3 (so for var2 this would be var2_me1, var2_me2, var2_me1ac1, var2_me2ac1) are nonzero. The output dataset would have additional columns like this:
ID var1_unmod var1_me1 var1_me2 var1_me3 var1_ac1 var1_meX var2_unmod var2_me1 var2_me2 var2_ac1 var2_me1ac1 var2_me2ac1 var2_meX
1 1 1 0 1 0 1 1 1 0 1 1 1 1 1
2 2 0 1 1 0 0 1 0 0 1 1 0 0 1
3 3 0 0 1 1 1 1 1 0 0 0 0 0 0
4 4 1 0 0 0 1 0 1 0 1 1 0 1 1
5 5 0 0 1 0 0 1 0 1 1 0 0 1 1
6 6 1 0 0 0 1 0 0 0 1 0 0 1 1
First I need to identify the applicable columns for each prefix (because there is no pattern to the prefixes, I'm thinking I will have to hard code at least this part), and then maybe somehow write a loop that iterates through the columns (stored in a vector?) for each prefix. I tend to have trouble referencing varying column names within loops. Any help is appreciated!
Here is a basic approach:
cols <- colnames(df)
varnames <- c("var1", "var2")
df2 <- df
for (i in varnames) {
newname <- paste(i, "meX", sep="_")
df2[, newname] <- apply(df2[, grepl(i, cols) & grepl("me", cols)], 1, sum)
df2[, newname] <- ifelse(df2[, newname] >= 1, 1, 0)
}
This will probably need to be modified based on the specific details of your data.
Define unique group of columns in cols, use lapply to iterate over each unique value and return 1 if there is atleast one 1 in the row in '_me' columns.
all_cols <- names(df)
cols <- c('var1', 'var2')
df[paste0(cols, '_meX')] <- lapply(cols, function(x)
as.integer(rowSums(df[grep(paste0(x, '_me'), all_cols, value = TRUE)]) > 0))
The new columns look like :
df[13:14]
# var1_meX var2_meX
#1 1 1
#2 1 1
#3 1 0
#4 0 1
#5 1 1
#6 0 1

How to set a loop to assign lots of variables

I just started using R for a psych class, so please go easy on me. I watched a bunch of youtube videos on For loops, but none have answered my question. I have 4 data frames (A, B, C, D), each with 25 columns. I want to combine the nth column from each data frame together, and save them as an object, like so:
Q1 <- cbind(A[1], B[1], C[1], D[1])
Q2 <- cbind(A[2], B[2], C[2], D[2])
How can I set a loop to do this for all 25 so I don’t have to do it manually?
Thanks in advance
Each of my data frames looks like this (with column headings reflecting the letter of the data frame (i.e. B has QB1, QB2, etc.
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15
1 1 2 2 0 0 2 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 1 0 0 0 0 0 1 0 0 2 1 1 0 0 0
4 1 0 0 0 0 0 1 1 0 1 0 2 0 0 0
In order to do it in a for loop, you need to use assign() from baseR and eval_tidy(), sym() from rlang(). Basically, you will need to evaluate strings as variables.
Create simulation data
library(rlang)
nrows = 10
ncols = 25
df_names <- c("A","B","C","D")
for(df_name in df_names){
# assign value to a string as variable
assign(
df_name,
as.data.frame(
matrix(
data = sample(
c(0,1),
size = nrows * ncols,
replace = TRUE
),
ncol = 25
)
)
)
# rename columns
assign(
df_name,
setNames(eval_tidy(sym(df_name)),paste0("Q",df_name,1:ncols))
)
}
Show A
> head(A)
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15 QA16 QA17 QA18 QA19 QA20 QA21 QA22 QA23 QA24 QA25
1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1
2 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0
3 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1
4 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1
5 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1
6 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0
To answer your question:
This should create 25 variables from Q1 to Q25:
# assign dataframes from Q1 to Q25
for(i in 1:25){
new_df_name <- paste0("Q",i)
# initialize Qi with the same number of rows as A,B,C,D ...
assign(
new_df_name,
data.frame(tmp = matrix(NA,nrow = rows))
)
# loop A,B,C,D ... and bind them
for(df_name in df_names){
assign(
new_df_name,
cbind(
eval_tidy(sym(new_df_name)),
eval_tidy(sym(df_name))[,i,drop = FALSE]
)
)
}
# drop tmp to clean up
assign(
new_df_name,
eval_tidy(sym(new_df_name))[,-1]
)
}
Show result:
> Q25
QA25 QB25 QC25 QD25
1 1 0 1 1
2 0 1 0 0
3 1 1 0 0
4 1 0 1 1
5 1 1 0 0
6 0 1 1 1
7 1 0 0 0
8 0 0 0 1
9 1 1 1 0
10 0 0 1 1
The codes should be much easier if you save results in a list using map(). The major complexity is from assigning values to separate variables.
You can combine some dplyr verbs in a for loop to combine the columns from each data set and assign them to 25 new objects.
# merge data, gather, split by var numbers, assign each df to environment
for (i in 1:25) {
df <- cbind(q1,q2,q3,q4) %>% mutate(id=row_number()) %>%
gather(k,v,-id) %>%
mutate(num=sub('A|B|C|D','',k)) %>%
filter(num==i) %>% select(-num) %>% spread(k,v)
assign(paste0('df',i),df)
}
ls(pattern = 'df')
[1] "df1" "df10" "df11" "df12" "df13" "df14" "df15" "df16" "df17" "df18" "df19" "df2"
[13] "df20" "df21" "df22" "df23" "df24" "df25" "df3" "df4" "df5" "df6" "df7" "df8"
[25] "df9"
Code to create initial 4 toy data frames.
# create four toy data frames
q1 <- data.frame(matrix(runif(100),ncol=25))
q2 <- data.frame(matrix(runif(100),ncol=25))
q3 <- data.frame(matrix(runif(100),ncol=25))
q4 <- data.frame(matrix(runif(100),ncol=25))
# set var names for each toy data
names(q1) <- sub('X','A',names(q1))
names(q2) <- sub('X','B',names(q2))
names(q3) <- sub('X','C',names(q3))
names(q4) <- sub('X','D',names(q4))

Filtering on a Column Whose Number is Specified in Another Column

I'm looking for a better way to achieve what the code below does with a for loop. The goal is to create a dataframe (or matrix) where each row is a possible n-length sequence of 1s and 0s, followed by an n+1th column which contains a number corresponding to one of the previous columns that contains a 0.
So in the n == 3 case for example, we want to include a row like this:
1 0 0 2
but not this:
1 0 0 1
Here's the code I have now (assuming n == 3 for simplicity):
library(tidyverse)
df <- expand.grid(x = 0:1, y = 0:1, z = 0:1, target = 1:3, keep = FALSE)
for (row in 1:nrow(df)) {
df$keep[row] <- df[row, df$target[row]] == 0
}
df <- df %>%
filter(keep == TRUE) %>%
select(-keep)
head(df)
# x y z target
# 1 0 0 0 1
# 2 0 1 0 1
# 3 0 0 1 1
# 4 0 1 1 1
# 5 0 0 0 2
# 6 1 0 0 2
# 7 0 0 1 2
# 8 1 0 1 2
# 9 0 0 0 3
# 10 1 0 0 3
# 11 0 1 0 3
# 12 1 1 0 3
Seems like there has to be a better way to do this, especially with dplyr. But I can't figure out how to use the value of target to specify the column to filter on.
Using base R, we can create a row/column index to filter values from the dataframe and keep rows where the extracted value is 0.
df[df[cbind(seq_len(nrow(df)), df$target)] == 0, ]
# x y z target
#1 0 0 0 1
#3 0 1 0 1
#5 0 0 1 1
#7 0 1 1 1
#9 0 0 0 2
#10 1 0 0 2
#13 0 0 1 2
#14 1 0 1 2
#17 0 0 0 3
#18 1 0 0 3
#19 0 1 0 3
#20 1 1 0 3
data
df <- expand.grid(x = 0:1, y = 0:1, z = 0:1, target = 1:3)

Change the value of variables that occur 80% of the times in each row, R

In my data, I have 74 observations (rows) and 128 variables (columns), where each variable takes either 0 or 1 as value. In R, I am trying to write a code, where I can find in each row, the variables that has 1 as value and calculate 80% of the times 1 appears in each row. Pick those variables that has 80% of the times value as 1 and change the value from 1 to 0. I could write code, where I can calculate the 80% of times, 1 appears in each row, but I am not able to pick these variables in each row and change their value from 1 to 0.
data# data frame with 74 observations and 128 variables
row1 <- data[1,]
count1 <- length(which(data[1,] == 1)) # #number of 1 in row 1
print(count1)
perform <- 80/100*count1# 80% of count1
Below code works for one row:
test <- t(apply(data[1,], 1, function(x,n){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
If specify all the rows, code is not working:
test <- t(apply(data[1:74,], 1, function(x,n){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
Example of desired output:
original data frame
df
a b c d e f
1 1 1 1 1 1 1
2 1 0 1 1 0 1
3 1 1 1 0 1 1
When the code is applied to all the three rows in df, output should like this in all the three rows (80% of 1 replaced as 0):
a b c d e f
1 1 0 0 0 1 0
2 0 0 1 0 0 0
3 0 1 1 0 0 0
Thanks
Any suggestions
Thank you
Priya
A solution is to use apply row-wise and get indices where value is 1 using which. Afterwards, pick 80% of those indices (with value as 1) using sample and replace those to '0`.
t(apply(df, 1, function(x){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
# a b c d e f
# [1,] 0 0 0 1 0 0
# [2,] 0 0 0 1 0 0
# [3,] 0 0 1 0 0 1
# [4,] 0 1 0 0 0 0
# [5,] 0 1 0 0 0 0
# [6,] 1 0 0 0 0 0
# [7,] 0 0 0 0 0 1
# [8,] 0 0 1 0 0 0
# [9,] 0 0 1 0 1 0
# [10,] 0 0 0 0 0 1
Sample Data:
set.seed(1)
df <- data.frame(a = sample(c(0,1,1,1), 10, replace = TRUE),
b = sample(c(0,1,1,1), 10, replace = TRUE),
c = sample(c(0,1,1,1), 10, replace = TRUE),
d = sample(c(0,1,1,1), 10, replace = TRUE),
e = sample(c(0,1,1,1), 10, replace = TRUE),
f = sample(c(0,1,1,1), 10, replace = TRUE))
df
# a b c d e f
# 1 1 0 1 1 1 1
# 2 1 0 0 1 1 1
# 3 1 1 1 1 1 1
# 4 1 1 0 0 1 0
# 5 0 1 1 1 1 0
# 6 1 1 1 1 1 0
# 7 1 1 0 1 0 1
# 8 1 1 1 0 1 1
# 9 1 1 1 1 1 1
# 10 0 1 1 1 1 1
# Answer on OP's data
t(apply(df1, 1, function(x){
onesInX <- which(x==1)
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
# a b c d e f
# 1 1 1 0 0 0 0 <- .8*6 = 4.8 => 4 has been converted to 0
# 2 0 0 0 1 0 0 <- .8*5 = 4.0 => 4 has been converted to 0
# 3 0 1 0 0 0 0 <- .8*4 = 3.2 => 3 has been converted to 0
# Data from OP
df1 <- read.table(text="
a b c d e f
1 1 1 1 1 1 1
2 1 0 1 1 0 1
3 1 1 1 0 1 1",
header = TRUE)
df1
# a b c d e f
# 1 1 1 1 1 1 1 <- No of 1 = 6
# 2 1 0 1 1 0 1 <- No of 1 = 4
# 3 1 1 1 0 1 1 <- No of 1 = 5

How to exclude cases that do not repeat X times in R?

I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])

Resources