How to set a loop to assign lots of variables - r

I just started using R for a psych class, so please go easy on me. I watched a bunch of youtube videos on For loops, but none have answered my question. I have 4 data frames (A, B, C, D), each with 25 columns. I want to combine the nth column from each data frame together, and save them as an object, like so:
Q1 <- cbind(A[1], B[1], C[1], D[1])
Q2 <- cbind(A[2], B[2], C[2], D[2])
How can I set a loop to do this for all 25 so I don’t have to do it manually?
Thanks in advance
Each of my data frames looks like this (with column headings reflecting the letter of the data frame (i.e. B has QB1, QB2, etc.
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15
1 1 2 2 0 0 2 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 1 0 0 0 0 0 1 0 0 2 1 1 0 0 0
4 1 0 0 0 0 0 1 1 0 1 0 2 0 0 0

In order to do it in a for loop, you need to use assign() from baseR and eval_tidy(), sym() from rlang(). Basically, you will need to evaluate strings as variables.
Create simulation data
library(rlang)
nrows = 10
ncols = 25
df_names <- c("A","B","C","D")
for(df_name in df_names){
# assign value to a string as variable
assign(
df_name,
as.data.frame(
matrix(
data = sample(
c(0,1),
size = nrows * ncols,
replace = TRUE
),
ncol = 25
)
)
)
# rename columns
assign(
df_name,
setNames(eval_tidy(sym(df_name)),paste0("Q",df_name,1:ncols))
)
}
Show A
> head(A)
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15 QA16 QA17 QA18 QA19 QA20 QA21 QA22 QA23 QA24 QA25
1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1
2 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0
3 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1
4 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1
5 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1
6 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0
To answer your question:
This should create 25 variables from Q1 to Q25:
# assign dataframes from Q1 to Q25
for(i in 1:25){
new_df_name <- paste0("Q",i)
# initialize Qi with the same number of rows as A,B,C,D ...
assign(
new_df_name,
data.frame(tmp = matrix(NA,nrow = rows))
)
# loop A,B,C,D ... and bind them
for(df_name in df_names){
assign(
new_df_name,
cbind(
eval_tidy(sym(new_df_name)),
eval_tidy(sym(df_name))[,i,drop = FALSE]
)
)
}
# drop tmp to clean up
assign(
new_df_name,
eval_tidy(sym(new_df_name))[,-1]
)
}
Show result:
> Q25
QA25 QB25 QC25 QD25
1 1 0 1 1
2 0 1 0 0
3 1 1 0 0
4 1 0 1 1
5 1 1 0 0
6 0 1 1 1
7 1 0 0 0
8 0 0 0 1
9 1 1 1 0
10 0 0 1 1
The codes should be much easier if you save results in a list using map(). The major complexity is from assigning values to separate variables.

You can combine some dplyr verbs in a for loop to combine the columns from each data set and assign them to 25 new objects.
# merge data, gather, split by var numbers, assign each df to environment
for (i in 1:25) {
df <- cbind(q1,q2,q3,q4) %>% mutate(id=row_number()) %>%
gather(k,v,-id) %>%
mutate(num=sub('A|B|C|D','',k)) %>%
filter(num==i) %>% select(-num) %>% spread(k,v)
assign(paste0('df',i),df)
}
ls(pattern = 'df')
[1] "df1" "df10" "df11" "df12" "df13" "df14" "df15" "df16" "df17" "df18" "df19" "df2"
[13] "df20" "df21" "df22" "df23" "df24" "df25" "df3" "df4" "df5" "df6" "df7" "df8"
[25] "df9"
Code to create initial 4 toy data frames.
# create four toy data frames
q1 <- data.frame(matrix(runif(100),ncol=25))
q2 <- data.frame(matrix(runif(100),ncol=25))
q3 <- data.frame(matrix(runif(100),ncol=25))
q4 <- data.frame(matrix(runif(100),ncol=25))
# set var names for each toy data
names(q1) <- sub('X','A',names(q1))
names(q2) <- sub('X','B',names(q2))
names(q3) <- sub('X','C',names(q3))
names(q4) <- sub('X','D',names(q4))

Related

How to create multiple new columns based of off groups of columns that start with a certain prefix and also contain a certain string?

I have data that look like this
df <- data.frame(ID = c(1,2,3,4,5,6),
var1_unmod = c (1,0,0,1,0,1),
var1_me1 = c(0,1,0,0,0,0),
var1_me2 = c(1,1,1,0,1,0),
var1_me3 = c(0,0,1,0,0,0),
var1_ac1 = c(1,0,1,1,0,1),
var2_unmod = c(1,0,1,1,0,0),
var2_me1 = c(0,0,0,0,1,0),
var2_me2 = c(1,1,0,1,1,1),
var2_ac1 = c(1,1,0,1,0,0),
var2_me1ac1 = c(1,0,0,0,0,0),
var2_me2ac1 = c(1,0,0,1,1,1))
ID var1_unmod var1_me1 var1_me2 var1_me3 var1_ac1 var2_unmod var2_me1 var2_me2 var2_ac1 var2_me1ac1 var2_me2ac1
1 1 1 0 1 0 1 1 0 1 1 1 1
2 2 0 1 1 0 0 0 0 1 1 0 0
3 3 0 0 1 1 1 1 0 0 0 0 0
4 4 1 0 0 0 1 1 0 1 1 0 1
5 5 0 0 1 0 0 0 1 1 0 0 1
6 6 1 0 0 0 1 0 0 1 0 0 1
except that in the actual dataset, the prefixes aren't sequential like var1 and var2, they are basically random combinations of letters and numbers, and there are about 30 different ones.
For each of these prefixes (var1, var2, ...), I need to create a single variable that indicates whether any of the columns with that prefix that also contain me1, me2, or me3 (so for var2 this would be var2_me1, var2_me2, var2_me1ac1, var2_me2ac1) are nonzero. The output dataset would have additional columns like this:
ID var1_unmod var1_me1 var1_me2 var1_me3 var1_ac1 var1_meX var2_unmod var2_me1 var2_me2 var2_ac1 var2_me1ac1 var2_me2ac1 var2_meX
1 1 1 0 1 0 1 1 1 0 1 1 1 1 1
2 2 0 1 1 0 0 1 0 0 1 1 0 0 1
3 3 0 0 1 1 1 1 1 0 0 0 0 0 0
4 4 1 0 0 0 1 0 1 0 1 1 0 1 1
5 5 0 0 1 0 0 1 0 1 1 0 0 1 1
6 6 1 0 0 0 1 0 0 0 1 0 0 1 1
First I need to identify the applicable columns for each prefix (because there is no pattern to the prefixes, I'm thinking I will have to hard code at least this part), and then maybe somehow write a loop that iterates through the columns (stored in a vector?) for each prefix. I tend to have trouble referencing varying column names within loops. Any help is appreciated!
Here is a basic approach:
cols <- colnames(df)
varnames <- c("var1", "var2")
df2 <- df
for (i in varnames) {
newname <- paste(i, "meX", sep="_")
df2[, newname] <- apply(df2[, grepl(i, cols) & grepl("me", cols)], 1, sum)
df2[, newname] <- ifelse(df2[, newname] >= 1, 1, 0)
}
This will probably need to be modified based on the specific details of your data.
Define unique group of columns in cols, use lapply to iterate over each unique value and return 1 if there is atleast one 1 in the row in '_me' columns.
all_cols <- names(df)
cols <- c('var1', 'var2')
df[paste0(cols, '_meX')] <- lapply(cols, function(x)
as.integer(rowSums(df[grep(paste0(x, '_me'), all_cols, value = TRUE)]) > 0))
The new columns look like :
df[13:14]
# var1_meX var2_meX
#1 1 1
#2 1 1
#3 1 0
#4 0 1
#5 1 1
#6 0 1

Create a new column based on several conditions

I want to create a new column based on some conditions imposed on several columns. For example, here is an example dataset:
a <- data.frame(x=c(1,0,1,0,0), y=c(0,0,0,0,0), z=c(1,1,0,0,0))
a
x y z
1 1 0 1
2 0 0 1
3 1 0 0
4 0 0 0
5 0 0 0
Specifically, if for any particular row 1 is present, then the new column returns 1. If all are 0, then the new column returns 0. So the dataset with the new column will be
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
My initial thought was to use %in% but couldn't get the result I want. Thank you for your help!
If your data frame consists of binary values, e.g., only 0 and 1, you can try the code below with rowSums
a$w <- +(rowSums(a)>0)
such that
> a
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
We can use rowMaxs from matrixStats
library(matrixStats)
a$w <- rowMaxs(as.matrix(a))
a$w
#[1] 1 1 1 0 0
You can find max of each row :
a$w <- do.call(pmax, a)
a
# x y z w
#1 1 0 1 1
#2 0 0 1 1
#3 1 0 0 1
#4 0 0 0 0
#5 0 0 0 0
which can also be done with apply :
a$w <- apply(a, 1, max)

R: Generating sparse matrix with all elements as rows and columns

I have a data set with user to user. It doesn't have all users as col and row. For example,
U1 U2 T
1 3 1
1 6 1
2 4 1
3 5 1
u1 and u2 represent users of the dataset. When I create a sparse matrix using following code, (df- keep all data of above dataset as a dataframe)
trustmatrix <- xtabs(T~U1+U2,df,sparse = TRUE)
3 4 5 6
1 1 0 0 1
2 0 1 0 0
3 0 0 1 0
Because this matrix doesn't have all the users in row and columns as below.
1 2 3 4 5 6
1 0 0 1 0 0 1
2 0 0 0 1 0 0
3 0 0 0 0 1 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
If I want to get above matrix after sparse matrix, How can I do so in R?
We can convert the columns to factor with levels as 1 through 6 and then use xtabs
df1[1:2] <- lapply(df1[1:2], factor, levels = 1:6)
as.matrix(xtabs(T~U1+U2,df1,sparse = TRUE))
# U2
#U1 1 2 3 4 5 6
# 1 0 0 1 0 0 1
# 2 0 0 0 1 0 0
# 3 0 0 0 0 1 0
# 4 0 0 0 0 0 0
# 5 0 0 0 0 0 0
# 6 0 0 0 0 0 0
Or another option is to get the expanded index filled with 0s and then use sparseMatrix
library(tidyverse)
library(Matrix)
df2 <- crossing(U1 = 1:6, U2 = 1:6) %>%
left_join(df1) %>%
mutate(T = replace(T, is.na(T), 0))
sparseMatrix(i = df2$U1, j = df2$U2, x = df2$T)
Or use spread
spread(df2, U2, T)

find how many times and which columns a string is repeated

My data consists of 6 strings per each element. It has string with 6 characters. The data has white space too.
I want to know how many times each string is repeated in all columns
for example P67809 is repeated 2 times in column a and column d
so the output should look likes
string No columns
P67809 2 a,b
Based on this function I can assign a row number to each string
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
Then I apply the function on all and each columns string like
myS <- lapply(mydata, normalize,";")
but I don't know how to then search and get the output
We could melt the data from 'wide' to 'long' format. Split the 'value' column with ; to get a list output. We set the names of the list as the 'variable' column of 'dM'. Then stack the list to a two column output, and get the frequency count with 'tbl'. It may be easier to understand the result from the 'tbl' output.
library(reshape2)
dM <- melt(mydata, id.var=NULL)
lst1 <- setNames(strsplit(dM$value, ";"), dM$variable)
tbl <- table(stack(lst1)[2:1])
tbl
values
#ind A4QPH2 O60814 P0CG47 P0CG48 P14923 P15924 P19338 P35908 P42356 P57053 P58876 P62750 P62807 P62851 P62979 P63241 P67809 Q02413 Q06830 Q07955 Q16658 Q5QNW6 Q6IS14 Q8N8J0 Q93079 Q969S3
# a 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
# b 3 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0
# c 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 0 0
# d 0 0 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1
# values
#ind Q99877 Q99879 Q9Y2T7
# a 0 0 1
# b 0 0 0
# c 0 0 0
# d 1 1 1
We get the total number of each element with colSums.
cS <- colSums(tbl)
If we need to get the output as in the OP's post, we can melt the list output to create a 2 column data.frame. From this, we convert to 'data.table' (setDT(), grouped by 'value' column , we get the length of unique elements of 'variable' and also paste together the unique elements.
library(data.table)
res <- setDT(melt(lst1))[, list(No= uniqueN(L1),
columns= toString(unique(L1))) ,.(string=value)]
head(res,2)
# string No columns
#1: P67809 2 a, d
#2: Q9Y2T7 2 a, d
One approach might be:
res <- apply(mydata, 2, function(x) unlist(strsplit(x, ";")))
un <- unique(unlist(res))
res2 <- sapply(un, function(x) lapply(res, function(y) as.numeric(x %in% y)))
res2
P67809 Q9Y2T7 P42356 Q8N8J0 A4QPH2 P35908 P19338 P15924 P14923 Q02413 P63241 Q6IS14
a 1 1 1 1 1 1 1 1 1 0 0 0
b 0 0 0 0 0 0 0 0 0 1 1 1
c 0 0 0 0 0 0 0 0 0 1 1 1
d 1 1 0 0 0 0 0 0 0 0 0 0
P62979 P0CG47 P0CG48 Q16658 P62851 Q07955 Q06830 P62807 O60814 P57053 Q99879 Q99877
a 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 1 0 0 0 0 0 0 0 0 0
c 1 1 1 1 1 1 0 0 0 0 0 0 0
d 1 1 1 0 0 0 1 1 1 1 1 1 1
Q93079 Q5QNW6 P58876 P62750 Q969S3
a 0 0 0 0 0
b 0 0 0 0 0
c 0 0 0 0 0
d 1 1 1 1 1
as.data.frame(t(apply(t(res2), 1, function(x) cbind(sum(as.numeric(x)), paste(names(x)[which(as.logical(x))], collapse = ",")))))
V1 V2
P67809 2 a,d
Q9Y2T7 2 a,d
P42356 1 a
Q8N8J0 1 a
A4QPH2 1 a
P35908 1 a
P19338 1 a
P15924 1 a
P14923 1 a
Q02413 2 b,c
P63241 2 b,c
Q6IS14 2 b,c
P62979 3 b,c,d
P0CG47 3 b,c,d
P0CG48 3 b,c,d
2 b,c
Q16658 1 c
P62851 1 c
Q07955 1 d
Q06830 1 d
P62807 1 d
O60814 1 d
P57053 1 d
Q99879 1 d
Q99877 1 d
Q93079 1 d
Q5QNW6 1 d
P58876 1 d
P62750 1 d
Q969S3 1 d
An alternative approach with cSplit from splitstackshape and gather from tidyr.
library(splitstackshape)
library(tidyr)
library(dplyr)
splitted <- cSplit(mydata, splitCols = names(mydata), sep = ";") %>% gather() # Split cols and melt data
splitted$key <- substring(splitted$key, 1, 1) # Lose irrelevant string
table(splitted) # Generate frequency table

Using loop to make column selections using different vectors

Let's say I have 3 vectors (strings of 10):
X <- c(1,1,0,1,0, 1,1, 0, NA,NA)
H <- c(0,0,1,0,NA,1,NA,1, 1, 1 )
I <- c(0,0,0,0,0, 1,NA,NA,NA,1 )
Data.frame Y contains 10 columns and 6 rows:
1 2 3 4 5 6 7 8 9 10
0 1 0 0 1 1 1 0 1 0
1 1 1 0 1 0 1 0 0 0
0 0 0 0 1 0 0 1 0 1
1 0 1 1 0 1 1 1 0 0
0 0 0 0 0 0 1 0 0 0
1 1 0 1 0 0 0 0 1 1
I'd like to use vector X, H en I to make column selections in data.frame Y, using "1's" and "0's" in the vector as selection criterium .
So the results for vector X using the '1' as selection criterium should be:
X <- c(1,1,0,1,0, 1,1, 0, NA,NA)
1 2 4 6 7
0 1 0 1 1
1 1 0 0 1
0 0 0 0 0
1 0 1 1 1
0 0 0 0 1
1 1 1 0 0
For vector H using the '1' as selection criterium:
H <- c(0,0,1,0,NA,1,NA,1, 1, 1 )
3 6 8 9 10
0 1 0 1 0
1 0 0 0 0
0 0 1 0 1
1 1 1 0 0
0 0 0 0 0
0 0 0 1 1
For vector I using the '1' as selection criterium:
I <- c(0,0,0,0,0, 1,NA,NA,NA,1 )
6 10
1 0
0 0
0 1
1 0
0 0
0 1
For convenience and speed I'd like to use a loop. It might be something like this:
all.ones <- lapply[,function(x) x %in% 1]
In the outcome (all.ones), the result for each vector should stay separate. For example:
X 1,2,4,6,7
H 3,6,8,9,10
I 6,10
The standard way of doing this is using the %in% operator:
Y[, X %in% 1]
To do this for multiple vectors (assuming you want an AND operation):
mylist = list(X, H, I, D, E, K)
Y[, Reduce(`&`, lapply(mylist, function(x) x %in% 1))]
The problem is the NA, use which to get round it. Consider the following:
x <- c(1,0,1,NA)
x[x==1]
[1] 1 1 NA
x[which(x==1)]
[1] 1 1
How about this?
idx <- which(X==1)
Y[,idx]
EDIT: For six vectors, do
idx <- which(X==1 & H==1 & I==1 & D==1 & E==1 & K==1)
Y[,idx]
Replace & with | if you want all columns of Y where at least one of the lists has a 1.

Resources