adding together multiple sets of columns in r - r

I'm trying to add several sets of columns together.
Example df:
df <- data.frame(
key = 1:5,
ab0 = c(1,0,0,0,1),
ab1 = c(0,2,1,0,0),
ab5 = c(1,0,0,0,1),
bc0 = c(0,1,0,2,0),
bc1 = c(2,0,0,0,0),
bc5 = c(0,2,1,0,1),
df0 = c(0,0,0,1,0),
df1 = c(1,0,3,0,0),
df5 = c(1,0,0,0,6)
)
Giving me:
key ab0 ab1 ab5 bc0 bc1 bc5 df0 df1 df5
1 1 1 0 1 0 2 0 0 1 1
2 2 0 2 0 1 0 2 0 0 0
3 3 0 1 0 0 0 1 0 3 0
4 4 0 0 0 2 0 0 1 0 0
5 5 1 0 1 0 0 1 0 0 6
I want to add all sets of columns with 0s and 5s in them together and place them in the 0 column.
So the end result would be:
key ab0 ab1 ab5 bc0 bc1 bc5 df0 df1 df5
1 1 2 0 1 0 2 0 0 1 1
2 2 0 2 0 3 0 2 0 0 0
3 3 0 1 0 1 0 1 0 3 0
4 4 0 0 0 2 0 0 2 0 0
5 5 2 0 1 1 0 1 0 0 6
I could add the columns together using 3 lines:
df$ab0 <- df$ab0 + df$ab5
df$bc0 <- df$bc0 + df$bc5
df$df0 <- df$df0 + df$df5
But my real example has over a hundred columns so I'd like to iterate over them and use apply.
The column names of the first set are contained in col0 and the names of the second set are in col5.
col0 <- c("ab0","bc0","df0")
col5 <- c("ab5","bc5","df5")
I created a function to add the columns to gether using mapply:
fun1 <- function(df,x,y) {
df[,x] <- df[,x] + df[,y]
}
mapply(fun1,df,col0,col5)
But I get an error: Error in df[, x] : incorrect number of dimensions
Thoughts?

Simply add two data frames together by their subsetted columns, assuming they will be the same length. No loops needed. All vectorized operation.
final_df <- df[grep("0", names(df))] + df[grep("5", names(df))]
final_df <- cbind(final_df, df[grep("0", names(df), invert=TRUE)])
final_df <- final_df[order(names(final_df))]
final_df
# ab0 ab1 ab5 bc0 bc1 bc5 df0 df1 df5 key
# 1 2 0 1 0 2 0 1 1 1 1
# 2 0 2 0 3 0 2 0 0 0 2
# 3 0 1 0 1 0 1 0 3 0 3
# 4 0 0 0 2 0 0 1 0 0 4
# 5 2 0 1 1 0 1 6 0 6 5
Rextester demo

You could use map2 from the purrr package to iterate over the two vectors at once:
df <- data.frame(
key = 1:5,
ab0 = c(1,0,0,0,1),
ab1 = c(0,2,1,0,0),
ab5 = c(1,0,0,0,1),
bc0 = c(0,1,0,2,0),
bc1 = c(2,0,0,0,0),
bc5 = c(0,2,1,0,1),
df0 = c(0,0,0,1,0),
df1 = c(1,0,3,0,0),
df5 = c(1,0,0,0,6)
)
col0 <- c("ab0","bc0","df0")
col5 <- c("ab5","bc5","df5")
purrr::map2(col0, col5, function(x, y) {
df[[x]] <<- df[[x]] + df[[y]]
})
> df
key ab0 ab1 ab5 bc0 bc1 bc5 df0 df1 df5
1 1 2 0 1 0 2 0 1 1 1
2 2 0 2 0 3 0 2 0 0 0
3 3 0 1 0 1 0 1 0 3 0
4 4 0 0 0 2 0 0 1 0 0
5 5 2 0 1 1 0 1 6 0 6

Here's an approach using tidyr and dplyr from the tidyverse meta-package.
First, I bring the table into long ("tidy") format, and split out the column into two components, and spread by the number part of those components.
Then I do the calculation you describe.
Finally, I bring it back into the original format using the inverse of step 1.
library(tidyverse)
df_tidy <- df %>%
# Step 1
gather(col, value, -key) %>%
separate(col, into = c("grp", "num"), 2) %>%
spread(num, value) %>%
# Step 2
mutate(`0` = `0` + `5`) %>%
# Step 3, which is just the inverse of Step 1.
gather(num, value, -key, - grp) %>%
unite(col, c("grp", "num")) %>%
spread(col, value)
df_tidy
key ab_0 ab_1 ab_5 bc_0 bc_1 bc_5 df_0 df_1 df_5
1 1 2 0 1 0 2 0 1 1 1
2 2 0 2 0 3 0 2 0 0 0
3 3 0 1 0 1 0 1 0 3 0
4 4 0 0 0 2 0 0 1 0 0
5 5 2 0 1 1 0 1 6 0 6

Related

How to create multiple new columns based of off groups of columns that start with a certain prefix and also contain a certain string?

I have data that look like this
df <- data.frame(ID = c(1,2,3,4,5,6),
var1_unmod = c (1,0,0,1,0,1),
var1_me1 = c(0,1,0,0,0,0),
var1_me2 = c(1,1,1,0,1,0),
var1_me3 = c(0,0,1,0,0,0),
var1_ac1 = c(1,0,1,1,0,1),
var2_unmod = c(1,0,1,1,0,0),
var2_me1 = c(0,0,0,0,1,0),
var2_me2 = c(1,1,0,1,1,1),
var2_ac1 = c(1,1,0,1,0,0),
var2_me1ac1 = c(1,0,0,0,0,0),
var2_me2ac1 = c(1,0,0,1,1,1))
ID var1_unmod var1_me1 var1_me2 var1_me3 var1_ac1 var2_unmod var2_me1 var2_me2 var2_ac1 var2_me1ac1 var2_me2ac1
1 1 1 0 1 0 1 1 0 1 1 1 1
2 2 0 1 1 0 0 0 0 1 1 0 0
3 3 0 0 1 1 1 1 0 0 0 0 0
4 4 1 0 0 0 1 1 0 1 1 0 1
5 5 0 0 1 0 0 0 1 1 0 0 1
6 6 1 0 0 0 1 0 0 1 0 0 1
except that in the actual dataset, the prefixes aren't sequential like var1 and var2, they are basically random combinations of letters and numbers, and there are about 30 different ones.
For each of these prefixes (var1, var2, ...), I need to create a single variable that indicates whether any of the columns with that prefix that also contain me1, me2, or me3 (so for var2 this would be var2_me1, var2_me2, var2_me1ac1, var2_me2ac1) are nonzero. The output dataset would have additional columns like this:
ID var1_unmod var1_me1 var1_me2 var1_me3 var1_ac1 var1_meX var2_unmod var2_me1 var2_me2 var2_ac1 var2_me1ac1 var2_me2ac1 var2_meX
1 1 1 0 1 0 1 1 1 0 1 1 1 1 1
2 2 0 1 1 0 0 1 0 0 1 1 0 0 1
3 3 0 0 1 1 1 1 1 0 0 0 0 0 0
4 4 1 0 0 0 1 0 1 0 1 1 0 1 1
5 5 0 0 1 0 0 1 0 1 1 0 0 1 1
6 6 1 0 0 0 1 0 0 0 1 0 0 1 1
First I need to identify the applicable columns for each prefix (because there is no pattern to the prefixes, I'm thinking I will have to hard code at least this part), and then maybe somehow write a loop that iterates through the columns (stored in a vector?) for each prefix. I tend to have trouble referencing varying column names within loops. Any help is appreciated!
Here is a basic approach:
cols <- colnames(df)
varnames <- c("var1", "var2")
df2 <- df
for (i in varnames) {
newname <- paste(i, "meX", sep="_")
df2[, newname] <- apply(df2[, grepl(i, cols) & grepl("me", cols)], 1, sum)
df2[, newname] <- ifelse(df2[, newname] >= 1, 1, 0)
}
This will probably need to be modified based on the specific details of your data.
Define unique group of columns in cols, use lapply to iterate over each unique value and return 1 if there is atleast one 1 in the row in '_me' columns.
all_cols <- names(df)
cols <- c('var1', 'var2')
df[paste0(cols, '_meX')] <- lapply(cols, function(x)
as.integer(rowSums(df[grep(paste0(x, '_me'), all_cols, value = TRUE)]) > 0))
The new columns look like :
df[13:14]
# var1_meX var2_meX
#1 1 1
#2 1 1
#3 1 0
#4 0 1
#5 1 1
#6 0 1

How to count number of columns that have a value by a grouping variable in R?

I have data like this:
repetition Ob1 Ob2 Ob3 Ob4
1 0 0 0 1
1 0 0 3 0
1 1 3 3 0
1 2 3 3 0
2 4 0 2 2
2 4 0 3 0
2 0 0 0 0
3 0 0 0 0
3 4 0 4 0
3 0 0 0 0
I want to count the number of columns per repetition that have a certain value e.g. 1.
So in this case repetition 1 should return a 2 because Ob1 and Ob4 have a value of 1. Everything else gets a 0 because there are no other repetitions with a 1.
you can get count using dplyr package below code:
df$count <- rowSums(df[,2:5] == df$repetition)
df %>% select(repetition, count) %>% group_by(repetition) %>% summarise(count = sum(count))
# A tibble: 3 x 2
repetition count
<int> <dbl>
1 1 2
2 2 2
3 3 0
You can use by like:
by(x[-1]==1, x$repetition, function(y) sum(colSums(y) > 0))
#INDICES: 1
#[1] 2
#------------------------------------------------------------
#INDICES: 2
#[1] 0
#------------------------------------------------------------
#INDICES: 3
#[1] 0
or to return a named vector
c(by(x[-1]==1, x$repetition, function(y) sum(colSums(y) > 0)))
#1 2 3
#2 0 0

How to set a loop to assign lots of variables

I just started using R for a psych class, so please go easy on me. I watched a bunch of youtube videos on For loops, but none have answered my question. I have 4 data frames (A, B, C, D), each with 25 columns. I want to combine the nth column from each data frame together, and save them as an object, like so:
Q1 <- cbind(A[1], B[1], C[1], D[1])
Q2 <- cbind(A[2], B[2], C[2], D[2])
How can I set a loop to do this for all 25 so I don’t have to do it manually?
Thanks in advance
Each of my data frames looks like this (with column headings reflecting the letter of the data frame (i.e. B has QB1, QB2, etc.
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15
1 1 2 2 0 0 2 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 1 0 0 0 0 0 1 0 0 2 1 1 0 0 0
4 1 0 0 0 0 0 1 1 0 1 0 2 0 0 0
In order to do it in a for loop, you need to use assign() from baseR and eval_tidy(), sym() from rlang(). Basically, you will need to evaluate strings as variables.
Create simulation data
library(rlang)
nrows = 10
ncols = 25
df_names <- c("A","B","C","D")
for(df_name in df_names){
# assign value to a string as variable
assign(
df_name,
as.data.frame(
matrix(
data = sample(
c(0,1),
size = nrows * ncols,
replace = TRUE
),
ncol = 25
)
)
)
# rename columns
assign(
df_name,
setNames(eval_tidy(sym(df_name)),paste0("Q",df_name,1:ncols))
)
}
Show A
> head(A)
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15 QA16 QA17 QA18 QA19 QA20 QA21 QA22 QA23 QA24 QA25
1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1
2 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0
3 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1
4 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1
5 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1
6 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0
To answer your question:
This should create 25 variables from Q1 to Q25:
# assign dataframes from Q1 to Q25
for(i in 1:25){
new_df_name <- paste0("Q",i)
# initialize Qi with the same number of rows as A,B,C,D ...
assign(
new_df_name,
data.frame(tmp = matrix(NA,nrow = rows))
)
# loop A,B,C,D ... and bind them
for(df_name in df_names){
assign(
new_df_name,
cbind(
eval_tidy(sym(new_df_name)),
eval_tidy(sym(df_name))[,i,drop = FALSE]
)
)
}
# drop tmp to clean up
assign(
new_df_name,
eval_tidy(sym(new_df_name))[,-1]
)
}
Show result:
> Q25
QA25 QB25 QC25 QD25
1 1 0 1 1
2 0 1 0 0
3 1 1 0 0
4 1 0 1 1
5 1 1 0 0
6 0 1 1 1
7 1 0 0 0
8 0 0 0 1
9 1 1 1 0
10 0 0 1 1
The codes should be much easier if you save results in a list using map(). The major complexity is from assigning values to separate variables.
You can combine some dplyr verbs in a for loop to combine the columns from each data set and assign them to 25 new objects.
# merge data, gather, split by var numbers, assign each df to environment
for (i in 1:25) {
df <- cbind(q1,q2,q3,q4) %>% mutate(id=row_number()) %>%
gather(k,v,-id) %>%
mutate(num=sub('A|B|C|D','',k)) %>%
filter(num==i) %>% select(-num) %>% spread(k,v)
assign(paste0('df',i),df)
}
ls(pattern = 'df')
[1] "df1" "df10" "df11" "df12" "df13" "df14" "df15" "df16" "df17" "df18" "df19" "df2"
[13] "df20" "df21" "df22" "df23" "df24" "df25" "df3" "df4" "df5" "df6" "df7" "df8"
[25] "df9"
Code to create initial 4 toy data frames.
# create four toy data frames
q1 <- data.frame(matrix(runif(100),ncol=25))
q2 <- data.frame(matrix(runif(100),ncol=25))
q3 <- data.frame(matrix(runif(100),ncol=25))
q4 <- data.frame(matrix(runif(100),ncol=25))
# set var names for each toy data
names(q1) <- sub('X','A',names(q1))
names(q2) <- sub('X','B',names(q2))
names(q3) <- sub('X','C',names(q3))
names(q4) <- sub('X','D',names(q4))

Filtering on a Column Whose Number is Specified in Another Column

I'm looking for a better way to achieve what the code below does with a for loop. The goal is to create a dataframe (or matrix) where each row is a possible n-length sequence of 1s and 0s, followed by an n+1th column which contains a number corresponding to one of the previous columns that contains a 0.
So in the n == 3 case for example, we want to include a row like this:
1 0 0 2
but not this:
1 0 0 1
Here's the code I have now (assuming n == 3 for simplicity):
library(tidyverse)
df <- expand.grid(x = 0:1, y = 0:1, z = 0:1, target = 1:3, keep = FALSE)
for (row in 1:nrow(df)) {
df$keep[row] <- df[row, df$target[row]] == 0
}
df <- df %>%
filter(keep == TRUE) %>%
select(-keep)
head(df)
# x y z target
# 1 0 0 0 1
# 2 0 1 0 1
# 3 0 0 1 1
# 4 0 1 1 1
# 5 0 0 0 2
# 6 1 0 0 2
# 7 0 0 1 2
# 8 1 0 1 2
# 9 0 0 0 3
# 10 1 0 0 3
# 11 0 1 0 3
# 12 1 1 0 3
Seems like there has to be a better way to do this, especially with dplyr. But I can't figure out how to use the value of target to specify the column to filter on.
Using base R, we can create a row/column index to filter values from the dataframe and keep rows where the extracted value is 0.
df[df[cbind(seq_len(nrow(df)), df$target)] == 0, ]
# x y z target
#1 0 0 0 1
#3 0 1 0 1
#5 0 0 1 1
#7 0 1 1 1
#9 0 0 0 2
#10 1 0 0 2
#13 0 0 1 2
#14 1 0 1 2
#17 0 0 0 3
#18 1 0 0 3
#19 0 1 0 3
#20 1 1 0 3
data
df <- expand.grid(x = 0:1, y = 0:1, z = 0:1, target = 1:3)

Get indices from each row and merge with original data.frame

I have the following data.frame
user_id 1 2 3 4 5 6 7 8 9
1 54449024717783 0 0 1 0 0 0 0 0 0
2 117592134783793 0 0 0 0 0 1 0 0 0
3 187145545782493 0 0 1 0 0 0 0 0 0
4 245003020993334 0 0 0 0 0 1 0 0 0
5 332625230637592 0 1 0 0 0 0 0 0 0
6 336336752713947 0 1 0 0 0 0 0 0 0
what I would like to do is to create one column (and remove 1:9) and insert the column name where I have the value 1 , each user contain only column with the value 1 ,
If im running the following function:
rowSums(users_cluster(users_cluster), dims = 1)
it will summarize all the rows value but I need to duplicate it with the column name
Base R solution:
data.frame(user_id = df[, 1],
name = which(t(df[, -1] == 1)) %% (ncol(df) - 1))
# user_id name
# 1 54449024717783 3
# 2 117592134783793 6
# 3 187145545782493 3
# 4 245003020993334 6
# 5 332625230637592 2
# 6 336336752713947 2
Here's another base R option:
inds <- which(df[,-1]!=0,TRUE)
df$newcol <- inds[order(row.names(inds)),][,2]
df[,c(1,11)]
# user_id newcol
#1 5.444902e+13 3
#2 1.175921e+14 6
#3 1.871455e+14 3
#4 2.450030e+14 6
#5 3.326252e+14 2
#6 3.363368e+14 2
Another approach is max.col from base R as the user specified each user contain only column with the value 1
cbind(dat[1], ind = max.col(dat[-1], 'first'))
# user_id ind
#1 54449024717783 3
#2 117592134783793 6
#3 187145545782493 3
#4 245003020993334 6
#5 332625230637592 2
#6 336336752713947 2
Another base R solution:
df$ind = apply(df[,-1]>0,1,which)
df[,c("user_id","ind")]
Output:
user_id ind
1 5.444902e+13 3
2 1.175921e+14 6
3 1.871455e+14 3
4 2.450030e+14 6
5 3.326252e+14 2
6 3.363368e+14 2
A solution using the tidyverse.
library(tidyverse)
dat2 <- dat %>%
mutate(ID = 1:n()) %>%
gather(Column, Value, -user_id, -ID) %>%
filter(Value == 1) %>%
arrange(ID) %>%
select(-Value, -ID) %>%
as.data.frame()
dat2
# user_id Column
# 1 54449024717783 3
# 2 117592134783793 6
# 3 187145545782493 3
# 4 245003020993334 6
# 5 332625230637592 2
# 6 336336752713947 2
DATA
dat <- read.table(text = " user_id 1 2 3 4 5 6 7 8 9
1 54449024717783 0 0 1 0 0 0 0 0 0
2 117592134783793 0 0 0 0 0 1 0 0 0
3 187145545782493 0 0 1 0 0 0 0 0 0
4 245003020993334 0 0 0 0 0 1 0 0 0
5 332625230637592 0 1 0 0 0 0 0 0 0
6 336336752713947 0 1 0 0 0 0 0 0 0",
header = TRUE, stringsAsFactors = FALSE)
library(tidyverse)
dat <- as.tibble(dat) %>%
setNames(sub("X", "", names(.))) %>%
mutate(user_id = as.character(user_id))
For the sake of completeness, here is also a data.table solution which uses melt() to reshape from wide to long format:
library(data.table)
melt(setDT(DF), id = "user_id")[value == 1L][order(user_id), !"value"]
user_id variable
1: 54449024717783 3
2: 117592134783793 6
3: 187145545782493 3
4: 245003020993334 6
5: 332625230637592 2
6: 336336752713947 2
This takes advantage of the fact that the sample dataset is already sorted by ascending user_id.
In case the sample dataset has a different order which should be maintained in the final result, it is necessary to remember that order by introducing a temporary row id:
melt(setDT(DF), id = "user_id")[, rn := rowid(variable)][value == 1L][
order(rn), !c("rn", "value")]
or, alternatively,
melt(setDT(DF), id = "user_id")[, rn := rowid(variable)][, setorder(.SD, rn)][
value == 1L, !c("rn", "value")]
Data
library(data.table)
DF <- fread(
"i user_id 1 2 3 4 5 6 7 8 9
1 54449024717783 0 0 1 0 0 0 0 0 0
2 117592134783793 0 0 0 0 0 1 0 0 0
3 187145545782493 0 0 1 0 0 0 0 0 0
4 245003020993334 0 0 0 0 0 1 0 0 0
5 332625230637592 0 1 0 0 0 0 0 0 0
6 336336752713947 0 1 0 0 0 0 0 0 0"
, drop = 1L)[, lapply(.SD, as.integer), by = user_id]

Resources