How to recode many data frame columns with same function - r

I have a data frame like this:
CriterionVar Var1 Var2 Var3
3 0 0 0
1 0 0 0
2 0 0 0
5 0 0 0
I want to recode the values of Var1, Var2, and Var3 based on the value of CriterionVar. In pseudocode, it would be something like this:
for each row
if (CriterionVar.value >= Var1.index) Var1 = 1
if (CriterionVar.value >= Var2.index) Var2 = 1
if (CriterionVar.value >= Var3.index) Var3 = 1
The recoded data frame would look like this:
CriterionVar Var1 Var2 Var3
3 1 1 1
1 1 0 0
2 1 1 0
5 1 1 1
Obviously, that is not the way to get it done because (1) the number of VarN columns is determined by a data value, and (2) it's just ugly.
Any help is appreciated.

For more general values of CriterionVar, you can use outer to construct a logical matrix which you can use for indexing like this:
dat[2:4][outer(dat$CriterionVar, seq_along(names(dat)[-1]), ">=")] <- 1
In this example, this returns
dat
CriterionVar Var1 Var2 Var3
1 3 1 1 1
2 1 1 0 0
3 2 1 1 0
4 5 1 1 1
A second method using col, which returns a matrix of the column index, is a tad bit more direct:
dat[2:4][dat$CriterionVar >= col(dat[-1])] <- 1
and returns the desired result.
data
dat <-
structure(list(CriterionVar = c(3L, 1L, 2L, 5L), Var1 = c(0L,
0L, 0L, 0L), Var2 = c(0L, 0L, 0L, 0L), Var3 = c(0L, 0L, 0L, 0L
)), .Names = c("CriterionVar", "Var1", "Var2", "Var3"), class = "data.frame",
row.names = c(NA, -4L))

df[,-1] = lapply(2:NCOL(df), function(i) as.numeric(df[,1] >= (i-1)))
df
# CriterionVar Var1 Var2 Var3
#1 3 1 1 1
#2 1 1 0 0
#3 2 1 1 0
#4 5 1 1 1
DATA
df = structure(list(CriterionVar = c(3L, 1L, 2L, 5L), Var1 = c(1,
1, 1, 1), Var2 = c(1, 0, 1, 1), Var3 = c(1, 0, 0, 1)), .Names = c("CriterionVar",
"Var1", "Var2", "Var3"), row.names = c(NA, -4L), class = "data.frame")

I'm a big proponent of vapply: it's fast, and you know the shape of what it'll return. The only problem is the resulting matrix is usually the "sideways" version of what you want. But t() fixes that easily enough.
n_var_cols <- 3
truncated_criterion <- pmin(dat[["CriterionVar"]], n_var_cols)
row_template <- rep_len(0, n_var_cols)
replace_up_to_index <- function(index) {
replace(row_template, seq_len(index), 1)
}
over_matrix <- vapply(
X = truncated_criterion,
FUN = replace_up_to_index,
FUN.VALUE = row_template
)
over_matrix <- t(over_matrix)
dat[, -1] <- over_matrix
dat
# CriterionVar Var1 Var2 Var3
# 1 3 1 1 1
# 2 1 1 0 0
# 3 2 1 1 0
# 4 5 1 1 1
There was some bookkeeping in the first three lines, but nothing too bad. I used pmin() to restrict the criteria values to be no greater than the number of VarN columns.

Related

How to replace the values in a binary matrix with values from a dataframe?

The matrix I have looks something like this:
Plot A B C
1 1 0 0
2 1 0 1
3 1 1 0
And I have a dataframe that looks like this
A 5
B 4
C 2
What I would like to do is replace the "1" values in the matrix with the corresponding values in the dataframe, like this:
Plot A B C
1 5 0 0
2 5 0 2
3 5 4 0
Any suggestions on how to do this in R? Thank you!
An option with tidyverse
library(dplyr)
df1 %>%
mutate(across(all_of(df2$col1),
~ replace(.x, .x== 1, df2$col2[match(cur_column(), df2$col1)])))
-output
Plot A B C
1 1 5 0 0
2 2 5 0 2
3 3 5 4 0
data
df1 <- structure(list(Plot = 1:3, A = c(1L, 1L, 1L), B = c(0L, 0L, 1L
), C = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(col1 = c("A", "B", "C"), col2 = c(5, 4, 2)),
class = "data.frame", row.names = c(NA,
-3L))

How to get merged data frame from two data frames having some same columns(R)

I want to merge them and find the values of one dataframe that would like to be added to the existing values of the other based on the same columns.
For example:
df1
No
A
B
C
D
1
1
0
1
0
2
0
1
2
1
3
0
0
1
0
df2
No
A
B
E
F
1
1
0
1
1
2
0
1
2
1
3
2
1
1
0
Finally, I want the output table like this.
df
No
A
B
C
D
E
F
1
2
0
1
0
1
1
2
0
2
2
1
2
1
3
2
1
1
0
1
0
Note: I did try merge(), but in this case, it did not work.
Any help/suggestion would be appreciated.
Reproducible sample data
df1 <-
structure(list(No = 1:3, A = c(1L, 0L, 0L), B = c(0L, 1L, 0L),
C = c(1L, 2L, 1L), D = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(No = 1:3, A = c(1L, 0L, 2L), B = c(0L, 1L, 1L),
E = c(1L, 2L, 1L), F = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
You can also carry out this operation by left_joining these two data frames:
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = "No") %>%
mutate(across(ends_with(".x"), ~ .x + get(str_replace(cur_column(), "\\.x", "\\.y")))) %>%
rename_with(~ str_replace(., "\\.x", ""), ends_with(".x")) %>%
select(!ends_with(".y"))
No A B C D E F
1 1 2 0 1 0 1 1
2 2 0 2 2 1 2 1
3 3 2 1 1 0 1 0
You can first row-bind the two dataframes and then compute the sum of each column while 'grouping' by the No column. This can be done like so:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(No) %>%
summarise(across(c(A, B, C, D, E, `F`), sum, na.rm = TRUE),
.groups = "drop")
If a particular column doesn't exist in one dataframe (i.e. columns E and F), values will be padded with NA. Adding the na.rm = TRUE argument (to be passed to sum()) means that these values will get treated like zeros.
Using data.table :
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), No]
# No A B C D E F
#1: 1 2 0 1 0 1 1
#2: 2 0 2 2 1 2 1
#3: 3 2 1 1 0 1 0
We can use base R (with R 4.1.0). Get the values of the objects in a list ('lst1'). Then, find the union of the column names ('nm1'). Loop over the list assign to create 0 value columns with setdiff in each list element, rbind them and use aggregate to get the sum grouped by 'No'
lst1 <- mget(ls(pattern= '^df\\d+$'))
nm1 <- lapply(lst1, names) |>
{\(x) Reduce(union, x)}()
lapply(lst1, \(x) {x[setdiff(nm1, names(x))] <- 0; x}) |>
{\(x) do.call(rbind, x)}() |>
{\(dat) aggregate(.~ No, data = dat, FUN = sum, na.rm = TRUE,
na.action = na.pass)}()
# No A B C D E F
#1 1 2 0 1 0 1 1
#2 2 0 2 2 1 2 1
#3 3 2 1 1 0 1 0

R - Function to make a binary variable

I have some variables which take value between 1 and 5. I would like to code them 0 if they take the value between 1 and 3 (included) and 1 if they take the value 4 or 5.
My dataset looks like this
var1 var2 var3
1 1 NA
4 3 4
3 4 5
2 5 3
So I would like it to be like this:
var1 var2 var3
0 0 NA
1 0 1
0 1 1
0 1 0
I tried to do a function and to call it
making_binary <- function (var){
var <- factor(var >= 4, labels = c(0, 1))
return(var)
}
df <- lapply(df, making_binary)
But I had an error : incorrect labels : length 2 must be 1 or 1
Where did I go wrong?
Thank you very much for your answers!
You can use :
df[] <- +(df == 4 | df == 5)
df
# var1 var2 var3
#1 0 0 NA
#2 1 0 1
#3 0 1 1
#4 0 1 0
Comparison of df == 4 | df == 5 returns logical values (TRUE/FALSE), + here turns those logical values to integer values (1/0) respectively.
If you want to apply this for selected columns you can subset the columns by position or by name.
cols <- 1:3 #Position
#cols <- grep('var', names(df)) #Name
df[cols] <- +(df[cols] == 4 | df[cols] == 5)
As far as your function is concerned you can do :
making_binary <- function (var){
var <- as.integer(var >= 4)
#which is faster version of
#var <- ifelse(var >= 4, 1, 0)
return(var)
}
df[] <- lapply(df, making_binary)
data
df <- structure(list(var1 = c(1L, 4L, 3L, 2L), var2 = c(1L, 3L, 4L,
5L), var3 = c(NA, 4L, 5L, 3L)), class = "data.frame", row.names = c(NA, -4L))
I think ifelse would fit the problem well:
df[] <- lapply(df, function(x) ifelse(x >=1 & x <=3, 0, x))
df
var1 var2 var3
1 0 0 NA
2 4 0 4
3 0 4 5
4 0 5 0
df[] <- lapply(df, function(x) ifelse(x >=4 & x <=5, 1, x))
df
var1 var2 var3
1 0 0 NA
2 1 0 1
3 0 1 1
4 0 1 0
If you need to do the two steps at once, you can look at dplyr::case_when() or data.table::fcase().
You can simply test if the value is larger than 3, which will return TRUE and FALSE and cast this to a number:
+(x>3)
# var1 var2 var3
#[1,] 0 0 NA
#[2,] 1 0 1
#[3,] 0 1 1
#[4,] 0 1 0
In case you want this only for some columns, you have to subset them. E.g. for column 1 and 2:
+(x[1:2]>3)
#+(x[c("var1","var2")]>3) #Alternative
# var1 var2
#[1,] 0 0
#[2,] 1 0
#[3,] 0 1
#[4,] 0 1
Data:
x <- data.frame(var1 = c(1L, 4L, 3L, 2L), var2 = c(1L, 3L, 4L, 5L)
, var3 = c(NA, 4L, 5L, 3L))

Repeat a value within each ID

I have a dataset in R in long format. Each ID does not appear the same number of times (i.e. one ID might be one row, another might appear 79 rows).
e.g.
ID V1 V2
1 B 0
1 A 1
1 C 0
2 C 0
3 A 0
3 C 0
I want to create a variable which, if any of the rows for a given ID have Var2 == 1, then 1 repeats for every row of that ID
e.g.
ID V1 V2 V3
1 B 0 1
1 A 1 1
1 C 0 1
2 C 0 0
3 A 0 0
3 C 0 0
In base R we can use any - and ave for the grouping.
DF$V3 <- with(DF, ave(V2, ID, FUN = function(x) any(x == 1)))
DF
# ID V1 V2 V3
#1 1 B 0 1
#2 1 A 1 1
#3 1 C 0 1
#4 2 C 0 0
#5 3 A 0 0
#6 3 C 0 0
data
DF <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), V1 = c("B", "A",
"C", "C", "A", "C"), V2 = c(0L, 1L, 0L, 0L, 0L, 0L)), .Names = c("ID",
"V1", "V2"), class = "data.frame", row.names = c(NA, -6L))
Here's a tidyverse solution.
If V2 can only be 0 or 1:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(V3 = max(V2))
If you want to check that V2 is exactly 1.
df %>%
group_by(ID) %>%
mutate(V3 = as.numeric(any(V2 == 1)))
Another base R option is
df$V3 <- with(df, +(ID %in% which(rowsum(V2, ID) > 0)))

Create a two-mode frequency matrix in R

I have a data frame, which looks something like this:
CASENO Var1 Var2 Resp1 Resp2
1 1 0 1 1
2 0 0 0 0
3 1 1 1 1
4 1 1 0 1
5 1 0 1 0
There are over 400 variables in the dataset. This is just an example. I need to create a simple frequency matrix in R (excluding the case numbers), but the table function doesn't work. Specifically, I'm looking to cross-tabulate a portion of the columns to create a two-mode matrix of frequencies. The table should look like this:
Var1 Var2
Resp1 3 1
Resp2 3 2
In Stata, the command is:
gen var = 1 if Var1==1
replace var= 2 if Var2==1
gen resp = 1 if Resp1==1
replace resp = 2 if Resp2==1
tab var resp
This one should work for any number of Var & Resps:
d <- structure(list(CASENO = 1:5, Var1 = c(1L, 0L, 1L, 1L, 1L), Var2 = c(0L, 0L, 1L, 1L, 0L), Resp1 = c(1L, 0L, 1L, 0L, 1L), Resp2 = c(1L, 0L, 1L, 1L, 0L)), .Names = c("CASENO", "Var1", "Var2", "Resp1", "Resp2"), class = "data.frame", row.names = c(NA, -5L))
m <- as.matrix(d[,-1])
m2 <- t(m) %*% m
rnames <- grepl('Resp',rownames((m2)))
cnames <- grepl('Var',colnames((m2)))
m2[rnames,cnames]
[UPDATE] A more elegant version, provided in the comment by G.Grothendieck:
m <- as.matrix(d[,-1])
cn <- colnames(m);
crossprod(m[, grep("Resp", cn)], m[, grep("Var", cn)])
I'm sure there's another way, but you could do this:
library(reshape2)
library(plyr)
df1 <- melt(df[,-1],id=1:2)
ddply(df1,.(variable),summarize,
Var1 = sum(value==1&Var1==1),
Var2 = sum(value==1&Var2==1))
# variable Var1 Var2
# 1 Resp1 3 1
# 2 Resp2 3 2
Here is an approach using xtabs.
# get names of non "variables"
not_vars <- c("Resp1", "Resp2", "CASENO")
# get names of "variables"
vars <- as.matrix(d[,!names(d) %in% not_vars])
# if you have many more than 2 response variables, this could get unwieldy
result <- rbind(
xtabs( vars ~ Resp1, data=d, exclude=0),
xtabs( vars ~ Resp2, data=d, exclude=0))
# give resulting table appropriate row names.
rownames(result) <- c("Resp1", "Resp2")
# Var1 Var2
#Resp1 3 1
#Resp2 3 2
sample data:
d <- read.table(text="
CASENO Var1 Var2 Resp1 Resp2
1 1 0 1 1
2 0 0 0 0
3 1 1 1 1
4 1 1 0 1
5 1 0 1 0", header=TRUE)

Resources