R - Function to make a binary variable

R - Function to make a binary variable - r

I have some variables which take value between 1 and 5. I would like to code them 0 if they take the value between 1 and 3 (included) and 1 if they take the value 4 or 5.
My dataset looks like this
var1 var2 var3
1 1 NA
4 3 4
3 4 5
2 5 3
So I would like it to be like this:
var1 var2 var3
0 0 NA
1 0 1
0 1 1
0 1 0
I tried to do a function and to call it
making_binary <- function (var){
var <- factor(var >= 4, labels = c(0, 1))
return(var)
}
df <- lapply(df, making_binary)
But I had an error : incorrect labels : length 2 must be 1 or 1
Where did I go wrong?
Thank you very much for your answers!

You can use :
df[] <- +(df == 4 | df == 5)
df
# var1 var2 var3
#1 0 0 NA
#2 1 0 1
#3 0 1 1
#4 0 1 0
Comparison of df == 4 | df == 5 returns logical values (TRUE/FALSE), + here turns those logical values to integer values (1/0) respectively.
If you want to apply this for selected columns you can subset the columns by position or by name.
cols <- 1:3 #Position
#cols <- grep('var', names(df)) #Name
df[cols] <- +(df[cols] == 4 | df[cols] == 5)
As far as your function is concerned you can do :
making_binary <- function (var){
var <- as.integer(var >= 4)
#which is faster version of
#var <- ifelse(var >= 4, 1, 0)
return(var)
}
df[] <- lapply(df, making_binary)
data
df <- structure(list(var1 = c(1L, 4L, 3L, 2L), var2 = c(1L, 3L, 4L,
5L), var3 = c(NA, 4L, 5L, 3L)), class = "data.frame", row.names = c(NA, -4L))

I think ifelse would fit the problem well:
df[] <- lapply(df, function(x) ifelse(x >=1 & x <=3, 0, x))
df
var1 var2 var3
1 0 0 NA
2 4 0 4
3 0 4 5
4 0 5 0
df[] <- lapply(df, function(x) ifelse(x >=4 & x <=5, 1, x))
df
var1 var2 var3
1 0 0 NA
2 1 0 1
3 0 1 1
4 0 1 0
If you need to do the two steps at once, you can look at dplyr::case_when() or data.table::fcase().

You can simply test if the value is larger than 3, which will return TRUE and FALSE and cast this to a number:
+(x>3)
# var1 var2 var3
#[1,] 0 0 NA
#[2,] 1 0 1
#[3,] 0 1 1
#[4,] 0 1 0
In case you want this only for some columns, you have to subset them. E.g. for column 1 and 2:
+(x[1:2]>3)
#+(x[c("var1","var2")]>3) #Alternative
# var1 var2
#[1,] 0 0
#[2,] 1 0
#[3,] 0 1
#[4,] 0 1
Data:
x <- data.frame(var1 = c(1L, 4L, 3L, 2L), var2 = c(1L, 3L, 4L, 5L)
, var3 = c(NA, 4L, 5L, 3L))

Related

How to get merged data frame from two data frames having some same columns(R)

I want to merge them and find the values of one dataframe that would like to be added to the existing values of the other based on the same columns.
For example:
df1
No
A
B
C
D
1
1
0
1
0
2
0
1
2
1
3
0
0
1
0
df2
No
A
B
E
F
1
1
0
1
1
2
0
1
2
1
3
2
1
1
0
Finally, I want the output table like this.
df
No
A
B
C
D
E
F
1
2
0
1
0
1
1
2
0
2
2
1
2
1
3
2
1
1
0
1
0
Note: I did try merge(), but in this case, it did not work.
Any help/suggestion would be appreciated.
Reproducible sample data
df1 <-
structure(list(No = 1:3, A = c(1L, 0L, 0L), B = c(0L, 1L, 0L),
C = c(1L, 2L, 1L), D = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(No = 1:3, A = c(1L, 0L, 2L), B = c(0L, 1L, 1L),
E = c(1L, 2L, 1L), F = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))

You can also carry out this operation by left_joining these two data frames:
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = "No") %>%
mutate(across(ends_with(".x"), ~ .x + get(str_replace(cur_column(), "\\.x", "\\.y")))) %>%
rename_with(~ str_replace(., "\\.x", ""), ends_with(".x")) %>%
select(!ends_with(".y"))
No A B C D E F
1 1 2 0 1 0 1 1
2 2 0 2 2 1 2 1
3 3 2 1 1 0 1 0

You can first row-bind the two dataframes and then compute the sum of each column while 'grouping' by the No column. This can be done like so:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(No) %>%
summarise(across(c(A, B, C, D, E, `F`), sum, na.rm = TRUE),
.groups = "drop")
If a particular column doesn't exist in one dataframe (i.e. columns E and F), values will be padded with NA. Adding the na.rm = TRUE argument (to be passed to sum()) means that these values will get treated like zeros.

Using data.table :
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), No]
# No A B C D E F
#1: 1 2 0 1 0 1 1
#2: 2 0 2 2 1 2 1
#3: 3 2 1 1 0 1 0

We can use base R (with R 4.1.0). Get the values of the objects in a list ('lst1'). Then, find the union of the column names ('nm1'). Loop over the list assign to create 0 value columns with setdiff in each list element, rbind them and use aggregate to get the sum grouped by 'No'
lst1 <- mget(ls(pattern= '^df\\d+$'))
nm1 <- lapply(lst1, names) |>
{\(x) Reduce(union, x)}()
lapply(lst1, \(x) {x[setdiff(nm1, names(x))] <- 0; x}) |>
{\(x) do.call(rbind, x)}() |>
{\(dat) aggregate(.~ No, data = dat, FUN = sum, na.rm = TRUE,
na.action = na.pass)}()
# No A B C D E F
#1 1 2 0 1 0 1 1
#2 2 0 2 2 1 2 1
#3 3 2 1 1 0 1 0

Subtracting multiple rows from the same row in R

I am looking to subtract multiple rows from the same row within a dataframe.
For example:
Group A B C
A 3 1 2
B 4 0 3
C 4 1 1
D 2 1 2
This is what I want it to look like:
Group A B C
B 1 -1 1
C 1 0 -1
D -1 0 0
So in other words:
Row B - Row A
Row C - Row A
Row D - Row A
Thank you!

Here's a dplyr solution:
library(dplyr)
df %>%
mutate(across(A:C, ~ . - .[1])) %>%
filter(Group != "A")
This gives us:
Group A B C
1: B 1 -1 1
2: C 1 0 -1
3: D -1 0 0

Here's an approach with base R:
data[-1] <- do.call(rbind,
apply(data[-1],1,function(x) x - data[1,-1])
)
data[-1,]
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0
Data:
data <- structure(list(Group = c("A", "B", "C", "D"), A = c(3L, 4L, 4L,
2L), B = c(1L, 0L, 1L, 1L), C = c(2L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-4L))

We could also replicate the first row and substract from the rest
cbind(data[-1, 1, drop = FALSE], data[-1, -1] - data[1, -1][col(data[-1, -1])])
-output
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0

R-Loop through data frame and count values greater than a value and removing rows

I want to loop through a large dataframe counting in the first column how many values >0, removing those rows that were counted.... then moving on to column 2 counting the number of values>0 and removing those rows etc...
the data frame
taxonomy A B C
1 cat 0 2 0
2 dog 5 1 0
3 horse 3 0 0
4 mouse 0 0 4
5 frog 0 2 4
6 lion 0 0 2
can be generated with
DF1 = structure(list(taxonomy = c("cat", "dog","horse","mouse","frog", "lion"),
A = c(0L, 5L, 3L, 0L, 0L, 0L), D = c(2L, 1L, 0L, 0L, 2L, 0L), C = c(0L, 0L, 0L, 4L, 4L, 2L)),
.Names = c("taxonomy", "A", "B", "C"),
row.names = c(NA, -6L), class = "data.frame")
and i expect the outcome to be
A B C
count 2 2 2
i wrote this loop but it does not remove the rows as it goes
res <- data.frame(DF1[1,], row.names = c('count'))
for(n in 1:ncol(DF1)) {
res[colnames(DF1)[n]] <- sum(DF1[n])
DF1[!DF1[n]==1]
}
it gives this incorrect result
A B C
count 2 3 3

You could do ...
DF = DF1[, -1]
cond = DF != 0
p = max.col(cond, ties="first")
fp = factor(p, levels = seq_along(DF), labels = names(DF))
table(fp)
# A B C
# 2 2 2
To account for rows that are all zeros, I think this works:
fp[rowSums(cond) == 0] <- NA

We can update the dataset in each run. Create a temporary dataset without the 'taxonomy' column ('tmp'). Initiate a named vector ('n'), loop through the columns of 'tmp', get a logical index based on whether the column is greater than 0 ('i1'), get the sum of TRUE values, update the 'n' for the corresponding column, then update the 'tmp' by removing those rows using 'i1' as row index
tmp <- DF1[-1]
n <- setNames(numeric(ncol(tmp)), names(tmp))
for(i in seq_len(ncol(tmp))) {
i1 <- tmp[[i]] > 0
n[i] <- sum(i1)
tmp <- tmp[!i1, ]}
n
# A B C
# 2 2 2
It can also be done with Reduce
sapply(Reduce(function(x, y) y[!x] > 0, DF1[3:4],
init = DF1[,2] > 0, accumulate = TRUE ), sum)
#[1] 2 2 2
Or using accumulate from purrr
library(purrr)
accumulate(DF1[3:4], ~ .y[!.x] > 0, .init = DF1[[2]] > 0) %>%
map_int(sum)
#[1] 2 2 2

This is easy with Reduce and sapply:
> first <- Reduce(function(a,b) b[a==0], df[-1], accumulate=TRUE)
> first
[[1]]
[1] 0 5 3 0 0 0
[[2]]
[1] 2 0 2 0
[[3]]
[1] 0 4 2
> then <- sapply(setNames(first, names(df[-1])), function(x) length(x[x>0]))
> then
A B C
2 2 2

R - find rows with at least n distinct elements

I have a data frame of arbitrary but non-trivial size. Each entry has one of three distinct values 0, 1, or 2 randomly distributed. For example:
col.1 col.2 col.3 col.4 ...
0 0 1 0 ...
0 2 2 1 ...
2 2 2 2 ...
0 0 0 0 ...
0 1 1 1 ...
... ... ... ... ...
My goal is to remove any row that only contains one unique element or to select only those rows with at least two distinct elements. Originally I selected those rows where the row mean was a not a whole number, but I realized that could eliminate rows containing equal amounts of 0 and 2 which I want to keep.
My current thought process is to use unique on each row of the data frame, followed by length to determine how many unique elements each contains but I can't seem to get the syntax right. I'm looking for something like this
DataFrame[length(unique(DataFrame)) != 1, ]

Try any of these:
nuniq <- function(x) length(unique(x))
subset(dd, apply(dd, 1, nuniq) >= 2)
subset(dd, apply(dd, 1, sd) > 0)
subset(dd, apply(dd[-1] != dd[[1]], 1, any))
subset(dd, rowSums(dd[-1] != dd[[1]]) > 0)
subset(dd, lengths(lapply(as.data.frame(t(dd)), unique)) >= 2)
subset(dd, lengths(apply(dd, 1, table)) >= 2)
# nuniq is from above
subset(dd, tapply(as.matrix(dd), row(dd), nuniq) >= 2)
giving:
col.1 col.2 col.3 col.4
1 0 0 1 0
2 0 2 2 1
5 0 1 1 1
Alternatives to nuniq
In the above nuniq could be replaced with any of these:
function(x) nlevels(factor(x))
function(x) sum(!duplicated(x))
funtion(x) length(table(x))
dplyr::n_distinct
Note
dd in reproducible form is:
dd <- structure(list(col.1 = c(0L, 0L, 2L, 0L, 0L), col.2 = c(0L, 2L,
2L, 0L, 1L), col.3 = c(1L, 2L, 2L, 0L, 1L), col.4 = c(0L, 1L,
2L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L))

What about something like this:
# some fake data
df<-data.frame(col1 = c(2,2,1,1),
col2 = c(1,0,2,0),col3 = c(0,0,0,0))
col1 col2 col3
1 2 1 0
2 2 0 0
3 1 2 0
4 1 0 0
# first we can convert 0 to NA
df[df == 0] <- NA
# a function that calculates the length of uniques, not counting NA as levels
fun <- function(x){
res <- unique(x[!is.na(x)])
length(res)
}
# apply it: not counting na, we can use 2 as threshold
df <- df[apply(df,1,fun)>=2,]
# convert the na to 0 as original
df[is.na(df)] <- 0
df
col1 col2 col3
1 2 1 0
3 1 2 0

How to recode many data frame columns with same function

I have a data frame like this:
CriterionVar Var1 Var2 Var3
3 0 0 0
1 0 0 0
2 0 0 0
5 0 0 0
I want to recode the values of Var1, Var2, and Var3 based on the value of CriterionVar. In pseudocode, it would be something like this:
for each row
if (CriterionVar.value >= Var1.index) Var1 = 1
if (CriterionVar.value >= Var2.index) Var2 = 1
if (CriterionVar.value >= Var3.index) Var3 = 1
The recoded data frame would look like this:
CriterionVar Var1 Var2 Var3
3 1 1 1
1 1 0 0
2 1 1 0
5 1 1 1
Obviously, that is not the way to get it done because (1) the number of VarN columns is determined by a data value, and (2) it's just ugly.
Any help is appreciated.

For more general values of CriterionVar, you can use outer to construct a logical matrix which you can use for indexing like this:
dat[2:4][outer(dat$CriterionVar, seq_along(names(dat)[-1]), ">=")] <- 1
In this example, this returns
dat
CriterionVar Var1 Var2 Var3
1 3 1 1 1
2 1 1 0 0
3 2 1 1 0
4 5 1 1 1
A second method using col, which returns a matrix of the column index, is a tad bit more direct:
dat[2:4][dat$CriterionVar >= col(dat[-1])] <- 1
and returns the desired result.
data
dat <-
structure(list(CriterionVar = c(3L, 1L, 2L, 5L), Var1 = c(0L,
0L, 0L, 0L), Var2 = c(0L, 0L, 0L, 0L), Var3 = c(0L, 0L, 0L, 0L
)), .Names = c("CriterionVar", "Var1", "Var2", "Var3"), class = "data.frame",
row.names = c(NA, -4L))

df[,-1] = lapply(2:NCOL(df), function(i) as.numeric(df[,1] >= (i-1)))
df
# CriterionVar Var1 Var2 Var3
#1 3 1 1 1
#2 1 1 0 0
#3 2 1 1 0
#4 5 1 1 1
DATA
df = structure(list(CriterionVar = c(3L, 1L, 2L, 5L), Var1 = c(1,
1, 1, 1), Var2 = c(1, 0, 1, 1), Var3 = c(1, 0, 0, 1)), .Names = c("CriterionVar",
"Var1", "Var2", "Var3"), row.names = c(NA, -4L), class = "data.frame")

I'm a big proponent of vapply: it's fast, and you know the shape of what it'll return. The only problem is the resulting matrix is usually the "sideways" version of what you want. But t() fixes that easily enough.
n_var_cols <- 3
truncated_criterion <- pmin(dat[["CriterionVar"]], n_var_cols)
row_template <- rep_len(0, n_var_cols)
replace_up_to_index <- function(index) {
replace(row_template, seq_len(index), 1)
}
over_matrix <- vapply(
X = truncated_criterion,
FUN = replace_up_to_index,
FUN.VALUE = row_template
)
over_matrix <- t(over_matrix)
dat[, -1] <- over_matrix
dat
# CriterionVar Var1 Var2 Var3
# 1 3 1 1 1
# 2 1 1 0 0
# 3 2 1 1 0
# 4 5 1 1 1
There was some bookkeeping in the first three lines, but nothing too bad. I used pmin() to restrict the criteria values to be no greater than the number of VarN columns.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Function to make a binary variable - r

Related

How to get merged data frame from two data frames having some same columns(R)

Subtracting multiple rows from the same row in R

R-Loop through data frame and count values greater than a value and removing rows

R - find rows with at least n distinct elements

How to recode many data frame columns with same function

Categories

Resources