find columns present in vector and replace values that equal 1 - r

I have a binary data set which looks like
a b c d
r1 1 1 0 0
r2 0 1 1 0
r3 1 0 0 1
And a vector
V <- c("a", "c")
I want to a command to search colnames and change values in these columns. for example change 1 to A. So the output would be:
a b c d
r1 A 1 0 0
r2 0 1 A 0
r3 A 0 0 1

Here is a vectorized way to do it,
df[names(df) %in% V] <- replace(df[names(df) %in% V], df[names(df) %in% V] == 1, 'A')
#or avoid calling the %in% part 3 times by assigning it, i.e.
i1 <- names(df) %in% V
df[i1] <- replace(df[i1], df[i1] == 1, 'A')
#or a more simplified syntax, compliments of #Cath,
df[, V][df[, V]==1] <- "A"
which gives,
a b c d
r1 A 1 0 0
r2 0 1 A 0
r3 A 0 0 1

A solution with dplyr:
library(dplyr)
V <- c("a", "c")
df %>%
mutate_at(V, ~replace(.x, .x == 1, 'A'))
# a b c d
# r1 A 1 0 0
# r2 0 1 A 0
# r3 A 0 0 1
mutate_at takes a a data.frame and a vector of column names and applys the specified function to each of the columns.
DATA
df <- structure(list(a = c(1L, 0L, 1L), b = c(1L, 1L, 0L),
c = c(0L, 1L, 0L), d = c(0L, 0L, 1L)),
.Names = c("a", "b", "c", "d"),
class = "data.frame", row.names = c("r1", "r2", "r3"))

If left hand side (LHS) and right hand side (RHS) or of the same type, then data.table can be used to update only the selected "cells" in place, i.e., without copying the whole column:
library(data.table)
setDT(df)
for (s in V) df[get(s) == 1L, (s) := 99L] # replacement value is of type integer
df[]
a b c d
1: 99 1 0 0
2: 0 1 99 0
3: 99 0 0 1
To verify that only selected rows in each column are updated, we can check the addresses of each column before and after the update using:
df[, lapply(.SD, address), .SDcols = V]
(In addition, the verbose mode can be switched on by options(datatable.verbose = TRUE).)
In case LHS and RHS are of different type, a type conversion is required anyway. Therefore, the whole column needs to be replaced:
df[, (V) := lapply(.SD, function(x) replace(x, x == 1L, "A")), .SDcols = V]
df
a b c d
1: A 1 0 0
2: 0 1 A 0
3: A 0 0 1
Using address() shows that each of the affected columns has been copied. But only the affected columns are copied, the other columns haven't been touched. This is different to the other answers posted so far where the whole data frame is copied.

Related

R function to convert rows where all binary values are 0 into NA, for a set of multiple columns

I have a dataset with several binary values. I'd like to convert 0s to NA in rows that add up to 0 .
How would I go about doing this?
Example:
RespondentID Popn1 Popn2 Popn3 ...... Popn20 Funding1 Funding2 Funding3
1 1 0 0 0 1 0 1
2 0 0 0 ...... 0 0 0 1
3 1 1 0 ....... 1 0 0 0
I'd like to find and convert rows with all 0s for specific columns to NA. For example, I'd like to convert Row 2 (where respondentID is 2) to NA for Popn1:Popn20, as they add up to 0 (assuming that the variables I haven't shown there are all 0 for that row as well). I'd also like to turn row 3 for columns Funding1:Funding3 to NA/Blank as they add up to 0 (ie all values are 0.
Is there a way to do this, and for a number of such columns? (100+). There are sets of such variables, so I'd be selective about which columns to be included.
We can do this with direct assignment:
cols = grepl("Popn", names(df1))
df1[rowSums(df1[cols]) == 0, cols] = NA
cols = grepl("Funding", names(df1))
df1[rowSums(df1[cols]) == 0, cols] = NA
df1
# RespondentID Popn1 Popn2 Popn3 Popn20 Funding1 Funding2 Funding3
# 1 1 1 0 0 0 1 0 1
# 2 2 NA NA NA NA 0 0 1
# 3 3 1 1 0 1 NA NA NA
(Using akrun's kindly provided sample data as input)
We could then turn this into a function and loop over patterns:
# converts rows that sum to 0 to NA
# within columns that match a pattern
row_0_to_na = function(data, pattern) {
cols = grepl(pattern, names(data))
data[rowSums(data[cols]) == 0, cols] = NA
return(data)
}
patterns = c("Popn", "Funding")
for(pat in patterns) {
df1 = row_0_to_na(df1, pat)
}
We can use split.default based on the column names of the data i.e. remove the digits in the column names with gsub, use that to split the dataset into subset of datasets in a list, then loop over the list with lapply, check for rows where there are all zeros (i1), assign those rows to NA (x[i1,] <- NA), return the data 'x', cbind them together
nm1 <- gsub("\\d+", "", names(df1)[-1])
lst1 <- unname(split.default(df1[-1], nm1))
out1 <- do.call(cbind, lapply(lst1, function(x) {
i1 <- !rowSums(x != 0)
x[i1,] <- NA
x}))
and assign the output back to the original dataset
df1[names(out1)] <- out1
Or create a new dataset
df2 <- cbind(df1[1], out1)
-output
df2
# RespondentID Funding1 Funding2 Funding3 Popn1 Popn2 Popn3 Popn20
#1 1 1 0 1 1 0 0 0
#2 2 0 0 1 NA NA NA NA
#3 3 NA NA NA 1 1 0 1
data
df1 <- structure(list(RespondentID = 1:3, Popn1 = c(1L, 0L, 1L), Popn2 = c(0L,
0L, 1L), Popn3 = c(0L, 0L, 0L), Popn20 = c(0L, 0L, 1L), Funding1 = c(1L,
0L, 0L), Funding2 = c(0L, 0L, 0L),
Funding3 = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))

Dispatch values in list column to separate columns

I have a data.table with a list column "c":
df <- data.table(a = 1:3, c = list(1L, 1:2, 1:3))
df
a c
1: 1 1
2: 2 1,2
3: 3 1,2,3
I want to create separate columns for the values in "c".
I create a set of new columns F_1, F_2, F_3:
mmax <- max(df$a)
flux <- paste("F", 1:mmax, sep = "_")
df[, (flux) := 0]
df
a c F_1 F_2 F_3
1: 1 1 0 0 0
2: 2 1,2 0 0 0
3: 3 1,2,3 0 0 0
I want to dispatch values in "c" to columns F_1, F_2, F_3 like this:
df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
What I have tried:
comp_vect <- function(vec, mmax){
vec <- vec %>% unlist()
n <- length(vec)
answr <- c(vec, rep(0, l = mmax -n))
}
df[ , ..flux := mapply(comp_vect, c, mmax)]
The expected data.table is :
> df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
I followed a radically different approach. I rbinded the list column and then dcasted it, obtaining the desired result. Last part is to set the names.
library(data.table)
df <- data.table(a = 1:3, d = list(1L, c(1L, 2L), c(1L, 2L, 3L)))
df2 <- df[, rbind(d), by = a][, dcast(.SD, a ~ V1, fill = 0)]
setnames(df2, 2:4, flux)[]
a F_1 F_2 F_3
1: 1 1 0 0
2: 2 1 2 0
3: 3 1 2 3
where flux is the variable of names that you defined in your question.
Please notice that avoided using the column name c, as it may be confused with the function c().
Solution :
for(idx in seq(max(sapply(df$c, length)))){ # maximum number of values according to all the elements of the list
set(x = df,
i = NULL,
j = paste0("F_",idx), # column's name
value = sapply(df$c, function(x){
if(is.na(x[idx])){
return(0) # 0 instead of NA
} else {
return(x[idx])
}
})
)
}
Explications :
We can extract the values from a list like this :
sapply(df$c, function(ll) return(ll[1])) # first value
[1] 1 1 1
sapply(df$c, function(ll) return(ll[2])) # second value
[1] NA 2 2
sapply(df$c, function(ll) return(ll[3])) # third value
[1] NA NA 3
We see that if there is no value, we have a NA.
We need an iterator to extract all values at the position idx. For that, we'll find the number of values in each element of df$c (the list) and keep the maximum.
max(sapply(df$c, length))
[1] 3
If we want zeros instead of NAs, we need to create a function in the sapply to convert them :
vec <- c(NA, 5, 1, NA)
> sapply(vec, function(x) if(is.na(x)) return(0) else return(x))
[1] 0 5 1 0

Forming a co-occurence matrix from a data frame

I have a data frame which looks something like this:
id val
1 a
1 b
2 a
2 c
2 d
3 a
3 a
think of each row as a label, val, that was given to some observation with an id.
What I ultimately want to get to is a "co-occurence" matrix that looks something like this where I get a count of how many times each letter appears within the same id with each other letter:
a b c d
a 1 1 1 1
b 1 0 0 0
c 1 0 0 1
d 1 0 1 0
I've been wracking my brain looking for ways to do this, but have come up empty so far. Any hints? Preferably using tidyverse tools, but open to other options as well at this point.
EDIT: the solutions to the question linked as a possible duplicate do not work in this case. I'm not sure why, but I suspect it has to do with that question having a data frame with 3 columns.
Here's a solution in base R. Not quite elegant but seems to work
temp = data.frame(do.call(cbind, lapply(split(df, df$id), function(a)
combn(a$val, 2))), stringsAsFactors = FALSE)
sapply(sort(unique(df$val)), function(rows)
sapply(sort(unique(df$val)), function(cols)
sum(sapply(temp, function(x)
identical(sort(x), sort(c(rows, cols)))))))
# a b c d
#a 1 1 1 1
#b 1 0 0 0
#c 1 0 0 1
#d 1 0 1 0
OR with igraph
temp = t(do.call(cbind, lapply(split(df, df$id), function(a) combn(a$val, 2))))
library(igraph)
as.matrix(get.adjacency(graph(temp, directed = FALSE)))
# a c b d
#a 1 1 1 1
#c 1 0 0 1
#b 1 0 0 0
#d 1 1 0 0
DATA
df = structure(list(id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L),
val = c("a", "b", "a", "c", "d", "a", "a")),
.Names = c("id", "val"),
class = "data.frame",
row.names = c(NA, -7L))
A solution with dplyr + purrr:
library(dplyr)
library(purrr)
df %>%
split(.$id) %>%
map_dfr(function(x){
t(combn(x$val, 2)) %>%
data.frame(stringsAsFactors = FALSE)
}) %>%
mutate_all(funs(factor(., levels = c("a", "b", "c", "d")))) %>%
table() %>%
pmax(., t(.))
Result:
X2
X1 a b c d
a 1 1 1 1
b 1 0 0 0
c 1 0 0 1
d 1 0 1 0
Notes:
I first split the df by id, then used map_dfr from purrr to map the combn function to each id group.
combn finds all combinations of elements within a vector (length(vec) choose 2) and returns a matrix.
_dfr at the end of map_dfr means that the result will be a dataframe by row binding each element of the list. So this is effectively do.call(rbind, lapply()).
mutate_all makes sures that table retains all the levels needed even if a letter does not exist in a column.
Finally, since after the table step, an upper triangular matrix is produced, I fed that matrix and its transpose into pmax
pmax finds the parallel maxima from the two inputs and returns a symmetric matrix as desired.
Data:
df = read.table(text= "id val
1 a
1 b
2 a
2 c
2 d
3 a
3 a", header = TRUE, stringsAsFactors = FALSE)

Loop with conditions in R programming

I would like to compare the previous row value whether it is same as the current one (for more than 1 variables and also using list of values). In this case how do I perform write code. I read 'apply' functions can be used.
I searched this topic here before posting this question found somewhat similar but unable to find the exact one. I'm quite new to R.
Here is my sample table: (Flag needs to be done based on conditions)
Ticket No V1 V2 Flag
Tkt10256 1 X 0
Tkt10257 1 aa 0
Tkt10257 2 bb 1
Tkt10257 3 x 0
Tkt10260 1 cc 0
Tkt10260 2 aa 1
Tkt10262 3 bb 0
I have to Flag based on the below conditions (if all the conditions are satisfied then mark as 1)
Variable 2 should be the following one of 4 names (aa, bb, cc, dd)
Variable 1 should be the different from previous row
Ticket number has to be the same as previous row
Thanks in advance for the help !
An approach without looping:
indx1 <- with(df, V2 %in% paste0(letters[1:4], letters[1:4]) )
indx2 <- with(df, c(TRUE,V1[-1]!=V1[-length(V1)]))
indx3 <- with(df, c(FALSE,Ticket.No[-1]==Ticket.No[-nrow(df)]))
df$Flag <- (indx1 & indx2 & indx3)+0
df$Flag
#[1] 0 0 1 0 0 1 0
data
df <- structure(list(Ticket.No = c("Tkt10256", "Tkt10257", "Tkt10257",
"Tkt10257", "Tkt10260", "Tkt10260", "Tkt10262"), V1 = c(1L, 1L,
2L, 3L, 1L, 2L, 3L), V2 = c("X", "aa", "bb", "x", "cc", "aa",
"bb"), Flag = c(0L, 0L, 1L, 1L, 0L, 1L, 0L)), .Names = c("Ticket.No",
"V1", "V2", "Flag"), class = "data.frame", row.names = c(NA,
-7L))
One more:
Check this on your larger data. I'm not exactly sure if duplicated is the right function to use there. If the numbers in the TicketNo column are increasing (i.e. the Xs in TktXXXXX), then it should work fine.
> dat2 <- dat[dat$V2 %in% c("aa", "bb", "cc", "dd"),]
> rn <- rownames(dat2)[duplicated(dat2[[1]]) & !c(FALSE, diff(dat2[[2]]) == 0)]
> dat$Flag <- (rownames(dat) %in% rn)+0
> dat
# TicketNo V1 V2 Flag
# 1 Tkt10256 1 X 0
# 2 Tkt10257 1 aa 0
# 3 Tkt10257 2 bb 1
# 4 Tkt10257 3 x 0
# 5 Tkt10260 1 cc 0
# 6 Tkt10260 2 aa 1
# 7 Tkt10262 3 bb 0
A variation on #Akrun's answer:
with(df,
V2 %in% c("aa","bb","cc","dd") &
c(FALSE,diff(V1) != 0) &
c(FALSE,head(Ticket.No, -1)) == Ticket.No
) + 0
#[1] 0 0 1 0 0 1 0
Try:
for(i in 2:nrow(ddf)){
ddf$Flag[i] = ifelse( ddf$V2[i] %in% c('aa', 'bb', 'cc', 'dd')
&& ddf$V1[i] != ddf$V1[(i-1)]
&& ddf$TicketNo[i] == ddf$TicketNo[(i-1)]
,1,0)
}
ddf
TicketNo V1 V2 Flag
1 Tkt10256 1 X 0
2 Tkt10257 1 aa 0
3 Tkt10257 2 bb 1
4 Tkt10257 3 x 0
5 Tkt10260 1 cc 0
6 Tkt10260 2 aa 1
7 Tkt10262 3 bb 0

rbindfill like merge of list of vectors

I have a list of named vectors (see below and at end for dput version) I would like to "merge" together to make a matrix and fill in zeros if a vector doesn't contain a name (character in this case). This doesn't seem that hard but I haven't found a working base solution to the problem. I thought about using match but that seems very costly of time when I'm sure there's a fancy way to use do.call and rbind together.
List of Named Vectors:
$greg
e i k l
1 2 1 1
$sam
! c e i t
1 1 1 2 1
$teacher
? c i k l
1 1 1 1 1
Final Desired Output
! ? c e i k l t
greg 0 0 0 1 2 1 1 0
sam 1 0 1 1 2 0 0 1
teacher 0 1 1 0 1 1 1 0
Likely this is the output people will give and filling NAs with 0 is easy
! ? c e i k l t
greg NA NA NA 1 2 1 1 NA
sam 1 NA 1 1 2 NA NA 1
teacher NA 1 1 NA 1 1 1 NA
Sample Data
L2 <- structure(list(greg = structure(c(1L, 2L, 1L, 1L), .Dim = 4L, .Dimnames = structure(list(
c("e", "i", "k", "l")), .Names = ""), class = "table"), sam = structure(c(1L,
1L, 1L, 2L, 1L), .Dim = 5L, .Dimnames = structure(list(c("!",
"c", "e", "i", "t")), .Names = ""), class = "table"), teacher = structure(c(1L,
1L, 1L, 1L, 1L), .Dim = 5L, .Dimnames = structure(list(c("?",
"c", "i", "k", "l")), .Names = ""), class = "table")), .Names = c("greg",
"sam", "teacher"))
Here's a fairly straight forward base solution:
# first determine all possible column names
cols <- sort(unique(unlist(lapply(L2,names), use.names=FALSE)))
# initialize the output
out <- matrix(0, length(L2), length(cols), dimnames=list(names(L2),cols))
# loop over list and fill in the matrix
for(i in seq_along(L2)) {
out[names(L2)[i], names(L2[[i]])] <- L2[[i]]
}
UPDATE with benchmarks:
f1 <- function(L2) {
cols <- sort(unique(unlist(lapply(L2,names), use.names=FALSE)))
out <- matrix(0, length(L2), length(cols), dimnames=list(names(L2),cols))
for(i in seq_along(L2)) out[names(L2)[i], names(L2[[i]])] <- L2[[i]]
out
}
f2 <- function(L2) {
L.names <- sort(unique(unlist(sapply(L2, names))))
L3 <- t(sapply(L2, function(x) x[L.names]))
colnames(L3) <- L.names
L3[is.na(L3)] <- 0
L3
}
f3 <- function(L2) {
m <- do.call(rbind, lapply(L2, as.data.frame))
m$row <- sub("[.].*", "", rownames(m))
m$Var1 <- factor(as.character(m$Var1))
xtabs(Freq ~ row + Var1, m)
}
library(rbenchmark)
benchmark(f1(L2), f2(L2), f3(L2), order="relative")[,1:5]
# test replications elapsed relative user.self
# 1 f1(L2) 100 0.022 1.000 0.020
# 2 f2(L2) 100 0.051 2.318 0.052
# 3 f3(L2) 100 0.788 35.818 0.760
set.seed(21)
L <- replicate(676, {n=sample(10,1); l=sample(26,n);
setNames(sample(6,n,TRUE), letters[l])}, simplify=FALSE)
names(L) <- levels(interaction(letters,LETTERS))
benchmark(f1(L), f2(L), order="relative")[,1:5]
# test replications elapsed relative user.self
# 1 f1(L) 100 1.84 1.000 1.828
# 2 f2(L) 100 4.24 2.304 4.220
I think something like this:
names <- sort(unique(unlist(lapply(L2, names), use.names=FALSE)))
L3 <- t(vapply(L2, function(x) x[names], FUN.VALUE=numeric(length(names))))
colnames(L3) <- names
L3[is.na(L3)] <- 0
reshape2 Solution. This can be readily done with the reshape2 package by melting the list into long form and then using dcast to reshape it back into wide form:
> library(reshape2)
> m <- melt(L2)
> m$Var.1 <- factor(as.character(m$Var.1)) # optional - if columns should be sorted
> dcast(m, L1 ~ Var.1, fill = 0)
L1 ! ? c e i k l t
1 greg 0 0 0 1 2 1 1 0
2 sam 1 0 1 1 2 0 0 1
3 teacher 0 1 1 0 1 1 1 0
Base Solution. And here is a corresponding base solution where the first two lines perform the melt, the next line ensures the columns will be sorted and the last line reshapes from long to wide:
> m <- do.call(rbind, lapply(L2, as.data.frame))
> m$row <- sub("[.].*", "", rownames(m))
> m$Var1 <- factor(as.character(m$Var1))
> xtabs(Freq ~ row + Var1, m)
Var1
row ! ? c e i k l t
greg 0 0 0 1 2 1 1 0
sam 1 0 1 1 2 0 0 1
teacher 0 1 1 0 1 1 1 0
EDIT: Added a base solution and modified the sort line.
While typing this I thought of this solution but wonder if there's a more efficient one:
chars <- sort(unique(unlist(lapply(L2, names))))
L3 <- lapply(L2, function(x){
nots <- chars[!chars %in% names(x)]
new <- rev(c(x, rep(0, length(nots))))
names(new)[1:length(nots)] <- nots
new[order(names(new))]
})
do.call(rbind, L3)
Yielding:
! ? c e i k l t
greg 0 0 0 1 2 1 1 0
sam 1 0 1 1 2 0 0 1
teacher 0 1 1 0 1 1 1 0

Resources