Replace a sequence in data frame column - r

I have a data frame in R that looks somewhat like this:
A | B
0 0
1 0
0 0
0 0
0 1
0 1
1 0
1 0
1 0
I now want to replace all sequences of more than one "1" in the columns so that only the first "1" is kept and the others are replaced by "0", so that the result looks like this
A | B
0 0
1 0
0 0
0 0
0 1
0 0
1 0
0 0
0 0
I hope you understood what I meant (English is not my mother tongue and especially the R-"vocabulary" is a bit hard for, which is probably why I couldn't find a solution through googling). Thank you in advance!

Try this solution:
Input data
df<-data.frame(
A=c(1,0,0,0,0,0,1,1,1,0),
B=c(1,1,0,1,0,0,1,1,0,0))
f<-function(X)
{
return(as.numeric((diff(c(0,X)))>0))
}
Your output
data.frame(lapply(df,f))
A B
1 1 1
2 0 0
3 0 0
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 0 0
10 0 0

You can use ave and create groups based on the difference of your values to capture the consecutives 1s and 0s as different groups and replace duplicates with 0, i.e.
df[] <- lapply(df, function(i)ave(i, cumsum(c(1, diff(i) != 0)),
FUN = function(i) replace(i, duplicated(i), 0)))
which gives,
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0

Here's a simple one line answer:
> df * rbind(c(0,0), sapply(df, diff))
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
This takes advantage of the fact that all unwanted 1's in the original data will become 0's with the diff function.

Here is an option with rleid
library(data.table)
df1[] <- lapply(df1, function(x) +(x==1& !ave(x, rleid(x), FUN = duplicated)))
df1
# A B
#1 0 0
#2 1 0
#3 0 0
#4 0 0
#5 0 1
#6 0 0
#7 1 0
#8 0 0
#9 0 0
<

Here's a more functional approach. Though, I find shorter answers here, but it's good to know the possible implementation under the hood:
# helper function
make_zero <- function(val)
{
get_index <- c()
for(i in seq(val))
{
if(val[i] == 1) get_index <- c(get_index, i)
else if (val[i] != 1) get_index <- c()
if(all(diff(get_index)) == 1)
{
val[get_index[-1]] <- 0
}
}
# set values as 0
return (val)
}
df <- sapply(df, make_zero)
head(df)
A B
[1,] 0 0
[2,] 1 0
[3,] 0 0
[4,] 0 0
[5,] 0 1
[6,] 0 0
[7,] 1 0
[8,] 0 0
[9,] 0 0
Explanation:
1. We save the indexes of consecutive 1s in get_index.
2. Next, we check if the difference between indexes is 1.
3. If found, we update the value in the column.

Related

Create a new column based on several conditions

I want to create a new column based on some conditions imposed on several columns. For example, here is an example dataset:
a <- data.frame(x=c(1,0,1,0,0), y=c(0,0,0,0,0), z=c(1,1,0,0,0))
a
x y z
1 1 0 1
2 0 0 1
3 1 0 0
4 0 0 0
5 0 0 0
Specifically, if for any particular row 1 is present, then the new column returns 1. If all are 0, then the new column returns 0. So the dataset with the new column will be
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
My initial thought was to use %in% but couldn't get the result I want. Thank you for your help!
If your data frame consists of binary values, e.g., only 0 and 1, you can try the code below with rowSums
a$w <- +(rowSums(a)>0)
such that
> a
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
We can use rowMaxs from matrixStats
library(matrixStats)
a$w <- rowMaxs(as.matrix(a))
a$w
#[1] 1 1 1 0 0
You can find max of each row :
a$w <- do.call(pmax, a)
a
# x y z w
#1 1 0 1 1
#2 0 0 1 1
#3 1 0 0 1
#4 0 0 0 0
#5 0 0 0 0
which can also be done with apply :
a$w <- apply(a, 1, max)

R - Creating a new column within a data frame when two or more columns are a match in a row

I'm currently stuck on a part of my code that feels intuitive but I can't figure a way to do it. I have a very big data frame (nrows = 34036, ncol = 43) in which I want to create a continuous sequence of the variables where the value of the row is 1 (without having multiple columns with 1). It consists of only zeros and ones similar to the following:
A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1
I was able to remove the zeroes using:
#find the sum of each row
placeholderData <- transform(placeholderData, sum=rowSums(placeholderData))
placeholderData <- placeholderData[!(placeholderData$sum <= 0),]
And the data frame now looks like:
A B C D sum
1 0 0 0 1
0 0 0 1 1
0 0 0 1 1
1 0 1 0 2
1 0 1 0 2
0 1 0 0 1
0 1 0 0 1
1 0 0 1 2
My main problem comes when there are two or more 1's in a row. To try to solve this, I used the following code to identify the columns that have a sum of 2 or more:
placeholderData$Matches <- lapply(apply(placeholderData == 1, 1, which), names)
Which added the following column to the data frame:
A B C D sum Matches
1 0 0 0 1 A
0 0 0 1 1 D
0 0 0 1 1 D
1 0 1 0 2 c("A","C")
1 0 1 0 2 c("A","C")
0 1 0 0 1 B
0 1 0 0 1 B
1 0 0 1 2 c("A", "D")
I added the Matches column as an approach to solve the problem, but I'm not sure how would I do it without using a lot of logical operators (I don't know what columns have matches or not). What I would like to do is to aggregate the rows that have more than (or equal to) two 1's into a new column, to be able to have a data frame like this:
A B C D AC AD sum Matches
1 0 0 0 0 0 1 A
0 0 0 1 0 0 1 D
0 0 0 1 0 0 1 D
0 0 0 0 1 0 1 c("A","C")
0 0 0 0 1 0 1 c("A","C")
0 1 0 0 0 0 1 B
0 1 0 0 0 0 1 B
0 0 0 0 0 1 1 c("A", "D")
Then, I would be able to use my code as normal (It works just fine when there are no repeated values in rows). I tried searching to find similar questions, but I'm not sure if I was even asking the right question. I was wondering if anyone could provide some help or some ideas that I could try.
Thank you very much!
This seems a lot like making dummy variables, so I would use the model.matrix function commonly used for dummy variables (one-hot encoding):
m = read.table(header = T, text = "A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1")
m = m[rowSums(m) > 0, ]
d = factor(sapply(apply(m == 1, 1, which), function(x) paste(names(m)[x], collapse = "")))
result = data.frame(model.matrix(~ d + 0))
names(result) = levels(d)
# A AC AD B D
# 1 1 0 0 0 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 1 0 0 0
# 6 0 0 0 1 0
# 7 0 0 0 1 0
# 8 0 0 1 0 0

R: Generating sparse matrix with all elements as rows and columns

I have a data set with user to user. It doesn't have all users as col and row. For example,
U1 U2 T
1 3 1
1 6 1
2 4 1
3 5 1
u1 and u2 represent users of the dataset. When I create a sparse matrix using following code, (df- keep all data of above dataset as a dataframe)
trustmatrix <- xtabs(T~U1+U2,df,sparse = TRUE)
3 4 5 6
1 1 0 0 1
2 0 1 0 0
3 0 0 1 0
Because this matrix doesn't have all the users in row and columns as below.
1 2 3 4 5 6
1 0 0 1 0 0 1
2 0 0 0 1 0 0
3 0 0 0 0 1 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
If I want to get above matrix after sparse matrix, How can I do so in R?
We can convert the columns to factor with levels as 1 through 6 and then use xtabs
df1[1:2] <- lapply(df1[1:2], factor, levels = 1:6)
as.matrix(xtabs(T~U1+U2,df1,sparse = TRUE))
# U2
#U1 1 2 3 4 5 6
# 1 0 0 1 0 0 1
# 2 0 0 0 1 0 0
# 3 0 0 0 0 1 0
# 4 0 0 0 0 0 0
# 5 0 0 0 0 0 0
# 6 0 0 0 0 0 0
Or another option is to get the expanded index filled with 0s and then use sparseMatrix
library(tidyverse)
library(Matrix)
df2 <- crossing(U1 = 1:6, U2 = 1:6) %>%
left_join(df1) %>%
mutate(T = replace(T, is.na(T), 0))
sparseMatrix(i = df2$U1, j = df2$U2, x = df2$T)
Or use spread
spread(df2, U2, T)

Replicate rows by value in column, change values to 1 or 0, in R

I have data structured as:
A B C D
3 2 1 1
I want it restructured as
A B C D
1 0 0 0
1 0 0 0
1 0 0 0
0 1 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Any thoughts on how to do this in R? Many thanks.
If the input is a data.frame, you could do the following:
coln <- seq_along(df)
m = do.call(rbind, lapply(coln, function(i) {t(replicate(df[1,i], coln == i))})) +0
This will result in a matrix like this:
# [,1] [,2] [,3] [,4]
#[1,] 1 0 0 0
#[2,] 1 0 0 0
#[3,] 1 0 0 0
#[4,] 0 1 0 0
#[5,] 0 1 0 0
#[6,] 0 0 1 0
#[7,] 0 0 0 1
You can then convert it to a data.frame or set column names if you like.
Here is an option using dcast
library(data.table)
nm1 <- rep(names(df1), unlist(df1))
dcast(data.table(nm1, v1 = seq_along(nm1)), v1 ~ nm1, length)[, v1 := NULL][]
# A B C D
#1: 1 0 0 0
#2: 1 0 0 0
#3: 1 0 0 0
#4: 0 1 0 0
#5: 0 1 0 0
#6: 0 0 1 0
#7: 0 0 0 1
Or after creating the 'nm1', use model.matrix from base R
model.matrix(~-1 + nm1)
or in a single line
model.matrix(~ -1 + rep(names(df1), unlist(df1)))
and change the column names
data
df1 <- data.frame(A = 3, B = 2, C = 1, D = 1)

How can I create this special sequence?

I would like to create the following vector sequence.
0 1 0 0 2 0 0 0 3 0 0 0 0 4
My thought was to create 0 first with rep() but not sure how to add the 1:4.
Create a diagonal matrix, take the upper triangle, and remove the first element:
d <- diag(0:4)
d[upper.tri(d, TRUE)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
If you prefer a one-liner that makes no global assignments, wrap it up in a function:
(function() { d <- diag(0:4); d[upper.tri(d, TRUE)][-1L] })()
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
And for code golf purposes, here's another variation using d from above:
d[!lower.tri(d)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
rep and rbind up to their old tricks:
rep(rbind(0,1:4),rbind(1:4,1))
#[1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
This essentially creates 2 matrices, one for the value, and one for how many times the value is repeated. rep does not care if an input is a matrix, as it will just flatten it back to a vector going down each column in order.
rbind(0,1:4)
# [,1] [,2] [,3] [,4]
#[1,] 0 0 0 0
#[2,] 1 2 3 4
rbind(1:4,1)
# [,1] [,2] [,3] [,4]
#[1,] 1 2 3 4
#[2,] 1 1 1 1
You can use rep() to create a sequence that has n + 1 of each value:
n <- 4
myseq <- rep(seq_len(n), seq_len(n) + 1)
# [1] 1 1 2 2 2 3 3 3 3 4 4 4 4 4
Then you can use diff() to find the elements you want. You need to append a 1 to the end of the diff() output, since you always want the last value.
c(diff(myseq), 1)
# [1] 0 1 0 0 1 0 0 0 1 0 0 0 0 1
Then you just need to multiply the original sequence with the diff() output.
myseq <- myseq * c(diff(myseq), 1)
myseq
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
unlist(lapply(1:4, function(i) c(rep(0,i),i)))
# the sequence
s = 1:4
# create zeros vector
vec = rep(0, sum(s+1))
# assign the sequence to the corresponding position in the zeros vector
vec[cumsum(s+1)] <- s
vec
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
Or to be more succinct, use replace:
replace(rep(0, sum(s+1)), cumsum(s+1), s)
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4

Resources