Create a new variable based on value across columns - r

I have a dataframe that is similar to a simplified version below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
df<-data.frame(MO1,MO2,MO3)
df
I am trying to create a new variable that would scan through the observations looking for all the 1 values. I would then like the observations in this new variable to take on the name of the column variable that it was obtained from, see below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
MOTIVATION<-c("MO2","MO1","MO3","")
df2<-data.frame(MO1,MO2,MO3,MOTIVATION)
df2
Sorry, I do not know how to just show the resulting data frame, df2 from above.
I have 989 observations and 19 different MO.. variables in my dataset.

Another option
> ind <- which(df==1, arr.ind = TRUE)
> df2 <- df # just cloning df
> df2$MOTIVATION <- NA
> df2$MOTIVATION[ind[,1]] <- names(df) [ind[,2]]
> df2
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>

An option is to use apply in combination with which as:
df$MOTIVATION <- apply(df,1,function(x)names(df)[which(x==1)])
df
# MO1 MO2 MO3 MOTIVATION
# 1 0 1 3 MO2
# 2 1 0 2 MO1
# 3 2 3 1 MO3
# 4 3 2 0

1) Try max.col like this. Insert a 1 in front of each row and then find the column of the last 1. Subtract 1 so that it corresponds tot he original column numbers and a missing 1 gives 0. Then replace all zeros with NA and look up the corresponding column names.
ix <- max.col(cbind(1, df) == 1, "last") - 1
transform(df, MOTIVATION = names(df)[replace(ix, ix == 0, NA)])
giving:
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>
2) A variation would be the following. We compute max.col and then multiply each result by 1 if there is a 1 in that row or NA if not.
df1 <- df == 1
transform(df, MOTIVATION = names(df)[max.col(df1) * match(rowSums(df1), 1)])

The following does the trick (note that this support the case where two Columns have "1" not sure if this was a valid edge case for you.
(I slightly modified MO4 from original so that it would contain two "1"
MO1<-c("0","1","2","3")
MO2<-c("1","2","3","2")
MO3<-c("3","2","1","0")
MO4<-c("3","2","1","1")
df<-data.frame(MO1,MO2,MO3,MO4)
df
findx <- function(dfx)
{
idx <- which(dfx=="1")
res <- lapply(idx, function(x) paste0('MO', x))
res
}
found <- apply(df,2,findx)
newdf <- unlist(found)
newdf
With an ouput of
"MO2" "MO1" "MO3" "MO3" "MO4"

Related

How can I use vectorisation in R to change a DF value based on a condition?

Suppose I have the following DF:
C1
C2
0
0
1
1
1
1
0
0
.
.
.
.
I now want to apply these following conditions on the Dataframe:
The value for C1 should be 1
A random integer between 0 and 5 should be less than 2
If both these conditions are true, I change the C1 and C2 value for that row to 2
I understand this can be done by using the apply function, and I have used the following:
C1 <- c(0, 1,1,0,1,0,1,0,1,0,1)
C2 <- c(0, 1,1,0,1,0,1,0,1,0,1)
df <- data.frame(C1, C2)
fun <- function(x){
if (sample(0:5, 1) < 2){
x[1:2] <- 2
}
return (x)
}
index <- df$C1 ==1 // First Condition
processed_Df <-t(apply(df[index,],1,fun)) // Applies Second Condition
df[index,] <- processed_Df
Output:
C1
C2
0
0
2
2
1
1
0
0
.
.
.
.
Some Rows have both conditions met, some doesn't (This is the main
functionality, I would like to achieve)
Now I want to achieve this same using vectorization and without using loops or the apply function. The only confusion I have is "If I don't use apply, won't each row get the same result based on the condition's result? (For example, the following:)
df$C1 <- ifelse(df$C1==1 & sample(0:5, 1) < 5, 2, df$C1)
This changes all the rows in my DF with C1==2 to 2 when there should possibly be many 1's.
Is there a way to get different results for the second condition for each row without using the apply function? Hopefully my question makes sense.
Thanks
You need to sample the values for nrow times. Try this method -
set.seed(167814)
df[df$C1 == 1 & sample(0:5, nrow(df), replace = TRUE) < 2, ] <- 2
df
# C1 C2
#1 0 0
#2 2 2
#3 2 2
#4 0 0
#5 1 1
#6 0 0
#7 2 2
#8 0 0
#9 1 1
#10 0 0
#11 1 1
Here is a fully vectorized way. Create the logical index index just like in the question. Then sample all random integers r in one call to sample. Replace in place based on the conjunction of the index and the condition r < 2.
x <- 'C1 C2
0 0
1 1
1 1
0 0'
df1 <- read.table(textConnection(x), header = TRUE)
set.seed(1)
index <- df1$C1 == 1
r <- sample(0:5, length(index), TRUE)
df1[index & r < 2, c("C1", "C2")] <- 2
df1
#> C1 C2
#> 1 0 0
#> 2 1 1
#> 3 2 2
#> 4 0 0
Created on 2022-05-11 by the reprex package (v2.0.1)

How do I drop all observations except the last of a pattern?

I asked a question a few months back about how to identify and keep only observations that follow a certain pattern: How can I identify patterns over several rows in a column and fill a new column with information about that pattern using R?
I want to take this a step further. In that question I just wanted to identify that pattern. Now, if the pattern appears several times within a group, how I keep only the last occurance of that pattern. For example, given df1 how can I achieve df2
df1
TIME ID D
12:30:10 2 0
12:30:42 2 0
12:30:59 2 1
12:31:20 2 0
12:31:50 2 0
12:32:11 2 0
12:32:45 2 1
12:33:10 2 1
12:33:33 2 1
12:33:55 2 1
12:34:15 2 0
12:34:30 2 0
12:35:30 2 0
12:36:30 2 0
12:36:45 2 0
12:37:00 2 0
12:38:00 2 1
I want to end up with the following df2
df2
TIME ID D
12:33:55 2 1
12:34:15 2 0
12:34:30 2 0
12:35:30 2 0
12:36:30 2 0
12:36:45 2 0
12:37:00 2 0
12:38:00 2 1
Thoughts? There were some helpful answers in the question I linked above, but I now want to narrow it.
Here is a base R function I find too complicated but that gets what is asked for.
If I understood the pattern correctly, it doesn't matter if the last sequence ends in a 1 or a 0. The test with df1b has a last sequence ending in a 0.
keep_last_pattern <- function(data, col){
x <- data[[col]]
if(x[length(x)] == 0) x[length(x)] <- 1
#
i <- ave(x, cumsum(x), FUN = \(y) y[1] == 1 & length(y) > 1)
r <- rle(i)
l <- length(r$lengths)
n <- which(as.logical(r$values))
r$values[ n[-length(n)] ] <- 0
r$values[l] <- r$lengths[l] == 1 && r$values[l] == 0
j <- as.logical(inverse.rle(r))
#
data[j, ]
}
keep_last_pattern(df1, "D")
df1b <- df1
df1b[17, "D"] <- 0
keep_last_pattern(df1b, "D")
Do you want to rows the sequence in each ID between second last 1 and last 1 ?
Here is a function to do that which can be applied for each ID.
library(dplyr)
extract_sequence <- function(x) {
inds <- which(x == 1)
inds[length(inds) - 1]:inds[length(inds)]
}
df %>%
group_by(ID) %>%
slice(extract_sequence(D)) %>%
ungroup
# TIME ID D
# <chr> <int> <int>
#1 12:33:55 2 1
#2 12:34:15 2 0
#3 12:34:30 2 0
#4 12:35:30 2 0
#5 12:36:30 2 0
#6 12:36:45 2 0
#7 12:37:00 2 0
#8 12:38:00 2 1
Not sure this will help as it's unclear what your pattern is.
Let's assume you have data like this, with one column indicating in some way whether the row matches a pattern or not:
set.seed(123)
df <- data.frame(
grp = sample(LETTERS[1:3], 10, replace = TRUE),
x = 1:10,
y = c(0,1,0,0,1,1,1,1,1,1),
pattern = rep(c("TRUE", "FALSE"),5)
)
If the aim is to keep only the last occurrence of pattern == "TRUE" per group, this might work:
df %>%
filter(pattern == "TRUE") %>%
group_by(grp) %>%
slice_tail(.)
# A tibble: 3 x 4
# Groups: grp [3]
grp x y pattern
<chr> <int> <dbl> <chr>
1 A 1 0 TRUE
2 B 9 1 TRUE
3 C 5 1 TRUE

Is there a way to make addition of two contiguos columns in R?

I have a data frame where each observation is comprehended in two columns. In this way, columns 1 and 2 represents the individual 1, 3 and 4 the individual 2 and so on.
Basically what I want to do is to add two contigous columns so I have the individual real score.
In this example V1 and V2 represent individual I and V3 and V4 represent individual II. So for the result data frame I will have the half of columns, the same number of rows and each value will be the addition of each value between two contigous colums.
Data
V1 V2 V3 V4
1 0 0 1 1
2 1 0 0 0
3 0 1 1 1
4 0 1 0 1
Desire Output
I II
1 0 2
2 1 0
3 1 2
4 1 1
I tried something like this
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
b <- data.frame(NA, nrow = nrow(a), ncol = ncol(data))
for (i in seq(2,ncol(a),by=2)){
for (k in 1:nrow(a)){
b[k,i] <- a[k,i] + a[k,i-1]
}
}
b <- as.data.frame(b)
b <- b[,-c(seq(1,length(b),by=2))]
Is there a way to make it simplier?
We could use split.default to split the data and then do rowSums by looping over the list
sapply(split.default(a, as.integer(gl(ncol(a), 2, ncol(a)))), rowSums)
1 2
[1,] 0 2
[2,] 1 0
[3,] 1 2
[4,] 1 1
You can use vector recycling to select columns and add them.
res <- a[c(TRUE, FALSE)] + a[c(FALSE, TRUE)]
names(res) <- paste0('col', seq_along(res))
res
# col1 col2
#1 0 2
#2 1 0
#3 1 2
#4 1 1
dplyr's approach with row-wise operations (rowwise is a special type of grouping per row)
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
library(dplyr)
a%>%
rowwise()%>%
transmute(I=sum(c(V1,V2)),
II=sum(c(V3,V4)))
or alternatively with a built-in row-wise variant of the sum
a %>% transmute(I = rowSums(across(1:2)),
II = rowSums(across(3:4)))

Index based assignment with apply in R

I'm cleaning up some survey data in R; assigning variables 1,0 based on the responses to a question. Say I had a question with 3 options; a,b,c; and I had a data frame with the responses and logical variables:
df <- data.frame(a = rep(0,3), b = rep(0,3), c = rep(0,3), response = I(list(c(1),c(1,2),c(2,3))))
So I want to change the 0's to 1's if the response matches the column index (ie 1=a, 2=b, 3=c).
This is fairly easy to do with a loop:
for (i in 1:nrow(df2)) df2[i,df2[i,"response"][[1]]] <- 1
Is there any way to do this with an apply/lapply/sapply/etc? Something like:
df <- sapply(df,function(x) x[x["response"][[1]]] <- 1)
Or should I stick with a loop?
You can use matrix indexing, from ?[:
A third form of indexing is via a numeric matrix with the one column
for each dimension: each row of the index matrix then selects a single
element of the array, and the result is a vector. Negative indices are
not allowed in the index matrix. NA and zero values are allowed: rows
of an index matrix containing a zero are ignored, whereas rows
containing an NA produce an NA in the result.
# construct a matrix representing the index where the value should be one
idx <- with(df, cbind(rep(seq_along(response), lengths(response)), unlist(response)))
idx
# [,1] [,2]
#[1,] 1 1
#[2,] 2 1
#[3,] 2 2
#[4,] 3 2
#[5,] 3 3
# do the assignment
df[idx] <- 1
df
# a b c response
#1 1 0 0 1
#2 1 1 0 1, 2
#3 0 1 1 2, 3
or you can try this .
library(tidyr)
library(dplyr)
df1=df %>%mutate(Id=row_number()) %>%unnest(response)
df[,1:3]=table(df1$Id,df1$response)
a b c response
1 1 0 0 1
2 1 1 0 1, 2
3 0 1 1 2, 3
Perhaps this helps
df[1:3] <- t(sapply(df$response, function(x) as.integer(names(df)[1:3] %in% names(df)[x])))
df
# a b c response
#1 1 0 0 1
#2 1 1 0 1, 2
#3 0 1 1 2, 3
Or a compact option is
library(qdapTools)
df[1:3] <- mtabulate(df$response)

R - How would I check to see if columns corresponding to a given group are all equal within a group?

Let's say I have data like this:
group value
1 0
1 0
1 0
2 1
2 0
3 1
3 0
4 1
4 1
How would I iterate through all values of "group" to see if the values corresponding with the group have all equal values. I want to have a dataset that includes ONLY groups where the values are not identical. I'm not sure of an easy way to do this avoiding a for loop.
You can do:
tapply(DF$value, DF$group, FUN = function(x) length(unique(x))) > 1L
# 1 2 3 4
# FALSE TRUE TRUE FALSE
To subset the table, write the same with ave:
DF[ ave(DF$value, DF$group, FUN = function(x) length(unique(x))) > 1L, ]
# group value
# 4 2 1
# 5 2 0
# 6 3 1
# 7 3 0
With packages, the latter step looks like...
library(data.table)
setDT(DF)[, if (uniqueN(value) > 1L) .SD, by=group]
# or
library(dplyr)
DF %>% group_by(group) %>% filter(n_distinct(value) > 1L)
Here is another option using table
tbl <- rowSums(table(df1)>0)>1
subset(df1, group %in% names(tbl)[tbl])
# group value
#4 2 1
#5 2 0
#6 3 1
#7 3 0

Resources