Replacing values in a list of data frames

Replacing values in a list of data frames - r

I have a list of data frames. Each has an ID column, followed by a number of numeric columns (with column names).
I would like to replace all the 1's with 0's for all the numeric columns, but keep the ID column the same. I can do this in part with a single data frame using
df[,-1] <- 0
But when I try to embed this in lapply, it fails:
df2 <- lapply(df, function(x) {x[,-1] <- 0})
I've tried using subset, ifelse, while, mutate, but struggling with this simple replacement. Could recreate the data frames from scratch, or recombine the ID column at the end, but this strikes me as something that should be easy...
Test list:
test_list <- list(data.frame("ID"=letters[1:3], "col2"=1:3, "col3"=0:2), data.frame("ID"=letters[4:6], "col2"=4:6, "col3"=0:2))
The end result should be:
final_list <- list(data.frame("ID"=letters[1:3], "col2"=0, "col3"=0), data.frame("ID"=letters[4:6], "col2"=0, "col3"=0))

Add return(x) to your function and then it should work fine.
lapply(test_list, function(x){
x[, -1] <- 0
return(x)
})
# [[1]]
# ID col2 col3
# 1 a 0 0
# 2 b 0 0
# 3 c 0 0
#
# [[2]]
# ID col2 col3
# 1 d 0 0
# 2 e 0 0
# 3 f 0 0

Your question is worded a little bit strangely in that it sounds like you want to replace all the 1's with 0's, but your example seems to contradict that.
If you want to replace just 1's with 0's, you could do so like this:
lapply(df, function(x) {x[x==1] <- 0; return(x)})
[[1]]
ID col2 col3
1 a 0 0
2 b 2 0
3 c 3 2
[[2]]
ID col2 col3
1 d 4 0
2 e 5 0
3 f 6 2

Related

Is there a way to make addition of two contiguos columns in R?

I have a data frame where each observation is comprehended in two columns. In this way, columns 1 and 2 represents the individual 1, 3 and 4 the individual 2 and so on.
Basically what I want to do is to add two contigous columns so I have the individual real score.
In this example V1 and V2 represent individual I and V3 and V4 represent individual II. So for the result data frame I will have the half of columns, the same number of rows and each value will be the addition of each value between two contigous colums.
Data
V1 V2 V3 V4
1 0 0 1 1
2 1 0 0 0
3 0 1 1 1
4 0 1 0 1
Desire Output
I II
1 0 2
2 1 0
3 1 2
4 1 1
I tried something like this
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
b <- data.frame(NA, nrow = nrow(a), ncol = ncol(data))
for (i in seq(2,ncol(a),by=2)){
for (k in 1:nrow(a)){
b[k,i] <- a[k,i] + a[k,i-1]
}
}
b <- as.data.frame(b)
b <- b[,-c(seq(1,length(b),by=2))]
Is there a way to make it simplier?

We could use split.default to split the data and then do rowSums by looping over the list
sapply(split.default(a, as.integer(gl(ncol(a), 2, ncol(a)))), rowSums)
1 2
[1,] 0 2
[2,] 1 0
[3,] 1 2
[4,] 1 1

You can use vector recycling to select columns and add them.
res <- a[c(TRUE, FALSE)] + a[c(FALSE, TRUE)]
names(res) <- paste0('col', seq_along(res))
res
# col1 col2
#1 0 2
#2 1 0
#3 1 2
#4 1 1

dplyr's approach with row-wise operations (rowwise is a special type of grouping per row)
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
library(dplyr)
a%>%
rowwise()%>%
transmute(I=sum(c(V1,V2)),
II=sum(c(V3,V4)))
or alternatively with a built-in row-wise variant of the sum
a %>% transmute(I = rowSums(across(1:2)),
II = rowSums(across(3:4)))

Split dataframe array column into multiple binary columns [R]

Array column is current and the others are the goal
I have a column of arrays and I would like to split it out into multiple binaries.
I have created all the columns by using
dat[,unique(unlist(df$array_column))] = 0
I tried to use an ifelse statement to then set the columns to '1' as needed however using %in% does not work with ifelse. I could create a nested for loop however I have millions of rows and am looking for a faster solution than that.
testdf = data.frame('a'=c(1,2,3,4,5),'array_column'=c('a-b-c','b-a','c-d','d-e-e','e-a'),stringsAsFactors = F)
testdf$array_column = strsplit(testdf$array_column,'-')

I think the question is rather how convert a list of vectors into a binary matrix/data.frame
Here is a solution
testdf = data.frame('a'=c(1,2,3,4,5),'array_column'=c('a-b-c','b-a','c-d','d-e-e','e-a'),stringsAsFactors = F)
testdf$array_column = strsplit(testdf$array_column,'-')
library('plyr')
# Creates a list of data.frames with 1s for each value observed
binary <- lapply(testdf$array_column, function(x) {
vals <- unique(x)
x <- setNames(rep(1,length(vals)), vals);
do.call(data.frame, as.list(x))
})
# Joins into single data.frame
result <- do.call(rbind.fill, binary)
result[is.na(result)] <- 0
result
# a b c d e
# 1 1 1 1 0 0
# 2 1 1 0 0 0
# 3 0 0 1 1 0
# 4 0 0 0 1 1
# 5 1 0 0 0 1

How to combine information from two rows with ALMOST same name

I have data frame that contains many columns with almost identical names, like A and A...1 , B and B...1 and so on. I would like to combine these columns, such as A and A...1 become one column. All these columns contain 0,1 or NA, NA:s should be considered to be zeros (0). And so if column A is 0,0,1,1,NA and column A...1 is 1,0,0,0,1 combined_A should be = 1,0,1,1,1. So the if any of these column elements are 1 in other column, they should be one in the combined column.
Here's some code to produce example
original_table <- data.frame(A = c(0,0,1,1,NA),B = c(1,1,NA,NA,1),A...1 = c(1,0,0,0,1),B...1 = c(0,1,0,1,1))
So the original table looks like this
A B A...1 B...1
0 1 1 0
0 1 0 1
1 NA 0 0
1 NA 0 1
NA 1 1 1
The desired output table would look like this after combining.
combined_table <- data.frame(combined_A = c(1,0,1,1,1),combined_B = c(1,1,0,1,1))
combined_A combined_B
1 1
0 1
1 0
1 1
1 1
I'm fairly familiar with R, but i couldn't find any help for this problem.

We can use split.default to split based on common part in the column names. In this example, it seems we can find common columns by extracting the first letter of each column name.
substr(names(original_table), 1, 1)
#[1] "A" "B" "A" "B"
We use this to split columns and in each group use pmax to get max value in each row removing NA
as.data.frame(lapply(split.default(original_table,
substr(names(original_table), 1, 1)), function(x)
do.call(pmax, c(x, na.rm = TRUE))))
# A B
#1 1 1
#2 0 1
#3 1 0
#4 1 1
#5 1 1

An other base solution :
find the normal column names:
initial_col <- str_extract(names(original_table),"[A-Z]")%>%
unique()
> initial_col
[1] "A" "B"
then for all columns containing these names (grep(col,names(original_table),value = T)), make a row sum and tramsform it to binary output
sapply(initial_col,function(col){
tmp <- original_table[,grep(col,names(original_table),value = T)] %>%
rowSums(.,na.rm = T,1)
ifelse( tmp > 0,1,0)
})
A B
[1,] 1 1
[2,] 0 1
[3,] 1 0
[4,] 1 1
[5,] 1 1

Changing values in dataframe iteraring over all rows and multiple columns

I need to change some values in my dataframe iterating over rows. For each row, if there is a 1 in some column I need to change 0 values in other columns to NA.
I have a code that works, but is super slow when using a bigger dataset.
data = data.frame(id=c("A","B","C"),V1=c(1,0,0),V2=c(0,0,0),V3=c(1,0,1))
cols = names(data)[2:4]
for (i in 1:nrow(data)){
if(any(data[i,cols]==1)){
data[i,cols][data[i,cols]==0]=NA
}
}
I have an example data set
data
id V1 V2 V3
1 A 1 0 1
2 B 0 0 0
3 C 0 0 1
and the expected (and the actual) result is
data
id V1 V2 V3
1 A 1 NA 1
2 B 0 0 0
3 C NA NA 1
How can I write this in a more optimal way?

A one-liner can be,
data[rowSums(data[-1]) > 0,] <- replace(data[rowSums(data[-1]) > 0,],
data[rowSums(data[-1]) > 0,] == 0,
NA)
data
# id V1 V2 V3
#1 A 1 NA 1
#2 B 0 0 0
#3 C NA NA 1
To avoid evaluating the same expression over and over again, we can define it first, i.e.
v1 <- rowSums(data[-1]) > 0
data[v1,] <- replace(data[v1,],
data[v1,] == 0,
NA)

It is easy with dplyr assuming you want to change values for V1 and V2 column based on values in V3. We can specify columns for whom we want to change values in mutate_at and in funs argument specify the condition for which you want to change values.
library(dplyr)
data %>% mutate_at(vars(V1:V2), funs(replace(., V3 == 1 & . == 0, NA)))
# id V1 V2 V3
#1 A 1 NA 1
#2 B 0 0 0
#3 C NA NA 1

We can do this in base R, by creating a logical vector with rowSums and then update the numeric columns based on this index
i1 <- rowSums(data[-1] == 1) > 0
data[-1][i1,] <- NA^ !data[-1][i1,]
data
# id V1 V2 V3
#1 A 1 NA 1
#2 B 0 0 0
#3 C NA NA 1
If the index needs to be based on a single column, say 'V3', change the 'i1' to
i1 <- data$V3 == 1
and update the other numeric columns after subsetting the rows with 'i1', create a logical matrix with negation (! - returns TRUE for 0 values and all others FALSE). Then, using NA^ on logical matrix returns NA for TRUE and 1 for other values. As there are only binary values, this can be updated
data[i1, 2:3] <- NA^!data[i1, 2:3]

Create a new variable based on value across columns

I have a dataframe that is similar to a simplified version below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
df<-data.frame(MO1,MO2,MO3)
df
I am trying to create a new variable that would scan through the observations looking for all the 1 values. I would then like the observations in this new variable to take on the name of the column variable that it was obtained from, see below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
MOTIVATION<-c("MO2","MO1","MO3","")
df2<-data.frame(MO1,MO2,MO3,MOTIVATION)
df2
Sorry, I do not know how to just show the resulting data frame, df2 from above.
I have 989 observations and 19 different MO.. variables in my dataset.

Another option
> ind <- which(df==1, arr.ind = TRUE)
> df2 <- df # just cloning df
> df2$MOTIVATION <- NA
> df2$MOTIVATION[ind[,1]] <- names(df) [ind[,2]]
> df2
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>

An option is to use apply in combination with which as:
df$MOTIVATION <- apply(df,1,function(x)names(df)[which(x==1)])
df
# MO1 MO2 MO3 MOTIVATION
# 1 0 1 3 MO2
# 2 1 0 2 MO1
# 3 2 3 1 MO3
# 4 3 2 0

1) Try max.col like this. Insert a 1 in front of each row and then find the column of the last 1. Subtract 1 so that it corresponds tot he original column numbers and a missing 1 gives 0. Then replace all zeros with NA and look up the corresponding column names.
ix <- max.col(cbind(1, df) == 1, "last") - 1
transform(df, MOTIVATION = names(df)[replace(ix, ix == 0, NA)])
giving:
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>
2) A variation would be the following. We compute max.col and then multiply each result by 1 if there is a 1 in that row or NA if not.
df1 <- df == 1
transform(df, MOTIVATION = names(df)[max.col(df1) * match(rowSums(df1), 1)])

The following does the trick (note that this support the case where two Columns have "1" not sure if this was a valid edge case for you.
(I slightly modified MO4 from original so that it would contain two "1"
MO1<-c("0","1","2","3")
MO2<-c("1","2","3","2")
MO3<-c("3","2","1","0")
MO4<-c("3","2","1","1")
df<-data.frame(MO1,MO2,MO3,MO4)
df
findx <- function(dfx)
{
idx <- which(dfx=="1")
res <- lapply(idx, function(x) paste0('MO', x))
res
}
found <- apply(df,2,findx)
newdf <- unlist(found)
newdf
With an ouput of
"MO2" "MO1" "MO3" "MO3" "MO4"