I need to change some values in my dataframe iterating over rows. For each row, if there is a 1 in some column I need to change 0 values in other columns to NA.
I have a code that works, but is super slow when using a bigger dataset.
data = data.frame(id=c("A","B","C"),V1=c(1,0,0),V2=c(0,0,0),V3=c(1,0,1))
cols = names(data)[2:4]
for (i in 1:nrow(data)){
if(any(data[i,cols]==1)){
data[i,cols][data[i,cols]==0]=NA
}
}
I have an example data set
data
id V1 V2 V3
1 A 1 0 1
2 B 0 0 0
3 C 0 0 1
and the expected (and the actual) result is
data
id V1 V2 V3
1 A 1 NA 1
2 B 0 0 0
3 C NA NA 1
How can I write this in a more optimal way?
A one-liner can be,
data[rowSums(data[-1]) > 0,] <- replace(data[rowSums(data[-1]) > 0,],
data[rowSums(data[-1]) > 0,] == 0,
NA)
data
# id V1 V2 V3
#1 A 1 NA 1
#2 B 0 0 0
#3 C NA NA 1
To avoid evaluating the same expression over and over again, we can define it first, i.e.
v1 <- rowSums(data[-1]) > 0
data[v1,] <- replace(data[v1,],
data[v1,] == 0,
NA)
It is easy with dplyr assuming you want to change values for V1 and V2 column based on values in V3. We can specify columns for whom we want to change values in mutate_at and in funs argument specify the condition for which you want to change values.
library(dplyr)
data %>% mutate_at(vars(V1:V2), funs(replace(., V3 == 1 & . == 0, NA)))
# id V1 V2 V3
#1 A 1 NA 1
#2 B 0 0 0
#3 C NA NA 1
We can do this in base R, by creating a logical vector with rowSums and then update the numeric columns based on this index
i1 <- rowSums(data[-1] == 1) > 0
data[-1][i1,] <- NA^ !data[-1][i1,]
data
# id V1 V2 V3
#1 A 1 NA 1
#2 B 0 0 0
#3 C NA NA 1
If the index needs to be based on a single column, say 'V3', change the 'i1' to
i1 <- data$V3 == 1
and update the other numeric columns after subsetting the rows with 'i1', create a logical matrix with negation (! - returns TRUE for 0 values and all others FALSE). Then, using NA^ on logical matrix returns NA for TRUE and 1 for other values. As there are only binary values, this can be updated
data[i1, 2:3] <- NA^!data[i1, 2:3]
Related
I have a data frame where each observation is comprehended in two columns. In this way, columns 1 and 2 represents the individual 1, 3 and 4 the individual 2 and so on.
Basically what I want to do is to add two contigous columns so I have the individual real score.
In this example V1 and V2 represent individual I and V3 and V4 represent individual II. So for the result data frame I will have the half of columns, the same number of rows and each value will be the addition of each value between two contigous colums.
Data
V1 V2 V3 V4
1 0 0 1 1
2 1 0 0 0
3 0 1 1 1
4 0 1 0 1
Desire Output
I II
1 0 2
2 1 0
3 1 2
4 1 1
I tried something like this
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
b <- data.frame(NA, nrow = nrow(a), ncol = ncol(data))
for (i in seq(2,ncol(a),by=2)){
for (k in 1:nrow(a)){
b[k,i] <- a[k,i] + a[k,i-1]
}
}
b <- as.data.frame(b)
b <- b[,-c(seq(1,length(b),by=2))]
Is there a way to make it simplier?
We could use split.default to split the data and then do rowSums by looping over the list
sapply(split.default(a, as.integer(gl(ncol(a), 2, ncol(a)))), rowSums)
1 2
[1,] 0 2
[2,] 1 0
[3,] 1 2
[4,] 1 1
You can use vector recycling to select columns and add them.
res <- a[c(TRUE, FALSE)] + a[c(FALSE, TRUE)]
names(res) <- paste0('col', seq_along(res))
res
# col1 col2
#1 0 2
#2 1 0
#3 1 2
#4 1 1
dplyr's approach with row-wise operations (rowwise is a special type of grouping per row)
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
library(dplyr)
a%>%
rowwise()%>%
transmute(I=sum(c(V1,V2)),
II=sum(c(V3,V4)))
or alternatively with a built-in row-wise variant of the sum
a %>% transmute(I = rowSums(across(1:2)),
II = rowSums(across(3:4)))
I have a list of data frames. Each has an ID column, followed by a number of numeric columns (with column names).
I would like to replace all the 1's with 0's for all the numeric columns, but keep the ID column the same. I can do this in part with a single data frame using
df[,-1] <- 0
But when I try to embed this in lapply, it fails:
df2 <- lapply(df, function(x) {x[,-1] <- 0})
I've tried using subset, ifelse, while, mutate, but struggling with this simple replacement. Could recreate the data frames from scratch, or recombine the ID column at the end, but this strikes me as something that should be easy...
Test list:
test_list <- list(data.frame("ID"=letters[1:3], "col2"=1:3, "col3"=0:2), data.frame("ID"=letters[4:6], "col2"=4:6, "col3"=0:2))
The end result should be:
final_list <- list(data.frame("ID"=letters[1:3], "col2"=0, "col3"=0), data.frame("ID"=letters[4:6], "col2"=0, "col3"=0))
Add return(x) to your function and then it should work fine.
lapply(test_list, function(x){
x[, -1] <- 0
return(x)
})
# [[1]]
# ID col2 col3
# 1 a 0 0
# 2 b 0 0
# 3 c 0 0
#
# [[2]]
# ID col2 col3
# 1 d 0 0
# 2 e 0 0
# 3 f 0 0
Your question is worded a little bit strangely in that it sounds like you want to replace all the 1's with 0's, but your example seems to contradict that.
If you want to replace just 1's with 0's, you could do so like this:
lapply(df, function(x) {x[x==1] <- 0; return(x)})
[[1]]
ID col2 col3
1 a 0 0
2 b 2 0
3 c 3 2
[[2]]
ID col2 col3
1 d 4 0
2 e 5 0
3 f 6 2
I have a dataframe that is similar to a simplified version below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
df<-data.frame(MO1,MO2,MO3)
df
I am trying to create a new variable that would scan through the observations looking for all the 1 values. I would then like the observations in this new variable to take on the name of the column variable that it was obtained from, see below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
MOTIVATION<-c("MO2","MO1","MO3","")
df2<-data.frame(MO1,MO2,MO3,MOTIVATION)
df2
Sorry, I do not know how to just show the resulting data frame, df2 from above.
I have 989 observations and 19 different MO.. variables in my dataset.
Another option
> ind <- which(df==1, arr.ind = TRUE)
> df2 <- df # just cloning df
> df2$MOTIVATION <- NA
> df2$MOTIVATION[ind[,1]] <- names(df) [ind[,2]]
> df2
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>
An option is to use apply in combination with which as:
df$MOTIVATION <- apply(df,1,function(x)names(df)[which(x==1)])
df
# MO1 MO2 MO3 MOTIVATION
# 1 0 1 3 MO2
# 2 1 0 2 MO1
# 3 2 3 1 MO3
# 4 3 2 0
1) Try max.col like this. Insert a 1 in front of each row and then find the column of the last 1. Subtract 1 so that it corresponds tot he original column numbers and a missing 1 gives 0. Then replace all zeros with NA and look up the corresponding column names.
ix <- max.col(cbind(1, df) == 1, "last") - 1
transform(df, MOTIVATION = names(df)[replace(ix, ix == 0, NA)])
giving:
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>
2) A variation would be the following. We compute max.col and then multiply each result by 1 if there is a 1 in that row or NA if not.
df1 <- df == 1
transform(df, MOTIVATION = names(df)[max.col(df1) * match(rowSums(df1), 1)])
The following does the trick (note that this support the case where two Columns have "1" not sure if this was a valid edge case for you.
(I slightly modified MO4 from original so that it would contain two "1"
MO1<-c("0","1","2","3")
MO2<-c("1","2","3","2")
MO3<-c("3","2","1","0")
MO4<-c("3","2","1","1")
df<-data.frame(MO1,MO2,MO3,MO4)
df
findx <- function(dfx)
{
idx <- which(dfx=="1")
res <- lapply(idx, function(x) paste0('MO', x))
res
}
found <- apply(df,2,findx)
newdf <- unlist(found)
newdf
With an ouput of
"MO2" "MO1" "MO3" "MO3" "MO4"
I would like to fill a dataframe ("DF") with 0's or 1's depending if values in a vector ("Date") match with other date values in a second dataframe ("df$Date").
If they match the output value have to be 1, otherwise 0.
I tried to adjust this code made by a friend of mine, but it doesn't work:
for(j in 1:length(Date)) { #Date is a vector with all dates from 1967 to 2006
# Start count
count <- 0
# Check all Dates between 1967-2006
if(any(Date[j] == df$Date)) { #df$Date contains specific dates of interest
count <- count + 1
}
# If there is a match between Date and df$Date, its output is 1, else 0.
DF[j,i] <- count
}
The main dataframe "DF" has got 190 columns, which have to filled, and of course a number of rows equal to the Date vector.
extra info
1) Each column is different from the other ones and therefore the observations in a row cannot be all equal (i.e. in a single row, I should have a mixture between 0's and 1's).
2) The column names in "DF" are also present in "df" as df$Code.
We can vectorize this operation with %in% and as.integer(), leveraging the fact that coercing logical to integer returns 0 for false and 1 for true:
Mat[,i] <- as.integer(Date%in%df$Date);
If you want to fill every single column of Mat with the exact same result vector:
Mat[] <- as.integer(Date%in%df$Date);
My above code exactly reproduces the logic of the code in your (original) question.
From your edit, I'm not 100% sure I understand the requirement, but my best guess is this:
set.seed(4L);
LV <- 10L; ND <- 10L;
Date <- sample(seq_len(ND),LV,T);
df <- data.frame(Date=sample(seq_len(ND),3L),Code=c('V1','V2','V3'));
DF <- data.frame(V1=rep(NA,NV),V2=rep(NA,NV),V3=rep(NA,NV));
Date;
## [1] 6 1 3 3 9 3 8 10 10 1
df;
## Date Code
## 1 8 V1
## 2 3 V2
## 3 1 V3
for (cn in colnames(DF)) DF[,cn] <- as.integer(Date%in%df$Date[df$Code==cn]);
DF;
## V1 V2 V3
## 1 0 0 0
## 2 0 0 1
## 3 0 1 0
## 4 0 1 0
## 5 0 0 0
## 6 0 1 0
## 7 1 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 1
I'm looking to compare values within a dataset
Every row starts with a unique ID followed by a couple binary variables
The data looks like this:
row.name v1 v2 v3 ...
1 0 0 0
2 1 1 1
3 1 0 1
I want to know which values are the same (if equal assign value of 1) and which are different (if not equal assign value of 0) for all unique pairings.
For example in column v1: row1 == 0 and row2 == 1, which should result in an assignment of 0.
So, the output should look like this
id1 id2 v1 v2 v3 ...
1 2 0 0 0 ...
1 3 0 1 0 ...
2 3 1 0 1 ...
I'm looking for an efficient way of doing this for more than 1000 rows...
There's no way to do this without expanding each combination of rows, so with 1000 rows, it is going to take a bit of time. But here is a solution:
dat <- read.table(header=T, text="row.name v1 v2 v3
1 0 0 0
2 1 1 1
3 1 0 1")
Create the index rows:
indices <- t(combn(dat$row.name, 2))
colnames(indices) <- c('id1', 'id2')
Loop through the index rows, and collect the comparisons:
res1 <- t(apply(indices, 1, function(x) as.numeric(dat[x[1],-1] == dat[x[2],-1])))
colnames(res1) <- names(dat[-1])
Put them together:
result <- cbind(indices, res1)
result
## id1 id2 v1 v2 v3
## [1,] 1 2 0 0 0
## [2,] 1 3 0 1 0
## [3,] 2 3 1 0 1