Select rows with identical columns from a data frame

Select rows with identical columns from a data frame - r

I have a data frame with several columns.
I want to select the rows with no NAs (as with complete.cases)
and all columns identical.
E.g., for
> f <- data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40))
> f
a b c
1 1 1 1
2 NA NA NA
3 NA 3 5
4 4 40 40
I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA.
I can do
Reduce("==",f[complete.cases(f),])
but that creates an intermediate data frame which I would love to avoid (to save memory).

Try this:
R > index <- apply(f, 1, function(x) all(x==x[1]))
R > index
[1] TRUE NA NA FALSE
R > index[is.na(index)] <- FALSE
R > index
[1] TRUE FALSE FALSE FALSE

The best (IMO) solution is from David Winsemius:
which( rowSums(f==f[[1]]) == length(f) )

Related

How to match multiple columns without merge?

I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.

Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE

Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6

With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE

When trying to replace values, "missing values are not allowed in subscripted assignments of data frames"

I have a table that has two columns: whether you were sick (H01) and the number of days sick (H03). However, the number of days sick is NA if H01 == false, and I would like to set it to 0. When I do this:
test <- pe94.person[pe94.person$H01 == 12,]
test$H03 <- 0
It works fine. However, I'd like to replace the values in the original dataframe. This, however, fails:
pe94.person[pe94.person$H01 == 12,]$H03 <- 0
It returns:
> pe94.person[pe94.person$H01 == 12,]$H03 <- 0
Error in `[<-.data.frame`(`*tmp*`, pe94.person$H01 == 12, , value = list( :
missing values are not allowed in subscripted assignments of data frames
Any idea why this is? For what it's worth, here's a frequency table:
> table(pe94.person[pe94.person$H01 == 12,]$H03)
2 3 5 28
3 1 1 1

It is due to missingness in H01 variable.
> x <- data.frame(a=c(NA,2:5), b=c(1:5))
> x
a b
1 NA 1
2 2 2
3 3 3
4 4 4
5 5 5
> x[x$a==2,]$b <- 99
Error in `[<-.data.frame`(`*tmp*`, x$a == 1, , value = list(a = NA_integer_, :
missing values are not allowed in subscripted assignments of data frames
The assignment won't work because x$a has a missing value.
Subsetting first works:
> z <- x[x$a==2,]
> z$b <- 99
> z <- x[x$a==2,]
> z
a b
NA NA NA
2 2 2
But that's because the [<- function apparently can't handle missing values in its extraction indices, even though [ can:
> `[<-`(x,x$a==2,,99)
Error in `[<-.data.frame`(x, x$a == 2, , 99) :
missing values are not allowed in subscripted assignments of data frames
So instead, trying specifying your !is.na(x$a) part when you're doing the assignment:
> `[<-`(x,!is.na(x$a) & x$a==2,'b',99)
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Or, more commonly:
> x[!is.na(x$a) & x$a==2,]$b <- 99
> x
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Note that this behavior is described in the documentation:
The replacement methods can be used to add whole column(s) by specifying non-existent column(s), in which case the column(s) are added at the right-hand edge of the data frame and numerical indices must be contiguous to existing indices. On the other hand, rows can be added at any row after the current last row, and the columns will be in-filled with missing values. Missing values in the indices are not allowed for replacement.

You can use ifelse, like so
pe94.person$foo <- ifelse(!is.na(pe94.person$H01) & pe94.person$H01 == 12, 0, pe94.person$H03)
check if foo meets your criteria and then go ahead and assign it to pe94.person$H03 directly. I find it safer to assign it a new variable and usually use that in subsequent analysis.

There might be an NA somewhere in the column that is causing the error. Run the index on a specific column instead of the entire data frame.
movies[movies$Actors == "N/A",] = NA #ERROR
movies$Actors[movies$Actors == "N/A"] = NA #Works

I realise the question is very old, but I think the most elegant solution is by using the which() function:
pe94.person[which(pe94.person$H01 == 12),]$H03 <- 0
should do what the original poster asked for. Because which() drops the NAs and keeps the (positions of the) TRUE results only.

Simply use the subset() function to exclude all NA from the string.
It works as x[subset & !is.na(subset)]. Look at this data:
> x <- data.frame(a = c(T,F,T,F,NA,F,T, F, NA,NA,T,T,F),
> b = c(F,T,T,F,T, T,NA,NA,F, T, T,F,F))
Subsetting with [ operator returns this:
> x[x$b == T & x$a == F, ]
a b
2 FALSE TRUE
NA NA NA
6 FALSE TRUE
NA.1 NA NA
NA.2 NA NA
And subset() does what we want:
> subset(x, b == T & a == F)
a b
2 FALSE TRUE
6 FALSE TRUE
To change the values of subsetted variables:
> ss <- subset(x, b == T & a == F)
> x[rownames(ss), 'a'] <- T
> x[c(2,6), ]
a b
2 TRUE TRUE
6 TRUE TRUE

Following works. Watch out there is no comma in sub setting:
x <- data.frame(a=c(NA,2:5), b=c(1:5))
x$a[x$a==2] <- 99

Replacing occurrences of a number in multiple columns of data frame with another value in R

ETA: the point of the below, by the way, is to not have to iterate through my entire set of column vectors, just in case that was a proposed solution (just do what is known to work once at a time).
There's plenty of examples of replacing values in a single vector of a data frame in R with some other value.
Replace a value in a data frame based on a conditional (if) statement in R
replace numbers in data frame column in r [duplicate]
And also how to replace all values of NA with something else:
How to replace all values in a data.frame with another ( not 0) value
What I'm looking for is analogous to the last question, but basically trying to replace one value with another. I'm having trouble generating a data frame of logical values mapped to my actual data frame for cases where multiple columns meet a criteria, or simply trying to do the actions from the first two questions on more than one column.
An example:
data <- data.frame(name = rep(letters[1:3], each = 3), var1 = rep(1:9), var2 = rep(3:5, each = 3))
data
name var1 var2
1 a 1 3
2 a 2 3
3 a 3 3
4 b 4 4
5 b 5 4
6 b 6 4
7 c 7 5
8 c 8 5
9 c 9 5
And say I want all of the values of 4 in var1 and var2 to be 10.
I'm sure this is elementary and I'm just not thinking through it properly. I have been trying things like:
data[data[, 2:3] == 4, ]
That doesn't work, but if I do the same with data[, 2] instead of data[, 2:3], things work fine. It seems that logical test (like is.na()) work on multiple rows/columns, but that numerical comparisons aren't playing as nicely?
Thanks for any suggestions!

you want to search through the whole data frame for any value that matches the value you're trying to replace. the same way you can run a logical test like replacing all missing values with 10..
data[ is.na( data ) ] <- 10
you can also replace all 4s with 10s.
data[ data == 4 ] <- 10
at least i think that's what you're after?
and let's say you wanted to ignore the first row (since it's all letters)
# identify which columns contain the values you might want to replace
data[ , 2:3 ]
# subset it with extended bracketing..
data[ , 2:3 ][ data[ , 2:3 ] == 4 ]
# ..those were the values you're going to replace
# now overwrite 'em with tens
data[ , 2:3 ][ data[ , 2:3 ] == 4 ] <- 10
# look at the final data
data

Basically data[, 2:3]==4 gave you the index for data[,2:3] instead of data:
R > data[, 2:3] ==4
var1 var2
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
So you may try this:
R > data[,2:3][data[, 2:3] ==4]
[1] 4 4 4 4

Just to provide a different answer, I thought I would write up a vector-math approach:
You can create a transformation matrix (really a data frame here, but will work the same), using a the vectorized 'ifelse' statement and multiply the transformation matrix and your original data, like so:
df.Rep <- function(.data_Frame, .search_Columns, .search_Value, .sub_Value){
.data_Frame[, .search_Columns] <- ifelse(.data_Frame[, .search_Columns]==.search_Value,.sub_Value/.search_Value,1) * .data_Frame[, .search_Columns]
return(.data_Frame)
}
To replace all values 4 with 10 in the data frame 'data' in columns 2 through 3, you would use the function like so:
# Either of these will work. I'm just showing options.
df.Rep(data, 2:3, 4, 10)
df.Rep(data, c("var1","var2"), 4, 10)
# name var1 var2
# 1 a 1 3
# 2 a 2 3
# 3 a 3 3
# 4 b 10 10
# 5 b 5 10
# 6 b 6 10
# 7 c 7 5
# 8 c 8 5
# 9 c 9 5

Just for continuity
data[,2:3][ data[,2:3] == 4 ] <- 10
But it looks ugly, So do it in 2 steps is better.

R - find all unique values among subsets of a data frame

I have a data frame with two columns. The first column defines subsets of the data. I want to find all values in the second column that only appear in one subset in the first column.
For example, from:
df=data.frame(
data_subsets=rep(LETTERS[1:2],each=5),
data_values=c(1,2,3,4,5,2,3,4,6,7))
data_subsets data_values
A 1
A 2
A 3
A 4
A 5
B 2
B 3
B 4
B 6
B 7
I would want to extract the following data frame.
data_subsets data_values
A 1
A 5
B 6
B 7
I have been playing around with duplicated but I just can't seem to make it work. Any help is appreciated. There are a number of topics tackling similar problems, I hope I didn't overlook the answer in my searches!
EDIT
I modified the approach from #Matthew Lundberg of counting the number of elements and extracting from the data frame. For some reason his approach was not working with the data frame I had, so I came up with this, which is less elegant but gets the job done:
counts=rowSums(do.call("rbind",tapply(df$data_subsets,df$data_values,FUN=table)))
extract=names(counts)[counts==1]
df[match(extract,df$data_values),]

First, find the count of each element in df$data_values:
x <- sapply(df$data_values, function(x) sum(as.numeric(df$data_values == x)))
> x
[1] 1 2 2 2 1 2 2 2 1 1
Now extract the rows:
> df[x==1,]
data_subsets data_values
1 A 1
5 A 5
9 B 6
10 B 7
Note that you missed "A 5" above. There is no "B 5".

You had the right idea with duplicated. The trick is to combine fromLast = TRUE and fromLast = FALSE options to get a full list of non-duplicated rows.
!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE)
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
Indexing your data.frame with this vector gives:
df[!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE),]
data_subsets data_values
1 A 1
5 A 5
9 B 6
10 B 7

A variant of P Lapointe's answer would be
df[! df$data_values %in% df[duplicated( unique(df)$data_values ), ]$data_values,]
The unique() deals with the possibility (not in your test data) that some rows in the data may be identical and you want to keep them once if the same data_values does not appear for distinct data_sets (or distinct other columns).

You can use the 'dplyr' and 'explore' library to overcome this problem.
library(dplyr)
library(explore)
df=data.frame(
data_subsets=rep(LETTERS[1:2],each=5),
data_values=c(1,2,3,4,5,2,3,4,6,7))
df %>% describe(data_subsets)
######## output ########
#variable = data_subsets
#type = character
#na = 0 of 10 (0%)
#unique = 2
# A = 5 (50%)
# B = 5 (50%)

R data frame select by global variable

I'm not sure how to do this without getting an error. Here is a simplified example of my problem.
Say I have this data frame DF
a b c d
1 2 3 4
2 3 4 5
3 4 5 6
Then I have a variable
x <- min(c(1,2,3))
Now I want do do the following
y <- DF[a == x]
But when I try to refer to some variable like "x" I get an error because R is looking for a column "x" in my data frame. I get the "undefined columns selected" error
How can I do what I am trying to do in R?

You may benefit from reading an Introduction to R, especially on matrices, data.frames and indexing. Your a is a column of a data.frame, your x is a scalar. The comparison you have there does not work.
Maybe you meant
R> DF$a == min(c(1,2,3))
[1] TRUE FALSE FALSE
R> DF[,"a"] == min(c(1,2,3))
[1] TRUE FALSE FALSE
R>
which tells you that the first row fits but not the other too. Wrapping this in which() gives you indices instead.

I think this is what you're looking for:
> x <- min(DF$a)
> DF[DF$a == x,]
a b c d
1 1 2 3 4
An easier way (avoiding the 'x' variable) would be this:
> DF[which.min(DF$a),]
a b c d
1 1 2 3 4
or this:
> subset(DF, a==min(a))
a b c d
1 1 2 3 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Select rows with identical columns from a data frame - r

Try this: R > index <- apply(f, 1, function(x) all(x==x[1])) R > index [1] TRUE NA NA FALSE R > index[is.na(index)] <- FALSE R > index [1] TRUE FALSE FALSE FALSE

The best (IMO) solution is from David Winsemius: which( rowSums(f==f[[1]]) == length(f) )

Related

How to match multiple columns without merge?

When trying to replace values, "missing values are not allowed in subscripted assignments of data frames"

Replacing occurrences of a number in multiple columns of data frame with another value in R

R - find all unique values among subsets of a data frame

R data frame select by global variable

Categories

Resources