Arrange by multiple conditions in R - r

Having the following dataset:
test <- data.frame(name= c("A", "B", "C", "D", "E"), v1 = c(2, 4, 1, 1, 2), v2 = c(3, 4, 2, 1, 5))
name v1 v2
A 2 3
B 4 4
C 1 2
D 1 1
E 2 5
I want to test a concept of actor/node dominance, which means that for each row/entry I want to see if it has the highest value in the dataset. For example B is higher than A, C and D for both v1 and v2. So it "dominates" all the other rows. E for example is only higher than A, C and D, so it dominates those 3 rows.
Mathematically speaking, what I'm searching for is i >= j, for v1_i >= v1_j and v2_i >= v2_j.
Arranging or sorting by columns doesn't work because it doesn't really show how one row will impose another as it sorts first by one column and then by another.
EDIT:Just to add an example an end output would be:
B dominates A, C, D
E dominates A, D, C
C dominates D
A dominates C, D
Doesn't really matter how it would look like. If it's in the form of a directed network/matrix or a table with a variable with all the letters.

i got a way out hope this helps :)
c$v3= rowSums(x = c[,-1])
c = c[order(c$v3,decreasing = T),]
k = length(c$name)
for (i in 1:k ) {
if (i == k) {
}else {
a = c$name[i+1:k]
a = as.character(a[!is.na(a)])
b = c$name[i]
b = as.character(b[!is.na(b)])
cat(b,"greater than ", a,"\n",sep=" ")
}
}
so your output will be
B greater than E A C D
E greater than A C D
A greater than C D
C greater than D

for loop are very inefficient in R. Please, avoid them !
You can simply do it with apply:
# Names column
names = c("A", "B", "C", "D", "E")
# Dataframe
test <- data.frame(name= names, v1 = c(2, 4, 1, 1, 2), v2 = c(3, 4, 2, 1, 5))
# Display function
findLowerValues <- function(row, test, names) {
rep <- test$v1 <= row["v1"] & test$v2 <= row["v2"] & test$name != row["name"]
cat(row["name"], 'dominates', names[rep], "\n")
}
# Apply the display function
# axis : row
# Extra args: the full dataset and names
apply(test, 1, findLowerValues, test=test, names=names)
# A dominates C D
# B dominates A C D
# C dominates D
# D dominates
# E dominates A C D
# NULL

Here is a suggestion. It's probably not the most elegant solution.
We can have a function compare that checks if one letter dominates the other (unless it is the identical letter), and then use two nested sapplys.
my_letters <- c("A", "B", "C", "D", "E")
test <- data.frame(name= my_letters, v1 = c(2, 4, 1, 1, 2), v2 = c(3, 4, 2, 1, 5))
get_row<-function(letter){
test[test$name==letter,2:ncol(test)]
}
compare<-function(letter,i){
if(letter!=i){
if(!sum(get_row(letter) < get_row(i))){
return(i)
}
}
}
result <- sapply(my_letters, function(let) unlist(sapply(my_letters, compare, letter=let)))
results in a list:
$A
C D
"C" "D"
$B
A C D
"A" "C" "D"
$C
D
"D"
$D
NULL
$E
A C D
"A" "C" "D"

We first split every row into list of rows and pass it to mapply, repeat each row nrow(test) times, compare them with the entire dataframe test and select the name which has all values greater than that row. As this will also match rows with itself as well we use setdiff to remove those name values.
mapply(function(x, y) setdiff(
test$name[rowSums(x[rep(1, nrow(test)),] >= test[-1]) == ncol(test) - 1], y),
split(test[-1], test$name), test$name)
#$A
#[1] "C" "D"
#$B
#[1] "A" "C" "D"
#$C
#[1] "D"
#$D
#character(0)
#$E
#[1] "A" "C" "D"
data
test <- data.frame(name= c("A", "B", "C", "D", "E"), v1 = c(2, 4, 1, 1, 2),
v2 = c(3, 4, 2, 1, 5), stringsAsFactors = FALSE)

If you don't mind a data.table solution, a possibility is using non-equi joins as follows:
library(data.table)
setDT(test)
test[test, on=.(v1<=v1, v2<=v2), .(actor=i.name, node=x.name), by=.EACHI, allow.cartesian=TRUE][
actor!=node, .(actor, node)]
output:
actor node
1: A C
2: A D
3: B A
4: B C
5: B D
6: C D
7: E A
8: E C
9: E D

Related

how to subset data in R using conditional operation booleans [duplicate]

I would like to subset (filter) a dataframe by specifying which rows not (!) to keep in the new dataframe. Here is a simplified sample dataframe:
data
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c
For example, if a row of column v1 has a "b", "d", or "e", I want to get rid of that row of observations, producing the following dataframe:
v1 v2 v3 v4
a v d c
a v d d
c k d c
c r p g
I have been successful at subsetting based on one condition at a time. For example, here I remove rows where v1 contains a "b":
sub.data <- data[data[ , 1] != "b", ]
However, I have many, many such conditions, so doing it one at a time is not desirable. I have not been successful with the following:
sub.data <- data[data[ , 1] != c("b", "d", "e")
or
sub.data <- subset(data, data[ , 1] != c("b", "d", "e"))
I've tried some other things as well, like !%in%, but that doesn't seem to exist.
Any ideas?
Try this
subset(data, !(v1 %in% c("b","d","e")))
The ! should be around the outside of the statement:
data[!(data$v1 %in% c("b", "d", "e")), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
You can also accomplish this by breaking things up into separate logical statements by including & to separate the statements.
subset(my.df, my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e")
This is not elegant and takes more code but might be more readable to newer R users. As pointed out in a comment above, subset is a "convenience" function that is best used when working interactively.
This answer is more meant to explain why, not how. The '==' operator in R is vectorized in a same way as the '+' operator. It matches the elements of whatever is on the left side to the elements of whatever is on the right side, per element. For example:
> 1:3 == 1:3
[1] TRUE TRUE TRUE
Here the first test is 1==1 which is TRUE, the second 2==2 and the third 3==3. Notice that this returns a FALSE in the first and second element because the order is wrong:
> 3:1 == 1:3
[1] FALSE TRUE FALSE
Now if one object is smaller then the other object then the smaller object is repeated as much as it takes to match the larger object. If the size of the larger object is not a multiplication of the size of the smaller object you get a warning that not all elements are repeated. For example:
> 1:2 == 1:3
[1] TRUE TRUE FALSE
Warning message:
In 1:2 == 1:3 :
longer object length is not a multiple of shorter object length
Here the first match is 1==1, then 2==2, and finally 1==3 (FALSE) because the left side is smaller. If one of the sides is only one element then that is repeated:
> 1:3 == 1
[1] TRUE FALSE FALSE
The correct operator to test if an element is in a vector is indeed '%in%' which is vectorized only to the left element (for each element in the left vector it is tested if it is part of any object in the right element).
Alternatively, you can use '&' to combine two logical statements. '&' takes two elements and checks elementwise if both are TRUE:
> 1:3 == 1 & 1:3 != 2
[1] TRUE FALSE FALSE
data <- data[-which(data[,1] %in% c("b","d","e")),]
my.df <- read.table(textConnection("
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c"), header = TRUE)
my.df[which(my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e" ), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
sub.data<-data[ data[,1] != "b" & data[,1] != "d" & data[,1] != "e" , ]
Larger but simple to understand (I guess) and can be used with multiple columns, even with !is.na( data[,1]).
And also
library(dplyr)
data %>% filter(!v1 %in% c("b", "d", "e"))
or
data %>% filter(v1 != "b" & v1 != "d" & v1 != "e")
or
data %>% filter(v1 != "b", v1 != "d", v1 != "e")
Since the & operator is implied by the comma.

R: create new column with name coming from variable

I have a dataframe with several columns and would like to add a new column and name it according to a previous variable. For example:
df <- data.frame("A" = c(1, 2, 3, 4), "B" = c("a", "c", "d", "b"))
Variable <- "C"
This is part of a function where the variable will be changing and rather than each time specifying:
df$C <- NA
I would like a one line that will take the "Variable" to name the additional column
Try [ instead of $:
> df[, Variable] <- NA
> df
A B C
1 1 a NA
2 2 c NA
3 3 d NA
4 4 b NA
In the context of a data.frame name also taken in a variable this might be helpful.
df <- data.frame("A" = c(1, 2, 3, 4), "B" = c("a", "c", "d", "b") )
Variable<-"C"
dfname<-"df"
df<-within ( assign(dfname , get(dfname) ),
assign(Variable, NA )
)

Access a single cell / subsetted column of a data.table

How can I access just a single cell in a data.table in the way as I could for a data.frame:
mdf <- data.frame(a = c("A", "B", "C"), b = rnorm(3), c = 1:3)
mdf[ mdf$a == "B", "c" ]
[1] 2
Doing the analogue on a data.table a data.table is returned including the key column(s):
mdt <- data.table( mdf, key = "a" )
mdt[ "B", c ]
a c
1: B 2
mdt[ "B", c ][ , c]
[1] 2
Did I miss a parameter or does it has to be done as in the last line?
Either of these will avoid repeating the c but are not as efficient since they involve computing the first [] as well as the final answer:
> mdt[ "B", ][["c"]]
[1] 2
> mdt[ "B", ][, c]
[1] 2
Recent versions of data.table make this easier
mdt[ "B", c]
# [1] 2
Original answer was returning a data.table like:
mdt['B', 'c']
# c
# 1: 2

Merge and paste duplicate columns in R

Suppose I have two data frames with some common variable x:
df1 <- data.frame(
x=c(1, 2, 3, 4),
y=c("a", "b", "c", "d")
)
df2 <- data.frame(
x=c(1, 1, 2, 2, 3, 4, 5),
z=c("A", "B", "C", "D", "E", "F", "G")
)
We can assume that each entry of the variable we're merging over, x, appears exactly once in df1; however, it may appear an arbitrary number of times in df2.
I want to merge df2 'into' df1, while preserving df1. Is there a fast way of merging these two data frames such that the merged output would be of the form (for example):
df_merged <- data.frame(
x=c(1, 2, 3, 4),
y=c("a", "b", "c", "d"),
z=c("A B", "C D", "E", "F")
)
Essentially, I want df_merged to be a composition of the original df1, in addition to any variables in df2 coerced to match the format of df1. The various incantations of merge will append new rows to the merged output, which I want to avoid.
We can assume that each entry of the variable we're merging over, x, appears exactly once.
Speed is also a priority since I'll be merging fairly large data frames.
merge( df1,
aggregate(df2$z , df2[1], FUN=paste, collapse=" ", sep=""),
by.x="x", by.y=1)
x y x
1 1 a A B
2 2 b C D
3 3 c E
4 4 d F
Warning message:
In merge.data.frame(df1, aggregate(df2$z, df2[1], FUN = paste, collapse = " ", :
column name ‘x’ is duplicated in the result
> M1 <- .Last.value
> names(M1)[3] <- "z"
> M1
x y z
1 1 a A B
2 2 b C D
3 3 c E
4 4 d F
Another option:
df2.z <- with(df2, tapply(z, x, paste, collapse=' '))
transform(df1, z=df2.z[match(x, names(df2.z))])
# x y z
# 1 1 a A B
# 2 2 b C D
# 3 3 c E
# 4 4 d F
If df1$x is in order, then use df2.z[names(df2.z) %in% x] in the transform statement.
I'm submitting this question with my own potential answer, but it is fairly slow and I'm curious what other methods might be available.
by <- "x"
df2_processed <- as.data.frame(
sapply( names(df2), function(x) {
tapply( df2[[x]], df2[[by]], function(xx) {
if( x == by ) {
return(xx[1])
} else {
paste(xx, collapse=" ")
}
})
}), optional=TRUE, stringsAsFactors=FALSE )
merge( df1, df2_processed, all.x=TRUE )

Subset dataframe by multiple logical conditions of rows to remove

I would like to subset (filter) a dataframe by specifying which rows not (!) to keep in the new dataframe. Here is a simplified sample dataframe:
data
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c
For example, if a row of column v1 has a "b", "d", or "e", I want to get rid of that row of observations, producing the following dataframe:
v1 v2 v3 v4
a v d c
a v d d
c k d c
c r p g
I have been successful at subsetting based on one condition at a time. For example, here I remove rows where v1 contains a "b":
sub.data <- data[data[ , 1] != "b", ]
However, I have many, many such conditions, so doing it one at a time is not desirable. I have not been successful with the following:
sub.data <- data[data[ , 1] != c("b", "d", "e")
or
sub.data <- subset(data, data[ , 1] != c("b", "d", "e"))
I've tried some other things as well, like !%in%, but that doesn't seem to exist.
Any ideas?
Try this
subset(data, !(v1 %in% c("b","d","e")))
The ! should be around the outside of the statement:
data[!(data$v1 %in% c("b", "d", "e")), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
You can also accomplish this by breaking things up into separate logical statements by including & to separate the statements.
subset(my.df, my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e")
This is not elegant and takes more code but might be more readable to newer R users. As pointed out in a comment above, subset is a "convenience" function that is best used when working interactively.
This answer is more meant to explain why, not how. The '==' operator in R is vectorized in a same way as the '+' operator. It matches the elements of whatever is on the left side to the elements of whatever is on the right side, per element. For example:
> 1:3 == 1:3
[1] TRUE TRUE TRUE
Here the first test is 1==1 which is TRUE, the second 2==2 and the third 3==3. Notice that this returns a FALSE in the first and second element because the order is wrong:
> 3:1 == 1:3
[1] FALSE TRUE FALSE
Now if one object is smaller then the other object then the smaller object is repeated as much as it takes to match the larger object. If the size of the larger object is not a multiplication of the size of the smaller object you get a warning that not all elements are repeated. For example:
> 1:2 == 1:3
[1] TRUE TRUE FALSE
Warning message:
In 1:2 == 1:3 :
longer object length is not a multiple of shorter object length
Here the first match is 1==1, then 2==2, and finally 1==3 (FALSE) because the left side is smaller. If one of the sides is only one element then that is repeated:
> 1:3 == 1
[1] TRUE FALSE FALSE
The correct operator to test if an element is in a vector is indeed '%in%' which is vectorized only to the left element (for each element in the left vector it is tested if it is part of any object in the right element).
Alternatively, you can use '&' to combine two logical statements. '&' takes two elements and checks elementwise if both are TRUE:
> 1:3 == 1 & 1:3 != 2
[1] TRUE FALSE FALSE
data <- data[-which(data[,1] %in% c("b","d","e")),]
my.df <- read.table(textConnection("
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c"), header = TRUE)
my.df[which(my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e" ), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
sub.data<-data[ data[,1] != "b" & data[,1] != "d" & data[,1] != "e" , ]
Larger but simple to understand (I guess) and can be used with multiple columns, even with !is.na( data[,1]).
And also
library(dplyr)
data %>% filter(!v1 %in% c("b", "d", "e"))
or
data %>% filter(v1 != "b" & v1 != "d" & v1 != "e")
or
data %>% filter(v1 != "b", v1 != "d", v1 != "e")
Since the & operator is implied by the comma.

Resources