R delete fathers row based on sons in hierarchycal data - r

I'm working with some data like these:
id <- c(1,1,1,2,2,2,3,3,3,4,4) # fathers
name <- c('a','b','k','b','e','g','e','f','k','f','u') # sons
data <- data.frame(id,name)
data
> data
id name
1 1 a
2 1 b
3 1 k
4 2 b
5 2 e
6 2 g
7 3 e
8 3 f
9 3 k
10 4 f
11 4 u
My goal is this: if there is only a son that I do not want, remove all the row with the same father of the disliked son. For example, I don't like the son e, the result should be:
> data_e
id name
1 1 a
2 1 b
3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
10 4 f
11 4 u
Because the rows with id 2 and 3 have in their name e.
This could be also a task like " I do not like e and f together":
> data_eandf
id name
1 1 a
2 1 b
3 1 k
4 2 b
5 2 e
6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
10 4 f
11 4 u
Or, "I don't want you if you have e or f":
> data_eorf
id name
1 1 a
2 1 b
3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
# 10 4 f
# 11 4 u
As you've noticed, to be more clear, I've "commented" the must-be-deleted rows.
I've searched, but I've found a lot of question based on only one column like data[which(data$name=='e'),], but this is going to remove only at sons' levels, not all the row of the relative father. Also I've thought to put the data in the wide format, paste all the name of a id in an unique cell, and fetch if there is e for example with function like grepl(), but I think this could be a problem with large dataset (these data are an example).
Do you have any idea about how to manage this?
Thanks in advance

Here's a function to handle the different cases
dislike1 <- c('e')
dislike2 <- c('e', 'f')
myfun <- function(df, dislike, ops = NULL) {
require(dplyr)
if (is.null(ops) || ops == 'OR') {
df %>%
group_by(id) %>%
filter(!any(name %in% dislike)) %>%
ungroup
} else if (ops == 'AND') {
df %>%
group_by(id) %>%
filter(!all(dislike %in% name)) %>%
ungroup
}
}
myfun(data, dislike1)
# A tibble: 5 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
# 4 4 f
# 5 4 u
myfun(data, dislike2, 'AND')
# A tibble: 8 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 4 f
# 8 4 u
myfun(data, dislike2, 'OR')
# A tibble: 3 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k

data[!(data$id %in% unique(data[data$name == 'e', 'id'])),]
unique(data[data$name == 'e', 'id']) will get the unique id's that have 'e' in the name field. Then you can use the %in% operator to find all the rows with those id's. The ! is a negation operator.

I have a data.table solution
require(data.table)
id <- c(1,1,1,2,2,2,3,3,3,4,4) # fathers
name <- c('a','b','k','b','e','g','e','f','k','f','u') # sons
data <- data.table(id,name)
# names to be deleted
to_del <- c("e","f")
# returns only id's without any of the names to be deleted
data[ , .SD[ !any(name %in% to_del) ,name ] , by = "id"]
id V1
1: 1 a
2: 1 b
3: 1 k

Related

Creating an identifier using pairs of row indices [duplicate]

I would like to generate indices to group observations based on two columns. But I want groups to be made of observation that share, at least one observation in commons.
In the data below, I want to check if values in 'G1' and 'G2' are connected directly (appear on the same row), or indirectly via other intermediate values. The desired grouping variable is shown in 'g'.
For example, A is directly linked to Z (row 1) and X (row 2). A is indirectly linked to 'B' via X (A -> X -> B), and further linked to Y via X and B (A -> X -> B -> Y).
dt <- data.frame(id = 1:10,
G1 = c("A","A","B","B","C","C","C","D","E","F"),
G2 = c("Z","X","X","Y","W","V","U","s","T","T"),
g = c(1,1,1,1,2,2,2,3,4,4))
dt
# id G1 G2 g
# 1 1 A Z 1
# 2 2 A X 1
# 3 3 B X 1
# 4 4 B Y 1
# 5 5 C W 2
# 6 6 C V 2
# 7 7 C U 2
# 8 8 D s 3
# 9 9 E T 4
# 10 10 F T 4
I tried with group_indices from dplyr, but haven't managed it.
Using igraph get membership, then map on names:
library(igraph)
# convert to graph, and get clusters membership ids
g <- graph_from_data_frame(df1[, c(2, 3, 1)])
myGroups <- components(g)$membership
myGroups
# A B C D E F Z X Y W V U s T
# 1 1 2 3 4 4 1 1 1 2 2 2 3 4
# then map on names
df1$group <- myGroups[df1$G1]
df1
# id G1 G2 group
# 1 1 A Z 1
# 2 2 A X 1
# 3 3 B X 1
# 4 4 B Y 1
# 5 5 C W 2
# 6 6 C V 2
# 7 7 C U 2
# 8 8 D s 3
# 9 9 E T 4
# 10 10 F T 4

R function to replace tricky merge in Excel (vlookup + hlookup)

I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!
You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)
You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

Determining if values of previous rows repeat in dataframe

I have some data organized like this:
set.seed(12)
ids <- matrix(replicate(1000,sample(LETTERS[1:4],2)),ncol=2,byrow=T)
df <- data.frame(
event = 1:100,
id1 = ids[,1],
id2 = ids[,2],
grp = rep(1:10, each=100), stringsAsFactors=F)
head(df,10)
event id1 id2 grp
1 1 A C 1
2 2 D A 1
3 3 A D 1
4 4 A B 1
5 5 A D 1
6 6 B C 1
7 7 B D 1
8 8 B D 1
9 9 B D 1
10 10 C A 1
There are pairs of ids (id1 & id2). Within a row they are never the same. There is a variable called grp. There are 10 groups. Each group could be considered a separate sample of data. The event variable goes from 1-100 in each group.
The first question I have is quite straightforward. Within each group, for each row, is the combination of the two ids (id1-id2) the same as the previous row, the reverse of the previous row, or neither of these two options. Obviously, if there is an A-C combination on row 100 of one group, I am not interested in whether it is reversed, the same or whatever on row 1 of the following group.
This is my temporary solution:
#Give each id pair and identifier:
df$pair <- paste(pmin(df$id1,df$id2), pmax(df$id1,df$id2))
#For each grp, work out using `lag` if previous row contains same pair of ids, and if they are in same or reversed order:
df.sp <- split(df, df$grp)
df$value <- unlist(lapply(df.sp, function(x) ifelse(x$pair!=lag(x$pair), NA, ifelse(x$id1==lag(x$id1), 1, 0)) ))
This gives:
head(df,10)
event id1 id2 grp pair value
1 1 A C 1 A C NA
2 2 D A 1 A D NA
3 3 A D 1 A D 0
4 4 A B 1 A B NA
5 5 A D 1 A D NA
6 6 B C 1 B C NA
7 7 B D 1 B D NA
8 8 B D 1 B D 1
9 9 B D 1 B D 1
10 10 C A 1 A C NA
This works - showing 0 as a reversal, 1 as a copy and NA as neither.
The more complex question I am interested in is the following. Within each group (grp), for each row, find if its combination of two ids (the pair) previously occurred in that grp. If they did, then return whether they were in the same order or reversed order the immediate previous time they occurred.
That result would look like this:
event id1 id2 grp pair value
1 1 A C 1 A C NA
2 2 D A 1 A D NA
3 3 A D 1 A D 0
4 4 A B 1 A B NA
5 5 A D 1 A D 1
6 6 B C 1 B C NA
7 7 B D 1 B D NA
8 8 B D 1 B D 1
9 9 B D 1 B D 1
10 10 C A 1 A C 0
e.g. row 10 is returned as a 0 because the combination A-C previously occurred and was in the reverse order (row 1). on row 5 a 1 is returned as A-D previously occurred in the same order on row 3.
You're almost there! The second question is equivalent to the first question, just grouping by pair as well as group. I converted the code to dplyr (though I appreciate the spirit behind keeping the question in base). I also removed the second ifelse, replacing it with a numeric conversion of the logical, which should be more performant (and some will find easier to read).
df %>% group_by(grp) %>%
mutate(
pair = paste(pmin(id1, id2), pmax(id1, id2)),
prev_row = ifelse(pair != lag(pair), NA, as.numeric(id1 == lag(id1)))
) %>%
group_by(grp, pair) %>%
mutate(prev_any = ifelse(pair != lag(pair), NA, as.numeric(id1 == lag(id1)))) %>%
head(10)
# Source: local data frame [10 x 7]
# Groups: grp, pair [5]
#
# event id1 id2 grp pair prev_row prev_any
# (int) (chr) (chr) (int) (chr) (dbl) (dbl)
# 1 1 A C 1 A C NA NA
# 2 2 D A 1 A D NA NA
# 3 3 A D 1 A D 0 0
# 4 4 A B 1 A B NA NA
# 5 5 A D 1 A D NA 1
# 6 6 B C 1 B C NA NA
# 7 7 B D 1 B D NA NA
# 8 8 B D 1 B D 1 1
# 9 9 B D 1 B D 1 1
# 10 10 C A 1 A C NA 0
For such grouping, filtering and mutating tasks, I find dplyr to be very helpful. Here is one way I came up with how you can achieve your goal:
df %>% group_by(grp) %>% mutate(value = ifelse(id1 == lag(id1) & id2 == lag(id2), 1, ifelse(id1 == lag(id2) & id2 == lag(id1), 0, NA)))
Within each group, you compare the ID values and conditionally assign a new value column. Hope this helps.

Convert datafile from wide to long format to fit ordinal mixed model in R

I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2

Resources