Determining if values of previous rows repeat in dataframe - r

I have some data organized like this:
set.seed(12)
ids <- matrix(replicate(1000,sample(LETTERS[1:4],2)),ncol=2,byrow=T)
df <- data.frame(
event = 1:100,
id1 = ids[,1],
id2 = ids[,2],
grp = rep(1:10, each=100), stringsAsFactors=F)
head(df,10)
event id1 id2 grp
1 1 A C 1
2 2 D A 1
3 3 A D 1
4 4 A B 1
5 5 A D 1
6 6 B C 1
7 7 B D 1
8 8 B D 1
9 9 B D 1
10 10 C A 1
There are pairs of ids (id1 & id2). Within a row they are never the same. There is a variable called grp. There are 10 groups. Each group could be considered a separate sample of data. The event variable goes from 1-100 in each group.
The first question I have is quite straightforward. Within each group, for each row, is the combination of the two ids (id1-id2) the same as the previous row, the reverse of the previous row, or neither of these two options. Obviously, if there is an A-C combination on row 100 of one group, I am not interested in whether it is reversed, the same or whatever on row 1 of the following group.
This is my temporary solution:
#Give each id pair and identifier:
df$pair <- paste(pmin(df$id1,df$id2), pmax(df$id1,df$id2))
#For each grp, work out using `lag` if previous row contains same pair of ids, and if they are in same or reversed order:
df.sp <- split(df, df$grp)
df$value <- unlist(lapply(df.sp, function(x) ifelse(x$pair!=lag(x$pair), NA, ifelse(x$id1==lag(x$id1), 1, 0)) ))
This gives:
head(df,10)
event id1 id2 grp pair value
1 1 A C 1 A C NA
2 2 D A 1 A D NA
3 3 A D 1 A D 0
4 4 A B 1 A B NA
5 5 A D 1 A D NA
6 6 B C 1 B C NA
7 7 B D 1 B D NA
8 8 B D 1 B D 1
9 9 B D 1 B D 1
10 10 C A 1 A C NA
This works - showing 0 as a reversal, 1 as a copy and NA as neither.
The more complex question I am interested in is the following. Within each group (grp), for each row, find if its combination of two ids (the pair) previously occurred in that grp. If they did, then return whether they were in the same order or reversed order the immediate previous time they occurred.
That result would look like this:
event id1 id2 grp pair value
1 1 A C 1 A C NA
2 2 D A 1 A D NA
3 3 A D 1 A D 0
4 4 A B 1 A B NA
5 5 A D 1 A D 1
6 6 B C 1 B C NA
7 7 B D 1 B D NA
8 8 B D 1 B D 1
9 9 B D 1 B D 1
10 10 C A 1 A C 0
e.g. row 10 is returned as a 0 because the combination A-C previously occurred and was in the reverse order (row 1). on row 5 a 1 is returned as A-D previously occurred in the same order on row 3.

You're almost there! The second question is equivalent to the first question, just grouping by pair as well as group. I converted the code to dplyr (though I appreciate the spirit behind keeping the question in base). I also removed the second ifelse, replacing it with a numeric conversion of the logical, which should be more performant (and some will find easier to read).
df %>% group_by(grp) %>%
mutate(
pair = paste(pmin(id1, id2), pmax(id1, id2)),
prev_row = ifelse(pair != lag(pair), NA, as.numeric(id1 == lag(id1)))
) %>%
group_by(grp, pair) %>%
mutate(prev_any = ifelse(pair != lag(pair), NA, as.numeric(id1 == lag(id1)))) %>%
head(10)
# Source: local data frame [10 x 7]
# Groups: grp, pair [5]
#
# event id1 id2 grp pair prev_row prev_any
# (int) (chr) (chr) (int) (chr) (dbl) (dbl)
# 1 1 A C 1 A C NA NA
# 2 2 D A 1 A D NA NA
# 3 3 A D 1 A D 0 0
# 4 4 A B 1 A B NA NA
# 5 5 A D 1 A D NA 1
# 6 6 B C 1 B C NA NA
# 7 7 B D 1 B D NA NA
# 8 8 B D 1 B D 1 1
# 9 9 B D 1 B D 1 1
# 10 10 C A 1 A C NA 0

For such grouping, filtering and mutating tasks, I find dplyr to be very helpful. Here is one way I came up with how you can achieve your goal:
df %>% group_by(grp) %>% mutate(value = ifelse(id1 == lag(id1) & id2 == lag(id2), 1, ifelse(id1 == lag(id2) & id2 == lag(id1), 0, NA)))
Within each group, you compare the ID values and conditionally assign a new value column. Hope this helps.

Related

R function to replace tricky merge in Excel (vlookup + hlookup)

I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!
You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)
You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)

Which group meet the criterion a < b < c depending on condition

My title might not be very informative but this is an example which exposes my problem :
I have this dataframe :
df=data.frame(cond1=c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3),
group=c("F","V","M","F","V","M","F","V","M","F","V","M","F","V","M","F","V","M"),
gene=c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B"),
value=c(1,2,3,4,5,6,7,8,9,1,3,2,4,3,2,2,3,4))
df
cond1 group gene value
1 1 F A 1
2 1 V A 2
3 1 M A 3
4 2 F A 4
5 2 V A 5
6 2 M A 6
7 3 F A 7
8 3 V A 8
9 3 M A 9
10 1 F B 1
11 1 V B 3
12 1 M B 2
13 2 F B 4
14 2 V B 3
15 2 M B 2
16 3 F B 2
17 3 V B 3
18 3 M B 4
What I would like to obtain is for each gene, the sum of how many different cond1 have their value corresponding with F group smaller than their value corresponding with V their value corresponding with M.
In the 3 first lines, we are in gene A for the cond1. value correspoding to group F=1, V=2, M=3. So F<V<M for the A gene for the cond1=1 group.
My expected output for the gene A is 3 as all cond1 groups meet F<V<M for value.
My expected output for the gene B is 1 as only cond1=3 group meet F<V<M for value.
My desired output would be ideally a dataframe with gene and the sum of cond1 than meet my criterion :
gene count
1 A 3
2 B 1
I would be very grateful if you could provide me any tips on how should I proceed
Check if all the data is in increasing order and count how many such values exist for each gene.
library(dplyr)
df %>%
#If the data is not ordered, order it using arrange
#arrange(gene, cond1, match(group, c('F', 'V', 'M'))) %>%
group_by(gene, cond1) %>%
summarise(cond = all(diff(value) > 0)) %>%
summarise(count = sum(cond))
# gene count
# <chr> <int>
#1 A 3
#2 B 1
Using data.table
library(data.table)
setDT(df)[, .(cond = all(diff(value) > 0)), .(gene, cond1)][, .(count = sum(cond)), gene]
gene count
1: A 3
2: B 1

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

Sort Data in the Table

For example, now I get the table
A B C
A 0 4 1
B 2 1 3
C 5 9 6
I like to order the columns and rows by my own defined order, to achieve
B A C
B 1 2 3
A 4 0 1
C 9 5 6
This can be accomplished in base R. First we make the example data:
# make example data
df.text <- 'A B C
0 4 1
2 1 3
5 9 6'
df <- read.table(text = df.text, header = T)
rownames(df) <- LETTERS[1:3]
A B C
A 0 4 1
B 2 1 3
C 5 9 6
Then we simply re-order the columns and rows using a vector of named indices:
# re-order data
defined.order <- c('B', 'A', 'C')
df <- df[, defined.order]
df <- df[defined.order, ]
B A C
B 1 2 3
A 4 0 1
C 9 5 6
If the defined order is given as
defined_order <- c("B", "A", "C")
and the initial table is created by
library(data.table)
# create data first
dt <- fread("
id A B C
A 0 4 1
B 2 1 3
C 5 9 6")
# note that row names are added as own id column
then you could achieve the desired result using data.table as follows:
# change column order
setcolorder(dt, c("id", defined_order))
# change row order
dt[order(defined_order)]
# id B A C
# 1: B 1 2 3
# 2: A 4 0 1
# 3: C 9 5 6

Convert datafile from wide to long format to fit ordinal mixed model in R

I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2

Resources