How to flag a progressive decrease in values in list - r

I have a dataframe with columns containing lists of values:
dt
onset_l coda_l
1 3, 7 7, 4, 1
3 7 7, 1, 3
12 1, 7 7, 4, 1
21 6, 7 7, 4, 1
23 7 7, 1, 5
What I want to do is create a new column, say, coda_flag, that flags whether the values in column coda_l progressively decrease or not.
I've tried this lapply method:
dt$coda_flag <- lapply(dt$coda_l, function(x) ifelse(x[1] > x[2] & x[2] > x[3], 1, 0))
The output is okay but I dislike that I have to enumerate x[1], x[2] and so on because in some cases I may not know the exact number of values in the lists.
The desired output would be:
dt
onset_l coda_l coda_flag
1 3, 7 7, 4, 1 1 # values do progressively decrease
3 7 7, 1, 3 0 # values do not progressively decrease
12 1, 7 7, 4, 1 1
21 6, 7 7, 4, 1 1
23 7 7, 1, 5 0
How can this be achieved?
Reproducible data:
dt <- structure(list(onset_l = list(c("3", "7"), "7", c("1", "7"),
c("6", "7"), "7"), coda_l = list(c("7", "4", "1"), c("7",
"1", "3"), c("7", "4", "1"), c("7", "4", "1"), c("7", "1", "5"
))), row.names = c(1L, 3L, 12L, 21L, 23L), class = "data.frame")

You can use is.unsorted() - although it can only detect increasing order so it's necessary to reverse the vector first.
dt$coda_flag <- +!sapply(dt$coda_l, function(x) is.unsorted(rev(as.numeric(x))))
onset_l coda_l coda_flag
1 3, 7 7, 4, 1 1
3 7 7, 1, 3 0
12 1, 7 7, 4, 1 1
21 6, 7 7, 4, 1 1
23 7 7, 1, 5 0
Note the strictly argument controls what happens in the event of tied values.
x <- c(1, 2, 2, 3)
is.unsorted(x)
[1] FALSE
is.unsorted(x, strictly = TRUE)
[1] TRUE

One dplyr and purrr solution could be:
dt %>%
mutate(coda_flag = map_int(.x = coda_l, ~ +(all(diff(as.numeric(.x)) < 1))))
onset_l coda_l coda_flag
1 3, 7 7, 4, 1 1
2 7 7, 1, 3 0
3 1, 7 7, 4, 1 1
4 6, 7 7, 4, 1 1
5 7 7, 1, 5 0

You can also try with a loop:
#Flag
dt$Flag <- NA
#Loop
for(i in 1:nrow(dt))
{
#Extract elements
vecvar <- do.call(c,dt$coda_l[i])
#Compute diff
difvec <- diff(sort(as.numeric(vecvar)))
#Assign
dt$Flag[i] <- ifelse(length(unique(difvec))==1,1,0)
}
Output:
dt
onset_l coda_l Flag
1 3, 7 7, 4, 1 1
3 7 7, 1, 3 0
12 1, 7 7, 4, 1 1
21 6, 7 7, 4, 1 1
23 7 7, 1, 5 0

An option with data.table
library(data.table)
setDT(dt)[, coda_flag := +(sapply(coda_l, function(x)
all(as.numeric(x) - shift(as.numeric(x), fill = first(as.numeric(x))) < 1)))]
-output
dt
# onset_l coda_l coda_flag
#1: 3,7 7,4,1 1
#2: 7 7,1,3 0
#3: 1,7 7,4,1 1
#4: 6,7 7,4,1 1
#5: 7 7,1,5 0

Related

dplyr solution: absolute difference of two values in one column matched by other column

I have a dataframe that looks like this, but there will be many more IDs:
# Groups: ID [1]
ID ARS stim
<int> <int> <chr>
1 3 0 1
2 3 4 2
3 3 2 3
4 3 3 4
5 3 1 5
6 3 0 6
7 3 2 10
8 3 4 11
9 3 0 12
10 3 3 13
11 3 2 14
12 3 2 15
I would like to calculate the sum of the absolute difference abs() between the values in ARS, e.g. for stim=1 and stim=10 plus for stim=2 and stim=11 and so on.
Any good solutions are appreciated!
The desired output calculation is:
abs(0-2) + abs(4-4) + abs(2-0) + abs(3-3) + abs(1-2) + abs(0-2)
Hence, 2+0+2+0+1+2
Output for ID==3: 7
A possible solution:
library(dplyr)
df <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), ARS = c(0, 4, 2, 3, 1, 0, 2, 4, 0, 3, 2, 2), stim = c(1, 2, 3, 4, 5, 6,
10, 11, 12, 13, 14, 15)), row.names = c(NA, -12L), class = "data.frame")
df %>%
group_by(ID) %>%
summarise(value = abs(ARS[which(stim == 1:6)] - ARS[which(stim == 9+1:6)]),
.groups = "drop") %>%
pull(value) %>% sum
#> [1] 7

replace negative values with na using na_if{dplyr}

Let's say I have the following dataframe:
dat <- tribble(
~V1, ~V2,
2, -3,
3, 2,
1, 3,
3, -4,
5, 1,
3, 2,
1, -4,
3, 4,
4, 1,
3, -5,
4, 2,
3, 4
)
How can I replace negative values with NA using na_if()? I know how to do this using ifelse, but don't manage to come up with a correct condition for na_if():
> dat %>%
+ mutate(V2 = ifelse(V2 < 0, NA, V2))
# A tibble: 12 x 2
V1 V2
<dbl> <dbl>
1 2 NA
2 3 2
3 1 3
4 3 NA
5 5 1
6 3 2
7 1 NA
8 3 4
9 4 1
10 3 NA
11 4 2
12 3 4

How to find corresponding ID comparing values in two columns?

I have following problem that you can easily see after downloading the picture. It would be of great help if you help me solve the problem.
In Table 1, the IDs are correctly linked up with the values in column A. B contains some values which are not ordered and whose corresponding IDs are not given in corresponding rows. We need to find the IDs of the values in column B by using the IDs of column A.
Now if we run following code in R, we will find the IDs corresponding the values in column B
mydata <- read.csv(‘C:/Users/Windows/Desktop/practice_1.csv’)
df <- data.frame(mydata$B, mydata$A, mydata$ID, header=TRUE)
library(qdap)
df[, "New ID"] <- df[, 1] %l% df[, -1]
After running above code, we will find the new ID in the column New ID like Table 2.
What you need is a simple match operation:
table1$ID2 <- table1$ID1[match(table1$z, table1$y)]
table1
# ID1 y z ID2
# 1 0 1 11 10
# 2 1 2 3 2
# 3 2 3 5 4
# 4 3 4 4 3
# 5 4 5 8 7
# 6 5 6 7 6
# 7 6 7 15 15
# 8 7 8 6 5
# 9 8 9 2 1
# 10 9 10 16 17
# 11 10 11 1 0
# 12 11 12 NA NA
# 13 15 15 NA NA
# 14 17 16 NA NA
Please, the next time you ask a question where sample data is necessary (most questions), please provide data in this format:
Data
# dput(table1)
structure(list(ID1 = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 15, 17), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16), z = c(11, 3, 5, 4, 8, 7, 15, 6, 2, 16, 1, NA, NA, NA), ID2 = c(10, 2, 4, 3, 7, 6, 15, 5, 1, 17, 0, NA, NA, NA)), row.names = c(NA, -14L), class = "data.frame")

build a network edge table from a sparse table

I don't know exactly how to explain it but...
I have a sparse table where each group represents a level. The columns are ordered, it means, the downstream (left) column represents a child node and upstream (right) node represents a parent node.
I'd like a two columns table where the 1st column is the parent node and the 2nd is the child node. If possible, a 3rd columns with the length (sum of the number of final nodes) of the parents.
Follow the example:
>tt <- tibble(
ID = letters[1:8],
`1` = c( 1, 1, 1, 1, 2, 2, 2, 2),
`2` = c( 3, 3, 4, 4, 5, 5, 5, 6),
`3` = c( 7, 7, 8, 9,10,10,11,12)
)
> tt
# A tibble: 8 x 4
ID `1` `2` `3`
<chr> <dbl> <dbl> <dbl>
1 a 1 3 7
2 b 1 3 7
3 c 1 4 8
4 d 1 4 9
5 e 2 5 10
6 f 2 5 10
7 g 2 5 11
8 h 2 6 12
>dput(tt)
structure(list(ID = c("a", "b", "c", "d", "e", "f", "g", "h"),
`1` = c(1, 1, 1, 1, 2, 2, 2, 2), `2` = c(3, 3, 4, 4, 5, 5,
5, 6), `3` = c(7, 7, 8, 9, 10, 10, 11, 12)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
the result should be:
>ttt <- tibble(
parent = c(1,1,2,2,3,4,4, 5, 5, 6, 7,7,8,9,10,10,11,12),
child = c(3,4,5,6,7,8,9,10,11,12, letters[1:8] ),
length = c(4,4,4,4,2,2,2, 3, 3, 1, 2,2,1,1, 2, 2, 1, 1)
)
>ttt
# A tibble: 18 x 3
parent child length
<dbl> <chr> <dbl>
1 1 3 4
2 1 4 4
3 2 5 4
4 2 6 4
5 3 7 2
6 4 8 2
7 4 9 2
8 5 10 3
9 5 11 3
10 6 12 1
11 7 a 2
12 7 b 2
13 8 c 1
14 9 d 1
15 10 e 2
16 10 f 2
17 11 g 1
18 12 h 1
> dput(ttt)
structure(list(parent = c(1, 1, 2, 2, 3, 4, 4, 5, 5, 6, 7, 7,
8, 9, 10, 10, 11, 12), child = c("3", "4", "5", "6", "7", "8",
"9", "10", "11", "12", "a", "b", "c", "d", "e", "f", "g", "h"
), length = c(4, 4, 4, 4, 2, 2, 2, 3, 3, 1, 2, 2, 1, 1, 2, 2,
1, 1)), row.names = c(NA, -18L), class = c("tbl_df", "tbl", "data.frame"
))
Any help is appreciated.
Thanks in advance.
This gets you 90% of the way there:
tt_correct <- tt[, c(2,3,4,1)]
ttt <- do.call(
rbind,
lapply(seq_len(length(tt)-1),
function(i){
DF <- tt_correct[, c(i, i+1)]
names(DF) <- c('parent', 'child')
DF$length <- ave(DF$parent, DF$parent, FUN = length)
unique(DF)
}
)
)
ttt
# A tibble: 18 x 3
parent child length
<dbl> <chr> <dbl>
1 1 3 4
2 1 4 4
3 2 5 4
4 2 6 4
5 3 7 2
6 4 8 2
7 4 9 2
8 5 10 3
9 5 11 3
10 6 12 1
11 7 a 2
12 7 b 2
13 8 c 1
14 9 d 1
15 10 e 2
16 10 f 2
17 11 g 1
18 12 h 1
The first part is correcting the order. Your expected output indicates that the 1st column is a child of the 4th column. The lapply() statement largely walks along the data.frame and stacks the data.
This is 90% of the way because the answer doesn't agree with your expected output for lengths. I think this is correct but I could be wrong.
Finally, and I'm not that good with igraph, you could likely find additional information doing:
library(igraph)
plot(graph_from_data_frame(ttt[, 1:2]))

Concatenating cells in a data frame using aggregate.data.frame

I have a data frame like the following:
df <- data.frame(bee.num=c(1,1,1,2,2,3,3), plant=c("d","d","w","d","d","w","d"))
df$visits = list(1:3, 4:9, 10:11, 1:10, 11:12, 1:4,5:11)
df
bee.num plant visits
1 1 d 1, 2, 3
2 1 d 4, 5, 6, 7, 8, 9
3 1 w 10, 11
4 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
5 2 d 11, 12
6 3 w 1, 2, 3, 4
7 3 d 5, 6, 7, 8, 9, 10, 11
I would like to aggregate visits by bee.num and plant with a function that concatenates the values for visit based on matching bee.num and plant values, like the one below
bee.num plant visits
1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
2 1 w 10, 11
3 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
4 3 w 1, 2, 3, 4
5 3 d 5, 6, 7, 8, 9, 10, 11
I've tried
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
and
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=unlist)
but I always get an "arguments imply differing number of rows" error. Any help would be greatly appreciated. Thanks in advance.
The function works as expected if you pass a data frame containing the list as a column, rather than pass the list itself.
x <- aggregate.data.frame(df['visits'], list(df$bee.num, df$plant) , FUN=c)
names(x) <- c('bee.num', 'plant', 'visits')
x
## bee.num plant visits
## 1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
## 2 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
## 3 3 d 5, 6, 7, 8, 9, 10, 11
## 4 1 w 10, 11
## 5 3 w 1, 2, 3, 4
Note:
> class(df$visits)
[1] "list"
> class(df['visits'])
[1] "data.frame"
It would thus suffice to call aggregate above.
Note also, the error is from trying to coerce the list to a data frame. The first two lines of aggregate.data.frame are as follows:
if (!is.data.frame(x))
x <- as.data.frame(x)
Applying this to df$visits results in:
as.data.frame(df$visits)
## Error in data.frame(1:3, 4:9, 10:11, 1:10, 11:12, 1:4, 5:11, check.names = TRUE, :
## arguments imply differing number of rows: 3, 6, 2, 10, 4, 7
Only "rectangular" lists can be coerced to data.frame. All entries must be the same length.
You can also get the output you're looking for if you unlist the list column first and make it so you have a long data.frame to start with:
visits <- unlist(df$visits, use.names=FALSE)
df <- df[rep(rownames(df), sapply(df$visits, length)), c("bee.num", "plant")]
df$visits <- visits
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
# bee.num plant x
# 1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
# 3 3 d 5, 6, 7, 8, 9, 10, 11
# 4 1 w 10, 11
# 5 3 w 1, 2, 3, 4
## Or, better yet:
aggregate(visits ~ bee.num + plant, df, c)
By the way, "data.table" can handle this listing and unlisting pretty directly:
library(data.table)
DT <- data.table(df)
setkey(DT, bee.num, plant)
DT[, list(visits = list(unlist(visits))), by = key(DT)]
# bee.num plant visits
# 1: 1 d 1,2,3,4,5,6,
# 2: 1 w 10,11
# 3: 2 d 1,2,3,4,5,6,
# 4: 3 d 5,6,7,8,9,10,
# 5: 3 w 1,2,3,4
The output there only looks truncated. All the information is there:
str(.Last.value)
# Classes ‘data.table’ and 'data.frame': 5 obs. of 3 variables:
# $ bee.num: num 1 1 2 3 3
# $ plant : Factor w/ 2 levels "d","w": 1 2 1 1 2
# $ visits :List of 5
# ..$ : int 1 2 3 4 5 6 7 8 9
# ..$ : int 10 11
# ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
# ..$ : int 5 6 7 8 9 10 11
# ..$ : int 1 2 3 4
# - attr(*, "sorted")= chr "bee.num" "plant"
# - attr(*, ".internal.selfref")=<externalptr>
In answer to your specific question, I don't think aggregate.data.frame will do this easily.
As I've stated in previous posts, most R users would probably come up with a way to do this in plyr.
However, as my first exposure to data analysis was through database scripting, I remain partial to the sqldf package for these sorts of tasks.
I also find SQL to be more transparent to non-R users (something I frequently encounter in the social science community where I do most of my work).
Here is a solution to your problem using sqldf:
#your data assigned to dat
bee.num <- c(1,1,1,2,2,3,3)
plant <- c("d", "d", "w", "d", "d", "w", "d")
visits <- c("1, 2, 3"
,"4, 5, 6, 7, 8, 9"
,"10, 11"
,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
,"11, 12"
,"1, 2, 3, 4"
,"5, 6, 7, 8, 9, 10, 11")
dat <- as.data.frame(cbind(bee_num, plant, visits))
#load sqldf
require(sqldf)
#write a simple SQL aggregate query using group_concat()
#i.e. "select" your fields specifying the aggregate function for the
#relevant field, "from" a table called dat, and "group by" bee_num
#(because sql_df converts "." into "_" for field names) and plant.
sqldf('select
bee_num
,plant
,group_concat(visits) visits
from dat
group by
bee_num
,plant')
bee_num plant visits
1 1 d 1, 2, 3,4, 5, 6, 7, 8, 9
2 1 w 10, 11
3 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12
4 3 d 5, 6, 7, 8, 9, 10, 11
5 3 w 1, 2, 3, 4

Resources