Concatenating cells in a data frame using aggregate.data.frame - r

I have a data frame like the following:
df <- data.frame(bee.num=c(1,1,1,2,2,3,3), plant=c("d","d","w","d","d","w","d"))
df$visits = list(1:3, 4:9, 10:11, 1:10, 11:12, 1:4,5:11)
df
bee.num plant visits
1 1 d 1, 2, 3
2 1 d 4, 5, 6, 7, 8, 9
3 1 w 10, 11
4 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
5 2 d 11, 12
6 3 w 1, 2, 3, 4
7 3 d 5, 6, 7, 8, 9, 10, 11
I would like to aggregate visits by bee.num and plant with a function that concatenates the values for visit based on matching bee.num and plant values, like the one below
bee.num plant visits
1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
2 1 w 10, 11
3 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
4 3 w 1, 2, 3, 4
5 3 d 5, 6, 7, 8, 9, 10, 11
I've tried
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
and
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=unlist)
but I always get an "arguments imply differing number of rows" error. Any help would be greatly appreciated. Thanks in advance.

The function works as expected if you pass a data frame containing the list as a column, rather than pass the list itself.
x <- aggregate.data.frame(df['visits'], list(df$bee.num, df$plant) , FUN=c)
names(x) <- c('bee.num', 'plant', 'visits')
x
## bee.num plant visits
## 1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
## 2 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
## 3 3 d 5, 6, 7, 8, 9, 10, 11
## 4 1 w 10, 11
## 5 3 w 1, 2, 3, 4
Note:
> class(df$visits)
[1] "list"
> class(df['visits'])
[1] "data.frame"
It would thus suffice to call aggregate above.
Note also, the error is from trying to coerce the list to a data frame. The first two lines of aggregate.data.frame are as follows:
if (!is.data.frame(x))
x <- as.data.frame(x)
Applying this to df$visits results in:
as.data.frame(df$visits)
## Error in data.frame(1:3, 4:9, 10:11, 1:10, 11:12, 1:4, 5:11, check.names = TRUE, :
## arguments imply differing number of rows: 3, 6, 2, 10, 4, 7
Only "rectangular" lists can be coerced to data.frame. All entries must be the same length.

You can also get the output you're looking for if you unlist the list column first and make it so you have a long data.frame to start with:
visits <- unlist(df$visits, use.names=FALSE)
df <- df[rep(rownames(df), sapply(df$visits, length)), c("bee.num", "plant")]
df$visits <- visits
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
# bee.num plant x
# 1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
# 3 3 d 5, 6, 7, 8, 9, 10, 11
# 4 1 w 10, 11
# 5 3 w 1, 2, 3, 4
## Or, better yet:
aggregate(visits ~ bee.num + plant, df, c)
By the way, "data.table" can handle this listing and unlisting pretty directly:
library(data.table)
DT <- data.table(df)
setkey(DT, bee.num, plant)
DT[, list(visits = list(unlist(visits))), by = key(DT)]
# bee.num plant visits
# 1: 1 d 1,2,3,4,5,6,
# 2: 1 w 10,11
# 3: 2 d 1,2,3,4,5,6,
# 4: 3 d 5,6,7,8,9,10,
# 5: 3 w 1,2,3,4
The output there only looks truncated. All the information is there:
str(.Last.value)
# Classes ‘data.table’ and 'data.frame': 5 obs. of 3 variables:
# $ bee.num: num 1 1 2 3 3
# $ plant : Factor w/ 2 levels "d","w": 1 2 1 1 2
# $ visits :List of 5
# ..$ : int 1 2 3 4 5 6 7 8 9
# ..$ : int 10 11
# ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
# ..$ : int 5 6 7 8 9 10 11
# ..$ : int 1 2 3 4
# - attr(*, "sorted")= chr "bee.num" "plant"
# - attr(*, ".internal.selfref")=<externalptr>

In answer to your specific question, I don't think aggregate.data.frame will do this easily.
As I've stated in previous posts, most R users would probably come up with a way to do this in plyr.
However, as my first exposure to data analysis was through database scripting, I remain partial to the sqldf package for these sorts of tasks.
I also find SQL to be more transparent to non-R users (something I frequently encounter in the social science community where I do most of my work).
Here is a solution to your problem using sqldf:
#your data assigned to dat
bee.num <- c(1,1,1,2,2,3,3)
plant <- c("d", "d", "w", "d", "d", "w", "d")
visits <- c("1, 2, 3"
,"4, 5, 6, 7, 8, 9"
,"10, 11"
,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
,"11, 12"
,"1, 2, 3, 4"
,"5, 6, 7, 8, 9, 10, 11")
dat <- as.data.frame(cbind(bee_num, plant, visits))
#load sqldf
require(sqldf)
#write a simple SQL aggregate query using group_concat()
#i.e. "select" your fields specifying the aggregate function for the
#relevant field, "from" a table called dat, and "group by" bee_num
#(because sql_df converts "." into "_" for field names) and plant.
sqldf('select
bee_num
,plant
,group_concat(visits) visits
from dat
group by
bee_num
,plant')
bee_num plant visits
1 1 d 1, 2, 3,4, 5, 6, 7, 8, 9
2 1 w 10, 11
3 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12
4 3 d 5, 6, 7, 8, 9, 10, 11
5 3 w 1, 2, 3, 4

Related

print from specific rows with highest value from multiple columns using R Studio

I have attached excellent image, I want to extract only those column in which Its row should have maximum value comparing other row
First, provide a reproducible version of your data (not a picture):
dput(dta)
structure(list(A = c(45, 20, 9, 6, 6), B = c(23, 34, 7, 10, 5
), C = c(12, 15, 8, 0, 4), D = c(4, 4, 6, 0, 3), E = c(5, 6,
3, 1, 2)), class = "data.frame", row.names = c("BOX_A", "BOX_B",
"BOX_C", "BOX_D", "BOX_E"))
Now find which column is the maximum:
idx <- apply(dta, 1, which.max)
Now display the rows where the maximum is in the first column. This is not what you asked for but it is what your picture shows:
dta[idx==1, ]
# A B C D E
# BOX_A 45 23 12 4 5
# BOX_C 9 7 8 6 3
# BOX_E 6 5 4 3 2

How to find corresponding ID comparing values in two columns?

I have following problem that you can easily see after downloading the picture. It would be of great help if you help me solve the problem.
In Table 1, the IDs are correctly linked up with the values in column A. B contains some values which are not ordered and whose corresponding IDs are not given in corresponding rows. We need to find the IDs of the values in column B by using the IDs of column A.
Now if we run following code in R, we will find the IDs corresponding the values in column B
mydata <- read.csv(‘C:/Users/Windows/Desktop/practice_1.csv’)
df <- data.frame(mydata$B, mydata$A, mydata$ID, header=TRUE)
library(qdap)
df[, "New ID"] <- df[, 1] %l% df[, -1]
After running above code, we will find the new ID in the column New ID like Table 2.
What you need is a simple match operation:
table1$ID2 <- table1$ID1[match(table1$z, table1$y)]
table1
# ID1 y z ID2
# 1 0 1 11 10
# 2 1 2 3 2
# 3 2 3 5 4
# 4 3 4 4 3
# 5 4 5 8 7
# 6 5 6 7 6
# 7 6 7 15 15
# 8 7 8 6 5
# 9 8 9 2 1
# 10 9 10 16 17
# 11 10 11 1 0
# 12 11 12 NA NA
# 13 15 15 NA NA
# 14 17 16 NA NA
Please, the next time you ask a question where sample data is necessary (most questions), please provide data in this format:
Data
# dput(table1)
structure(list(ID1 = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 15, 17), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16), z = c(11, 3, 5, 4, 8, 7, 15, 6, 2, 16, 1, NA, NA, NA), ID2 = c(10, 2, 4, 3, 7, 6, 15, 5, 1, 17, 0, NA, NA, NA)), row.names = c(NA, -14L), class = "data.frame")

How to flag a progressive decrease in values in list

I have a dataframe with columns containing lists of values:
dt
onset_l coda_l
1 3, 7 7, 4, 1
3 7 7, 1, 3
12 1, 7 7, 4, 1
21 6, 7 7, 4, 1
23 7 7, 1, 5
What I want to do is create a new column, say, coda_flag, that flags whether the values in column coda_l progressively decrease or not.
I've tried this lapply method:
dt$coda_flag <- lapply(dt$coda_l, function(x) ifelse(x[1] > x[2] & x[2] > x[3], 1, 0))
The output is okay but I dislike that I have to enumerate x[1], x[2] and so on because in some cases I may not know the exact number of values in the lists.
The desired output would be:
dt
onset_l coda_l coda_flag
1 3, 7 7, 4, 1 1 # values do progressively decrease
3 7 7, 1, 3 0 # values do not progressively decrease
12 1, 7 7, 4, 1 1
21 6, 7 7, 4, 1 1
23 7 7, 1, 5 0
How can this be achieved?
Reproducible data:
dt <- structure(list(onset_l = list(c("3", "7"), "7", c("1", "7"),
c("6", "7"), "7"), coda_l = list(c("7", "4", "1"), c("7",
"1", "3"), c("7", "4", "1"), c("7", "4", "1"), c("7", "1", "5"
))), row.names = c(1L, 3L, 12L, 21L, 23L), class = "data.frame")
You can use is.unsorted() - although it can only detect increasing order so it's necessary to reverse the vector first.
dt$coda_flag <- +!sapply(dt$coda_l, function(x) is.unsorted(rev(as.numeric(x))))
onset_l coda_l coda_flag
1 3, 7 7, 4, 1 1
3 7 7, 1, 3 0
12 1, 7 7, 4, 1 1
21 6, 7 7, 4, 1 1
23 7 7, 1, 5 0
Note the strictly argument controls what happens in the event of tied values.
x <- c(1, 2, 2, 3)
is.unsorted(x)
[1] FALSE
is.unsorted(x, strictly = TRUE)
[1] TRUE
One dplyr and purrr solution could be:
dt %>%
mutate(coda_flag = map_int(.x = coda_l, ~ +(all(diff(as.numeric(.x)) < 1))))
onset_l coda_l coda_flag
1 3, 7 7, 4, 1 1
2 7 7, 1, 3 0
3 1, 7 7, 4, 1 1
4 6, 7 7, 4, 1 1
5 7 7, 1, 5 0
You can also try with a loop:
#Flag
dt$Flag <- NA
#Loop
for(i in 1:nrow(dt))
{
#Extract elements
vecvar <- do.call(c,dt$coda_l[i])
#Compute diff
difvec <- diff(sort(as.numeric(vecvar)))
#Assign
dt$Flag[i] <- ifelse(length(unique(difvec))==1,1,0)
}
Output:
dt
onset_l coda_l Flag
1 3, 7 7, 4, 1 1
3 7 7, 1, 3 0
12 1, 7 7, 4, 1 1
21 6, 7 7, 4, 1 1
23 7 7, 1, 5 0
An option with data.table
library(data.table)
setDT(dt)[, coda_flag := +(sapply(coda_l, function(x)
all(as.numeric(x) - shift(as.numeric(x), fill = first(as.numeric(x))) < 1)))]
-output
dt
# onset_l coda_l coda_flag
#1: 3,7 7,4,1 1
#2: 7 7,1,3 0
#3: 1,7 7,4,1 1
#4: 6,7 7,4,1 1
#5: 7 7,1,5 0

pivot_longer with groups of columns [duplicate]

This question already has an answer here:
How to use Pivot_longer to reshape from wide-type data to long-type data with multiple variables
(1 answer)
Closed 2 years ago.
I've got a dataset that looks like this:
df_start <- tribble(
~name, ~age, ~x1_sn_ctrl1, ~x1_listing2_2, ~x1_affect1, ~x2_sn_ctrl1, ~x1_listing2_2, ~x2_affect1, ~number,
"John", 28, 1, 1, 9, 4, 5, 9, 6,
"Paul", 27, 2, 1, 4, 1, 3, 3, 4,
"Ringo", 31, 3, 1, 2, 2, 5, 8, 9)
I need to pivot_longer() while handling the groupings within my columns:
There are 2 x-values (1 and 2)
There are 3 questions (sn_ctrl1, listing2_2, affect1) for each x-value
In my actual dataset, there are 14 x's.
Essentially, what I'd like to do is to apply pivot_longer() to the x-values but leave my 3 questions (sn_ctrl1, listing2_2, affect1) wide.
What I'd like to end up with is this:
df_end <- tribble(
~name, ~age, ~xval, ~sn_ctrl1, ~listing2_2, ~affect1, ~number,
"John", 28, 1, 1, 1, 9, 6,
"John", 28, 2, 4, 5, 9, 6,
"Paul", 27, 1, 2, 1, 4, 4,
"Paul", 27, 2, 1, 3, 3, 4,
"Ringo", 31, 1, 3, 1, 2, 9,
"Ringo", 31, 2, 2, 5, 8, 9)
I have tried lots of very unsuccessful attempts playing with regex in names_pattern & pivot_longer but am completely striking out.
Anyone know how to tackle this?
THANKS!
PS: Note that I tried to make a straightforward reproducible example. The actual names of my columns vary slightly. For instance, there is x1_sn_ctrl1 & x1_attr1_ctrl2.
You can use :
tidyr::pivot_longer(df_start,
cols = -c(name, age, number),
names_to = c("xval", ".value"),
names_pattern = 'x(\\d+)_(.*)')
Which yields
# A tibble: 9 x 7
name age number xval sn_ctrl1 listing2_2 affect1
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 John 28 6 1 1 1 9
2 John 28 6 2 4 NA 9
3 John 28 6 1 NA 5 NA
4 Paul 27 4 1 2 1 4
5 Paul 27 4 2 1 NA 3
6 Paul 27 4 1 NA 3 NA
7 Ringo 31 9 1 3 1 2
8 Ringo 31 9 2 2 NA 8
9 Ringo 31 9 1 NA 5 NA

Frequency of vectors inside list

Let's say I have a list
test <- list(c(1, 2, 3), c(2, 4, 6), c(1, 5, 10), c(1, 2, 3), c(1, 5, 10), c(1, 2, 3))
and I need to count all of these vectors so the desired output should looks like:
Category Count
1, 2, 3 3
2, 4, 6 1
1, 5, 10 2
Is there any simple way in R how to achieve this?
You can just paste and use table, i.e.
as.data.frame(table(sapply(test, paste, collapse = ' ')))
which gives,
Var1 Freq
1 1 2 3 3
2 1 5 10 2
3 2 4 6 1
The function unique() can work on a list. For counting one can use identical():
test <- list(c(1, 2, 3), c(2, 4, 6), c(1, 5, 10), c(1, 2, 3), c(1, 5, 10), c(1, 2, 3))
Lcount <- function(xx, L) sum(sapply(L, identical, y=xx))
sapply(unique(test), FUN=Lcount, L=test)
unique(test)
The result as data.frame:
result <- data.frame(
Set=sapply(unique(test), FUN=paste0, collapse=','),
count= sapply(unique(test), FUN=Lcount, L=test)
)
result
# > result
# Set count
# 1 1,2,3 3
# 2 2,4,6 1
# 3 1,5,10 2

Resources