Arrange rows by pairs in R based on two columns - r

I need to order data table by pairs of users who sent messages. Currently, the data looks like this:
I want to rearrange rows so that I can see how many messages users exchanged between each other. If one user sent a message, but the other one did not respond, I need to have a value of 0 in column Messages_sent.
As a next step, I need to calculate conversation length between two users, therefore, sum Messages_sent for every two lines.
Please advice how I can rearrange data table!

With dplyr, to get the table given in your description, this code should work. But if you want to sum the counts in both directions the first line contains all you may want.
df <- merge(df,df
df <- mutate(df,Messages_sent.x=coalesce(Messages_sent.x,0),
df$row <- 1:nrow(df)
rbind(select(df,-Messages_sents.y) %>%
select(df,-Messages_sent.x) %>%
) %>% arrange(row) %>% select(-row)

Here are the steps using base R functions:
df <- data.frame(from_id=c(624227,624227,624227,624227,624227,624227,667255,667255,667255,7134655,713465),
to_id = c(352731,693915,184455,771100,503940,91558,626814,857601,862512,156874,419242),
# merge dataset together with itself swapping from_id and to_id columns
df.full <- merge(df,df, by.x=c("from_id","to_id"), by.y=c("to_id","from_id"),suffixes = c(".x",".y"), all=TRUE)
# fill missing values with 0
# those records will correspond to all the pairs where
# someone did not send any messages back
df.full[] <- 0
# calculate total number of messages for each pair:
df.full$total <- df.full$message_sent.x + df.full$message_sent.y
# from_id to_id message_sent.x message_sent.y total
# 1 91558 624227 0 1 1
# 2 156874 7134655 0 1 1
# 3 184455 624227 0 2 2
# 4 352731 624227 0 1 1
# 5 419242 713465 0 1 1
# 6 503940 624227 0 1 1
For very large datasets base R functions might be slow in this case you can look into using dplyr library (for most steps here it has similar functions):
df.full.2 <- merge(df,df # merge dataframe and switched one
,all.x=TRUE,all.y=TRUE) %>%
mutate(message_sent.x=coalesce(message_sent.x,0), # replace NAs with 0
message_sent.y=coalesce(message_sent.y,0)) %>%
mutate(total=rowSums(.[3:4])) # calculate total number of messages
# from_id to_id message_sent.x message_sent.y total
#1 91558 624227 0 1 1
#2 156874 7134655 0 1 1
#3 184455 624227 0 2 2
#4 352731 624227 0 1 1
#5 419242 713465 0 1 1
#6 503940 624227 0 1 1
If it's important to have records in pairs follow each other you can also add the following code:
df2.full.3 <- df2.full.2 %>%
mutate("%06d%6d",pmin(from_id,to_id ),
pmax(from_id,to_id ))) %>%
arrange( %>% select(
# from_id to_id message_sent.x message_sent.y total
#1 91558 624227 0 1 1
#2 624227 91558 1 0 1
#3 156874 7134655 0 1 1
#4 7134655 156874 1 0 1
#5 184455 624227 0 2 2
#6 624227 184455 2 0 2
There is also data.table package that is also very efficient for very large datasets:
# convert dataframe to datatable
df.full <- merge(df,df, by.x=c("from_id","to_id"), by.y=c("to_id","from_id"),
suffixes = c(".x",".y"), all=TRUE)
# substitute NAs with zeros
for (j in 3:4)set(df.full,which([[j]] )),j,0)
# calculate the total number of messages
df.full[, total:=message_sent.x+message_sent.y]
# from_id to_id message_sent.x message_sent.y total
# 1: 91558 624227 0 1 1
# 2: 156874 7134655 0 1 1
# 3: 184455 624227 0 2 2
# 4: 352731 624227 0 1 1
# 5: 419242 713465 0 1 1
# 6: 503940 624227 0 1 1
Depending on the size of your dataset one of these methods might be more efficient than the other two.


aggregating repeat entries prior to summarizing on other variables

id event var1 var2 var_total
1 1 a 1 1 1
2 2 a 1 0 1
3 3 a 0 0 0
4 4 a 0 0 0
5 3 a 1 1 1
6 4 a 0 1 1
7 3 a 1 0 1
8 1 b 1 1 1
9 1 b 0 1 1
10 2 b 1 0 1
I am reformatting/cleaning data from a data entry site that produces very diagonal data. I have got it into a manageable form at the moment, but I still have one problem. There are events that repeat, and I would like each id/event combination to be unique. As you can see, lines 5 and 6 are duplicate in that aspect, but the variables are not identical. The variables are a binary response (yes=1,no=0), but if there is any yes in the event the variable should be 1. Additionally, the 'var_total' column should be 1 if ANY of the variables are positive.
My data set has 77 of these 'repeat events' out of over 6000 entries, and it's likely to change everytime more data is entered. How do I isolate ids with repeat events so I can aggregate() (?summarise) them and be sure it's done correctly? There are over 15 variables. I need to report number of ids per event for all variables.
df %>%
group_by(id, event) %>%
across(var1:var2, ~ +any(. > 0)),
var_total = +((var1 + var2) > 0),
.groups = "drop"
) %>%
arrange(event, id)
# # A tibble: 6 x 5
# id event var1 var2 var_total
# <dbl> <chr> <int> <int> <int>
# 1 1 a 1 1 1
# 2 2 a 1 0 1
# 3 3 a 1 1 1
# 4 4 a 0 1 1
# 5 1 b 1 1 1
# 6 2 b 1 0 1
The arrange is purely getting it back to the order you had in your question, not required for the operation of the code.
summarize is wiping out var_total and then recreating it based on the logic you stated (either var* is 1).
I could have easily used across(var1:var2, max) instead of the ~ +any(. > 0), it produces the same results here. I shows the ~ any version purely to demonstrating something a little more complex than max.

Using lag in mutate() for rolling values forward for the created column

I am trying to specify sessions in a click stream data. I group rows based on months and userId and try to create another variable session, that looks at diff_days column, and increase by on if thats > 0.00209 and stays as the previous value otherwise. So basically I am trying to create session variable and use the lag version on it at the same time. The fist row in a group is always session = 1.
So take for example this data is one of the groups from group_by:
ID Month diff_days
2 0 NA
2 0 0.0002
2 0 0.001
2 0 0.01
2 0 0.00034
2 0 0.1
2 0 0.3
2 0 0.00005
and I want to create session variable within each group like this:
ID Month diff_days session
2 0 NA 1
2 0 0.0002 1
2 0 0.001 1
2 0 0.01 2
2 0 0.00034 2
2 0 0.1 3
2 0 0.3 4
2 0 0.00005 4
The code that I am using and not giving the right answer:
data <- data %>% group_by(ID, Month)
%>% mutate(session = ifelse(row_number() == 1, 1 ,
ifelse(diff_days < 0.0209, lag(session) , lag(session) + 1))) %>% ungroup()
I have been struggling with this for quite some time so any help would be greatly appreciated.
We can use cumsum on the logical vector after grouping by 'ID', 'Month'. Create a logical vector diff_days[-1] >= 0.00209 (removed the first observation which is NA and appended TRUE as the first one. Then, get the cumulative sum, so that for every TRUE value, it gets added 1.
data %>%
group_by(ID, Month) %>%
mutate(session = cumsum(c(TRUE, diff_days[-1] >= 0.00209)))
# A tibble: 8 x 4
# Groups: ID, Month [1]
# ID Month diff_days session
# <int> <int> <dbl> <int>
#1 2 0 NA 1
#2 2 0 0.0002 1
#3 2 0 0.001 1
#4 2 0 0.01 2
#5 2 0 0.00034 2
#6 2 0 0.1 3
#7 2 0 0.3 4
#8 2 0 0.00005 4

Sum a group of columns by row count

I'm trying to create a new dataset from an existing one. The new dataset is supposed to combine 60 rows from the original dataset in order to convert a sum of events occurring each second to the total by minute. The number of columns will generally not be known in advance.
For example, with this dataset, if we split it into groups of 3 rows:
a b c d
1 1 1 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 1 0
5 0 0 1 0
6 1 0 0 0
We'll get this data.frame. Row 1 contains the column sums for rows 1-3 of d1 and Row 2 contains the column sums for rows 4-6 of d1:
a b c d
1 1 3 0 2
2 1 0 2 0
I've tried d2<-colSums(d1[seq(1,NROW(d1),3),]) which is about as close as I've been able to get.
I've also considered recommendations from How to sum rows based on multiple conditions - R?,How to select every xth row from table,Remove last N rows in data frame with the arbitrary number of rows,sum two columns in R, and Merging multiple rows into single row. I'm all out of ideas. Any help would be greatly appreciated.
Create a grouping variable, group_by that variable, then summarise_all.
# your data
d <- data.frame(a = c(1,0,0,0,0,1),
b = c(1,1,1,0,0,0),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
# create the grouping variable
d$group <- rep(c("A","B"), each = 3)
# apply the mean to all columns
d %>%
group_by(group) %>%
# A tibble: 2 x 5
group a b c d
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 0 2
2 B 1 0 3 0
After reading Split up a dataframe by number of rows, I realized the only thing you need to know is how you'd like to split() d1.
In this case, you'd like to split d1 into multiple data frames based on every 3 rows. In this case, you use rep() to specify that you'd like each element in the sequence - 1:2 - to be repeated three times (the number of rows divided by the length of your sequence).
After that, the logic involves using map() to sum each column for each data frame created after d1 %>% split(). Here, summarize_all() is helpful since you don't need to know the column names ahead of time.
Once the calculations are complete, you use bind_rows() to stack all the observations back into one data frame.
# load necessary package ----
# load necessary data ----
df1 <-
read.table(text = "a b c d
1 1 0 1
0 1 0 1
0 1 0 0
0 0 1 0
0 0 1 0
1 0 0 0", header = TRUE)
# perform operations --------
df2 <-
df1 %>%
# split df1 into two data frames
# based on three consecutive rows
split(f = rep(1:2, each = nrow(.) / length(1:2))) %>%
# for each data frame, apply the sum() function to all the columns
map(.f = ~ .x %>% summarize_all(.funs = funs(sum))) %>%
# collapse data frames together
# view results -----
# a b c d
# 1 1 3 0 2
# 2 1 0 2 0
# end of script #

Grouping and Counting instances?

Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+ <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2) <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.

Splitting one Column to Multiple R and Giving logical value if true

I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
