Counting successor combinations in a data.frame - r

I got a data.frame that looks like the following one:
OBJECT ID TASK
1 A
1 C
1 D
1 E
2 A
2 B
2 C
2 D
2 F
Now I would like to count the unique successive combinations within the data.frame in order to get following result:
PREDECESSOR SUCCESSOR COUNT
A C 1
C D 2
D E 1
A B 1
B C 1
D F 1
I've already figured out to extract the successive values with the help of two for loops, but I'm failing the assignment and counting task within a new data.frame (or list).

aggregate(COUNT~.,
data.frame(PREDECESSOR = head(df1$TASK, -1),
SUCCESSOR = tail(df1$TASK, -1),
COUNT = 1),
length)
# PREDECESSOR SUCCESSOR COUNT
#1 E A 1
#2 A B 1
#3 A C 1
#4 B C 1
#5 C D 2
#6 D E 1
#7 D F 1
You could use a similar approach even if you want to first split by OBJECT.ID
temp = do.call(rbind, lapply(split(df1, df1$OBJECT.ID), function(X){
aggregate(COUNT~., data.frame(PREDECESSOR = head(X$TASK, -1),
SUCCESSOR = tail(X$TASK, -1),
COUNT = 1),
length)
}))
aggregate(COUNT~., temp, length)
# PREDECESSOR SUCCESSOR COUNT
#1 A C 1
#2 B C 1
#3 C D 2
#4 D E 1
#5 A B 1
#6 D F 1
DATA
df1 = structure(list(OBJECT.ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), TASK = c("A", "C", "D", "E", "A", "B", "C", "D", "F")), .Names = c("OBJECT.ID",
"TASK"), class = "data.frame", row.names = c(NA, -9L))

Solution using data.table:
Code:
library(data.table)
setDT(df)
df[, TASK0 := shift(TASK), OBJECT]
df[!is.na(TASK0), .N, .(TASK, TASK0)][, .(
COUNT = sum(N)), .(PREDECESSOR = TASK0, SUCCESSOR = TASK)]
Result:
PREDECESSOR SUCCESSOR COUNT
1: A C 1
2: C D 2
3: D E 1
4: A B 1
5: B C 1
6: D F 1
Explanation:
setDT(df): turns data.frame into a data.table object
[, TASK0 := shift(TASK), OBJECT]: gets previous letter for each OBJECT
!is.na(TASK0): gets rid of first row for each OBJECT (they don't have PREDECESSOR)
.N, .(TASK, TASK0): counts occurences of TASK and TASK0 (previous letter combinations)
sum(N): sums counts
Data (df):
structure(list(OBJECT = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
TASK = c("A", "C", "D", "E", "A", "B", "C", "D", "F")), .Names = c("OBJECT",
"TASK"), row.names = c(NA, -9L), class = c("data.table", "data.frame"
))

Just to get the counts, you do it with the following two lines:
cc <- cbind(df$TASK,c(df$TASK[-1],"LAST"))
table(paste(cc[,1],cc[2],sep="-"))
The result is
A-B A-C B-C C-D D-E D-F E-A F-LAST
1 1 1 2 1 1 1 1

Related

check if numbers in a column are ascending by a certain value (R dataframe)

I have a column of numbers (index) in a dataframe like the below. I am attempting to check if these numbers are in ascending order by the value of 1. For example, group B and C do not ascend by 1. While I can check by sight, my dataframe is thousands of rows long, so I'd prefer to automate this. Does anyone have advice? Thank you!
group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2
...
I think this works. diff calculates the difference between the two subsequent numbers, and then we can use all to see if all the differences are 1. dat2 is the final output.
library(dplyr)
dat2 <- dat %>%
group_by(group) %>%
summarize(Result = all(diff(index) == 1)) %>%
ungroup()
dat2
# # A tibble: 3 x 2
# group Result
# <chr> <lgl>
# 1 A TRUE
# 2 B FALSE
# 3 C FALSE
DATA
dat <- read.table(text = "group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2",
header = TRUE, stringsAsFactors = FALSE)
Maybe aggregate could help
> aggregate(.~group,df1,function(v) all(diff(v)==1))
group index
1 A TRUE
2 B FALSE
3 C FALSE
We can do a group by group, get the difference between the current and previous value (shift) and check if all the differences are equal to 1.
library(data.table)
setDT(df1)[, .(Result = all((index - shift(index))[-1] == 1)), group]
# group Result
#1: A TRUE
#2: B FALSE
#3: C FALSE
data
df1 <- structure(list(group = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "C", "C", "C", "C"), index = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 2L, 0L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-13L))

How to swap row values in the same column of a data frame?

I have a data frame that looks like the following:
ID Loc
1 N
2 A
3 N
4 H
5 H
I would like to swap A and H in the column Loc while not touching rows that have values of N, such that I get:
ID Loc
1 N
2 H
3 N
4 A
5 A
This dataframe is the result of a pipe so I'm looking to see if it's possible to append this operation to the pipe.
You could try:
df$Loc <- chartr("AH", "HA", df$Loc)
df
ID Loc
1 1 N
2 2 H
3 3 N
4 4 A
5 5 A
We can try chaining together two calls to ifelse, for a base R option:
df <- data.frame(ID=c(1:5), Loc=c("N", "A", "N", "H", "H"), stringsAsFactors=FALSE)
df$Loc <- ifelse(df$Loc=="A", "H", ifelse(df$Loc=="H", "A", df$Loc))
df
ID Loc
1 1 N
2 2 H
3 3 N
4 4 A
5 5 A
If you have a factor, you could simply reverse those levels
l <- levels(df$Loc)
l[l %in% c("A", "N")] <- c("N", "A")
df
# ID Loc
# 1 1 A
# 2 2 N
# 3 3 A
# 4 4 H
# 5 5 H
Data:
df <- structure(list(ID = 1:5, Loc = structure(c(3L, 1L, 3L, 2L, 2L
), .Label = c("A", "H", "N"), class = "factor")), .Names = c("ID",
"Loc"), class = "data.frame", row.names = c(NA, -5L))

Iterate through grouped rows to get different pair combinations

Having the following table:
read.table(text = "route origin dest seq
1 a b 1
1 b c 2
1 c d 3
1 d e 4
2 f g 1
2 g h 2
2 h i 3", header = TRUE)
I'm trying to find a way of going through each row, grouped by route, and iterate every potential combination of origin destination pairs, taking into account the seq variable and the route as mentioned.
The output should look something like this:
origin dest
a b
a c
a d
a e
b c
b d
(...) (...)
The idea behind this is that a train e.g route 1, goes from a to e. However, I want to list every single possibility of train pairs with that. I tried with igraph but unsuccessfully.
Any ideas with dplyr or so?
library(dplyr)
library(tidyr)
df %>%
mutate_if(is.factor, as.character) %>% #convert factor variable to character
group_by(route) %>%
expand(origin = paste(origin, seq, sep = "_"), dest = paste(dest, seq, sep = "_")) %>% #all possible combination of origin & destination grouped by route
rowwise() %>%
filter(strsplit(origin, split = "_")[[1]][1] != strsplit(dest, split = "_")[[1]][1] &
strsplit(origin, split = "_")[[1]][2] <= strsplit(dest, split = "_")[[1]][2]) %>%
mutate(origin = gsub("_.*$", "", origin),
dest = gsub("_.*$", "", dest))
Output is:
route origin dest
1 1 a b
2 1 a c
3 1 a d
4 1 a e
5 1 b c
...
Sample data:
df <- structure(list(route = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), origin = structure(1:7, .Label = c("a",
"b", "c", "d", "f", "g", "h"), class = "factor"), dest = structure(1:7, .Label = c("b",
"c", "d", "e", "g", "h", "i"), class = "factor"), seq = c(1L,
2L, 3L, 4L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-7L))
# route origin dest seq
#1 1 a b 1
#2 1 b c 2
#3 1 c d 3
#4 1 d e 4
#5 2 f g 1
#6 2 g h 2
#7 2 h i 3

Summation of the corresponding number of values which are in different columns

My data frame looks like below:
df<-data.frame(alphabets1=c("A","B","C","B","C"," ","NA"),alphabets2=c("B","A","D","D"," ","E","NA"),alphabets3=c("C","F","G"," "," "," ","NA"), number = c("1","2","3","1","4","1","2"))
alphabets1 alphabets2 alphabets3 number
1 A B C 1
2 B A F 2
3 C D G 3
4 B D 1
5 C 4
6 E 1
7 NA NA NA 2
NOTE1: within the row all the values are unique, that is, below shown is not possible.
alphabets1 alphabets2 alphabets3 number
1 A A C 1
NOTE2: data frame may contains NA or is blank
I am struggling to get the below output: which is nothing but a dataframe which has the alphabets and the sum of their corresponding numbers, that is A alphabet is in 1st and 2nd rows so its sum of its corresponding number is 1+2 i.e 3 and let's say B, its in 1st, 2nd and 4th row so the sum will be 1+2+1 i.e 4.
output <-data.frame(alphabets1=c("A","B","C","D","E","F","G"), number = c("3","4","8","4","1","2","3"))
output
alphabets number
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
NOTE3: output may or may not have the NA or blanks (it doesn't matter!)
We can reshape it to 'long' format and do a group by operation
library(data.table)
melt(setDT(df), id.var="number", na.rm = TRUE, value.name = "alphabets1")[
!grepl("^\\s*$", alphabets1), .(number = sum(as.integer(as.character(number)))),
alphabets1]
# alphabets1 number
#1: A 3
#2: B 4
#3: C 8
#4: D 4
#5: E 1
#6: F 2
#7: G 3
Or we can use xtabs from base R
xtabs(number~alphabets1, data.frame(alphabets1 = unlist(df[-4]),
number = as.numeric(as.character(df[,4]))))
NOTE: In the OP's dataset, the missing values were "NA", and not real NA and the 'number' column is factor (which was changed by converting to integer for doing the sum)
data
df <- data.frame(alphabets1=c("A","B","C","B","C"," ",NA),
alphabets2=c("B","A","D","D"," ","E",NA),
alphabets3=c("C","F","G"," "," "," ",NA),
number = c("1","2","3","1","4","1","2"))
Here is a base R method using sapply and table. I first converted df$number into a numeric. See data section below.
data.frame(table(sapply(df[-length(df)], function(i) rep(i, df$number))))
Var1 Freq
1 11
2 A 3
3 B 4
4 C 8
5 D 4
6 E 1
7 F 2
8 G 3
9 NA 6
To make the output a little bit nicer, we could wrap a few more functions and perform a subsetting within sapply.
data.frame(table(droplevels(unlist(sapply(df[-length(df)],
function(i) rep(i[i %in% LETTERS],
df$number[i %in% LETTERS])),
use.names=FALSE))))
Var1 Freq
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
It may be easier to do this afterward, though.
data
I ran
df$number <- as.numeric(df$number)
on the OP's data resulting in this.
df <-
structure(list(alphabets1 = structure(c(2L, 3L, 4L, 3L, 4L, 1L,
5L), .Label = c(" ", "A", "B", "C", "NA"), class = "factor"),
alphabets2 = structure(c(3L, 2L, 4L, 4L, 1L, 5L, 6L), .Label = c(" ",
"A", "B", "D", "E", "NA"), class = "factor"), alphabets3 = structure(c(2L,
3L, 4L, 1L, 1L, 1L, 5L), .Label = c(" ", "C", "F", "G", "NA"
), class = "factor"), number = c(1, 2, 3, 1, 4, 1, 2)), .Names = c("alphabets1",
"alphabets2", "alphabets3", "number"), row.names = c(NA, -7L), class = "data.frame")

R: Subsetting a data.table with repeated column names with numerical positions

I have a data.table that looks like this
> dput(DT)
A B C A B C D
1: 1 2 3 3 5 6 7
2: 2 1 3 2 1 3 4
Here's the dput
DT <- structure(list(A = 1:2, B = c(2L, 1L), C = c(3L, 3L), A = c(3L,
2L), B = c(5L, 1L), C = c(6L, 3L), D = c(7L, 4L)), .Names = c("A",
"B", "C", "A", "B", "C", "D"), row.names = c(NA, -2L), class = c("data.table",
"data.frame"))
Basically, I want to subset them according to their headers. So for header "B", I would do this:
subset(DT,,grep(unique(names(DT))[2],names(DT)))
B B
1: 2 2
2: 1 1
As you can see, the values are wrong as the second column is simply a repeat of the first. I want to get this instead:
B B
1: 2 5
2: 1 1
Can anyone help me please?
The following alternatives work for me:
pos <- grep("B", names(DT))
DT[, ..pos]
# B B
# 1: 2 5
# 2: 1 1
DT[, .SD, .SDcols = patterns("B")]
# B B
# 1: 2 5
# 2: 1 1
DT[, names(DT) %in% unique(names(DT))[2], with = FALSE]
# B B
# 1: 2 5
# 2: 1 1

Resources