Sequence of patterns in R sequence and events issues - r

I am trying to work with frequent sequences in R (SPADE). I have the following data set:
d1 <- c(1:10)
d2 <- c("nut", "bolt", "screw")
data <- data.frame(expand.grid(d1,d2))
data$status <- sample(c("a","b","c"), size = nrow(data), replace = TRUE)
colnames(data) <- c("day", "widget", "status")
day widget status
1 1 nut c
2 2 nut b
3 3 nut b
4 4 nut b
5 5 nut a
6 6 nut a
7 7 nut b
8 8 nut c
9 9 nut c
10 10 nut b
11 1 bolt a
12 2 bolt b
...
I have not been able to get the data into a format that seems to work with the various packages available. I think the basic issue is that most packages would like to have sequences that are tied to an identity and an event. In my case that doesn't exists.
I want to answer the question of:
If on any day the status of widget[bolt] is an "a" and widget[screw] is a "c" and on the next day widget[screw] is "b" then on the 3rd day widget[nut] is likely to be "a".
So there is no identity or transaction/event to use. Am I over complicating this issue? Or is there a package that is well suited for this. So far I have tried arulesSequence and TraMineR.
Thank you

Not sure what you want to do. If you would like to use TraMineR, here is how you could input your data assuming the widgets are your sequence ids:
library(TraMineR)
## Transforming into the STS form expected by seqdef()
sts.data <- seqformat(data, from="SPELL", to="STS", id="widget",
begin="day", end="day", status="status",
limit=10)
## Setting position names and sequence names
names(sts.data) <- paste0("d",rep(1:10))
rownames(sts.data) <- d2
sts.data
# d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
# nut b a b b b a c a a a
# bolt c b a b a c b a c c
# screw a b a a c c b b b c
## Creating the state sequence object
sseq <- seqdef(sts.data)
## Potting the sequences
seqiplot(sseq, ytlab="id", ncol=3)

The key here is to reshape your dataset based on your objective. You have to make sure each row has all the input information (your criteria/conditions) and the target variable (what you want to find out).
Based on the problem you described:
The input info is "widget[bolt] value at a given day, widget[screw] value at the same given day and on widget[screw] value the day after", so you need to make sure each row of your new dataset has this info.
The target info is "3rd day widget[nut] value".
# for reproducibility reasons
set.seed(16)
# example dataset
d1 <- c(1:100)
d2 <- c("nut", "bolt", "screw")
data <- data.frame(expand.grid(d1,d2))
data$status <- sample(c("a","b","c"), size = nrow(data), replace = TRUE)
colnames(data) <- c("day", "widget", "status")
library(tidyverse)
data %>%
spread(widget, status) %>% # reshape data
mutate(screw_next_1 = lead(screw), # add screw next day
nut_next_2 = lead(nut, 2)) %>% # add nut 2 days after (target variable)
filter(bolt == "a" & screw == "c" & screw_next_1 == "b") # get rows that satisfy your criteria
# day nut bolt screw screw_next_1 nut_next_2
# 1 8 c a c b a
# 2 19 c a c b c
# 3 62 c a c b c
# 4 97 c a c b b
With a simple calculation you can say that based on the data you have the probability to have nut = a the 3rd day, given your criteria, is 1/4.

I think you'll find this type of question is most-easily addressed by reshaping your data from long to wide, and then implementing a logical test. For example:
# reshape from long to wide
data2 <- reshape2::dcast(data, day ~ widget)
# get the next-rows's value for "nut"
data2$next_nut <- dplyr::lead(data2$nut)
# implement your test
data2$bolt == "a" & data2$screw == "c" & data2$next_nut == "a"

Related

Subset Data Frame Rows by value in row.names in R

I have seen this Subsetting a data frame based on a logical condition on a subset of rows and that https://statisticsglobe.com/filter-data-frame-rows-by-logical-condition-in-r
I want to subset a data.frame according to a specific value in the row.names.
data <- data.frame(x1 = c(3, 7, 1, 8, 5), # Create example data
x2 = letters[1:5],
group = c("ga1", "ga2", "gb1", "gc3", "gb1"))
data # Print example data
# x1 x2 group
# 3 a ga1
# 7 b ga2
# 1 c gb1
# 8 d gc3
# 5 e gb1
I want to subset data according to group. One subset should be the rows containing a in their group, one containing b in their group and one c. Maybe something with grepl?
The result should look like this
data.a
# x1 x2 group
# 3 a ga1
# 7 b ga2
data.b
# x1 x2 group
# 1 c gb1
# 5 e gb1
data.c
# 8 d gc3
I would be interested in how to subset one of these output examples, or perhaps a loop would work too.
I modified the example from here https://statisticsglobe.com/filter-data-frame-rows-by-logical-condition-in-r
Extract the data which you want to split on :
sub('\\d+', '', data$group)
#[1] "ga" "ga" "gb" "gc" "gb"
and use the above in split to divide the data into groups.
new_data <- split(data, sub('\\d+', '', data$group))
new_data
#$ga
# x1 x2 group
#1 3 a ga1
#2 7 b ga2
#$gb
# x1 x2 group
#3 1 c gb1
#5 5 e gb1
#$gc
# x1 x2 group
#4 8 d gc3
It is better to keep data in a list however, if you want separate dataframes for each group you can use list2env.
list2env(new_data, .GlobalEnv)
We can use group_split with str_remove in tidyverse
library(dplyr)
library(stringr)
data %>%
group_split(grp = str_remove(group, "\\d+$"), .keep = FALSE)
Good question. This solution uses inputs and outputs that closely match the request: "I want to subset data according to group. One subset should be the rows containing a in their group, one containing b in their group and one c. Maybe something with grepl?".
The code below uses the data frame that was provided (named data), and uses grep(), and subsets by group.
code:
ga <- grep("ga", data$group) # seperate the data by group type
gb <- grep("gb", data$group)
gc <- grep("gc", data$group)
ga1 <- data[ga,] # subset ga
gb1 <- data[gb,] # subset gb
gc1 <- data[gc,] # subset gc
print(ga1)
print(gb1)
print(gc1)
Windows and Jupyter Lab were used. This output here closely matches the output that was shown above.
Output shown at link: link1

How to transpose a long data frame every n rows

I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.

randomly ordering across groups (not within group) in data.table

Let's say I want to order the iris dataset (as a data.table) by Species, keeping observations grouped by species and randomly ordering across species.
How do I do that?
I am not talking about generating a random order within groups (species).
My intuition was to write the code bellow. But it actually creates the within species random variable. Well at least it makes the question reproducible
d <- iris %>% data.table
set.seed('12345')
d[,g:=runif(.N),Species]
You may do a binary search in i. A smaller example:
d <- data.table(Species = rep(letters[1:4], each = 2), ri = 1:8)
set.seed(1)
d[.(sample(unique(Species))), on = "Species"]
# Species ri
# 1: b 3
# 2: b 4
# 3: d 7
# 4: d 8
# 5: c 5
# 6: c 6
# 7: a 1
# 8: a 2
Alternatively you could do:
e <- d[, .N, Species]
e[, g2 := runif(.N)]
d <- e[, .(Species, g2)][d, on = 'Species']
We can randomly sample from a series 1...N where N is the # of levels of the factor (Species) in question.
We then map the new order to a column and sort by it. Broken apart into steps for illustration it looks like this:
tmp <- sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1]
d$index <- tmp[as.numeric(d$Species)]
d <- d[order(d$index),]
You could compact this into 1 line/step:
d <- d[order(sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1][as.numeric(d$Species)]),]

connecting groups of duplicates

I have some data which has lots of duplication. For example, this data frame shows IDs in the data set that are known to be identical (e.g. row1 indicates a =b, therefore the rest of the data indicate that a=b=c and d=e=f):
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
duplicates <- cbind(a,b)
Is there any easy way to split these into two groups that are true IDs (e.g. here a,b & c are all the same and d,e & f are also all the same). So for my sample data:
a <- c('a','b','c','d','e','f')
b <- c('c1','c1','c1','c2','c2','c2')
new_id <- cbind(a,b)
The actual data has thousands of rows and is not fully connected (i.e. in a cluster of duplicates this could occur: a=b, a=c,b=/=c), due to some errors in duplicate detection.
Sounds like you are looking at network analyses. There are a few packages that deal with this. So you might want to use the one you are the most familiar with (network, tidygraph, igraph, diagrammeR). I use igraph, because I know that one a bit more than the others.
Steps:
First create a graph from the data using the dup data.frame. Next use the clusters function (or one of the other cluster options) to create clusters based on the data. Last step is to transform the clusters into a data.frame. Additionally you could plot the data (depends on how much data you have).
library(igraph)
g <- graph_from_data_frame(dup, directed = FALSE)
clust <- clusters(g)
clusters <- data.frame(name = names(clust$membership),
cluster = clust$membership,
row.names = NULL,
stringsAsFactors = FALSE)
clusters
name cluster
1 a 1
2 b 1
3 c 1
4 d 2
5 e 2
6 f 2
# plot graph if needed
plot(g)
data:
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
dup <- data.frame(a,b, stringsAsFactors = FALSE)
You could work with factors.
df.1$id <- with(df.1, ifelse(as.numeric(a) %in% 1:3, "c1", "c2"))
new_id <- unique(df.1[, -2])
rownames(new_id) <- NULL # just in case
Yielding
> new_id
a id
1 a c1
2 b c1
3 c c1
4 d c2
5 e c2
6 f c2
Data
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
df.1 <- data.frame(a, b)

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

Resources