How to identify repeated subsequences in a dataset

How to identify repeated subsequences in a dataset - r

I have a dataset of numerical values, each represent a zone.
eg.
x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)
I need to identify whether there are repeated subsequences within the data, i.e whether the subject repeatedly travelled from zone 1 to 2 to 3. In the above example 1,2,3 would give a value of 3. I don't know the subsequences already, I need R to provide this given the data.
Following that I need to calculate how many times this subsequence appears in the data.
Very basic knowledge or R so forgive me for my ignorance if this is a simple task!

Here's a way to find which sequences of length n repeat, and how many times
For n = 3
library(tidyverse) # not necessary, see base version below
n <- 3
lapply(seq(0, length(x) - n), `+`, seq(n)) %>% # get index of all subsequences
map_chr(~ paste(x[.], collapse = ',')) %>% # paste together as character
table %>% # get number of times each occurs
`[`(. > 1) # select sequences occurring > 1 time
# 1,2,3
# 3
For n = 2
n <- 2
lapply(seq(0, length(x) - n), `+`, seq(n)) %>%
map_chr(~ paste(x[.], collapse = ',')) %>%
table %>%
`[`(. > 1)
# 1,2 2,3 5,9
# 3 3 2
Without Tidyverse
seqs <- lapply(seq(0, length(x) - n), `+`, seq(n))
seqs.char <- sapply(seqs, function(i) paste(x[i], collapse = ','))
tbl <- table(seqs.char)
tbl[tbl > 1]
I'll add my own question: Does anyone know how to do this without converting to character first? e.g. fun where fun(list(1:2, 1:2, 2:3)) tells you 1:2 occurs twice and 2:3 occurs once?

An alternative tidyverse approach that creates a big dataframe of results based on how many values you want your subsequences to have:
library(tidyverse)
# example vector
x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)
# function that gets as input number of consequtive elements in a subsequence
# and returns an ordered dataframe by counts of occurence
f = function(n) {
data.frame(value = x) %>% # get the vector x
slice(1:(nrow(.)-n+1)) %>% # remove values not needed from the end
mutate(position = row_number()) %>% # add position of each value
rowwise() %>% # for each value/row
mutate(vec = paste0(x[position:(position+n-1)], collapse = ",")) %>% # create subsequences as a string
ungroup() %>% # forget the grouping
count(vec, sort = T) } # order by counts descending
2:5 %>% # specify how many values in your subsequences you want to investigate (let's say from 2 to 5)
map_df(~ data.frame(NumElements = ., f(.))) %>% # apply your function and keep the number values
arrange(desc(n)) %>% # order by counts descending
tbl_df() # (only for visualisation purposes)
# # A tibble: 88 x 3
# NumElements vec n
# <dbl> <chr> <int>
# 1 2 1,2 3
# 2 2 2,3 3
# 3 3 1,2,3 3
# 4 2 5,9 2
# 5 2 1,6 1
# 6 2 10,1 1
# 7 2 10,7 1
# 8 2 3,10 1
# 9 2 3,4 1
# 10 2 4,1 1
# # ... with 78 more rows

The approach below finds sequences of any length (k): the input vector is converted into a matrix with k rows; this is done k times with adding 0:(k-1) NA's to the beginning. Finally, all rows in these k matrices are counted (paste'ing the elements together):
frs <- function(x, k=2){
padit <- function(.) c(.,rep(NA, k-length(.)%%k))
xx <- lapply(1:k, function(iii) padit(c(rep(NA,iii-1), x)))
xx <- do.call(rbind, lapply(xx, function(.) matrix(., ncol=k, byrow=TRUE)))
xx <- sapply(split(xx, 1:NROW(xx)), paste, collapse=",")
(function(x) x[x>1])(table(xx))
}
Output:
> frs(x,2)
xx
1,2 2,3 5,9
3 3 2
> frs(x,3)
1,2,3
3
> frs(x,4)
named integer(0)

Related

What is the best way to re-write (simplify) the same logic to produce the same result as below codes in R?

I need to extract a sample that has equal distribution in each experience-level group. For your info, there are total 4 groups (1, 2, 3, 4 years of exp), and total 8 people (A, B, C, D, E, F, G, H) in this example scenario. I was trying to come up with a function with loops, but don't know how to. Please help me out! Thank you! :)
library(tidyverse)
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4), pre_year_exp = year_exp - 1)
data_0 <- data %>% filter(year_exp == max(year_exp) - 0) %>% sample_n(2)
data_1 <- data %>% filter(year_exp == max(year_exp) - 1) %>% anti_join(data_0, by = 'id') %>% sample_n(2)
data_2 <- data %>% filter(year_exp == max(year_exp) - 2) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% sample_n(2)
data_3 <- data %>% filter(year_exp == max(year_exp) - 3) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% anti_join(data_2, by = 'id')
#Result Table
result <- data_0 %>% bind_rows(data_1, data_2, data_3)
result

The below produces the same output as your code and extends the idea to allow for an arbitrary number of values of year_exp using a for loop.
Please note that because this simply extends your code, it must share the following (possibly-undesirable) features with your code:
The code moves sequentially through groups, sampling from the members of later groups who were not sampled for early groups. Accordingly, there is a risk that the code throws an error because it tries to sample from groups whose members were already sampled from previous, other groups.
The probabilities of selection are not uniformly distributed across members of a group. Accordingly, the samples drawn from each group are not representative of that group.
In the event that there data were instead a balanced panel, there are much more efficient and simpler ways to accomplish this.
library(tibble)
library(dplyr)
set.seed(123)
# Create original data
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),
year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4),
pre_year_exp = year_exp - 1)
# Assign values to parameters used by/in the loop.
J <- data$id %>% unique %>% length # unique units/persons (8)
K <- data$year_exp %>% unique %>% length # unique groups/years (4)
N <- 2 # sample size per group (2)
# Initialize objects loop will modify
samples_list <- vector(mode = "list", length = K) # stores each sample
used_ids <- rep(NA_character_, J) # stores used ids
index <- 1:N # initial indices for used ids
# For-loop solution
for (k in 1:K) {
# Identifier for current group
cur_group <- 1 + K - k
# Sample from persons in current group who were not previously sampled
one_sample <- data %>%
filter(year_exp == cur_group, !(id %in% used_ids)) %>%
slice_sample(n = N)
# Save sample and the id values for those sampled
samples_list[[k]] <- one_sample
used_ids[index] <- one_sample$id
index <- index + N
}
# Bind into a single data.frame
bind_rows(samples_list)
#> # A tibble: 8 x 3
#> id year_exp pre_year_exp
#> <chr> <dbl> <dbl>
#> 1 H 4 3
#> 2 D 4 3
#> 3 G 3 2
#> 4 E 3 2
#> 5 C 2 1
#> 6 B 2 1
#> 7 F 1 0
#> 8 A 1 0

Create a loop for calculating values from a dataframe in R?

Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.

You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070

You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070

Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85

Separate string column by row efficiently

I'm trying to separate a string column into two pieces based on chopping up the string. It's best illustrated with example below. rowwise does work, but given the size of the data.frame, I'd like to use a more efficient method. How can I avoid using rowwise?
library(dplyr)
library(stringr)
library(tidyr)
#make data
a <- "(1, 10)"
b <- "(10, 20)"
c <- "(20, 30)"
df <- data.frame(size = c(a,b,c))
# Goal is to separate the 'size' column into 'lower' and 'upper' by
# extracting the value contained in the parens and split by a comma.
# Once the column is split into 'upper' and 'lower' I will perform
# additional operations.
# DESIRED RESULT
size lower upper
<fct> <chr> <chr>
1 (1, 10) 1 10
2 (10, 20) 10 20
3 (20, 30) 20 30
# WHAT I HAVE TRIED
> #This works... but too inefficient
> df %>%
+ rowwise() %>%
+ mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+ str_split("\\(") %>% .[[1]] %>% .[2])
size lower
<fct> <chr>
1 (1, 10) 1
2 (10, 20) 10
3 (20, 30) 20
> # I'm not sure why this doesn't work
> df %>%
+ mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+ str_split("\\(") %>% .[[1]] %>% .[2])
size lower
1 (1, 10) 1
2 (10, 20) 1
3 (20, 30) 1
> #Not obivous how to use separate (tidyr)
> df %>%
+ separate(size, sep=",", c("lower", "upper"))
lower upper
1 (1 10)
2 (10 20)
3 (20 30)

You don't state your goal explicitly, but it seems like you want to extract the first number from a string. This is easy with stringi::str_extract_first_regex
library(stringi)
stri_extract_first_regex(df$size, "[0-9]+")
# [1] "1" "10" "20"
So in your case,
df %>% mutate(lower = as.numeric(stri_extract_first_regex, size, "[0-9]+"))
You can extract all numbers with stri_extract_all_regex.
Based on your edits:
df$nums = str_extract_all(df$size, "[0-9]+")
df$lower = as.numeric(sapply(df$nums, `[[`, 1))
df$upper = as.numeric(sapply(df$nums, `[[`, 2))
df
# size nums lower upper
# 1 (1, 10) 1, 10 1 10
# 2 (10, 20) 10, 20 10 20
# 3 (20, 30) 20, 30 20 30
Another way to go is to get rid of the parens and whitespace and then use separate:
df %>%
mutate(just_nums = str_replace_all(size, "[^0-9,]", "")) %>%
separate(just_nums, into = c("lower", "upper"))
# size lower upper
# 1 (1, 10) 1 10
# 2 (10, 20) 10 20
# 3 (20, 30) 20 30
The regex pattern "[^0-9,]" matches everything except numbers and commas.

For rowwise operations, I prefer data.table.
Try this
library(data.table)
library(stringi)
#make data
a <- "(1, 10)"
b <- "(10, 20)"
c <- "(20, 30)"
dt <- data.table(c(a,b,c))
dt[, lower := tstrsplit(V1, ",")[1]]
dt[, lower:= stri_replace_all_regex(lower, '\\(', '')]
dt

An option is to use tidyr::separate after removing both ( and ) from the data.
library(tidyverse)
df %>% mutate(size = gsub("\\(|)","",size)) %>% # Both ( and ) has been removed.
separate(size, c("Min", "Max"), sep = ",")
# Min Max
# 1 1 10
# 2 10 20
# 3 20 30

You are almost there. Here is my explanation for two approach, one is similar to yours:
In the first code, I have used unnest_tokens from tidytext package, which can split words on a different rows, since you want to extract the first item before the comma(I have assumed it basis your example, although you should mention it). I have choosen the first row basis this by using filter command.
In the second code, I have used the regex (note you can also use here str_replace also). Here I am using map(since the items returned by str_split is a list) to iterate the returned items and pass each by gsub, which can replace the regex matched with the back referencing items. Also to select only the first item, I have used [[1]] in the end of gsub.
library(tidyverse)
library(stringr)
library(tidytext)
df %>%
unnest_tokens(lower,size, token="words",drop=F) %>%
filter(row_number()%%2==T)
df %>%
mutate(lower = map(str_split(df$size, ","), function(x)gsub("\\((\\w+)","\\1",x)[[1]]))
Output:
# size lower
# 1 (1, 10) 1
# 2 (10, 20) 10
# 3 (20, 30) 20
In case you want to extract both the terms before and after the commas, you can use extract function as well.
tidyr::extract(df, size, c("lower", "upper"), regex= "\\((\\w+),\\s+(\\w+)\\)")
Output:
# lower upper
# 1 1 10
# 2 10 20
# 3 20 30

Conditionally selecting last N values within a group by another column using R

This question is similar to selecting the top N values within a group by column here.
However, I want to select the last N values by group, with N depending on the value of a corresponding count column. The count represents the number of occurrences of a specific name. If count >3, I only want the last three entries but if it is less than 3, I only want the last entry.
# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), Value = c(1,2,3,4,5,6,7,8,9))
# Obtain count for each name
count <- df %>%
group_by(Name) %>%
summarise(Count = n_distinct(Value))
# Merge dataframe with count
merge(df, count, by=c("Name"))
# Delete the first entry for x and the first entry for z
# Desired output
data.frame(Name = c("x","x","x","y","y","y","z"), Value = c(2,3,4,5,6,7,9))

Another dplyrish way:
df %>% group_by(Name) %>% slice(tail(row_number(),
if (n_distinct(Value) < 3) 1 else 3
))
# A tibble: 7 x 2
# Groups: Name [3]
Name Value
<fctr> <dbl>
1 x 2
2 x 3
3 x 4
4 y 5
5 y 6
6 y 7
7 z 9
The analogue in data.table is...
library(data.table)
setDT(df)
df[, tail(.SD, if (uniqueN(Value) < 3) 1 else 3), by=Name]
The closest thing in base R is...
with(df, {
len = tapply(Value, Name, FUN = length)
nv = tapply(Value, Name, FUN = function(x) length(unique(x)))
df[ sequence(len) > rep(nv - ifelse(nv < 3, 1, 3), len), ]
})
... which is way more difficult to come up with than it should be.

Another possibility:
library(tidyverse)
df %>%
split(.$Name) %>%
map_df(~ if (n_distinct(.x) >= 3) tail(.x, 3) else tail(.x, 1))
Which gives:
# Name Value
#1 x 2
#2 x 3
#3 x 4
#4 y 5
#5 y 6
#6 y 7
#7 z 9

In base R, split the df by df$Name first. Then, for each subgroup, check number of rows and extract last 3 or last 1 row conditionally.
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), c(3,1)[(NROW(a) < 3) + 1]),]))
Or
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), ifelse(NROW(a) < 3, 1, 3)),]))
# Name Value
#x.2 x 2
#x.3 x 3
#x.4 x 4
#y.5 y 5
#y.6 y 6
#y.7 y 7
#z z 9
For three conditional values
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), ifelse(NROW(a) >= 6, 6, ifelse(NROW(a) >= 3, 3, 1))),]))

If you're already using dplyr, the natural approach is:
library(dplyr)
# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"),
Value = c(1,2,3,4,5,6,7,8,9))
df %>%
group_by(Name) %>%
mutate(Count = n_distinct(Value),
Rank = dense_rank(desc(Value))) %>%
filter((Count>= 3 & Rank <= 3) | (Rank==1)) %>%
select(-c(Count,Rank))
There's no need for a merge since you are just counting and ranking on groups defined by Name. Then, you apply a filter on your count and rank requirements, and (optionally, for clean-up) drop the counts and ranks.

R sum rows of a data.frame, that contain only numbers in a certain column

How to sum only the rows that contain a number in column b.
a <- c(1,5,3,1,-8,6,-1)
b <- c(4,-2,1,0,"c",2,"DX")
df <- data.frame(a,b)
df
# a b
# 1 1 4
# 2 5 -2
# 3 3 1
# 4 1 0
# 5 -8 c
# 6 6 2
# 7 -1 DX
I just can't seem how to sum the numeric rows only of the columb b.
In fact i have multiple data which are from multiples csv files that i read and store in r in form of dataframes with this code :
path <- "C:/Users/Visitor/Desktop/Unesco/"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
I can read them by typing their name(the csv files)
The problem is that some of the columns contains both numeric and character so i can't do a simple sum on them.

I'd suggest using stringsAsFactors = FALSE when creating your data.frame or converting factors to character in your existing data.frame. Then you could extract numeric values into a new column, which you can sum with regular tidyverse tools.
library(tidyverse)
a <- c(1,5,3,1,-8,6,-1)
b <- c(4,-2,1,0,"c",2,"DX")
df <- data.frame(a, b, stringsAsFactors = F)
df %>%
mutate(b_numbers = parse_double(b)) %>%
summarise(sum = sum(b_numbers, na.rm = T))
#> Warning: 2 parsing failures.
#> row col expected actual
#> 5 -- a double c
#> 7 -- a double DX
#> sum
#> 1 5
This approach generalizes nicely to working on multiple columns:
df %>%
mutate_all(funs(parse_double)) %>%
summarise_all(funs(sum(., na.rm = T)))
#> Warning: 2 parsing failures.
#> row col expected actual
#> 5 -- a double c
#> 7 -- a double DX
#> a b
#> 1 7 5
If you have many data.frames like this, you could turn it into a function and map it to your list:
my_fun <- function(x) {
x %>%
mutate_all(funs(suppressWarnings(parse_double(.)))) %>%
summarise_all(funs(sum(., na.rm = T)))
}
# create list with multiple data.frames
my_list <- list(a = df, b = df)
my_list %>%
map(my_fun)
#> $a
#> a b
#> 1 7 5
#>
#> $b
#> a b
#> 1 7 5
Like d.b pointed out, you may use suppressWarnings() to silence the warnings from parse_double. I would prefer parse_double over as.numeric in this example, since it is informative about what is happening.
Edit:
Since you stated in your comment to the original question, that the columns of your data.frames always have the same names, you could use mutate_at and summarise_at:
my_fun <- function(x) {
x %>%
mutate_at("b", funs(suppressWarnings(parse_double(.)))) %>%
summarise_at("b", funs(sum(., na.rm = T)))
}
or for multiple columns:
my_fun <- function(x) {
x %>%
mutate_at(c("a", "b"), funs(suppressWarnings(parse_double(.)))) %>%
summarise_at(c("a", "b"), funs(sum(., na.rm = T)))
}

If you are looking to sum by row:
a <- c(1,5,3,1,-8,6,-1)
b <- c(4,-2,1,0,"c",2,"DX")
df <- data.frame(a,b = as.numeric(b))
rowSums(df)
For only the non-na's:
rowSums(df[!is.na(df$b),])
If you want by column:
colSums(df[!is.na(df$b),])
or depending on what you want to sum in the columns:
colSums(df, na.rm = TRUE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to identify repeated subsequences in a dataset - r

Related

What is the best way to re-write (simplify) the same logic to produce the same result as below codes in R?

Create a loop for calculating values from a dataframe in R?

Separate string column by row efficiently

Conditionally selecting last N values within a group by another column using R

R sum rows of a data.frame, that contain only numbers in a certain column

Categories

Resources