Frequent Sequential Patterns - r

What would be the best way to get the sequential pattern for such data in R :
The idea is to get the frequency of letters in process 1,2, and 3. Is there GSP function that can do that ? any insight or tutorial is appreciated.

you can use an apply and table combo (provided you read your data into R):
dat <- data.frame(process1 = c('A', 'B', 'A', 'A', 'C'), process2 = c('B', 'C', 'B', 'B', 'A'), process3 = c('C', 'C', 'A', 'B', 'B'))
apply(dat, 2, table)
# process1 process2 process3
#A 3 1 1
#B 1 3 2
#C 1 1 2
apply iterates through the columns of dat (this is what argument 2 refers to) and applies table to each, which counts each unique element. see help pages for *apply family of functions for more info.
d.b's solution above, lapply(dat, table), does the same thing but returns a list rather than a matrix.

Related

Create a dataframe that is the result of a difference between vectors from other dataframes R

I have the following datasets and information: first, I have i different plots that I want to analyze. In each plot, i have j species that I want to obtain some information, such as:
plot1 = c(rep(1, 3), rep(2, 4), rep(3, 5))
spp1 = c('a', 'b', 'c', 'a', 'b', 'c', 'd', 'b', 'b', 'b', 'e', 'f')
data.1 = data.frame(plot1, spp1)
The above mentioned information repeats for a second dataframe of similar structure:
plot2 = c(rep 1, 2), rep(2, 3), rep(3, 5))
spp2 = c('a', 'a', 'b', 'c', 'c', 'b', 'b', 'b', 'e', 'f'))
data.2 = data.frame(plot2, spp2)
What I'm trying to do is, for each i plot, setdiff(unique(data.1$spp1), unique(data.2$spp2)) and add the obtained information to a dataframe that has 2 columns: plot and spp_name
For the example datasets I'd like to obtain a final dataframe such as:
df_result = data.frame(plot = c(1,1,2,2,3), spp_name = ('b','c','a','d',0)
0 (or similar) must be returned when the setdiff(unique()) returns 'character(0)', So, in a way, my df_result needs to have, for each i plot, length equal to the number of setdiff strings between data.1$spp1 and data.2$spp2.
The first thing I did was using a for loop based on each i plot. Getting to setdiff() string result is ok to but I don't know how to add this information to a empty dataframe...do I need to loop something for each species? I really hope my question is comprehensible.
Thanks already
You could use anti_join and add rows for the missing values:
library(dplyr)
anti_join(data.1, data.2, by = c("plot1" = "plot2", "spp1" = "spp2")) %>%
add_row(plot1 = setdiff(data.1$plot1, .$plot1))
# plot1 spp1
#1 1 b
#2 1 c
#3 2 a
#4 2 d
#5 3 <NA>

Finding All Pairwise Commonalities in R

I have been a StackExchange Lurker forever now, but have not had much luck finding this question in R before, so I created a username just for this.
Basically I have a data set of Customers and Stores (roughly 260k customers and 300 stores with most of the customers visiting at least 10 unique stores), and I want to see which sites overlap on customers the most (i.e. Site A and B share this many customers, A and C that many, ... for ALL Pairs of sites).
Reproducible example:
begindata<-data.frame(customer_id=c(1,2,3,1,2,3,4,1,4,5), site_visited=c('A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'D', 'D'))
and I would like to see the following, if possible:
final_table<-data.frame(site_1=c('A', 'A', 'A', 'B', 'B', 'C'), site_2=c('B', 'C', 'D', 'C', 'D', 'D'), number_of_commonalities=c(3, 1,1,1,1,0))
I have tried joining begindata to itself based on customer_id, something like this...
attempted_df<-begindata %>% left_join(begindata, by="customer_id") %>% count(site_visited.x, site_visited.y)
Which I know is redundant (lines that go A, B, 3; B, A, 3; as well as lines that go A, A, 3).
However, this cannot be executed with my actual data set (260k members and 300 sites) due to size limitations.
Any advice would be greatly appreciated! Also go easy on me if my post sucks--I think I have followed the rules and suggestions?
We could use combn
number_of_commonalities <- combn(unique(begindata$site_visited), 2,
FUN = function(x) with(begindata,
length(intersect(customer_id[site_visited == x[1]],
customer_id[site_visited == x[2]]))))
names(number_of_commonalities) <- combn(unique(begindata$site_visited), 2,
FUN = paste, collapse="_")
stack(number_of_commonalities)[2:1]

index vector by value in R

Say I have two character vectors
vec <- c('A', 'B', 'C', 'D', 'E')
pat <- c('D', 'B', 'A')
how do I get the indexes of the occurrences in vec of the values in pat in the order they appear in pat?
I can try
which(vec %in% pat)
but this gives me them in the incorrect order: 1 2 4. I want them as 4 2 1.
I tried different ways to solve this problem before and always found that the easiest way to solve it is the solution as mentioned in #DavidArenburg's comment:
match(pat, vec)
# [1] 4 2 1

R : Map a column to column using key-value list

In R, I want to use a key-value list to convert a column of keys to values. It's similar to: How to map a column through a dictionary in R, but I want to use a list not a data.frame.
I've tried to do this using list and columns:
d = list('a'=1, 'b'=2, 'c'=3)
d[c('a', 'a', 'c', 'b')] # I want this to return c(1,1,3,2) but it doesn't
However, the above returns a list:
list('a'=1, 'a'=1, 'c'=3, 'b'=2)
unlist is a useful function in this situation
unlist(d[c('a', 'a', 'c', 'b')], use.names=FALSE)
#[1] 1 1 3 2
Or another option is stack which returns the 'key/value' as columns in a 'data.frame'. By subsetting the values column, we get
stack( d[c('a', 'a', 'c', 'b')])[,1]
#[1] 1 1 3 2

Combine vector and data.frame matching column values and vector values

I have
vetor <- c(1,2,3)
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
I need a data.frame output that match each vector value to a specific id, resulting:
id vector1
1 a 1
2 b 2
3 a 1
4 c 3
5 a 1
Here are two approaches I often use for similar situations:
vetor <- c(1,2,3)
key <- data.frame(vetor=vetor, mat=c('a', 'b', 'c'))
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- key[match(data$id, key$mat), 'vetor']
#or with merge
merge(data, key, by.x = "id", by.y = "mat")
So you want one unique integer for each different id column?
This is called a factor in R, and your id column is one.
To convert to a numeric representation, use as.numeric:
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- as.numeric(data$id)
This works because data$id is not a column of strings, but a column of factors.
Here's an answer I found that follows the "mathematical.coffee" tip:
vector1 <- c('b','a','a','c','a','a') # 3 elements to be labeled: a, b and c
labels <- factor(vector1, labels= c('char a', 'char b', 'char c') )
data.frame(vector1, labels)
The only thing we need to observe is that in the factor(vector1,...) function, vector1 will be ordered and the labels must follow that order correctly.

Resources