How to create a hashed dataframe in R - r

Given the following data (myinput.txt):
A q,y,h
B y,f,g
C n,r,q
### more rows
How can I convert it into such data structure in R?
$A
[1] "q" "y" "h"
$B
[1] "y" "f" "g"
$C
[1] "n" "r" "q"

I've assumed this as your data:
dat <- read.table(text="q,y,h
y,f,g
n,r,q", header=FALSE, sep=",", row.names=c("A", "B", "C"))
If you want an automatic method:
as.list(as.data.frame((t(dat)), stringsAsFactors=FALSE))
## $A
## [1] "q" "y" "h"
##
## $B
## [1] "y" "f" "g"
##
## $C
## [1] "n" "r" "q"
Another couple of methods which work are:
lapply(apply(dat, 1, list), "[[", 1)
unlist(apply(dat, 1, list), recursive=FALSE)

Using a bit of readLines strsplit and regex to account for breaking the names off the start:
dat <- readLines(textConnection("A q,y,h
B y,f,g
C n,r,q"))
result <- lapply(strsplit(dat,"\\s{2}|,"),function(x) x[2:length(x)])
names(result) <- gsub("^(.+)\\s{2}.+$","\\1",dat)
> result
$A
[1] "q" "y" "h"
$B
[1] "y" "f" "g"
$C
[1] "n" "r" "q"
or with less regex and more steps:
result <- strsplit(dat,"\\s{2}|,")
names(result) <- lapply(result,"[",1)
result <- lapply(result,function(x) x[2:length(x)])
> result
$A
[1] "q" "y" "h"
$B
[1] "y" "f" "g"
$C
[1] "n" "r" "q"

Related

asotiation rules , how do i need to convert the data for fpgrowth

I have data like this
set.seed(123)
#fake data
dat <- list()
for(i in 1:1000) dat[[i]] <- LETTERS[sample(20)] [1:sample(10)[1]]
head(dat)
.
[[1]]
[1] "M" "L" "N" "H" "F" "P" "D"
[[2]]
[1] "F" "J" "C" "A" "G" "R" "M" "O" "S"
[[3]]
[1] "Q" "C" "E" "O" "P" "D"
[[4]]
[1] "K" "C" "P" "J" "N" "O" "B" "F" "Q"
[[5]]
[1] "K" "Q" "C"
[[6]]
[1] "M" "S" "A" "O" "E" "Q"
I need to find all associations with "L"
I used to use apriori from the arules package
library(arules)
rules <- apriori(data = dat,
parameter = list(supp = 0.01,
conf = 0.01),
appearance = list(rhs="L")
)
rules
And it works great, but at some point it turned out to be too slow for me. I decided to try
fpgrowth from rCBA package
library(rCBA)
dat2 <- as(dat,"transactions")
rules <- rCBA::fpgrowth(dat2, support=0.01,
confidence=0.01,
consequent="L")
but i get an error
2020-06-20 08:09:35 rCBA: initialized
2020-06-20 08:09:36 rCBA: data 1000x20
took: 1.21 s
����. 20, 2020 8:09:36 �� cz.jkuchar.rcba.fpg.FPGrowth run
INFO: FPG: start
java.lang.NullPointerException
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
at java.base/java.lang.String.<init>(String.java:254)
at cz.jkuchar.rcba.rules.Tuple.getCopy(Tuple.java:41)
at cz.jkuchar.rcba.fpg.FPGrowth.insert(FPGrowth.java:304)
at cz.jkuchar.rcba.fpg.FPGrowth.lambda$buildTree$0(FPGrowth.java:62)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1510)
at cz.jkuchar.rcba.fpg.FPGrowth.buildTree(FPGrowth.java:59)
at cz.jkuchar.rcba.fpg.FPGrowth.run(FPGrowth.java:125)
at cz.jkuchar.rcba.fpg.FPGrowth.run(FPGrowth.java:99)
at cz.jkuchar.rcba.r.RPruning.fpgrowth(RPruning.java:204)
while the example from the rCBA package works fine for me..
I understand that i need to convert my data differently, but I don’t know how..
Thanks you
UPD=======for==MarBlo===============
please give a more complete answer, I do not understand how to use it
dat[[1]]
[1] "M" "L" "N" "H" "F" "P" "D"
.
DAT2 <- do.call(rbind.data.frame,dat)
DAT2[1,]
c..M....F....Q....K....K....M....O....E....O....N..
1 M
c..L....J....C....C....Q....S....D....I....H....O..
1 L
c..N....C....E....P....C....A....O....T....J....L..
1 N
c..H....A....O....J....K....O....D....G....B....F..
1 H
c..F....G....P....N....Q....E....O....C....R....M..
1 F
c..P....R....D....O....C....Q....D....Q....L....S..
1 P
c..D....M....Q....B....K....M....O....R....C....G..
1 D
c..M....O....C....F....Q....S....D....A....T....C..
1 M
c..L....S....E....Q....C....A....O....K....O....T..
1 L
c..N....F....O....K....K....O....D....P....H....K..
1 N

Randomly shuffle letters in words in sentences

I want to randomly disrupt the order of the letters that make up words in sentences. I can do the shuffling for single words, e.g.:
a <- "bach"
sample(unlist(str_split(a, "")), nchar(a))
[1] "h" "a" "b" "c"
but I fail to do it for sentences, e.g.:
b <- "bach composed fugues and cantatas"
What I've tried so far:
split into words:
b1 <- str_split(b, " ")
[[1]]
[1] "bach" "composed" "fugues" "and" "cantatas"
calculate the number of characters per word:
n <- lapply(b1, function(x) nchar(x))
n
[[1]]
[1] 4 8 6 3 8
split words in b1 into single letters:
b2 <- str_split(unlist(str_split(b, " ")), "")
b2
[[1]]
[1] "b" "a" "c" "h"
[[2]]
[1] "c" "o" "m" "p" "o" "s" "e" "d"
[[3]]
[1] "f" "u" "g" "u" "e" "s"
[[4]]
[1] "a" "n" "d"
[[5]]
[1] "c" "a" "n" "t" "a" "t" "a" "s"
Jumble the letters in each word based on the above:
lapply(b2, function(x) sample(unlist(x), unlist(n), replace = T))
[[1]]
[1] "h" "a" "c" "b"
[[2]]
[1] "o" "p" "o" "s"
[[3]]
[1] "g" "s" "s" "u"
[[4]]
[1] "d" "d" "a" "d"
[[5]]
[1] "c" "n" "s" "a"
That's obviously not the right result. How can I randomly jumble the sequence of letters in each word in the sentence?
After b2 you can randomly shuffle character using sample and paste the words back.
paste0(sapply(b2, function(x) paste0(sample(x), collapse = "")), collapse = " ")
#[1] "bhac moodscpe uefusg and tsatnaac"
Note that you don't need to mention the size in sample if you want the output to be of same length as input with replace = FALSE.

Efficient way of running multiple successive for loops in R?

I am trying to run several for loops in succession in R. I hope this simplified example of the kind of thing I am trying to do provides enough information, and that the question is relevant/interesting enough to a general audience.
Essentially, I have a pool of individuals (here represented by the 26 LETTERS and saved in a vector called 'ids'). I start with 2 of them randomly selected (called 'ids1') and run a for loop (here 5 times as defined by 'runs'). Those letters not picked get put into another vector called 'ids.left1'.
The first thing going on in the for loop in this example is that I am just randomly picking one of the letters five times. I am storing the result of this in another vector called result1. In this example I'm also storing those letters not used in another vector called 'otherresult1'. (My real-world reason for doing this would be using loops containing several different processes, not just these two).
set.seed(123)
#Initializing
ids<-LETTERS[1:26]
runs<-5
#1st time
result1 <- vector("list",runs)
otherresult1 <- vector("list",runs)
ids1<-sample(ids,2)
ids.left1<-setdiff(ids,ids1)
for (i in 1:runs) {
picked1<-sample(ids1, 1)
result1[[i]] <- picked1
otherresult1[[i]] <- setdiff(ids1,picked1)
}
result1x<-unlist(result1) #[1] "H" "T" "T" "H" "T"
The above is trivial. What I am trying to do next is to add an extra letter (randomly selected) to the pool (so we now have 3) and run the for loop again for the same number of times (5). I also want to store the now 23 letters not being used in a vector (ids.left2) and also store the results of this loop in result2. Those not selected get stored in otherresult2.
#2nd time
result2 <- vector("list",runs)
otherresult2 <- vector("list",runs)
ids2<-c(ids1, sample(ids.left1,1))
ids.left2<-setdiff(ids,ids2)
for (i in 1:runs) {
picked2<- sample(ids2, 1)
result2[[i]] <- picked2
otherresult2[[i]] <- setdiff(ids2,picked2)
}
result2x<-unlist(result2) #[1] "T" "T" "X" "T" "X"
This is repeated again. Another letter is added (so we now have 4), and the same for loop is run 5 times, and the results stored again in another vector. Those not used again get stored in otherresult3.
#3rd time
result3 <- vector("list",runs)
otherresult3 <- vector("list",runs)
ids3<-c(ids2, sample(ids.left2,1))
for (i in 1:runs) {
picked3 <- sample(ids3, 1)
result3[[i]] <- picked3
otherresult3[[i]] <- setdiff(ids3,picked3)
}
result3x<-unlist(result3) #[1] "H" "O" "H" "H" "T"
This is just putting the results all together.
#putting results together
results.final <- c(result1x,result2x,result3x)
results.final #[1] "H" "T" "T" "H" "T" "T" "T" "X" "T" "X" "H" "O" "H" "H" "T"
unlist(otherresult1) #[1] "T" "H" "H" "T" "H"
unlist(otherresult2) #[1] "H" "X" "H" "X" "H" "T" "H" "X" "H" "T"
unlist(otherresult3) #[1] "T" "X" "O" "H" "T" "X" "T" "X" "O" "T" "X" "O" "H" "X" "O"
This is all pretty easy when I am only running the for loop 3 times. However, if I wanted to do the same thing (adding in one individual into a pool of individuals) 1000 times, it would be crazy to manually write the code. (Obviously, I wouldn't be using letters if I ran it 1000 times but some other identifier).
My question is therefore, is it possible to more efficiently code these successive for loops?
EDIT: I added in another process in the for-loop (the result being stored in 'otherresult' vector) to try and make this more realistic.
A perfect time to use recursion
recCount <- 1 #which recursive iteration we are in
allLetters <- LETTERS[1:26]
endPoint <- 6 #after how many recursions do we stop
runs <- 5
recEx <- function(resultList,otherResultList,
inLetters,outLetters,
recCount)
{
newLetter <- sample(inLetters,ifelse(recCount==1,2,1)) #pick a letter, 2 if this is the first run
outLetters <- c(outLetters,newLetter) #add this letter to our pool of usable letters
inLetters <- inLetters[inLetters!=newLetter] #subtract this letter from the total pool
excludedList <- includedList <- list() #initialize the lists we will add to
for (i in 1:runs) {
picked1<-sample(outLetters, 1)
includedList[[i]] <- picked1
excludedList[[i]] <- setdiff(outLetters,picked1)
}
if(recCount == endPoint) return(list(c(resultList,list(includedList)), #if we're done
c(otherResultList,list(excludedList)))) else
return(recEx(c(resultList,list(includedList)), #pass in our results so far, and add the "included" list onto the end
c(otherResultList,list(excludedList)), #same with the "excluded" list
inLetters,outLetters,recCount+1))
}
finalResult <- recEx(list(),list(),allLetters,NULL,1)
> finalResult
[[1]]#1 is for your final results, #2 is for the excluded results
[[1]][[1]]# 1 through 6 are your 6 iterations, with 2 through 7 letters in each iteration
[[1]][[1]][[1]] #1 through 5 are your 5 runs
[1] "H"
[[1]][[1]][[2]]
[1] "T"
[[1]][[1]][[3]]
[1] "T"
[[1]][[1]][[4]]
[1] "H"
[[1]][[1]][[5]]
[1] "T"
[[1]][[2]]
[[1]][[2]][[1]]
[1] "T"
[[1]][[2]][[2]]
[1] "T"
[[1]][[2]][[3]]
[1] "X"
[[1]][[2]][[4]]
[1] "T"
[[1]][[2]][[5]]
[1] "X"
[[1]][[3]]
[[1]][[3]][[1]]
[1] "H"
[[1]][[3]][[2]]
[1] "N"
[[1]][[3]][[3]]
[1] "H"
[[1]][[3]][[4]]
[1] "H"
[[1]][[3]][[5]]
[1] "T"
[[1]][[4]]
[[1]][[4]][[1]]
[1] "Y"
[[1]][[4]][[2]]
[1] "N"
[[1]][[4]][[3]]
[1] "N"
[[1]][[4]][[4]]
[1] "Y"
[[1]][[4]][[5]]
[1] "N"
[[1]][[5]]
[[1]][[5]][[1]]
[1] "N"
[[1]][[5]][[2]]
[1] "N"
[[1]][[5]][[3]]
[1] "T"
[[1]][[5]][[4]]
[1] "H"
[[1]][[5]][[5]]
[1] "Q"
[[1]][[6]]
[[1]][[6]][[1]]
[1] "Y"
[[1]][[6]][[2]]
[1] "Q"
[[1]][[6]][[3]]
[1] "H"
[[1]][[6]][[4]]
[1] "N"
[[1]][[6]][[5]]
[1] "Q"
[[2]] #your excluded letters
[[2]][[1]]
[[2]][[1]][[1]]
[1] "T"
[[2]][[1]][[2]]
[1] "H"
[[2]][[1]][[3]]
[1] "H"
[[2]][[1]][[4]]
[1] "T"
[[2]][[1]][[5]]
[1] "H"
[[2]][[2]]
[[2]][[2]][[1]]
[1] "H" "X"
[[2]][[2]][[2]]
[1] "H" "X"
[[2]][[2]][[3]]
[1] "H" "T"
[[2]][[2]][[4]]
[1] "H" "X"
[[2]][[2]][[5]]
[1] "H" "T"
[[2]][[3]]
[[2]][[3]][[1]]
[1] "T" "X" "N"
[[2]][[3]][[2]]
[1] "H" "T" "X"
[[2]][[3]][[3]]
[1] "T" "X" "N"
[[2]][[3]][[4]]
[1] "T" "X" "N"
[[2]][[3]][[5]]
[1] "H" "X" "N"
[[2]][[4]]
[[2]][[4]][[1]]
[1] "H" "T" "X" "N"
[[2]][[4]][[2]]
[1] "H" "T" "X" "Y"
[[2]][[4]][[3]]
[1] "H" "T" "X" "Y"
[[2]][[4]][[4]]
[1] "H" "T" "X" "N"
[[2]][[4]][[5]]
[1] "H" "T" "X" "Y"
[[2]][[5]]
[[2]][[5]][[1]]
[1] "H" "T" "X" "Y" "Q"
[[2]][[5]][[2]]
[1] "H" "T" "X" "Y" "Q"
[[2]][[5]][[3]]
[1] "H" "X" "N" "Y" "Q"
[[2]][[5]][[4]]
[1] "T" "X" "N" "Y" "Q"
[[2]][[5]][[5]]
[1] "H" "T" "X" "N" "Y"
[[2]][[6]]
[[2]][[6]][[1]]
[1] "H" "T" "X" "N" "Q" "V"
[[2]][[6]][[2]]
[1] "H" "T" "X" "N" "Y" "V"
[[2]][[6]][[3]]
[1] "T" "X" "N" "Y" "Q" "V"
[[2]][[6]][[4]]
[1] "H" "T" "X" "Y" "Q" "V"
[[2]][[6]][[5]]
[1] "H" "T" "X" "N" "Y" "V"
This isn't the best structure for results imo, but this is as you specified. Unpacking these lists is trivial though
How about this?
set.seed(123)
#Initializing
ids =LETTERS[1:26]
runs=5
result1 = list()
temp = sample(ids,2)
j=1
results = c()
while(j<6) {
ids.left = ids[!(ids%in%temp)]
for(i in 1:runs){
result1[[i]] = sample(temp,1)
}
temp = c(temp, sample(ids.left,1))
j=j+1
results = c(results, unlist(result1))
}
results # [1] "H" "T" "T" "H" "T" "T" "T" "X" "T" "X" "H" "O" "H" "H" "T" "Y" "O" "O" "Y" "O" "O" "O" "T" "H" "Q"

Matching between a vector and multiple vectors in a list in R

I have a list of vectors such as:
>list
[[1]]
[1] "a" "m" "l" "s" "t" "o"
[[2]]
[1] "a" "y" "o" "t" "e"
[[3]]
[1] "n" "a" "s" "i" "d"
I want to find the matches between each of them and the remaining (i.e. between the 1st and the other 2, the 2nd and the other 2, and so on) and keep the couple with the highest number of matches. I could do it with a "for" loop and intersect by couples. For example
for (i in 2:3) { intersect(list[[1]],list[[i]]) }
and then save the output into a vector or some other structure. However, this seems so inefficient to me (given than rather than 3 I have thousands) and I am wondering if R has some built-in function to do that in a clever way.
So the question would be:
Is there a way to look for matches of one vector to a list of vectors without the explicit use of a "for" loop?
I don't believe there is a built-in function for this. The best you could try is something like:
lsts <- lapply(1:5, function(x) sample(letters, 10)) # make some data (see below)
maxcomb <- which.max(apply(combs <- combn(length(lsts), 2), 2,
function(ix) length(intersect(lsts[[ix[1]]], lsts[[ix[2]]]))))
lsts <- lsts[combs[, maxcomb]]
# [[1]]
# [1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
# [[2]]
# [1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
A dump of the original:
[[1]]
[1] "z" "r" "j" "h" "e" "m" "w" "u" "q" "f"
[[2]]
[1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
[[3]]
[1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
[[4]]
[1] "c" "o" "t" "j" "d" "g" "u" "k" "w" "h"
[[5]]
[1] "f" "g" "q" "y" "d" "e" "n" "s" "w" "i"
datal <- list (a=c(2,2,1,2),
b=c(2,2,2,4,3),
c=c(1,2,3,4))
# all possible combinations
combs <- combn(length(datal), 2)
# split into list
combs <- split(combs, rep(1:ncol(combs), each = nrow(combs)))
# calculate length of intersection for every combination
intersections_length <- sapply(combs, function(y) {
length(intersect(datal[[y[1]]],datal[[y[2]]]))
}
)
# What lists have biggest intersection
combs[which(intersections_length == max(intersections_length))]

How to print a character list from A to Z?

In R, how can I print a character list from A to Z? With integers I can say:
my_list = c(1:10)
> my_list
[1] 1 2 3 4 5 6 7 8 9 10
But can I do the same with characters? e.g.
my_char_list = c(A:Z)
my_char_list = c("A":"Z")
These don't work, I want the output to be: "A" "B" "C" "D", or separated by commas.
LETTERS
"A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X"
[25] "Y" "Z"
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
[25] "y" "z"
> LETTERS[5:10]
[1] "E" "F" "G" "H" "I" "J"
>
strsplit(intToUtf8(c(97:122)),"")
for a,b,c,...,z
strsplit(intToUtf8(c(65:90)),"")
for A,B,C,...,Z
#' range_ltrs() returns a vector of letters
#'
#' range_ltrs() returns a vector of letters,
#' starting with arg start and ending with arg stop.
#' Start and stop must be the same case.
#' If start is after stop, then a "backwards" vector is returned.
#'
#' #param start an upper or lowercase letter.
#' #param stop an upper or lowercase letter.
#'
#' #examples
#' > range_ltrs(start = 'A', stop = 'D')
#' [1] "A" "B" "C" "D"
#'
#' If start is after stop, then a "backwards" vector is returned.
#' > range_ltrs('d', 'a')
#' [1] "d" "c" "b" "a"
range_ltrs <- function (start, stop) {
is_start_upper <- toupper(start) == start
is_stop_upper <- toupper(stop) == stop
if (is_start_upper) stopifnot(is_stop_upper)
if (is_stop_upper) stopifnot(is_start_upper)
ltrs <- if (is_start_upper) LETTERS else letters
start_i <- which(ltrs == start)
stop_i <- which(ltrs == stop)
ltrs[start_i:stop_i]
}

Resources