Randomize vector order with maximum variance - r

I have a vector that looks something like this.
v <- as.data.frame(list(v=(c("a","b","c",'d','e'))))
v
v
1 a
2 b
3 c
4 d
5 e
My vector has 5 different values. This means I can make 120 permutations of my vector.
Here are some examples of permutations
v v2 v3
1 a a a
2 b b c
3 c c b
4 d e d
5 e d e
I would like to create only create 10 different vectors out of the 120 possible ones, but I would like to select the combination that should maximise their covariance. Any idea how I could do this?
thanks a lot in advance for your help

Related

quanteda::dfm_lookup(): capture found term

I would like to perform the amazing quanteda's dfm_lookup() on a dictionary but also retrieve the matches.
Consider the following example:
dict_ex <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
dfmat_ex <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")),
remove = stopwords("english"))
dfmat_ex
dfm_lookup(dfmat_ex, dict_ex)
This gives me:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas opposition taxglob taxregex country
text1 1 1 1 0 0
text2 0 0 1 0 2
However, since every dictionary tool also has multiple entries, I would like to know which token produced the match. (My real dictionary is rather long, so the example might seem trivial but for the real use case, it is not.)
I would like to achieve a result like this:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas christmas.match opposition opposition.match taxglob taxglob.match taxregex taxreg.match country country.match
text1 1 Christmas 1 Opposition 1 tax 0 NA 0 NA
text2 0 NA 0 NA 1 taxation 0 NA 2 United_States, Sweden
Can someone help me with this? Many thanks in advance! :)
That's not really possible for two reasons.
First, a matrix(-like) object (dfm or otherwise) cannot mix element modes, here a mixture of counts and character values. This would be possible with a data.frame but then you lose the advantages of sparsity, and here, you would have a n x 2*V (where V = number of features) data.frame dimensions.
Second, "christmas.match" could have more than one feature/token matching it, so the character value would require a list, straining the object class even further.
A better way would be to use kwic() to match the tokens to the patterns formed by the dictionary. You can do this for the keys by supplying the dictionary as pattern(), or unlisting the dictionary to get matches for each value.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(one = c("a*", "b"), two = c("e", "f")))
toks <- tokens(c(d1 = "a b c d e f g and another"))
# where the dictionary keys are the patterns matched
kwic(toks, dict) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f one
## 2 d1 2 2 a b c d e f g one
## 3 d1 5 5 a b c d e f g and another two
## 4 d1 6 6 a b c d e f g and another two
## 5 d1 8 8 c d e f g and another one
## 6 d1 9 9 d e f g and another one
# where the dictionary values are the patterns matched
kwic(toks, unlist(dict)) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f a*
## 2 d1 2 2 a b c d e f g b
## 3 d1 5 5 a b c d e f g and another e
## 4 d1 6 6 a b c d e f g and another f
## 5 d1 8 8 c d e f g and another a*
## 6 d1 9 9 d e f g and another a*

How to skip not completly empty rows in r

So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.
There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?

Grouping of two variables in r

I have a data frame like below.
dat <- data.frame(v1=c("a","b","c","c","a","w","f"),
v2=c("z","a","a","w","p","e","h"))
v1 v2
1 a z
2 b a
3 c a
4 c w
5 a p
6 w e
7 f h
I want to add a group column based on whether these letters appear in the same row.
v1 v2 gp
1 a z 1
2 b a 1
3 c a 1
4 c w 1
5 a p 1
6 w e 1
7 f h 2
My idea is to first assign the first row to group 1, and then any row that v1 or v2 is "a" or "z" will also be assigned to group 1.
There are scenarios like row 3 and 4, where c is assigned to group 1, because, in row 3, v2 is "a". And "w" is assigned to group 1 because in row 4 v1 is "c", which is assigned to group 1 previously. But my list is very long, so I cannot keep checking all the "descendants".
I wonder if there is a way to group these letters, and return a list with group number. Something like the below table will do.
letter gp
a 1
b 1
c 1
e 1
f 2
h 2
w 1
z 1
One way to solve this is to consider the letters as vertices of a graph and being in the same row as a link between the vertices. Then what you are asking for is the connected components of the graph. All of that is easy using the igraph package in R.
library(igraph)
G = graph_from_edgelist(as.matrix(dat), directed=FALSE)
letters = sort(unique(c(as.character(dat$v1), as.character(dat$v2))))
(gp = components(G)$membership[letters])
a b c e f h p w z
1 1 1 1 2 2 1 1 1
If you want a data.frame containing this information
(Groups = data.frame(letters, gp, row.names=NULL))
letters gp
1 a 1
2 b 1
3 c 1
4 e 1
5 f 2
6 h 2
7 p 1
8 w 1
9 z 1
In order to think through why this works, it may help you to look at the graph that was created and think how that represents your problem.

Convert from n x m matrix to long matrix in R [duplicate]

This question already has answers here:
Create dataframe from a matrix
(6 answers)
Closed 1 year ago.
Note: This is not a graph question.
I have an n x m matrix:
> m = matrix(1:6,2,3)
> m
a b c
d 1 2 3
e 4 5 6
I would like to convert this to a long matrix:
> m.l
a d 1
a e 4
b d 2
b e 5
c d 3
c e 6
Obviously nested for loops would work but I know there are a lot of nice tools for reshaping matrixes in R. So far, I have only found literature on converting from long or wide matrixes to an n x m matrix and not the other way around. Am I missing something obvious? How can I do this conversion?
Thank you!
If you need a single column matrix
matrix(m, dimnames=list(t(outer(colnames(m), rownames(m), FUN=paste)), NULL))
# [,1]
#a d 1
#a e 4
#b d 2
#b e 5
#c d 3
#c e 6
For a data.frame output, you can use melt from reshape2
library(reshape2)
melt(m)

create all possible permutations of two vectors in R [duplicate]

This question already has answers here:
Generate all possible permutations (or n-tuples)
(2 answers)
Closed 9 years ago.
I have two vectors like this:
f1=c('a','b','c','d')
e1=c('e','f','g')
There is 4^3 different permutations of them. I need to create all of possible permutations of them in R softeware.for example;
(1):
a e
a f
a g
(2):
a e
a f
b g
...
Moreover, my real data are very huge and I need speed codes.
It sounds like you are looking for expand.grid.
> expand.grid(f1, e1)
Var1 Var2
1 a e
2 b e
3 c e
4 d e
5 a f
6 b f
7 c f
8 d f
9 a g
10 b g
11 c g
12 d g
I don't know what "speed codes" are, so I'm not sure I can help from that aspect.

Resources