quanteda::dfm_lookup(): capture found term - r

I would like to perform the amazing quanteda's dfm_lookup() on a dictionary but also retrieve the matches.
Consider the following example:
dict_ex <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
dfmat_ex <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")),
remove = stopwords("english"))
dfmat_ex
dfm_lookup(dfmat_ex, dict_ex)
This gives me:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas opposition taxglob taxregex country
text1 1 1 1 0 0
text2 0 0 1 0 2
However, since every dictionary tool also has multiple entries, I would like to know which token produced the match. (My real dictionary is rather long, so the example might seem trivial but for the real use case, it is not.)
I would like to achieve a result like this:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas christmas.match opposition opposition.match taxglob taxglob.match taxregex taxreg.match country country.match
text1 1 Christmas 1 Opposition 1 tax 0 NA 0 NA
text2 0 NA 0 NA 1 taxation 0 NA 2 United_States, Sweden
Can someone help me with this? Many thanks in advance! :)

That's not really possible for two reasons.
First, a matrix(-like) object (dfm or otherwise) cannot mix element modes, here a mixture of counts and character values. This would be possible with a data.frame but then you lose the advantages of sparsity, and here, you would have a n x 2*V (where V = number of features) data.frame dimensions.
Second, "christmas.match" could have more than one feature/token matching it, so the character value would require a list, straining the object class even further.
A better way would be to use kwic() to match the tokens to the patterns formed by the dictionary. You can do this for the keys by supplying the dictionary as pattern(), or unlisting the dictionary to get matches for each value.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(one = c("a*", "b"), two = c("e", "f")))
toks <- tokens(c(d1 = "a b c d e f g and another"))
# where the dictionary keys are the patterns matched
kwic(toks, dict) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f one
## 2 d1 2 2 a b c d e f g one
## 3 d1 5 5 a b c d e f g and another two
## 4 d1 6 6 a b c d e f g and another two
## 5 d1 8 8 c d e f g and another one
## 6 d1 9 9 d e f g and another one
# where the dictionary values are the patterns matched
kwic(toks, unlist(dict)) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f a*
## 2 d1 2 2 a b c d e f g b
## 3 d1 5 5 a b c d e f g and another e
## 4 d1 6 6 a b c d e f g and another f
## 5 d1 8 8 c d e f g and another a*
## 6 d1 9 9 d e f g and another a*

Related

Randomize vector order with maximum variance

I have a vector that looks something like this.
v <- as.data.frame(list(v=(c("a","b","c",'d','e'))))
v
v
1 a
2 b
3 c
4 d
5 e
My vector has 5 different values. This means I can make 120 permutations of my vector.
Here are some examples of permutations
v v2 v3
1 a a a
2 b b c
3 c c b
4 d e d
5 e d e
I would like to create only create 10 different vectors out of the 120 possible ones, but I would like to select the combination that should maximise their covariance. Any idea how I could do this?
thanks a lot in advance for your help

How to matching missing IDs?

I have a large table with 50000 obs. The following mimic the structure:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B",NA,"D","E",NA,"G","H","I")
b <- c(11,2233,12,2,22,13,23,23,100)
c <- c(12,10,12,23,16,17,7,9,7)
df <- data.frame(ID ,a,b,c)
Where there are some missing values on the vector "a". However, I have some tables where the ID and the missing strings are included:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B","C","D","E","F","G","H","I")
key <- data.frame(ID,a)
Is there a way to include the missing strings from key into the column a using the ID?
Another options is to use data.tables fast binary join and update by reference capabilities
library(data.table)
setkey(setDT(df), ID)[key, a := i.a]
df
# ID a b c
# 1: 1 A 11 12
# 2: 2 B 2233 10
# 3: 3 C 12 12
# 4: 4 D 2 23
# 5: 5 E 22 16
# 6: 6 F 13 17
# 7: 7 G 23 7
# 8: 8 H 23 9
# 9: 9 I 100 7
If you want to replace only the NAs (not all the joined cases), a bit more complicated implemintation will be
setkey(setDT(key), ID)
setkey(setDT(df), ID)[is.na(a), a := key[.SD, a]]
You can just use match; however, I would recommend that both your datasets are using characters instead of factors to prevent headaches later on.
key$a <- as.character(key$a)
df$a <- as.character(df$a)
df$a[is.na(df$a)] <- key$a[match(df$ID[is.na(df$a)], key$ID)]
df
# ID a b c
# 1 1 A 11 12
# 2 2 B 2233 10
# 3 3 C 12 12
# 4 4 D 2 23
# 5 5 E 22 16
# 6 6 F 13 17
# 7 7 G 23 7
# 8 8 H 23 9
# 9 9 I 100 7
Of course, you could always stick with factors and factor the entire "ID" column and use the labels to replace the values in column "a"....
factor(df$ID, levels = key$ID, labels = key$a)
## [1] A B C D E F G H I
## Levels: A B C D E F G H I
Assign that to df$a and you're done....
Named vectors make nice lookup tables:
lookup <- a
names(lookup) <- as.character(ID)
lookup is now a named vector, you can access each value by lookup[ID] e.g. lookup["2"] (make sure the number is a character, not numeric)
## should give you a vector of a as required.
lookup[as.character(ID_from_big_table)]

Assigning rows of data.frame to another data.frame in R based on frequency of element's occurance

I have a data.frame df
> df
V1 V2
1 a b
2 a e
3 a f
4 b c
5 b e
6 b f
7 c d
8 c g
9 c h
10 d g
11 d h
12 e f
13 f g
14 g h
I found the frequency of each element's occurrence considering column V1 and sorted the Freq column in ascending order
>dfFreq <- as.data.frame(table(df$V1))
Var1 Freq
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 1
7 g 1
>dfFreqSorted <- dfFreq[order(dfFreq$Freq),]
Var1 Freq
5 e 1
6 f 1
7 g 1
4 d 2
1 a 3
2 b 3
3 c 3
Now what i want to do is to create a new data.frame based on original data.frame such that each "Var1" item in "dfFreqSorted" is used according to it's Freq but once every time going from the top of "dfFreqSorted" to the bottom which would give the result below:
So consider the first Var1 item which is "e" so the first matching row of "e" in V1 column of df is (e,f) which would be the first item in the new data.frame.
I figured that this can be done using:
>subset(df, V1==dfFreqSorted$Var[1])[1,]
V1 V2
12 e f
So if i used a for loop and looped through all the elements in the Var1 column of dfFreqSorted and used the subset command above and rbind the returned result into another data.frame I would have something like below
V1 V2
12 e f
13 f g
14 g h
10 d g
1 a b
4 b c
7 c d
Now this result shows each Var1 item once. I need the remaining rows as shown below such that after finishing first iteration of all the rows of Var1 once, the loop should go again to the beginning and check the frequency of all Var1 whose frequency is more than 1 now and find the next row from df for that element so the remaining rows that should be produced in the same data.frame as shown below:
11 d h
2 a e
5 b e
8 c g
3 a f
6 b f
9 c h
As you can see above that all elements are considered in Var1 whose frequency is 1 are used first then those whose frequency is greater than 1 (i.e 2) and are used once then in the next iteration those are used whose freq is greater than 2 (i.e 3) are used. Used such that corresponding unused row of that element is fetched from df.
So in short all the elements of df are arranged in anew data.frame such that elements are used in ascending order of their frequencies but used once first and then twice or thrice in every iteration based on what their frequency is.
I am not asking for the whole code just few guidelines of how i can achieve the objective. Thanks in advance.
Hello #akrun i am a beginner so the solution might be really a beginner level approach but it solved my problem perfectly fine.
> a<-read.table("isnodes.txt")
> a
V1 V2
1 a b
2 a e
3 a f
4 b c
5 b e
6 b f
7 c d
8 c g
9 c h
10 d g
11 d h
12 e f
13 f g
14 g h
> aF<-as.data.frame(table(a$V1))
> aF
Var1 Freq
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 1
7 g 1
> aFsorted <- aF[order(aF$Freq),]
> aFsorted
Var1 Freq
5 e 1
6 f 1
7 g 1
4 d 2
1 a 3
2 b 3
3 c 3
> sortedEdgeList <- a[-c(1:nrow(a)),]
> sortedEdgeList
[1] V1 V2
<0 rows> (or 0-length row.names)
> aFsorted <- cbind(aFsorted, Used=0)
> aFsorted
Var1 Freq Used
5 e 1 0
6 f 1 0
7 g 1 0
4 d 2 0
1 a 3 0
2 b 3 0
3 c 3 0
> maxFreq <- max(aFsorted$Freq)
> maxFreq
[1] 3
> for(i in 1:maxFreq){
+ rows<-nrow(aFsorted)
+ for(j in 1:rows){
+ Var1Value<-aFsorted$Var[j]
+ Var1Edge<-a[match(aFsorted$Var1[j],a$V1),]
+ sortedEdgeList<-rbind(sortedEdgeList,Var1Edge)
+ a<-a[!(a$V1==Var1Edge$V1 & a$V2==Var1Edge$V2),]
+ aFsorted$Used[j]=aFsorted$Used[j]+1
+ }
+ if(aFsorted$Used==aFsorted$Freq){
+ aFsorted<-aFsorted[!(aFsorted$Used==aFsorted$Freq),]
+ }
+ }
Warning messages:
1: In if (aFsorted$Used == aFsorted$Freq) { :
the condition has length > 1 and only the first element will be used
2: In if (aFsorted$Used == aFsorted$Freq) { :
the condition has length > 1 and only the first element will be used
3: In if (aFsorted$Used == aFsorted$Freq) { :
the condition has length > 1 and only the first element will be used
> sortedEdgeList
V1 V2
12 e f
13 f g
14 g h
10 d g
5 a b
4 b c
7 c d
11 d h
2 a e
51 b e
8 c g
3 a f
6 b f
9 c h
I'm not sure this is what you want, but it might be close. It helps conceptually to keep the frequencies in the original data frame.
library("plyr")
set.seed(3)
df <- data.frame(V1 = sample(letters[1:10], 20, replace = TRUE),
V2 = sample(letters[1:10], 20, replace = TRUE),
stringsAsFactors = FALSE)
df$freqV1 <- NA_integer_
for (i in 1:nrow(df)) {
df$freqV1[i] <- length(grep(pattern = df$V1[i], x = df$V1))
}
df2 <- arrange(df, freqV1, V2) # you may want just arrange(df, freqV1)
which gives:
V1 V2 freqV1
1 h c 1
2 d a 2
3 d b 2
4 c c 2
5 c j 2
6 b c 3
7 g c 3
8 b f 3
9 g h 3
10 g h 3
11 b i 3
12 i a 4
13 i c 4
14 i d 4
15 i f 4
16 f b 5
17 f d 5
18 f d 5
19 f e 5
20 f f 5

Creating a new variable column based on data from another column

I'm pretty new to R, and programming in general, and I'm wondering the best way to loop through a column so I can add a column to the data frame further describing the observations I looped through.
I currently have a list of amino acids and their positions on a protein that looks like this:
Residue Position
H 1
R 2
K 3
D 4
E 5
H 6
R 7
K 8
D 9
E 10
I'd like something that looks like this (where H, R, and K are basic amino acids, and D and E are acidic amino acids):
Residue Position Properties
H 1 Basic
R 2 Basic
K 3 Basic
D 4 Acidic
E 5 Acidic
H 6 Basic
R 7 Basic
K 8 Basic
D 9 Acidic
E 10 Acidic
I'm really not sure where to start, and I'm having difficulty finding a good resource for this kind of situation in R.
I started by trying to subset the data, but then I realized that wouldn't do the trick:
Basic
h.dat <- subset(all, all$Residue == "H")
r.dat <- subset(all, all$Residue == "R")
k.dat <- subset(all, all$Residue == "K")
Acidic
d.dat <- subset(all, all$Residue == "D")
e.dat <- subset(all, all$Residue == "E")
Thanks!
Note:
H = Histidine (Basic amino acid)
R = Arginine (Basic)
K = Lysine (Basic)
E = Glutamic Acid (Acidic)
D = Aspartic Acid (Acidic)
You can use ifelse. If df is the name of your original data,
df$Property <- ifelse(df$Residue %in% c("H", "R", "K"), "Basic", "Acidic")
df
# Residue Position Property
# 1 H 1 Basic
# 2 R 2 Basic
# 3 K 3 Basic
# 4 D 4 Acidic
# 5 E 5 Acidic
# 6 H 6 Basic
# 7 R 7 Basic
# 8 K 8 Basic
# 9 D 9 Acidic
# 10 E 10 Acidic
Try:
> df1
Residue Position
1 H 1
2 R 2
3 K 3
4 D 4
5 E 5
6 H 6
7 R 7
8 K 8
9 D 9
10 E 10
Create a reference table:
> df2
Residue Property
1 H Basic
2 R Basic
3 K Basic
4 D Acidic
5 E Acidic
Then merge:
> merge(df1, df2)
Residue Position Property
1 D 9 Acidic
2 D 4 Acidic
3 E 5 Acidic
4 E 10 Acidic
5 H 1 Basic
6 H 6 Basic
7 K 8 Basic
8 K 3 Basic
9 R 7 Basic
10 R 2 Basic
I think you might want to allow for non-polar amino acids as well:
c(rep("Basic",3),rep("Acidic",2),"Non-Polar")[ # those are the choices
match(dat$Residue, c("H","R","K","E","D"), nomatch=6) ] #select indices
So I added an 11th residue named "Z" and tested:
> dat$Property <- c(rep("Basic",3),rep("Acidic",2),"Non-Polar")[
match(dat$Residue, c("H","R","K","E","D"), nomatch=6) ]
> dat
Residue Position Property
1 H 1 Basic
2 R 2 Basic
3 K 3 Basic
4 D 4 Acidic
5 E 5 Acidic
6 H 6 Basic
7 R 7 Basic
8 K 8 Basic
9 D 9 Acidic
10 E 10 Acidic
11 Z 11 Non-Polar

How to find out whether a variable is a factor or continuous in R

I have a table with a bunch of variables. What statement I can use to find out whether these variables are considered as a factor or continuous?
Assuming foo is the name of your object and it is a data frame,
f <- sapply(foo, is.factor)
will apply the is.factor() function to each component (column) of the data frame. is.factor() checks if the supplied vector is a factor as far as R is concerned.
Then
which(f)
will tell you the index of the factor columns. f contains a logical vector too, so you could select the factor columns via
foo[, f]
or select all but them
foo[, !f]
Here is an example:
> ## some dummy data
> foo <- data.frame(a = factor(1:10), b = 1:10, c = factor(letters[1:10]))
> foo
a b c
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
6 6 6 f
7 7 7 g
8 8 8 h
9 9 9 i
10 10 10 j
> ## apply is.factor
> f <- sapply(foo, is.factor)
> f
a b c
TRUE FALSE TRUE
> ## which are factors
> which(f)
a c
1 3
> ## select those
> foo[, f]
a c
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
There are equivalent checks for numeric and integer too, amongst others: is.numeric() and is.integer(), but you only need is.numeric() if you don't care about the type of numbers:
> is.numeric(1L)
[1] TRUE
(Also is.character(), is.logical(), ...)
You have to use is.factor and is.numeric.
You can use str(Data)
?str - Compactly display the internal structure of an R object, a
diagnostic function and an alternative to summary (and to some extent,
dput)

Resources