IDF(Inverse Document Frequency) calculations - information-retrieval

I've calculated the TF of my dataset and I'm currently trying to calculate the IDF for it. I'm confused to which number to use for the calculation.
id uid
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 e
3 b
3 c
3 e
3 f
(3 items)
Occurrence
a = 2
b = 3
c = 3
d = 1
e = 2
f = 1
Which gives something like this below:
A B C
A - 2 2
B 2 - 3
C 2 3 -
Formula
IDF(t,D)=log(Total Number documents/Number of Document matching term);
For example using (A,B) which value is 2: how should I go about calculating it?
Total items = 3
Number of document matching terms = should i be using A or B value? (2 or 3)
(A,B) * log(total / matching)
= 2 * log ( 3 / 2 or 3) ?

I am not sure what you meant by (A,B).
But I assume that from your dataset: the first column is document id, and the second column is term.
If my assumption is correct then:
doc id 1 is "a b c d"
doc id 2 is "a b c e"
doc id 3 is "b c e f"
Your formula for IDF(t, D) is log(# of documents / # of documents that contains that term). Thus, we can calculate IDF for each term as the following:
IDF('a', D) = log(3 / 2)
IDF('b', D) = log(3 / 3)
and so on...
Here is my reference: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Related

find the length of the longest chain of characters in each row of a dataframe

I want to find the length of the longest chain of characters following a pattern. Let's say I have this dataframe, and I want to find the length of rows were the sequence "a" is repeating how do i find it?
id = c(1, 2, 3,4,5)
A = c("a","a","a","a","a")
B = c("a","a","b","a","d")
C = c("b","a","c","a","a")
D = c("a","a","a","b","c")
E = c("a","a","e","c","a")
df = data.frame(id,A,B,C,D,E,stringsAsFactors=FALSE)
df$Count = c(2,5,1,3,1)
id A B C D E Count
1 a a b a a 2
2 a a a a a 5
3 a b c a e 1
4 a a a b c 3
5 a d a c a 1
You can use rle (run-length encoding).
rles = apply(df[2:6], 1, rle)
result = sapply(rles, function(x) max(x$lengths[x$values == "a"]))
df$new_count = result
df
# id A B C D E Count new_count
# 1 1 a a b a a 2 2
# 2 2 a a a a a 5 5
# 3 3 a b c a e 1 1
# 4 4 a a a b c 3 3
# 5 5 a d a c a 1 1
See ?rle or many other questions on this site if you search for "[r] rle" for additional details.

Grouping of two variables in r

I have a data frame like below.
dat <- data.frame(v1=c("a","b","c","c","a","w","f"),
v2=c("z","a","a","w","p","e","h"))
v1 v2
1 a z
2 b a
3 c a
4 c w
5 a p
6 w e
7 f h
I want to add a group column based on whether these letters appear in the same row.
v1 v2 gp
1 a z 1
2 b a 1
3 c a 1
4 c w 1
5 a p 1
6 w e 1
7 f h 2
My idea is to first assign the first row to group 1, and then any row that v1 or v2 is "a" or "z" will also be assigned to group 1.
There are scenarios like row 3 and 4, where c is assigned to group 1, because, in row 3, v2 is "a". And "w" is assigned to group 1 because in row 4 v1 is "c", which is assigned to group 1 previously. But my list is very long, so I cannot keep checking all the "descendants".
I wonder if there is a way to group these letters, and return a list with group number. Something like the below table will do.
letter gp
a 1
b 1
c 1
e 1
f 2
h 2
w 1
z 1
One way to solve this is to consider the letters as vertices of a graph and being in the same row as a link between the vertices. Then what you are asking for is the connected components of the graph. All of that is easy using the igraph package in R.
library(igraph)
G = graph_from_edgelist(as.matrix(dat), directed=FALSE)
letters = sort(unique(c(as.character(dat$v1), as.character(dat$v2))))
(gp = components(G)$membership[letters])
a b c e f h p w z
1 1 1 1 2 2 1 1 1
If you want a data.frame containing this information
(Groups = data.frame(letters, gp, row.names=NULL))
letters gp
1 a 1
2 b 1
3 c 1
4 e 1
5 f 2
6 h 2
7 p 1
8 w 1
9 z 1
In order to think through why this works, it may help you to look at the graph that was created and think how that represents your problem.

Total rows does not contain a factor and the value is not zero

I have the following data
path value
1 b,b,a,c 3
2 c,b 2
3 a 10
4 b,c,a,b 0
5 e,f 0
6 a,f 1
df
df <- data.frame (path= c("b,b,a,c", "c,b", "a", "b,c,a,b" ,"e,f" ,"a,f"), value = c(3,2,10,0,0,1))
I wish to compute the total number that I do not have a factor and the the value is not zero. So my desired output will be:
#desiored output
path value
1: b 2
2: a 1
3: c 2
4: e 4
5: f 3
For instance, for a it shows the total number that we do not have a and the value is not zero is equal to 1. Only one time in row 2 we do not have a and the value is not zero. (hope it is clear, please let me know if more example is required)
I tried the following code but the out put for b is wrong. Does anyone know why?
total <- sum(df$value != 0)
library (splitstackshape)
#total number of total minus total number that a value is not zero
output <-cSplit(df, "path", ",", 'long')[, .(value=total - sum(value!=0)), .(path)]
output
This code results in the following output which is not correct for b
path value
1: b 1
2: a 1
3: c 2
4: e 4
5: f 3
Read the factors into facs and then use grep them out and count:
facs <- unique(scan(textConnection(as.character(df$path)), what = "", sep = ","))
data.frame(path = facs,
value = colSums( !sapply(facs, grepl, as.character(df$path)) & df$value != 0 ))
giving:
path value
b b 2
a a 1
c c 2
e e 4
f f 3

Cumulative sum conditional over multiple columns in r dataframe containing the same values

Say my data.frame is as outlined below:
df<-as.data.frame(cbind("Home"=c("a","c","e","b","e","b"),
"Away"=c("b","d","f","c","a","f"))
df$Index<-rep(1,nrow(df))
Home Away Index
1 a b 1
2 c d 1
3 e f 1
4 b c 1
5 e a 1
6 b f 1
What I want to do is calculate a cumulative sum using the Index column for each character a - f regardless of whether they in the Home or Away columns. Thus a column called Cumulative_Sum_Home, say, takes the character in the Home row, "b" in the case of row 6, and counts how many times "b" has appeared in either the Home or Away columns in all previous rows including row 6. Thus in this case b has appeared 3 times cumulatively in the first 6 rows, and thus the Cumulative_Sum_Home gives the value 3. Likewise the same logic applies to the Cumulative_Sum_Away column. Taking row 5, character "a" appears in the Away column, and has cumulatively appeared 2 times in either Home or Away columns up to that row, so the column Cumulative_Sum_Away takes the value 2.
Home Away Index Cumulative_Sum_Home Cumulative_Sum_Away
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
I have to confess to being totally stumped as to how to solve this problem. I've tried looking at the data.table approaches, but I've never used that package before so I can't immediately see how to solve it. Any tips would be greatly received.
There is scope to make this leaner but if that doesn't matter much for you then this should be okay.
NewColumns = list()
for ( i in sort(unique(c(levels(df[,"Home"]),levels(df[,"Away"]))))) {
NewColumnAddition = i == df$Home | i ==df$Away
NewColumnAddition[NewColumnAddition] = cumsum(NewColumnAddition[NewColumnAddition])
NewColumns[[i]] = NewColumnAddition
}
df$Cumulative_Sum_Home = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Home"])]][i]
}
)
df$Cumulative_Sum_Away = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Away"])]][i]
}
)
> df
Home Away Index HomeSum AwaySum
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
Here's a data.table alternative -
setDT(df)
for ( i in sort(unique(c(levels(df[,Home]),levels(df[,Away]))))) {
df[, TotalSum := cumsum(i == Home | i == Away)]
df[Home == i, Cumulative_Sum_Home := TotalSum]
df[Away == i, Cumulative_Sum_Away := TotalSum]
}
df[,TotalSum := NULL]

How do I find the highest common number for each group in SQLite?

Here is my table example:
LETTER NUMBER
a 1
a 2
a 4
b 1
b 2
b 3
c 1
c 2
c 3
d 1
d 2
d 3
e 1
e 2
e 3
The result I want:
LETTER NUMBER
a 2
b 2
c 2
d 2
e 2
The highest number that matches an 'a' is 4, while it's 3 for the other letters. However, the highest letter they all have in common is 2. That is why the result table has 2 for the NUMBER.
Does anyone know how I can accomplish this?
Let's call your table l. Here's a horribly inefficient solution:
select l.LETTER, max(l.NUMBER)
from l
where
(select count(distinct LETTER) from l)
= (select count(distinct l2.LETTER) from l as l2 where l2.NUMBER = l.NUMBER)
group by l.LETTER;
Kind of a mess, huh?

Resources