term document matrix seems combine words? "sootawn" "abilitiesbrush" - r

head(myTDM)
Docs
Terms 1 2
abatement 20 0
abilities 18 80
abilitiesbrush 0 1
abilitiescreate 0 13
abilitiesmarket 0 6
able 0 30
I was thinking, I was doing something wrong since the outcome is the combination of 2 or sometimes more words then I saw the same thing in a text mining book so I really wonder why it just appears like that? many thanks in advance

Related

How to calculate similarity of numbers (in list)

I am looking for a method for calculating similarity score for list of numbers. Ideally the method should give result in fixed range. For example from 0 to 1 where 0 is not similar at all and 1 means all numbers are identical.
For clarity let me provide a few examples:
0 1 2 3 4 5 6 7 8 9 10 => the similarity should be 0 or close to zero as all numbers are different
1 1 1 1 1 1 1 => 1
10 9 11 10.5 => close to 1
1 1 1 1 1 1 1 1 1 1 100 => score should be still pretty high as only the last value is different
I have tried to calculate the similarity based on normalization and average, but that gives me really bad results when there is one 'bad number'.
Thank you.
Similarity tests are always incredibly subjective, and the right one to use depends heavily on what you're trying to use it for. We already have three typical measures of central tendency (mean, median, mode). It's hard to say what test will work for you because there are different ways of measuring that will do what you're asking, but have wildly different measures for other lists (like [1]*7 + [100] * 7). Here's one solution:
import statistics as stats
def tester(ell):
mode_measure = 1 - len(set(ell))/len(ell)
avg_measure = 1 - stats.stdev(ell)/stats.mean(ell)
return max(avg_measure, mode_measure)

How do I make a selected table confined to a matrix, rather than a running list?

For my previous lines of code for making tables from column names, they successfully made short and dense matrices for me to readily process data from two questions (from survey results): (2nd example).
However, when I try using the same line of code (above), I don't get that sleek matrix. I end up getting a list of un-linked tables, which I do not want. Perhaps it's due to the new column only having 0's and 1's as numeric characters, vs. the others that have more than 2: (1st example).
[Please forgive my formatting issues (StackOverflow Status: Newbie). Also, many thanks in advance to those checking in on and answering my question!]
>table(select(data_final, `Relationship 2Affected Individual`, Satisfied_Treatments))
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, , 1 = 1, Response = 10679308122
0
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, ,
...
> table(select(data_final, `Relationship 2Affected Individual`, Indirect_Benefits))
Indirect_Benefits
Relationship 2Affected Individual 0 1 2 3
1 4 1 0 0
2 42 17 9 3
3 12 1 1 0
6 5 2 2 0
Other (please specify) 1 0 0 0
>#rstudioapi::versionInfo()
>#packageVersion("dplyr")
table(data_final$Relationship 2Affected Individual, data_final$Satisfied_Treatments)
Problem Solved^

Optimized data sort - software r

this is my first post, so please be polite if I might get sth wrong.
I have a set of data that I want to sort in a specific way using the software r.
It is a non-quadratic matrix of tasks (>100) that require specific components (>100) of specific material (1, 2, 3, or 4). I want to figure out, which tasks can be executed as a group because they require the same components. That's easy. But I want to optimize it, means that components with a low-value material (1) can be "upgraded" to a higher value material (2, 3 or 4) if that decreases the number of groups.
I could only manage to sort the data while upgrading all materials, but that's not what I want.
My minimum example looks like this:
1 0 0 2 0 4 0
2 2 4 0 0 1 1
0 4 4 0 0 3 0
1 3 0 1 1 0 4
4 2 1 0 4 1 2
0 means that this component is not required.
I hope I could describe my problem clearly enough.
thanks a lot in advance for your suggestions
Miguel

using findAssocs in R to find frequently occurred words with central term

As I was working with findAssocs in R, I realised that the function don't actually pick up the words that occur together with the searched term across documents, but rather the words that occur when searched term frequently appeared.
I've tried using a simple test script below:
test <- list("housekeeping bath towel housekeeping room","housekeeping dirty","housekeeping very dirty","housekeeping super dirty")
test <-Corpus(VectorSource(test))
test_dtm<-DocumentTermMatrix(test)
test_dtms<-removeSparseTerms(test_dtm,0.99)
findAssocs(test_dtms,"housekeeping",corlimit = 0.1)
And the returning result from R is:
$housekeeping
bath room towel
1 1 1
Noticed that the word "dirty" occur in 3 out of the 4 documents, compared to the returned keywords which only occurred once in all documents.
Anyone has any idea what went wrong in my script or if there is a better way to do this?
The result I want to achieve is the model should reflect the words that occurs frequently with the search term across all documents and not within a specific document. I've tried combining the 4 documents into 1 but it doesn't work as findAssocs doesn't work on a single document.
Any advise?
How about an alternative, using the quanteda package? It imposes no mystery restrictions on the correlations returned, and has many other options (see ?similarity).
require(quanteda)
testDfm <- dfm(unlist(test), verbose = FALSE)
## Document-feature matrix of: 4 documents, 7 features.
## 4 x 7 sparse Matrix of class "dfmSparse"
## features
## docs housekeeping bath towel room dirty very super
## text1 2 1 1 1 0 0 0
## text2 1 0 0 0 1 0 0
## text3 1 0 0 0 1 1 0
## text4 1 0 0 0 1 0 1
similarity(testDfm, "housekeeping", margin = "features")
## similarity Matrix:
## $housekeeping
## bath towel room very super dirty
## 1.0000 1.0000 1.0000 -0.3333 -0.3333 -1.0000

R if then else loop

I have the following output and would like to insert a column if net.results$net output equal to 0 if the output if <0.5 and 1 if >0.5 but <1.0. Basically rounding up or down.
How do I go about doing in this in a loop ? Can I insert this column using data.frame below , just in between the predicted and the test set columns ?
Assume I don't know the number of rows that net.results$net.result has.
Thank you for your help.
data.frame(net.results$net.result,diabTest$class)
predicted col Test set col
net.results.net.result diabTest.class
4 0.2900909633 0
7 0.2900909633 1
10 0.4912509122 1
12 0.4912509122 1
19 0.2900909633 0
21 0.2900909633 0
23 0.4912509122 1
26 0.2900909633 1
27 0.4912509122 1
33 0.2900909633 0
As the commenters have pointed out. This will not work for some situations, but based on the appearance of the data, this should produce the output desired.
df$rounded <- round(df$net.results.net.result,0)
Here are a few test values to see what the function does for different numbers. Read the round help page for more info.
round(0.2900909633,0)
[1] 0
round(0.51, 0)
[1] 1
You can help everyone by supplying a reproducible example, doing research, and explaining approaches that you've tried.

Resources