Autocoding using RQDA - r

I try to use RQDA for quantitative text analysis. I want to code text passages with the same characters automatically.
Let´s say I have the category dog and I marked "dog" in the first sentence and "dogfood" in the fourth. I want RQDA mark "dog" also in the second sentence and "dogfood in the fifth.
In Maxqda, for example, this is done automatically if I enable the software. Is there a function to do this?

If I understand you want to make an automatic coding using RQDA. The function would be codingBySearch:
codingBySearch(pattern, fid = getFileIds(), cid, seperator,
concatenate = FALSE)
But this function only allows you to make a single pattern per time. If you would like to get a list of patterns, a loop will sort it out:
X <- c("pattern1", "pattern2", "pattern3", "pattern4", "pattern5", "pattern6")
for (i in X) {
codingBySearch(i,fid=getFileIds(),cid=cid_number, seperator="[.!?]",ignore.case=TRUE)
}
Where cid is the number of the code you created in the GUID interface. You can also adapt the separators as you see fit.

Related

How to check whether an English word is meaningful in Julia?

In Julia, how can I check an English word is a meaningful word? Suppose I want to know whether "Hello" is meaningful or not. In Python, one can use the enchant or nltk packages(Examples: [1],[2]). Is it possible to do this in Julia as well?
What I need is a function like this:
is_english("Hello")
>>>true
is_english("Hlo")
>>>false
# Because it doesn't have meaning! We don't have such a word in English terminology!
is_english("explicit")
>>>true
is_english("eeplicit")
>>>false
Here is what I've tried so far:
I have a dataset that contains frequent 5char English words(link to google drive). So I decided to augment it to my question for better understanding. Although this dataset is not adequate (because it just contains frequent 5char meaningful words, not all the meaningful English words with any length), it's suitable to use it to show what I want:
using CSV
using DataFrames
df = CSV.read("frequent_5_char_words.csv" , DataFrame , skipto=2)
df = [lowercase(item) for item in df[:,"0"]]
function is_english(word::String)::Bool
return lowercase(word) in df
end
Then when I try these:
julia>is_english("Helo")
false
julia>is_english("Hello")
true
But I don't have an affluent dataset! So this isn't enough. So I'm curious if there are any packages like what I mentioned before, in Julia or not?
(not enough rep to post a comment!)
You can still use NLTK in Julia via PyCall. Or, as it seems you don't need an NLP tool but just a dictionary, you can use wiktionary to do some lookup or build the dataset.
There is a recently new package, Named LanguageDetect.jl. It does not return true/false, but a list of probabilities. You could define something like:
using LanguageDetect: detect
function is_english(text, threshold=0.8)
langs = detect(text)
for lang in langs
if lang.language == "en"
return lang.probability >= threshold
end
end
ret

I am using R code to count for a specific word occurrence in a string. How can I update it to count if the word's synonyms are used?

I'm using the following code to find if the word "assist" is used in a string variable.
string<- c("assist")
`assist <-
(1:nrow(df) %in% c(sapply(string, grep, df$textvariable, fixed = TRUE)))+0`
`sum(assist)`
If I also wanted to check if synonyms such as "help" and "support" are used in the string, how can I update the code? So if either of these synonyms are used, I want to code it as 1. If neither of these words are used, I want to code it as 0. It doesn't matter if all of the words appear in the string or how many times they are used.
I tried changing it to
string<- c("assist", "help", "support")
But it looks like it is searching for strings in which all of these words are used?
I'd appreciate your help!
Thank you

Use substr with start and stop words, instead of integers

I want to extract information from downloaded html-Code. The html-Code is given as a string. The required information is stored inbetween specific html-expressions. For example, if I want to have every headline in the string, I have to search for "H1>" and "/H1>" and the text between these html expressions.
So far, I used substr(), but I had to calculate the position of "H1>" and "/H1>" first.
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
startposition = c(21,55) # calculated with gregexpr
stopposition = c(28, 63) # calculated with gregexpr
substr(htmlcode, startposition[1], stopposition[1])
substr(htmlcode, startposition[2], stopposition[2])
The output is correct, but to calculate every single start and stopposition is a lot of work. Instead I search for a similar function like substr (), where you can use start and stop words instead of the position. For example like this:
function(htmlcode, startword = "H1>", stopword = "/H1>")
I'd agree that using a package built for html processing is probably the best way to handle the example you give. However, one potential way to sub-string a string based on character values would be to do the following.
Step 1: Define a simple function to return to position of a character in a string, in this example I am only using fixed character strings.
strpos_fixed=function(string,char){
a<-gregexpr(char,string,fixed=T)
b<-a[[1]][1:length(a[[1]])]
return(b)
}
Step 2: Define your new sub-string function using the strpos_fixed() function you just defined
char_substr<-function(string,start,stop){
x<-strpos_fixed(string,start)+nchar(start)
y<-strpos_fixed(string,stop)-1
z<-cbind(x,y)
apply(z,1,function(x){substr(string,x[1],x[2])})
}
Step 3: Test
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
htmlcode2 = " some html code <H1>baa dee ya</H1> some other code <H1>say do you remember?</H1>"
htmlcode3<- "<x>baa dee ya</x> skdjalhgfjafha <x>dancing in september</x>"
char_substr(htmlcode,"<H1>","</H1>")
char_substr(htmlcode2,"<H1>","</H1>")
char_substr(htmlcode3,"<x>","</x>")
You have two options here. First, use a package that has been developed explicitly for the parsing of HTML structures, e.g., rvest. There are a number of tutorials online.
Second, for edge cases where you may need to extract from strings that are not necessarily well-formatted HTML you should use regular expressions. One of the simpler implementations for this comes from stringr::str_match:
# 1. the parenthesis define regex groups
# 2. ".*?" means any character, non-greedy
# 3. so together we are matching the expression <H1>some text or characters of any length</H1>
str_match(htmlcode, "(<H1>)(.*?)(</H1>)")
This will yield a matrix where the columns are (in order) the fully matched string followed by each independent regex group we specified. You would just want to pull the second group in this case if you want whatever text is between the <H1> tags (3rd column).

Two PASTE functions in a character vector

attach.files = c(paste("/users/joesmith/nosection_", currentDate,".csv",sep=""),
paste("/users/joesmith/withsection_", currentDate,".csv",sep=""))
Basically, if I did it like
c("nosection_051418.csv", "withsection_051418.csv")
And I did that manually it would work fine but since I'm automating this to run every day I can't do that.
I'm trying to attach files in an automated email but when I structure it like this, it doesn't work. How can I recreate this so that the character vector accepts it?
I thought your example implied the need for "parallel" inputs to the path stem, the first portion of the file name, and the date portions of those full paths. Consider this illustration of using a 2 item vector and a one item vector (produced by Sys.Date, replacing your "currentdate") to populate the %s positions in that sprintf string (suggested by #Gregor):
sprintf("/users/joesmith/%s_%s.csv", c("nosection", "withsection"), Sys.Date() )
[1] "/users/joesmith/nosection_2018-05-14.csv" "/users/joesmith/withsection_2018-05-14.csv"

readline is considering every record in the spreadsheet as a new line [R]

I am trying to create a function that will calculate the frequency count of keywords using TM package. The function works fine if the text pasted from readline is on free form text without a new line. The problem is, when I paste a bunch of text copied from a spreadsheet, readline considers it as a new line.
keyword <- function() {
x <- readline(as.character('Input text here: '))
x <- Corpus(VectorSource(x))
...
tdm <- TermDocumentMatrix(x)
...
tdm
}
Here's the full code: https://github.com/CSCDataAnalytics/PM-Analysis/blob/master/Keyword.R
How can I prevent this from happening or at least consider a bunch of text of every row from the spreadsheet as one vector only?
If I'm understanding you correctly, the problem is when the user pastes the text from another application: the newline is causing R to stop accepting the subsequent lines.
One technique (fragile as it may be) is to look for a specific line, such as an empty line "" or a period ".". It's a little fragile because now you need (1) assurance that the data will "never" include that as a whole line, and (2) it is easily appended by the user.
Try:
endofinput <- ""
totalstr <- ""
while(! endofinput == (x <- readline('prompt (empty string when done): ')))
totalstr <- paste(totalstr, x)
In this case, the empty string is the catch, and when the while loop is done, totalstr contains all input separated by a space (this can be changed in the paste function).
NB: one problem with this technique is that it is "growing" the vector totalstr, which will eventually cause performance penalties (depending on the size of the input data): every loop iteration, more memory is allocated and the entire string is copied plus the new line of text. There are more verbose ways to side-step this problem (e.g., pre-allocate a vector larger than your anticipated input data), but if you aren't anticipated 1000s of lines then you may be able to accept this naive programming for simplicity.
Another option would be to have the user save the data to a text file and use file.choose() and readLines() to get your data.
Try collapsing the data into a single string after using readline
x <- paste(readline(as.character('Input text here: ')), collapse=' ')

Resources