R: Want to do a dictionary check and remove unwanted space in between where removing space will make it a proper word - r

I am using R for text mining and have data that have been concatenated from different text columns. There are cases where words have been split by a space like"functi oning". I want to detect all such cases and remove space in between by doing dictionary check. I know splitWords function in aspell, I want a function exactly opposite of what this does.

Here is an approach, based on some code I found, but you need to provide some example text and even just pseudo code to help others respond.
First create an object that has a huge set of words spelled correctly. Then you compare your vector of words to that set with adist and an argument set to a single difference -- ideally, the internal spaces you would like to remove. I doubt that this will solve everything, but it may help.
sorted_words <- comments(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(*your vector*) { c(sorted_words[adist(*your vector*, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }
Then use the correct function.

Related

transform character to a number in R

I loaded a data big set with read_delim(), since there I have the possibility to skip the first 4 rows of the data set which is not important for me. The data set is separated by ";". My Problem is the following:
I have some numbers like
-0,000364929204806685
0,00367021351121366
-0,0184237491339445
as you can see this numbers are seperated by commas. Therefore if i change the type of it to "numeric", during the loading phase, i get a formatting error like -3.649292e+14 for the first number. Thus i have to load the data as characters.
But now I am not able to do numeric calculations. as.numeric() doesen't work.
Is there any possibility to change this characters to numeric?
Thanks
Matthias
Thanks everybody for help, it can be solved by using gsub(). In the upper example:
as.numeric(gsub(",", ".", Dat[1,12]))
provides:
-0.0003649292

Understanding the logic of R code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am learning R through tutorials, but I have difficulties in "how to read" R code, which in turn makes it difficult to write R code. For example:
dir.create(file.path("testdir2","testdir3"), recursive = TRUE)
vs
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
While I know what these lines of code do, I cannot read or interpret the logic of each line of code. Whether I read left to right or right to left. What strategies should I use when reading/writing R code?
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Don't let lines of code like this ruin writing R code for you
I'm going to be honest here. The code is bad. And for many reasons.
Not a lot of people can read a line like this and intuitively know what the output is.
The point is you should not write lines of code that you don't understand. This is not Excel, you do not have but 1 single line to fit everything within. You have a whole deliciously large script, an empty canvas. Use that space to break your code into smaller bits that make a beautiful mosaic piece of art! Let's dive in~
Dissecting the code: Data Frames
Reading a line of code is like looking at a face for familiar features. You can read left to right, middle to out, whatever -- as long as you can lock onto something that is familiar.
Okay you see data.combined. You know (hope) it has rows and columns... because it's data!
You spot a $ in the code and you know it has to be a data.frame. This is because only lists and data.frames (which are really just lists) allow you to subset columns using $ followed by the column name. Subset-by the way- just means looking at a portion of the overall. In R, subsetting for data.frames and matrices can be done using single brackets[, within which you will see [row, column]. Thus if we type data.combined[1,2], it would give you the value in row 1 of column 2.
Now, if you knew that the name of column 2 was name you can use data.combined[1,"name"] to get the same output as data.combined$name[1]. Look back at that code:
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Okay, so now we see our eyes should be locked on data.combined[SOMETHING IS IN HERE?!]) and slowly be picking out data.combined[ ?ROW? , Oh the "name" column]. Cool.
Finding those ROW values!
which(duplicated(as.character(data.combined$name)))
Anytime you see the which function, it is just giving you locations. An example: For the logical vector a = c(1,2,2,1), which(a == 1) would give you 1 and 4, the location of 1s in a.
Now duplicated is simple too. duplicated(a) (which is just duplicated(c(1,2,2,1))) will give you back FALSE FALSE TRUE TRUE. If we ran which(duplicated(a)) it would return 3 and 4. Now here is a secret you will learn. If you have TRUES and FALSES, you don't need to use the which function! So maybe which was unnessary here. And also as.character... since duplicated works on numbers and strings.
What You Should Be Writing
Who am I to tell you how to write code? But here's my take.
Don't mix up ways of subsetting: use EITHER data.frame[,column] or data.frame$column...
The code could have been written a little bit more legibly as:
dupes <- duplicated(data.combined$name)
dupe.names <- data.combines$name[dupes]
or equally:
dupes <- duplicated(data.combined[,"name"])
dupe.names <- data.combined[dupes,"name"]
I know this was lengthy but I hope it helps.
An easier way to read any code is to break up their components.
dup.names <-
as.character(
data.combined[which(
duplicated(
as.character(
data.combined$name
)
)
), "name"]
)
For each of the functions - those parts with rounded brackets following them e.g. as.character() you can learn more about what they do and how they work by typing ?as.character in the console
Square brackets [] are use to subset data frames, which are stored in your environment (the box to the upper right if you're using R within RStudio contains your values as well as any defined functions). In this case, you can tell that data.combined is the name that has been given to such a data frame in this example (type ?data.frame to find out more about data frames).
"Unwrapping" long lines of code can be daunting at first. Start by breaking it down into parenthesis , brackets, and commas. Parenthesis directly tacked onto a word indicate a function, and any commas that lie within them (unless they are part of another nested function or bracket) separate arguments which contain parameters that modify the way the function behaves. We can reduce your 2nd line to an outer function as.character and its arguments:
dup.names <- as.character(argument_1)
Just from this, we know that dup.names will be assigned a value with the data type "character" off of a single argument.
Two functions in the first line, file.path() and dir.create(), contain a comma to denote two arguments. Arguments can either be a single value or specified with an equal sign. In this case, the output of file.path happens to perform as argument #1 of dir.create().
file.path(argument_1,argument_2)
dir.create(argument_1,argument_2)
Brackets are a way of subsetting data frames, with the general notation of dataframe_object[row,column]. Within your second line is a dataframe object, data.combined. You know it's a dataframe object because of the brackets directly tacked onto it, and knowing this allows you to that any functions internal to this are contributing to subsetting this data frame.
data.combined[row, column]
So from there, we can see that the internal functions within this bracket will produce an output that specifies the rows of data.combined that will contribute to the subset, and that only columns with name "name" will be selected.
Use the help function to start to unpack these lines by discovering what each function does, and what it's arguments are.

Convert characters or symbols to existing variables in R

I'm using R to compute the best fit of a sequence of initializations, and I named them Initialization1, Initialization2, etc.. I compared the best fit with the largest result_probs value. And I want to use the one, say Initialization1, with the best property I want again.
best_fit <- paste("Initialization", which.max(results_probObs), sep = "")
best_estimated <- somefunction(best_fit, string1)
However, best_fit here is a character and can't be used as the existing Initialization1 (which is a list). I've tried as.name() too. It gave me a symbol and couldn't be used as a list as well.
Thank you very much for helping.

grep or gsub for everything except a specific string in R

I'm trying to match everything except a specific string in R, and I've seen a bunch of posts on this suggesting a negative lookaround, but I haven't gotten that to work.
I have a dataset looking at crime incidents in SF, and I want to sort cases that have a resolution or do not. In the resolution field, cases have things listed like arrest booked, arrest cited, juvenile booked, etc., or none. I want to relabel all the specific resolutions like the different arrests to "RESOLVED" and keep the instances with "NONE" as such. So, I thought I could gsub or grep for not "NONE".
Based on what I've read on finding all strings except one specific string, I would have thought this would work:
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE)
Where I make a vector that searches through my training dataset, specifically the resolution column, and finds the terms that aren't "NONE". But, I just get an empty vector.
Does anyone have suggestions, or know why this might not be working in R? Or, even if there was a way to just use gsub, how do I say "not NONE" for my regex in R?
trainData$Resolution = gsub("!NONE", RESOLVED, trainData$Resolution) << what's the way to negate the string here?
Based on your explanation, it seems as though you don't need regular expressions (i.e. gsub()) at all. You can use != since you are looking for all non-matches of an exact string. Perhaps you want
within(trainData, {
## next line only necessary if you have a factor column
Resolution <- as.character(Resolution)
Resolution[Resolution != "NONE"] <- "RESOLVED"
})
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE,perl=TRUE)
You need to use option perl=TRUE.

Add textual property to nodes in igraph with R

I have some issue while adding properties of nodes in igraph working with R. I made a text list named journal.txt and I want to give the nodes of my graph a property. With other textual or numeric lists, I had absolutely no issues, but with this one I have.
with this I read the txt file, read just the first column, although there is just one, read as character, although i tried also without and it doesn't work
journalList = read.csv("c:/temp/biblioCoupling/journals.txt", header=FALSE)
journalLR = (journalList[1:303,1])
journalLR = as.character(journalLR)
V(g)$journalName = journalLR
then when I save the file,
write.graph(gr,"filename.gml",format=c("gml"), creator="Claudio Biscaro")
I see all other properties I added to nodes, but not this one!!!
could it be because some entry in journalLR is more than 15 character long?
I have absolutely no idea why I can't do that
Your code is not reproducible, it is impossible to tell for sure, but I guess that V(g)$journalName is a complex attribute, i.e. it is not a vector of values, but a list of values.
To check, you can do str(g) and then look at the code letter after the journalName attribute. If it is x, then it is complex, if it is c, then it is character.
If this is the problem and you don't really need a list, then the workaround is to do
g <- remove.vertex.attribute(g, "journalName")
V(g)$journalName <- journalName
solved by adding one at a time. That was weird. after a long time trying!
for (i in 1:length(journalLR))
{
V(g)[i]$journalName = journalLR[i]
}
probably it is not a formally good solution, but it works!

Resources