Error extracting noun in R using KoNLP - r

I tried to extract noun for R. When using program R, an error appears. I wrote the following code:
setwd("C:\\Users\\kyu\\Desktop\\1-1file")
library(KoNLP)
useSejongDic()
txt <- readLines(file("1_2000.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
and, the error appear like this:
setwd("C:\\Users\\kyu\\Desktop\\1-1file")
library(KoNLP)
useSejongDic()
Backup was just finished!
87007 words were added to dic_user.txt.
txt <- readLines(file("1_2000.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
java.lang.ArrayIndexOutOfBoundsException Error in
Encoding<-(*tmp*, value = "UTF-8") : a character vector
argument expected
Why is this happening? I load 1_2000.csv file, there are 2000 lines of data. Is this too much data? How do I extract noun like large data file? I use R 3.2.4 with RStudio, and Excel version 2016 on Windows 8.1 x64.

The number of lines shouldn't be a problem.
I think that there might be a problem with the encoding. See this post. Your .csv file is encoded as EUC-KR.
I changed the encoding to UTF-8 using
txtUTF <- read.csv(file.choose(), encoding = 'UTF-8')
nouns <- sapply(txtUTF, extractNoun, USE.NAMES = F)
But that results in the following error:
Warning message:
In preprocessing(sentence) : Input must be legitimate character!
So this might be an error with your input. I can't read Korean so can't help you further.

Related

R: Error opening file(corrupted data) using fread

I run into issue with opening a large file using fread.
df <- fread("userprofile.csv", encoding = "UTF-8")
In fread("userprofile.csv", encoding = "UTF-8") :
Stopped early on line 342637. Expected 77 fields but found 81. Consider fill=TRUE and comment.char=. First discarded non-empty line:
Edit1:
df <- fread("userprofile.csv", encoding = "UTF-8", fill=TRUE)
This gave me R Session Aborted
Is there any alternative to open a large file? or handle this error?

How can I solve this R error message relating to atomic vectors?

I am using R in RStudio and I am the running the following codes to perform a sentiment analysis on a set of unstructured texts.
Since the bunch of texts contain some invalid characters (caused by the use of emoticons and other typo errors), I want to remove them before proceeding with the analysis.
My R codes (extract) stand as follows:
setwd("E:/sentiment")
doc1=read.csv("book1.csv", stringsAsFactors = FALSE, header = TRUE)
# replace specific characters in doc1
doc1<-gsub("[^\x01-\x7F]", "", doc1)
library(tm)
#Build Corpus
corpus<- iconv(doc1$Review.Text, to = 'utf-8')
corpus<- Corpus(VectorSource(corpus))
I get the following error message when I reach this line of code corpus<- iconv(doc1$Review.Text, to = 'utf-8'):
Error in doc1$Review.Text : $ operator is invalid for atomic vectors
I had a look at the following StackOverflow questions:
remove emoticons in R using tm package
Replace specific characters within strings
I have also tried the following to clean my texts before running the tm package, but I am getting the same error: doc1<-iconv(doc1, "latin1", "ASCII", sub="")
How can I solve this issue?
With 
doc1<-gsub("[^\x01-\x7F]", "", doc1)
 you overwrite the object doc1, from this on it is not a dataframe but a character vector; see:
doc1 <- gsub("[^\x01-\x7F]", "", iris)
str(doc1)
and now clear
doc1$Species
produces the error.
Eventually you want to do:
doc1$Review.Text <- gsub("[^\x01-\x7F]", "", doc1$Review.Text)

Error saving data for csv using R

When using program R, an error appears.
I wrote the following code.
txt <- readLines(file("test.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
head(unlist(nouns), 30)
tail(unlist(nouns), 30)
nouns2 <- unlist(nouns)
nouns <- Filter(function(x) {nchar(x) >= 2}, nouns2)
nouns <- gsub("지금", "", nouns)
show <-unlist(lapply(nouns,extractNoun))
showfrq<- data.frame(table(show),stringAsFactors=F)
aa<-as.matrix(showfrq)
write(aa, "test2.xls")
there is no error in script
but, when I look at the csv file, there are error in dividing sheet
I was expecting this
Why is this happening?
I am using R version 3.2.4
and windows 8 x64
excel version 2015
Two immediate things you need to do: is to change your write function to write.csv. Also, use the filename "test2.csv".
It is not necessary to create the aa matrix before writing to .csv

Error while trying to read .data file in R

I am trying to read car.data file at this location - https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data using read.table as below. Tried various solutions listed earlier, but did not work. I am using Windows 8, R version 3.2.3. I can save this file as txt file and then read, but not able to read the .data file directly from URL or even after saving using read.table
t <- read.table(
"https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
fileEncoding="UTF-16",
sep = ",",
header=F
)
Here is the error I am getting and is resulting in an empty dataframe with single cell with "?" in it:
Warning messages:
1: In read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", : invalid input found on input connection 'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
2: In read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", :
incomplete final line found by readTableHeader on 'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
Please help!
Don't use read.table when the data is not stored in a table. Data at that link is clearly presented in comma-separated format. Use the RCurl package instead and read the data as CSV:
library(RCurl)
x <- getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data")
y <- read.csv(text = x)
Now y contains your data.
Thanks to cory, here is the solution - just use read.csv directly:
x <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data")

R - textcat not executing because of supposed invalid UTF-8 strings

I'm trying to run a seemingly simple task, trying to identify the languages of a vector of text using the 'textcat' package. I've cleaned the text data (a sample of tweets) so as to only be left with standard characters, however, when I try to execute the textcat command as follows
text.df$language <- textcat(text.df$text)
I get the following error message:
Error in textcnt(x, n = max(n), split = split, tolower = tolower, marker = marker, :
not a valid UTF-8 string
Despite the fact that the following test
nchar(text.df$text, "c", allowNA=TRUE)
suggested that there are no non-utf8 characters in the data.
Does anyone have any ideas? Thanks in advance.
Try iconv on your input text...
text <- "i💙you"
> iconv(text, "UTF8", "ASCII", sub="")
[1] "iyou"

Resources