I tried to extract noun for R. When using program R, an error appears. I wrote the following code:
setwd("C:\\Users\\kyu\\Desktop\\1-1file")
library(KoNLP)
useSejongDic()
txt <- readLines(file("1_2000.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
and, the error appear like this:
setwd("C:\\Users\\kyu\\Desktop\\1-1file")
library(KoNLP)
useSejongDic()
Backup was just finished!
87007 words were added to dic_user.txt.
txt <- readLines(file("1_2000.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
java.lang.ArrayIndexOutOfBoundsException Error in
Encoding<-(*tmp*, value = "UTF-8") : a character vector
argument expected
Why is this happening? I load 1_2000.csv file, there are 2000 lines of data. Is this too much data? How do I extract noun like large data file? I use R 3.2.4 with RStudio, and Excel version 2016 on Windows 8.1 x64.
The number of lines shouldn't be a problem.
I think that there might be a problem with the encoding. See this post. Your .csv file is encoded as EUC-KR.
I changed the encoding to UTF-8 using
txtUTF <- read.csv(file.choose(), encoding = 'UTF-8')
nouns <- sapply(txtUTF, extractNoun, USE.NAMES = F)
But that results in the following error:
Warning message:
In preprocessing(sentence) : Input must be legitimate character!
So this might be an error with your input. I can't read Korean so can't help you further.
Related
I run into issue with opening a large file using fread.
df <- fread("userprofile.csv", encoding = "UTF-8")
In fread("userprofile.csv", encoding = "UTF-8") :
Stopped early on line 342637. Expected 77 fields but found 81. Consider fill=TRUE and comment.char=. First discarded non-empty line:
Edit1:
df <- fread("userprofile.csv", encoding = "UTF-8", fill=TRUE)
This gave me R Session Aborted
Is there any alternative to open a large file? or handle this error?
I am using R in RStudio and I am the running the following codes to perform a sentiment analysis on a set of unstructured texts.
Since the bunch of texts contain some invalid characters (caused by the use of emoticons and other typo errors), I want to remove them before proceeding with the analysis.
My R codes (extract) stand as follows:
setwd("E:/sentiment")
doc1=read.csv("book1.csv", stringsAsFactors = FALSE, header = TRUE)
# replace specific characters in doc1
doc1<-gsub("[^\x01-\x7F]", "", doc1)
library(tm)
#Build Corpus
corpus<- iconv(doc1$Review.Text, to = 'utf-8')
corpus<- Corpus(VectorSource(corpus))
I get the following error message when I reach this line of code corpus<- iconv(doc1$Review.Text, to = 'utf-8'):
Error in doc1$Review.Text : $ operator is invalid for atomic vectors
I had a look at the following StackOverflow questions:
remove emoticons in R using tm package
Replace specific characters within strings
I have also tried the following to clean my texts before running the tm package, but I am getting the same error: doc1<-iconv(doc1, "latin1", "ASCII", sub="")
How can I solve this issue?
With
doc1<-gsub("[^\x01-\x7F]", "", doc1)
you overwrite the object doc1, from this on it is not a dataframe but a character vector; see:
doc1 <- gsub("[^\x01-\x7F]", "", iris)
str(doc1)
and now clear
doc1$Species
produces the error.
Eventually you want to do:
doc1$Review.Text <- gsub("[^\x01-\x7F]", "", doc1$Review.Text)
When using program R, an error appears.
I wrote the following code.
txt <- readLines(file("test.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
head(unlist(nouns), 30)
tail(unlist(nouns), 30)
nouns2 <- unlist(nouns)
nouns <- Filter(function(x) {nchar(x) >= 2}, nouns2)
nouns <- gsub("지금", "", nouns)
show <-unlist(lapply(nouns,extractNoun))
showfrq<- data.frame(table(show),stringAsFactors=F)
aa<-as.matrix(showfrq)
write(aa, "test2.xls")
there is no error in script
but, when I look at the csv file, there are error in dividing sheet
I was expecting this
Why is this happening?
I am using R version 3.2.4
and windows 8 x64
excel version 2015
Two immediate things you need to do: is to change your write function to write.csv. Also, use the filename "test2.csv".
It is not necessary to create the aa matrix before writing to .csv
I am trying to read car.data file at this location - https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data using read.table as below. Tried various solutions listed earlier, but did not work. I am using Windows 8, R version 3.2.3. I can save this file as txt file and then read, but not able to read the .data file directly from URL or even after saving using read.table
t <- read.table(
"https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
fileEncoding="UTF-16",
sep = ",",
header=F
)
Here is the error I am getting and is resulting in an empty dataframe with single cell with "?" in it:
Warning messages:
1: In read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", : invalid input found on input connection 'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
2: In read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", :
incomplete final line found by readTableHeader on 'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
Please help!
Don't use read.table when the data is not stored in a table. Data at that link is clearly presented in comma-separated format. Use the RCurl package instead and read the data as CSV:
library(RCurl)
x <- getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data")
y <- read.csv(text = x)
Now y contains your data.
Thanks to cory, here is the solution - just use read.csv directly:
x <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data")
I'm trying to run a seemingly simple task, trying to identify the languages of a vector of text using the 'textcat' package. I've cleaned the text data (a sample of tweets) so as to only be left with standard characters, however, when I try to execute the textcat command as follows
text.df$language <- textcat(text.df$text)
I get the following error message:
Error in textcnt(x, n = max(n), split = split, tolower = tolower, marker = marker, :
not a valid UTF-8 string
Despite the fact that the following test
nchar(text.df$text, "c", allowNA=TRUE)
suggested that there are no non-utf8 characters in the data.
Does anyone have any ideas? Thanks in advance.
Try iconv on your input text...
text <- "i💙you"
> iconv(text, "UTF8", "ASCII", sub="")
[1] "iyou"