Can't Inspect Text Corpus in R - r

I am trying to create Corpus for further analysis, the code I am showing You suddenly stopped working and I cannot find solution for this error. I execute this:
library("tm")
library("SnowballC")
library("wordcloud")
library("arules")
library("arulesViz")
#library("e1071")
#WCZYTAJ_DANE######################################################################
setwd("D:/Dysk Google/Shared/SGGW/MGR_R2/Metody Eksploracji Danych/_PROJEKT")
smSPAM <- read.table("smSPAM.txt", sep="\t", quote="", stringsAsFactors = F)
dim(smSPAM)
colnames(smSPAM) <- c("class", 'text')
head(smSPAM,50)
#zamienia spam ham na 1 0
smSPAM$class=ifelse(smSPAM$class=="ham", "0", "1")
head(smSPAM$text,50)
#View(smSPAM[smSPAM$class=="1",])
#STWORZ_KORPUS#####################################################################
#tworze korpus na potrzeby documenttermmatrix
smSPAM.corp <- Corpus(VectorSource(smSPAM$text))
inspect(smSPAM.corp)
But I get this error in log:
Error in (function (classes, fdef, mtable):
unable to find an inherited method for function ‘inspect’ for signature ‘"VCorpus"’
However I can still perform stemming, removing white spaces etc. on this Corpus, only inspect doesn't work.

Ok I found what my problem was - both tm and arules packages containt inspect functions do I had to detach arulesViz and arules (in that order cause latter is needed by former) and It's working again.

Related

Problem extracting specific transcript sequences from genome

I'm working on Linux in R to get specific transcript sequences from the genome. I am able to import the genome, extract transcripts annotations (tx), and get transcript sequences (tx_seq). I am also able to import genomic annotations for the transcripts I'm interested in as GRanges (uorf_annotations) and convert them transcript coordinates (uorf_tx_annotations).
The libraries I have loaded are
#BiocManager::install(version="3.9")
#BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
#BiocManager::install("S4Vectors")
#BiocManager::install(c("GenomicFeatures", "Biostrings", "GenomicRanges"))
library(broom)
library(BSgenome.Hsapiens.UCSC.hg19)
library(Biostrings)
library(GenomicRanges)
library(GenomicFeatures)
library(rtracklayer)
library(tidyverse)
library(plyranges)
options(scipen=999)
The chunk of code that works is
uorf_tx_annotations <- mapToTranscripts(uorf_annotations, tx) %>%
print()
I get stuck when I try and get the transcript sequences from uorf_tx_annotations. The code I am using for this step is
uorf_transcript_seq <- extractTranscriptSeqs(genome, uorf_tx_annotations) %>%
print()
and the error I am getting is "Error in extractTranscriptSeqs(genome, uorf_tx_annotations) : failed to extract the exon ranges from 'transcripts' with exonsBy(transcripts, by="tx", ...)"
I use extractTranscriptSeqs earlier on to get tx_seq, so I'm not sure what's wrong here. I thought that everything I needed was in GRanges, but in case uorf_annotations wasn't in the GRanges I tried (inspired from documentation at https://rdrr.io/bioc/GenomicFeatures/man/extractTranscriptSeqs.html)
uorf_tx_annotations <- mapToTranscripts(uorf_annotations, tx) %>%
exonsBy(by = "tx") %>%
print()
but this just introduces a new error: "Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘exonsBy’ for signature ‘"GRanges"’"

How can I solve this R error message relating to atomic vectors?

I am using R in RStudio and I am the running the following codes to perform a sentiment analysis on a set of unstructured texts.
Since the bunch of texts contain some invalid characters (caused by the use of emoticons and other typo errors), I want to remove them before proceeding with the analysis.
My R codes (extract) stand as follows:
setwd("E:/sentiment")
doc1=read.csv("book1.csv", stringsAsFactors = FALSE, header = TRUE)
# replace specific characters in doc1
doc1<-gsub("[^\x01-\x7F]", "", doc1)
library(tm)
#Build Corpus
corpus<- iconv(doc1$Review.Text, to = 'utf-8')
corpus<- Corpus(VectorSource(corpus))
I get the following error message when I reach this line of code corpus<- iconv(doc1$Review.Text, to = 'utf-8'):
Error in doc1$Review.Text : $ operator is invalid for atomic vectors
I had a look at the following StackOverflow questions:
remove emoticons in R using tm package
Replace specific characters within strings
I have also tried the following to clean my texts before running the tm package, but I am getting the same error: doc1<-iconv(doc1, "latin1", "ASCII", sub="")
How can I solve this issue?
With 
doc1<-gsub("[^\x01-\x7F]", "", doc1)
 you overwrite the object doc1, from this on it is not a dataframe but a character vector; see:
doc1 <- gsub("[^\x01-\x7F]", "", iris)
str(doc1)
and now clear
doc1$Species
produces the error.
Eventually you want to do:
doc1$Review.Text <- gsub("[^\x01-\x7F]", "", doc1$Review.Text)

Error in if (xn == xx) { : missing value where TRUE/FALSE needed

I'm trying to combine a large number of raster tiles to a single mosaic using R codes as follows. The error that appears is:
Error in if (xn == xx) { : missing value where TRUE/FALSE needed
The error appears after the for loop.
I will highly appreciate your suggestion.
require(raster)
rasters1 <- list.files("D:/lidar_grid_metrics/ElevMax",
pattern="*.asc$", full.names=TRUE, recursive=TRUE)
rast.list <- list()
for(i in 1:length(rasters1)) { rast.list[i] <- raster(rasters1[i]) }
rast.list$fun <- mean
rast.mosaic <- do.call(mosaic,rast.list)
plot(rast.mosaic)
First a better way to write what you do (use lapply)
library(raster)
ff <- list.files("D:/lidar_grid_metrics/ElevMax",
pattern="\\.asc$", full.names=TRUE, recursive=TRUE)
rast.list <- lapply(ff, raster)
rast.list$fun <- mean
rast.mosaic <- do.call(mosaic,rast.list)
Now, to the error your get. It is useful to show the results of traceback() after the error occurs. But from the error message you get, I infer that one of the RasterLayers has an extent with an NA value. That makes it invalid. You can check if that is true (and if so figure out what is going on) by doing
t(sapply(rast.list, function(i) as.vector(extent(i))))
EDIT
With the files Ram send me I figured out what was going on. There was a bug when creating a RasterLayer from an ascii file with the native driver if the file specifies "xllcenter" rather than "xllcorner".
This is now fixed on the development version (2.9-1) available on github.
The problem can also be avoided by installing rgdal because if rgdal is available, the native driver won't be used.

Error extracting noun in R using KoNLP

I tried to extract noun for R. When using program R, an error appears. I wrote the following code:
setwd("C:\\Users\\kyu\\Desktop\\1-1file")
library(KoNLP)
useSejongDic()
txt <- readLines(file("1_2000.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
and, the error appear like this:
setwd("C:\\Users\\kyu\\Desktop\\1-1file")
library(KoNLP)
useSejongDic()
Backup was just finished!
87007 words were added to dic_user.txt.
txt <- readLines(file("1_2000.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
java.lang.ArrayIndexOutOfBoundsException Error in
Encoding<-(*tmp*, value = "UTF-8") : a character vector
argument expected
Why is this happening? I load 1_2000.csv file, there are 2000 lines of data. Is this too much data? How do I extract noun like large data file? I use R 3.2.4 with RStudio, and Excel version 2016 on Windows 8.1 x64.
The number of lines shouldn't be a problem.
I think that there might be a problem with the encoding. See this post. Your .csv file is encoded as EUC-KR.
I changed the encoding to UTF-8 using
txtUTF <- read.csv(file.choose(), encoding = 'UTF-8')
nouns <- sapply(txtUTF, extractNoun, USE.NAMES = F)
But that results in the following error:
Warning message:
In preprocessing(sentence) : Input must be legitimate character!
So this might be an error with your input. I can't read Korean so can't help you further.

Invalid 'path' argument with XLConnect

I am trying and failing to get the following process to complete in R Version 3.1.2:
library(RCurl)
library(XLConnect)
yr <- substr(Sys.Date(), 1, 4)
mo <- as.character(as.numeric(substr(Sys.Date(), 6, 7)) - 1)
temp <- tempfile()
temp <- getForm("http://strikemap.clb.org.hk/strikes/api.v4/export",
FromYear = "2011", FromMonth = "1",
ToYear = yr, ToMonth = mo,
`_lang` = "en")
CLB <- readWorksheetFromFile(temp, sheet=1)
unlink(temp)
I have been able manually to export the requested data set and then read it into R from a local directory using the same readWorksheetFromFile syntax. My goal now is to do the whole thing in R. The call to the API seems to work (thanks to some earlier help), but the process fails at the next step, when I try to ingest the results. Here's what happens:
> CLB <- readWorksheetFromFile(temp, sheet=1)
Error in path.expand(filename) : invalid 'path' argument
Any thoughts on what I'm doing wrong or what's broken?
Turns out the problem didn't lie with XLConnect at all. Based on Hadley's tip that I needed to save the results of my query to the API to a file before reading them back into R, I have managed (almost) to complete the process using the following code:
library(httr)
library(readxl)
yr <- substr(Sys.Date(), 1, 4)
mo <- as.character(as.numeric(substr(Sys.Date(), 6, 7)) - 1)
baseURL <- paste0("http://strikemap.clb.org.hk/strikes/api.v4/export?FromYear=2011&FromMonth=1&ToYear=", yr, "&ToMonth=", mo, "&_lang=en")
queryList <- parse_url(baseURL)
clb <- GET(build_url(queryList), write_disk("clb.temp.xlsx", overwrite=TRUE))
CLB <- read_excel("clb.temp.xlsx")
The object that creates, CLB, includes the desired data with one glitch: the dates in the first column are not being read properly. If I open "clb.temp.xlsx" in Excel, they show up as expected (e.g., 2015-06-30, or 6/30/2015 if I click on the cell). But read_excel() is reading them as numbers that don't track to those dates in an obvious way (e.g., 42185 for 2015-06-30). I tried fixing that by specifying that they were dates in the call to read_excel, but that produced a long string of warnings about expecting dates but getting those numbers.
If I use readWorkSheetFromFile() instead of read_excel at that last step, here's what happens:
> CLB <- readWorksheetFromFile("clb.temp.xlsx")
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readWorksheet’ for signature ‘"workbook", "missing"’
I will search for a solution to the problem using read_excel and will create a new question if I can't find one.

Resources