Create vocabulary of corpus from multiple txt files - r

I am playing around with R. I want to create my dictionary from a txt file. I have 2 .txt files as:
#1.txt
sky,
sun
#2.txt
blue,
bright
To load these 2 files in R, I am doing the following:
library(tm)
txt_files = list.files(pattern = '*.txt');
data = lapply(txt_files, read.table, sep = ",")
#here I get error
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 2 did not have 2 elements
In addition: Warning message:
In FUN(c("1.txt", "2.txt")[[1L]], ...) :
incomplete final line found by readTableHeader on '1.txt'
dict <- c(data)
#dict <- c("sky","blue","bright","sun") // original dictionary, want to replace this by above method
docs <- c(D1 = "The sky is blue.", D2 = "The sun is bright.", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs))
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf,dictionary = dict))
I am getting the following error:
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
Can anybody tell me, what I am doing wrong?

I don't think you should use read.table for those irregular data files. Why not just use readLines() instead
txt_files <- list.files(pattern = '*.txt');
data <- lapply(txt_files, readLines)
dict <- gsub(",$","", unlist(data))
docs <- c(D1 = "The sky is blue.", D2 = "The sun is bright.", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs))
dtm <- DocumentTermMatrix(dd,
control = list(weighting = weightTfIdf,dictionary = dict))
inspect(dtm)
Note we had to remove the training comma ourselves with this method, but that's pretty easy.

Related

Trying to create a dataframe on R from a directory that contains files of different types of files i.e. png, tif, rds

As the question states, I am trying to make a data frame on R from a directory that has different types of files. I have tried this code:
setwd("/working/directory/here")
file_list <- list.files()
# Creating the dataset for all the files in file_list.
for (file in file_list) {
# if the merged dataset does not exist, create it.
if (!exists("dataset")){
dataset <- read.table(file, header = TRUE, sep = "\t")
}
# if the merged dataset does exist, append to it.
if (exists("dataset")){
temp_dataset <- read.table(file, header = TRUE, sep = "\t")
dataset <- rbind(dataset, temp_dataset)
rm(temp_dataset)
}
}
But end up receiving several different errors and am not sure how to go about it:
Error in match.names(clabs, names(xi)) :
names do not match previous names
In addition: Warning messages:
1: In read.table(file, header = TRUE, sep = "\t") :
line 1 appears to contain embedded nulls
2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
3: In read.table(file, header = TRUE, sep = "\t") :
line 1 appears to contain embedded nulls
4: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
5: In read.table(file, header = TRUE, sep = "\t") :
line 2 appears to contain embedded nulls
6: In read.table(file, header = TRUE, sep = "\t") :
line 3 appears to contain embedded nulls
7: In read.table(file, header = TRUE, sep = "\t") :
line 4 appears to contain embedded nulls
8: In read.table(file, header = TRUE, sep = "\t") :
line 5 appears to contain embedded nulls
9: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
Code when using rbindlist:
setwd("/srv/shiny-server/magneto/Storage/1880")
file_list_1880 <- list.files()
all_data <- rbindlist(lapply(file_list_1880, fread), fill = TRUE)
all_data
Error:
Error in FUN(X[[i]], ...) :
embedded nul in string: '\xf5\xfd\x9e\x9a\xc0:\xea~\xa1\u07fcV\xfd\xbd\xe4s\xf9\x99\U02e6aead\xdfC\xb6y\x97\xfa\xbd\xa6$g\xa9\xef۩\xf7\xaf>g\xdf\023\xe0\f\xfa:\0p\x97\xfaߛw\xed+\xf5\xf3?\xfb^\xf5sJ99\001\xe0\021\xe6\r\0\0\x85\xfaw\023\xfb-\xafP\xdf\xe7\xa9'
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
Previous fread() session was not cleaned up properly. Cleaned up ok at the beginning of this fread() call.
2: In FUN(X[[i]], ...) :
Detected 3 column names but the data has 2 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
3: In FUN(X[[i]], ...) :
Stopped early on line 26. Expected 3 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: < >>
4: In FUN(X[[i]], ...) :
Detected 3 column names but the data has 5 columns (i.e. invalid file). Added 2 extra default column names at the end.
With your list of files, lapply over it and later dplyr::bind_rows to form one, large dataframe.
Here is a small example
data_list = lapply(file_list, function(x) {
read.table(file, header = TRUE, sep = "\t")
})
all_data = dplyr::bind_rows(data_list)

Dealing with NA value in data importing loop in R

I'm importing several hundred files into a single file in order to analyze is after using:
files.pet <- sort(list.files(pattern = '1998[0-9][0-9][0-9][0-9].pet'), decreasing = FALSE)
all_data.pet <- NA;
for (pet.atual in files.pet) {
data.atual <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
data.atual <- cbind(data.atual, Desig = pet.atual)
all_data.pet <- rbind(all_data.pet, data.atual)
}
Which runs good until it finds one file giving this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 399 did not have 9 elements
Which has a NA value in one of the columns, is there a way to tell the loop to ignore this and keep importing? Or should i just erase/replace NA in the row?
Also while I'm asking can anyone give me a insight on the meaning of:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
I read around but had little luck understanding what it actually means.
Thanks alot! (sorry if the questions are pretty obvious but I'm new to R)

Batch load files to R while excluding bad rows using 'sqldf'

I have a chunk of R code that recursively loads, tidies, and exports all .txt files in a directory (files are tab-delimited but I used read.fwf to drop columns). The code works for .txt files with complete data after 9 lines of unnecessary headers. However, when I expanded the code to the directory with the full set of .txt files (>500), I found that some of the files have bad rows embedded within the data (essentially, automated repeats of a few header lines, sample available here). I have tried just loading all rows, both good and bad, with the intent of removing the bad rows from within R, but get error messages about column numbers.
Original error: Batch load using read.fwf (Note: I only need the first three columns from each .txt file)
setwd("C:/Users/Seth/Documents/testdata")
library(stringr)
filesToProcess <- dir(pattern="*.txt", full.names=T)
listoffiles <- lapply(filesToProcess, function(x) read.fwf (x,skip=9, widths=c(10,20,21), col.names=c("Point",NA,"Location",NA,"Time"), stringsAsFactors=FALSE))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 344 did not have 5 elements #error from bad rows
Next I tried pre-processing the data to exclude the bad rows using 'sqldf'.
Fix attempt 1: Batch pre-process using 'sqldf'
library(sqldf)
listoffiles <- lapply(filesToProcess, function(x) read.csv.sql(x, sep="\t",
+ skip=9,field.types=c("Point","Location","Time","V4","V5","V6","V7","V8","V9"),
+ header=F, sql = "select * from file where Point = 'Trackpoint' "))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 9 elements
Fix attempt 2: Single file pre-process using 'sqldf'
test.v1 <- read.csv.sql("C:/Users/Seth/Documents/testdata/test/2008NOV28_MORNING_Hunknown.txt",
+ sep="\t", skip=9,field.types=c("Point","Location","Time","V4","V5","V6","V7","V8","V9"),
+ header=F, sql = "select * from file where Point = 'Trackpoint' ")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 9 elements
I'd prefer to do this cleanly and use something like 'sqldf' or 'dplyr, but am open to pulling in all rows and then post-process within R. My questions: How do I exclude the bad rows of data during import? Or, how do I get the full data set imported and then remove the bad rows within R?
Here are a few ways. They all make use of the fact that the good lines all contain the degree symbol (octal 260) and junk lines do not. In all of these we have assumed that columns 1 and 3 are to be dropped.
1) This code assumes you have grep but you may need to quote the first argument of grep depending on your shell. (On Windows, to get grep you would need to install Rtools and under a normal Rtools install grep is found here: C:\\Rtools\bin\grep.exe. The Rtools bin directory would have to be placed on your Windows path or else the entire pathname would need to be used when referencing the Rtools grep.) These comments only apply to (1) and (4) as (2) and (3) do not use the system's grep.
File <- "2008NOV28_MORNING_trunc.txt"
library(sqldf)
DF <- read.csv.sql(File, header = FALSE, sep = "\t", eol = "\n",
sql = "select V2, V4, V5, V6, V7, V8, V9 from file",
filter = "grep [\260] ")
2) You may not need sqldf for this:
DF <- read.table(text = grep("\260", readLines(File), value = TRUE),
sep = "\t", as.is = TRUE)[-c(1, 3)]
3) Alternately try the following which is more efficient than (2) but involves specifying the colClasses vector:
colClasses <- c("NULL", NA, "NULL", NA, NA, NA, NA, NA, NA)
DF <- read.table(text = grep("\260", readLines(File), value = TRUE),
sep = "\t", as.is = TRUE, colClasses = colClasses)
4) We can also use the system's grep with `read.table. The comments in (1) about grep apply here too:
DF <- read.table(pipe(paste("grep [\260]", File)),
sep = "\t", as.is = TRUE, colClasses = colClasses)

R - reading several data tables from one text file

I am new to R and trying to learn how to read the text below. I am using
data <- read.table("myvertices.txt", stringsAsFactors=TRUE, sep=",")
hoping to convey that the "FID..." should be associated with the comma separated numbers below them.
The error I get is:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 13 did not have 2 elements
How would I read the following format
FID001:
-120.9633,51.8496
-121.42749,52.293
-121.25453,52.3195
FID002:
-65.4794,47.69011
-65.4797,47.0401
FID003:
-65.849,47.5215
-65.467,47.515
into something like
FID001 -120.9633 51.8496
FID001 -121.42749 52.293
FID001 -121.25453 52.3195
FID002 -65.4794 47.69011
FID002 -65.4797 47.0401
FID003 -65.849 47.5215
FID003 -65.467 47.515
Here is a possible way to achieve this:
data <- read.table("myvertices.txt") # Read as-is.
fid1 <- c(grep("^FID", data$V1), nrow(data) +1) # Get the row numbers containing "FID.."
df1 <- diff(x = fid1, lag = 1) # Calculate the length+1 rows to read
listdata <- lapply(seq_along(df1),
function(n) cbind(FID = data$V1[fid1[n]],
read.table("myvertices.txt",
skip = fid1[n],
nrows = df1[n] -1,
sep = ",")))
data2 <- do.call(rbind, listdata) # Combine all the read tables into a single data frame.

Using formatC to add 0's to sequence in arg returning error in R

Firstly, real new to R here.
I have a function intended to evaluate the values in columns spread over many tables. I have an argument in the function that allows the user to input what sequence of tables they want to draw data from. The function seems to work fine, but when the user inputs a single number (over 9) instead of a sequence using the : operator, I find an error message:
the input for this example is 30
Error in file(file, "rt") : cannot open the connection
5 file(file, "rt")
4 read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
3 FUN("specdata/3e+01.csv"[[1L]], ...)
2 lapply(filepaths, read.csv)
1 pollutantmean("specdata", "nitrate", 30)
In addition: Warning message:
In file(file, "rt") :
cannot open file 'specdata/3e+01.csv': No such file or directory
Below is my code:
pollutantmean <- function(directory, pollutant, id = 1:332){
filenames <- paste0(formatC(id, digits = 0, width = 3, flag = "0"), ".csv")
filepaths <- file.path(directory, filenames)
list_of_data_frames <- lapply(filepaths, read.csv)
big.df<-do.call(rbind,list_of_data_frames)
print(str(big.df))
mean(big.df[,pollutant], na.rm=TRUE)
}
Many of you probably recognize this as a coursera assignment. It's beyond the due date and I've submitted some (pretty not good) code for it. I just want to have a good understanding of what's going on here, as this type of work is directly related to my research.
You want to make sure you format your values as integers as well
formatC(id,digits=0, width=3, flag="0", format="d")
or you could use
sprintf("%03d", id)
inside your paste statement. Both of those should prevent the scientific notation.

Resources