I have this code that works for me (it's from Jockers' Text Analysis with R for Students of Literature). However, what I need to be able to do is to automate this: I need to perform the "ProcessingSection" for up to thirty individual text files. How can I do this? Can I have a table or data frame that contains thirty occurrences of "text.v" for each scan("*.txt")?
Any help is much appreciated!
# Chapter 5 Start up code
setwd("D:/work/cpd/R/Projects/5/")
text.v <- scan("pupil-14.txt", what="character", sep="\n")
length(text.v)
#ProcessingSection
text.lower.v <- tolower(text.v)
mars.words.l <- strsplit(text.lower.v, "\\W")
mars.word.v <- unlist(mars.words.l)
#remove blanks
not.blanks.v <- which(mars.word.v!="")
not.blanks.v
#create a new vector to store the individual words
mars.word.v <- mars.word.v[not.blanks.v]
mars.word.v
It's hard to help as your example is not reproducible.
Admitting you're happy with the result of mars.word.v,
you can turn this portion of code into a function that will accept a single argument,
the result of scan.
processing_section <- function(x){
unlist(strsplit(tolower(x), "\\W"))
}
Then, if all .txt files are in the current working directory, you should be able to list them,
and apply this function with:
lf <- list.files(pattern=".txt")
lapply(lf, function(path) processing_section(scan(path, what="character", sep="\n")))
Is this what you want?
Related
I have multiple txt files, each referring to a different month of the year (for many years). So, how I could analyze these files (text mining) each of these separately from a unique corpus (or something similar), by taking track of the month-year reference, thank you.
Here is an example I programmed for Game of Thrones subtitles. The subtitles are in the form 60 text files, one file for one episode in the form of S01E01 were we wanted to keep the episode information.
The following code will read the files into a list, and will turn it into a dataframe with episode information and text. You will have to adapt it to your own problem.
library(plyr)
####### Read data ######
filenames <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=TRUE)
filenames_short <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=FALSE)
ldf <- alply(.data=filenames,.margins=1,.fun=scan,what = "character", quiet = T, quote = "")
names(ldf) <- filenames_short
# Loop over all filenames
# Turns list into two columns of a dataframe, episode and word
# create empty dataframe
df_got_subs <- as.data.frame(NULL)
for (i in 1:60) {
# extract listname
# vector with list name
listenname <- filenames_short[i]
vec_listenname <- rep.int(listenname,length(ldf[[i]]))
# Doublecheck
cat("listenname: ",listenname,"\n")
# turn list element into vector
vec_subs <- as.vector(ldf[[i]])
# create dataframe from vectors
df_subs <- cbind.data.frame(vec_listenname,vec_subs,stringsAsFactors=FALSE)
# attach to the "big" dataframe
df_got_subs <- rbind.data.frame(df_got_subs,df_subs)
}
# test datastructure
str(df_got_subs)
# change column names
colnames(df_got_subs) <- c("episode","subs")
The whole text mining we did with the tidytext package from Julia Silge. I didn't post the code because she gives much better examples in this post:
http://juliasilge.com/blog/Life-Changing-Magic/
I hope this helps with your problem.
I want to automate the extraction of certain information from text files using grep, grepl and regexpr. I have a code that works when I do it for each individual file, however I cannot get the loop to work, to automate the process for all files in my working directory.
I am reading in the txt files as strings because of the structure of the data. The loop seems to iterate through the first file numerous times corresponding to the number of files in the directory, obviously because of the length(txtfiles)command in the for statement.
txtfiles = list.files(pattern="*.txt")
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
#select hours of operation
hours_op[i] <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(hours_op, regexpr("[0-9]{1,9}.[0-9]{1,9}",hours_op))
}
I would be grateful if someone could point me in the right direction to repeat this routine for each file, rather than the same file multiple times over. I want to end up with a list of the file names and the corresponding hours_op.
you need to either add an index ([i]) to every one of your reference to hours_op[i], as in:
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
hours_op[i] <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(hours_op[i], regexpr("[0-9]{1,9}.[0-9]{1,9}",hours_op[i]))
}
or better yet, use a temporary variable:
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
temp <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(temp, regexpr("[0-9]{1,9}.[0-9]{1,9}",temp))
}
I have a dataframe data with information on tiffs, including one column txt describing the content of the tiff. Unfortunately, txt is not always correct and we need to correct them by hand. Therefore I want to loop over each row in data, show the tiff and ask for feedback, which is than put into data$txt.cor.
setwd(file.choose())
Some test tiffs (with nonsene inside, but to show the idea...):
txt <- sample(100:199, 5)
for (i in 1:length(txt)){
tiff(paste0(i, ".tif"))
plot(txt[i], ylim = c(100, 200))
dev.off()
}
and the dataframe:
pix.files <- list.files(getwd(), pattern = "*.tif", full.names = TRUE)
pix.file.info <- file.info(pix.files)
data <- cbind(txt, pix.file.info)
data$file <- row.names(pix.file.info)
data$txt.cor <- ""
data$txt[5] <- 200 # wrong one
My feedback function (error handling stripped):
read.number <- function(){
n <- readline(prompt = "Enter the value: ")
n <- as.character(n) #Yes, character. Sometimes we have alphanumerical data or leading zeros
}
Now the loop, for which help would be very much appreciated:
for (i in nrow(data)){
file.show(data[i, "file"]) # show the image file
data[i, "txt.cor"] <- read.number() # aks for the feedback and put it back into the dataframe
}
In my very first attempts I was thinking of the plot.lm idea, where you go through the diagnostic plots after pressing return. I suspect that plot and tiffs are not big friends. file.show turned out to be easier. But now I am having a hard time with that loop...
Your problem is that you don't loop over the data, you only evaluate the last row. Simply write 1:nrow(data)to iterate over all rows.
To display your tiff images in R you can use the package rtiff:
library(rtiff)
for (i in 1:nrow(data)){
tif <- readTiff(data[i,"file"]) # read in the tiff data
plot(tif) # plot the image
data[i, "txt.cor"] <- read.number() # aks for the feedback and put it back into the dataframe
}
I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.
I am trying to copy and paste tables from R into Excel. Consider the following code from a previous question:
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.data.frame(table(outline))
print(outline) # this prints all n tables
name <- paste0(i,"X.csv")
write.csv(outline, name)
}
This code writes each table into separate Excel files (i.e. "1X.csv", "2X.csv", etc..). Is there any way of "shifting" each table down some rows instead of rewriting the previous table each time? I have also tried this code:
output <- as.data.frame(output)
wb = loadWorkbook("X.xlsx", create=TRUE)
createSheet(wb, name = "output")
writeWorksheet(wb,output,sheet="output",startRow=1,startCol=1)
writeNamedRegion(wb,output,name="output")
saveWorkbook(wb)
But this does not copy the dataframes exactly into Excel.
I think, as mentioned in the comments, the way to go is to first merge the data frames in R and then writing them into (one) output file:
# get vector of filenames
filenames <- list.files(path=getwd())
# for each filename: load file and create outline
outlines <- lapply(filenames, function(filename) {
data <- read.csv(filename)
outline <- data[,2]
outline <- as.data.frame(table(outline))
outline
})
# merge all outlines into one data frame (by appending them row-wise)
outlines.merged <- do.call(rbind, outlines)
# save merged data frame
write.csv(outlines.merged, "all.csv")
Despite what microsoft would like you to believe, .csv files are not excel files, they are a common file type that can be read by excel and many other programs.
The best approach depends on what you really want to do. Do you want all the tables to read into a single worksheet in excel? If so you could just write to a single file using the append argument to the write.csv or other functions. Or use a connection that you keep open so each new one is appended. You may want to use cat to put a couple of newlines before each new table.
Your second attempt looks like it uses the XLConnect package (but you don't say, so it could be something else). I would think this the best approach, how is the result different from what you are expecting?