Escape single quote in R variables - r

I have a table with names in one column. I have an R script to read this table and then do a write.table to a CSV file for further processing. The script barfs when writing my table if it encounters a name with an apostrophe (single quote) character such as "O'Reilly" in the matrix
library(RCurl)
library(RJSONIO)
dir <- "C:/Users/rob/Data"
setwd(dir)
filename <- "employees.csv"
url <- "https://obscured/employees.html"
html <- getURL(url, ssl.verifypeer = FALSE)
initdata <- gsub("^.*?emp.allEployeeData = (.*?);.*", "\\1", html)
initdata <- gsub("'", '"', initdata)
data <- fromJSON( initdata )
table <- list()
for(i in seq_along(data))
{
job <- data[[i]][[1]]
name <- data[[i]][[2]]
age <- data[[i]][[6]]
sex <- data[[i]][[7]]
m <- matrix(nrow = 1, ncol = 4)
colnames(m) <- c("job", "name", "age", "sex")
m[1, ] <- c(job, name, age, sex)
table[[i]] <- as.data.frame(m)
write.table(table[[i]],file = filename,append = TRUE,sep = ",",col.names = FALSE,row.names = FALSE)
}
When I encounter O'Reilly, the error I am receiving is:
Error in m[1, ] <- c(job, name, age, sex) :
number of items to replace is not a multiple of replacement length
I end up with a csv file that includes data for all employees before O'Reilly is encountered. My Googling revealed people trying to add quotes to strings or parse strings already containing escape characters.
Is there a way to escape or remove single quotes inside my data?

I was replacing single quotes with double quotes in line 11, which I don't need to do in this data set. So it wasn't a single quote in a name messing things up, it was replacing that single quote with a double messing things up.
Removed this line:
initdata <- gsub("'", '"', initdata)

Related

How to convert .txt to .csv file in R

I found this code here, and it worked to convert the '.txt' to '.csv' but the file is not broken into columns, pretty sure there's an easy fix or line to add here, but I'm not finding it. Still new to r and working through, so any help or direction is appreciated.
EDIT: The file contains the following, a list of invasive plants:
Header: Noxious Weed List.
'(a) Abrus precatorius – rosary pea '
'(b) Aeginetia spp. – aeginetia'
'(c) Ageratina adenophora – crofton weed '
'(d) Alectra spp. – alectra '
And so I would like to get all the parts, i.e., genus, species, and common name, in a separate column. and if possible, delete the letters '(a)' and the ' - ' separating hyphen.
filelist = list.files(pattern = ".txt")
for (i in 1:length(filelist)) {
input<-filelist[i]
output <- paste0(gsub("\\.txt$", "", input), ".csv")
print(paste("Processing the file:", input))
data = read.delim(input, header = TRUE)
write.table(data, file=output, sep=",", col.names=TRUE, row.names=FALSE)
}
You'll need to adjust if you have common names with three or more words, but this is the general idea:
path <- "C:\\Your File Path Here\\"
file <- paste0(path, "WeedList.txt")
DT <- read.delim(file, header = FALSE, sep = " ")
DT <- DT[-c(1),-c(1,4,7)]
colnames(DT) <- c("Genus", "Species", "CommonName", "CommonName2")
DT$CommonName <- gsub("'", "", DT$CommonName)
DT$CommonName2 <- gsub("'", "", DT$CommonName2)
DT$CommonName <- paste(DT$CommonName, DT$CommonName2, sep = " ")
DT <- DT[,-c(4)]
write.csv(DT, paste0(path, "WeedList.csv"), row.names = FALSE)

Error in aggregate.data.frame(as.data.frame(x), ...) : arguments must have same length

Hi I'm working with the last example in this tutorial: Topics proportions over time.
https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html
I run it for my data with this code
library(readxl)
library(tm)
# Import text data
tweets <- read_xlsx("C:/R/data.xlsx")
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
library('stringr')
#Remove URL's
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")
textdata <- gsub("#\\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See: https://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new <- datatm[rowTotals> 0, ]
library("ldatuning")
library("topicmodels")
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#####################################################
#topics by year
tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
top5termsPerTopic <- terms(topicModel, 7)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames
# reshape data frame
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")
# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Here is the excel file to the input data
https://www.mediafire.com/file/4w2hkgzzzaaax88/data.xlsx/file
I got the error when I run the line with the aggregate function, I can't find out what is going on with the aggregate, I created the "decade" variable the same as in the tutoria, I show it and looks ok, the theta variable is also ok.. I changed several times the aggregate function according for example to this post
Error in aggregate.data.frame : arguments must have same length
But still have the same error.. please help
I am not sure what you want to achieve with the command
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
As far as I see you produce only one decade with
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
table(tweets$decade)
2010
3481
With all the preprocessing from tweets to textdata you're producing a few empty lines. This is where your problem starts.
Textdata with its new empty lines is the basis of your corpus and your dtm. You get rid of them with the lines:
ui = unique(dtm$i)
dtm.new = dtm[ui,]
At the same time you're basically deleting the empty columns in the dtm, thereby changing the length of your object. This new dtm without the empty cells is
then your new basis for the topic model. This is coming back to haunt you, when you try to use aggregate() with two objects of different lengths: tweets$decade, which is still the old length of 3418 with theta, that is produced by the topic model, which in turn is based on dtm.new -- remember, the one with fewer rows.
What I would suggest is to, first, get an ID-column in tweets. Later on you can use the IDs to find out what texts later on get deleted by your preprocessing and match the length of tweet$decade and theta.
I rewrote your code -- try this out:
library(readxl)
library(tm)
# Import text data
tweets <- read_xlsx("data.xlsx")
## Include ID for later
tweets$ID <- 1:nrow(tweets)
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
library('stringr')
#Remove URL's
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")
textdata <- gsub("#\\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See: https://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new <- datatm[rowTotals> 0, ]
library("ldatuning")
library("topicmodels")
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#####################################################
#topics by year
tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)
id <- data.frame(ID = dtm.new$dimnames$Docs)
colnames(id) <- "ID"
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
tweets_new <- merge(id, tweets, by.x="ID", by.y = "ID", all.x = T)
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets_new$decade), mean)

How to create a for loop to open, mutate and save .csv files using R?

I have a several .csv files that have to be reformatted and saved again using an R script.
The function that is needed to do the changes and the reformating of the files, is already established and works perfectly fine. But as there are always lots of documents to change, I would like to have a for lLoop so that I don't have to adapt my code for every single document. But unfortunately I don't have experience in the use of loops using R so far.
My code looks like this at the moment:
setwd("C:/users/Desktop/Raw/.")
df <- read.csv("A1.csv", sep= ",")
new_df <- wrap_frame(df, nr = 61, rownames = "", unique_names = FALSE)
write.csv(new_df, "C:/users/Desktop/Data/A1.csv", row.names = FALSE)
The original .csv files are always called the same way with a letter (A to Z) followed by a number from 1 to 12. The number of the .csv files to change may adapt. But their names are always following the mentioned rules.
I would be very grateful, if somebody could help me with this issue!
You can get a vector with all filenames that exist in your folder (as this folder contains no other files than those you want to edit) with
setwd( "C:/users/Desktop/Raw/" )
files <- Sys.glob( "*.csv" )
and then process them one by one with
for( i in files )
{
df <- read.csv( i )
new_df <- wrap_frame(df, nr = 61, rownames = "", unique_names = FALSE)
write.csv(new_df, paste( "C:/users/Desktop/Data/", i, sep = "" ), row.names = FALSE)
}
Try out:
# vector of file names
my.files <- paste0(c(outer(LETTERS, 1:12, FUN = "paste0")),
".csv")
# for loop
for (i in seq_along(my.files)) {
df <- read.csv(my.files[i], sep= ",") # open
new_df <- wrap_frame(df, nr = 61, rownames = "", unique_names = FALSE) # mutate
write.csv(new_df, paste0("C:/users/Desktop/Data/", my.files[i]),
row.names = FALSE) # save
}

rbind txt files from online directory (R)

I am trying to get concatenate text files from url but i don't know how to do this with the html and the different folders?
This is the code i tried, but it only lists the text files and has a lot of html code like this How do I fix this so that I can combine the text files into one csv file?
library(RCurl)
url <- "http://weather.ggy.uga.edu/data/daily/"
dir <- getURL(url, dirlistonly = T)
filenames <- unlist(strsplit(dir,"\n")) #split into filenames
#append the files one after another
for (i in 1:length(filenames)) {
file <- past(url,filenames[i],delim='') #concatenate for urly
if (i==1){
cp <- read_delim(file, header=F, delim=',')
}
else{
temp <- read_delim(file,header=F,delim=',')
cp <- rbind(cp,temp) #append to existing file
rm(temp)# remove the temporary file
}
}
here is a code snippet that I got to work for me. I like to use rvest over RCurl, just because that's what I've learned. In this case, I was able to use the html_nodes function to isolate each file ending in .txt. The result table has the times saved as character strings, but you could fix that later. Let me know if you have any questions.
library(rvest)
library(readr)
url <- "http://weather.ggy.uga.edu/data/daily/"
doc <- xml2::read_html(url)
text <- rvest::html_text(rvest::html_nodes(doc, "tr td a:contains('.txt')"))
# define column types of fwf data ("c" = character, "n" = number)
ctypes <- paste0("c", paste0(rep("n",11), collapse = ""))
data <- data.frame()
for (i in 1:2){
file <- paste0(url, text[1])
date <- as.Date(read_lines(file, n_max = 1), "%m/%d/%y")
# Read file to determine widths
columns <- fwf_empty(file, skip = 3)
# Manually expand `solar` column to be 3 spaces wider
columns$begin[8] <- columns$begin[8] - 3
data <- rbind(data, cbind(date,read_fwf(file, columns,
skip = 3, col_types = ctypes)))
}

How can I write out multiple files with different filenames in R

I have one BIG file (>10000 lines of data) and I want write out a separate file by ID. I have 50 unique ID names and I want a separate text file for each one. Here's what Ive got so far, and I keep getting errors. My ID is actually character string which I would prefer if I can name each file after that character string it would be best.
for (i in 1:car$ID) {
a <- data.frame(car[,i])
carib <- car1[,(c("x","y","time","sd"))]
myfile <- gsub("( )", "", paste("C:/bridge", carib, "_", i, ".txt"))
write.table(a, file=myfile,
sep="", row.names=F, col.names=T quote=FALSE, append=FALSE)
}
One approach would be to use the plyr package and the d_ply() function. d_ply() expects a data.frame as an input. You also provide a column(s) that you want to slice and dice that data.frame by to operate on independently of one another. In this case, you have the column ID. This specific function does not return an object, and is thus useful for plotting, or making charter iteratively, etc. Here's a small working example:
library(plyr)
dat <- data.frame(ID = rep(letters[1:3],2) , x = rnorm(6), y = rnorm(6))
d_ply(dat, "ID", function(x)
write.table(x, file = paste(x$ID[1], "txt", sep = "."), sep = "\t", row.names = FALSE))
Will generate three tab separates files with the ID column as the name of the files (a.txt, b.txt, c.txt).
EDIT - to address follow up question
You could always subset the columns you want before passing it into d_ply(). Alternatively, you can use/abuse the [ operator and select the columns you want within the call itself:
dat <- data.frame(ID = rep(letters[1:3],2) , x = rnorm(6), y = rnorm(6)
, foo = rnorm(6))
d_ply(dat, "ID", function(x)
write.table(x[, c("x", "foo")], file = paste(x$ID[1], "txt", sep = ".")
, sep = "\t", row.names = FALSE))
For the data frame called mtcars separated by mtcars$cyl:
lapply(split(mtcars, mtcars$cyl),
function(x)write.table(x, file = paste(x$cyl[1], ".txt", sep = "")))
This produces "4.txt", "6.txt", "8.txt" with the corresponding data. This should be faster than looping/subsetting since the subsetting (splitting) is vectorized.

Resources