I am trying to do text mining in Chinese with R.
In my data set, I have a column with people's comment like "连锁店购买的". And I have 2 other columns that I created thanks to JiebaR. These hold the segmented message ("连锁店", "购买", "的") and the keywords from these messages ("连锁店", "购买"). The keyword selection removes "不"("no" in Chinese) so I am trying to fetch it back from the words and add it to the keywords. Simple, right ?
To have a clean code, I put all my functions in a separate file and source it in my main file. And NOW something VERY weird happens : the function works when it's in the main file but doesn't work when it's in the file that I source ! (I just copied and pasted the function from my main to the "function" file and run the source(...) line...).
fetchingNeg <- function(df){
for (i in 1:nrow(df)){
if ("不" %in% unlist(df[i,]$words)){
df[i,]$keywords <- list(append(unlist(df[i,]$keywords),"不"))
}
}
return(df)
}
So I found the error : Encoding !
There was a character c that I knew was "不" but when I was doing print("不" == c) it would give FALSE... "不" is not encoded in UTF-8 in this case, so to make my code work I had to change it to
fetchingNeg <- function(df){
for (i in 1:nrow(df)){
# "不" is "\u{4e0d}" in UTF-8
if ("\u{4e0d}" %in% unlist(df[i,]$words)){
df[i,]$keywords <- list(append(unlist(df[i,]$keywords),"\u{4e0d}"))
}
}
return(df)
}
Related
There are several code snippets that are invaluable to my workflow and play nicely with functions in my custom R package. Can I include these code snippets in my R package so that they are added to users' code snippets (with permissions of course) when they install my package?
Rmd snippet example that creates a sql chunk:
snippet sql
```{sql, connection = conn, output.var = "${1:df}"}
${2}
```
Short answer: Yes
One way to achieve what you want (that works for my package) is:
Store the packages snippet definitions in two text files somewhere in the packages inst/ directory. It's important that the snippets follow exactly the formatting rules (e.g. tabs at the start of the lines, not spaces). I have one file for R code snippets and one for markdown.
Define a function that reads these files and copies their content into RStudios user snippets files. These files are generated at the first attempt to edit the snippets (Tools -> Global Options -> Code -> Edit Snippets) (I think RStudio uses an other, not user exposed file before one tries to edit, not sure though). On ubuntu the RStudio files are called 'r.snippets' and 'markdown.snippets' and are in '~/.R/snippets/'. I also check if the snipped definition already exists, and double check the tabs at the start of the lines before using cat(..., append=TRUE) to add the packages snippet definitions.
I first used an elaborate .onLoad function with configs and all but now I just export a addPackageSnippets function ;)
Edit
Some code:
Part that checks for already existing snippet definitons:
I just read the rstudio file and extract the lines starting with 'snippet'. I do the same for the packages snipptes definition file and use setdiff (one might want to also use trimws on the lists, just in case there is some trailing white-space)
# load package snippets definitions
#
pckgSnippetsFileContent <- readLines(pckgSnippetsFilesPath)
# Extract names of package snippets
#
pckgSnippetsFileDefinitions <- pckgSnippetsFileContent[grepl("^snippet (.*)", pckgSnippetsFileContent)]
# Extract 'names' of already existing snitppets
#
rstudioSnippetsFileContent <- readLines(rstudioSnippetsFilePath)
rstudioSnippetDefinitions <- rstudioSnippetsFileContent[grepl("^snippet (.*)", rstudioSnippetsFileContent)]
# find definitions appearing in packageSnippets but not in rstudioSnippets
# if no snippets are missing go to next file
#
snippetsToCopy <- setdiff(pckgSnippetsFileDefinitions, rstudioSnippetDefinitions)
For context here is the whole 'addPackageSnippets' function. The function is using only the base package, except getOS which returns one of 'linux', 'windows' or 'mac' (i.e. a wrapper around Sys.info()
#' #title Export snippets
#'
#' #description \code{addPackageSnippets} copies all (missing) snippet definitions
#' in 'inst/rstudio/Rsnippets.txt' and 'Rmdsnippets.txt' to the RStudios user snippet location.
#'
#' #return boolean invisible(FALSE) if nothing was added, invisible(TRUE) if snipped definitions were added
#' #export
#'
#' #examples \dontrun{addPackageSnippets()}
addPackageSnippets <- function() {
added <- FALSE
# if not on RStudio or RStudioServer exit
#
if (!nzchar(Sys.getenv("RSTUDIO_USER_IDENTITY"))) {
return(NULL)
}
# Name of files containing snippet code to copy
#
pckgSnippetsFiles <- c("Rsnippets.txt", "Rmdsnippets.txt")
# Name of files to copy into. Order has to be the same
# as in 'pckgSnippetsFiles'
#
rstudioSnippetsFiles <- c("r.snippets", "markdown.snippets")
# Path to directory for RStudios user files depends on OS
#
if (getOS() == "linux") {
rstudioSnippetsPathBase <- "~/.R/snippets"
} else if (getOS() == "windows") {
rstudioSnippetsPathBase <- file.path(path.expand('~'), ".R", "snippets")
} else {
warning(paste0("goSnippets() is only implemented on linux and windows"))
return(NULL)
}
# Read each file in pckgSnippetsFiles and add its contents
#
for (i in seq_along(pckgSnippetsFiles)) {
# Try to get template, if template is not found skip it
#
pckgSnippetsFilesPath <- system.file("rstudio", pckgSnippetsFiles[i], package = "myFunc")
if (pckgSnippetsFilesPath == "") {
next()
}
# load package snippets definitions
#
pckgSnippetsFileContent <- readLines(pckgSnippetsFilesPath)
# Extract names of package snippets
#
pckgSnippetsFileDefinitions <- pckgSnippetsFileContent[grepl("^snippet (.*)", pckgSnippetsFileContent)]
# Construct path for destination file
#
rstudioSnippetsFilePath <- file.path(rstudioSnippetsPathBase, rstudioSnippetsFiles[i])
# If targeted RStudios user file does not exist, raise error (otherwise we would 'remove')
# the default snippets from the 'user file'
#
if (!file.exists(rstudioSnippetsFilePath)) {
stop(paste0( "'", rstudioSnippetsFilePath, "' does not exist yet\n.",
"Use RStudio -> Tools -> Global Options -> Code -> Edit Snippets\n",
"To initalize user defined snippets file by adding dummy snippet\n"))
}
# Extract 'names' of already existing snitppets
#
rstudioSnippetsFileContent <- readLines(rstudioSnippetsFilePath)
rstudioSnippetDefinitions <- rstudioSnippetsFileContent[grepl("^snippet (.*)", rstudioSnippetsFileContent)]
# replace two spaces with tab, ONLY at beginning of string
#
pckgSnippetsFileContentSanitized <- gsub("(?:^ {2})|\\G {2}|\\G\t", "\t", pckgSnippetsFileContent, perl = TRUE)
# find defintions appearing in packageSnippets but not in rstudioSnippets
# if no snippets are missing go to next file
#
snippetsToCopy <- setdiff(trimws(pckgSnippetsFileDefinitions), trimws(rstudioSnippetDefinitions))
snippetsNotToCopy <- intersect(trimws(pckgSnippetsFileDefinitions), trimws(rstudioSnippetDefinitions))
if (length(snippetsToCopy) == 0) {
# cat(paste0("(\nFollowing snippets will NOT be added because there is already a snippet with that name: ",
# paste0(snippetsNotToCopy, collapse=", ") ,")"))
next()
}
# Inform user about changes, ask to confirm action
#
if (interactive()) {
cat(paste0("You are about to add the following ", length(snippetsToCopy),
" snippets to '", rstudioSnippetsFilePath, "':\n",
paste0(paste0("-", snippetsToCopy), collapse="\n")))
if (length(snippetsNotToCopy) > 0) {
cat(paste0("\n(The following snippets will NOT be added because there is already a snippet with that name:\n",
paste0(snippetsNotToCopy, collapse=", ") ,")"))
}
answer <- readline(prompt="Do you want to procedd (y/n): ")
if (substr(answer, 1, 1) == "n") {
next()
}
}
# Create list of line numbers where snippet definitons start
# This list is used to determine the end of each definition block
#
allPckgSnippetDefinitonStarts <- grep("^snippet .*", pckgSnippetsFileContentSanitized)
for (s in snippetsToCopy) {
startLine <- grep(paste0("^", s, ".*"), pckgSnippetsFileContentSanitized)
# Find last line of snippet definition:
# First find start of next defintion and return
# previous line number or lastline if already in last definiton
#
endLine <- allPckgSnippetDefinitonStarts[allPckgSnippetDefinitonStarts > startLine][1] -1
if (is.na(endLine)) {
endLine <- length(pckgSnippetsFileContentSanitized)
}
snippetText <- paste0(pckgSnippetsFileContentSanitized[startLine:endLine], collapse = "\n")
# Make sure there is at least one empty line between entries
#
if (tail(readLines(rstudioSnippetsFilePath), n=1) != "") {
snippetText <- paste0("\n", snippetText)
}
# Append snippet block, print message
#
cat(paste0(snippetText, "\n"), file = rstudioSnippetsFilePath, append = TRUE)
cat(paste0("* Added '", s, "' to '", rstudioSnippetsFilePath, "'\n"))
added <- TRUE
}
}
if (added) {
cat("Restart RStudio to use new snippets")
}
return(invisible(added))
}
For anyone who comes across this thread, and to add to Dario's great answer: from RStudio v1.3, the filepaths have changes. So in his function, the section for setting rstudioSnippetsPathBase would need to change into something like the following
if (rstudioapi::versionInfo()$version < "1.3") {
rstudioSnippetsPathBase <- file.path(path.expand('~'),".R", "snippets")
} else {
if (.Platform$OS.type == "windows") {
rstudioSnippetsPathBase <- file.path(Sys.getenv("APPDATA"), "RStudio", "snippets")
} else {
rstudioSnippetsPathBase <- file.path(path.expand('~'), ".config/rstudio", "snippets")
}
}
I am trying to read 23 excel files, store each in a list, and then rbind them to one csv. Some of these file are csv and some of them are xlsx. However, I got the following message:
Error: Can't establish that the input is either xls or xlsx.
So I want to identify which ones are giving error and then append it manually.
My function is the following:
make_df<-function(filename){
library(readxl)
library(foreign)
if (str_sub(filename,-3,-1) == "csv"){
df<-read.csv(filename,fileEncoding="latin1")
}
else{
df<-read_excel(filename)
}
return(df)
}
filenames_vector<-list.files(# directory)
datalist = list()
for (i in 1:23){
datalist[[i]] <- make_df(filenames_vector[i])
}
mega_data = do.call(rbind,datalist)
How can I add something in make_df to print out the names of files that are causing the error message? Also, is there another work around, when the the error message is on not being able to distinguish xlsx from xls?
This can be done with a tryCatch block. Without example data it's a little hard to recreate. I'm not sure what you mean in your second question.
Try the code below to catch errors and print out the filename if there's an error, otherwise return the df object.
make_df<-function(filename){
library(readxl)
library(foreign)
df = tryCatch(
{ # try block
if (str_sub(filename,-3,-1) == "csv"){
df<-read.csv(filename,fileEncoding="latin1")
}
else{
df<-read_excel(filename)
}
},
error=function(cond){return(filename)} # grab the filename if there was an error
)
if (class(df) == 'character') {
print(df)
} else{return(df)}
}
I have a file that I would to write away in a certain dir. Therefore I have the following code:
function <- {
file_path_new <- file.path("C:", "Users", "MavanderPeet", "Documents", "data")
setwd(file_path_new)
now <- Sys.time()
file_name <- paste0(now, "data_set.csv")
write.csv(data_frame, file_name)
# write.csv(data_frame, "file.csv") #for checking purposes
}
The part where I want to create a name with timestamp does not seem to work however... When I uncomment the line
write.csv(data_frame, "file.csv")
Everything works fine. So I guess it should be a syntax error....
Any thoughts??
The colon (:) is not allowed in Windows file names (reference).
Use a different format:
paste0(format(now, "%Y%m%d_%H%M%S_"), "data_set.csv")
In the answer by #Roland, now needs parenthesis:
paste0(format(now(), "%Y%m%d_%H%M%S_"), "data_set.csv")
I would like to have a function is.binary that gives the following results
tmp_vec=1:3
save(tmp_vec,file="temp_vec.RData")
write.csv(tmp_vec,"temp_vec.csv")
is.binary("temp_vec.RData")
#TRUE
is.binary("temp_vec.csv")
#FALSE
Is there a function like this in R?
The best solution I was able to come up with is
is.binary=function(filename) {
is_binary=TRUE
#suppress warnings and try to read file with binary reader
#if it throws an error, set is_binary to FALSE
withCallingHandlers(expr=tryCatch(load(filename),
error=function(err) is_binary<<-FALSE),
warning=function(w) invokeRestart("muffleWarning"))
#since R loads the objects of the binary file into the memory, delete them
#maybe this is unnecessary ...
#... I do not know how R handles memory of objects that go out of scope
rm(list=ls()[ls()!="is_binary"])
is_binary
}
Obviously, this function is not very efficient when dealing with large files. Any pointers to better solutions? Thank you!
Edit: I am on a Windows machine.
All files on a computer are binary files. The exceptions are those that happen to look like text files. And even those can have different encodings that are required to actually read the files properly.
If you're just trying to distinguish between the results of save() and write.csv(), the easiest thing to check would be if the file is compressed. By default save() will compress a file. You can look for the magic number for compression in the file. Here's one way to do that
is_compressed <- function(filename, magic.number=as.raw(c("0x1f","0x8b"))) {
fh<-file(filename, "rb")
on.exit(close(fh))
magic <- readBin(fh, "raw", length(magic.number))
if(length(magic) != length(magic.number)) return(FALSE)
if(all(magic == magic.number)) return(TRUE)
return (FALSE)
}
This only requires reading two bytes of a file rather than the whole thing. Once you know it's compressed you can then try to look for the magic number for an rData file. Here's a more complete function
is_rdata <- function(filename) {
#check for magic number
#https://github.com/wch/r-source/blob/b99d403f4b7337553acb2d2108c7a00e6c19f908/src/main/saveload.c#L1786
fh <- if(!is_compressed(filename))
file(filename, "rb")
else {
gzfile(filename, "rb")
}
on.exit(close(fh))
magic <- rawToChar(readBin(fh, "raw", 5))
if(nchar(magic)<5) return(FALSE)
if(magic %in% c("RDA1\n","RDB1\n","RDX1\n","RDA2\n","RDB2\n","RDX2\n")) return(TRUE)
return (FALSE)
}
We can test with
dd <- data.frame(a=1:4, b=letters[1:4])
save(dd, file="test1.file")
write.csv(dd, file="test2.file")
is_rdata("test1.file")
is_rdata("test2.file")
of course, if you are careful about your file extensions, that would probably be the easiest way to identify files.
I created a function which produces a matrix as a result, but I can't figure out how to make the output of this function usable outside of the function environment, so that I could for instance save it in csv file.
My code for function is the following:
created function which takes url's from specific site and returns page title:
getTitle <- function(url) {
webpage <- readLines(url)
first.row <- webpage[1]
start <- regexpr("<title>", first.row)
end <- regexpr("</title>", first.row)
title <- substr(first.row,start+7,end-1)
return(title)
}
created function which takes vector of urls and returns n*2 matrix with urls and page titles:
getTitles <- function(pages) {
my.matrix <- matrix(NA, ncol=2, nrow=nrow(pages))
for (i in seq_along(1:nrow(pages))) {
my.matrix[i,1] <- as.character(pages[i,])
my.matrix[i,2] <- getTitle(as.character(pages[i,])) }
return(my.matrix)
print(my.matrix)}
After running this functions on a sample file from here http://goo.gl/D9lLZ which I import with read.csv function and name "mypages" I get the following output:
getTitles(mypages)
[,1] [,2]
[1,] "http://support.google.com/adwords/answer/1704395" "Create your first ad campaign - AdWords Help"
[2,] "http://support.google.com/adwords/answer/1704424" "How costs are calculated in AdWords - AdWords Help"
[3,] "http://support.google.com/adwords/answer/2375470" "Organizing your account for success - AdWords Help"
This is exactly what I need, but I'd love to be able to export this output to csv file or reuse for further manipulations. However, when I try to print(my.matrix), I am getting an error saying "Error: object 'my.matrix' not found"
I feel like it's quite basic gap in my knowledge, but have not been working with R for a while and could not solve that.
Thanks!
Sergey
That's easy: use <<- for assignment to a global.
But then again, global assignment is evil and not functional. Maybe you'd rather return
a list with several results from your function? Looking at your code, it seems that your second function may confuse the return and print. Make sure you return the correct data structure.
A little about functional programming. First of all, when you define your function:
getTitles <- function(pages) {
[...]
return(my.matrix)
print(my.matrix)
}
know that when the function is called it will never reach the print statement. Instead, it will exit right before, with return. So you can remove that print statement, it is useless.
Now the more important stuff. Inside your function, you define and return my.matrix. The object only exists within the scope of the function: as the function exits, what is returned is an unnamed object (and my.matrix is lost.)
In your session, when you call
getTitles(mypages)
the result is printed because you did not assign it. Instead, you should do:
out.matrix <- getTitles(mypages)
Now the result won't be printed but you can definitely do so by typing print(out.matrix) or just out.matrix on a single line. And because you have stored the result in an object, you can now reuse it for further manipulations.
If it help you grasp the concept, this is all the same as calling the c() function from the command line:
c(1, 5, 2) # will return and print a vector
x <- c(1, 5, 2) # will return and assign a vector (not printed.)
Bonus: Really, I don't think you need to define getTitles, but you can use one of the *apply functions. I would try this:
url <- as.character(mypages)
title <- sapply(url, getTitle)
report <- data.frame(url, title)
write.csv(report, file = "report.csv", row.names = FALSE)
Can't you just use <<- to assign it the object to the workspace? The following code works for me and saves the amort_value object.
amortization <- function(cost, downpayment, interest, term) {
amort_value <<- (cost)*(1-downpayment/100)*(interest/1200)*((1+interest/1200)^(term*12))/((1+interest/1200)^(term*12)-1)
sprintf("$%.2f", amort_value)
}
amortization(445000,20,3,15)
amort_value
At the end of the function, you can return the result.
First define the function:
getRangeOf <- function (v) {
numRange <- max(v) - min(v)
return(numRange)
}
Then call it and assign the output to a variable:
scores <- c(60, 65, 70, 92, 99)
scoreRange <- getRangeOf(scores)
From here on use scoreRange in the environment. Any variables or nested functions within your defined function is not accessible to the outside, unless of course, you use <<- to assign a global variable. So in this example, you can't see what numRange is from the outside unless you make it global.
Usually, try to avoid global variables at an early stage. Variables are "encapsulated" so we know which one is used within the current context ("environment"). Global variables are harder to tame.