R plots with X11 cannot show CJK fonts - r

I loaded a UTF-8 csv file with Japanese characters in it, its str is like this:
> str(purchases)
'data.frame': 168996 obs. of 7 variables:
$ ITEM_COUNT : int 1 1 1 1 1 1 2 2 1 1 ...
$ I_DATE : Date, format: "2012-03-28" "2011-07-04" ...
$ SMALL_AREA_NAME: Factor w/ 55 levels "キタ","ミナミ他",..: 6 47 26 26 26 26 26 35 35 26 ...
$ USER_ID_hash : Factor w/ 22782 levels "0000b53e182165208887ba65c079fc21",..: 19467 7623 7623 7623 7623 7623 7623 7623 7623 7623 ...
$ COUPON_ID_hash : Factor w/ 19368 levels "000eba9b783cec10658308b5836349f6",..: 3929 8983 5982 5982 5982 5982 5982 2737 18489 5018 ...
$ category : Factor w/ 13 levels "Beauty","Delivery service",..: 2 3 2 2 2 2 2 7 2 3 ...
So I think there's nothing wrong with my encoding or locale(en_US.UTF-8)? But when I plot with
> barplot(table(purchases$SMALL_AREA_NAME))
why do the Japanese characters turn into little blocks like this?
I think I have the font to display Japanese characters
> names(X11Fonts())
[1] "serif" "sans" "mono" "Times" "Helvetica"
[6] "CyrTimes" "CyrHelvetica" "Arial" "Mincho"
Additional info:
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_1.0.1

You may want to take a look at the showtext package, which allows you to use different fonts in R graphs. It also ships with a CJK font that can be used directly.
Try to run the code below:
library(showtext)
showtext.auto()
## ... code to generate data
barplot(table(purchases$SMALL_AREA_NAME))

Related

Use grep to delete any string containing year less than 2014

Edited to add more context and data 5/12/2017
Using R version 3 on Windows
I have a data frame data2:
'data.frame': 1504 obs. of 14 variables:
$ Member.Name : chr "A" "B" "C"...
$ MSTATUS : Factor w/ 14 levels "","ACTIVE","ACTIVE;CHANGEDROLES;NONQUALIF",..: 13 2 2 2 2 4 13 13 2 13 ...
$ MCAT : Factor w/ 9 levels "","EDNEWCLASS",..: 5 4 9 6 6 6 9 9 4 4 ...
$ SALUTATION : Factor w/ 822 levels "","Aaron","Abigail",..: 285 2 2 2 4 4 4 4 5 5 ...
$ MEM_SUBCATEGORY : Factor w/ 22 levels "","AGENCYCEO",..: 22 6 8 15 8 6 8 1 6 6 ...
$ MEM_SUBTYPE : Factor w/ 25 levels "","AGENCY","AGENCYCEO",..: 24 6 6 20 6 6 6 6 6 6 ...
$ COUNTRY : Factor w/ 33 levels "","AE","AT","AU",..: 33 33 33 33 7 33 33 33 33 33 ...
$ F500 : Factor w/ 243 levels "","#1406 on Forbes Global 2000 ($11B)",..: 1 1 96 1 242 1 147 1 1 76 ...
$ OPT_LINE : Factor w/ 1467 levels "","(Formerly) Condé Nast",..: 1 1170 609 1333 251 1427 444 258 814 1207 ...
$ FLAGS : chr "2014PAGEJAMPARTICIPANT, \nPHOTO" "" "PUFOUNDINGMEMBER" "2014FLESPEAKER" ...
$ FLAGS_DESCR : chr "2014 Page Jam Participant, \nPhoto on File" "" "Page Up Founding Member" "2014 Future Leaders Experience Speaker" ...
$ Enroll.Date : Date, format: "2012-12-04" "2010-08-24" "2013-09-20" "2013-05-06" ...
$ Expiration.Date : Date, format: "2014-12-31" "2017-12-31" "2017-12-31" "2017-12-31" ...
$ Sponsorship.Amount: num 0 0 0 0 0 0 0 0 0 0 ...
For the FLAGS variable, I'd like to remove all row elements that contain a year less than 2014.
head(data2$FLAGS, n=3)
[1] "2011PRESIDENTS, \n2012CHAIRMANSCOUNCIL, \n2016CHAIRCOUNCIL" ""
[3] "2012COI"
So that FLAGS will look like:
head(data2$FLAGS, n=3)
[1] "\n2016CHAIRCOUNCIL" ""
[3] ""
The rows with no values can either be blank or NA, BUT if a row does contain an event with a year >=2014 and an event with a year <2014 than just delete the event less than 2014 and keep the other events in the row.
This regex works for your example. The idea is to match the first 3 characters of year for those elements that fail and drop them.
FLAGS[-grep("20(0|1[0123])", FLAGS)]
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "\n2014PAGEJAMPARTICIPANT" "\n2014PUSPONSOR, \nPHOTO"
or, using invert, you'd have
FLAGS[grep("20(0|1[0123])", FLAGS, invert=TRUE)]
Note that it won't catch pre-2000s and you should be cautious if there are other "numeric" values in the vector.
To return a vector of the same length, with NAs replacing the earlier years, you could use is.na<- and grepl like this
is.na(FLAGS) <- grepl("20(0|1[0123])", FLAGS)
original data
FLAGS<-c("2014PAGEJAMPARTICIPANT, \nPHOTO", "2001ANNUALCONFERENCECOMM",
"\n2011GOVERNANCE", "\n2014PAGEJAMPARTICIPANT", "2013NEWMEMBERNOMINATOR",
"\n2014PUSPONSOR, \nPHOTO")
given OP's second question. The following more or less works:
sapply(strsplit(FLAGS, ","),
function(x) paste(gsub("(\\n)?20(0|1[0123]).*?(, |$)", "", trimws(x)), collapse=" "))
[1] " 2016CHAIRCOUNCIL" "" ""
Note that a "\n" is missing at the beginning and there is an additional (set of) space(s) at the beginning of the first element. The "\n" is removed be trimws. This makes the string a bit easier to work with. The additional spaces can be removed by wrapping the above expression in trimws, for example, trimws(sapply(strsplit(...))).
additional data
FLAGS <- c("2011PRESIDENTS, \n2012CHAIRMANSCOUNCIL, \n2016CHAIRCOUNCIL", "", "2012COI")
Here is one solution using stringr package:
library(stringr)
FLAGS[sapply(str_extract_all(FLAGS, '[0-9]{4}'),
function(x) !any(as.integer(x) < 2014))]
This solution assumes you may have more than one year in each value. If that is not the case, you can do something more simple like:
FLAGS[as.integer(str_extract(FLAGS, '[0-9]{4}')) >= 2014]
Assuming FLAGS is as follows:
FLAGS
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "2001ANNUALCONFERENCECOMM"
[3] "\n2011GOVERNANCE" "\n2014PAGEJAMPARTICIPANT"
[5] "2013NEWMEMBERNOMINATOR" "\n2014PUSPONSOR, \nPHOTO"
You get result as:
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "\n2014PAGEJAMPARTICIPANT"
[3] "\n2014PUSPONSOR, \nPHOTO"
EDITING ANSWER BASED ON QUESTION EDIT ABOVE
You can keep only values with 2014 or above and fill with NAs otherwise as follows:
data2$FLAGS <- ifelse(as.integer(str_extract(data2$FLAGS, '\\d+')) >= 2014,
data2$FLAGS, NA)
Result is as follows:
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" NA
[3] NA "\n2014PAGEJAMPARTICIPANT"
[5] NA "\n2014PUSPONSOR, \nPHOTO"

Error in R code - c50 code called exit with value 1

So, I'm new to machine learning in R. I'm trying the Kaggle Home Depot product search relevance competition in R.
The structure of my training data set is -
'data.frame': 74067 obs. of 6 variables:
$ id : int 2 3 9 16 17 18 20 21 23 27 ...
$ product_uid : int 100001 100001 100002 100005 100005 100006
100006 100006 100007 100009 ...
$ product_title: Factor w/ 53489 levels "# 62 Sweetheart 14 in. Low
Angle Jack Plane",..: 44305 44305 5530 12404 12404 51748 51748 51748
30638 25364 ...
$ search_term : Factor w/ 11795 levels "$ hole saw",". exterior floor
stain",..: 1952 6411 3752 8652 9528 3499 7146 7148 4417 7026 ...
$ relevance : Factor w/ 13 levels "1","1.25","1.33",..: 13 10 13 9
11 13 11 13 11 13 ...
$ levsim1 : num 0.1818 0.1212 0.0886 0.1795 0.2308 ...
where levsim1 is the vector containing Levenshtein similarity coefficients after comparing the search term and product name. The target value is the relevance and I have tried using the C50 package in R for modeling this training set. However once I run this command:
relevance_model <- C5.0(train.combined[,-5],train.combined$relevance)
(the relevance vector is in the factor format with 13 levels). My computer hangs for about 15 - 20 minutes because of the computations in R, and I later get this message in R:
c50 code called exit with value 1
I know that this error is common if there are empty cells, however no cells are empty in the data set.
I'm not sure if I'm using the wrong kind of data for this package. If some one could shed light on why I'm getting this error, or what to read up on in terms of how to model this data set, that would be great.

R convert data to factor will corrupt all other data.frame columns

I have a data.frame, all columns are numeric. I want to convert one integer column to factor, but doing so will convert all other columns to class character. Is there anyway to just convert one column to factor?
The example is from Converting variables to factors in R:
myData <- data.frame(A=rep(1:2, 3), B=rep(1:3, 2), Pulse=20:25)
myData$A <-as.factor(myData$A)
The result
apply(myData,2,class)
# A B Pulse
# "character" "character" "character"
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines stats graphics grDevices utils datasets methods base ...
str(myData$A)
# Factor w/ 2 levels "1","2": 1 2 1 2 1 2
Your code actually works when I test it.
This is my output from str(myData):
'data.frame': 6 obs. of 3 variables:
$ A : Factor w/ 2 levels "1","2": 1 2 1 2 1 2
$ B : int 1 2 3 1 2 3
$ Pulse: int 20 21 22 23 24 25
Your issue is because, as ?apply states:
‘apply’ attempts to coerce
to an array via ‘as.matrix’ if it is two-dimensional (e.g., a data
frame)
This is done before executing the function on each column. And when you run as.matrix(myData) you end up with everything forced to one class, in this case character data:
is.character(as.matrix(myData))
#[1] TRUE

Graphing data that is read using readHTMLTable

I want to read the following table , from a webpage then create a bargraph.
Language............ Jobs
PHP.................... 12,664
Java................... 12,558
Objective C......... 8,925
SQL.................... 5,165
Android (Java).... 4,981
Ruby................... 3,859
JavaScript........... 3,742
C#....................... 3,549
C++..................... 1,908
ActionScript......... 1,821
Python................. 1,649
C.......................... 1,087
ASP.NET............... 818
My questions:
1.The problem that my bars get messed up and each bar does correspond to the correct language
The following is my code:
library(XML)
tables2 <-(readHTMLTable("http://www.sitepoint.com/best-programming-language-of-2013/",which=1))
barplot(as.numeric(tables2$Job),names.arg=tables2$Language)
Since I am a beginner at R I would like to know in what format does readHTMLTable save the data in? is it a matrix, data frame or other format?
The main problem here is that Jobs is being read as a factor. Because of the commas in that field, you can't do a direct numeric conversion. You can find out what 'format' your object is in R by doing str(). Here str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : Factor w/ 13 levels "1,087","1,649",..: 6 5 12 11 10 9 8 7 4 3 ...
So you can see Jobs is a factor, and that tables2 is a data.frame. To convert it to numeric you need to remove the commas. You can do that with gsub().
tables2$Jobs <- as.numeric(gsub(",","",tables2$Jobs))
No str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : num 12664 12558 8925 5165 4981 ...
and when you do your plot, all should be well:
barplot(tables2$Jobs,names.arg=tables2$Language)

"Subscript out of bounds" when using effects package

I am using the effects package to construct some probability graphs showing the predicted probabilities from a logistic regression model.However, I get an odd error message and don't know what the issue is.
When I attempt to generate the plots, I get the following error. The warning is not an issue, it's that I'm not understanding what the error message is telling me.
library(effects)
dat$won_ping = as.factor(dat$won_ping)
mod2 = glm(won_ping ~ our_bid +
age_of_oldest_driver2 +
credit_type2 +
coverage_type2 +
home_owner2 +
vehicle_driver_score +
currently_insured2 +
zipcode2,
data=dat, family=binomial(link="logit"))
> plot(effect("our_bid*vehicle_driver_score", mod2), rescale.axis=FALSE, multiline=TRUE)
Warning message:
In analyze.model(term, mod, xlevels, default.levels) :
our_bid:vehicle_driver_score does not appear in the model
Error in plot(effect("our_bid*vehicle_driver_score", mod2), rescale.axis = FALSE, :
error in evaluating the argument 'x' in selecting a method for function 'plot': Error in apply(mod.matrix[, components], 1, prod) :
subscript out of bounds
Here's info on my data and my glm commands:
> str(dat)
'data.frame': 85240 obs. of 71 variables:
$ our_bid : num 155 123 183 98 108 159 98 123 98 200 ...
$ won_ping : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
$ zipcode2 : Factor w/ 4 levels "1:6999","10000:14849",..: 4 3 2 1 3 2 3 1 2 2 ...
$ age_of_oldest_driver2 : Factor w/ 4 levels "18 to 21","22 to 25",..: NA 3 NA NA NA NA 3 NA 3 NA ...
$ currently_insured2 : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 2 1 1 ...
$ credit_type2 : Ord.factor w/ 4 levels "POOR"<"FAIR"<..: 2 3 2 3 2 2 1 3 3 2 ...
$ coverage_type2 : Factor w/ 4 levels "BASIC","MINIMUM",..: 4 3 3 3 3 3 3 3 4 3 ...
$ home_owner2 : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
$ vehicle_driver_score : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
And finally, here might be some useful info:
> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] effects_2.2-1 colorspace_1.1-1 nnet_7.3-1 MASS_7.3-16 lattice_0.20-0 foreign_0.8-46
loaded via a namespace (and not attached):
[1] tools_2.14.0
Help! What is the error message mean? Normally, if a "subscript is out of bounds" that'd mean I'm selecting something outside the bounds of that data structure, but that simply is not occuring.
EDIT:
To #Rowland
As I said above, the warning and error messages are seperate and unrelated. Let's say I take out zipcode2 and run the glm:
mod2 = glm(won_ping ~ our_bid +
age_of_oldest_driver2 +
credit_type2 +
coverage_type2 +
home_owner2 +
vehicle_driver_score +
currently_insured2,
data=dat, family=binomial(link="logit"))
> plot(effect("our_bid*home_owner2", mod2), rescale.axis=FALSE, multiline=TRUE)
Warning message:
In analyze.model(term, mod, xlevels, default.levels) :
our_bid:home_owner2 does not appear in the model
This produces just the warning, which is fine as I get the desired result. So the fact that ":" does not appear in the model is not the issue, and DOES NOT cause the error message.
Try this:
with(dat, table(our_bid, vehicle_driver_score))
I suspect you have some unpopulated cells. With your edit, it seems unlikely that the sparseness I am hypothesizing as the problem lies with these two variables. It still remains possible that despite your large number of cases that there are still empty cells when the model is constructed with all of those factor variables.

Resources