Graphing data that is read using readHTMLTable - r

I want to read the following table , from a webpage then create a bargraph.
Language............ Jobs
PHP.................... 12,664
Java................... 12,558
Objective C......... 8,925
SQL.................... 5,165
Android (Java).... 4,981
Ruby................... 3,859
JavaScript........... 3,742
C#....................... 3,549
C++..................... 1,908
ActionScript......... 1,821
Python................. 1,649
C.......................... 1,087
ASP.NET............... 818
My questions:
1.The problem that my bars get messed up and each bar does correspond to the correct language
The following is my code:
library(XML)
tables2 <-(readHTMLTable("http://www.sitepoint.com/best-programming-language-of-2013/",which=1))
barplot(as.numeric(tables2$Job),names.arg=tables2$Language)
Since I am a beginner at R I would like to know in what format does readHTMLTable save the data in? is it a matrix, data frame or other format?

The main problem here is that Jobs is being read as a factor. Because of the commas in that field, you can't do a direct numeric conversion. You can find out what 'format' your object is in R by doing str(). Here str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : Factor w/ 13 levels "1,087","1,649",..: 6 5 12 11 10 9 8 7 4 3 ...
So you can see Jobs is a factor, and that tables2 is a data.frame. To convert it to numeric you need to remove the commas. You can do that with gsub().
tables2$Jobs <- as.numeric(gsub(",","",tables2$Jobs))
No str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : num 12664 12558 8925 5165 4981 ...
and when you do your plot, all should be well:
barplot(tables2$Jobs,names.arg=tables2$Language)

Related

Converting a factor to a numeric to then create a subset is not working

I am new to R and am having issues trying to work with a large dataset. I have a variable called DifferenceMonths and I would like to create a subset of my large dataset with only observations where the variable DifferenceMonths is less than 3.
It is coded into R as a factor so I have tried multiple times to convert it to a numeric. It finally showed up as numeric in my Global Environment, but then I checked using str() and it still shows up as a factor variable.
Log:
DifferenceMonths<-as.numeric(levels(DifferenceMonths))[DifferenceMonths]
Warning message:
NAs introduced by coercion
KRASDiff<-subset(KRASMCCDataset_final,DifferenceMonths<=2)
Warning message:
In Ops.factor(DifferenceMonths, 2) : ‘<=’ not meaningful for factors
str(KRASMCCDataset_final)
'data.frame': 7831 obs. of 25 variables:
$ Age : Factor w/ 69 levels "","21","24","25",..: 29 29 29 29 29 29 29 29 29 29 ...
$ Alive.Dead : Factor w/ 4 levels "","A","D","S": 2 2 2 2 2 2 2 2 2 2 ...
$ Status : Factor w/ 5 levels "","ambiguous",..: 4 4 5 5 4 5 5 5 4 5 ...
$ DifferenceMonths : Factor w/ 75 levels "","#NUM!","0",..: 14 14 14 14 14 14 14 14 14 14 ...
Thank you!
It's ugly, but you want:
as.numeric(as.character(DifferenceMonths))
The problem here, which you may have discovered, is that as.numeric() gives you the internal integer codes for the factor. The values are stored in the levels. But if you run as.numeric(levels(DifferenceMonths)), you'll get those values, but just as they appear in levels(DifferenceMonths). The way around this is to coerce to character first, and get away from the internal integer codes all together.
EDIT: I learned something today. See this answer
as.numeric(levels(DifferenceMonths))[DifferenceMonths]
Is the more efficient and preferred way, in particular if length(levels(DifferenceMonths)) is less than length(DifferenceMonths).
EDIT 2: on review after #MrFlick's comment, and some initial testing, x <- as.numeric(levels(x))[x] can behave strangely. Try assigning it to a new variable name. Let me see if I can figure out how and when this behavior occurs.

How can I store a value in a name?

I use the neotoma package where I get data from a geographical site, which is marked by an ID. What I want to do is to "store" the number in a term, like Sitenum, so I can just need to write down the ID once and then use it.
What I did:
Site<-get_download(20131, verbose = TRUE)
taxa<-as.vector(Site$'20131'$taxon.list$taxon.name)
What I want to do:
Sitenum <-20131
Site<-get_download(Sitenum, verbose = TRUE) # this obv. works
taxa<-as.vector(Site$Sitenum$taxon.list$taxon.name) # this doesn't work
The structure of the dataset:
str(Site)
List of 1
$ 20131:List of 6
..$ taxon.list :'data.frame': 84 obs. of 6 variables:
.. ..$ taxon.name : Factor w/ 84 levels "Alnus","Amaranthaceae",..: 1 2 3 4 5 6 7 8 9 10 ...
I constructed an object that mimics yours as follows:
Site <- list("2043"=list(other=data.frame(that=1:10)))
Note that the structure is essentially identical.
str(Site)
List of 1
$ 2043:List of 1
..$ other:'data.frame': 10 obs. of 1 variable:
.. ..$ that: int [1:10] 1 2 3 4 5 6 7 8 9 10
Now, I save the value of the first term:
temp <- 2043
Then use the code in my comment to access the inner vector:
Site[[as.character(temp)]]$other$that
[1] 1 2 3 4 5 6 7 8 9 10
I could also use recursive referencing like this
Site[[c(temp,"other", "that")]]
[1] 1 2 3 4 5 6 7 8 9 10
because c will coerce temp to be a character vector in the presence of "other" and "that" character vectors.

KNNCAT error "some classes have only one member"

I'm trying to run a KNN analysis on auto data using knncat's knncat function. My training set is around 700,000 observations. The following happens when I try to implement the analysis. I've attempted to remove NA using the complete cases method while reading the data in. I'm not sure exactly how to take care of the errors or what they mean.
kdata.training = kdataf[ind==1,]
kdata.test = kdataf[ind==2,]
kdata_pred = knncat(train = kdata.training, test = kdata.test, classcol = 4)
Error in knncat(train = kdata.training, test = kdata.test, classcol = 4) :
Some classes have only one member. Check "classcol"
When I attempt to run a small subsection of the training and test set(200 and 70 observations respectively) I get the following error:
kdata_strain = kdata.training[1:200,]
kdata_stest = kdata.test[1:70,]
kdata_pred = knncat(train = kdata_strain, test = kdata_stest, classcol = 4)
Error in knncat(train = kdata_strain, test = kdata_stest, classcol = 4) :
Some factor has empty levels
Here is the str method called on kdataf, the dataframe for which the above data was sampled for:
str(kdataf)
'data.frame': 1159712 obs. of 9 variables:
$ vehicle_sales_price: num 13495 11999 14499 12495 14999 ...
$ week_number: Factor w/ 27 levels "1","2","3","4",..: 11 10 13 10 10 9 18 10 10 10 ...
$ county: Factor w/ 219 levels "Anderson","Andrews",..: 49 49 49 49 49 49 49 49 49 49 ...
$ ownership_code : Factor w/ 23 levels "1","2","3","4",..: 11 11 3 1 11 11 11 11 11 11 ...
$ X30_days_late : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
$ X60_days_late : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
$ penalty : num 0 0 55.3 0 0 ...
$ processing_time : int 28 24 32 29 19 20 63 27 28 24 ...
$ transaction_code : Factor w/ 2 levels "TITLE","WDTA": 2 2 2 2 2 2 2 2 2 2 ...
The seed was set to '1234' and the ratio of the training to test data was 2:1
First, I know very little about R, so take my answer with a grain of salt.
I had the same problem, that made no sense, because there were no NAs. I thought at the beginning that it were strange characters like ', /, etc that I had in my data. But no, the knncat algorithm works with those characters when I put the following three lines of code after defining my train sets (i use data.table because my data are huge):
write.csv(train, file="train.csv")
train <- fread("train.csv", sep=",", header=T, stringsAsFactors=T)
train[,V1:=NULL]
Then, there are no more messages 'Some factor has empty levels' or 'Some classes have only one member. Check "classcol"'.
I know this is not a real solution to the problem, but at least, you can finish your work.
Hope it helps.

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

Reverting to Factor Codes R

Let's say I have a data.frame that looks like this:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
and I apply a factor:
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
Now, how I would like to convert it back the integer codes:
as.numeric(df.test[1])## replies with an error code.
But this works:
as.numeric(df.test$a)
Why is that?
Actually Joshua's link are not applicable here because the task is not coverting from a factor with levels that have numeric interpretation. Your original effort that produced an error was almost correct. It was missing only a comma before the 1:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
as.numeric(df.test[,1])
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# [19] 19 20 21 22 23 24 25 26
Or you could have used "[["
> as.numeric(df.test[[1]])
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26
as.numeric will convert a factor to numeric:
as.numeric(df.test$a)
Accessing a column by name gives you a factor vector, which can be converted to numeric.
However, a data frame is a list (of columns), and when you use the single bracket operator and a single number on a list, you get a list of length one. The same applies for data frames, so df.test[1] gets you column one as a new data frame, which cannot be coerced by as.numeric(). I did not know this!
> str(df.test$a)
Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(df.test[1])
'data.frame': 26 obs. of 1 variable:
$ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
To respond to your edit: Keep in mind that a factor has two parts: 1) the labels, and 2) the underlying integer codes. The two answers I linked to in my comment were to convert the labels to numeric. If you just want to get the internal codes, use as.integer(df.test$a) as demonstrated in the examples section of ?factor. aL3xa answered your question about why as.numeric(df.test[1]) throws an error.

Resources