Name collisions in data.table columns - r

> str(tester)
Classes ‘data.table’ and 'data.frame': 6402 obs. of 2419 variables:
$ h1 : int 1 5 6 12 13 16 19 22 26 28 ...
$ joinno : int 2 6 7 11 12 14 16 17 19 21 ..
$ h1 : int 1 5 6 12 13 16 19 22 26 28 ...
$ joinno : int 2 6 7 11 12 14 16 17 19 21 ...
Could somebody enlighten me as to how/why cbind-ing these two objects together with identical column names doesn't cause problems? These happen to actually be identical columns so it's kind of moot but when I subset that column name(s) I get a single value. So how does R decide which column I'm referring to (presumably the first)? Is there an easy/canned way to de-dupe columns in R?
Thanks in Advance.

#Frank is right. The defaults are check.names=TRUE for ?data.frame and check.names=FALSE for ?data.table. Although, in the case of cbind-ing, it doesn't come into play:
cbind(data.frame(a=1),data.frame(a=2))
cbind(data.table(a=1),data.table(a=2))
...both give duplicate names. You could apply:
names(out) <- make.unique(names(out))
...after cbind-ing to fix it up. Another option would be to not use cbind in favour of:
data.frame(data.frame(a=1),data.frame(a=2))
data.table(data.table(a=1),data.table(a=2),check.names=TRUE)

Related

Converting a factor to a numeric to then create a subset is not working

I am new to R and am having issues trying to work with a large dataset. I have a variable called DifferenceMonths and I would like to create a subset of my large dataset with only observations where the variable DifferenceMonths is less than 3.
It is coded into R as a factor so I have tried multiple times to convert it to a numeric. It finally showed up as numeric in my Global Environment, but then I checked using str() and it still shows up as a factor variable.
Log:
DifferenceMonths<-as.numeric(levels(DifferenceMonths))[DifferenceMonths]
Warning message:
NAs introduced by coercion
KRASDiff<-subset(KRASMCCDataset_final,DifferenceMonths<=2)
Warning message:
In Ops.factor(DifferenceMonths, 2) : ‘<=’ not meaningful for factors
str(KRASMCCDataset_final)
'data.frame': 7831 obs. of 25 variables:
$ Age : Factor w/ 69 levels "","21","24","25",..: 29 29 29 29 29 29 29 29 29 29 ...
$ Alive.Dead : Factor w/ 4 levels "","A","D","S": 2 2 2 2 2 2 2 2 2 2 ...
$ Status : Factor w/ 5 levels "","ambiguous",..: 4 4 5 5 4 5 5 5 4 5 ...
$ DifferenceMonths : Factor w/ 75 levels "","#NUM!","0",..: 14 14 14 14 14 14 14 14 14 14 ...
Thank you!
It's ugly, but you want:
as.numeric(as.character(DifferenceMonths))
The problem here, which you may have discovered, is that as.numeric() gives you the internal integer codes for the factor. The values are stored in the levels. But if you run as.numeric(levels(DifferenceMonths)), you'll get those values, but just as they appear in levels(DifferenceMonths). The way around this is to coerce to character first, and get away from the internal integer codes all together.
EDIT: I learned something today. See this answer
as.numeric(levels(DifferenceMonths))[DifferenceMonths]
Is the more efficient and preferred way, in particular if length(levels(DifferenceMonths)) is less than length(DifferenceMonths).
EDIT 2: on review after #MrFlick's comment, and some initial testing, x <- as.numeric(levels(x))[x] can behave strangely. Try assigning it to a new variable name. Let me see if I can figure out how and when this behavior occurs.

Error in R code - c50 code called exit with value 1

So, I'm new to machine learning in R. I'm trying the Kaggle Home Depot product search relevance competition in R.
The structure of my training data set is -
'data.frame': 74067 obs. of 6 variables:
$ id : int 2 3 9 16 17 18 20 21 23 27 ...
$ product_uid : int 100001 100001 100002 100005 100005 100006
100006 100006 100007 100009 ...
$ product_title: Factor w/ 53489 levels "# 62 Sweetheart 14 in. Low
Angle Jack Plane",..: 44305 44305 5530 12404 12404 51748 51748 51748
30638 25364 ...
$ search_term : Factor w/ 11795 levels "$ hole saw",". exterior floor
stain",..: 1952 6411 3752 8652 9528 3499 7146 7148 4417 7026 ...
$ relevance : Factor w/ 13 levels "1","1.25","1.33",..: 13 10 13 9
11 13 11 13 11 13 ...
$ levsim1 : num 0.1818 0.1212 0.0886 0.1795 0.2308 ...
where levsim1 is the vector containing Levenshtein similarity coefficients after comparing the search term and product name. The target value is the relevance and I have tried using the C50 package in R for modeling this training set. However once I run this command:
relevance_model <- C5.0(train.combined[,-5],train.combined$relevance)
(the relevance vector is in the factor format with 13 levels). My computer hangs for about 15 - 20 minutes because of the computations in R, and I later get this message in R:
c50 code called exit with value 1
I know that this error is common if there are empty cells, however no cells are empty in the data set.
I'm not sure if I'm using the wrong kind of data for this package. If some one could shed light on why I'm getting this error, or what to read up on in terms of how to model this data set, that would be great.

Ranking entries in a column based on sums of entries in another column

everyone. I am a beginner in R with a question I can't quite figure out. I've created multiple queries within Stack Overflow to address my question (links to results here, here, and here) but none have addressed my issue. On to the problem: I have subset data frame DAV from a larger dataset.
> str(DAV)
'data.frame': 994 obs. of 9 variables:
$ MIL.ID : Factor w/ 18840 levels "","0000151472",..: 7041 9258 10513 5286 5759 5304 5312 5337 5337 5547 ...
$ Name : Factor w/ 18395 levels ""," Atticus Finch",..: 1226 6754 12103 17234 2317 14034 15747 4542 4542 14819 ...
$ Center : int 2370 2370 2370 2370 2370 2370 2370 2370 2370 2370 ...
$ Gift.Date : Factor w/ 339 levels "","01/01/2015",..: 6 6 6 7 10 13 13 13 13 13 ...
$ Gift.Amount: num 100 47.5 150 41 95 ...
$ Solic. : Factor w/ 31 levels "","aa","ac","an",..: 20 31 20 29 20 8 28 8 8 8 ...
$ Tender : Factor w/ 10 levels "","c","ca","cc",..: 3 2 3 5 2 9 3 9 9 9 ...
$ Account : Factor w/ 16 levels "","29101-0000",..: 4 4 4 11 2 11 2 11 2 11 ...
$ Restriction: Factor w/ 258 levels "","AAU","ACA",..: 216 59 216 1 137 1 137 1 38 1 ...
The two relevant columns for my issue are MIL.ID, which contains a unique ID for a donor, and Gift.Amount, which contains a dollar amount for a single gift the donor gave. A single MIL.ID is often associated with multiple Gift.Amount entries, meaning that donor has given on multiple different occasions for various amounts. Here is what I want to do:
I want to separate out the above mentioned columns from the rest of the data frame;
I want to sum(Gift.Amount) but only do so for each donor, i.e. I want to create a sum of all gifts for MIL.ID 1234 in the above data.frame; and
I want to rank all the MIL.IDs based on the sum Gift.Amount entries associated with their ID.
I apologize for how basic this is, and if it is redundant to a question already asked, but I couldn't find anything.
Edit to address comment:
shot of table
> print(ranking)
Desired output
I am struggling to get the formatting correct here so I included screen shots
This should do it:
df <- DAV[, c("MIL.ID", "Gift.Amount")] #extract columns
df <- aggregate(Gift.Amount ~ MIL.ID, df, sum) #sum amounts with same ID
df <- df[ order(df$Gift.Amount,decreasing = TRUE), ] #sort Decreasing

Graphing data that is read using readHTMLTable

I want to read the following table , from a webpage then create a bargraph.
Language............ Jobs
PHP.................... 12,664
Java................... 12,558
Objective C......... 8,925
SQL.................... 5,165
Android (Java).... 4,981
Ruby................... 3,859
JavaScript........... 3,742
C#....................... 3,549
C++..................... 1,908
ActionScript......... 1,821
Python................. 1,649
C.......................... 1,087
ASP.NET............... 818
My questions:
1.The problem that my bars get messed up and each bar does correspond to the correct language
The following is my code:
library(XML)
tables2 <-(readHTMLTable("http://www.sitepoint.com/best-programming-language-of-2013/",which=1))
barplot(as.numeric(tables2$Job),names.arg=tables2$Language)
Since I am a beginner at R I would like to know in what format does readHTMLTable save the data in? is it a matrix, data frame or other format?
The main problem here is that Jobs is being read as a factor. Because of the commas in that field, you can't do a direct numeric conversion. You can find out what 'format' your object is in R by doing str(). Here str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : Factor w/ 13 levels "1,087","1,649",..: 6 5 12 11 10 9 8 7 4 3 ...
So you can see Jobs is a factor, and that tables2 is a data.frame. To convert it to numeric you need to remove the commas. You can do that with gsub().
tables2$Jobs <- as.numeric(gsub(",","",tables2$Jobs))
No str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : num 12664 12558 8925 5165 4981 ...
and when you do your plot, all should be well:
barplot(tables2$Jobs,names.arg=tables2$Language)

Reverting to Factor Codes R

Let's say I have a data.frame that looks like this:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
and I apply a factor:
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
Now, how I would like to convert it back the integer codes:
as.numeric(df.test[1])## replies with an error code.
But this works:
as.numeric(df.test$a)
Why is that?
Actually Joshua's link are not applicable here because the task is not coverting from a factor with levels that have numeric interpretation. Your original effort that produced an error was almost correct. It was missing only a comma before the 1:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
as.numeric(df.test[,1])
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# [19] 19 20 21 22 23 24 25 26
Or you could have used "[["
> as.numeric(df.test[[1]])
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26
as.numeric will convert a factor to numeric:
as.numeric(df.test$a)
Accessing a column by name gives you a factor vector, which can be converted to numeric.
However, a data frame is a list (of columns), and when you use the single bracket operator and a single number on a list, you get a list of length one. The same applies for data frames, so df.test[1] gets you column one as a new data frame, which cannot be coerced by as.numeric(). I did not know this!
> str(df.test$a)
Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(df.test[1])
'data.frame': 26 obs. of 1 variable:
$ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
To respond to your edit: Keep in mind that a factor has two parts: 1) the labels, and 2) the underlying integer codes. The two answers I linked to in my comment were to convert the labels to numeric. If you just want to get the internal codes, use as.integer(df.test$a) as demonstrated in the examples section of ?factor. aL3xa answered your question about why as.numeric(df.test[1]) throws an error.

Resources