Let's say I have a data.frame that looks like this:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
and I apply a factor:
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
Now, how I would like to convert it back the integer codes:
as.numeric(df.test[1])## replies with an error code.
But this works:
as.numeric(df.test$a)
Why is that?
Actually Joshua's link are not applicable here because the task is not coverting from a factor with levels that have numeric interpretation. Your original effort that produced an error was almost correct. It was missing only a comma before the 1:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
as.numeric(df.test[,1])
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# [19] 19 20 21 22 23 24 25 26
Or you could have used "[["
> as.numeric(df.test[[1]])
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26
as.numeric will convert a factor to numeric:
as.numeric(df.test$a)
Accessing a column by name gives you a factor vector, which can be converted to numeric.
However, a data frame is a list (of columns), and when you use the single bracket operator and a single number on a list, you get a list of length one. The same applies for data frames, so df.test[1] gets you column one as a new data frame, which cannot be coerced by as.numeric(). I did not know this!
> str(df.test$a)
Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(df.test[1])
'data.frame': 26 obs. of 1 variable:
$ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
To respond to your edit: Keep in mind that a factor has two parts: 1) the labels, and 2) the underlying integer codes. The two answers I linked to in my comment were to convert the labels to numeric. If you just want to get the internal codes, use as.integer(df.test$a) as demonstrated in the examples section of ?factor. aL3xa answered your question about why as.numeric(df.test[1]) throws an error.
Related
I'm working on an assignment practicing Logistic Regression models. Our data is on shots made in NBA games and each row includes a column for what team the player making the shot belongs to and a column for who the home team was.
I am trying to add a column with TRUE/FALSE values based on whether or not the shot was taken by the home team, based on some example code we were provided.
df$home.advntg <- df$Team == df$Home
However I keep getting the error: "Error in Ops.factor(df$Team, df$Home) :
level sets of factors are different"
When I check the columns with str() however these are the results:
str(df$Team) : "Factor w/ 30 levels "ATL","BKN","BOS",..: 7 16 27 3 24 1 10 8 12 12 ..."
str(df$Home) : " Factor w/ 30 levels "ATL","BKN","BOS",..: 7 20 27 5 28 1 10 8 1 12 ..."
The data I'm using is a subset of a much larger dataset which covered shots made from 1997 to 2020. The code worked on the original data, so something about how I've reduced it to just the 2020 shots is probably responsible. The dates of the games are in YMD format, so to filter down to just 2020 I ran:
df0 <- read_csv("NBA Shot Locations 1997 - 2020.csv")
df0$Year <- substr(df0$"Game Date",1,4)
df <- filter(df0, Year == 2020)
df <- df[,-23]
When I run str and check the columns with the original data (for which there was no error) I get:
str(df$Team) : "Factor w/ 36 levels "ATL","BKN","BOS",..: 18 17 8 2 18 7 1 10 9 12 ..."
str(df$Home) : "Factor w/ 36 levels "ATL","BKN","BOS",..: 6 17 5 28 18 2 1 10 9 12 ..."
In both cases the Factor levels look like they're the same. I don't really understand what the numbers being returned by the str function represent.
I've got a mess of data and am trying to efficiently wrangle it into shape. Here's a simplified short sample of the general format of my data.frame right now. The main difference is that I have a few more data labels like Label1 for my sampling units - each has a set of data similar to the data.frame I'm including but in my situation they are all in the same data.frame. I don't think that will complicate the reformatting so I've just included the single sampling unit of mock data here. StatsType levels Ave, Max, and Min are effectively nested within MeasureType.
tastycheez<-data.frame(
Day=rep((1:3),9),
StatsType=rep(c(rep("Ave",3),rep("Max",3),rep("Min",3)),3),
MeasureType=rep(c("Temp","H2O","Tastiness"),each=9),
Data_values=1:27,
Label1=rep("SamplingU1",27))
Ultimately, I would like a data frame where for each sampling unit and each Day there are columns holding the Data_values for my categories, like this:
Day Label1 Ave.Temp Ave.H2O Ave.Tastiness Max.Temp ...
1 SamplingU1 1 10 19 4 ...
2 SamplingU1 2 11 20 5 ...
I think some combination of functions from reshape,dplyr,tidyr, and/or data.table could do the job but I can't figure out how to code it. Here's what I've tried:
First, I spread the tastycheez (yum!), and that got me partway:
test<-spread(tastycheez,StatsType,Data_values)
Now I'm trying to spread it again or to cast, but with no luck:
test2<-spread(test,MeasureType,(Ave,Max,Min))
test2 <- recast(Day ~ MeasureType+c(Ave,Max,Min), data=test)
(I also tried melting the tastycheez but the results were a sticky, gooey mess and my tongue got burnt. that doesn't seem to be the right function for this.)
If you hate my puns please excuse them, I really can't figure this out!
Here are a couple related questions:
Combining two subgroups of data in the same dataframe
How can I spread repeated measures of multiple variables into wide format?
reshape2 You could use dcast from reshape2:
library(reshape2)
dcast(tastycheez,
Day + Label1 ~ paste(StatsType, MeasureType, sep="."),
value.var = "Data_values")
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
tidyr Stealing #DavidArenburg's comment, here's the tidyr way:
library(tidyr)
tastycheez %>%
unite(temp, StatsType, MeasureType, sep = ".") %>%
spread(temp, Data_values)
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
> str(tester)
Classes ‘data.table’ and 'data.frame': 6402 obs. of 2419 variables:
$ h1 : int 1 5 6 12 13 16 19 22 26 28 ...
$ joinno : int 2 6 7 11 12 14 16 17 19 21 ..
$ h1 : int 1 5 6 12 13 16 19 22 26 28 ...
$ joinno : int 2 6 7 11 12 14 16 17 19 21 ...
Could somebody enlighten me as to how/why cbind-ing these two objects together with identical column names doesn't cause problems? These happen to actually be identical columns so it's kind of moot but when I subset that column name(s) I get a single value. So how does R decide which column I'm referring to (presumably the first)? Is there an easy/canned way to de-dupe columns in R?
Thanks in Advance.
#Frank is right. The defaults are check.names=TRUE for ?data.frame and check.names=FALSE for ?data.table. Although, in the case of cbind-ing, it doesn't come into play:
cbind(data.frame(a=1),data.frame(a=2))
cbind(data.table(a=1),data.table(a=2))
...both give duplicate names. You could apply:
names(out) <- make.unique(names(out))
...after cbind-ing to fix it up. Another option would be to not use cbind in favour of:
data.frame(data.frame(a=1),data.frame(a=2))
data.table(data.table(a=1),data.table(a=2),check.names=TRUE)
I want to read the following table , from a webpage then create a bargraph.
Language............ Jobs
PHP.................... 12,664
Java................... 12,558
Objective C......... 8,925
SQL.................... 5,165
Android (Java).... 4,981
Ruby................... 3,859
JavaScript........... 3,742
C#....................... 3,549
C++..................... 1,908
ActionScript......... 1,821
Python................. 1,649
C.......................... 1,087
ASP.NET............... 818
My questions:
1.The problem that my bars get messed up and each bar does correspond to the correct language
The following is my code:
library(XML)
tables2 <-(readHTMLTable("http://www.sitepoint.com/best-programming-language-of-2013/",which=1))
barplot(as.numeric(tables2$Job),names.arg=tables2$Language)
Since I am a beginner at R I would like to know in what format does readHTMLTable save the data in? is it a matrix, data frame or other format?
The main problem here is that Jobs is being read as a factor. Because of the commas in that field, you can't do a direct numeric conversion. You can find out what 'format' your object is in R by doing str(). Here str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : Factor w/ 13 levels "1,087","1,649",..: 6 5 12 11 10 9 8 7 4 3 ...
So you can see Jobs is a factor, and that tables2 is a data.frame. To convert it to numeric you need to remove the commas. You can do that with gsub().
tables2$Jobs <- as.numeric(gsub(",","",tables2$Jobs))
No str(tables2) gives:
'data.frame': 13 obs. of 2 variables:
$ Language: Factor w/ 13 levels "ActionScript",..: 10 7 9 13 2 12 8 5 6 1 ...
$ Jobs : num 12664 12558 8925 5165 4981 ...
and when you do your plot, all should be well:
barplot(tables2$Jobs,names.arg=tables2$Language)
I have a set of data with 3 columns: index column (with no name), colour, colour of seed, and germination time.
How do I create a numerical variable called 'order' with values 1 to 22 (the number of data sets)?
I don't know if I get you right, but simplest way would be:
> order <- c(1:22)
> order
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
No, if you run:
class(order)
you will get:
[1] "integer"
but you can easily get every element of object order (especially in a loop)
for(i in 1:length(order)){
print(order[i])
}