R error: level sets of factors are different - r

I'm working on an assignment practicing Logistic Regression models. Our data is on shots made in NBA games and each row includes a column for what team the player making the shot belongs to and a column for who the home team was.
I am trying to add a column with TRUE/FALSE values based on whether or not the shot was taken by the home team, based on some example code we were provided.
df$home.advntg <- df$Team == df$Home
However I keep getting the error: "Error in Ops.factor(df$Team, df$Home) :
level sets of factors are different"
When I check the columns with str() however these are the results:
str(df$Team) : "Factor w/ 30 levels "ATL","BKN","BOS",..: 7 16 27 3 24 1 10 8 12 12 ..."
str(df$Home) : " Factor w/ 30 levels "ATL","BKN","BOS",..: 7 20 27 5 28 1 10 8 1 12 ..."
The data I'm using is a subset of a much larger dataset which covered shots made from 1997 to 2020. The code worked on the original data, so something about how I've reduced it to just the 2020 shots is probably responsible. The dates of the games are in YMD format, so to filter down to just 2020 I ran:
df0 <- read_csv("NBA Shot Locations 1997 - 2020.csv")
df0$Year <- substr(df0$"Game Date",1,4)
df <- filter(df0, Year == 2020)
df <- df[,-23]
When I run str and check the columns with the original data (for which there was no error) I get:
str(df$Team) : "Factor w/ 36 levels "ATL","BKN","BOS",..: 18 17 8 2 18 7 1 10 9 12 ..."
str(df$Home) : "Factor w/ 36 levels "ATL","BKN","BOS",..: 6 17 5 28 18 2 1 10 9 12 ..."
In both cases the Factor levels look like they're the same. I don't really understand what the numbers being returned by the str function represent.

Related

Converting a factor to a numeric to then create a subset is not working

I am new to R and am having issues trying to work with a large dataset. I have a variable called DifferenceMonths and I would like to create a subset of my large dataset with only observations where the variable DifferenceMonths is less than 3.
It is coded into R as a factor so I have tried multiple times to convert it to a numeric. It finally showed up as numeric in my Global Environment, but then I checked using str() and it still shows up as a factor variable.
Log:
DifferenceMonths<-as.numeric(levels(DifferenceMonths))[DifferenceMonths]
Warning message:
NAs introduced by coercion
KRASDiff<-subset(KRASMCCDataset_final,DifferenceMonths<=2)
Warning message:
In Ops.factor(DifferenceMonths, 2) : ‘<=’ not meaningful for factors
str(KRASMCCDataset_final)
'data.frame': 7831 obs. of 25 variables:
$ Age : Factor w/ 69 levels "","21","24","25",..: 29 29 29 29 29 29 29 29 29 29 ...
$ Alive.Dead : Factor w/ 4 levels "","A","D","S": 2 2 2 2 2 2 2 2 2 2 ...
$ Status : Factor w/ 5 levels "","ambiguous",..: 4 4 5 5 4 5 5 5 4 5 ...
$ DifferenceMonths : Factor w/ 75 levels "","#NUM!","0",..: 14 14 14 14 14 14 14 14 14 14 ...
Thank you!
It's ugly, but you want:
as.numeric(as.character(DifferenceMonths))
The problem here, which you may have discovered, is that as.numeric() gives you the internal integer codes for the factor. The values are stored in the levels. But if you run as.numeric(levels(DifferenceMonths)), you'll get those values, but just as they appear in levels(DifferenceMonths). The way around this is to coerce to character first, and get away from the internal integer codes all together.
EDIT: I learned something today. See this answer
as.numeric(levels(DifferenceMonths))[DifferenceMonths]
Is the more efficient and preferred way, in particular if length(levels(DifferenceMonths)) is less than length(DifferenceMonths).
EDIT 2: on review after #MrFlick's comment, and some initial testing, x <- as.numeric(levels(x))[x] can behave strangely. Try assigning it to a new variable name. Let me see if I can figure out how and when this behavior occurs.

Getting a difference between time(n+1)-time(n) in a dataframe in r

I have a dataframe where the columns represent monthly data and the rows different simulations. the data I am working with accumulates over time so I want to take the difference between the months to get the true value for that month. There are not headers for my data frame
For example:
View(df)=
1 3 4 6 19 23 24 25 26 ...
1 2 3 4 5 6 7 8 9 ...
0 0 2 3 5 7 14 14 14 ...
My plan was to use the diff() function or something like it, but I am having trouble using it on a dataframe.
I have tried:
df1<-diff(df, lag = 1, differences = 1)
but only get zeros.
I am grateful for any advice.
see ?apply. If it's a data frame
apply(df,2,diff)
should work. Also since a dataframe is a list of vectors sapply(df,diff) should work.

How to make data in a single column (long) with multiple, nested group categories wide

I've got a mess of data and am trying to efficiently wrangle it into shape. Here's a simplified short sample of the general format of my data.frame right now. The main difference is that I have a few more data labels like Label1 for my sampling units - each has a set of data similar to the data.frame I'm including but in my situation they are all in the same data.frame. I don't think that will complicate the reformatting so I've just included the single sampling unit of mock data here. StatsType levels Ave, Max, and Min are effectively nested within MeasureType.
tastycheez<-data.frame(
Day=rep((1:3),9),
StatsType=rep(c(rep("Ave",3),rep("Max",3),rep("Min",3)),3),
MeasureType=rep(c("Temp","H2O","Tastiness"),each=9),
Data_values=1:27,
Label1=rep("SamplingU1",27))
Ultimately, I would like a data frame where for each sampling unit and each Day there are columns holding the Data_values for my categories, like this:
Day Label1 Ave.Temp Ave.H2O Ave.Tastiness Max.Temp ...
1 SamplingU1 1 10 19 4 ...
2 SamplingU1 2 11 20 5 ...
I think some combination of functions from reshape,dplyr,tidyr, and/or data.table could do the job but I can't figure out how to code it. Here's what I've tried:
First, I spread the tastycheez (yum!), and that got me partway:
test<-spread(tastycheez,StatsType,Data_values)
Now I'm trying to spread it again or to cast, but with no luck:
test2<-spread(test,MeasureType,(Ave,Max,Min))
test2 <- recast(Day ~ MeasureType+c(Ave,Max,Min), data=test)
(I also tried melting the tastycheez but the results were a sticky, gooey mess and my tongue got burnt. that doesn't seem to be the right function for this.)
If you hate my puns please excuse them, I really can't figure this out!
Here are a couple related questions:
Combining two subgroups of data in the same dataframe
How can I spread repeated measures of multiple variables into wide format?
reshape2 You could use dcast from reshape2:
library(reshape2)
dcast(tastycheez,
Day + Label1 ~ paste(StatsType, MeasureType, sep="."),
value.var = "Data_values")
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
tidyr Stealing #DavidArenburg's comment, here's the tidyr way:
library(tidyr)
tastycheez %>%
unite(temp, StatsType, MeasureType, sep = ".") %>%
spread(temp, Data_values)
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9

How do I create a survival object in R?

The question I am posting here is closely linked to another question I posted two days ago about gompertz aging analysis.
I am trying to construct a survival object, see ?Surv, in R. This will hopefully be used to perform Gompertz analysis to produce an output of two values (see original question for further details).
I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
Each genotype has approximately 50 flies in it. What I need help with now is how to go from one of the above dataframes to a working survival object. What does this object look like? And how do I get from the above to the survival object smoothly?
After noting the total of Deaths was 55 and you said that the number of flies was "around 50", I decided the likely assumption was that this was a completely observed process. So you need to replicate the duplicate deaths so there is one row for each death and assign an event marker of 1. The "long" format is clearly the preferred format. You can then create a Surv-object with the 'Day' and 'event'
?Surv
df3 <- df2[rep(rownames(df2), df2$Deaths), ]
str(df3)
#---------------------
'data.frame': 55 obs. of 3 variables:
$ Exp : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
$ Deaths: num 2 2 3 3 3 1 3 3 3 4 ...
$ Day : num 10 10 12 12 12 14 16 16 16 18 ...
#----------------------
df3$event=1
str(with(df3, Surv(Day, event) ) )
#------------------
Surv [1:55, 1:2] 10 10 12 12 12 14 16 16 16 18 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "time" "status"
- attr(*, "type")= chr "right"
Note: If this were being done in the coxph function, the expansion to individual lines of date might not have been needed, since that function allows specification of case weights. (I'm guessing that the other regression function in the survival package would not have needed this to be done either.) In the past Terry Therneau has expressed puzzlement that people are creating Surv-objects outside the formula interface of the coxph. The intended use of htis Surv-object was not described in sufficient detail to know whether a weighted analysis without exapnsion were possible.

Reverting to Factor Codes R

Let's say I have a data.frame that looks like this:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
and I apply a factor:
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
Now, how I would like to convert it back the integer codes:
as.numeric(df.test[1])## replies with an error code.
But this works:
as.numeric(df.test$a)
Why is that?
Actually Joshua's link are not applicable here because the task is not coverting from a factor with levels that have numeric interpretation. Your original effort that produced an error was almost correct. It was missing only a comma before the 1:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
as.numeric(df.test[,1])
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# [19] 19 20 21 22 23 24 25 26
Or you could have used "[["
> as.numeric(df.test[[1]])
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26
as.numeric will convert a factor to numeric:
as.numeric(df.test$a)
Accessing a column by name gives you a factor vector, which can be converted to numeric.
However, a data frame is a list (of columns), and when you use the single bracket operator and a single number on a list, you get a list of length one. The same applies for data frames, so df.test[1] gets you column one as a new data frame, which cannot be coerced by as.numeric(). I did not know this!
> str(df.test$a)
Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(df.test[1])
'data.frame': 26 obs. of 1 variable:
$ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
To respond to your edit: Keep in mind that a factor has two parts: 1) the labels, and 2) the underlying integer codes. The two answers I linked to in my comment were to convert the labels to numeric. If you just want to get the internal codes, use as.integer(df.test$a) as demonstrated in the examples section of ?factor. aL3xa answered your question about why as.numeric(df.test[1]) throws an error.

Resources