I am trying to call a function that provides a value for specific data elements in a table.
A data table (gameData) might be:
Date TeamA TeamB TeamAScore TeamBScore
1 2016-03-06 NYC HOU 67 76
2 2016-02-14 BOS SEA NaN NaN
3 2016-01-30 LAS DAL 63 74
I would like to populate the TeamAScore with the return of a function if it is NaN. I tried a function like the following:
gameData$TeamAScore <- ifelse(
is.nan(gameData$TeamAScore),
getTeamAScore(gameData$TeamA,gameData$TeamB,gameDate=gameData$Date),
gameData$TeamAScore
)
When I run this, I get a an error like the following:
Error in Ops.factor(teamdata$Team, TeamA) :
level sets of factors are different
It seems to be sending all of the TeamA's with the function call instead of only the value for that row.
The problem here is that the TeamA and TeamB columns do not have the data you think they have. Factors are tricky in R...
Let's create two factors here to see what is happening:
> TeamA <- factor(c("NYC", "BOS", "LAS", "SEA"))
> TeamB <- factor(c("HOU", "LAS", "NYC", "SEA"))
> TeamA
[1] NYC BOS LAS SEA
Levels: BOS LAS NYC SEA
OK, so TeamA has four slots: NYC, BOS, LAS and SEA. So we can compare this to TeamB to see whether any slot in the two vectors is the same. Right? Wrong:
> TeamA == TeamB
Error in Ops.factor(TeamA, TeamB) : level sets of factors are different
That is the same error you are receiving! That happens because what is really stored in this vectors is a number representing each "factor level".
> str(TeamA)
Factor w/ 4 levels "BOS","LAS","NYC",..: 3 1 2 4
> levels(TeamA)
[1] "BOS" "LAS" "NYC" "SEA"
> levels(TeamB)
[1] "HOU" "LAS" "NYC" "SEA"
So, 1 represents BOS in the TeamA vector, but it represents HOU in the TeamB vector. Of course they can't be compared!
How to avoid using factors when they are getting in your way? Use the argument stringsAsFactors=FALSE when you create the data.frame (either using data.frame(x, y, z, stringsAsFactors=FALSE) or read.csv("filename.csv", etc, etc, stringsAsFactors=FALSE)`.
Related
I am working through creating pivot tables with the Pivottabler package to summarise frequencies of rock art classes by location. The data I am summarising here are from published papers, and I have it stored in an RDS file created in R, and looks like this:
> head(cyp_art_freq)
Class Location value
1: Figurative Princess Charlotte Bay 347
2: Track Princess Charlotte Bay 35
3: Non-Figurative Princess Charlotte Bay 18
4: Figurative Mitchell-Palmer and Chillagoe 320
5: Track Mitchell-Palmer and Chillagoe 79
6: Non-Figurative Mitchell-Palmer and Chillagoe 1002
>str(cyp_art_freq)
Classes ‘data.table’ and 'data.frame': 12 obs. of 3 variables:
Class : chr "Figurative" "Track" "Non-Figurative" "Figurative" ...
Location: chr "Princess Charlotte Bay" "Princess Charlotte Bay" "Princess Charlotte Bay" "Mitchell-Palmer and Chillagoe" ...
value : num 347 35 18 320 79 ...
attr(*, ".internal.selfref")=<externalptr>
The problem is that pivottabler does not sum the contents of the 'value' col. Instead, it counts the number of rows/cases. So, as the graphic below shows, the resulting table includes a total of 12 cases when the result should be into the 1000s. I think this relates to the 'value' column which is a count of a larger dataset. I've tried pivot_longer and pivot_wider, changed datatypes and used CSVs instead of RDS for import (and more).
The code block I'm using for this data works with the built-in BHMtrains dataset, and my other datasets, but I suspect I can either specify that pivottabler tallies the contents of the 'values' col, or I just expand the underlying dataset.
How might I ensure that the 'Count' columns actually count the contents of the input 'value' column? I hope that is clear, and thanks for any suggestions on how to address this issue.
table01 <- PivotTable$new()
table01$addData(cyp_art_freq)
table01$addColumnDataGroups("Class", totalCaption = "Total")
table01$defineCalculation(calculationName="Count", summariseExpression="n()", caption="Count", visible=TRUE)
filterOverrides <- PivotFilterOverrides$new(table01, keepOnlyFiltersFor="Count")
table01$defineCalculation(calculationName="TOCTotal", filters=filterOverrides,
summariseExpression="n()", caption="TOC Total", visible=FALSE)
table01$defineCalculation(calculationName="PercentageAllMotifs", type="calculation",
basedOn=c("Count", "TOCTotal"),
calculationExpression="values$Count/values$TOCTotal*100",
format="%.1f %%", caption="Percent")
table01$addRowDataGroups("Location")
table01$theme <- "compact"
table01$renderPivot()
table01$evaluatePivot()
The PT returned from this code
This question already has answers here:
Select rows from data.frame ending with a specific character string in R
(3 answers)
Closed 7 years ago.
So, I have a small data frame and like my title says, I would like to remove all rows that end in a certain letter, "n".
Here is the code that will give you the data I am working with:
url = "http://www.basketball-reference.com/leagues/NBA_1980.html"
library(XML)
x1 = readHTMLTable(url)
east.1980 = x1[["E_standings"]]
west.1980 = x1[["W_standings"]]
east.1980 = east.1980[c(1,2)]
west.1980 = west.1980[c(1,2)]
names(east.1980) = c("Team", "W")
names(west.1980) = c("Team", "W")
wins.1980 = rbind(east.1980, west.1980)
wins.1980$Team = gsub("\\b\\d+\\b", "", wins.1980$Team)
wins.1980$Team = gsub(" +"," ",gsub("^ +","",gsub("[^a-zA-Z0-9 ]","",wins.1980$Team)))
View(wins.1980)
Here is an example of how the data frame will look:
Team W
1 Atlantic Division �
2 Boston Celtics 61
3 Philadelphia 76ers 59
4 Washington Bullets 39
5 New York Knicks 39
6 New Jersey Nets 34
7 Central Division �
8 Atlanta Hawks 50
9 Houston Rockets 41
10 San Antonio Spurs 41
11 Indiana Pacers 37
12 Cleveland Cavaliers 37
13 Detroit Pistons 16
14 Midwest Division �
15 Milwaukee Bucks 49
16 Kansas City Kings 47
17 Denver Nuggets 30
So basically, I want to remove the division rows "Atlantic Division, Central Division, etc...". It just so happens that all of these strings end with "n", so I am trying to write a for loop to remove all of the rows where the wins.1980$Team string ends with "n".
I want to be able to repeat the process over 30+ years of the data so being repeatable is a must.
Here are the two for loops I have tried so far:
for (i in 1:nrow(wins.1980)) {
if ((str_sub(wins.1980$Team[i], -1)) == "n") {
eval(parse(text=paste0("wins.","1980","[-", i, ",]")))
}
}
for (i in 1:nrow(wins.1980)) {
if ((str_sub(wins.1980$Team[i], -1)) == "n") {
wins.1980[-i,]
}
}
I have used a for loop with if ((str_sub(myData$Column[i], -1)) == "letter") to do something if the last character was equal to "letter" so I am pretty sure that part of the loop works.
Since there are only 6 divisions in the NBA, I would be okay with something that was repeatable and said if (wins.1980$Team == "Atlantic Division" | "Midwest Division" | etc...) then remove that row, however I do not feel like the problem in my loop is selecting the right rows, just removing them.
I do not get any errors when I run each of the above loops, it runs, but I think it just does not save what it does or something.
Pulling from my example data frame above, I would like to result to look like:
Team W
2 Boston Celtics 61
3 Philadelphia 76ers 59
4 Washington Bullets 39
5 New York Knicks 39
6 New Jersey Nets 34
8 Atlanta Hawks 50
9 Houston Rockets 41
10 San Antonio Spurs 41
11 Indiana Pacers 37
12 Cleveland Cavaliers 37
13 Detroit Pistons 16
15 Milwaukee Bucks 49
16 Kansas City Kings 47
17 Denver Nuggets 30
And again, I would like to be able to repeat this over many more data frames. Any ideas?
I am pretty new to R so I might by oblivious to simpler solutions and simplicity would be much appreciated! Thanks in advance!
Here is an easier way:
wins.1980[grep("Division$", wins.1980$Team, invert = TRUE), ]
grep("Division$"... matches anything that ends in "Division" in the Team column (this is probably safer than choosing anything that ends in n, but you could do that with the same technique), and invert = TRUE inverts these matches so you get everything that doesn't end in "Division". Using this to subset gets you all the rows where Team doesn't end in "Division".
You could make this a function to apply to many data frames:
no_div <- function(x) {
x[grep("Division$", x$Team, invert = TRUE), ]
}
Assuming you want to subset them all based on the Team column; if you're using different columns you'd have to modify the function to take an additional argument. Then call it on your data with no_div(wins.1980).
You can use grepl like so,
df <- data.frame(Team=c("Boston Celtics","Atlantic Division",
"Central Division","Atlanta Hawks"),
W=sample(10:20, 4))
df <- df[!grepl("n$", df$Team),]
Where "n$" is a regular expression meaning 'string ends with n'
You should be able to use substr and subset to do this.
First find the rows which end in Division
matches <- substr(wins.1980$team,nchar(wins.1980$team)-8,nchar(wins.1980$team)) %in% c("Division")
Then subset the dataframe based on this
wins.1980 <- subset(wins.1980, !matches)
Edit: better example here - https://stackoverflow.com/a/13012423/1502898
If you like the syntax of the dplyr and magrittr packages:
library(dplyr) ; library(magrittr)
wins.1980 %<>% filter(!grepl("Division", Team))
The data.frame (d1.csv) looks like:
Age Height Weight Sport
23 170 60 Judo
33 193 125 Athletics
I have to make a ny data.frame like d2 with the top 20 an shall use this charachters below stored in
names(top.20.sports)
[1] "Athletics" "Swimming" "Football" "Rowing"
... and have to use match() or %in% like to use subset() like d1 with subset = Sport %in% names(top.20.sports).
I tried several things bud I'm new at this and am missing something...
d2<-subset(d1, (Sport %in% names(top.20.sports)))
gives the hole list, same as with
d2 <- d1[d1$Sport %in% names(top.20.sports),]
match gives me a bunch (42) with "NA"
d2<-d1[,tolower(names(top.20.sports)) %in% d1[,4]]
Dataframe with 0 colomns und 9038 rows
(9038 rows are correct bud where is the data?)
There was no error, like BondedDust told me: "If subset(d1, (Sport %in% names(top.20.sports))) gives the whole list then .... it is what it is. All of the Sport entries are in the top-20."
Just it never was the hole list...:
I was thinking I had 10384 rows
-10384 24 221 110 Basketball-
Basketball as the last one. Bud the number of the row is not the number of the rows:
nrow(d2)
[1] 8009
dim(d2)
[1] 8009 4
I am using the sapply() function to create a new column of data. First, from my raw data of observations, every patient receives a number between 1-999, each number has a unique description, but they all fall into 1 out of 27 categories. My problem is that the 27 categories are not given in the raw data, so I have to look them up in a dictionary which has the categories that match the numbers 1-999.
Here is the raw data from a data set titled inova9:
ID AgeGroup Race SexCode Org_DRGCode
9 9 75-84 White F 435
10 10 75-84 White F 441
11 11 45-54 White F 301
40 40 14-17 White F 775
70 70 75-84 White F 853
120 120 55-64 White M 395
Here is part of my dictionary:
MSDRG_num MS.DRG_Descriptions_
1 1 Heart transplant or implant of heart assist system w MCC
2 2 Heart transplant or implant of heart assist system w/o MCC
3 3 ECMO or trach w MV 96+ hrs or PDX exc face, mouth & neck w maj O.R.
4 4 Trach w MV 96+ hrs or PDX exc face, mouth & neck w/o maj O.R.
5 5 Liver transplant w MCC or intestinal transplant
6 6 Liver transplant w/o MCC
New_CI_Category
1 Organ Transplant
2 Organ Transplant
3 General/Other Surgery
4 General/Other Surgery
5 Organ Transplant
6 Organ Transplant
here are the 27 categories:
> levels(DRG$New_CI_Category)
[1] "Bariatric Surgery" "Behavioral"
[3] "Cardiovasc Medicine" "CV Surg - Open Heart"
[5] "General/Other Surgery" "GYN Med/Surg"
[7] "Hem/Onc Medicine" "Interventional Cardiology - EP"
[9] "Interventional Cardiology - PCI" "Medicine"
[11] "Neonates" "Neurology"
[13] "Neurosurgery - Brain" "Neurosurgery - Other"
[15] "Normal Newborns" "OB Deliveries"
[17] "OB Other" "Organ Transplant"
[19] "Ortho Medicine" "Ortho Surg - Other"
[21] "Ortho Surgery - Joints" "Rehab"
[23] "Spine" "Thoracic Surgery"
[25] "Unspecified" "Urology Surgery"
[27] "Vascular Procedure - Surgery or IR"
So, I need to match up inova9$Org_DRGCode with MSDRG_num from my dictionary, then pull the corresponding category from DRG$New_CI_Catgory
I implemented the following:
ServiceLine1 = matrix(nrow=length(inova9$Org_DRGCode),ncol=1)
ServiceLine1 = sapply(1:length(inova9$Org_DRGCode),function(i)as.character(DRG$New_CI_Category[DRG$MSDRG_num==inova9$Org_DRGCode[i]]))
Svc = as.factor(ServiceLine1)
inova9 = data.frame(inova9,Svc)
As, you can see, I created a column and now I can merge it with my original data, one-to-one.
I have four data sets like this, but it only works for two. The other two I receive this error:
> Svc = as.factor(ServiceLine2)
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
And my data looks like this:
[[1]]
[1] "Neurology"
[[2]]
[1] "Medicine"
[[3]]
[1] "GYN Med/Surg"
[[4]]
[1] "Vascular Procedure - Surgery or IR"
[[5]]
[1] "Neurology"
[[6]]
[1] "Medicine"
How did sapply() turn my matrix into a list and how do i stop it from happening?
You might save yourself a headache by converting your data.table, setting a key then simply joining.
library(data.table)
DT.DRG <- as.data.table(DRG)
DT.dict <- as.data.table(your_dict)
## Set the key to what you want to join on
setkey(DT.DRG, ID)
setkey(DT.dict, MSDRG_num)
## Assign the column from DT.dict into DT.DRG, joining on the keys
DT.DRG[DT.dict, New_CI_Category := New_CI_Category]
Make sure the keys are of the same type
meaning that they are both factor or both character, etc
This happens because sapply is a wrapper for lapply that tries to be smart about its return structure. When, for whatever reason, it can't figure it out, it will always fall back to a list because that is what lapply returns.
Now, I'm not entirely sure why that's happening here. Just reading your code, I would also expect sapply to return a vector and not a list. One possibility is that, for some value of i, the expression as.character(DRG$New_CI_Category[DRG$MSDRG_num==inova9$Org_DRGCode[i]]) has length greater than one. You can check this with any(sapply(ServiceLine1, length) > 1).
In any case, the function unlist will compress a list down to a vector, so you can do as.factor(unlist(ServiceLine1)).
I would like to select in my dataframe (catch) only the rows for which my "tspp.name" variable is the same as my "elasmo.name" variable.
For example, row #74807 and #74809 in this case would be selected, but not row #74823 because the elasmo.name is "skate" and the tspp.name is "Northern shrimp".
I am sure there is an easy answer for this, but I have not found it yet. Any hints would be appreciated.
> catch[4:6,]
gear tripID obsID sortie setID date time NAFO lat long dur depth bodymesh
74807 GRL2 G00001 A 1 13 2000-01-04 13:40:00 2H 562550 594350 2.000000 377 80
74809 GRL2 G00001 A 1 14 2000-01-04 23:30:00 2H 562550 594350 2.166667 370 80
74823 GRL2 G00001 A 1 16 2000-01-05 07:45:00 2H 561450 593050 3.000000 408 80
codendmesh mail.fil long.fil nbr.fil hook.shape hook.size hooks VTS tspp tspp.name elasmo
74807 45 NA NA NA NA NA 3.3 2211 Northern shrimp 2211
74809 45 NA NA NA NA NA 3.2 2211 Northern shrimp 2211
74823 45 NA NA NA NA NA 3.3 2211 Northern shrimp 211
elasmo.name kept discard Tcatch date.1 latitude longitude EID
74807 Northern shrimp 2747 50 2797 2000-01-04 56.91667 -60.21667 G00001-13
74809 Northern shrimp 4919 100 5019 2000-01-04 56.91667 -60.21667 G00001-14
74823 Skates 0 50 50 2000-01-05 56.73333 -60.00000 G00001-16
fgear
74807 Shrimp trawl (stern) with a grid
74809 Shrimp trawl (stern) with a grid
74823 Shrimp trawl (stern) with a grid
I know what the problem is - you need to read in the data "as is", by adding the argument as.is=TRUE to the read.csv command (which you presumably used to load everything in). Without this, the strings get stored as factors, and all methods suggested above will fail (as you've discovered!)
Once you've read in the data correctly, you can use either
catch[which(catch$tspp.name == catch$elasmo.name),]
or
subset(catch, tspp.name == elasmo.name)
to obtain the matching rows - do not omit the which in the first one otherwise the code will fail when doing comparisons with NAs.
Below is a 30-second example using a small fabricated data set that illustrates all these points explicitly.
First, create a text file on disk that looks like this (I saved it as "F:/test.dat" but it can be saved anywhere)...
col1~col2
a~b
a~a
b~b
c~NA
NA~d
NA~NA
Let's load it in without converting factors to strings, just to see the methods proposed above fall over:
> dat=read.csv("F:/test.dat",sep="~") # don't forget to check the filename
> dat[which(dat$col1==dat$col2),]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> dat[dat$col1==dat$col2,]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> subset(dat,col1==col2)
Error in Ops.factor(col1, col2) : level sets of factors are different
This is exactly the problem you were having. If you type dat$col1 and dat$col2 you'll see that the first has factor levels a b c while the second has factor levels a b d - hence the error messages.
Now let's do the same, but this time reading in the data "as is":
> dat=read.csv("F:/test.dat",sep="~",as.is=TRUE) # note the as.is=TRUE
> dat[which(dat$col1==dat$col2),]
col1 col2
2 a a
3 b b
> dat[dat$col1==dat$col2,]
col1 col2
2 a a
3 b b
NA <NA> <NA>
NA.1 <NA> <NA>
NA.2 <NA> <NA>
> subset(dat,col1==col2)
col1 col2
2 a a
3 b b
As you can see, the first method (based on which) and the third method (based on subset) both give the right answer, while the second method gets confused by comparisons with NA. I would personally advocate the subset method as in my opinion it's the neatest.
A final note: There are other ways that you can get strings arising as factors in a data frame - and to avoid all of those headaches, always remember to include the argument stringsAsFactors = FALSE at the end whenever you create a data frame using data.frame. For instance, the correct way to create the object dat directly in R would be:
dat=data.frame(col1=c("a","a","b","c",NA,NA), col2=c("b","a","b",NA,"d",NA),
stringsAsFactors=FALSE)
Type dat$col1 and dat$col2 and you'll see they've been interpreted correctly. If you try it again but with the stringsAsFactors argument omitted (or set to TRUE), you'll see those darned factors appear (just like the dodgy first method of loading from disk).
In short, always remember as.is=TRUE and stringsAsFactors=FALSE, and learn how to use the subset command, and you won't go far wrong!
Hope this helps :)