CRAN-R: Subset a dataframe bug or violation of semantics?

CRAN-R: Subset a dataframe bug or violation of semantics? - r

Today i was confronted with a bug in my code due to a dataframe subset operation. I would like to know if the problem i found is a bug or if i am violating R semantics.
I am running a RHEL x86_64 with an R 2.15.2-61015 (Trick or Treat). I am using the subset operation from the base package.
The following code should be reproducible and it was run on a clean R console initiated for the purpose of this test.
>teste <-data.frame(teste0=c(1,2,3),teste1=c(3,4,5))
>teste0<-1
>teste1<-1
>subset(teste,teste[,"teste0"]==1 & teste[,"teste1"]==1)
[1] teste0 teste1
<0 rows> (or 0-length row.names)
>subset(teste,teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1)
teste0 teste1
1 1 3
2 2 4
3 3 5
However, if i run the logical code outside the subset operation:
>teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1
[1] FALSE FALSE FALSE
I would expect that both subset operations would yield an empty dataframe. However, the second one returns the complete dataframe. Is this a bug or am I missing something about R environments and namespaces ?
Thank you for your help,
Miguel

In this statement:
subset(teste,teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1)
teste0 means teste$teste0. Same for teste1.
In this statement:
teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1
teste0 and teste1 are the vectors that you have defined above (not members of the data frame).

Related

how to use regexpr to identify patters in icd10 data

I am working with icd10 data, and I wish to create new variables called complication based on the pattern "E1X.9X", using regular expression but I keep getting an error. please help
dm_2$icd9_9code<- (E10.49, E11.51, E13.52, E13.9, E10.9, E11.21, E16.0)
dm_2$DM.complications<- "present"
dm_2$DM.complications[regexpr("^E\\d{2}.9$",dm_2$icd9_code)]<- "None"
# Error in dm_2$DM.complications[regexpr("^E\\d{2}.9", dm_2$icd9_code)] <-
# "None" : only 0's may be mixed with negative subscripts
I want
icd9_9code complications
E10.49 present
E11.51 present
E13.52 present
E13.9 none
E10.9 none
E11.21 present

This problem has already been solved. The 'icd' R package which me and co-authors have been maintaining for five years can do this. In particular, it uses standardized sets of comorbidities, including the diabetes with complications you seek, from AHRQ, Elixhauser original, Charlson, etc..
E.g., for ICD-10 AHRQ, you can see the codes for diabetes with complications here. From icd 4.0, these include ICD-10 codes from the WHO, and all years of ICD-10-CM.
icd::icd10_map_ahrq$DMcx
To use them, first just take your patient data frame and try:
library(icd)
pts <- data.frame(visit_id = c("encounter-1", "encounter-2", "encounter-3",
"encounter-4", "encounter-5", "encounter-6"), icd10 = c("I70401",
"E16", "I70.449", "E13.52", "I70.6", "E11.51"))
comorbid_ahrq(pts)
# and for diabetes with complications only:
comorbid_ahrq(pts)[, "DMcx"]
Or, you can get a data frame instead of a matrix this way:
comorbid_ahrq(pts, return_df = TRUE)
# then you can do:
comorbid_ahrq(pts, return_df = TRUE)$DMcx
If you give an example of the source data and your goal, I can help more.

Seems like there are a few errors in your code, I'll note them in the code below:
You'll want to start with wrapping your ICD codes with quotes: "E13.9"
dm_2 <- data.frame(icd9_9code = c("E10.49", "E11.51", "E13.52", "E13.9", "E10.9", "E11.21", "E16.0"))
Next let's use grepl() to search for the particular ICD pattern. Make sure you're applying it to the proper column, your code above is attempting to use dm_2$icd9_code and not dm_2$icd9_9code:
dm_2$DM.complications <- "present"
dm_2$DM.complications[grepl("^E\\d{2}.9$", dm_2$icd9_9code)] <- "None"
Finally,
dm_2
#> icd9_9code DM.complications
#> 1 E10.49 present
#> 2 E11.51 present
#> 3 E13.52 present
#> 4 E13.9 None
#> 5 E10.9 None
#> 6 E11.21 present
#> 7 E16.0 present
A quick side note -- there is a wonderful ICD package you may find handy as well: https://cran.r-project.org/web/packages/icd/index.html

Code to filter data where observations may or may not be present in the dataframe

I am trying to make a template of code to filter data. The problem I am having is that there are various levels of categorical data and if I use the dplyr function filter R returns no data if the filtering level was not in the data.
For example:
library(dplyr)
lease <-c(1,2,1)
year<-c(2010,2011,2010)
beg <-c(1,2,1)
gas<-c(1,2,2)
pelelts<-c(1,2,2)
df<-data.frame(lease, year, beg, gas, pelelts)
df%>%
mutate_all(as.character)%>%
filter(lease==1 | year==2010)%>%
filter(beg==1 & gas==2)%>%
filter(pelelts==3)
this returns
<0 rows> (or 0-length row.names), which I believe is because pelelts==3 doesn't exist (and I get data if I remove this line of code). The problem I have is I don't want to check every data set for what is there as it will vary on a subset by subset basis. Any help will be much appreciated.

By saying pelelts == 3 you're telling R that you only want 3. You need to modify your code to catch the other conditions that are acceptable. If 3 doesn't exist then something else has to happen, or no results will be returned.

This code got the required result thanks #H5470
df%>%
mutate_all(as.character)%>%
filter(lease==1 | year==2010)%>%
filter(beg==1 & gas==2)%>%
mutate(pelelts=case_when(pelelts %in% '3'~ '3',
pelelts %in% c('0', '1', '2')~ '',
TRUE~as.character(pelelts)))
returning:
lease year beg gas pelelts
1 2010 1 2

R: Applying factor values from one column to another

I am trying to process municipal information in R and it seems that factors (to be exact factor()). are the best way to achieve my goal. I am only starting to get the hang of R, so I imagine my problem is possibly very simple.
I have the following example dataframe to share (a tiny portion of Finnish municipalities):
municipality<-c("Espoo", "Oulu", "Tampere", "Joensuu", "Seinäjoki",
"Kerava")
region<-c("Uusimaa","Pohjois-Pohjanmaa","Pirkanmaa","Pohjois-Karjala","Etelä-Pohjanmaa","Uusimaa")
myData<-cbind(municipality,region)
myData<-as.data.frame(myData)
By default R converts my character columns into factors, which can be tested with str(myData). Now to the part where my beginner to novice level R skills end: I can't seem to find a way to apply factors from column region to column municipality.
Let me demonstrate. Instead of having the original result
as.numeric(factor(myData$municipality))
[1] 1 4 6 2 5 3
I would like to get this, the factors from myData$region applied to myData$municipality.
as.numeric(factor(myData$municipality))
[1] 5 4 2 3 1 5
I welcome any help with open arms. Thank you.

To better understand the use of factor in R have a look here.
If you want to add factor levels, you have to do something like this in your dataframe:
levels(myData$region)
[1] "Etelä-Pohjanmaa" "Pirkanmaa" "Pohjois-Karjala" "Pohjois-Pohjanmaa" "Uusimaa"
> levels(myData$municipality)
[1] "Espoo" "Joensuu" "Kerava" "Oulu" "Seinäjoki" "Tampere"
> levels(myData$municipality)<-c(levels(myData$municipality),levels(myData$region))
> levels(myData$municipality)
[1] "Espoo" "Joensuu" "Kerava" "Oulu" "Seinäjoki"
[6] "Tampere" "Etelä-Pohjanmaa" "Pirkanmaa" "Pohjois-Karjala" "Pohjois-Pohjanmaa"
[11] "Uusimaa"

r functions will not recognise apostrophe in character string

I have a large data frame of survey data read from a .csv that looks like this when simplified.
x <- data.frame("q1" = c("yes","no","don’t_know"),
"q2" = c("no","no","don’t_know"),
"q3" = c("yes","don’t_know","don’t_know"))
I want to create a column using rowSums as below
x$dntknw<-rowSums(x=="don’t_know")
I can do it for all the yes and no answers easily, but In my dataframe it just generates zeros for the don’t_know's.
I previously had an issue with the apostrophe looking like this donâ€™t_know. I added encoding = "UTF-8"to my read.table to fix this. However now I cant seem to get any R functions to recognise it, I tried gsub("’","",df) but this didnt work as with rowSums.
Is this a problem with the encoding? is there a regex solution to removing them? what solutions are there for dealing with this?

It is an encoding issue and not a regex one. I am unable to reproduce the issue and my encoding is set as UTF-8 in R. Try by setting the encoding to UTF-8 in default R rather than at the time of read.
here is my sample output with your code.
> x
q1 q2 q3 dntknw
1 yes no yes 0
2 no no don’t_know 1
3 don’t_know don’t_know don’t_know 3
> Sys.setlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
Here is some more detail that may be helpful.
https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding

As #Drj stated, it is probably an encoding error. When I paste your code into my console, I get
> x$q1
[1] yes no don<U+0092>t_know
Even if the encoding is off, you can still match it using regex:
grepl("don.+t_know", x$q1)
# [1] FALSE FALSE TRUE
Hence, you can calculate the row sums as follows:
x$dntknw <- rowSums(apply(x, 2, function(y) grepl("don.+t_know", y)))
Which results in
> x
q1 q2 q3 dntknw
1 yes no yes 0
2 no no don<U+0092>t_know 1
3 don<U+0092>t_know don<U+0092>t_know don<U+0092>t_know 3

Import DAT File - Parsing Issue

I have a tab-delimited DAT file that I want to read into R. When I import the data using read.delim, my data frame has the correct number of columns, but has more rows than expected.
My datafile represents responses to a survey. After digging a little deeper, it appears that R is creating a new record when there is a "." in a column that represents an open-ended response. It appears that there are times when a respondent may have hit "enter" to add a new line.
Is there a way to get around this? I read the help, but I am not sure how I can tell R to ignore this character in the character response.
Here is an example response that parses incorrectly. This is one response, but you can see that there are returns that put this onto multiple lines when parsed by R.
possible ask for size before giving free tshirt.
Also maybe have the interview in conference rooms instead of tight offices. I felt very cramped.
I would of loved to have gone, but just had to make a choices and had more options then I expected.
I am analyzing the data with SPSS and the data were brought in fine, however, I need to use R for more advanced modeling
Any help will be greatly appreciated. Thanks in advance.

There is an 'na.strings' argument. You don't offer any test case, but perhaps you can to this:
read.delim(file="myfil.DAT", na.strings=".")
I think it would be good if you could produce an edit to your question that better demonstrated the problem. I cannot create an error with a simple effort:
> read.delim(text="a\tb\t.\nc\td\te\n",header=FALSE)
V1 V2 V3
1 a b .
2 c d e
> read.delim(text="a\tb\t.\nc\td\te\n",header=FALSE, na.strings=".")
V1 V2 V3
1 a b <NA>
2 c d e
(After the clarification that above comments are not particularly relevant.) This will bring in a field that has a linefeed in it .... but it requires that the "field" be quoted in the original file:
> scan(file=textConnection("'a\nb'\nx\t.\nc\td\te\n"), what=list("","","") )
Read 2 records
[[1]]
[1] "a\nb" "c"
[[2]]
[1] "x" "d"
[[3]]
[1] "." "e"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

CRAN-R: Subset a dataframe bug or violation of semantics? - r

In this statement: subset(teste,teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1) teste0 means teste$teste0. Same for teste1. In this statement: teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1 teste0 and teste1 are the vectors that you have defined above (not members of the data frame).

Related

how to use regexpr to identify patters in icd10 data

Code to filter data where observations may or may not be present in the dataframe

R: Applying factor values from one column to another

r functions will not recognise apostrophe in character string

Import DAT File - Parsing Issue

Categories

Resources