How to filter alphanumeric characters range?

How to filter alphanumeric characters range? - r

I need to create dummy variables using ICD-10 codes. For example, chapter 2 starts with C00 and ends with D48X. Data looks like this:
data <- data.frame(LINHAA1 = c("B342", "C000", "D450", "0985"),
LINHAA2 = c("U071", "C99", "D68X", "J061"),
LINHAA3 = c("D48X", "Y098", "X223", "D640"))
Then I need to create a column that receives 1 if it's between the C00-D48X range and 0 if it's not. The result I desire:
LINHAA1 LINHAA2 LINHAA3 CHAPTER2
B342 U071 D48X 1
C000 C99 Y098 1
D450 D68X X223 1
O985 J061 D640 0
It needs to go through LINHAA1 to LINHAA3. Thanks in advance!

This should do it:
as.numeric(apply(apply(data, 1,
function(x) { x >="C00" & x <= "D48X" }), 2, any))
[1] 1 1 1 0
A little explanation: Checking if the codes are in the range can just be checked using alphabetic order (which you can get from <= etc). The inner apply checks each element and produces a matrix of logical values. The outer apply uses any to check if any one of the three logical values is true. as.numeric changes the result from TRUE/False to 1/0.

This is the typical case for dplyr::if_any. if_any returns TRUE if a given condition is met in any of the tested columns, rowwise:
library(dplyr)
data %>%
mutate(CHAPTER2 = +if_any(starts_with("LINHAA"),
~.x >= 'C00' & .x <='D48X'))
LINHAA1 LINHAA2 LINHAA3 CHAPTER2
1 B342 U071 D48X 1
2 C000 C99 Y098 1
3 D450 D68X X223 1
4 0985 J061 D640 0

Using dedicated icd package
# remotes::install_github("jackwasey/icd")
library(icd)
#get the 2nd chapter start and end codes
ch2 <- icd::icd10_chapters[[ 2 ]]
# start end
# "C00" "D49"
#expland the codes to include all chapter2 codes
ch2codes <- expand_range(ch2[ "start" ], ch2[ "end" ])
# length(ch2codes)
# 2094
#check if codes in a row match
ix <- apply(data, 1, function(i) any(i %in% ch2codes))
# [1] FALSE TRUE FALSE FALSE
data$chapter2 <- as.integer(ix)
#data
# LINHAA1 LINHAA2 LINHAA3 chapter2
# 1 B342 U071 D48X 0
# 2 C000 C99 Y098 1
# 3 D450 D68X X223 0
# 4 0985 J061 D640 0
Note that you have some invalid codes:
#invalid
is_defined("D48X")
# [1] FALSE
explain_code("D48X")
# character(0)
#Valid
is_defined("D48")
# [1] TRUE
explain_code("D48")
# [1] "Neoplasm of uncertain behavior of other and unspecified sites"

Related

Using row number to create a 0/1 column in R

I want to create a new column in my dataset for when 'death_code' contains an 'I' (could be I001-I100) then it would return a 1, otherwise it would return a 0
death_code
I099
E045
T054
I065
I022
I have used grepl to search for rows in a variable which contain 'I' and saved the row numbers
rows<-which(grepl('I', fulldata$deathcode))
However I now want to assign a 1 to these rows in a new column and I cannot workout how to do this.
This is what I anticipate the data to look like
death_code CVD_death
I099. 1
E045. 0
T054. 0
I065. 1
I022. 1

Instead of using which, use as.integer on the grepl result - TRUE/FALSE will be converted to 1/0.
fulldata$CVD_death <- as.integer(grepl("I", fulldata$deathcode))
Alternately, you could do it with which by setting all values in the column to 0, and then setting the which values to 1:
fulldata$CVD_death <- 0
fulldata$CVD_death[which(grepl("I", fulldata$deathcode))] <- 1

Using stringr approach:
library(dplyr)
library(stringr)
df %>% mutate(CVD_death = case_when(str_detect(death_code, '^I\\d{3}') ~ 1, TRUE ~ 0))
# A tibble: 5 x 2
death_code CVD_death
<chr> <dbl>
1 I099 1
2 E045 0
3 T054 0
4 I065 1
5 I022 1

Another option is + to convert the logical to integer
fulldata$CVD_death <- +(grepl("I", fulldata$deathcode))

R data.table struggling with conditional subsetting when column name is predefined elsewhere

Let's say I have a data table
library(data.table)
DT <- data.table(x=c(1,1,0,0),y=c(0,1,2,3))
column_name <- "x"
x y
1: 1 0
2: 1 1
3: 0 2
4: 0 3
And I want to access all the rows where x = 1, but by using column_name.
The desired output should behave like this:
DT[x==1,]
x y
1: 1 0
2: 1 1
but with x replaced by column_name in the input.
Note that this problem is similar to but not quite the same as Select subset of columns in data.table R, and the solution there (using with=FALSE) doesn't work here.
Here are all the things I've tried. None of them work.
DT[column_name ==1,]
DT[.column_name ==1,]
DT[.(column_name) ==1,]
DT[..column_name ==1,]
DT[."column_name" ==1,]
DT[,column_name ==1,]
DT[,column_name ==1,with=TRUE]
DT[,column_name ==1,with=FALSE]
DT[,.column_name ==1,with=TRUE]
DT[,.column_name ==1,with=FALSE]
DT[,..column_name ==1,with=TRUE]
DT[,..column_name ==1,with=FALSE]
DT[,."column_name" ==1,with=TRUE]
DT[,.column_name ==1,with=FALSE]
DT[column_name ==1,with=TRUE]
DT[column_name ==1,with=FALSE]
DT[[column_name==1,]]
subset(DT,column_name==1)
I also have options(datatable.WhenJisSymbolThenCallingScope=TRUE) enabled
There's obviously some kind of lexical trick I'm missing. I've spent several hours looking through vignettes and SO questions to no avail.

I can imagine this was very frustrating for you. I applaud the number of things you tried before posting. Here's one approach:
DT[get(column_name) == 1,]
x y
1: 1 0
2: 1 1
If you need to use column_name in J, you can use get(..column_name):
DT[,get(..column_name)]
[1] 1 1 0 0
The .. instructs evaluation to occur in the parent environment.
Another approach for using a string in either I or J is with eval(as.name(column_name)):
DT[eval(as.name(column_name)) == 1]
x y
1: 1 0
2: 1 1
DT[,eval(as.name(column_name))]
[1] 1 1 0 0

You can subset the column by name and then select rows.
library(data.table)
DT[DT[[column_name]] == 1]
# x y
#1: 1 0
#2: 1 1

A little caveat, using get() directly with paste0() doesn't work. You have to assign the paste to a variable first, like:
# Doesn't work:
dt[get(paste0(column_name, 'some_string')) == 1]
# Does work:
this_col_name = paste0(column_name, 'some_string')
dt[get(this_col_name) == 1]

An additional answer I just discovered:
If there are multiple columns named this way and you want to return all of them, don't use get, use mget.
Example:
df <- data.table(x=1:4,y=1:4,z=1:4,w=1:4) # here's my data table
desired_columns <- c("y","z","w") # I want to return only columns Y, Z and W
if I try:
> df[,get(desired_columns)]
Error in get(desired_columns) : first argument has length > 1
Instead:
> df[,mget(desired_columns)]
y z w
1: 1 1 1
2: 2 2 2
3: 3 3 3
4: 4 4 4

Assign id by cluster in R

I have a vector like this
var1=c("A","A","B"," "," ","C","A","","A")
How can I create a vector of ids indicating whether they are adjacent. Like
id1=c(1,1,1,0,0,2,2,0,3)
So I want to assign ids to each clusters. Any ways to do that in R?

We can cumsum on the diff of var1 to generate a sequence representing the clusters including empty strings and then replace empty string positions with 0:
replace(cumsum(c(T, diff(var1 != "") == 1)), var1 == "", 0)
gives:
# [1] 1 1 1 0 0 2 2 0 3
for:
var1=c("A","A","B","","","C","A","","A")
This assumes var1 does not start with empty string, to generalize it to that case, we can check the first element of var1 and use the condition as the initial value:
replace(cumsum(c(var1[1] != "", diff(var1 != "") == 1)), var1 == "", 0)
gives:
# [1] 0 1 1 1 0 0 2 2 0 3
for:
var1=c("", "A","A","B","","","C","A","","A")

Here is one option with rle. We remove the leading/lagging space with trimws, convert to a logical vector (nzchar) based on whether it is a non-empty string and get the run-length-encoding (rle). Change the 'values' vector in the list of 'rl' where it is TRUE to the sequence and replicate the values with lengths
rl <- rle(nzchar(trimws(var1)))
rl$values[rl$values] <- seq_along(rl$values[rl$values])
rep(rl$values, rl$lengths)
#[1] 1 1 1 0 0 2 2 0 3
data
var1=c("A","A","B"," "," ","C","A","","A")

appending to a data frame row by row character formatting issue

I am trying to build a dataframe from the output of a mapply.
Here is one example of my output.
> out[1:9,1]
$statistic
X-squared
1311.404
$parameter
df
1
$p.value
[1] 1.879366e-287
$estimate
prop 1 prop 2
0.001680737 0.009517644
$null.value
NULL
$conf.int
[1] -1.000000000 -0.007153045
attr(,"conf.level")
[1] 0.95
$alternative
[1] "less"
$method
[1] "2-sample test for equality of proportions with continuity correction"
$data.name
[1] "members out of enrolled"
I want to put these values into a dataframe. I have 1684 rows in this matrix. I want a dataframe with 1684 rows.
I also have codes from outside of this data that I want to incorporate into the dataframe. These are strings from fwa$proc.
> out[,1]$p.value
[1] 1.879366e-287
> out[,1]$estimate[[1]]
[1] 0.001680737
> out[,1]$estimate[[2]]
[1] 0.009517644
> as.character(fwa$proc[1])
[1] "10022"
I have looked here for support for doing this. I am creating a dataframe first and then attempting to fill my dataframe from another dataframe row by row as such...
n<-1684
new.df <- data.frame(cpt=character(n), FFS_prop=numeric(n), PHN_prop=numeric(n)
, differnce=numeric(n), results=character(n), Null_HO = character(n), Alt_HA=character(n), stringsAsFactors=FALSE)
Here is the head.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Now to fill data row by row...
for (i in 1:n) new.df[i, ] <- data.frame(cpt = toString(fwa$proc[i])
,FFS_prop=round(out[,i]$estimate[[1]],5)
,PHN_prop=round(out[,i]$estimate[[2]],5)
,differnce=round(out[,i]$estimate[[1]]-out[,i]$estimate[[2]],5)
,results=if(out[,i]$p.value <.05) {"Reject NUll"} else {"Fail to Reject Null"}
,Null_HO = toString('FFS = pHN')
,Alt_HA = toString('FFS < PHN')
)
Here is the head after the code runs.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 1 0.00168 0.00952 -0.00784 1 1 1
2 1 0.00033 0.00142 -0.00109 1 1 1
3 1 0.00239 0.01461 -0.01222 1 1 1
4 1 0.00135 0.00919 -0.00783 1 1 1
5 1 0.00008 0.00180 -0.00172 1 1 1
6 1 0.00036 0.00177 -0.00141 1 1 1
Please friends, why don't my strings make it into the data dataframe?
I have tried to put as.character() around them, toString() around them all for naught.
Wiser ones please advise.
Thanks.

You can either set options(stringsAsFactors=F) of you can also set stringsAsFactors=F in the data.frame in you loop. The problem is that because you are building a new data.frame in each loop, it doesn't know about the rules you've set on the data.frame that it's going to added to later. So at the time of creation, it converts it's values to a factor which is stored as a unique integer for each observed character string. Since you are only adding one value, each factor has one level so they each coded as the integer 1.
Then when you go to do the assignment to the master data.frame, that integer 1 is converted to a character "1". So the str(new.df) should show that your character columns are still characters, they just happen to contain the character "1" for each row.
Building data.frames row-by-row is always a messy process that should be avoided if at all possible. It's better to try to build data data column wise and then build your data.frame at the end. You said that out was the result of using mapply on a prop.test so i've created a sample
out<-mapply(prop.test, replicate(10, rbinom(1, size = 100, prob = .5)), 100)
That gives something that matches your out with only 10 columns I believe. But then you can extract all the p-values with
apply(out, 2, '[[', "p.value")
and all of your FSS values with
apply(out, 2, function(x) x$estimate[[1]])
so your data.frame construction would look more like
new.df<- data.frame(cpt = fwa$proc
,FFS_prop=apply(out, 2, function(x) x$estimate[[1]])
,PHN_prop=apply(out, 2, function(x) x$estimate[[2]])
,pval = apply(out, 2, '[[', "p.value")
,Null_HO = 'FFS = pHN'
,Alt_HA = 'FFS < PHN'
,stringsAsFactors=F
)
new.df <- transform(new.df,
differnce=FFS_prop-PHN_prop,
,results=ifelse(pval<.05, "Reject NUll", "Fail to Reject Null")
)

R set value in dependency of another value

For each row of my dataframe, I want to calculate a value from numbers taken from columns of this dataframe. If the calculated value is above 2, I want to set another columns value to 0, else to 1.
x=(df$firstnumber+df$secondnumer)/2
if(x>2){
df$binaryValue=0}
else{ df$binaryValue=1}
this throws the error
the condition has length > 1 and only the first element will be used
because x is a vector
How can I solve this? One way would be to write this as a function and to apply it to the dataframe - are there any other options?
Also, how could I write this to work with appl() ?
Thanks in advance

You could simply do...
df$BinaryValue <- ifelse( x > 2 , 0 , 1 )
So you get...
df <- data.frame( x = 1:5 , y = -2:2 )
x <- df$x + df$y
df$BinaryValue <- ifelse( x > 2 , 0 , 1 )
df
# x y BinaryValue
# 1 1 -2 1
# 2 2 -1 1
# 3 3 0 0
# 4 4 1 0
# 5 5 2 0

transform(df, BinaryValue = as.numeric(firstnumber + secondnumber > 4))
There's no need to divide by two in the first place. You could check whether the sum is greater than four. The function as.numeric is employed to transform boolean to numeric (0 and 1) values.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to filter alphanumeric characters range? - r

Related

Using row number to create a 0/1 column in R

R data.table struggling with conditional subsetting when column name is predefined elsewhere

Assign id by cluster in R

appending to a data frame row by row character formatting issue

R set value in dependency of another value

Categories

Resources