Assign id by cluster in R - r

I have a vector like this
var1=c("A","A","B"," "," ","C","A","","A")
How can I create a vector of ids indicating whether they are adjacent. Like
id1=c(1,1,1,0,0,2,2,0,3)
So I want to assign ids to each clusters. Any ways to do that in R?

We can cumsum on the diff of var1 to generate a sequence representing the clusters including empty strings and then replace empty string positions with 0:
replace(cumsum(c(T, diff(var1 != "") == 1)), var1 == "", 0)
gives:
# [1] 1 1 1 0 0 2 2 0 3
for:
var1=c("A","A","B","","","C","A","","A")
This assumes var1 does not start with empty string, to generalize it to that case, we can check the first element of var1 and use the condition as the initial value:
replace(cumsum(c(var1[1] != "", diff(var1 != "") == 1)), var1 == "", 0)
gives:
# [1] 0 1 1 1 0 0 2 2 0 3
for:
var1=c("", "A","A","B","","","C","A","","A")

Here is one option with rle. We remove the leading/lagging space with trimws, convert to a logical vector (nzchar) based on whether it is a non-empty string and get the run-length-encoding (rle). Change the 'values' vector in the list of 'rl' where it is TRUE to the sequence and replicate the values with lengths
rl <- rle(nzchar(trimws(var1)))
rl$values[rl$values] <- seq_along(rl$values[rl$values])
rep(rl$values, rl$lengths)
#[1] 1 1 1 0 0 2 2 0 3
data
var1=c("A","A","B"," "," ","C","A","","A")

Related

How to filter alphanumeric characters range?

I need to create dummy variables using ICD-10 codes. For example, chapter 2 starts with C00 and ends with D48X. Data looks like this:
data <- data.frame(LINHAA1 = c("B342", "C000", "D450", "0985"),
LINHAA2 = c("U071", "C99", "D68X", "J061"),
LINHAA3 = c("D48X", "Y098", "X223", "D640"))
Then I need to create a column that receives 1 if it's between the C00-D48X range and 0 if it's not. The result I desire:
LINHAA1 LINHAA2 LINHAA3 CHAPTER2
B342 U071 D48X 1
C000 C99 Y098 1
D450 D68X X223 1
O985 J061 D640 0
It needs to go through LINHAA1 to LINHAA3. Thanks in advance!
This should do it:
as.numeric(apply(apply(data, 1,
function(x) { x >="C00" & x <= "D48X" }), 2, any))
[1] 1 1 1 0
A little explanation: Checking if the codes are in the range can just be checked using alphabetic order (which you can get from <= etc). The inner apply checks each element and produces a matrix of logical values. The outer apply uses any to check if any one of the three logical values is true. as.numeric changes the result from TRUE/False to 1/0.
This is the typical case for dplyr::if_any. if_any returns TRUE if a given condition is met in any of the tested columns, rowwise:
library(dplyr)
data %>%
mutate(CHAPTER2 = +if_any(starts_with("LINHAA"),
~.x >= 'C00' & .x <='D48X'))
LINHAA1 LINHAA2 LINHAA3 CHAPTER2
1 B342 U071 D48X 1
2 C000 C99 Y098 1
3 D450 D68X X223 1
4 0985 J061 D640 0
Using dedicated icd package
# remotes::install_github("jackwasey/icd")
library(icd)
#get the 2nd chapter start and end codes
ch2 <- icd::icd10_chapters[[ 2 ]]
# start end
# "C00" "D49"
#expland the codes to include all chapter2 codes
ch2codes <- expand_range(ch2[ "start" ], ch2[ "end" ])
# length(ch2codes)
# 2094
#check if codes in a row match
ix <- apply(data, 1, function(i) any(i %in% ch2codes))
# [1] FALSE TRUE FALSE FALSE
data$chapter2 <- as.integer(ix)
#data
# LINHAA1 LINHAA2 LINHAA3 chapter2
# 1 B342 U071 D48X 0
# 2 C000 C99 Y098 1
# 3 D450 D68X X223 0
# 4 0985 J061 D640 0
Note that you have some invalid codes:
#invalid
is_defined("D48X")
# [1] FALSE
explain_code("D48X")
# character(0)
#Valid
is_defined("D48")
# [1] TRUE
explain_code("D48")
# [1] "Neoplasm of uncertain behavior of other and unspecified sites"

Using row number to create a 0/1 column in R

I want to create a new column in my dataset for when 'death_code' contains an 'I' (could be I001-I100) then it would return a 1, otherwise it would return a 0
death_code
I099
E045
T054
I065
I022
I have used grepl to search for rows in a variable which contain 'I' and saved the row numbers
rows<-which(grepl('I', fulldata$deathcode))
However I now want to assign a 1 to these rows in a new column and I cannot workout how to do this.
This is what I anticipate the data to look like
death_code CVD_death
I099. 1
E045. 0
T054. 0
I065. 1
I022. 1
Instead of using which, use as.integer on the grepl result - TRUE/FALSE will be converted to 1/0.
fulldata$CVD_death <- as.integer(grepl("I", fulldata$deathcode))
Alternately, you could do it with which by setting all values in the column to 0, and then setting the which values to 1:
fulldata$CVD_death <- 0
fulldata$CVD_death[which(grepl("I", fulldata$deathcode))] <- 1
Using stringr approach:
library(dplyr)
library(stringr)
df %>% mutate(CVD_death = case_when(str_detect(death_code, '^I\\d{3}') ~ 1, TRUE ~ 0))
# A tibble: 5 x 2
death_code CVD_death
<chr> <dbl>
1 I099 1
2 E045 0
3 T054 0
4 I065 1
5 I022 1
Another option is + to convert the logical to integer
fulldata$CVD_death <- +(grepl("I", fulldata$deathcode))

How can I use rowSums with conditions to return binary value?

Say I have a data frame with a column for summed data. What is the most efficient way to return a binary 0 or 1 in a new column if any value in columns a, b, or c are NOT zero? rowSums is fine for a total, but I also need a simple indicator if anything differs from a value.
tt <- data.frame(a=c(0,-5,0,0), b=c(0,5,10,0), c=c(-5,0,0,0))
tt[, ncol(tt)+1] <- rowSums(tt)
This yields:
> tt
a b c V4
1 0 0 -5 -5
2 -5 5 0 0
3 0 10 10 20
4 0 0 0 0
The fourth column is a simple sum of the data in the first three columns. How can I add a fifth column that returns a binary 1/0 value if any value differs from a criteria set on the first three columns?
For example, is there a simple way to return a 1 if any of a, b, or c are NOT 0?
as.numeric(rowSums(tt != 0) > 0)
# [1] 1 1 1 0
tt != 0 gives us a logical matrix telling us where there are values not equal to zero in tt.
When the sum of each row is greater than zero (rowSums(tt != 0) > 0), we know that at least one value in that row is not zero.
Then we convert the result to numeric (as.numeric(.)) and we've got a binary vector result.
We can use Reduce
+(Reduce(`|`, lapply(tt, `!=`, 0)))
#[1] 1 1 1 0
One could also use the good old apply loop:
+apply(tt != 0, 1, any)
#[1] 1 1 1 0
The argument tt != 0 is a logical matrix with entries stating whether the value is different from zero. Then apply() with margin 1 is used for a row-wise operation to check if any of the entries is true. The prefix + converts the logical output into numeric 0 or 1. It is a shorthand version of as.numeric().

appending to a data frame row by row character formatting issue

I am trying to build a dataframe from the output of a mapply.
Here is one example of my output.
> out[1:9,1]
$statistic
X-squared
1311.404
$parameter
df
1
$p.value
[1] 1.879366e-287
$estimate
prop 1 prop 2
0.001680737 0.009517644
$null.value
NULL
$conf.int
[1] -1.000000000 -0.007153045
attr(,"conf.level")
[1] 0.95
$alternative
[1] "less"
$method
[1] "2-sample test for equality of proportions with continuity correction"
$data.name
[1] "members out of enrolled"
I want to put these values into a dataframe. I have 1684 rows in this matrix. I want a dataframe with 1684 rows.
I also have codes from outside of this data that I want to incorporate into the dataframe. These are strings from fwa$proc.
> out[,1]$p.value
[1] 1.879366e-287
> out[,1]$estimate[[1]]
[1] 0.001680737
> out[,1]$estimate[[2]]
[1] 0.009517644
> as.character(fwa$proc[1])
[1] "10022"
I have looked here for support for doing this. I am creating a dataframe first and then attempting to fill my dataframe from another dataframe row by row as such...
n<-1684
new.df <- data.frame(cpt=character(n), FFS_prop=numeric(n), PHN_prop=numeric(n)
, differnce=numeric(n), results=character(n), Null_HO = character(n), Alt_HA=character(n), stringsAsFactors=FALSE)
Here is the head.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Now to fill data row by row...
for (i in 1:n) new.df[i, ] <- data.frame(cpt = toString(fwa$proc[i])
,FFS_prop=round(out[,i]$estimate[[1]],5)
,PHN_prop=round(out[,i]$estimate[[2]],5)
,differnce=round(out[,i]$estimate[[1]]-out[,i]$estimate[[2]],5)
,results=if(out[,i]$p.value <.05) {"Reject NUll"} else {"Fail to Reject Null"}
,Null_HO = toString('FFS = pHN')
,Alt_HA = toString('FFS < PHN')
)
Here is the head after the code runs.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 1 0.00168 0.00952 -0.00784 1 1 1
2 1 0.00033 0.00142 -0.00109 1 1 1
3 1 0.00239 0.01461 -0.01222 1 1 1
4 1 0.00135 0.00919 -0.00783 1 1 1
5 1 0.00008 0.00180 -0.00172 1 1 1
6 1 0.00036 0.00177 -0.00141 1 1 1
Please friends, why don't my strings make it into the data dataframe?
I have tried to put as.character() around them, toString() around them all for naught.
Wiser ones please advise.
Thanks.
You can either set options(stringsAsFactors=F) of you can also set stringsAsFactors=F in the data.frame in you loop. The problem is that because you are building a new data.frame in each loop, it doesn't know about the rules you've set on the data.frame that it's going to added to later. So at the time of creation, it converts it's values to a factor which is stored as a unique integer for each observed character string. Since you are only adding one value, each factor has one level so they each coded as the integer 1.
Then when you go to do the assignment to the master data.frame, that integer 1 is converted to a character "1". So the str(new.df) should show that your character columns are still characters, they just happen to contain the character "1" for each row.
Building data.frames row-by-row is always a messy process that should be avoided if at all possible. It's better to try to build data data column wise and then build your data.frame at the end. You said that out was the result of using mapply on a prop.test so i've created a sample
out<-mapply(prop.test, replicate(10, rbinom(1, size = 100, prob = .5)), 100)
That gives something that matches your out with only 10 columns I believe. But then you can extract all the p-values with
apply(out, 2, '[[', "p.value")
and all of your FSS values with
apply(out, 2, function(x) x$estimate[[1]])
so your data.frame construction would look more like
new.df<- data.frame(cpt = fwa$proc
,FFS_prop=apply(out, 2, function(x) x$estimate[[1]])
,PHN_prop=apply(out, 2, function(x) x$estimate[[2]])
,pval = apply(out, 2, '[[', "p.value")
,Null_HO = 'FFS = pHN'
,Alt_HA = 'FFS < PHN'
,stringsAsFactors=F
)
new.df <- transform(new.df,
differnce=FFS_prop-PHN_prop,
,results=ifelse(pval<.05, "Reject NUll", "Fail to Reject Null")
)

R set value in dependency of another value

For each row of my dataframe, I want to calculate a value from numbers taken from columns of this dataframe. If the calculated value is above 2, I want to set another columns value to 0, else to 1.
x=(df$firstnumber+df$secondnumer)/2
if(x>2){
df$binaryValue=0}
else{ df$binaryValue=1}
this throws the error
the condition has length > 1 and only the first element will be used
because x is a vector
How can I solve this? One way would be to write this as a function and to apply it to the dataframe - are there any other options?
Also, how could I write this to work with appl() ?
Thanks in advance
You could simply do...
df$BinaryValue <- ifelse( x > 2 , 0 , 1 )
So you get...
df <- data.frame( x = 1:5 , y = -2:2 )
x <- df$x + df$y
df$BinaryValue <- ifelse( x > 2 , 0 , 1 )
df
# x y BinaryValue
# 1 1 -2 1
# 2 2 -1 1
# 3 3 0 0
# 4 4 1 0
# 5 5 2 0
transform(df, BinaryValue = as.numeric(firstnumber + secondnumber > 4))
There's no need to divide by two in the first place. You could check whether the sum is greater than four. The function as.numeric is employed to transform boolean to numeric (0 and 1) values.

Resources