Using row number to create a 0/1 column in R - r

I want to create a new column in my dataset for when 'death_code' contains an 'I' (could be I001-I100) then it would return a 1, otherwise it would return a 0
death_code
I099
E045
T054
I065
I022
I have used grepl to search for rows in a variable which contain 'I' and saved the row numbers
rows<-which(grepl('I', fulldata$deathcode))
However I now want to assign a 1 to these rows in a new column and I cannot workout how to do this.
This is what I anticipate the data to look like
death_code CVD_death
I099. 1
E045. 0
T054. 0
I065. 1
I022. 1

Instead of using which, use as.integer on the grepl result - TRUE/FALSE will be converted to 1/0.
fulldata$CVD_death <- as.integer(grepl("I", fulldata$deathcode))
Alternately, you could do it with which by setting all values in the column to 0, and then setting the which values to 1:
fulldata$CVD_death <- 0
fulldata$CVD_death[which(grepl("I", fulldata$deathcode))] <- 1

Using stringr approach:
library(dplyr)
library(stringr)
df %>% mutate(CVD_death = case_when(str_detect(death_code, '^I\\d{3}') ~ 1, TRUE ~ 0))
# A tibble: 5 x 2
death_code CVD_death
<chr> <dbl>
1 I099 1
2 E045 0
3 T054 0
4 I065 1
5 I022 1

Another option is + to convert the logical to integer
fulldata$CVD_death <- +(grepl("I", fulldata$deathcode))

Related

looping within a variable in panel data using loop in R

I have a panel data like id <- c(1,1,1,1,1,1,1,2,2,2,2,2), intm <- c(1,1,0,0,1,0,0,0,0,0,1,1). The data frame is like
dta <- data.frame(cbind(id,intm)) which gives:
id intm
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 1 0
7 1 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 1
I would like to replace the subsequent values of "intm" variable by the first value within the ID variable. That is for ID=1, the first value is 1, so the intm should have all values as 1 and for ID=2, intm should have all values as 0. The data should be like
id <- c(1,1,1,1,1,1,1,2,2,2,2,2),intm <- c(1,1,1,1,1,1,1,0,0,0,0,0) with data frame
dta <- data.frame(cbind(id,intm))
How can I do this in R by looping or any other means? I have a big data set.
Consider the following code:
new_column <- c(); i <- 1; # new column to be created
# loop
for (j in unique(dta$id)){ # let's separate the unique values of ID
index <- which(dta$id==j) # which row index satisfy id==1, or id==2, ...?
value <- dta$intm[index[1]] # which value of intm corresponds to the first value of the index?
new_column[i:tail(index,n = 1)] <- rep(value,nrow(dta[id==j,])) # repeat this value the number of rows times which contains the ID
i <- tail(index,n = 1)+1 # the new_column component must start with its last value + 1
}
dta <- cbind(dta,new_column)
Alternatively, you can use the subset() function, i.e
rep(value,nrow(subset(dta,dta$id==j)))
You can use dplyr to do this.
There are other fancier ways to do this but I think dplyr is more graceful.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2)
intm <- c(1,1,0,0,1,0,0,0,0,0,1,1)
df = data.frame(id, intm)
library(dplyr)
df2 = df %>% group_by(id) %>% do({
.$intm = .$intm[1, drop = TRUE]
.
})
You can also try data.table library which shall be faster.
BTW: you do not need cbind to make a data.frame.
Consider ave + head
dta$intm <- with(dta, ave(intm, id, FUN=function(x) head(x, 1)))

Check and replace column values in R dataframe

I have multiple files to read in using R. I iterate through the files in a loop, obtain dataframes and then try to change values of a particular column. Examples of the R dataframes are as follows:
df_A:
ID ZN
1 0
2 1
3 1
4 0
df_B:
ID ZN
1 2
2 1
3 1
4 2
As shown above, the column 'ZN' for some dataaframes may have 0's and 1's and others dataframes have have 1's and 2's. What I want is - as I'm iterating through the files, I want to make changes only in the dataframes with column ZN having 1's and 2's like this: 1 to 0 and 2 to 1. Dataframes with ZN values as 0's and 1's will be left unchaged.
my attempt did not work:
if(dataframe$ZN > 1){
dataframe$ZN<-recode(dataframe$ZN,"1=0;2=1")
}
else{
dataframe$ZN
}
Any solutions please?
One approach might be to decrement the value of ZN by one if we detect a single value of 2 anywhere in the column:
if (max(df_A$ZN) == 2) {
df_A$ZN = df_A$ZN - 1
}
Demo
If there are only two values i.e. 0 and 1, then
df_A$ZN <- (df_A$ZN==0) + 1
df_A$ZN
#[1] 2 1 1 2
Or using case_when for multiple values
library(dplyr)
df_A %>%
mutate(ZN = case_when(ZN==0 ~2, TRUE ~ 1))

compare current cell and previous cell in excel style without loop

I want to create a indicator variable after comparing the current value of a variable and the previous value. The logic is like this:if current value= previous value, then indicator =1,else 0. The first indicator value is truncated because there is no comparison.
It needs to be fast because I have lots of groups to compare in my data( I did not include the group for simplicity)
> dt<-c('a','a','a','b','a','a','c','c')
> indicator
[1] NA 1 1 0 0 1 0 1
Using base R you can remove the last elements and the first element of the vector with head() and tail() and do the comparison, then add the NA to the front.
c(NA, as.numeric(head(dt, -1) == tail(dt, -1)))
If dt were a vector of numbers, you could use diff like
dn <- c(1,1,1,2,1,1,3,3)
c(NA, (diff(dn)==0)+0)
(using +0 rather than as.numeric to make the booleans 1's and 0's.)
You can use Lag from Hmisc package
Ignoring the first value with [-1] and adding NA at the beginning.
library(Hmisc)
c(NA, as.numeric(dt== Lag(dt))[-1])
#[1] NA 1 1 0 0 1 0 1
You could also use rle in base R:
v <- rle(dt)[[1]]
x <- rep(1:length(v),v)
indicator <- c(NA, (diff(x)==0)*1)
#[1] NA 1 1 0 0 1 0 1
v: gets the number of times each character is repeated
x: contains the respective numeric vector from dt to benefit from diff

Assign id by cluster in R

I have a vector like this
var1=c("A","A","B"," "," ","C","A","","A")
How can I create a vector of ids indicating whether they are adjacent. Like
id1=c(1,1,1,0,0,2,2,0,3)
So I want to assign ids to each clusters. Any ways to do that in R?
We can cumsum on the diff of var1 to generate a sequence representing the clusters including empty strings and then replace empty string positions with 0:
replace(cumsum(c(T, diff(var1 != "") == 1)), var1 == "", 0)
gives:
# [1] 1 1 1 0 0 2 2 0 3
for:
var1=c("A","A","B","","","C","A","","A")
This assumes var1 does not start with empty string, to generalize it to that case, we can check the first element of var1 and use the condition as the initial value:
replace(cumsum(c(var1[1] != "", diff(var1 != "") == 1)), var1 == "", 0)
gives:
# [1] 0 1 1 1 0 0 2 2 0 3
for:
var1=c("", "A","A","B","","","C","A","","A")
Here is one option with rle. We remove the leading/lagging space with trimws, convert to a logical vector (nzchar) based on whether it is a non-empty string and get the run-length-encoding (rle). Change the 'values' vector in the list of 'rl' where it is TRUE to the sequence and replicate the values with lengths
rl <- rle(nzchar(trimws(var1)))
rl$values[rl$values] <- seq_along(rl$values[rl$values])
rep(rl$values, rl$lengths)
#[1] 1 1 1 0 0 2 2 0 3
data
var1=c("A","A","B"," "," ","C","A","","A")

appending to a data frame row by row character formatting issue

I am trying to build a dataframe from the output of a mapply.
Here is one example of my output.
> out[1:9,1]
$statistic
X-squared
1311.404
$parameter
df
1
$p.value
[1] 1.879366e-287
$estimate
prop 1 prop 2
0.001680737 0.009517644
$null.value
NULL
$conf.int
[1] -1.000000000 -0.007153045
attr(,"conf.level")
[1] 0.95
$alternative
[1] "less"
$method
[1] "2-sample test for equality of proportions with continuity correction"
$data.name
[1] "members out of enrolled"
I want to put these values into a dataframe. I have 1684 rows in this matrix. I want a dataframe with 1684 rows.
I also have codes from outside of this data that I want to incorporate into the dataframe. These are strings from fwa$proc.
> out[,1]$p.value
[1] 1.879366e-287
> out[,1]$estimate[[1]]
[1] 0.001680737
> out[,1]$estimate[[2]]
[1] 0.009517644
> as.character(fwa$proc[1])
[1] "10022"
I have looked here for support for doing this. I am creating a dataframe first and then attempting to fill my dataframe from another dataframe row by row as such...
n<-1684
new.df <- data.frame(cpt=character(n), FFS_prop=numeric(n), PHN_prop=numeric(n)
, differnce=numeric(n), results=character(n), Null_HO = character(n), Alt_HA=character(n), stringsAsFactors=FALSE)
Here is the head.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Now to fill data row by row...
for (i in 1:n) new.df[i, ] <- data.frame(cpt = toString(fwa$proc[i])
,FFS_prop=round(out[,i]$estimate[[1]],5)
,PHN_prop=round(out[,i]$estimate[[2]],5)
,differnce=round(out[,i]$estimate[[1]]-out[,i]$estimate[[2]],5)
,results=if(out[,i]$p.value <.05) {"Reject NUll"} else {"Fail to Reject Null"}
,Null_HO = toString('FFS = pHN')
,Alt_HA = toString('FFS < PHN')
)
Here is the head after the code runs.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 1 0.00168 0.00952 -0.00784 1 1 1
2 1 0.00033 0.00142 -0.00109 1 1 1
3 1 0.00239 0.01461 -0.01222 1 1 1
4 1 0.00135 0.00919 -0.00783 1 1 1
5 1 0.00008 0.00180 -0.00172 1 1 1
6 1 0.00036 0.00177 -0.00141 1 1 1
Please friends, why don't my strings make it into the data dataframe?
I have tried to put as.character() around them, toString() around them all for naught.
Wiser ones please advise.
Thanks.
You can either set options(stringsAsFactors=F) of you can also set stringsAsFactors=F in the data.frame in you loop. The problem is that because you are building a new data.frame in each loop, it doesn't know about the rules you've set on the data.frame that it's going to added to later. So at the time of creation, it converts it's values to a factor which is stored as a unique integer for each observed character string. Since you are only adding one value, each factor has one level so they each coded as the integer 1.
Then when you go to do the assignment to the master data.frame, that integer 1 is converted to a character "1". So the str(new.df) should show that your character columns are still characters, they just happen to contain the character "1" for each row.
Building data.frames row-by-row is always a messy process that should be avoided if at all possible. It's better to try to build data data column wise and then build your data.frame at the end. You said that out was the result of using mapply on a prop.test so i've created a sample
out<-mapply(prop.test, replicate(10, rbinom(1, size = 100, prob = .5)), 100)
That gives something that matches your out with only 10 columns I believe. But then you can extract all the p-values with
apply(out, 2, '[[', "p.value")
and all of your FSS values with
apply(out, 2, function(x) x$estimate[[1]])
so your data.frame construction would look more like
new.df<- data.frame(cpt = fwa$proc
,FFS_prop=apply(out, 2, function(x) x$estimate[[1]])
,PHN_prop=apply(out, 2, function(x) x$estimate[[2]])
,pval = apply(out, 2, '[[', "p.value")
,Null_HO = 'FFS = pHN'
,Alt_HA = 'FFS < PHN'
,stringsAsFactors=F
)
new.df <- transform(new.df,
differnce=FFS_prop-PHN_prop,
,results=ifelse(pval<.05, "Reject NUll", "Fail to Reject Null")
)

Resources