R data frame , New Vector with condition - r

Hello I am new to R and I need some help
I have a data like this
ID Age Sex
A01 30 m
A02 35 f
B03 45 m
C99 50 m
...
And i would like to create a new column Group with condition like this
if data1$age <30 then Group is = 1
else if data1$age >=30 and data1$age <40 then Group = 2
else if data1$age >=40 and data1$age <50 then Group = 3
else data1$age >=50 group = 4
ID Age Sex Group
A01 30 m 2
A02 35 f 2
B03 45 m 3
C99 50 m 4
How do i do that in R

You can try findInterval, which can be used like this (using #Tim's sample data):
> findInterval(data1$Age, c(0, 30, 40, 50))
[1] 2 2 3 4

Some good old-fashioned Base R will come in handy for your problem:
data1 <- data.frame(ID=c("A01", "A02", "B03", "C99"),
Age=c(30, 35, 45, 50),
Sex=c("m", "f", "m", "m"))
data1$Group[data1$Age < 30] <- 1
data1$Group[data1$Age >= 30 & data1$Age < 40] <- 2
data1$Group[data1$Age >= 40 & data1$Age < 50] <- 3
data1$Group[data1$Age >= 50] <- 4
> data1
ID Age Sex Group
1 A01 30 m 2
2 A02 35 f 2
3 B03 45 m 2
4 C99 50 m 4
By the way, you miscategorized ID A01 in your example. Since his age is 30, he belongs in group 2 according to your logic.

We can also do with cut
cut(data1$Age, c(0,seq(30,50,10),Inf), right=FALSE, labels=FALSE)
#[1] 2 2 3 4
EDIT: Based on #thelatemail's comments.

Related

Replace, based on other value in DF

I have a dataframe and like to replace certain values if other values in the same row meet a specific condition, e.g.:
DF <- data.frame(a= c(2,4,67),
b= c("TSS",".","TSS"),
c= c(3,46,5),
d= c(45,"-",47))
resulting in:
a b c d
1 2 TSS 3 45
2 4 . 46 -
3 67 TSS 5 47
Now I'd like to replace values in row 2 column c and d with "." and [2,c], respectively, if the value of [2,b] is ".". The result would look like this:
a b c d
1 2 TSS 3 45
2 4 . . 46
3 67 TSS 5 47
I tried using a for loop, but since I have a huge dataset this takes too much time. Is there a better way to solve this problem?
This should work:
DF <- data.frame(
a = c(2, 4, 67),
b = c("TSS", ".", "TSS"),
c = c(3, 46, 5),
d = c(45, "-", 47),
stringsAsFactors = FALSE
)
DF$d[DF$b == "."] <- DF$c[DF$b == "."]
DF$c[DF$b == "."] <- "."
First we replace the d-Value in rows where b is a "." with the value from c. The second line then replaces the value in c with a ".".
> DF
a b c d
1 2 TSS 3 45
2 4 . . 46
3 67 TSS 5 47

Calculate medication adherence: cumulative pill shortage

It is difficult to find a proper way to calculate medication adherence. Knowing that no perfect sollution exists, I want to calculate the number of days a patient could definitely not have taken his medication, because he didn't have any. I want to do this for every time medication was issued.
I have pharmacy data that looks like this:
x <- data.frame(
patient_id = 1,
issue_date = as.Date( "1990-01-01" ) + cumsum( c( 0, 35, 30, 25, 30 ) ),
no_tablets = 60
)
patient_id issue_date no_tablets
1 1 1990-01-01 60
2 1 1990-02-05 60
3 1 1990-03-07 60
4 1 1990-04-01 60
5 1 1990-05-01 60
I can off course calculate the difference in issue_dates, and see if it's equal to no_tablets / 2 (bidaily dose) a patient had to take.
But if a patient collects his drugs earlier, he can then wait longer for the next time than this period, because he has a stock of medication.
I tried to do the calculations on a cumulative number of days and a cumulative number of doses, and then sum all non-negative cumulative pill shortages. However, if I patient is late once and then picks up the medication precisely on time, this number still stands and is then counted multiple times.
Do you have any other ideas how I can do this?
Thank you in advance!
In your example, it looks like this patient's supply of pills continues to exceed the number of days between refills. Without an example of what you're desired output would like, I think the script below may suffice:
x <- data.frame(
patient_id = c(rep(1, 5), rep(2, 5)),
issue_date = c(rep(as.Date( "1990-01-01" ) + cumsum( c( 0, 35, 30, 25, 30 ) ), 2)),
no_tablets = c(rep(60, 5), rep(31, 5))
)
y <- data.frame(
patient_id = c(1, 2),
pills_per_day = c(2, 1)
)
z <- merge(x, y, by = "patient_id")
library(data.table)
z <- data.table(z)
z[, days_excess := (no_tablets/pills_per_day) - ifelse(patient_id == shift(patient_id, 1, type = "lead"),
-as.integer(issue_date - shift(issue_date, 1, type="lead")),
NA)]
print(z)
patient_id issue_date no_tablets pills_per_day days_excess
1: 1 1990-01-01 60 2 -5
2: 1 1990-02-05 60 2 0
3: 1 1990-03-07 60 2 5
4: 1 1990-04-01 60 2 0
5: 1 1990-05-01 60 2 NA
6: 2 1990-01-01 31 1 -4
7: 2 1990-02-05 31 1 1
8: 2 1990-03-07 31 1 6
9: 2 1990-04-01 31 1 1
10: 2 1990-05-01 31 1 NA

Divide vector into groups according difference between two neighbouring numbers

My dummy input vector looks like this:
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
What I want: Add group factor to each number. Group is assigned according difference between neighbouring numbers.
Example:
Difference (absolute) between 10 and 20 is 10, hence they belong to same group.
Difference between 30 and 20 is 10 - they belong to same group.
Difference between 30 and 70 is 40 - they belong to different groups.
Given maximal difference 20 wanted result is:
x group
10 1
20 1
30 1
70 4
80 4
90 4
130 7
190 8
200 8
My code:
library(data.table)
library(foreach)
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
x <- data.table(x, group = 1)
y <- nrow(x)
maxGap <- 20
g <- 1
groups <-
foreach(i = 2:y, .combine = rbind) %do% {
if (x[i, x] - x[i - 1, x] < maxGap) {
g
} else {
g <- i
g
}
}
x[2:y]$group <- as.vector(groups)
My question
Given code works, but is too slow with large data (number of rows > 10mil). Is there simpler and quicker solution (not using loop)?
library(IRanges)
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
# If the distance between two integers is larger than 30,
# then they would be in two groups. Otherwise, they would
# be in the same group.
ther <- 15
df.1 <- data.frame(val=x, left=x-15, right=x+15)
df.ir <- IRanges(df.1$left, df.1$right)
df.ir.re <- findOverlaps(df.ir, reduce(df.ir))
df.1$group <- subjectHits(df.ir.re)
df.1
# val left right group
# 1 10 -5 25 1
# 2 20 5 35 1
# 3 30 15 45 1
# 4 70 55 85 2
# 5 80 65 95 2
# 6 90 75 105 2
# 7 130 115 145 3
# 8 190 175 205 4
# 9 200 185 215 4
An implementation which uses the rleid and shift functions of data.table:
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
DT <- data.table(x)
DT[, grp := rleid(cumsum(x - shift(x,1L,0) > 20))]
which gives:
> DT
x grp
1: 10 1
2: 20 1
3: 30 1
4: 70 2
5: 80 2
6: 90 2
7: 130 3
8: 190 4
9: 200 4
Explanation: With x - shift(x,1L,0) you calculate the difference with the previous observation of x. By comparing it to 20 (i.e.: the > 20 part) and wrapping that in cumsum and rleid a runlength id is created.
In response to #Roland's comments: you can leave the rleid-part out if you set the fill parameter in shift to -Inf:
DT[, grp := cumsum((x - shift(x, 1L, -Inf)) > 20)]
test <- c(TRUE, diff(x) > 20) #test the differences
res <- factor(cumsum(test)) #groups
#[1] 1 1 1 2 2 2 3 4 4
#Levels: 1 2 3 4
levels(res) <- which(test) #fix levels
res
#[1] 1 1 1 4 4 4 7 8 8
#Levels: 1 4 7 8

How to substract a column by row?

I want to do an easy subtract in R, but I don't know how to solve it. I would like to know if I have to do a loop or if there is a function.
I have a column with numeric variables, and I would like to subtract n by n-1.
Time_Day Diff
10 10
15 5
45 30
60 15
Thus, I would like to find the variable "Diff".
you can also try with package dplyr
library(dplyr)
mutate(df, dif=Time_Day-lag(Time_Day))
# Time_Day Diff dif
# 1 10 10 NA
# 2 15 5 5
# 3 45 30 30
# 4 60 15 15
Does this do what you need?
Here we save the column as a variable:
c <- c(10, 15, 45, 60)
Now we add a 0 to the beginning and then cut off the last element:
cm1 <- c(0, c)[1:length(c)]
Now we subtract the two:
dif <- c - cm1
If we print that out, we get what you're looking for:
dif # 10 5 30 15
With diff :
df <- data.frame(Time_Day = c(10, 15, 45, 60))
df$Diff <- c(df$Time_Day[1], diff(df$Time_Day))
df
## Time_Day Diff
##1 10 10
##2 15 5
##3 45 30
##4 60 15
It works fine in dplyr too :
library("dplyr")
df <- data.frame(Time_Day = c(10, 15, 45, 60))
df %>% mutate(Diff = c(Time_Day[1], diff(Time_Day)))

R handling NA values while doing a comparison ifelse [duplicate]

This question already has answers here:
How to ignore NA in ifelse statement
(5 answers)
Closed 8 years ago.
how can i tell > and < operators to ignore NA values? Below code return NA on 1st row. I want it to return 0 as both conditions fail on that row
##sum by values
df <- data.frame(sex=c('M','F','M'),occupation=c('Student','Analyst','Analyst'),age=c(NA,6,9), marks=c(34,65,21))
df
#df$counting <- ifelse(df$age > 5 & df$age < 8, 1, 0)
df$counting <- ifelse(df$age > 5 & df$age < 8, 1, 0)+ifelse(df$marks > 60 & df$marks < 70, 1, 0)
df
Please see the following SO post: How to ignore NA in ifelse statement
With respect to your question:
df$counting <- ifelse(df$age > 5 & df$age < 8 & !is.na(df$age), 1, 0) + ifelse(df$marks > 60 & df$marks < 70, 1, 0)
> df
sex occupation age marks counting
1 M Student NA 34 0
2 F Analyst 6 65 2
3 M Analyst 9 21 0
you could also use cut or findInterval.
df$counting <- colSums(rbind(cut(df$age, c(5,8), labels=F),cut(df$marks, c(60,70), labels=F)), na.rm=T)
df
# sex occupation age marks counting
#1 M Student NA 34 0
#2 F Analyst 6 65 2
#3 M Analyst 9 21 0

Resources