processing of hospital admission data using R - r

I have a set of hospital admission data that I need to process, I am stuck when trying to loop the data and pick up the stuff I need, here is the example:
Date Ward
1 A
2 A
3 A
4 A B
5 A
6 A
7 A C
8 C
9 C
10 C
And I need them to be transformed into:
Ward Adm_Date Dis_Date
A 1 4
B 4 4
A 4 7
C 7 10
To put it in sentence, this is a admission record patient X who:
go to ward A from day 1 to day 4
go to ward B (maybe it's an ICU ward) for less than a day in day 4, and move back to ward A on that day
stay in ward A from day 4 to day 7
move to ward C from ward A from day 7 and stay in ward C till day 10
I am thinking of using ddply by filtering the ward but it is not OK since B will be "omitted" and the period of time for A is not broken down into 2 pieces.
Any suggestions? Thanks!

dat <- data.frame(Date=1:10,Ward=c(rep("A",3),"A B",rep("A",2),"A C",rep("C",3)))
dat$Ward <- as.character(dat$Ward)
# Change data to a "long" format
Date2 <- rep(dat$Date,nchar(gsub(" ","",dat$Ward)))
Ward2 <- unlist(strsplit(dat$Ward," "))
dat2 <- data.frame(Date=Date2,Ward=Ward2)
dat2$Ward <- as.character(dat2$Ward) # pesky factors!
# Create output
Ward3 <- unlist(strsplit(gsub("(\\w)\\1+","\\1",paste(dat2$Ward,collapse="")),""))
#helper function to find lengths of repeated characters, probably a better way of doing this
repCharLength <- function(str)
{
out <- numeric(0)
tmp <- 1
for (i in 2:length(str))
{
if (str[i]!=str[i-1])
{out<-c(out,tmp)
tmp<-1}
else
tmp <- tmp+1
}
return(c(out,tmp))
}
stays <- repCharLength(dat2$Ward)
Adm_Date <- c(1,dat2$Date[cumsum(stays)[1:(length(stays)-1)]])
Dis_Date <- dat2$Date[cumsum(stays)]
dat3 <- data.frame(Ward=Ward3,Adm_Date=Adm_Date,Dis_Date=Dis_Date)
> dat3
Ward Adm_Date Dis_Date
1 A 1 4
2 B 4 4
3 A 4 7
4 C 7 10
A bit more involved than I first thought, and there is probably a better way to get the stay lengths than using the helper function I wrote, but this seems to do the job.
Edit
In light of Spacedman's comment, there is a library function to calculate Ward3 and stays:
Ward3 <- rle(dat2$Ward)$values
stays <- rle(dat2$Ward)$lengths

It's not a complex answer but you can transform your data
X <- data.frame(
Date=1:10,
Ward=c("A","A","A","A B","A","A","A C","C","C","C"),
stringsAsFactors=FALSE
)
w <- strsplit(X$Ward," +")
n <- sapply(w, length)
X_mod <- data.frame(
Date = rep(X$Date, n),
Ward = unlist(w, FALSE, FALSE)
)
With X_mod you could write vectorized (=fast) solution. For start with(X_mod, c(0,cumsum(Ward[-1]!=Ward[-length(Ward)]))) gives you id of visit.

Related

Find the GROWTH RATE of FaceValue for 5 days in percentage

I'm trying to open another column and find the growth rate of the facevalue column per day in percentage
Day
FaceValue
1
₦72,077,680.94
2
₦112,763,770.99
3
₦118,146,250.01
4
₦74,446,035.80
5
₦77,026,183.71
here is the code but it's not working
value_performance%>%
mutate(change=(value_performance$FaceValue-lag(FaceValue,5))/lag(FaceValue,5)*100)
Thanks
Three problems:
FaceValue appears to be a string, not numeric, try first fixing that with as.numeric;
(Almost) never use value_performance$ inside of a dplyr-pipe verb. ("Almost" because there are rare times when you need it. Otherwise you are at best being inefficient, possibly using incorrect values depending on what is happening in the pipe before its use.); and
You say "per day" but you are lagging by 5. While I'm assuming your real data has more than 5 rows, you are still not calculating by-day.
Try this.
value_performance %>%
mutate(
FaceValue = as.numeric(gsub("[^0-9.]", "", FaceValue)),
change = (FaceValue - lag(FaceValue))/lag(FaceValue)
)
# Day FaceValue change
# 1 1 7.21e+07 NA
# 2 2 1.13e+08 0.5645
# 3 3 1.18e+08 0.0477
# 4 4 7.44e+07 -0.3699
# 5 5 7.70e+07 0.0347
With similar data:
Day <- c(1,2,3,4,5)
FaceValue <- c(72077680.94, 112763770.99, 118146250.01, 74446035.80, 77026183.71)
df <- data.frame(Day, FaceValue)
df
df %>%
mutate(change= 100*(FaceValue/lag(FaceValue)-1)
)
Results in:
Day FaceValue change
1 1 72077681 NA
2 2 112763771 56.447557
3 3 118146250 4.773234
4 4 74446036 -36.988236
5 5 77026184 3.465796
Not sure what is wrong. Maybe check your data classes and make sure FaceValue is numerical.

How to create a shared householdID for spouses by referencing two separate character IDs

This quesiton is similar to my previous question (How to create a "householdID" for rows with shared "customerID" and "spouseID"?), although this version deals with a rats-nest mix of character and numeric strings instead of simply numeric IDs. I'm trying to create a "household ID" for all couples who appear in a larger dataframe. In short, each individual has a "customerID" and "spouseID". If a customerID is married, their spouse's ID appears in the "spouseID" column. If they are not married, the spouseID field is empty. Each member of a married couple will appear on its own row, resulting in the need for a common "householdID" that a couple shares.
What is the best way to and add a unique householdID that duplicates for couples? A small and over-simplified example of the original data is as follows. Note that the original IDs are far more complex, with varying lengths and patters of numbers and characters.
df <- data.frame(
prospectID=as.character(c("G1339jf", "6dhd54G1", "Cf14c", "Bvmkm1", "kda-1qati", "pwn9enr", "wj44v04t4t", "D15", "dkfs044nng", "v949s")),
spouseID=as.character( c( "", "wj44v04t4t", "", "pwn9enr", "", "Bvmkm1", "6dhd54G1", "", "v949s", "dkfs044nng")),
stringsAsFactors = FALSE)
> df
prospectID spouseID
1 G1339jf
2 6dhd54G1 wj44v04t4t
3 Cf14c
4 Bvmkm1 pwn9enr
5 kda-1qati
6 pwn9enr Bvmkm1
7 wj44v04t4t 6dhd54G1
8 D15
9 dkfs044nng v949s
10 v949s dkfs044nng
An example of my desired result is as follows:
> df
prospectID spouseID HouseholdID
1 G1339jf 1
2 6dhd54G1 wj44v04t4t 2
3 Cf14c 3
4 Bvmkm1 pwn9enr 4
5 kda-1qati 5
6 pwn9enr Bvmkm1 4
7 wj44v04t4t 6dhd54G1 2
8 D15 6
9 dkfs044nng v949s 7
10 v949s dkfs044nng 7
This is an edited solution due to comments made by OP.
Illustrative data:
df <- data.frame(
prospectID=as.character(c("A1jljljljl344asbvc", "A2&%$ll##fffh", "B1665453sskn:;", "B2gavQWEΩΩø⁄", "C1", "D1", "E1#+'&%", "E255646321", "F1", "G1")),
spouseID=as.character(c("A2&%$ll##fffh", "A1jljljljl344asöbvc", "B2gavQWEΩΩø⁄", "B1665453sskn:;_", "", "", "E255646321", "E1#+'&%", "", "")),
stringsAsFactors = FALSE)
First define a pattern to match:
patt <- paste(df$prospectID, df$spouseID, sep = "|")
Second, define a for loop; here, a little editing is necessary for the first and the last value. Maybe others can improve on this part:
for(i in 1:nrow(df)){
df$HousholdID[1] <- 1
df$HousholdID[i] <- ifelse(grepl(patt[i], df$prospectID[i+1]), 1, 0)
df$HousholdID[10] <- 1
}
The final step is to run cumsum:
df$HousholdID <- cumsum(df$HousholdID)
The result:
df
prospectID spouseID HousholdID
1 A1jljljljl344asbvc A2&%$ll##fffh 1
2 A2&%$ll##fffh A1jljljljl344asöbvc 1
3 B1665453sskn:; B2gavQWEΩΩø⁄ 2
4 B2gavQWEΩΩø⁄ B1665453sskn:;_ 2
5 C1 3
6 D1 4
7 E1#+'&% E255646321 5
8 E255646321 E1#+'&% 5
9 F1 6
10 G1 7

Adding missing end_of_months values by different variables in R

I have the following xlsx file df.xlsx which looks like this:
client id dax dpd
1 2000-05-30 7
1 2000-12-31 6
2 2003-05-21 6
3 1999-12-30 5
3 2000-10-30 6
3 2001-12-30 5
4 1999-12-30 5
4 2002-05-30 6
It's about a loan migration from a snapshot to another. The problem is that I don't have all the months in between. (ie: client_id = 1 , dax is from 2000-05-30 and 2000-12-30) . I have tried several approaches but no result. I need to populate by client_id all the months in between dax and keep the same "dpd" as the first month. (ie client_id = 1 , dax is from 2000-05-30 and 2000-12-30, dpd=7 for all months except the last one "2000-12-31" where dpd= 6). If the client_id appears only once (like client_id = 2 ) it should remain the same.
(dpd means days past due aka rating bucket)
I have tried this code:
df2 <- data.frame(dax=seq(min(df$dax), max(df$dax), by="month"))
df3 <- merge(x=df2a, y=df, by="dax", all.x=T)
idx <- which(is.na(df3$values))
for (client_id in idx)
df3$values[client_id] <- df3$values[client_id-1]
df3
but the results were not quite okay for what i need.
i appreciate any advice. thank you very much!
If I understand your question correctly, you want to generate seqence of dates, given the start/end date.
R code to do this would be (insert values from your dataframe):
seq(as.Date("2017-01-30"), as.Date("2017-12-30"), "month")
Edit after comment:
In this case you can split your data by clients first and then generate the sequences:
new_data <- data.frame()
customerslist <- split(YOURDATA, YOURDATA$id)
for(i in 1:length(customerslist)){
dates <- seq(min(as.Date(customerslist[[i]]$dax)), max(as.Date(customerslist[[i]]$dax)), "month")
id <- rep(customerslist[[i]]$id[1], length(dates))
dpd <- rep(customerslist[[i]]$dpd[1], length(dates))
add <- cbind(id, as.character(dates), dpd)
new_data <- rbind(new_data, add)
}
new_data$V2 <- as.Date(new_data$V2)

Adding variables to a data.frame using a string as syntax

Supose I have this variables:
data <- data.frame(x=rnorm(10), y=rnorm(10))
form <- 'z = x*y'
How can I compute z (using data's variables) and add as a new variable to data?
I tried with parse() and eval() (base on an old question), but without success :/
Given what #Nico said is correct you might do:
d1 <- within(data, eval(parse(text=form)) )
d1
x y z
1 0.5939462 1.58683345 0.94249368
2 0.3329504 0.55848643 0.18594826
3 1.0630998 -1.27659221 -1.35714497
4 -0.3041839 -0.57326541 0.17437812
5 0.3700188 -1.22461261 -0.45312970
6 0.2670988 -0.47340064 -0.12644474
7 -0.5425200 -0.62036668 0.33656135
8 1.2078678 0.04211587 0.05087041
9 1.1604026 -0.91092165 -1.05703586
10 0.7002136 0.15802877 0.11065390
transform() is the easy way if using this interactively:
data <- data.frame(x=rnorm(10), y=rnorm(10))
data <- transform(data, z = x * y)
R> head(data)
x y z
1 -1.0206 0.29982 -0.30599
2 -1.6985 1.51784 -2.57805
3 0.8940 1.19893 1.07187
4 -0.3672 -0.04008 0.01472
5 0.5266 -0.29205 -0.15381
6 0.2545 -0.26889 -0.06842
You can't do this using form though, but within(), which is similar to transform(), does allow this, e.g.
R> within(data, eval(parse(text = form)))
x y z
1 -0.8833 -0.05256 0.046428
2 1.6673 1.61101 2.686115
3 1.1261 0.16025 0.180453
4 0.9726 -1.32975 -1.293266
5 -1.6220 -0.51079 0.828473
6 -1.1981 2.62663 -3.147073
7 -0.3596 -0.01506 0.005416
8 -0.9700 0.21865 -0.212079
9 1.0626 1.30377 1.385399
10 -0.8020 -1.04639 0.839212
though it involves some amount of jiggery-pokery with the language which to my mind is not elegant. Effectively, you are doing something like this:
R> eval(eval(parse(text = form), data), data, parent.frame())
[1] 0.046428 2.686115 0.180453 -1.293266 0.828473 -3.147073 0.005416
[8] -0.212079 1.385399 0.839212
(and assigning the result to the named component in data.)
Does form have to come like this, as a character string representing some expression to be evaluated?

counting unique factors in r

I would like to know the number of unique dams which gave birth on each of the birth dates recorded. My data frame is similar to this one:
dam <- c("2A11","2A11","2A12","2A12","2A12","4D23","4D23","1X23")
bdate <- c("2009-10-01","2009-10-01","2009-10-01","2009-10-01",
"2009-10-01","2009-10-03","2009-10-03","2009-10-03")
mydf <- data.frame(dam,bdate)
mydf
# dam bdate
# 1 2A11 2009-10-01
# 2 2A11 2009-10-01
# 3 2A12 2009-10-01
# 4 2A12 2009-10-01
# 5 2A12 2009-10-01
# 6 4D23 2009-10-03
# 7 4D23 2009-10-03
# 8 1X23 2009-10-03
I used aggregate(dam ~ bdate, data=mydf, FUN=length) but it counts all the dams that gave birth on a particular date
bdate dam
1 2009-10-01 5
2 2009-10-03 3
Instead, I need to have something like this:
mydf2
bdate dam
1 2009-10-01 2
2 2009-10-03 2
Your help is very much appreciated!
What about:
aggregate(dam ~ bdate, data=mydf, FUN=function(x) length(unique(x)))
You could also run unique on the data first:
aggregate(dam ~ bdate, data=unique(mydf[c("dam","date")]), FUN=length)
Then you could also use table instead of aggregate, though the output is a little different.
> table(unique(mydf[c("dam","date")])$bdate)
2009-10-01 2009-10-03
2 2
This is just an example of how to think of the problem and one of the approaches on how to solve it.
split.mydf <- with(mydf, split(x = mydf, f = bdate)) #each list element has only one date.
# it's just a matter of counting unique dams
unique.mydf <- lapply(X = split.mydf, FUN = unique)
#and then count the number of unique elements
unilen.mydf <- lapply(unique.mydf, length)
#you can do these two last steps in one go like so
lapply(split.mydf, FUN = function(x) length(unique(x)))
as.data.frame(unlist(unilen.mydf)) #data.frame is just a special list, so this is water to your mill
unlist(unilen.mydf)
2009-10-01 2
2009-10-03 2
In dplyr you can use n_distinct :
library(tidyverse)
mydf %>%
group_by(bdate) %>%
summarize(dam = n_distinct(dam))

Resources