Reshaping data in R with multiple variable levels - "aggregate function missing" warning - r

I'm trying to use dcast in reshape2 to transform a data frame from long to wide format. The data is hospital visit dates and a list of diagnoses. (Dx.num lists the sequence of diagnoses in a single visit. If the same patient returns, this variable starts over and the primary diagnosis for the new visit starts at 1.) I would like there to be one row per individual (id). The data structure is:
id visit.date visit.id bill.num dx.code FY Dx.num
1 1/2/12 203 1234 409 2012 1
1 3/4/12 506 4567 512 2013 1
2 5/6/18 222 3452 488 2018 1
2 5/6/18 222 3452 122 2018 2
3 2/9/14 567 6798 923 2014 1
I'm imagining I would end up with columns like this:
id, date_visit1, date_visit2, visit.id_visit1, visit.id_visit2, bill.num_visit1, bill.num_visit2, dx.code_visit1_dx1, dx.code_visit1_dx2 dx.code_visit2_dx1, FY_visit1_dx1, FY_visit1_dx2, FY_visit2_dx1
Originally, I tried creating a visit_dx column like this one:
**visit.dx**
v1dx1 (visit 1, dx 1)
v2dx1 (visit 2, dx 1)
v1dx1 (...)
v1dx2
v1dx1
And used the following code, omitting "Dx.num" from the DF, as it's accounted for in "visit.dx":
wide <-
dcast(
setDT(long),
id + visit.date + visit.id + bill.num ~ visit.dx,
value.var = c(
"dx.code",
"FY"
)
)
When I run this, I get the warning "Aggregate function missing, defaulting to 'length'" and new dataframe full of 0's and 1's. There are no duplicate rows in the dataframe, however. I'm beginning to think I should go about this completely differently.
Any help would be much appreciated.

The data.table package extended dcast with rowid and allowing multiple value.var, so...
library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var=setdiff(names(DF), "id"))
id visit.date_1 visit.date_2 visit.id_1 visit.id_2 bill.num_1 bill.num_2 dx.code_1 dx.code_2 FY_1 FY_2 Dx.num_1 Dx.num_2
1: 1 1/2/12 3/4/12 203 506 1234 4567 409 512 2012 2013 1 1
2: 2 5/6/18 5/6/18 222 222 3452 3452 488 122 2018 2018 1 2
3: 3 2/9/14 <NA> 567 NA 6798 NA 923 NA 2014 NA 1 NA

Related

Create a variable and apllying a function to another one

I have a dataframe with a column Arrivo (formatted as date) and a column Giorni (formatted as integer) with number of days (es.: 2, 3, 6 etc..).
I would like to apply two function to these columns and precisely, I would like to duplicate a row for the number in the column Giorni and while duplicating these rows, I would like to create a new column called data.osservazione that is equal to Arrivo and augmented of one day iteratively.
From this:
No. Casa Anno Data Categoria Camera Arrivo Stornata.il Giorni
1 2.867 SEELE 2019 03/09/2019 CDV 316 28/03/2020 NA 3
2 148.000 SEELE 2020 20/01/2020 CDS 105 29/03/2020 NA 3
3 3.684 SEELE 2019 16/11/2019 CD 102 02/04/2020 NA 5
to this:
No. data.osservazione Casa Anno Data Categoria Camera Arrivo
1 2867 3/28/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
2 2867 3/29/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
3 2867 3/30/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
4 148 3/29/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
5 148 3/30/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
6 148 3/31/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
Stornata.il Giorni
1 #N/D 3
2 #N/D 3
3 #N/D 3
4 #N/D 3
I was able to duplicate the rows but I don't know how to concurrently create the new column with the values I need.
Please don't mind the date values in the columns, I'll fix them in the end.
Thanks in advance
Since I am a fan of the data.table package, I will propose a solution using data.table. You can install it by typing on the console: install.packages("data.table").
My approach was to create a separate data.frame with an index going from 0 to whatever number is in Giorni by row from the original data.frame and then merge this new data.frame with the original data that you have and, by virtue of many to one matches from the key specified, the resulting data.frame will "expand" to the size desired, therefore "duplicating" the rows when necessary.
For this, I used seq_len(). If you do seq_len(3L), you get: [1] 1 2 3 which is the sequence from 1L to whatever integer you've given in length.out when length.out >= 1L. Thus seq_len() will produce a sequence that ends in whatever is in Giorni, the challenge is to do by row since length.out in seq_len() needs to be a vector of size 1. We use by in data.table syntax to accomplish this.
So let's get started, first you load data.table:
library(data.table) # load data.table
setDT(data) # data.frame into data.table
In your example, it isn't clear whether Arrivo is in Date format, I am assuming it isn't so I convert to Date --you will need this to add the days later.
# is `Arrivo`` date? If no, into date fmt
data[["Arrivo"]] <- as.Date(data[["Arrivo"]], format = "%d/%m/%y")
The next bit is key, using seq_len() and by in data.table syntax, I create a separate data.table --which is always a data.frame, but not the other way around-- with the sequence by every single element of Giorni, therefore expanding the data to the desired size. I use by = "No." because I want to apply seq_len() to every value of Giorni associated with No. the values in No..
# create an index with the count from `Giorni`, subtract by 1 so the first day is 0.
d1 <- data[, seq_len(Giorni) - 1, by = "No."]
Check the result, you can see where I am going now:
> d1
No. V1
1: 2867 0
2: 2867 1
3: 2867 2
4: 148 0
5: 148 1
Lastly, you inner join d1 with the original data, I am using data.table join syntax here. Then you add the index V1 to Arrivo:
# merge with previous data
res <- d1[data, on = "No."]
# add days to `Arrivo``, create column data.osservazione
res[ , data.osservazione := V1 + Arrivo]
Result:
> res
No. V1 Casa Anno Data Categoria Camera Arrivo
1: 2867 0 SEELE 2019 03/09/2019 CDV 316 2020-03-28
2: 2867 1 SEELE 2019 03/09/2019 CDV 316 2020-03-28
3: 2867 2 SEELE 2019 03/09/2019 CDV 316 2020-03-28
4: 148 0 SEELE 2019 20/01/2020 CDS 105 2020-03-29
5: 148 1 SEELE 2019 20/01/2020 CDS 105 2020-03-29
Stornata.il Giorni data.osservazione
1: NA 3 2020-03-28
2: NA 3 2020-03-29
3: NA 3 2020-03-30
4: NA 2 2020-03-29
5: NA 2 2020-03-30
The next commands are just cosmetic, formatting dates and deleting columns:
# reformat `Arrivo` and `data.osservazione`
cols <- c("Arrivo", "data.osservazione")
res[, (cols) := lapply(.SD, function(x) format(x=x, format="%d/%m/%Y")), .SDcols=cols]
# remove index
res[, V1 := NULL]
Console:
> res
No. V1 Casa Anno Data Categoria Camera Arrivo
1: 2867 0 SEELE 2019 03/09/2019 CDV 316 2020-03-28
2: 2867 1 SEELE 2019 03/09/2019 CDV 316 2020-03-28
3: 2867 2 SEELE 2019 03/09/2019 CDV 316 2020-03-28
4: 148 0 SEELE 2019 20/01/2020 CDS 105 2020-03-29
5: 148 1 SEELE 2019 20/01/2020 CDS 105 2020-03-29
Stornata.il Giorni data.osservazione
1: NA 3 2020-03-28
2: NA 3 2020-03-29
3: NA 3 2020-03-30
4: NA 2 2020-03-29
5: NA 2 2020-03-30
Hi #JdeMello and really thank you for the quick answer!
Indeed it was what I was looking for, but in the mean time I kinda found a solution myself using lubridate and tidyverse and purrr.
What I did was transform variables from Posix to date (revenue is my df):
revenue <- revenue %>% mutate(Data = as_date(Data), Arrivo = as_date(Arrivo), `Stornata il` = as_date(`Stornata il`), Partenza = as_date(Partenza))
Then I created another data frame but included variables id and data_obs:
revenue_1 <- revenue %>% mutate(data_obs = Arrivo, id = 1:nrow(revenue))
I created another data frame with the variable data_obs iterated by the number of Giorni:
revenue_2 <- revenue_1 %>% group_by(id, data_obs) %>%
complete(Giorni = sequence(Giorni)) %>%
ungroup() %>%
mutate(data_obs = data_obs + Giorni -1)
I extracted data_obs:
data_obs <- revenue_2$data_obs
I created another data frame to duplicate the rows:
revenue_3 <- revenue %>% map_df(.,rep, .$Giorni)
And finally created the ultimate data frame that I needed:
revenue_finale <- revenue_3 %>% mutate(data_obs = data_obs)
I know it's kinda redundant having created all those data frame but I have very little knowledge of R at the moment and had to work around.
I wanted to merge data frames but for unknown reasons to me, it didn't work out.
What I found kinda fun is that you can play with many packages to get to your point instead of using just one.
I've never used data.table so your answer is very interesting and I'll try to memorize it.
So again, really really thank you!!

R- combine rows of a data frame to be unique by 3 columns

I have data frame looking like this:
> head(temp)
VisitIDCode start stop Value_EVS hr heart rate NU EE0A Value_EVS temp celsius CAL 113C Value_EVS current weight kg CAL
23642 2008253059 695 696 <NA> 36.4 <NA>
24339 2008253059 695 696 132 <NA> <NA>
72450 2008953178 527 528 <NA> 38.6 <NA>
72957 2008953178 527 528 123 <NA> <NA>
73976 2008965669 527 528 <NA> 36.2 <NA>
74504 2008965669 527 528 116 <NA> <NA>
First and second row are both for the same patient(same VisitIDCode), in the first row I have the value of heart rate and in the second I have the value of temperature from time 2 to 3. I want to combine these rows so that the result is one row that looks like:
VisitIDCode start stop Value_EVS hr heart rate NU EE0A Value_EVS temp celsius CAL 113C Value_EVS current weight kg CAL
23642 2008253059 695 696 132 36.4 <NA>
In other words, I want my data frame to be unique by combination of VisitIDCode, start and stop. This is a large dataframe with more columns that need to be combined.
What is the best way of doing it and if at all possible, avoiding for loop?
Edit: I don't want to remove the NAs. If there are 2 rows each of which have one value and 2 NAs, I want to combine them to one row so it has two values and one NA. Like the example above.
nasim,
It's useful to create a reproducible example when posting questions. It makes it much easier to sort out how to help. I created a toy example here. Hopefully, that reproduces your issue:
> df <- data.frame(MRN = c(123,125,213,214),
+ VID = c(2008,2008,2011,2011),
+ start=c(695,695),
+ heart.rate = c(NA,112,NA,96),
+ temp = c(39.6,NA,37.4,NA))
> df
MRN VID start heart.rate temp
1 123 2008 695 NA 39.6
2 125 2008 695 112 NA
3 213 2011 695 NA 37.4
4 214 2011 695 96 NA
Here is a solution using dplyr:
> library(dplyr)
> df <- df %>%
+ group_by(VID) %>%
+ summarise(MRN = max(MRN,na.rm=T),
+ start=max(start,na.rm=T),
+ heart.rate=max(heart.rate,na.rm=T),
+ temp = max(temp,na.rm=T))
> df
# A tibble: 2 × 5
VID MRN start heart.rate temp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2008 125 695 112 39.6
2 2011 214 695 96 37.4
After I made sure all columns classes are numeric (not factors) by defining the classes of columns while reading the data in, this worked for me:
CompleteCoxObs<-aggregate(x=CompleteCoxObs[c("stop","Value_EVS current weight kg CAL","Value_EVS hr heart rate NU EE0A","Value_EVS temp celsius CAL 113C")], by=list(VisitIDCode=CompleteCoxObs$VisitIDCode,start=CompleteCoxObs$start), max, na.rm = FALSE);

Find and tag a number between a range

I have two dfs as below
>codes1
Country State City Start No End No
IN Telangana Hyderabad 100 200
IN Maharashtra Pune (Bund Garden) 300 400
IN Haryana Gurgaon 500 600
IN Maharashtra Pune 700 800
IN Gujarat Ahmedabad (Vastrapur) 900 1000
Now i want to tag ip address from table 1
>codes2
ID No
1 157
2 346
3 389
4 453
5 562
6 9874
7 98745
Now i want to tag numbers in codes2 df as per the range given in codes1 df for No column , expected ouput is
ID No Country State City
1 157 IN Telangana Hyderabad
2 346 IN Maharashtra Pune(Bund Garden)
.
.
.
Basically want to tag No column in codes 2 with codes1 according to the range (Start No and End No) that No observations falls in.
Also the order could be anything in codes 2 df .
You could use the non-equi join capability of the data.table package for that:
library(data.table)
setDT(codes1)
setDT(codes2)
codes2[codes1, on = .(No > StartNo, No < EndNo), ## (1)
`:=`(cntry = Country, state = State, city = City)] ## (2)
(1) obtains matching row indices in codes2 corresponding to each row in codes1, while matching on the condition provided to the on argument.
(2) updates codes2 values for those matching rows for the columns specified directly by reference (i.e., you don't have to assign the result back to another variable).
This gives:
codes2
# ID No cntry state city
# 1: 1 157 IN Telangana Hyderabad
# 2: 2 346 IN Maharashtra Pune (Bund Garden)
# 3: 3 389 IN Maharashtra Pune (Bund Garden)
# 4: 4 453 NA NA NA
# 5: 5 562 IN Haryana Gurgaon
# 6: 6 9874 NA NA NA
# 7: 7 98745 NA NA NA
if you're comfortable writing SQL, you might consider using the sqldf package to do something like
library('sqldf')
result <- sqldf('select * from codes2 left join codes1 on codes2.No between codes1.StartNo and codes1.EndNo')
you may have to remove special characters and spaces from the columnnames of your dataframes beforehand.

Deleting rows dynamically based on certain condition in R

Problem description:
From the below table, I would want to remove all the rows above the quarter value of 2014-Q3 i.e. rows 1,2
Also note that this is a dynamic data-set. Which means when we move on to the next quarter i.e. 2016-Q3, I would want to remove all the rows above quarter value of 2014-Q4 automatically through a code without any manual intervention
(and when we move to next qtr 2016-Q4, would want to remove all rows above 2015-Q1 and so on)
I have a variable which captures the first quarter I would like to see in my final data-frame (in this case 2014-Q3) and this variable would change as we progress in the future
QTR Revenue
1 2014-Q1 456
2 2014-Q2 3113
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
How do I code this?
Here is a semi-automated method using which:
myFunc <- function(df, year, quarter) {
dropper <- paste(year, paste0("Q",(quarter-1)), sep="-")
df[-(1:which(as.character(df$QTR)==dropper)),]
}
myFunc(df, 2014, 3)
QTR Revenue
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
To subset, you can just assign output
dfNew <- myFunc(df, 2014, 3)
At this point, you can pretty easily change the year and quarter to perform a new subset.
Thanks lmo
Was going through articles and I think we can use the dplyr package to do this in a much simpler way:
>df % slice((nrow(df)-7):(nrow(df)))
Get the below result
>df
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
This would act in a dynamic way too as once we have more rows entered beyond 2016-Q2, the range of 8 rows (to be selected) is maintained by the nrow function

Exclude intervals that overlap between two data frame's (by range of two column values)

This is almost an extension of a previous question I asked, but I've run into a new problem I haven't found a solution for.
Here is the original question and answer: Find matching intervals in data frame by range of two column values
(this found overlapping intervals that were common among different names within same data frame)
I now want to find a way to exclude row's in DF1 when there are overlapping intervals with a new data-frame, DF2.
Using the same DF1 :
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID1
ADAM 2 A 384 407 23 ID1
ADAM 3 B 0 79 79 ID1
ADAM 4 B 505 586 81 ID1
ADAM 5 C 140 205 65 ID1
ADAM 6 C 522 599 77 ID1
This continues for 18 different names and two ID groups.
Now have a second data frame with intervals that I wish to exclude from the above data frame.
Here is an example of DF2:
Name Event Order Sequence start_event end_event duration Group
GAP1 1 A 55 121 66 ID1
GAP2 2 A 394 419 25 ID1
GAP3 3 C 502 635 133 ID1
I.E., I am hoping to find any interval for each "Name" in DF1, that is in the same "Sequence" and has overlapping time at any point of the interval found in DF2 (any portion, whether it begins before the start event, or begins midway and ends after the end event). I would like to iterate through each distinct "Name" in DF1. Also, the sequence matters, so I would only like to return results found common between sequence A and sequence A, then sequence B and sequence B, and finally sequence C and sequence C.
Desired Result (showing just the first name):
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
ADAM 3 B 0 79 79 ID1
ADAM 4 B 505 586 81 ID1
ADAM 5 C 140 205 65 ID1
Last time the answer was resolved in part with foverlaps, but I am still not overly familiar with it to be able to solve this problem - assuming that's the best way to answer this.
Thanks!
This piece of code should work for you
library(data.table)
Dt1 <- data.table(a = 1:1000,b=1:1000 + 100)
Dt2 <- data.table(a = 100:200,b=100:200+10)
#identify the positions that are not allowed
badSeq <- unique(unlist(lapply(1:nrow(Dt2),function(i) Dt2[i,a:b,])))
#select for the rows outside of the range
correctPos <- sapply(1:nrow(Dt1),
function(i)
all(!Dt1[i,a:b %in% badSeq]))
Dt1[correctPos,]
I have done it with data.tables rather than data.frames. I like them better and they can be faster. But you can apply the same ideas to a data.frame

Resources