I am new to r and I am surprised at how long it takes to run what I believe to be rather simple lines of code, this leads me to believe I am missing something rather obvious. I have searched the internet and tried a few different iterations of the function but nothing has improved the efficiency (measured in time).
The Extract data is a data frame with 18.5m rows and 11 variables. I am trying to establish two things, first what percentage of patients stay in a hospital for longer than 7 as a percentage of all patients and second 21 days stays as a proportion of 7 days.
LOS_prob_providerage <- function(x,y){
Var1 = which(Extract$LOS>=0 & Extract$ProviderCode == x & Extract$age_group == y)
Var2 = which(Extract$LOS>=7 & Extract$ProviderCode == x & Extract$age_group == y)
return(list(Strand=(sum(Extract$LOS[Var1] >= 7)/length(Var1))*100, ELOS=(sum(Extract$LOS[Var2] >= 21)/length(Var2))*100))
}
When I call this function I give it a list of hospitals as the x variable and 1 age group from a list for the y variable (I can't seem to get it to take both as lists and output all hospitals for all age groups) using the following set of code
Providerage_prob_strand = mapply(LOS_prob_providerage,Provider_unique, agelabels[1], SIMPLIFY = FALSE)
I then create a data frame using the 2 lists that the function outputs using the code below
National = data.frame(matrix(unlist(Providerage_prob_strand), ncol=2,
byrow=T),row.names = Provider_unique)
colnames(National) <- c("Stranded_010","ELOS_010")
I subsequently re-run the last portions of code for all 11 elements in my age group list and append to the National data frame.
Question 1: Is there a less computationally intensive way to code my loop using r, or is the loop just taking that length of time due to the way r stores everything in memory?
Question 2: Is there anywhere to give r two lists for both the x and y varibale using mapply/sapply and for it to output the results to both Strand and ELOS across all hospitals /age groups?
I would use the data.table package for this.
Some dummy data to demonstrate (usually it is good practice for the question asker to provide this):
set.seed(123)
df1 = data.frame(
provider = sample(LETTERS[1:4], 1000, T),
los = round(runif(1000,0,40)),
age_group = sample(1:4,1000, T))
Now we turn this into a data table
library(data.table)
setDT(df1)
and we can extact the values you want like this:
providerlist = c('A','B')
age_list = c(1,2)
df1[provider %in% providerlist & age_group %in% age_list,
.(los_greater_than7 = 100*sum(los>7)/.N),
keyby = .(provider, age_group)]
# provider age_group los_greater_than7
# 1: A 1 92.40506
# 2: A 2 81.81818
# 3: B 1 77.27273
# 4: B 2 87.50000
df1[provider %in% providerlist & age_group %in% age_list & los>7,
.(los_greater_than20 = 100*sum(los>20)/.N),
by = .(provider, age_group)]
# provider age_group los_greater_than20
# 1: A 1 56.16438
# 2: A 2 66.66667
# 3: B 1 56.86275
# 4: B 2 58.92857
Relatively new with R for this kind of thing, searched quite a bit and couldn't find much that was helpful.
I have about 150 .csv files with 40,000 - 60,000 rows each and I am trying to merge 3 columns from each into 1 large data frame. I have a small script that extracts the 3 columns of interest ("id", "name" and "value") from each file and merges by "id" and "name" with the larger data frame "MergedData". Here is my code (I'm sure this is a very inefficient way of doing this and that's ok with me for now, but of course I'm open to better options!):
file_list <- list.files()
for (file in file_list){
if(!exists("MergedData")){
MergedData <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(MergedData) <- c("id", "name", file)
}
else if(exists("MergedData")){
temp_data <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(temp_data) <- c("id", "name", file)
MergedData <- merge(MergedData, temp_data, by=c("id", "name"), all=TRUE)
rm(temp_data)
}
}
Not every file has the same number of rows, though many rows are common to many files. I don't have an inclusive list of rows, so I included all=TRUE to append new rows that don't yet exist in the MergedData file.
My problem is: many of the files contain 2-4 rows with identical "id" and "name" entries, but different "value" entries. So, when I merge them I end up adding rows for every possible combination, which gets out of hand fast. Most frustrating is that none of these duplicates are of any interest to me whatsoever. Is there a simple way to take the value for the first entry and just ignore any further duplicate entries?
Thanks!
Based on your comment, we could stack each file and then cast the resulting data frame from "long" to "wide" format:
library(dplyr)
library(readr)
library(reshape2)
df = lapply(file_list, function(file) {
dat = read_csv(file)
dat$source.file = file
return(dat)
})
df = bind_rows(df)
df = dcast(df, id + name ~ source.file, value.var="value")
In the code above, after reading in each file, we add a new column source.file containing the file name (or a modified version thereof).* Then we use dcast to cast the data frame from "long" to "wide" format to create a separate column for the value from each file, with each new column taking one of the names we just created in source.file.
Note also that depending on what you're planning to do with this data frame, you may find it more convenient to keep it in long format (i.e., skip the dcast step) for further analysis.
Addendum: Dealing with Aggregation function missing: defaulting to length warning. This happens when you have more than one row with the same id, name and source.file. That means there are multiple values that have to get mapped to the same cell, resulting in aggregation. The default aggregation function is length (i.e., a count of the number of values in that cell). The only ways around this that I know of are (a) keep the data in long format, (b) use a different aggregation function (e.g., mean), or (c) add an extra counter column to differentiate cases with multiple values for the same combination of id, name, and source.file. We demonstrate these below.
First, let's create some fake data:
df = data.frame(id=rep(1:2,2),
name=rep(c("A","B"), 2),
source.file=rep(c("001","002"), each=2),
value=11:14)
df
id name source.file value
1 1 A 001 11
2 2 B 001 12
3 1 A 002 13
4 2 B 002 14
Only one value per combination of id, name and source.file, so dcast works as desired.
dcast(df, id + name ~ source.file, value.var="value")
id name 001 002
1 1 A 11 13
2 2 B 12 14
Add an additional row with the same id, name and source.file. Since there are now two values getting mapped to a single cell, dcast must aggregate. The default aggregation function is to provide a count of the number of values.
df = rbind(df, data.frame(id=1, name="A", source.file="002", value=50))
dcast(df, id + name ~ source.file, value.var="value")
Aggregation function missing: defaulting to length
id name 001 002
1 1 A 1 2
2 2 B 1 1
Instead, use mean as the aggregation function.
dcast(df, id + name ~ source.file, value.var="value", fun.aggregate=mean)
id name 001 002
1 1 A 11 31.5
2 2 B 12 14.0
Add a new counter column to differentiate cases where there are multiple rows with the same id, name and source.file and include that in dcast. This gets us back to a single value per cell, but at the expense of having more than one column for some source.files.
# Add counter column
df = df %>% group_by(id, name, source.file) %>%
mutate(counter=1:n())
As you can see, the counter value only has a value of 1 in cases where there's only one combination of id, name, and source.file, but has values of 1 and 2 for one case where there are two rows with the same id, name, and source.file (rows 3 and 5 below).
df
id name source.file value counter
1 1 A 001 11 1
2 2 B 001 12 1
3 1 A 002 13 1
4 2 B 002 14 1
5 1 A 002 50 2
Now we dcast with counter included, so we get two columns for source.file "002".
dcast(df, id + name ~ source.file + counter, value.var="value")
id name 001_1 002_1 002_2
1 1 A 11 13 50
2 2 B 12 14 NA
* I'm not sure what your file names look like, so you'll probably need to adjust this create a naming format with a unique file identifier. For example, if your file names follow the pattern "file001.csv", "file002.csv", etc., you could do this: dat$source.file = paste0("Value", gsub("file([0-9]{3})\\.csv", "\\1", file).
I have two dataframes df.o and df.m as defined below. I need to find which observation in df.o (dimension table) corresponds which observations in df.m (fact table) based on two criteria: 1) df.o$Var1==df.o$Var1 and df.o$date1 < df.m$date2 < df.o$date3 such that I get the correct value of df.o$oID in df.m$oID (the correct value is manually entered in df.m$CORRECToID). I need the ID to complete a merge afterwards.
df.o <- data.frame(oID=1:4,
Var1=c("a","a","b","c"),
date3=c(2015,2011,2014,2015),
date1=c(2013,2009,2012,2013),
stringsAsFactors=FALSE)
df.m <- data.frame(mID=1:3,
Var1=c("a","a","b"),
date2=c(2014,2010,2013),
oID=NA,
CORRECToID=c(1,2,3),
points=c(5, 10,15),
stringsAsFactors=FALSE)
I have tried various combinations of like the code below, but without luck:
df.m$oID[df.m$date2 < df.o$date3 & df.m$date2 > df.o$date1 & df.o$Var1==df.m$Var1] <- df.o$oID
I have also tried experimenting with various combinations of ifelse, which and match, but none seem to do the trick.
The problem I keep encountering is that my replacement was a different number of rows than data and that "longer object length is not a multiple of shorter object length".
What you are looking for is called an "overlap join", you could try the data.table::foverlaps function in order to achieve this.
The idea is simple
Create the columns to overlap on (add an additional column to df.m)
key by these columns
run foverlaps and select the column you want back
library(data.table)
setkey(setDT(df.m)[, date4 := date2], Var1, date2, date4)
setkey(setDT(df.o), Var1, date1, date3)
foverlaps(df.m, df.o)[, names(df.m), with = FALSE]
# mID Var1 date2 oID CORRECToID points date4
# 1: 2 a 2010 2 2 10 2010
# 2: 1 a 2014 1 1 5 2014
# 3: 3 b 2013 3 3 15 2013
EDIT:
(I apologize for the fact that my example was oversimplified, and I will try to remedy this, as well as format my more relevant example in a more convenient format for copying directly into R. In particular, there are multiple value columns, and some preceding columns with other information that does not need to be parsed.)
I am fairly new to R, and to data.table as well, so I would appreciate input on an issue I am finding. I am working with a data table where one column is a colon-separated format string that serves as a legend for values in other colon-separated columns. In order to parse it, I have to first split it into its components, and then search for the indices of the components I need to later index the value strings. Here is a simplified example of the sort of situation I might be working with
DT <- data.table(number=c(1:5),
format=c("name:age","age:name","age:name:height","height:age:name","weight:name:age"),
person1=c("john:30","40:bill","20:steve:100","300:70:george","140:fred:20"),
person2=c("jane:31","42:ivan","21:agnes:120","320:72:vivian","143:rose:22"))
When evaluated, we get
> DT
number format person1 person2
1: 1 name:age john:30 jane:31
2: 2 age:name 40:bill 42:ivan
3: 3 age:name:height 20:steve:100 21:agnes:120
4: 4 height:age:name 300:70:george 320:72:vivian
5: 5 weight:name:age 140:fred:20 143:rose:22
Let's say that for each person, I need to know ONLY their name and age, and don't need their height or weight; in this example, and in my actual data, every format string has fields for name and age, but possibly in different positions (the fields that I am actually looking for are usually fixed in certain columns, but I am reluctant to hard-code any indices as I am not completely familiar with the production of the data files I am working with). I would first split up the format string and then do a match() search for the names of the fields I want.
DT[, format.split := strsplit(format, ":")]
At this point, the only method I used that worked to perform the match was a vapply:
DT[, index.name := vapply(format.split, function (x) match('name', x), 0L)]
DT[, index.age := vapply(format.split, function (x) match('age', x), 0L)]
because I don't know of any other way to let R know that it should be looking at the rows in the columns individually, and not bunched together as a vector, and perform the match on the vector-valued format.split column of each row, rather than trying to match the whole column of rows. Even then, once I find the indices for each row, I have to perform another strsplit and then an mapply to parse the name-value and age-value out of each person's value-string:
DT[, person1.split := strsplit(person1, ':')]
DT[, person1.name := mapply(function (x,y) x[y], person1.split, index.name]
DT[, person1.age := mapply(function (x,y) x[y], person1.split, index.age]
DT[, person2.split := strsplit(person2, ':')]
DT[, person2.name := mapply(function (x,y) x[y], person2.split, index.name]
DT[, person2.age := mapply(function (x,y) x[y], person2.split, index.age]
(And, of course, I would do the same thing for age as well)
I am working with fairly large data sets, so I'd like my code to be as efficient as possible. Does anyone have recommendations for ways I can speed up or otherwise optimize my code?
(NOTE: I am really looking for the right approach to take, not the right *apply or *ply or Map function to use. If *(ap)ply or Map really is the right approach, I would appreciate knowing which is the most efficient or appropriate for my situation, but if there is a better way of testing for intra-row duplication, I would prefer recommendations about that to function suggestions. Suggestions are welcome, though).
EDIT 2:
It turns out that my example was much more general than it need have been. I only need two fields, which are always going to be the first two fields in the format string, without variation. The first field is just a literal character string. The second field, however, consists of at least 2 numbers, separated by commas (ultimately, I filter out any rows with more than 2 numbers in the second field, so the possibility of more is only relevant if the filtering happens after the parsing). For each of the (3) value strings, I only need to create three columns: a character column for the first field, and two numeric columns, one for each member of the comma-separated pair in the second field. Any other fields are irrelevant. My current method, which is probably quite inefficient, is to use sub() to pattern-match on the desired fields and subfields with back-references.
> DT <- data.table(id=1:5,
format=c(rep("A:B:C:D:E", 5)),
person1=paste(paste0("foo",LETTERS[1:5]), paste(1:5, 10:6, sep=','), "blah", "bleh", "bluh", sep=':'),
person2=paste(paste0("bar",LETTERS[1:5]), paste(16:20, 5:1, sep=','), "blah", "bleh", "bluh", sep=':'),
person3=paste(paste0("baz",LETTERS[1:5]), paste(0:4, 12:8, sep=','), "blah", "bleh", "bluh", sep=':'))
> DT
id format person1 person2 person3
1: 1 A:B:C:D:E fooA:1,10:blah:bleh:bluh barA:16,5:blah:bleh:bluh bazA:0,12:blah:bleh:bluh
2: 2 A:B:C:D:E fooB:2,9:blah:bleh:bluh barB:17,4:blah:bleh:bluh bazB:1,11:blah:bleh:bluh
3: 3 A:B:C:D:E fooC:3,8:blah:bleh:bluh barC:18,3:blah:bleh:bluh bazC:2,10:blah:bleh:bluh
4: 4 A:B:C:D:E fooD:4,7:blah:bleh:bluh barD:19,2:blah:bleh:bluh bazD:3,9:blah:bleh:bluh
5: 5 A:B:C:D:E fooE:5,6:blah:bleh:bluh barE:20,1:blah:bleh:bluh bazE:4,8:blah:bleh:bluh
My code then does this:
DT[, `:=`(person1.A=sub("^([^:]*):.*$","\\1", person1),
person2.A=sub("^([^:]*):.*$","\\1", person2),
person3.A=sub("^([^:]*):.*$","\\1", person3),
person1.B.first=sub("^[^:]*:([^:,]*),.*$","\\1", person1),
person1.B.second=sub("^[^:]*:[^:,]*,([^:,]*)(,[^:,]*)*:.*$","\\1", person1),
person2.B.first=sub("^[^:]*:([^:,]*),.*$","\\1", person2),
person2.B.second=sub("^[^:]*:[^:,]*,([^:,]*)(,[^:,]*)*:.*$","\\1", person2),
person3.B.first=sub("^[^:]*:([^:,]*),.*$","\\1", person3),
person3.B.second=sub("^[^:]*:[^:,]*,([^:,]*)(,[^:,]*)*:.*$","\\1", person3))]
for the splitting, and filters by
DT <- DT[grepl("^[^:]*:[^:,]*,[^:,]*:.*$", person1) &
grepl("^[^:]*:[^:,]*,[^:,]*:.*$", person2) &
grepl("^[^:]*:[^:,]*,[^:,]*:.*$", person3) ]
I understand that this method is probably very inefficient, but it was the first improvement I came up with over my old approach of repeatedly applying strsplit. With the new conditions in mind, is there an even better way of doing things than melt, csplit, dcast?
EDIT 3:
Since I only needed the first two fields, I ended up trimming all the value strings, removing those with more than two commas (i.e. more than 3 2nd-field numbers), changing the commas to colons, replacing the format string of every line with the names of the (now 3) fields, and performing the dcast(csplit(melt)) as suggested by #AnandaMahto. It seems to work well.
#bskaggs has the right idea that it might just make more sense to put your data into a long form, or even a structured wide form.
I'll show you two options, but first, it's always better to share your data in a way that others can actually use it:
DT <- data.table(
format = c("name:age", "name:age:height", "age:height:name",
"height:weight:name:age", "name:age:weight:height",
"name:age:height:weight"),
values = c("john:30", "rene:33:183", "100:10:speck",
"100:400:sumo:11", "james:43:120:120",
"plink:2:300:400"))
I'm also going to suggest you use my cSplit function.
Here's how you would easily convert this dataset into a long form:
cSplit(DT, c("format", "values"), ":", "long")
# format values
# 1: name john
# 2: age 30
# 3: name rene
# 4: age 33
# 5: height 183
# 6: age 100
# 7: height 10
# 8: name speck
# 9: height 100
# 10: weight 400
# 11: name sumo
# 12: age 11
# 13: name james
# 14: age 43
# 15: weight 120
# 16: height 120
# 17: name plink
# 18: age 2
# 19: height 300
# 20: weight 400
Once the data are in a "long" form, you can convert it easily to a "wide" form using dcast.data.table, like this. (I've also reordered the columns using setcolorder, which lets you rearrange the data without copying.)
X <- dcast.data.table(
cSplit(cbind(id = 1:nrow(DT), DT),
c("format", "values"), ":", "long"),
id ~ format, value.var = "values")
setcolorder(X, c("id", "name", "age", "height", "weight"))
X
# id name age height weight
# 1: 1 john 30 NA NA
# 2: 2 rene 33 183 NA
# 3: 3 speck 100 10 NA
# 4: 4 sumo 11 100 400
# 5: 5 james 43 120 120
# 6: 6 plink 2 300 400
How does this fare in terms of speed?
First, a very moderate dataset:
DT <- rbindlist(replicate(2000, DT, FALSE))
dim(DT)
# [1] 12000 2
## #bskaggs's suggestion
system.time(colonMelt(DT))
# user system elapsed
# 0.27 0.00 0.27
## cSplit. It would be even faster if you already had
## an id column and didn't need to cbind one in
system.time(cSplit(cbind(id = 1:nrow(DT), DT),
c("format", "values"), ":", "long"))
# user system elapsed
# 0.02 0.00 0.01
## cSplit + dcast.data.table
system.time(dcast.data.table(
cSplit(cbind(id = 1:nrow(DT), DT),
c("format", "values"), ":", "long"),
id ~ format, value.var = "values"))
# user system elapsed
# 0.08 0.00 0.08
Update
For your updated problem, you can melt the "data.table" first, and then proceed similarly:
library(reshape2)
## Melting, but no reshaping -- a nice long format
cSplit(melt(DT, id.vars = c("number", "format")),
c("format", "value"), ":", "long")
## Try other combinations for the LHS and RHS of the
## formula. This seems to be what you might be after
dcast.data.table(
cSplit(melt(DT, id.vars = c("number", "format")),
c("format", "value"), ":", "long"),
number ~ variable + format, value.var = "value")
I think you may be better served by using a tall tidy format:
colonMelt <- function(DT) {
formats <- strsplit(DT$format, ":")
rows <- rep(row.names(DT), sapply(formats, length))
data.frame(row = rows,
key = unlist(formats),
value = unlist(strsplit(DT$values, ":"))
)
}
newDT <- colonMelt(DT)
The result is a format that is much easier to do search and filtering without string splitting all the time:
row key value
1 1 name john
2 1 age 30
3 2 name rene
4 2 age 33
5 2 height 183
6 3 age 100
7 3 height 10
8 3 name speck
Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)