Aggregate contiguous rows - r

I have a data.frame in R with a series of variables
userID (numeric) var1 (factor) var2 (factor) time (character) action (character)
The first 3 columns are the same for each user session, which can have many rows. However, time and action change for each row.
I have tried aggregate to combine the entire session (contiguous rows) into a single row.
dat <- aggregate(cbind(time, action) ~ userID + var1 + var2,
data = log, FUN = paste, collapse = "|")
That would solve my problem if users only appeared in the file once. However, that is not the case and the above line aggregates all sessions into a single row.
How can I avoid that? How do I aggregate contiguous rows (sessions) only?

Like what MrFlick said, except create the user.session variable with:
rle <- with(log, rle(as.character(interaction(user, var1, var2))))
log$user.session <- rep(seq_along(rle$lengths), rle$lengths)
It just seems easier to understand for us mere mortals...

I would create a user.session ID such that each continuous sequence of rows for a given user/var1/var2 are assigned a unique ID. First, a sample data set
log<-data.frame(
user = rep(c(1,2,3,1,3,4), times),
var1 = factor(rep(letters[c(1,2,3,1,3,4)+7], times)),
var2 = factor(rep(letters[c(1,2,3,1,3,4)+13], times)),
time = "a",
action="b",
stringsAsFactors=F
)
Now we add the user.session id
log$user.session<-with(log,
ave(seq_len(nrow(log)),user,var1,var2,FUN=function(x) {
cumsum(c(0,diff(x))!=1)
})
)
And now you can do
dat <- aggregate(cbind(time, action) ~ user + var1 + var2 + user.session,
data = log, FUN = paste, collapse = "|")
which gives
user var1 var2 user.session time action
1 1 h n 1 a|a|a|a|a b|b|b|b|b
2 2 i o 1 a|a|a b|b|b
3 3 j p 1 a b
4 4 k q 1 a|a|a b|b|b
5 1 h n 2 a|a b|b
6 3 j p 2 a|a|a|a|a|a|a b|b|b|b|b|b|b

So in bed I had the same realization which is to add a user.session column. I like both of your solutions but the second one is easier to understand and that is why I chose it. In any case, this is a third possibility
log$user.session <- cumsum(c(TRUE, diff(log$userID)!=0))
Then, including this column in the aggregate function does the trick as well.
Cheers.

Related

How to return the range of values shared between two data frames in R?

I have several data frames that have the same columns names, and ID
, the following to are the start from and end to of a range and group label from each of them.
What I want is to find which values offrom and to from one of the data frames are included in the range of the other one. I leave an example picture to ilustrate what I want to achieve (no graph is need for the moment)
I thought I could accomplish this using between() of the dplyr package but no. This could be accomplish using if between() returns true then return the maximum value of from and the minimum value of to between the data frames.
I leave example data frames and the results I'm willing to obtain.
a <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(1,500,1000,1,500,1000,1,500,1000),
to=c(400,900,1400,400,900,1400,400,900,1400),group=rep("a",9))
b <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(300,1200,1900,1400,2800,3700,1300,2500,3500),
to=c(500,1500,2000,2500,3000,3900,1400,2800,3900),group=rep("b",9))
results <- data.frame(ID = c(1,1,1,2,3),from=c(300,500,1200,1400,1300),
to=c(400,500,1400,1400,1400),group=rep("a, b",5))
I tried using this function which will return me the values when there is a match but it doesn't return me the range shared between them
f <- function(vec, id) {
if(length(.x <- which(vec >= a$from & vec <= a$to & id == a$ID))) .x else NA
}
b$fromA <- a$from[mapply(f, b$from, b$ID)]
b$toA <- a$to[mapply(f, b$to, b$ID)]
We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.
You can bind the two dataframes and compare the from col with the previous value lag(from) to see if the actual value is smaller. Also you compare the previous lag(to) to the actual to col to see if the max value of the range overlap the previous range or not.
Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1 (the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.
res = rbind(a,b) %>% # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe
If you play a bit with the code, not using the filter() and select() you will see how that's work.
> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b

Calculate all possible product combinations between variables

I have a df containing 3 variables, and I want to create an extra variable for each possible product combination.
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
I want to create a new df (output) containing the result of a*b, a*c, b*c.
output <- data.frame(d = test$a * test$b
, e = test$a * test$c
, f = test$b * test$c)
This is easily doable (manually) with a small number of columns, but even above 5 columns, this activity can get very lengthy - and error-prone, when column names contain prefix, suffix or codes inside.
It would be extra if I could also control the maximum number of columns to consider at the same time (in the example above, I only considered 2 columns, but it would be great to select that parameter too, so to add an extra variable a*b*c - if needed)
My initial idea was to use expand.grid() with column names and then somehow do a lookup to select the whole columns values for the product - but I hope there's an easier way to do it that I am not aware of.
You can use combn to create combination of column names taken 2 at a time and multiply them to create new columns.
cbind(test, do.call(cbind, combn(names(test), 2, function(x) {
setNames(data.frame(do.call(`*`, test[x])), paste0(x, collapse = '-'))
}, simplify = FALSE)))
#. a b c a-b a-c b-c
#1 0.4098568 -0.3514020 2.5508854 -0.1440245 1.045498 -0.8963863
#2 1.4066395 0.6693990 0.1858557 0.9416031 0.261432 0.1244116
#3 0.7150305 -1.1247699 2.8347166 -0.8042448 2.026909 -3.1884040
#4 0.8932950 1.6330398 0.3731903 1.4587864 0.333369 0.6094346
#5 -1.4895243 1.4124826 1.0092224 -2.1039271 -1.503261 1.4255091
#6 0.8239685 0.1347528 1.4274288 0.1110321 1.176156 0.1923501
#7 0.7803712 0.8685688 -0.5676055 0.6778060 -0.442943 -0.4930044
#8 -1.5760181 2.0014636 1.1844449 -3.1543428 -1.866707 2.3706233
#9 1.4414434 1.1134435 -1.4500410 1.6049658 -2.090152 -1.6145388
#10 0.3526583 -0.1238261 0.8949428 -0.0436683 0.315609 -0.1108172
Could this one also be a solution. Ronak's solution is more elegant!
library(dplyr)
# your data
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
# new dataframe output
output <- test %>%
mutate(a_b= prod(a,b),
a_c= prod(a,c),
b_c= prod(b,c)
) %>%
select(-a,-b,-c)

How to assign the column name to the variable dynamically

I am currently developing an application and I need to loop through the columns of the data frame. For instance, if the data frame has the columns
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
If the input is given as "a", then the column name "b" should be assigned to the variable, say promote.
It throws an error Error in[.data.frame(char_set, i + 1) : undefined columns selected. Is there any solution?
char_name <- "a"
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
for (i in 1:ncol(char_set)) {
promote <- ifelse(names(char_set) == char_name,char_set[i+1], "-")
print(promote)
}
Thanks in advance!!!
This is actually quite interesting. I would suggest doing something on those lines:
char_name <- "a"
char_set <- data.frame(
a = 1:2,
b = 3:4,
c = 5:6,
d = 8:9,
stringsAsFactors = FALSE
)
res_dta <- data.frame(matrix(nrow = 2, ncol = 3))
for (i in wrapr::seqi(1, NCOL(char_set) - 1)) {
print(i)
if (names(char_set)[i] == char_name) {
res_dta[i] <- char_set[i + 1]
} else {
res_dta[i] <- char_set[i]
}
}
Results
char_set
a b c d
1 1 3 5 8
2 2 4 6 9
res_dta
X1 X2 X3
1 3 3 5
2 4 4 6
There are few generic points:
When you are looping through columns be mindful not fall outside data frame dimensions; running i + 1 on i = 4 will give you column 5 which will return an error for data frame with four columns. You may then decide to run to one column less or break for a specific i value
Not sure if I got your request right, for column names a you want to take values of column b; then column b stays as it was?
Broadly speaking, I'm of a view that this names(char_set)[i] == char_name requires more thought but you have a start with this answer. Updating your post with desired results would help to design a solution.
The problem in your code is that you are looping from 1 to the number of columns of the char_set df, then you are calling the variable char_set[i+1].
This, when the i index takes the maximum value, the instruction char_set[i+1] returns an error because there is no element with that index.
You can try with this solution:
char_name<-"a"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "b"
char_name<-"d"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "-"
However. when the variable char_name takes the value a, the variable promote will take the value that the set char_set has at the position after the element named a, which matches char_name.
I suggest you to think about the case in which the variable char_name takes the value d and you don't have any values in the char_set after d.

How to optimise an r function with 2 inputs within a loop

I am new to r and I am surprised at how long it takes to run what I believe to be rather simple lines of code, this leads me to believe I am missing something rather obvious. I have searched the internet and tried a few different iterations of the function but nothing has improved the efficiency (measured in time).
The Extract data is a data frame with 18.5m rows and 11 variables. I am trying to establish two things, first what percentage of patients stay in a hospital for longer than 7 as a percentage of all patients and second 21 days stays as a proportion of 7 days.
LOS_prob_providerage <- function(x,y){
Var1 = which(Extract$LOS>=0 & Extract$ProviderCode == x & Extract$age_group == y)
Var2 = which(Extract$LOS>=7 & Extract$ProviderCode == x & Extract$age_group == y)
return(list(Strand=(sum(Extract$LOS[Var1] >= 7)/length(Var1))*100, ELOS=(sum(Extract$LOS[Var2] >= 21)/length(Var2))*100))
}
When I call this function I give it a list of hospitals as the x variable and 1 age group from a list for the y variable (I can't seem to get it to take both as lists and output all hospitals for all age groups) using the following set of code
Providerage_prob_strand = mapply(LOS_prob_providerage,Provider_unique, agelabels[1], SIMPLIFY = FALSE)
I then create a data frame using the 2 lists that the function outputs using the code below
National = data.frame(matrix(unlist(Providerage_prob_strand), ncol=2,
byrow=T),row.names = Provider_unique)
colnames(National) <- c("Stranded_010","ELOS_010")
I subsequently re-run the last portions of code for all 11 elements in my age group list and append to the National data frame.
Question 1: Is there a less computationally intensive way to code my loop using r, or is the loop just taking that length of time due to the way r stores everything in memory?
Question 2: Is there anywhere to give r two lists for both the x and y varibale using mapply/sapply and for it to output the results to both Strand and ELOS across all hospitals /age groups?
I would use the data.table package for this.
Some dummy data to demonstrate (usually it is good practice for the question asker to provide this):
set.seed(123)
df1 = data.frame(
provider = sample(LETTERS[1:4], 1000, T),
los = round(runif(1000,0,40)),
age_group = sample(1:4,1000, T))
Now we turn this into a data table
library(data.table)
setDT(df1)
and we can extact the values you want like this:
providerlist = c('A','B')
age_list = c(1,2)
df1[provider %in% providerlist & age_group %in% age_list,
.(los_greater_than7 = 100*sum(los>7)/.N),
keyby = .(provider, age_group)]
# provider age_group los_greater_than7
# 1: A 1 92.40506
# 2: A 2 81.81818
# 3: B 1 77.27273
# 4: B 2 87.50000
df1[provider %in% providerlist & age_group %in% age_list & los>7,
.(los_greater_than20 = 100*sum(los>20)/.N),
by = .(provider, age_group)]
# provider age_group los_greater_than20
# 1: A 1 56.16438
# 2: A 2 66.66667
# 3: B 1 56.86275
# 4: B 2 58.92857

Find similar strings and reconcile them within one dataframe

Another question for me as a beginner. Consider this example here:
n = c(2, 3, 5)
s = c("ABBA", "ABA", "STING")
b = c(TRUE, "STING", "STRING")
df = data.frame(n,s,b)
n s b
1 2 ABBA TRUE
2 3 ABA STING
3 5 STING STRING
How can I search within this dataframe for similar strings, i.e. ABBA and ABA as well as STING and STRING and make them the same (doesn't matter whether ABBA or ABA, either fine) that would not require me knowing any variations? My actual data.frame is very big so that it would not be possible to know all the different variations.
I would want something like this returned:
> n = c(2, 3, 5)
> s = c("ABBA", "ABBA", "STING")
> b = c(TRUE, "STING", "STING")
> df = data.frame(n,s,b)
> print(df)
n s b
1 2 ABBA TRUE
2 3 ABBA STING
3 5 STING STING
I have looked around for agrep, or stringdist, but those refer to two data.frames or are able to name the column which I can't since I have many of those.
Anyone an idea? Many thanks!
Best regards,
Steffi
This worked for me but there might be a better solution
The idea is to use a recursive function, special, that uses agrepl, which is the logical version of approximate grep, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep. Note that you can specify the 'error tolerance' to group similar strings with agrep. Using agrepl, I split off rows with similar strings into x, mutate the s column to the first-occurring string, and then add a grouping variable grp. The remaining rows that were not included in the ith group are stored in y and recursively passed through the function until y is empty.
You need the dplyr package, install.packages("dplyr")
library(dplyr)
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$s[1], y$s) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(s=head(s,1)) %>% mutate(grp=grp))
y <- setdiff(y, y[similar,])
special(x, y, grp+1)
}
}
desired <- special(desired,df,grp)
To change the stringency of string similarity, change max.distance like agrepl(x,y,max.distance=0.5)
Output
n s b grp
1 2 ABBA TRUE 1
2 3 ABBA STING 1
3 5 STING STRING 2
To remove the grouping variable
withoutgrp <- desired %>% select(-grp)

Resources