How to concatenate rows based on group as quickly as possible - r

I have a dataframe as follows
ClientVisitGUID LineNum TextCol
1 1 This was a great
1 2 report I did
2 3 was performed today
2 1 Another great report
2 2 for this person
3 2 good stuff
3 1 I really write very
3 3 when I put my
3 4 mind to it
I'd like to concatenate the rows based on the ClientVisitGUID and the line number so i can get the following output
ClientVisitGUID TextCol
1 This was a great report I did
2 Another great report for this person was performed today
3 I really write very good stuff when I put my mind to it
I tried dplyr but it takes a long time and can't deal with thousands of rows which is what I have
resultset2<-resultset %>%
group_by(ClientVisitGUID) %>%
arrange(LineNum) %>%
summarize_all(paste, collapse=",")
Is there a faster way? I'm not really familiar with data.table but is this fast?

A second data.table option, also using stringi for its performance
library(data.table)
library(stringi)
setDT(df)
setkey(df, ClientVisitGUID, LineNum)
df1 <- df[, .(new = stri_c(TextCol, collapse = " ")), by = ClientVisitGUID]
Result
df1
# ClientVisitGUID new
#1: 1 This was a great report I did
#2: 2 Another great report for this person was performed today
#3: 3 I really write very good stuff when I put my mind to it
data (thanks to #ThomasIsCoding)
df <- structure(list(ClientVisitGUID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), LineNum = c(1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 4L), TextCol = c("This was a great",
"report I did", "was performed today", "Another great report",
"for this person", "good stuff", "I really write very", "when I put my",
"mind to it")), class = "data.frame", row.names = c(NA, -9L))

An base R option is using aggregate
result <- aggregate(TextCol~ClientVisitGUID,
df[order(df$ClientVisitGUID,df$LineNum),],
paste0,
collapse = " ")
which gives
> result
ClientVisitGUID TextCol
1 1 This was a great report I did
2 2 Another great report for this person was performed today
3 3 I really write very good stuff when I put my mind to it
Data
df <- structure(list(ClientVisitGUID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), LineNum = c(1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 4L), TextCol = c("This was a great",
"report I did", "was performed today", "Another great report",
"for this person", "good stuff", "I really write very", "when I put my",
"mind to it")), class = "data.frame", row.names = c(NA, -9L))

If you want speed, data.table is indeed a great candidate:
library(data.table)
setDT(resultset)
data.table::setkeyv(resultset, "ClientVisitGUID")
resultset <- resultset[order(ClientVisitGUID, LineNum)]
resultset[, .(lapply(.SD, paste, collapse = ",")), by = "ClientVisitGUID"]
Setting the key takes some times at first but you will end up with faster operations afterwards. Setting the keys reorder rows belonging to the same group in contiguous memory slots
Example
data = data.table("a" = c("aaa","ffff","ttt"), "b" = c(1,1,2))
data[, .(lapply(.SD, paste, collapse = ",")), by = "b"]

Related

How to combine multiple text entries for a variable once dplyr has grouped by another variable [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 2 years ago.
For hundreds of matters, my data frame has daily text entries by dozens of timekeepers. Not every timekeeper enters time each day for each matter. Text entries can be any length. Each entry for a matter is for work done on a different day (but for my purposes, figuring out readability measures for the text, dates don't matter). What I would like to do is to combine for each matter all of its text entries.
Here is a toy data set and what it looks like:
> dput(df)
structure(list(Matter = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 4L, 4L), .Label = c("MatterA", "MatterB", "MatterC", "MatterD"
), class = "factor"), Timekeeper = structure(c(1L, 2L, 3L, 4L,
2L, 3L, 1L, 1L, 3L, 4L), .Label = c("Alpha", "Baker", "Charlie",
"Delta"), class = "factor"), Text = structure(c(5L, 8L, 1L, 3L,
7L, 6L, 9L, 2L, 10L, 4L), .Label = c("all", "all we have", "good men to come to",
"in these times that try men's souls", "Now is", "of", "the aid",
"the time for", "their country since", "to fear is fear itself"
), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
Dplyr groups the time records by matter, but I am stumped as to how to combine the text entries for each matter so that the result is along these lines -- all text gathered for a matter:
1 MatterA Now is the time for all good men to come to
5 MatterB the aid of their country since
8 MatterC all we have
9 MatterD to fear is fear itself in these times that try men's souls
dplyr::mutate() does not work with various concatenation functions:
textCombined <- df %>% group_by(Matter) %>% mutate(ComboText = str_c(Text))
textCombined2 <- df %>% group_by(Matter) %>% mutate(ComboText = paste(Text))
textCombined3 <- df %>% group_by(Matter) %>% mutate(ComboText = c(Text)) # creates numbers
Maybe a loop will do the job, as in "while the matter stays the same, combine the text" but I don't know how to write that. Or maybe dplyr has a conditional mutate, as in "mutate(while the matter stays the same, combine the text)."
Thank you for your help.
Hi you can use group by and summarise with paste,
> df %>% group_by(Matter) %>% summarise(line= paste(Text, collapse = " "))
# A tibble: 4 x 2
# Matter line
# <fct> <chr>
#1 MatterA Now is the time for all good men to come to
#2 MatterB the aid of their country since
#3 MatterC all we have
#4 MatterD to fear is fear itself in these times that try men's souls

Return a single row out of multiple rows with partially matching entries

I am reposting this question with a bit of more clarity. Unfortunately, didn't get any solutions from my previous posting. Please help me with this.
Below is what I want to do:
I have a dataset with the name of proteome. It has 14 columns and thousands of rows.
Row 1, column 5: GHFCLKPGCNFHAESTRGYR
Row 2, column 5: FCLKPGCNFHAESTRGYR
Row 3, column 5: GHFCLKPGCNFHAESTR
Row 4: column 5: GCNFHAESTR
Please click on this link to see the screenshot of a part of the original data frame; i67.tinypic.com/2wd0ap3.png[/IMG]
So, In row 2, first two letters of row 1 are missing; in row 3, last three letters of row 1 are missing; in row 4, first seven and last three letters of row 1 are missing.
Rows 2, 3, and 4 reflect the artifacts of the scientific method I have been using to generate the data, and therefore I want to remove these entries.
I want R to return only one of the four rows, ideally row 1, and remove the rest. The way R can do it is by first finding all rows with a matching string of letters and then eliminating such rows while keeping only one. For example, in the above data set, GCNFHAESTR match in all four rows, so I want R to return me only one row, ideally the top one. But I don't know how to do this.
Hope this makes better sense this time. I look forward to hearing from the experts.
Thanks!
In response to Julian_Hn suggestion, here is the dput of my dataset:
dput(Proteome)
structure(list(Protein.name = structure(c(1L, 1L, 1L, 1L, 2L,
3L), .Label = c("HCTF", "IFT", "ROSF"), class = "factor"), X..Proteins = c(5L,
5L, 5L, 5L, 3L, 7L), X..PSMs = c(3L, 1L, 6L, 2L, 2L, 4L), Previous.5.amino.acids = structure(c(4L,
5L, 4L, 2L, 3L, 1L), .Label = c("CWYAT", "FCLKP", "MGCPT", "NCTMY",
"TMYFC"), class = "factor"), Sequence = structure(c(5L, 1L, 4L,
2L, 3L, 6L), .Label = c("FCLKPGCNFHAESTRGYR", "GCNFHAESTR", "GFGFNWPHAVR",
"GHFCLKPGCNFHAESTR", "GHFCLKPGCNFHAESTRGYR", "GNFSVKLMNR"), class = "factor")), .Names = c("Protein.name",
"X..Proteins", "X..PSMs", "Previous.5.amino.acids", "Sequence"
), class = "data.frame", row.names = c(NA, -6L))

Can I use %in% to search and match two columns?

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")
First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]

Constructing All Possible Pairs within Groups

I have a large amount of graph data in the following form. Suppose a person has multiple interests.
person,interest
1,1
1,2
1,3
2,1
2,5
2,2
3,2
3,5
...
I want to construct all pairs of interests for each user. I would like to convert this into an edgelist like the following. I want the data in this format so that I can convert it into an adjacency matrix for graphing etc.
person,x_interest,y_interest
1,1,2
1,1,3
1,2,3
2,1,5
2,1,2
2,5,2
3,2,5
There is one solution here: Pairs of Observations within Groups but it works only for small datasets as the call to table wants to generate more than 2^31 elements. Is there another way that I can do this without having to rely on table?
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'person', we get the unique pairwise combinations of 'interest' to create two columns ('x_interest' and 'y_interest').
library(data.table)
setDT(df1)[,{tmp <- combn(unique(interest),2)
list(x_interest=tmp[c(TRUE, FALSE)], y_interest= tmp[c(FALSE, TRUE)])} , by = person]
NOTE: To speed up, combnPrim from library(gRbase) could be used in place of combn.
data
df1 <- structure(list(person = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
interest = c(1L,
2L, 3L, 1L, 5L, 2L, 2L, 5L)), .Names = c("person", "interest"
), class = "data.frame", row.names = c(NA, -8L))

Filling in missing (blanks) in a data table, per category - backwards and forwards

I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However, using some rules I can quite easily fill them in but do not know how to implement it in data.table under R. I know that there are things such as na.locf in the zoo package and self rolling join in the data.table package. The examples that I have seen are too simplistic and do not help me.
Here is some fictitious data to orient you (as a dput ASCII text representation)
structure(list(patient.first.name = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("John", "Kathy",
"Timothy"), class = "factor"), patient.last.name = structure(c(3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Jones",
"Martinez", "Squeal"), class = "factor"), medical.record.nr = c(4563455,
4563455, 4563455, 4563455, 4563455, 2663775, 2663775, 2663775,
2663775, 2663775, 3330956, 3330956, 3330956, 3330956), date.of.service = c(39087,
39112, 39112, 39130, 39228, 39234, 39244, 39244, 39262, 39360,
39184, 39194, 39198, 39216), procedure.code = c(44750, 38995,
40125, 44720, 44729, 44750, 38995, 40125, 44720, 44729, 44750,
44729, 44729, 44729), diagnosis.code.1 = c(456.87, 456.87, 456.87,
456.87, 456.87, 521.37, 521.37, 521.37, 521.37, 356.36, 456.87,
456.87, 456.87, 456.87), diagnosis.code.2 = c(413, 413, 413,
413, 413, 532.23, NA, NA, NA, NA, NA, NA, NA, NA), referring.doctor.first = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Abe",
"Mark"), class = "factor"), referring.doctor.last = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Newstead",
"Wydell"), class = "factor"), referring.docotor.zip = c(15209,
15209, 15209, 15209, 15209, 15222, 15222, 15222, NA, NA, NA,
15209, 15209, NA), some.other.stuff = structure(c(1L, 1L, 1L,
NA, 3L, NA, NA, 4L, NA, 6L, NA, 2L, 5L, NA), .Label = c("alkjkdkdio",
"cheerios", "ddddd", "dddddd", "dogs", "lkjljkkkkk"), class = "factor")), .Names = c("patient.first.name",
"patient.last.name", "medical.record.nr", "date.of.service",
"procedure.code", "diagnosis.code.1", "diagnosis.code.2", "referring.doctor.first",
"referring.doctor.last", "referring.docotor.zip", "some.other.stuff"
), row.names = c(NA, 14L), class = "data.frame")
The obvious solution is to use some sort of last observation carried forward (LOCF) algorithm on referring.doctor.last and referring.doctor.first. However, it must stop when it gets to a new patient. In other words the LOCF must only be applied to one patient who is identified by the combination of patient.first.name, patient.last.name, medical.record.nr. Also note how some patients are missing the referring doctor on their very first visit so that means that some observations have to be carried backwards. To complicate matters some patients change primary care physicians and so there may be one referring doctor earlier on and another one later on. The alogorithm therefore needs to be aware of the date order of the rows with missing values.
In zoo na.locf I do not see an easy way to group the LOCF per patient. The rolling join examples that I have seen, would not work here becasuse I cannot simply take out the rows with the missing referring.doctor information since I would then loose date.of.service and procedure.code etcetera. I would love your help in learning how R can fill in my missing data.
A more concise example would have been easier to answer. For example you've included quite a few columns that appear to be redundant. Does it really need to be by first name and last name, or can we use the patient number?
Since you already have NAs in the data, that you wish to fill, it's not roll in data.table really. A rolling join is more for when your data has no NA but you have another time series (for example) that joins to positions inbetween the data. (One efficiency advantage there is the very fact you don't create NA first which you then have to fill in a 2nd step.) Or, in other words, in your question you just have one dataset; you aren't joining two.
So you do need na.locf as #Joshua suggested. I'm not aware of a function that fills NA forward and then the first value backwards, though.
In data.table, to use na.locf by group it's just :
require(data.table)
require(zoo)
DT[,doctor:=na.locf(doctor),by=patient]
which has the efficiency advantages of fast aggregation and update by reference. You would have to write a new small function on top of na.locf to roll the first non NA backwards.
Ensure the data is sorted by patient then date, first. Then the above will cope with changes in doctor over time, since by maintains the order of rows within each group.
Hope that gives you some hints.
#MatthewDowle has provided us with a wonderful starting point and here we will take it to its conclusion.
In a nutshell, use zoo's na.locf. The problem is not amenable to rolling joins.
setDT(bill)
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE,fromLast=TRUE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
Then do something similar for referring.doctor.first
A few pointers:
The by statement ensures that the last observation carried forward is restricted to the same patient so that the carrying does not "bleed" into the next patient on the list.
One must use the na.rm=FALSE argument. If one does not then a patient who is missing information for a referring physician on their very first visit will have the NA removed and the vector of new values (existing + carried forward) will be one element short of the number of rows. The shortened vector is recycled and everything gets shifted up and the last row gets the first element of the vector as it is recycled. In other words, a big mess. And worst of all you will only see it sometimes.
Use fromLast=TRUE to run through the column again. That fills in the NA that preceded any data. Instead of last observation carried forward (LOCF) zoo uses next observation carried backward (NOCB). Happiness - you have now filled in the missing data in a way that is correct for most circumstances.
You can pass multiple := per line, e.g. DT[,`:=`(new=1L,new2=2L,...)]

Resources