how many people received 4 drugs of interest? R - r

I have a long list of people receiving drugs coded in the variable ATC. I want to find out how many people have used 4 specific drugs. For example, I want to count how many people have used this particular pattern of drugs "C07ABC" & "C09XYZ" &"C08123" &"C03ZOO". Some people may have used some agents (eg C07 or C08) more than once, thats ok, I just want to count how many unique people had the regimen I'm interested in. I don't care how many times they had the unique drugs. However, because I have various patterns that I want to look up - I would like to use the grepl function. To explain this further, my first attempt at this problem tried a sum command:
sum(df[grepl('^C07.*?'|'^C09.*?'|'^C08.*?|C03.*?', as.character(df$atc)),])
However this doesn't work because I think the sum command needs a boolean function. ALso, I think the | sign isn't correct here either (I want an &) but I'm just showing the code so that you know what I'm after. Maybe an ave function is what I need - but am unsure of how I would code this?
Thanks in advance.
df
names fruit dates atc
4 john kiwi 2010-07-01 C07ABC
7 john apple 2010-09-01 C09XYZ
9 john banana 2010-11-01 C08123
13 john orange 2010-12-01 C03ZOO
14 john apple 2011-01-01 C07ABC
2 mary orange 2010-05-01 C09123
5 mary apple 2010-07-01 C03QRT
8 mary orange 2010-07-01 C09ZOO
10 mary apple 2010-09-01 C03123
12 mary apple 2010-11-01 C09123
1 tom apple 2010-02-01 C03897
3 tom banana 2010-03-01 C02CAMN
6 tom apple 2010-06-01 C07123
11 tom kiwi 2010-08-01 C02DA12

You might consider avoiding the use of regular expressions, and instead derive some set of meaningful columns from column atc. For combinations, you probably want a 2-way table of person and drug, and then compute on the matrix to count combinations.
For example:
tab <- xtabs(~ names + atc, df)
combo <- c("C07ABC", "C09XYZ", "C08123", "C03ZOO")
haveCombo <- rowSums(tab[,combo] > 0) == length(combo)
sum(haveCombo)
The last two lines could easily be turned into a function for each combination.
EDIT: This approach can be applied to other, derived columns, so if you're interested in the prefix then,
df$agent <- substring(df$atc, 1, 3)
tab <- xtabs(~ names + agent, df)
combo <- c("C07", "C09", "C08", "C03")
and proceed as before.

In addition to not needing to deliver entire dataframe lines to sum you also had extra quote marks in that pattern:
> sum( grepl('^C07.*|^C09.*|^C08.*|C03.*', df$atc) )
[1] 12
I think this is easier to read:
> sum( grepl('^(C07|C09|C08|C03).*', df$atc) )
[1] 12
But now I read that you want all of thos used and to do the calculation within a patient id. That might have requiree using & as the connector but I decide to try a different route and use unique and then count then number of unique matches while doing it within an aggregate operation.
> aggregate(atc ~ names, data=df,
function(drgs) length(unique(grep('^(C07|C09|C08|C03)', drgs))))
names atc
1 john 5
2 mary 5
3 tom 2
Although that's the number of matching items but not the number of unique items, because I forgot to put value=TRUE in the grep call (and also need to use substr to avoid separately counting congeners with different trailing ATC codes):
> aggregate(atc ~ names, data=df, function(drgs) length(unique(grep('^C0[7983]', substr(drgs,1,3), value=TRUE))))
names atc
1 john 4
2 mary 2
3 tom 2
This would be somewhat similar to #MichaelLawrence's matrix/table approach, but I think it would scale better since the "tables" being created would be much smaller:
combo <- c("C07", "C09", "C08", "C03")
tapply(df$atc, df$names, function(drgs) sum(combo %in% substr(drgs,1,3)) )
#------
john mary tom
4 2 2

you can try this
drugs <- c("C07ABC","C09XYZ", "C08123", "C03ZOO")
table(unique(df[df$atc %in% drugs, c("names", "atc")])$names)
# john mary tom
# 4 0 0
names(which(table(unique(df[df$atc %in% drugs, c("names", "atc")])$names) > 3))
# [1] "john"
Data
df <- structure(list(names = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("john", "mary", "tom"
), class = "factor"), fruit = structure(c(3L, 1L, 2L, 4L, 1L,
4L, 1L, 4L, 1L, 1L, 1L, 2L, 1L, 3L), .Label = c("apple", "banana",
"kiwi", "orange"), class = "factor"), dates = structure(c(5L,
7L, 8L, 9L, 10L, 3L, 5L, 5L, 7L, 8L, 1L, 2L, 4L, 6L), .Label = c("2010-02-01",
"2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-08-01",
"2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01"), class = "factor"),
atc = structure(c(8L, 11L, 9L, 6L, 8L, 10L, 5L, 12L, 3L,
10L, 4L, 1L, 7L, 2L), .Label = c("C02CAMN", "C02DA12", "C03123",
"C03897", "C03QRT", "C03ZOO", "C07123", "C07ABC", "C08123",
"C09123", "C09XYZ", "C09ZOO"), class = "factor")), .Names = c("names",
"fruit", "dates", "atc"), class = "data.frame", row.names = c("4",
"7", "9", "13", "14", "2", "5", "8", "10", "12", "1", "3", "6",
"11"))

This is just a continuation of #Michael Lawrence's answer. I changed the drugs to what #user2363642 wanted, and I also substringed the atc column to only use the three first characters, which again, I believe is what #user2363642 wanted. Also, for the rowSums, I first changed all non-zero quantities to 1, to ensure we don't double count drugs.
drugs <- c("C07", "C09", "C08", "C03")
df$atc.abbr <- substring(df$atc, 1, 3)
xt <- xtabs(~ names + atc.abbr, df)
xt[xt>0] <- 1
rowSums(xt[,drugs]) >= length(drugs)
Output:
john mary tom
TRUE FALSE FALSE

Related

group_by edit distance between rows over multiple columns

I have the following data frame.
Input:
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
class indicates the teacher's name,
id indicates the student user ID, and,
q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated.
For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:
difference:
For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
cheating:
This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class id q1 q2 q3 q4 difference cheating
Ali 12 1 2 3 3 (12,24 = 2), (12,35 = 2) NA
Tom 16 1 2 4 2 (16,18 = 3), (16,36 = 0) 36
Tom 18 1 2 3 4 (16,18 = 3), (18,36 = 3) NA
Ali 24 2 2 4 3 (12,24 = 2), (24,35 = 0) 35
Ali 35 2 2 4 3 (12,35 = 2), (24,35 = 0) 24
Tom 36 1 2 4 2 (16,36 = 0), (18,36 = 3) 16
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,
This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations)
This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R?
The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column
This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns
dput()
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help would be greatly appreciated!
You could try to clustering the data, using hclust() for example. Once the relative distances are calculated and mapped, the cut the tree at the threshold of expected cheating.
This example I am using the standard dist() function to calculate differences, the stringdist function may be better or maybe another option is out there to try.
df<- structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
#apply the standard distance function
scores <- hclust(dist(df[ , 3:6]))
plot(scores)
#divide into groups based on level of matching too closely
groups <- cutree(scores, h=0.1)
#summary table
summarytable <- data.frame(class= df$class, id =df$id, groupings =groups)
#select groups with more than 2 people in them
suspectgroups <- table(groups)[table(groups) >=2]
potential_cheaters <- summarytable %>% filter(groupings %in% names(suspectgroups)) %>% arrange(groupings)
potential_cheaters
This works for this test case, but for larger datasets the height in the cutree() function may need to be adjusted. Also consider splitting the initial dataset by class to eliminate the chance of matching people between classes (depending on the situation of course).

Select unique values in dataframe based on sorted value

Has anyone selected unique values from a dataframe based on a second value's highest value?
Example:
name value
cheese 15
pepperoni 12
cheese 9
tomato 4
cheese 3
tomato 2
The best I've come up with - which I am SURE there's a better way - is to sort df by value descending, extract df$name, run unique() on that, then do a left join back with dplyr.
The ideal outcome is this:
name value
cheese 15
pepperoni 12
tomato 4
Thanks in advance!
Seeing your expected result, for each name, you are looking for the row that has the largest number. One way to achieve this task is the following.
library(dplyr)
group_by(mydf, name) %>%
slice(which.max(value))
# A tibble: 3 x 2
# Groups: name [3]
# name value
# <fct> <int>
#1 cheese 15
#2 pepperoni 12
#3 tomato 4
DATA
mydf <- structure(list(name = structure(c(1L, 2L, 1L, 3L, 1L, 3L), .Label = c("cheese",
"pepperoni", "tomato"), class = "factor"), value = c(15L, 12L,
9L, 4L, 3L, 2L)), class = "data.frame", row.names = c(NA, -6L
))

Is there a equivalent for the tidyr fill() for strings in R?

So I have a data frame like this one:
First Group Bob
Joe
John
Jesse
Second Group Jane
Mary
Emily
Sarah
Grace
I would like to fill in the empty cells in the first column in the data frame with the last string in that column i.e
First Group Bob
First Group Joe
First Group John
First Group Jesse
Second Group Jane
Second Group Mary
Second Group Emily
Second Group Sarah
Second Group Grace
With tidyr, there is fill() but it obviously doesn't work with strings. Is there an equivalent for strings? If not is there a way to accomplish this?
Seems fill() is designed to be used in isolation. When using fill() inside a mutate() statement this error appears (regardless of the data type), but it works when using it as just a component of the pipe structure. Could that have been the problem?
Just for full clarity, a quick example. Assuming you have a data frame called 'people' with columns 'group' and 'name', the right structure would be:
people %>%
fill(group)
and the following would give the error you described (and a similar error when using numbers):
people %>%
mutate(
group = fill(group)
)
(I made the assumption that this was output from an R console session. If it's a raw text file the data input may need to be done with read.fwf.)
The display suggests those are empty character values in the "spaces">
First set them to NA and then use na.locf from zoo:
dat[dat==""] <- NA
dat[1:2] <- lapply(dat[1:2], zoo::na.locf)
dat
#------------
V1 V2 V3
1 First Group Bob
2 First Group Joe
3 First Group John
4 First Group Jesse
5 Second Group Jane
6 Second Group Mary
7 Second Group Emily
8 Second Group Sara
9 Second Group Grace
To start with what I was using:
dat <-
structure(list(V1 = structure(c(2L, 1L, 1L, 1L, 3L, 1L, 1L, 1L,
1L), .Label = c("", "First", "Second"), class = "factor"), V2 = structure(c(2L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("", "Group"), class = "factor"),
V3 = structure(c(1L, 6L, 7L, 5L, 4L, 8L, 2L, 9L, 3L), .Label = c("Bob",
"Emily", "Grace", "Jane", "Jesse", "Joe", "John", "Mary",
"Sara"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
If I have to take a stab at what your data structure is, I might have something like this:
df <- data.frame(c1=c("First Group", "", "", "", "Second Group", "", "", "", ""),
c2=c("Bob","Joe","Jon","Jesse","Jane","Mary","Emily","Sara","Grace"),
stringsAsFactors = FALSE)
Then, a very basic way to do this would be by simply looping:
for(i in 2:nrow(df)) if(df$c1[i]=="") df$c1[i] <- df$c1[i-1]
df
c1 c2
1 First Group Bob
2 First Group Joe
3 First Group Jon
4 First Group Jesse
5 Second Group Jane
6 Second Group Mary
7 Second Group Emily
8 Second Group Sara
9 Second Group Grace
However, I would suggest you accept #42-'s solution if you have anything other than a small data set as zoo::na.locf is optimized to work with large numbers of records and is a very respected, widely used stable package.

Add rows when values in columns are equal in df

For a sample dataframe:
df <- structure(list(animal.1 = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 3L, 1L, 1L), .Label = c("cat", "dog", "horse", "rabbit"), class = "factor"),
animal.2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 1L, 1L, 3L,
1L), .Label = c("cat", "dog", "hamster", "rabbit"), class = "factor"),
number = c(5L, 3L, 2L, 5L, 1L, 4L, 6L, 7L, 1L, 11L)), .Names = c("animal.1",
"animal.2","number"), class = "data.frame", row.names = c(NA,
-10L))
... I wish to make a new df with 'animal' duplicates all added together. For example multiple rows with the same animal in columns 1 and 2 will be put together. So for example the dataframe above would read:
cat cat 16
dog dog 7
cat dog 3 etc. etc... (those with different animals would be left as they are). Importantly the sum of 'number' in both dataframes would be the same.
My real df is >400K observations, so anything that anyone could recommend could cope with a large dataset would be great!
Thanks in advance.
One option would be to use data.table. Convert "data.frame" to "data.table" (setDT(), if the "animal.1" rows are equal to "animal.2", then, replace the "number" with sum of "number" after grouping by the two columns, and finally get the unique rows.
library(data.table)
setDT(df)[as.character(animal.1)==as.character(animal.2),
number:=sum(number) ,.(animal.1, animal.2)]
unique(df)
# animal.1 animal.2 number
#1: cat cat 16
#2: cat dog 3
#3: dog dog 7
#4: dog rabbit 1
#5: rabbit rabbit 4
#6: rabbit cat 6
#7: horse cat 7
#8: cat hamster 1
Or an option with dplyr. The approach is similar to data.table. We group by "animal.1", "animal.2", then replace the "number" with sum only when "animal.1" is equal to "animal.2", and get the unique rows
library(dplyr)
df %>%
group_by(animal.1, animal.2) %>%
mutate(number=replace(number,as.character(animal.1)==
as.character(animal.2),
sum(number))) %>%
unique()

Pairs of Observations within Groups

I've got a problem that I know how to solve using SQL, but I'm looking to implement a solution in R with a new data set. I've been trying to figure out things with the reshape2 package, but I haven't had any luck with what I'm trying to accomplish. Here's my problem:
I have a dataset in which I need to look at all pairs of items that are together from within another group. I've created a toy example below to further explain.
BUNCH FRUITS
1 apples
1 bananas
1 mangos
2 apples
3 bananas
3 apples
4 bananas
4 apples
What I want is a listing of all possible pairs and sum the frequency they occur together within a bunch. My output would ideally look like this:
FRUIT1 FRUIT2 FREQUENCY
APPLES BANANAS 3
APPLES MANGOS 1
My end goal is to make something that I'll eventually be able to import into Gephi for a network analysis. For this I need a Source and Target column (aka FRUIT1 and FRUIT2 above).
The original solution in SQL is here if that would help anyone: PROC SQL in SAS - All Pairs of Items
The following seems valid:
tmp = table(DF$FRUITS, DF$BUNCH) != 0
#> tmp
# 1 2 3 4
# apples TRUE TRUE TRUE TRUE
# bananas TRUE FALSE TRUE TRUE
# mangos TRUE FALSE FALSE FALSE
do.call(rbind,
combn(unique(as.character(DF$FRUITS)),
2,
function(x) data.frame(fr1 = x[1],
fr2 = x[2],
freq = sum(colSums(tmp[x, ]) == 2)),
simplify = F))
# fr1 fr2 freq
#1 apples bananas 3
#2 apples mangos 1
#3 bananas mangos 1
Where DF:
DF = structure(list(BUNCH = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 4L), FRUITS = structure(c(1L,
2L, 3L, 1L, 2L, 1L, 2L, 1L), .Label = c("apples", "bananas",
"mangos"), class = "factor")), .Names = c("BUNCH", "FRUITS"), class = "data.frame", row.names = c(NA,
-8L))

Resources