Pairs of Observations within Groups

Pairs of Observations within Groups - r

I've got a problem that I know how to solve using SQL, but I'm looking to implement a solution in R with a new data set. I've been trying to figure out things with the reshape2 package, but I haven't had any luck with what I'm trying to accomplish. Here's my problem:
I have a dataset in which I need to look at all pairs of items that are together from within another group. I've created a toy example below to further explain.
BUNCH FRUITS
1 apples
1 bananas
1 mangos
2 apples
3 bananas
3 apples
4 bananas
4 apples
What I want is a listing of all possible pairs and sum the frequency they occur together within a bunch. My output would ideally look like this:
FRUIT1 FRUIT2 FREQUENCY
APPLES BANANAS 3
APPLES MANGOS 1
My end goal is to make something that I'll eventually be able to import into Gephi for a network analysis. For this I need a Source and Target column (aka FRUIT1 and FRUIT2 above).
The original solution in SQL is here if that would help anyone: PROC SQL in SAS - All Pairs of Items

The following seems valid:
tmp = table(DF$FRUITS, DF$BUNCH) != 0
#> tmp
# 1 2 3 4
# apples TRUE TRUE TRUE TRUE
# bananas TRUE FALSE TRUE TRUE
# mangos TRUE FALSE FALSE FALSE
do.call(rbind,
combn(unique(as.character(DF$FRUITS)),
2,
function(x) data.frame(fr1 = x[1],
fr2 = x[2],
freq = sum(colSums(tmp[x, ]) == 2)),
simplify = F))
# fr1 fr2 freq
#1 apples bananas 3
#2 apples mangos 1
#3 bananas mangos 1
Where DF:
DF = structure(list(BUNCH = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 4L), FRUITS = structure(c(1L,
2L, 3L, 1L, 2L, 1L, 2L, 1L), .Label = c("apples", "bananas",
"mangos"), class = "factor")), .Names = c("BUNCH", "FRUITS"), class = "data.frame", row.names = c(NA,
-8L))

Related

Select unique values in dataframe based on sorted value

Has anyone selected unique values from a dataframe based on a second value's highest value?
Example:
name value
cheese 15
pepperoni 12
cheese 9
tomato 4
cheese 3
tomato 2
The best I've come up with - which I am SURE there's a better way - is to sort df by value descending, extract df$name, run unique() on that, then do a left join back with dplyr.
The ideal outcome is this:
name value
cheese 15
pepperoni 12
tomato 4
Thanks in advance!

Seeing your expected result, for each name, you are looking for the row that has the largest number. One way to achieve this task is the following.
library(dplyr)
group_by(mydf, name) %>%
slice(which.max(value))
# A tibble: 3 x 2
# Groups: name [3]
# name value
# <fct> <int>
#1 cheese 15
#2 pepperoni 12
#3 tomato 4
DATA
mydf <- structure(list(name = structure(c(1L, 2L, 1L, 3L, 1L, 3L), .Label = c("cheese",
"pepperoni", "tomato"), class = "factor"), value = c(15L, 12L,
9L, 4L, 3L, 2L)), class = "data.frame", row.names = c(NA, -6L
))

Combine several data.frames into one (keep every rownames)

I have input dataframes Berry and Orange
Berry = structure(list(Name = c("ACT", "ACTION", "ACTIVISM", "ACTS",
"ADDICTION", "ADVANCE"), freq = c(2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("Name",
"freq"), row.names = c(NA, 6L), class = "data.frame")
Orange = structure(list(Name = c("ACHIEVE", "ACROSS", "ACT", "ACTION",
"ADVANCE", "ADVANCING"), freq = c(1L, 3L, 1L, 1L, 1L, 1L)), .Names = c("Name",
"freq"), row.names = c(NA, 6L), class = "data.frame")
Running the following operation will give me the desired output
output = t(merge(Berry,Orange, by = "Name", all = TRUE))
rownames(output) = c("","Berry","Orange")
colnames(output) = output[1,]
output = output[2:3,]
output = data.frame(output)
However, now I have to create output from 72 dataframes similar to Berry and Orange. Since merge appears to work with only two data.frame at a time, I'm not sure what would be the best approach for me. I tried rbind.fill which kept the values but lost the Names. I found this and this but couldn't figure out a solution on my own.
Here is one more data.frame in order to provide a reproducible example
Apple = structure(list(Name = c("ABIDING", "ABLE", "ABROAD", "ACROSS",
"ACT", "ADVANTAGE"), freq = c(1L, 1L, 1L, 4L, 2L, 1L)), .Names = c("Name",
"freq"), row.names = c(NA, 6L), class = "data.frame")
I'm trying to figure out how to obtain outputfrom Apple, Berry, and Orange. I am looking for a solution that would work for multiple dataframes preferably without me having to provide the dataframes manually.
You can assume that the data.frame names to be processed for getting the output is available in a list df_names:
df_names = c("Apple","Berry","Orange")
Or, you can also assume that every data.frame in the Global Environment needs to be processed to create output.

If you have all your data frames in an environment, you can get them into a named list then use package reshape2 to reshape the list. If desired, you can then set the first column as the row names.
library(reshape2)
dcast(melt(Filter(is.data.frame, mget(ls()))), L1 ~ Name)
# L1 ABIDING ABLE ABROAD ACHIEVE ACROSS ACT ACTION ACTIVISM ACTS ADDICTION ADVANCE ADVANCING ADVANTAGE
# 1 Apple 1 1 1 NA 4 2 NA NA NA NA NA NA 1
# 2 Berry NA NA NA NA NA 2 2 1 1 1 1 NA NA
# 3 Orange NA NA NA 1 3 1 1 NA NA NA 1 1 NA
Note: This assumes all your data is in the global environment and that no other data frames are present except the ones to be used here.

We can use tidyverse
library(dplyr)
library(tidyr)
list(Apple = Apple, Orange = Orange, Berry = Berry) %>%
bind_rows(.id = "objName") %>%
spread(Name, freq, fill = 0)
# objName ABIDING ABLE ABROAD ACHIEVE ACROSS ACT ACTION ACTIVISM ACTS ADDICTION ADVANCE ADVANCING ADVANTAGE
#1 Apple 1 1 1 0 4 2 0 0 0 0 0 0 1
#2 Berry 0 0 0 0 0 2 2 1 1 1 1 0 0
#3 Orange 0 0 0 1 3 1 1 0 0 0 1 1 0
As you have 72 data.frames, it is better not to create all these objects in the global environment. Instead, read the dataset files in a list and then do the processing. Suppose, if the files are all in the working directory
files <- list.files(pattern = ".csv")
lapply(files, read.csv, stringsAsFactors=FALSE)
and then do the processing with bind_rows as above. As it is not clear about the file names, we cannot comment on how to create the 'objName'

r creating an adjacency matrix from columns in a dataframe

I am interested in testing some network visualization techniques but before trying those functions I want to build an adjacency matrix (from, to) using the dataframe which is as follows.
Id Gender Col_Cold_1 Col_Cold_2 Col_Cold_3 Col_Hot_1 Col_Hot_2 Col_Hot_3
10 F pain sleep NA infection medication walking
14 F Bump NA muscle NA twitching flutter
17 M pain hemoloma Callus infection
18 F muscle pain twitching medication
My goal is to create an adjacency matrix as follows
1) All values in columns with keyword Cold will contribute to the rows
2) All values in columns with keyword Hot will contribute to the columns
For example, pain, sleep, Bump, muscle, hemaloma are cell values under the columns with keyword Cold and they will form the rows and cell values such as infection, medication, Callus, walking, twitching, flutter are under columns with keywords Hot and this will form the columns of the association matrix.
The final desired output should appear like this:
infection medication walking twitching flutter Callus
pain 2 2 1 1 1
sleep 1 1 1
Bump 1 1
muscle 1 1
hemaloma 1 1
[pain, infection] = 2 because the association between pain and infection occurs twice in the original dataframe: once in row 1 and again in row 3.
[pain, medication]=2 because association between pain and medication occurs twice once in row 1 and again in row 4.
Any suggestions or advice on producing such an association matrix is much appreciated thanks.
Reproducible Dataset
df = structure(list(id = c(10, 14, 17, 18), Gender = structure(c(1L, 1L, 2L, 1L), .Label = c("F", "M"), class = "factor"), Col_Cold_1 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "Bump", "muscle", "pain"), class = "factor"), Col_Cold_2 = structure(c(4L, 2L, 3L, 1L), .Label = c("", "NA", "pain", "sleep"), class = "factor"), Col_Cold_3 = structure(c(1L, 3L, 2L, 4L), .Label = c("NA", "hemaloma", "muscle", "pain" ), class = "factor"), Col_Hot_1 = structure(c(4L, 3L, 2L, 1L), .Label = c("", "Callus", "NA", "infection"), class = "factor"), Col_Hot_2 = structure(c(2L, 3L, 1L, 3L), .Label = c("infection", "medication", "twitching"), class = "factor"), Col_Hot_3 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "flutter", "medication", "walking" ), class = "factor")), .Names = c("id", "Gender", "Col_Cold_1", "Col_Cold_2", "Col_Cold_3", "Col_Hot_1", "Col_Hot_2", "Col_Hot_3" ), row.names = c(NA, -4L), class = "data.frame")

One way is to make the dataset into a "tidy" form, then use xtabs. First, some cleaning up:
df[] <- lapply(df, as.character) # Convert factors to characters
df[df == "NA" | df == "" | is.na(df)] <- NA # Make all blanks NAs
Now, tidy the dataset:
library(tidyr)
library(dplyr)
out <- do.call(rbind, sapply(grep("^Col_Cold", names(df), value = T), function(x){
vars <- c(x, grep("^Col_Hot", names(df), value = T))
setNames(gather_(select(df, one_of(vars)),
key_col = x,
value_col = "value",
gather_cols = vars[-1])[, c(1, 3)], c("cold", "hot"))
}, simplify = FALSE))
The idea is to "pair" each of the "cold" columns with each of the "hot" columns to make a long dataset. out looks like this:
out
# cold hot
# 1 pain infection
# 2 Bump <NA>
# 3 <NA> Callus
# 4 muscle <NA>
# 5 pain medication
# ...
Finally, use xtabs to make the desired output:
xtabs(~ cold + hot, na.omit(out))
# hot
# cold Callus flutter infection medication twitching walking
# Bump 0 1 0 0 1 0
# hemaloma 1 0 1 0 0 0
# muscle 0 1 0 1 2 0
# pain 1 0 2 2 1 1
# sleep 0 0 1 1 0 1

Create merged tables in R

I have some data in multiple large data tables in R. I wish to merge and produce counts of various variables.
I can produce the counts within individual tables easily using the 'table' command, but I have not yet figured out the economical (preferably base R, one liner) command to then produce combined counts.
aaa<-table(MyData1$MyVar)
bbb<-table(MyData2$MyVar)
> aaa
Dogs 3
Cats 4
Horses 1
Sheep 2
Giraffes 3
> bbb
Dogs 27
Cats 1
Sheep 2
Ocelots 1
Desired Output:
Dogs 30
Cats 5
Horses 1
Sheep 4
Giraffes 3
Ocelots 1
I am sure there is a straightforward Base R way to do this I am just not seeing it.

Base package:
aggregate(V2 ~ V1, data = rbind(df1, df2), FUN = sum)
dplyr:
library(dplyr)
rbind(df1, df2) %>% group_by(V1) %>% summarise(V2 = sum(V2))
Output:
V1 V2
1 Cats 5
2 Dogs 30
3 Giraffes 3
4 Horses 1
5 Sheep 4
6 Ocelots 1
Data:
df1 <- structure(list(V1 = structure(c(2L, 1L, 4L, 5L, 3L), .Label = c("Cats",
"Dogs", "Giraffes", "Horses", "Sheep"), class = "factor"), V2 = c(3L,
4L, 1L, 2L, 3L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(V1 = structure(c(2L, 1L, 4L, 3L), .Label = c("Cats",
"Dogs", "Ocelots", "Sheep"), class = "factor"), V2 = c(27L, 1L,
2L, 1L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-4L))

First merge/concatenate your input, then apply table to it.
table(c(MyData1$MyVar, MyData2$MyVar))
You may run into issue if MyVar is a factor and its levels are different in MyData1 and MyData2. In this case, just lookup how to merge factor variables.
EDIT: if that doesn't suit your need, I suggest the following:
Merge the levels of all "MyVar" throughout all your "MyDatai" tables (from your example, I assume that it makes sense to do this).
total_levels <- unique(c(levels(MyData1$MyVar), levels(MyData2$MyVar)))
MyData1$MyVar <- factor(MyData1$MyVar, levels=total_levels)
MyData2$MyVar <- factor(MyData1$MyVar, levels=total_levels)
Obviously you will need wrap this into an apply-like function if you have around 100 data.frames.
Note that this is a one-time preprocessing operation, so I think it's ok if it is a bit costly. Ideally you can integrate it upstream when you generate/load the data.
At this point, all your "MyVar" have the same levels (but are still the same in terms of content of course). Now the good thing is, since table works with the levels, all your tables will have the same entries:
aaa<-table(MyData1$MyVar)
bbb<-table(MyData2$MyVar)
> aaa
Dogs 3
Cats 4
Horses 1
Sheep 2
Giraffes 3
Ocelot 0
> bbb
Dogs 27
Cats 1
Horses 0
Sheep 2
Giraffes 0
Ocelots 1
And you can just sum them with aaa+bbb or sum if you have a lot. Addition of vectors is lightning fast :)

how many people received 4 drugs of interest? R

I have a long list of people receiving drugs coded in the variable ATC. I want to find out how many people have used 4 specific drugs. For example, I want to count how many people have used this particular pattern of drugs "C07ABC" & "C09XYZ" &"C08123" &"C03ZOO". Some people may have used some agents (eg C07 or C08) more than once, thats ok, I just want to count how many unique people had the regimen I'm interested in. I don't care how many times they had the unique drugs. However, because I have various patterns that I want to look up - I would like to use the grepl function. To explain this further, my first attempt at this problem tried a sum command:
sum(df[grepl('^C07.*?'|'^C09.*?'|'^C08.*?|C03.*?', as.character(df$atc)),])
However this doesn't work because I think the sum command needs a boolean function. ALso, I think the | sign isn't correct here either (I want an &) but I'm just showing the code so that you know what I'm after. Maybe an ave function is what I need - but am unsure of how I would code this?
Thanks in advance.
df
names fruit dates atc
4 john kiwi 2010-07-01 C07ABC
7 john apple 2010-09-01 C09XYZ
9 john banana 2010-11-01 C08123
13 john orange 2010-12-01 C03ZOO
14 john apple 2011-01-01 C07ABC
2 mary orange 2010-05-01 C09123
5 mary apple 2010-07-01 C03QRT
8 mary orange 2010-07-01 C09ZOO
10 mary apple 2010-09-01 C03123
12 mary apple 2010-11-01 C09123
1 tom apple 2010-02-01 C03897
3 tom banana 2010-03-01 C02CAMN
6 tom apple 2010-06-01 C07123
11 tom kiwi 2010-08-01 C02DA12

You might consider avoiding the use of regular expressions, and instead derive some set of meaningful columns from column atc. For combinations, you probably want a 2-way table of person and drug, and then compute on the matrix to count combinations.
For example:
tab <- xtabs(~ names + atc, df)
combo <- c("C07ABC", "C09XYZ", "C08123", "C03ZOO")
haveCombo <- rowSums(tab[,combo] > 0) == length(combo)
sum(haveCombo)
The last two lines could easily be turned into a function for each combination.
EDIT: This approach can be applied to other, derived columns, so if you're interested in the prefix then,
df$agent <- substring(df$atc, 1, 3)
tab <- xtabs(~ names + agent, df)
combo <- c("C07", "C09", "C08", "C03")
and proceed as before.

In addition to not needing to deliver entire dataframe lines to sum you also had extra quote marks in that pattern:
> sum( grepl('^C07.*|^C09.*|^C08.*|C03.*', df$atc) )
[1] 12
I think this is easier to read:
> sum( grepl('^(C07|C09|C08|C03).*', df$atc) )
[1] 12
But now I read that you want all of thos used and to do the calculation within a patient id. That might have requiree using & as the connector but I decide to try a different route and use unique and then count then number of unique matches while doing it within an aggregate operation.
> aggregate(atc ~ names, data=df,
function(drgs) length(unique(grep('^(C07|C09|C08|C03)', drgs))))
names atc
1 john 5
2 mary 5
3 tom 2
Although that's the number of matching items but not the number of unique items, because I forgot to put value=TRUE in the grep call (and also need to use substr to avoid separately counting congeners with different trailing ATC codes):
> aggregate(atc ~ names, data=df, function(drgs) length(unique(grep('^C0[7983]', substr(drgs,1,3), value=TRUE))))
names atc
1 john 4
2 mary 2
3 tom 2
This would be somewhat similar to #MichaelLawrence's matrix/table approach, but I think it would scale better since the "tables" being created would be much smaller:
combo <- c("C07", "C09", "C08", "C03")
tapply(df$atc, df$names, function(drgs) sum(combo %in% substr(drgs,1,3)) )
#------
john mary tom
4 2 2

you can try this
drugs <- c("C07ABC","C09XYZ", "C08123", "C03ZOO")
table(unique(df[df$atc %in% drugs, c("names", "atc")])$names)
# john mary tom
# 4 0 0
names(which(table(unique(df[df$atc %in% drugs, c("names", "atc")])$names) > 3))
# [1] "john"
Data
df <- structure(list(names = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("john", "mary", "tom"
), class = "factor"), fruit = structure(c(3L, 1L, 2L, 4L, 1L,
4L, 1L, 4L, 1L, 1L, 1L, 2L, 1L, 3L), .Label = c("apple", "banana",
"kiwi", "orange"), class = "factor"), dates = structure(c(5L,
7L, 8L, 9L, 10L, 3L, 5L, 5L, 7L, 8L, 1L, 2L, 4L, 6L), .Label = c("2010-02-01",
"2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-08-01",
"2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01"), class = "factor"),
atc = structure(c(8L, 11L, 9L, 6L, 8L, 10L, 5L, 12L, 3L,
10L, 4L, 1L, 7L, 2L), .Label = c("C02CAMN", "C02DA12", "C03123",
"C03897", "C03QRT", "C03ZOO", "C07123", "C07ABC", "C08123",
"C09123", "C09XYZ", "C09ZOO"), class = "factor")), .Names = c("names",
"fruit", "dates", "atc"), class = "data.frame", row.names = c("4",
"7", "9", "13", "14", "2", "5", "8", "10", "12", "1", "3", "6",
"11"))

This is just a continuation of #Michael Lawrence's answer. I changed the drugs to what #user2363642 wanted, and I also substringed the atc column to only use the three first characters, which again, I believe is what #user2363642 wanted. Also, for the rowSums, I first changed all non-zero quantities to 1, to ensure we don't double count drugs.
drugs <- c("C07", "C09", "C08", "C03")
df$atc.abbr <- substring(df$atc, 1, 3)
xt <- xtabs(~ names + atc.abbr, df)
xt[xt>0] <- 1
rowSums(xt[,drugs]) >= length(drugs)
Output:
john mary tom
TRUE FALSE FALSE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pairs of Observations within Groups - r

Related

Select unique values in dataframe based on sorted value

Combine several data.frames into one (keep every rownames)

r creating an adjacency matrix from columns in a dataframe

Create merged tables in R

how many people received 4 drugs of interest? R

Categories

Resources