For each combination of a set of variables in a list, calculating correlations between this combination and another variable in R - r

In R I want to generate correlation co-efficients by comparing 2 variables whilst also retaining a phylogenetic signal.
The initial way I thought to do this is not computationally efficient, and I think there is a much simpler, but I do not have the skills in R to do it.
I have a csv file which looks like this:
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159 |
| Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
What I want to do is, for each possible combination of the percentages within the 20 single letter columns (amino acids, so 10 million combinations). Is to calculate the correlation between each different combination and the OGT variable in the CSV.... (whilst retaining a phylogenetic signal)
My current code is this:
library(parallel)
library(dplyr)
library(tidyr)
library(magrittr)
library(ape)
library(geiger)
library(caper)
taxonomynex <- read.nexus("taxonomyforzeldospecies.nex")
zeldodata <- read.csv("COMPLETECOPYFORR.csv")
Species <- dput(zeldodata)
SpeciesLong <-
Species %>%
gather(protein, proportion,
A:Y) %>%
arrange(Species)
S <- unique(SpeciesLong$protein)
Scombi <- unlist(lapply(seq_along(S),
function(x) combn(S, x, FUN = paste0, collapse = "")))
joint_protein <- function(protein_combo, data){
sum(data$proportion[vapply(data$protein,
grepl,
logical(1),
protein_combo)])
}
SplitSpecies <-
split(SpeciesLong,
SpeciesLong$Species)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, c("Scombi", "joint_protein"))
SpeciesAggregate <-
parLapply(cl,
X = SplitSpecies,
fun = function(data){
X <- lapply(Scombi,
joint_protein,
data)
names(X) <- Scombi
as.data.frame(X)
})
Species <- cbind(Species, SpeciesAggregate)
`
Which attempts to feed in each combination into memory and then calculate the sum of each proportion of each of the acids, but this takes forever to finish and crashes before completion.
I think it would be better to feed in correlation co-efficents into a vector, and then just print out the relative co-efficients of each different combination for each species, but I don't know the best way of doing this in R.
I also aim to retain a phylogenetic signal using the ape package using something along the lines of this:
pglsModel <- gls(OGT ~ AminoAcidCombination, correlation = corBrownian(phy = taxonomynex),
data = zeldodata, method = "ML")
summary(pglsModel)
Apologies for how unclear this is, if anyone has any advice, much appreciated!
Edit: Link to taxonomyforzeldospecies.nex
Output from dput(Zeldodata):
1 Species OGT Domain A C D E F G H I K L M N P Q R S T V W Y
------------------------------- ----- ---------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
2 Aeropyrum pernix 95 Archaea 9.7659115711 0.6720465616 4.3895390781 7.6501943794 2.9344881615 8.8666657183 1.5011817208 5.6901432494 4.1428307243 11.0604191603 2.21143353 1.9387130928 5.1038552753 1.6855017182 7.7664358772 6.266067034 4.2052190807 9.2692433532 1.318690698 3.5614200159
3 Argobacterium fabrum 26 Bacteria 11.5698896021 0.7985475923 5.5884500155 5.8165463343 4.0512504104 8.2643271309 2.0116736244 5.7962804605 3.8931525401 9.9250463349 2.5980609708 2.9846761128 4.7828063605 3.1262365491 6.5684282943 5.9454781844 5.3740045968 7.3382308193 1.2519739683 2.3149400984
4 Anaeromyxobacter dehalogenans 27 Bacteria 16.0337898849 0.8860252895 5.1368827707 6.1864992608 2.9730203513 9.3167603253 1.9360386851 2.940143349 2.3473650439 10.898494736 1.6343905351 1.5247123262 6.3580285706 2.4715303021 9.2639057482 4.1890063803 4.3992339725 8.3885969061 1.2890166336 1.8265589289
5 Aquifex aeolicus 85 Bacteria 5.8730327277 0.795341216 4.3287799008 9.6746388172 5.1386954322 6.7148035486 1.5438364179 7.3358775924 9.4641440609 10.5736658776 1.9263080969 3.6183861236 4.0518679067 2.0493569604 4.9229955632 4.7976564501 4.2005259246 7.9169763709 0.9292167138 4.1438942987
6 Archaeoglobus fulgidus 83 Archaea 7.8742687687 1.1695110027 4.9165979364 8.9548767369 4.568636662 7.2640358917 1.4998752909 7.2472039919 6.8957233203 9.4826333048 2.6014466253 3.206476915 3.8419576418 1.7789787933 5.7572748236 5.4763351139 4.1490633048 8.6330814159 1.0325605451 3.6494619148

this will give you a long data frame with each combination and sum per Species (takes about 35 seconds on my machine)...
zeldodata <-
Species %>%
gather(protein, proportion, A:Y) %>%
group_by(Species) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion)
an example of calculating each species separately and saving the data to disk before reading each one in and combining them...
library(readr)
library(dplyr)
library(tidyr)
library(purrr)
# read in CSV file
zeldodata <-
read_delim(
delim = "|",
trim_ws = TRUE,
col_names = TRUE,
col_types = "cicdddddddddddddddddddd",
file = "Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y
Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159
Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984
Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289
Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987
Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148"
)
# save an RDS file for each species
for(species in unique(zeldodata$Species)) {
zeldodata %>%
filter(Species == species) %>%
gather(protein, proportion, A:Y) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion) %>%
saveRDS(file = paste0(species, ".RDS"))
}
# read in and combine all the RDS files
zeldodata <-
list.files(pattern = "\\.RDS") %>%
map(read_rds) %>%
bind_rows()

Related

frequency table for banner list

I am trying to create a function to generate frequency table (to show count , valid percentage , percentage) for list of banner.
I want to export tables in xlsx files.
like for variable "gear" , i want to calculate the table for banner below ()
library(expss)
df <- mtcars
df$all<- 1
df$small<-ifelse(df$vs==1,1,NA)
df$large<-ifelse(df$am ==1,1,NA)
val_lab(df$all)<-c("Total"=1)
val_lab(df$small)<-c("Small"=1)
val_lab(df$large)<-c("Large"=1)
banner <- list(dat$all,dat$small,dat$large)
data <- df
var <- "gear"
var1 <- rlang::parse_expr(var)
expss::var_lab(data[[var]])
#tab1 <- expss::fre(data[[var1]])
table1 <- expss::fre(data[[var1]],
stat_lab = getOption("expss.fre_stat_lab", c("Count N", "Valid percent", "Percent",
"Responses, %", "Cumulative responses, %")))
table1
the output table should be like below
You need to make custom function around fre:
library(expss)
df <- mtcars
df$all<- 1
df$small<-ifelse(df$vs==1,1,NA)
df$large<-ifelse(df$am ==1,1,NA)
val_lab(df$all)<-c("Total"=1)
val_lab(df$small)<-c("Small"=1)
val_lab(df$large)<-c("Large"=1)
my_fre <- function(curr_var) setNames(expss::fre(curr_var)[, 1:3],
c("row_labels", "Count N", "Valid percent"))
cross_fun_df(df, gear, list(all, small, large), fun = my_fre)
# | | Total | | Small | | Large | |
# | | Count N | Valid percent | Count N | Valid percent | Count N | Valid percent |
# | ------ | ------- | ------------- | ------- | ------------- | ------- | ------------- |
# | 3 | 15 | 46.88 | 3 | 21.43 | | |
# | 4 | 12 | 37.50 | 10 | 71.43 | 8 | 61.54 |
# | 5 | 5 | 15.62 | 1 | 7.14 | 5 | 38.46 |
# | #Total | 32 | 100.00 | 14 | 100.00 | 13 | 100.00 |
# | <NA> | 0 | | 0 | | 0 | |

How to drop unused value labels in crosstabulations table outputs using cro function from expss package?

I'm using heaven labelled dataframes (variables already have value labels when importing datasets). I need to run many crosstabulations of two variables. I’m using the cro function from expss package because by default displays value labels, and computes weighted crosstabs.
However, the output tables I get display unused value labels. How can I drop unused labels without manually dropping unused value labels for each variable? (by the way: the fre function from expss package has this argument by default: drop_unused_labels = TRUE, but cro function doesn’t)
Here is a reproducible example:
# Dataframe
df <- data.frame(sex = c(1, 2, 99, 2, 1, 2, 2, 2, 1, 2),
agegroup= c(1, 2, 99, 2, 3, 3, 2, 2, 2, 1),
weight = c(100, 20, 400, 300, 50, 50, 80, 250, 100, 100))
library(expss)
# Variable labels
var_lab(df$sex) <-"Sex"
var_lab(df$agegroup) <-"Age group"
# Value labels
val_lab(df$sex) <- make_labels("1 Male
2 Female
97 Didn't know
98 Didn't respond
99 Abandoned survey")
val_lab(df$agegroup) <- make_labels("1 1-29
2 30-49
3 50 and more
97 Didn't know
98 Didn't respond
99 Abandoned survey")
cro(df$sex, df$agegroup, weight = df$weight)
| | | Age group | | | | | |
| | | 1-29 | 30-49 | 50 and more | Didn't know | Didn't respond | Abandoned survey |
| --- | ---------------- | --------- | ----- | ----------- | ----------- | -------------- | ---------------- |
| Sex | Male | 100 | 100 | 50 | | | |
| | Female | 100 | 650 | 50 | | | |
| | Didn't know | | | | | | |
| | Didn't respond | | | | | | |
| | Abandoned survey | | | | | | 400 |
| | #Total cases | 2 | 5 | 2 | | | 1 |
I want to get rid of the columns and rows called ‘Didn't know’ and ‘Didn't respond’.
You can use drop_unused_labels function to remove the labels which are not used.
library(expss)
df1 <- drop_unused_labels(df)
cro(df1$sex, df1$agegroup, weight = df1$weight)
| | | Age group | | | |
| | | 1-29 | 30-49 | 50 and more | Abandoned survey |
| --- | ---------------- | --------- | ----- | ----------- | ---------------- |
| Sex | Male | 100 | 100 | 50 | |
| | Female | 100 | 650 | 50 | |
| | Abandoned survey | | | | 400 |
| | #Total cases | 2 | 5 | 2 | 1 |

Relabel of rowname column in R dataframe

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+
We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

How do I analyze Market Basket Output?

I have a sale data as below:
+------------+------+-------+
| Receipt ID | Item | Value |
+------------+------+-------+
| 1 | a | 2 |
| 1 | b | 3 |
| 1 | c | 2 |
| 1 | k | 4 |
| 2 | a | 2 |
| 2 | b | 5 |
| 2 | d | 6 |
| 2 | k | 7 |
| 3 | a | 8 |
| 3 | k | 1 |
| 3 | c | 2 |
| 3 | q | 3 |
| 4 | k | 4 |
| 4 | a | 5 |
| 5 | b | 6 |
| 5 | a | 7 |
| 6 | a | 8 |
| 6 | b | 3 |
| 6 | c | 4 |
+------------+------+-------+
Using APriori algorithm, I modified the Rules into different columns:
For eg, I got output as below, I trimmed support, confidence, Lift value.. I am only considering rules which mapped into different columns into Target Item, Item1, Items ({Item1,Item2} -> {Target Item})
Output is as below:
+-------------+-------+-------+
| Target Item | Item1 | Item2 |
+-------------+-------+-------+
| a | b | |
| a | b | c |
| a | k | |
+-------------+-------+-------+
I am looking to calculate the all the receipts having the rules combination and identify the Target item Sale value only in those receipts and also Combined sale value of Item 1 and Item 2 in the combination receipts:
Output should be something like below (I dont need receipt ID's from below)
+-------------+-------+-------+--------------+----------------------+------------------------------+
| Target Item | Item1 | Item2 | Receipt ID's | Value of Target Item | Remaining value(Item1+item2) |
+-------------+-------+-------+--------------+----------------------+------------------------------+
| a | b | | 1,2,5,6 | 2+2+7+8 | 3+5+6+3 |
| a | b | c | 1,6 | 2 | (3+3) + (2+4) |
| a | k | | 1,2,3,4 | 2+2+8+5 | 4+7+1+4 |
+-------------+-------+-------+--------------+----------------------+------------------------------+
To replicate the Apriori:
library(arules)
Data <- data.frame(
Receipt_ID = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,5,5,6,6,6),
item = c('a','b','c','k','a','b','d','k','a','k','c','q','k', 'a','b','a','a', 'b', 'c'
)
,
value = c(2,3,2,4,2,5,6,7,8,1,2,3,4,5,6,7,8,3,4
)
)
write.table(Data,"item.csv",sep=',',row.names = F)
data_frame = read.transactions(
file = "item.csv",
format = "single",
sep = ",",
cols = c("Receipt_ID","item"),
rm.duplicates = T
)
rules_apriori <- apriori(data_frame)
rules_apriori
rules_tab <- as(rules_apriori, "data.frame")
rules_tab
out <- strsplit(as.character(rules_tab$rules),'=>')
rules_tab$rhs <- do.call(rbind, out)[,2]
rules_tab$lhs <- do.call(rbind, out)[,1]
rules_tab$rhs <- gsub("\\{", "", rules_tab$rhs)
rules_tab$rhs <- gsub("}", "", rules_tab$rhs)
rules_tab$lhs = gsub("}", "", rules_tab$lhs)
rules_tab$lhs = gsub("\\{", "", rules_tab$lhs)
rules_final <- data.frame (target_item = character(),item_combination = character() )
rules_final <- cbind(target_item = rules_tab$rhs,item_Combination = rules_tab$lhs)
rules_final

How to make a multiple corpora in R

This is a car review data which has more than 40,000 rows and each review has more than 500 characters. This is sample data : https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E
| brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| brand2 | 500 characters3 | 100 Characters3 | | | | | |
| brand2 | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| brand3 | 500 characters6 | 100 characters6 | | | | | |
I'd like to merge review column by brands like this :
| Brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| | 500 characters3 | 100 Characters3 | | | | | |
| | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| | 500 characters6 | 100 characters6 | | | | | |
So, I tired to use aggregate().
temp <- aggregate(data$review ~ data$brand , data, as.list )
But, It takes very long.
Is there any simple way to merge that?
Thank you in advance!
Try splitting them on each factor and then pasting them together. aggregate() is a horribly slow function and should be avoided for all but the smallest datasets.
This should do the trick: (note I downloaded your Google file as sampleDF.csv here)
sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)
# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")
# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")
newDf <- data.frame(brand = names(brand.split),
text <- favorite.grouped,
favorite <- favorite.grouped,
stringsAsFactors = FALSE)
If you want to bring in other variables they will need to vary at the brand level only.

Resources