I'm trying to plot confidence intervals for data that isn't normal. I was given advice in using stat_summary, but I can't find any details on how to list the fun.args so that I get the 95th percentiles plotted as a ribbon. Just taking the Excel percentiles for 0.05 and 0.95 for year 2 of S4 I get 31.015 and 31.104, rather than what the plot shows. I assume the issue is with fun.data= mean_cl_normal, but there is very little info on what the options are.
Here is the data I'm using:
data:
+-----------+-----------+-----------+-----------+------+
| S4 | S5 | S6 | S7 | year |
+-----------+-----------+-----------+-----------+------+
| 31.052168 | 30.612594 | 30.328008 | 30.162733 | 2 |
| 31.0111 | 30.664017 | 30.277935 | 30.118793 | 2 |
| 31.049706 | 30.70231 | 30.341677 | 30.202466 | 2 |
| 31.077554 | 30.701983 | 30.355643 | 30.161663 | 2 |
| 31.056968 | 30.696955 | 30.323812 | 30.186214 | 2 |
| 31.096337 | 30.679318 | 30.261566 | 30.080544 | 2 |
| 31.073879 | 30.618196 | 30.281664 | 30.187808 | 2 |
| 31.115269 | 30.700809 | 30.301731 | 30.211642 | 2 |
| 31.085665 | 30.716211 | 30.362345 | 30.16574 | 2 |
| 31.076053 | 30.720127 | 30.319381 | 30.14898 | 2 |
| 31.017615 | 30.73175 | 30.326711 | 30.142657 | 2 |
| 31.020176 | 30.660135 | 30.274531 | 30.144741 | 2 |
| 31.04606 | 30.635148 | 30.362041 | 30.061961 | 2 |
| 31.06509 | 30.65724 | 30.305546 | 30.062432 | 3 |
| 30.974952 | 30.690091 | 30.305273 | 30.186476 | 3 |
| 30.99952 | 30.658606 | 30.29415 | 30.203725 | 3 |
| 31.013494 | 30.621646 | 30.2701 | 30.169807 | 3 |
| 31.081632 | 30.702792 | 30.326554 | 30.063521 | 3 |
| 31.033945 | 30.650637 | 30.334073 | 30.158865 | 3 |
| 31.075722 | 30.627908 | 30.331883 | 30.125196 | 3 |
| 31.036684 | 30.694549 | 30.322353 | 30.125278 | 3 |
| 31.054786 | 30.60339 | 30.356116 | 30.125177 | 3 |
| 31.089391 | 30.652875 | 30.268113 | 30.173289 | 3 |
| 31.063207 | 30.65264 | 30.346941 | 30.174659 | 3 |
| 31.050838 | 30.7144 | 30.28113 | 30.104956 | 3 |
| 31.002156 | 30.727084 | 30.28905 | 30.15026 | 3 |
| 31.052874 | 30.672237 | 30.325414 | 30.055 | 3 |
| 31.116682 | 30.737313 | 30.309537 | 30.13867 | 3 |
| 31.051456 | 30.662466 | 30.264082 | 30.125838 | 3 |
| 31.082019 | 30.646523 | 30.300457 | 30.119709 | 3 |
+-----------+-----------+-----------+-----------+------+
and the code:
Code:
library(tidyverse)
dat <- read.table("C:/temp.txt",sep="\t", header=TRUE)
df <- dat %>%
pivot_longer(cols = c(S4), names_to = "variable", values_to = "value")
ggplot(df, aes(x = year, y = value, color = variable)) +
stat_summary(geom = "line", fun = mean, linetype = "solid") +
stat_summary(geom = "ribbon", fun.data= mean_cl_normal, fun.args = list(conf.int=0.95), alpha=.1)
Adjusted code to register quantiles.
ggplot(df, aes(x = year, y = value, color = variable)) +
stat_summary(geom = "line", fun = mean, linetype = "solid") +
stat_summary(geom = "ribbon", fun.min = function(z) { quantile(z,0.05) },
fun.max = function(z) { quantile(z,0.95) }, alpha=.1)
Related
When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+
We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))
When using the expss package in R for creating tables, how does one get the row_percentages to be calculated within a nested variable? In the example below, I would like the row percentage to be calculated within each time period. Thus, I would like the row percentage to sum to 100% within each time period (2015-2016 and 2017-2018). Now however, the percentage is calculated over the entire row.
library(expss)
data(mtcars)
mtcars$period <- "2015-2016"
mtcars <- rbind(mtcars, mtcars)
mtcars$period[33:64] <- "2017-2018"
mtcars = apply_labels(mtcars,
cyl = "Number of cylinders",
am = "Transmission",
am = c("Automatic" = 0,
"Manual"=1),
period = "Measurement period"
)
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period %nest% am) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot()
Created on 2019-09-28 by the reprex package (v0.3.0)
| | | | Measurement period | | | |
| | | | 2015-2016 | | 2017-2018 | |
| | | | Transmission | | Transmission | |
| | | | Automatic | Manual | Automatic | Manual |
| ------------------- | ------------ | -------- | ------------------ | ------ | ------------ | ------ |
| Number of cylinders | 4 | row_perc | 13.6 | 36.4 | 13.6 | 36.4 |
| | 6 | row_perc | 28.6 | 21.4 | 28.6 | 21.4 |
| | 8 | row_perc | 42.9 | 7.1 | 42.9 | 7.1 |
| | #Total cases | row_perc | 19.0 | 13.0 | 19.0 | 13.0 |
I believe this is what you are after:
library(expss)
data(mtcars)
mtcars$period <- "2015-2016"
mtcars <- rbind(mtcars, mtcars)
mtcars$period[33:64] <- "2017-2018"
mtcars = apply_labels(mtcars,
cyl = "Number of cylinders",
am = "Transmission",
am = c("Automatic" = 0,
"Manual"=1),
period = "Measurement period"
)
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period %nest% am ) %>%
tab_subgroup(period =="2015-2016") %>%
tab_stat_rpct(label = "row_perc") %>%
tab_subgroup(period =="2017-2018") %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot(stat_position = "inside_rows")
Pay attention to the use of tab_subgroup() which determines which subgroup of year period we want to calculate the percentage as well as to stat_position = "inside_rows" which determines where we want to put the calculated output in the final table.
Output:
| | | | Measurement period | | | |
| | | | 2015-2016 | | 2017-2018 | |
| | | | Transmission | | Transmission | |
| | | | Automatic | Manual | Automatic | Manual |
| ------------------- | ------------ | -------- | ------------------ | ------ | ------------ | ------ |
| Number of cylinders | 4 | row_perc | 27.3 | 72.7 | | |
| | | | | | 27.3 | 72.7 |
| | 6 | row_perc | 57.1 | 42.9 | | |
| | | | | | 57.1 | 42.9 |
| | 8 | row_perc | 85.7 | 14.3 | | |
| | | | | | 85.7 | 14.3 |
| | #Total cases | row_perc | 19.0 | 13.0 | | |
| | | | | | 19.0 | 13.0 |
EDIT:
We do not need %nest% if we do not want nested rows(i.e. twice more rows). In this case, the final part of the code should be modified as follows:
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period,am) %>%
tab_subgroup(period ==c("2015-2016")) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_subgroup(period ==c("2017-2018")) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot(stat_position = "outside_columns")
Output:
| | | Measurement period | Transmission | | |
| | | 2015-2016 | Automatic | Manual | Automatic |
| | | row_perc | row_perc | row_perc | row_perc |
| ------------------- | ------------ | ------------------ | ------------ | -------- | --------- |
| Number of cylinders | 4 | 100 | 27.3 | 72.7 | 27.3 |
| | 6 | 100 | 57.1 | 42.9 | 57.1 |
| | 8 | 100 | 85.7 | 14.3 | 85.7 |
| | #Total cases | 32 | 19.0 | 13.0 | 19.0 |
| Measurement period |
Manual | 2017-2018 |
row_perc | row_perc |
-------- | ------------------ |
72.7 | 100 |
42.9 | 100 |
14.3 | 100 |
13.0 | 32 |
In R I want to generate correlation co-efficients by comparing 2 variables whilst also retaining a phylogenetic signal.
The initial way I thought to do this is not computationally efficient, and I think there is a much simpler, but I do not have the skills in R to do it.
I have a csv file which looks like this:
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159 |
| Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
What I want to do is, for each possible combination of the percentages within the 20 single letter columns (amino acids, so 10 million combinations). Is to calculate the correlation between each different combination and the OGT variable in the CSV.... (whilst retaining a phylogenetic signal)
My current code is this:
library(parallel)
library(dplyr)
library(tidyr)
library(magrittr)
library(ape)
library(geiger)
library(caper)
taxonomynex <- read.nexus("taxonomyforzeldospecies.nex")
zeldodata <- read.csv("COMPLETECOPYFORR.csv")
Species <- dput(zeldodata)
SpeciesLong <-
Species %>%
gather(protein, proportion,
A:Y) %>%
arrange(Species)
S <- unique(SpeciesLong$protein)
Scombi <- unlist(lapply(seq_along(S),
function(x) combn(S, x, FUN = paste0, collapse = "")))
joint_protein <- function(protein_combo, data){
sum(data$proportion[vapply(data$protein,
grepl,
logical(1),
protein_combo)])
}
SplitSpecies <-
split(SpeciesLong,
SpeciesLong$Species)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, c("Scombi", "joint_protein"))
SpeciesAggregate <-
parLapply(cl,
X = SplitSpecies,
fun = function(data){
X <- lapply(Scombi,
joint_protein,
data)
names(X) <- Scombi
as.data.frame(X)
})
Species <- cbind(Species, SpeciesAggregate)
`
Which attempts to feed in each combination into memory and then calculate the sum of each proportion of each of the acids, but this takes forever to finish and crashes before completion.
I think it would be better to feed in correlation co-efficents into a vector, and then just print out the relative co-efficients of each different combination for each species, but I don't know the best way of doing this in R.
I also aim to retain a phylogenetic signal using the ape package using something along the lines of this:
pglsModel <- gls(OGT ~ AminoAcidCombination, correlation = corBrownian(phy = taxonomynex),
data = zeldodata, method = "ML")
summary(pglsModel)
Apologies for how unclear this is, if anyone has any advice, much appreciated!
Edit: Link to taxonomyforzeldospecies.nex
Output from dput(Zeldodata):
1 Species OGT Domain A C D E F G H I K L M N P Q R S T V W Y
------------------------------- ----- ---------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
2 Aeropyrum pernix 95 Archaea 9.7659115711 0.6720465616 4.3895390781 7.6501943794 2.9344881615 8.8666657183 1.5011817208 5.6901432494 4.1428307243 11.0604191603 2.21143353 1.9387130928 5.1038552753 1.6855017182 7.7664358772 6.266067034 4.2052190807 9.2692433532 1.318690698 3.5614200159
3 Argobacterium fabrum 26 Bacteria 11.5698896021 0.7985475923 5.5884500155 5.8165463343 4.0512504104 8.2643271309 2.0116736244 5.7962804605 3.8931525401 9.9250463349 2.5980609708 2.9846761128 4.7828063605 3.1262365491 6.5684282943 5.9454781844 5.3740045968 7.3382308193 1.2519739683 2.3149400984
4 Anaeromyxobacter dehalogenans 27 Bacteria 16.0337898849 0.8860252895 5.1368827707 6.1864992608 2.9730203513 9.3167603253 1.9360386851 2.940143349 2.3473650439 10.898494736 1.6343905351 1.5247123262 6.3580285706 2.4715303021 9.2639057482 4.1890063803 4.3992339725 8.3885969061 1.2890166336 1.8265589289
5 Aquifex aeolicus 85 Bacteria 5.8730327277 0.795341216 4.3287799008 9.6746388172 5.1386954322 6.7148035486 1.5438364179 7.3358775924 9.4641440609 10.5736658776 1.9263080969 3.6183861236 4.0518679067 2.0493569604 4.9229955632 4.7976564501 4.2005259246 7.9169763709 0.9292167138 4.1438942987
6 Archaeoglobus fulgidus 83 Archaea 7.8742687687 1.1695110027 4.9165979364 8.9548767369 4.568636662 7.2640358917 1.4998752909 7.2472039919 6.8957233203 9.4826333048 2.6014466253 3.206476915 3.8419576418 1.7789787933 5.7572748236 5.4763351139 4.1490633048 8.6330814159 1.0325605451 3.6494619148
this will give you a long data frame with each combination and sum per Species (takes about 35 seconds on my machine)...
zeldodata <-
Species %>%
gather(protein, proportion, A:Y) %>%
group_by(Species) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion)
an example of calculating each species separately and saving the data to disk before reading each one in and combining them...
library(readr)
library(dplyr)
library(tidyr)
library(purrr)
# read in CSV file
zeldodata <-
read_delim(
delim = "|",
trim_ws = TRUE,
col_names = TRUE,
col_types = "cicdddddddddddddddddddd",
file = "Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y
Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159
Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984
Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289
Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987
Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148"
)
# save an RDS file for each species
for(species in unique(zeldodata$Species)) {
zeldodata %>%
filter(Species == species) %>%
gather(protein, proportion, A:Y) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion) %>%
saveRDS(file = paste0(species, ".RDS"))
}
# read in and combine all the RDS files
zeldodata <-
list.files(pattern = "\\.RDS") %>%
map(read_rds) %>%
bind_rows()
I have a sale data as below:
+------------+------+-------+
| Receipt ID | Item | Value |
+------------+------+-------+
| 1 | a | 2 |
| 1 | b | 3 |
| 1 | c | 2 |
| 1 | k | 4 |
| 2 | a | 2 |
| 2 | b | 5 |
| 2 | d | 6 |
| 2 | k | 7 |
| 3 | a | 8 |
| 3 | k | 1 |
| 3 | c | 2 |
| 3 | q | 3 |
| 4 | k | 4 |
| 4 | a | 5 |
| 5 | b | 6 |
| 5 | a | 7 |
| 6 | a | 8 |
| 6 | b | 3 |
| 6 | c | 4 |
+------------+------+-------+
Using APriori algorithm, I modified the Rules into different columns:
For eg, I got output as below, I trimmed support, confidence, Lift value.. I am only considering rules which mapped into different columns into Target Item, Item1, Items ({Item1,Item2} -> {Target Item})
Output is as below:
+-------------+-------+-------+
| Target Item | Item1 | Item2 |
+-------------+-------+-------+
| a | b | |
| a | b | c |
| a | k | |
+-------------+-------+-------+
I am looking to calculate the all the receipts having the rules combination and identify the Target item Sale value only in those receipts and also Combined sale value of Item 1 and Item 2 in the combination receipts:
Output should be something like below (I dont need receipt ID's from below)
+-------------+-------+-------+--------------+----------------------+------------------------------+
| Target Item | Item1 | Item2 | Receipt ID's | Value of Target Item | Remaining value(Item1+item2) |
+-------------+-------+-------+--------------+----------------------+------------------------------+
| a | b | | 1,2,5,6 | 2+2+7+8 | 3+5+6+3 |
| a | b | c | 1,6 | 2 | (3+3) + (2+4) |
| a | k | | 1,2,3,4 | 2+2+8+5 | 4+7+1+4 |
+-------------+-------+-------+--------------+----------------------+------------------------------+
To replicate the Apriori:
library(arules)
Data <- data.frame(
Receipt_ID = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,5,5,6,6,6),
item = c('a','b','c','k','a','b','d','k','a','k','c','q','k', 'a','b','a','a', 'b', 'c'
)
,
value = c(2,3,2,4,2,5,6,7,8,1,2,3,4,5,6,7,8,3,4
)
)
write.table(Data,"item.csv",sep=',',row.names = F)
data_frame = read.transactions(
file = "item.csv",
format = "single",
sep = ",",
cols = c("Receipt_ID","item"),
rm.duplicates = T
)
rules_apriori <- apriori(data_frame)
rules_apriori
rules_tab <- as(rules_apriori, "data.frame")
rules_tab
out <- strsplit(as.character(rules_tab$rules),'=>')
rules_tab$rhs <- do.call(rbind, out)[,2]
rules_tab$lhs <- do.call(rbind, out)[,1]
rules_tab$rhs <- gsub("\\{", "", rules_tab$rhs)
rules_tab$rhs <- gsub("}", "", rules_tab$rhs)
rules_tab$lhs = gsub("}", "", rules_tab$lhs)
rules_tab$lhs = gsub("\\{", "", rules_tab$lhs)
rules_final <- data.frame (target_item = character(),item_combination = character() )
rules_final <- cbind(target_item = rules_tab$rhs,item_Combination = rules_tab$lhs)
rules_final
This is a car review data which has more than 40,000 rows and each review has more than 500 characters. This is sample data : https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E
| brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| brand2 | 500 characters3 | 100 Characters3 | | | | | |
| brand2 | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| brand3 | 500 characters6 | 100 characters6 | | | | | |
I'd like to merge review column by brands like this :
| Brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| | 500 characters3 | 100 Characters3 | | | | | |
| | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| | 500 characters6 | 100 characters6 | | | | | |
So, I tired to use aggregate().
temp <- aggregate(data$review ~ data$brand , data, as.list )
But, It takes very long.
Is there any simple way to merge that?
Thank you in advance!
Try splitting them on each factor and then pasting them together. aggregate() is a horribly slow function and should be avoided for all but the smallest datasets.
This should do the trick: (note I downloaded your Google file as sampleDF.csv here)
sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)
# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")
# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")
newDf <- data.frame(brand = names(brand.split),
text <- favorite.grouped,
favorite <- favorite.grouped,
stringsAsFactors = FALSE)
If you want to bring in other variables they will need to vary at the brand level only.