How to make a multiple corpora in R - r

This is a car review data which has more than 40,000 rows and each review has more than 500 characters. This is sample data : https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E
| brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| brand2 | 500 characters3 | 100 Characters3 | | | | | |
| brand2 | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| brand3 | 500 characters6 | 100 characters6 | | | | | |
I'd like to merge review column by brands like this :
| Brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| | 500 characters3 | 100 Characters3 | | | | | |
| | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| | 500 characters6 | 100 characters6 | | | | | |
So, I tired to use aggregate().
temp <- aggregate(data$review ~ data$brand , data, as.list )
But, It takes very long.
Is there any simple way to merge that?
Thank you in advance!

Try splitting them on each factor and then pasting them together. aggregate() is a horribly slow function and should be avoided for all but the smallest datasets.
This should do the trick: (note I downloaded your Google file as sampleDF.csv here)
sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)
# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")
# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")
newDf <- data.frame(brand = names(brand.split),
text <- favorite.grouped,
favorite <- favorite.grouped,
stringsAsFactors = FALSE)
If you want to bring in other variables they will need to vary at the brand level only.

Related

How can i print complete query output without line breaks in output file

It's printing output but once 1024 rows crossed it takes line break and prints output
please suggest how can we handle without line break as below below output:
used:
(impala-shell -k --ssl -i <impala_url> -c -f sample_202112.sql 2>&1)> sample_202112.txt
printing as below:
| VOL | BC | VOL00109 | VA-V5B | APR22DM |
| VOL | BC | VOL00109 | VC | APR22DM |
| VOL | BC | VOL00109 | VE-V5G | APR22DM |
| VOL | BC | VOL00109 | VH | APR22DM |
| VOL | BC | VOL00109 | VJ | APR22DM |
| VOL | BC | VOL00104 | VK-V7G | APR22DM |
| VOL | BC | VOL00103 | VL | APR22DM |
+-----------+---------------+-----------+--------------------+---------+
+-----------+-----------+-----------+------------+---------+
| zone_code | zone_desc | zone_code | data | dm |
+-----------+-----------+-----------+------------+---------+
| VOL | BC | VOL00109 | VM | APR22DM |
| VOL | BC | VOL00103 | VN | APR22DM |
| VOL | BC | VOL00103 | VP | APR22DM |
| VOL | BC | VOL00103 | VR | APR22DM |
| VOL | BC | VOL00103 | S | APR22DM |
+-----------+-----------+-----------+------------+---------+

Relabel of rowname column in R dataframe

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+
We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

why does the frequency of my Gnocchi measurements not match the set granularity

Im running openstack and am trying to get my gnocchi meters to come through more frequently so that I can run a scaling demo without lots of 5 minute lags. In Gnocchi I have changed the Archive-policy to be a custom policy with granularity set to 30 seconds (I've also tried the following using the existing 'medium' policy and it has the same result)
+---------------------+--------------------------------------------------------+
| Field | Value |
+---------------------+--------------------------------------------------------+
| aggregation_methods | std, count, min, max, sum, mean |
| back_window | 0 |
| definition | - points: 120, granularity: 0:00:30, timespan: 1:00:00 |
| name | test |
+---------------------+--------------------------------------------------------+
the cpu_util meter is picking it up correclty
+------------------------------------+-------------------------------------------------------------------+
| Field | Value |
+------------------------------------+-------------------------------------------------------------------+
| archive_policy/aggregation_methods | std, count, min, max, sum, mean |
| archive_policy/back_window | 0 |
| archive_policy/definition | - points: 120, granularity: 0:00:30, timespan: 1:00:00 |
| archive_policy/name | test |
| created_by_project_id | e499d0c2e0fb4a05ac39c3f8c260052b |
| created_by_user_id | 21759a51f3834b9bbae49c3ed17a13e4 |
| creator | 21759a51f3834b9bbae49c3ed17a13e4:e499d0c2e0fb4a05ac39c3f8c260052b |
| id | e5a02f3a-9fbe-4e44-bb91-e1cfe6b86143 |
| name | cpu_util |
| resource/created_by_project_id | e499d0c2e0fb4a05ac39c3f8c260052b |
| resource/created_by_user_id | 21759a51f3834b9bbae49c3ed17a13e4 |
| resource/creator | 21759a51f3834b9bbae49c3ed17a13e4:e499d0c2e0fb4a05ac39c3f8c260052b |
| resource/ended_at | None |
| resource/id | 243b9715-95ba-4532-9728-3e61776e1c29 |
| resource/original_resource_id | 243b9715-95ba-4532-9728-3e61776e1c29 |
| resource/project_id | 43a7db62d5d54c4590e363868fff49e2 |
| resource/revision_end | None |
| resource/revision_start | 2018-08-08T14:05:09.770765+00:00 |
| resource/started_at | 2018-08-08T13:20:45.948842+00:00 |
| resource/type | instance |
| resource/user_id | 4e5015006b304e7ca57edc5419b42be3 |
| unit | % |
+------------------------------------+-------------------------------------------------------------------+
but the measurements are still only coming out every 5 min
gnocchi measures show e5a02f3a-9fbe-4e44-bb91-e1cfe6b86143
+---------------------------+-------------+--------------+
| timestamp | granularity | value |
+---------------------------+-------------+--------------+
| 2018-08-08T13:30:00+00:00 | 30.0 | 0.0400002375 |
| 2018-08-08T13:35:00+00:00 | 30.0 | 0.0366666763 |
| 2018-08-08T13:40:00+00:00 | 30.0 | 0.0366667101 |
| 2018-08-08T13:45:00+00:00 | 30.0 | 0.0399999545 |
| 2018-08-08T13:50:00+00:00 | 30.0 | 0.0366664861 |
| 2018-08-08T13:55:00+00:00 | 30.0 | 0.0400000543 |
| 2018-08-08T14:00:00+00:00 | 30.0 | 0.0366665877 |
+---------------------------+-------------+--------------+
any ideas what I am missing?
I had the same issue. In Gnocchi-backed Ceilometer there is a new configuration file: polling.yaml. Resources polling interval is being set there.
https://review.opendev.org/#/c/405682/
https://docs.openstack.org/ceilometer/pike/admin/telemetry-best-practices.html

For each combination of a set of variables in a list, calculating correlations between this combination and another variable in R

In R I want to generate correlation co-efficients by comparing 2 variables whilst also retaining a phylogenetic signal.
The initial way I thought to do this is not computationally efficient, and I think there is a much simpler, but I do not have the skills in R to do it.
I have a csv file which looks like this:
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159 |
| Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
What I want to do is, for each possible combination of the percentages within the 20 single letter columns (amino acids, so 10 million combinations). Is to calculate the correlation between each different combination and the OGT variable in the CSV.... (whilst retaining a phylogenetic signal)
My current code is this:
library(parallel)
library(dplyr)
library(tidyr)
library(magrittr)
library(ape)
library(geiger)
library(caper)
taxonomynex <- read.nexus("taxonomyforzeldospecies.nex")
zeldodata <- read.csv("COMPLETECOPYFORR.csv")
Species <- dput(zeldodata)
SpeciesLong <-
Species %>%
gather(protein, proportion,
A:Y) %>%
arrange(Species)
S <- unique(SpeciesLong$protein)
Scombi <- unlist(lapply(seq_along(S),
function(x) combn(S, x, FUN = paste0, collapse = "")))
joint_protein <- function(protein_combo, data){
sum(data$proportion[vapply(data$protein,
grepl,
logical(1),
protein_combo)])
}
SplitSpecies <-
split(SpeciesLong,
SpeciesLong$Species)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, c("Scombi", "joint_protein"))
SpeciesAggregate <-
parLapply(cl,
X = SplitSpecies,
fun = function(data){
X <- lapply(Scombi,
joint_protein,
data)
names(X) <- Scombi
as.data.frame(X)
})
Species <- cbind(Species, SpeciesAggregate)
`
Which attempts to feed in each combination into memory and then calculate the sum of each proportion of each of the acids, but this takes forever to finish and crashes before completion.
I think it would be better to feed in correlation co-efficents into a vector, and then just print out the relative co-efficients of each different combination for each species, but I don't know the best way of doing this in R.
I also aim to retain a phylogenetic signal using the ape package using something along the lines of this:
pglsModel <- gls(OGT ~ AminoAcidCombination, correlation = corBrownian(phy = taxonomynex),
data = zeldodata, method = "ML")
summary(pglsModel)
Apologies for how unclear this is, if anyone has any advice, much appreciated!
Edit: Link to taxonomyforzeldospecies.nex
Output from dput(Zeldodata):
1 Species OGT Domain A C D E F G H I K L M N P Q R S T V W Y
------------------------------- ----- ---------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
2 Aeropyrum pernix 95 Archaea 9.7659115711 0.6720465616 4.3895390781 7.6501943794 2.9344881615 8.8666657183 1.5011817208 5.6901432494 4.1428307243 11.0604191603 2.21143353 1.9387130928 5.1038552753 1.6855017182 7.7664358772 6.266067034 4.2052190807 9.2692433532 1.318690698 3.5614200159
3 Argobacterium fabrum 26 Bacteria 11.5698896021 0.7985475923 5.5884500155 5.8165463343 4.0512504104 8.2643271309 2.0116736244 5.7962804605 3.8931525401 9.9250463349 2.5980609708 2.9846761128 4.7828063605 3.1262365491 6.5684282943 5.9454781844 5.3740045968 7.3382308193 1.2519739683 2.3149400984
4 Anaeromyxobacter dehalogenans 27 Bacteria 16.0337898849 0.8860252895 5.1368827707 6.1864992608 2.9730203513 9.3167603253 1.9360386851 2.940143349 2.3473650439 10.898494736 1.6343905351 1.5247123262 6.3580285706 2.4715303021 9.2639057482 4.1890063803 4.3992339725 8.3885969061 1.2890166336 1.8265589289
5 Aquifex aeolicus 85 Bacteria 5.8730327277 0.795341216 4.3287799008 9.6746388172 5.1386954322 6.7148035486 1.5438364179 7.3358775924 9.4641440609 10.5736658776 1.9263080969 3.6183861236 4.0518679067 2.0493569604 4.9229955632 4.7976564501 4.2005259246 7.9169763709 0.9292167138 4.1438942987
6 Archaeoglobus fulgidus 83 Archaea 7.8742687687 1.1695110027 4.9165979364 8.9548767369 4.568636662 7.2640358917 1.4998752909 7.2472039919 6.8957233203 9.4826333048 2.6014466253 3.206476915 3.8419576418 1.7789787933 5.7572748236 5.4763351139 4.1490633048 8.6330814159 1.0325605451 3.6494619148
this will give you a long data frame with each combination and sum per Species (takes about 35 seconds on my machine)...
zeldodata <-
Species %>%
gather(protein, proportion, A:Y) %>%
group_by(Species) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion)
an example of calculating each species separately and saving the data to disk before reading each one in and combining them...
library(readr)
library(dplyr)
library(tidyr)
library(purrr)
# read in CSV file
zeldodata <-
read_delim(
delim = "|",
trim_ws = TRUE,
col_names = TRUE,
col_types = "cicdddddddddddddddddddd",
file = "Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y
Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159
Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984
Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289
Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987
Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148"
)
# save an RDS file for each species
for(species in unique(zeldodata$Species)) {
zeldodata %>%
filter(Species == species) %>%
gather(protein, proportion, A:Y) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion) %>%
saveRDS(file = paste0(species, ".RDS"))
}
# read in and combine all the RDS files
zeldodata <-
list.files(pattern = "\\.RDS") %>%
map(read_rds) %>%
bind_rows()

change column value only if two other columns are duplicates

I am having a hard time to figure this out in R.
This is what I would like to do.
In a data frame like below, I would like to do if Name and Class duplicates add two row's score and if not, leave it as it is.
+------------------+-----------+-------+
| Name | Class | Score |
+------------------+-----------+-------+
| Sara | Sophomore | 10 |
| John | Freshman | 20 |
| Taylor | Sophomore | 30 |
| Tyler | Junior | 10 |
| Keith | Junior | 20 |
| Andrew | Senior | 30 |
| Victor | Senior | 10 |
| Nancy |Sophomore | 20 |
| Taylor | Junior | 30 |
| John | Senior | 10 |
| Victor | Freshman | 20 |
| Sara | Sophomore | 30 |
| John | Freshman | 10 |
| Taylor | Sophomore | 20 |
| John | Senior | 30 |
+------------------+-----------+-------+
So basically, the end result should look like:
+--------+-----------+-------+--+--+--+--+
| Name | Class | Score | | | | |
+--------+-----------+-------+--+--+--+--+
| Sara | Sophomore | 40 | | | | |
| John | Freshman | 30 | | | | |
| Taylor | Sophomore | 50 | | | | |
| Tyler | Junior | 10 | | | | |
| Keith | Junior | 20 | | | | |
| Andrew | Senior | 30 | | | | |
| Victor | Senior | 10 | | | | |
| Nancy | Sophomore | 20 | | | | |
| Taylor | Junior | 30 | | | | |
| John | Senior | 40 | | | | |
| Victor | Freshman | 20 | | | | |
+--------+-----------+-------+--+--+--+--+
As you see if name is the only duplicated value, it does not change (Example of John Freshman and John Senior). If class is the only duplicated value, it does not change either... Two columns in a row have to be duplicated to make the change.
My try is as below, but it is not working and am getting error message
'Error in if ((experiment[i, 1] == experiment[j, 1]) & (experiment[i, 2] == : missing value where TRUE/FALSE needed'
My code:
# creating an empty data frame
experiment1<-data.frame(matrix(ncol=3, nrow=15))
for(i in 1: nrow(experiment)){
for(j in i+1: nrow(experiment)){
if((experiment[i,1] == experiment[j,1]) & (experiment[i,2] == experiment[j,2])){
experiment1[i,1] <- experiment[i,1]
experiment1[i,2] <- experiment[i,2]
experiment1[i,3] <- experiment[i,3] + experiment[j,3]}
else{
experiment1[i,1] <- experiment[i,1]
experiment1[i,2] <- experiment[i,2]
experiment1[i,3] <- experiment[i,3]}}}
Could anyone help fixing my code or figuring out "nobler" code?
Aggregation is like the first argument explained in any basic R tutorial, I suggest you go and follow some.
base R
aggregate(formula = Score ~ Name + Class, data = mydf, FUN = sum)
dplyr
mydf %>% group_by(Name, Class) %>% summarize(scoreSum = sum(Score))
data.table
setDT(mydf)[ , .(scoreSum = sum(number)), by = .(Name, Class)]

Resources