R parsing large data frame - speed optimization [duplicate]

R parsing large data frame - speed optimization [duplicate] - r

This question already has answers here:
R Optimizing double for loop, matrix manipulation
(4 answers)
Closed 7 years ago.
Suppose I have an extremely large data frame with 2 columns and .5 mil rows.
For example, a few rows may look like this:
# Start End
# 89 100
# 93 120
# 95 125
# 101 NA
# 115 NA
# 123 NA
# 124 NA
I would like to manipulate this data frame to output a data frame that looks
like this:
# End Start
# 100 89, 93, 95
# 120 101, 115
# 125 123, 124
What would be the absolute quickest way to do this, given that there are
.5 million rows? bgoldst suggested this awesome piece of code:
# m is a large two column data frame
end <- na.omit(m[,'V2']);
out <- data.frame(End=end,
Start=unname(sapply(split(m[,'V1'],findInterval(m[,'V1'],end [as.character(0:c(length(end)-1))],paste,collapse='.')))
However this is taking a little bit too long.
Thanks for the help!
The answers on the possible duplicate post did not address the time issue. bgoldst's answer produced the desired outcome, but was very slow on my computer. I was wondering if there was something further that I could do to make this run faster.

A solution with data.table may be faster:
library(data.table)
dt = setDT(df)[, id:=findInterval(Start, End[!is.na(End)])][,paste(Start,collapse=','),id]
result = data.frame(End = df$End[!is.na(df$End)],Start = dt$V1)
# End Start
#1 100 89,93,95
#2 120 101,115
#3 125 123

Related

Substr() function within the apply() function in R

I have a data frame with 25 million rows and I need to run a substring function to all 25 million rows of data. Because of the size of the data frame I thought apply would be the most efficient way of doing this.
df <- data.frame( seq_start=c(75, 59, 44),
seq_end=c(151, 135, 120),
sequence=c("NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA", "NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG", "NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA"))
Function to accomplish this that I thought would be the most efficient:
apply(df,1,substr(sequence,seq_start,seq_end))
I'm not familiar with the apply function and a loop is way to inefficient to process 25 million lines.

Not 100% sure what you need/want but it seems that using the dplyrsyntax is useful here (more useful than apply as you're only looking to extract a substring from a single column)
library(dplyr)
df %>%
mutate(substring = substr(sequence,seq_start,seq_end))
seq_start seq_end
1 75 151
2 59 135
3 44 120
sequence
1 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
2 NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG
3 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA
substring
1 ATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
2 TAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACAC
3 AAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATAT
Base R:
df$substring <- substr(df$sequence,df$seq_start,df$seq_end)

Converting contingency tables with counts to two-column data tables with frequency columns

I would like to enter a frequency table into an R data.table.
The data are in a format like this:
Height
Gender 3 35
m 173 125
f 323 198
... where the entries in the table (173, 125, etc.) are counts.
I have a 2 by 2 table, and I want to turn it into two-column data.table.
The data is from a study of birds who nest at a height. The question is whether different genders of the bird prefer certain heights.
I thought the frequency table should be turned into something like this:
Gender height N
m 3 173
m 35 125
f 3 323
f 35 198
but now I'm not so sure. Some of the models I want to run need every case itemized.
Can I do this conversion in R? Ideally, I'd like a way to switch back and forth between the two formats.

Based on a review of ?table.
Make a data frame (x) with columns for Gender, Height, and Freq which would be your N value.
Convert that to a table by using
tabledata <- xtabs(Freq ~ ., x)
There are a number of base functions that can work with this kind of data, which is obviously much more compact than individual rows.
Also from ?loglin this example using table.
loglin(HairEyeColor, list(c(1, 2), c(1, 3), c(2, 3)))

Thanks, everybody (#simon and #Elin) for the help. I thought I was conducting a poll that would get answers like "start with the 4-row version" or "start with the 719-row version" and you all have given me an entire toolbox of ways to move from one to the other. It's really great, informative, and way more than the inquiry deserves.
I unquestionably need to work harder and get more explicit in forming a question. I see by the -3 rating that this boondoggle has earned, crystallizing the fact that I'm not adding anything to the knowledge base, so will delete the question in order to keep future searchers from finding this. I've had a bad run recently with my questions, and as a former teacher of the year, writer of five books, and PhD statistician, it's extremely embarrassing to have been on Stack Exchange for as long as I have, and stand here with one reputation point. One. That means that my upvotes of your answers don't count for a thing.
That reputation point should be scarlet colored.
Here's what I was getting at:
In a book, a common way to express data is in a 2×2 table:
Height
Gender 3 35
M 173 175
F 323 198
My tic-tac-sized mind sees two ways of entering that into a data table:
require(data.table)
GENDER <- c("m","m","f","f")
HEIGHT <- c(3, 35, 3, 35)
N <- c(173, 125, 323, 198)
SANDFLIERS <-data.table(GENDER, HEIGHT, N)
That gives the four-line flat-file/tidy representation of the data:
GENDER HEIGHT N
1: m 3 173
2: m 35 125
3: f 3 323
4: f 35 198
The other option is to make a 719-row data table with 173 male#3ft, 125 male#35 feet, etc. It's not too bad if you use the rep() command and build your table columns carefully. I hate doing arithmetic, so I leave some of these numbers bare and untotaled.
# I need 173+125 males, and 323+198 females.
# One c(rep()) for "m", one c(rep() for "f", and one c() to merge them
gender <- c(c(rep("m", 173+25)), c(rep("f",(323+198))))
# Same here, except the c() functions are one level 'deeper'. I need two
# sets for males (at heights 3 and 35, 173 and 125 of each, respectively)
# and two sets for females (at heights 3 and 35, 323 and 198 respectively)
heights <-c(c(c(rep(3, 173)), c(rep(35,25))), c(c(rep(3, 323)), c(rep(35,198))))
which, when merged into a data.table gives 719 rows, one for each observed bird.
1: m 3
2: m 3
3: m 3
4: m 3
5: m 3
---
715: f 35
716: f 35
717: f 35
718: f 35
719: f 35
Now that I have the data in two formats, I start looking for ways to do plots and analyses.
I can get a mosaic plot using the 719-row version, but you can't see it because of my 1-point reputation
mosaicplot(table(sandfliers), COLOR=TRUE, margin, legend=TRUE)
Mosaic Plot
and you can get a balloon plot using the 4-row version
Balloon Plot
So my question was, for those of you with lots and lots of experience with this sort of thing, do you find the 4-row or the 719-row tables more common. I can change from one to the other, but that's more code to add to the book (again I hear my editor, "You're teaching statistics, not R").
So, as I said at the top, this was just an informal poll on whether one is used more often than the other, or whether beginners are better off with one.

This is in the form of a contingency table. It isn't easy to enter directly into R but it can be done as follows (based on http://cyclismo.org/tutorial/R/tables.html):
> f <- matrix(c(173,125,323,198),nrow=2,byrow=TRUE)
> colnames(f) <- c(3,35)
> rownames(f) <- c("m","f")
> f <- as.table(f)
> f
3 35
m 173 125
f 323 198
You can then create a count or frequency table with:
> as.data.frame(f)
Var1 Var2 Freq
1 m 3 173
2 f 3 323
3 m 35 125
4 f 35 198
The R Cookbook gives a short function to convert to a table of cases (i.e. a long list of the individual items), as follows:
> countsToCases(as.data.frame(f))
... where:
# Convert from data frame of counts to data frame of cases.
# `countcol` is the name of the column containing the counts
countsToCases <- function(x, countcol = "Freq") {
# Get the row indices to pull from x
idx <- rep.int(seq_len(nrow(x)), x[[countcol]])
# Drop count column
x[[countcol]] <- NULL
# Get the rows from x
x[idx, ]
}
... thus you can convert the data to the format needed by any analysis method from any starting format.
(EDIT)
Another way to read in the contingency table is to start with text like this:
> ss <- " 3 35
+ m 173 125
+ f 323 198"
> read.table(text=ss,row.name=1)
X3 X35
m 173 125
f 323 198
Instead of using text =, you can also use a file name to read the table from (for example) a CSV file.

R: Merging Two Dataframes by Rowname Values & Column Values whilst Preserving Rownames [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I'm attempting to merge two dataframes. One dataframe contains rownames which appear as values within a column of another dataframe. I would like to append a single column (Top.Viral.TaxID.Name) from the second dataframe based upon these mutual values, to the first dataframe.
The first dataframe looks like this:
ERR1780367 ERR1780369 ERR2013703 xxx...
374840 73 0 0
417290 56 57 20
1923444 57 20 102
349409 40 0 0
265522 353 401 22
322019 175 231 35
The second dataframe looks like this:
Top.Viral.TaxID Top.Viral.TaxID.Name
1 374840 Enterobacteria phage phiX174 sensu lato
2 417290 Saccharopolyspora erythraea prophage pSE211
3 1923444 Shahe picorna-like virus 14
4 417290 Saccharopolyspora erythraea prophage pSE211
5 981323 Gordonia phage GTE2
6 349409 Pandoravirus dulcis
However, I would also like to preserve the rownames of the first dataframe, so the result would look something like this:
ERR1780367 ERR1780369 ERR2013703 xxx... Top.Viral.TaxID.Name
374840 73 0 0 Enterobacteria phage phiX174 sensu lato
417290 56 57 20 Saccharopolyspora erythraea prophage pSE211
1923444 57 20 102 Shahe picorna-like virus 14
349409 40 0 0 Pandoravirus dulcis
265522 353 401 22 Hyposoter fugitivus ichnovirus
322019 175 231 35 Acanthocystis turfacea Chlorella virus 1
Thanks in advance.

I would strongly recommend against relying on rownames. They are embarrasingly often removed, and the function in dplyr/tidyr always strip them.
Always make the rownames a part of the data, i.e. use "tidy" data sets as in the example below
data(iris)
# We mix the data a bit, to check if rownames are conserved
iris = iris[sample.int(nrow(iris), 20),]
head(iris)
description =
data.frame(Species = unique(iris$Species))
description$fullname = paste("The wonderful", description$Species)
description
# .... the above are your data
iris = cbind(row = rownames(iris), iris)
# Now it is easy
merge(iris, description, by="Species")
And please, use reproducibly data when asking questions in SO to get fast answers. It is lot of work to reformat the data you presented into a form that can be tested.

Use sapply to loop through rownames of dataframe 1 (df1) and search the id in the dataframe 2 (df2), returning the description in the same row.
Something like this
df1$Top.Viral.TaxID.Name <- sapply(rownames(df1), (function(id){
df2$Top.Viral.TaxID.Name[df2$Top.Viral.TaxID == id]
}))

How to reshape data into longformat and run repeated measures anova test? [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
I have a data which I am working on it. I need to run repeated measures Anova test on it but first I have to reshape data to long format. I did something as shown on website, reshaping doesn't give any error but I don't think it works. So Anova test gives error. Here is my code and error.
# reshaping to long format
id=1:length(veri$SIRA)
k.1 <- veri$KOLEST
k.2 <- veri$KOLEST2
k.3 <- veri$KOLEST3
veri2 <- data.frame(id,k.1,k.2,k.3)
longformat <- reshape(veri2,direction="long", varying=list("k.1","k.2","k.3"), idvar="id")
This is output for longformat
id time k.1 k.2 k.3
1 1 1 209 195 181
2 2 1 243 184 172
3 3 1 192 178 162
4 4 1 210 112 93
5 5 1 190 188 172
6 6 1 232 169 156
Time is 1 all along. This seems little odd to me. I thought it shoud be 1-2-3 according to 3 different measures.
And this is error when I run the test:
repmesao <- aov(k~time+Error(id/time), data=longformat)
Error in model.frame.default(formula = k ~ id/time, data = longformat, :
invalid type (list) for variable 'k'
How I can fix this problem? Any suggestions?

For reshaping, use library(tidyr) and a command like so
data_long <- gather(data, group, dv, range of columns)
If you have more than on dv, then this procedure is not good. What I usually do is divide data into dvs like data_dv1 <- data[1:3] and data_dv2 <- cbind(data[1:2], data[4]). I reshape it as shown above, then I just cbind(data_dv1_long, data_dv2_long) minding that not all columns should be combined, as you will have for instance your subject id in both df, so choose the columns for cbind accordingly.
Also, don't know that you are going to use for ANOVA, but I recommend library(ez) with a command like so
ezANOVA(data=data, dv=.(dv), wid=.(subject_id), within=.(group1), between=.(group2), detail=T)

How to get the levels number in R? [duplicate]

This question already has answers here:
How to drop factor levels while scraping data off US Census HTML site
(2 answers)
Closed 5 years ago.
I used a as.data.frame(table(something_to_count)), and get result like:
Var1 Freq
1 20 2970
2 30 1349
3 40 322
4 50 1009
I just want the $Var1 value, but if I write d[1,]$Var1 or d[1,1], I always get these things:
1] 20
305 Levels: 20 30 40 50 60 70 80 90 100 110 120 130 150 160 170 190 200 ... 4120
And when I try to output the value, it is always not 20, but 1. And as.number() also can only return 1. How can I literally get the Var1 value as it is instead of getting the id of the row? Also, when the outputs are levels numbers? What is wrong?

The as.data.frame method for objects of class "table" returns the first column as a factor and (along with any other "marginal labels" columns) and only the last column as the numeric counts. See the help page for ?table and look at the Value section. Tyler's recommendation to use the R-FAQ recommended as.numeric(as.character(.)) conversion strategy is "standard R".

This is because the function table turns the argument into a factor (type table into your console and you'll see the line a <- factor(a, exclude=exclude).
The best solution is just to do what Tyler suggested to transform the results of table into data.frame

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R parsing large data frame - speed optimization [duplicate] - r

A solution with data.table may be faster: library(data.table) dt = setDT(df)[, id:=findInterval(Start, End[!is.na(End)])][,paste(Start,collapse=','),id] result = data.frame(End = df$End[!is.na(df$End)],Start = dt$V1) # End Start #1 100 89,93,95 #2 120 101,115 #3 125 123

Related

Substr() function within the apply() function in R

Converting contingency tables with counts to two-column data tables with frequency columns

R: Merging Two Dataframes by Rowname Values & Column Values whilst Preserving Rownames [duplicate]

How to reshape data into longformat and run repeated measures anova test? [duplicate]

How to get the levels number in R? [duplicate]

Categories

Resources