Extracting block of m rows at regular interval from large dataset

Extracting block of m rows at regular interval from large dataset - r

I have a small problem. I have a dataset with 8208 rows of data. It's a single column of data, I want to take every n rows as a block and add this to a new data frame.
So, for example:
newdf has column 1 to column 23.
column 1 is composed of rows 289:528 from the original dataset
column 2 is composed of rows 625:864 from the original dataset
And so on. The "block" size is 239 rows, the jump between blocks is every 336 rows.
I can do this manually, but it just becomes tedious. I have to repeat this entire procedure for another 11 sets of data so obviously a more automated approach would be preferable.

The trick here is to create an index of integers that refer to the row numbers you want to keep. This is simple enough with some use of rep, sequences and R's recycling rule.
Let me demonstrate using iris. Say you want to skip 25 rows, then return 3 rows:
skip <- 25
take <- 3
total <- nrow(iris)
reps <- total %/% (skip + take)
index <- rep(0:(reps-1), each=take) * (skip + take) + (1:take) + skip
The index now is:
index
[1] 26 27 28 54 55 56 82 83 84 110 111 112 138 139 140
And the rows of iris:
iris[index, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
82 5.5 2.4 3.7 1.0 versicolor
83 5.8 2.7 3.9 1.2 versicolor
84 6.0 2.7 5.1 1.6 versicolor
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2.0 virginica
112 6.4 2.7 5.3 1.9 virginica
138 6.4 3.1 5.5 1.8 virginica
139 6.0 3.0 4.8 1.8 virginica
140 6.9 3.1 5.4 2.1 virginica

Update
Note the OP states the block size is 239 elements but it is clear from the examples rows indicated that the block size is 240
> length(289:528)
[1] 240
I'll leave the example below at a block length of 239, but adjust if it is really 240.
It isn't clear from the Question, but assuming that you have something like this
df <- data.frame(A = runif(8208))
a data frame with 8208 rows.
First compute the indices of the elements of A that you need to keep. This is done via
want <- sapply(seq(289, nrow(df)-239, by = 336),
function(x) x + (seq_len(239) - 1))
Then we can use the fact that R fills matrices by columns and convert the required elements of A to a matrix with 239 rows
mat <- matrix(df$A[want], nrow = 239)
This works
> all.equal(mat[,1], df$A[289:527])
[1] TRUE
but do note that I have taken a block length of 239 here (289:527) not the indices the OP quotes as that is a block size of 240 (see Update above)
If you want this is a data frame, just add
df2 <- as.data.frame(mat)

Try this:
1) Create a list of indices
lapply(seq(1, 8208, 336), function(X) X:(X+239)) -> Indices
2) Select Data
Columns <- lapply(Indices, function(X) OldDF[X,])
3) Combine selected data in columns
NewDF <- do.call(cbind, Columns)

Why not just:
as.dataframe(matrix(orig, nrow=528 )[289:528 ,])
Since the 8028 is not an exactl multiple of the row count we need to determine the columns:
> 8208/528
[1] 15.54545 # so either 15 or 16
> 8208-15*528
[1] 288 # all in the to-be-discarded section
as.dataframe(matrix(orig, nrow=528, col=15 )[289:528 ,])
Or:
as.dataframe(matrix(orig, nrow=528, col=8208 %/% 528)[289:528 ,])

Related

Function to filter data equal to or greater than a certain value

I have a dataframe containing thousands of rows and columns. The rows contain the names of genes and the columns the names of samples.
I only want to keep the rows that contain a value equal to or greater than 5 in more than 3 samples.
I tried this so far but I can't figure out how to set multiple conditions:
data.frame1 %>% filter_all(all_vars(.>= 5))
I hope I have stated this question correctly.

The way I do it in my gene expression filtering pre-differential gene expression pipeline is as follows:
data.frame1[rowSums(data.frame1 >= 5) > 3, ] -> filtered.counts
And if your first column is your gene identifier, with all the other columns being numeric, you can have the evaluation skip the first column as follows:
data.frame1[rowSums(data.frame1[-1] >= 5) > 3, ] -> filtered.counts

The way to do this in dplyr 1.0.0 is
iris %>%
filter(rowSums(across(where(is.numeric)) > 6) > 1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.6 3.0 6.6 2.1 virginica
2 7.3 2.9 6.3 1.8 virginica
3 7.2 3.6 6.1 2.5 virginica
4 7.7 3.8 6.7 2.2 virginica
5 7.7 2.6 6.9 2.3 virginica
6 7.7 2.8 6.7 2.0 virginica
7 7.4 2.8 6.1 1.9 virginica
etc
For your case
data.frame1 %>%
filter(rowSums(across(where(is.numeric)) >= 5) > 3)

Plotting sales over time in R

I am trying to show the top 100 sales on a scatterplot by year. I used the below code to take top 100 games according to sales and then set it as a data frame.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
as.data.frame(top100)
I then tried to plot this with the below code:
ggplot(top100)+
aes(x=Year, y = Global_Sales) +
geom_point()
I bet the below error when using the subset top100
Error: data must be a data frame, or other object coercible by fortify(), not a numeric vector
if i use the actual games dataseti get the plot attached.
Any ideas?

As pointed out in comments by #CMichael, you have several issues in your code.
In absence of reproducible example, I used iris dataset to explain you what is wrong with your code.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
By doing that you are only extracting a single column.
The same command with the iris dataset:
> head(sort(iris$Sepal.Length, decreasing = TRUE), n = 20)
[1] 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 7.2 7.1 7.0 6.9 6.9 6.9 6.9 6.8 6.8 6.8
So, first, you do not have anymore two dimensions to be plot in your ggplot2. Second, even colnames are not kept during the extraction, so you can't after ask for ggplot2 to plot Year and Global_Sales.
So, to solve your issue, you can do (here the example with the iris dataset):
top100 = as.data.frame(head(iris[order(iris$Sepal.Length, decreasing = TRUE), 1:2], n = 100))
And you get a data.frame of of this type:
> str(top100)
'data.frame': 100 obs. of 2 variables:
$ Sepal.Length: num 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 ...
$ Sepal.Width : num 3.8 3.8 2.6 2.8 3 3 2.8 2.9 3.6 3.2 ...
> head(top100)
Sepal.Length Sepal.Width
132 7.9 3.8
118 7.7 3.8
119 7.7 2.6
123 7.7 2.8
136 7.7 3.0
106 7.6 3.0
And then if you are plotting:
library(ggplot2)
ggplot(top100, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
Warning Based on what you provided in your example, I will suggest you to do:
top100 <- as.data.frame(head(games[order(games$NA_Sales,decreasing=TRUE),c("Year","Global_Sales")], 100))
However, if this is not satisfying to you, you should consider to provide a reproducible example of your dataset How to make a great R reproducible example

Sum column every n column in a data frame R

I have a df(A) with 10 column and 300 row. I need to sum every two column, between them, in this way:
A[,1]+A[,2] = # first result
A[,3]+A[,4] = # second result
A[,5]+A[,6]= # third result
....
A[,9]+A[,10] # last result
The expected final result is a new dataframe with 5 column and 300 row.
Any way to do this? with tapply or loop for?
I know that i can try with the upon example, but i'm looking for a fast method
Thank you

We could use sapply:
df <- data.frame(replicate(expr=rnorm(100),n = 10))
sapply(seq(1,9,by=2),function(i) rowSums(df[,i:(i+1)]))

You can do it without *apply loops.
Sample data:
df <- head(iris[-5])
df
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4
Now you can use vector recycling of a logicals:
df[c(TRUE,FALSE)] + df[c(FALSE,TRUE)]
# Sepal.Length Petal.Length
#1 8.6 1.6
#2 7.9 1.6
#3 7.9 1.5
#4 7.7 1.7
#5 8.6 1.6
#6 9.3 2.1

It's a bit cryptic but I it should be fast. We add each column to the adjacent column. Then delete the unnecessary results with c(T,F) which recycles through odd columns:
(A[1:(ncol(A)-1)] + A[2:ncol(A)])[c(T,F)]

R range referencing with different dimensions not causing an error

In using range referencing I normally expect to see an error or at least a warning message when the operations in '[' ']' do not match the dimensions of the parent object, however I have just discovered that I am not seeing said warnings and errors. Is there a setting for this or a way to force an error? Example:
x = 1:5
y = 10:12
x[y>10]
y[x>2]
likewise this applies to data frames and other R objects:
dat = data.frame(x=runif(100),y=1:100)
dat[sample(c(TRUE,FALSE),23),c(TRUE,FALSE)]
The silent repetition and truncation of the references to match the dimensions of the parent object is unexpected, having used R for years, I've somehow never noticed this before.
I'm using R Console (64-bit) 3.0.1 for Windows (could be updated yes, but I hope this isn't the cause).
Edit: Fixed data.frame example as data.frame's don't allow more column references than columns. Thanks zero323.

You could modify the `[.data.frame` function to throw a warning when indexing with a logical vector that doesn't evenly divide the number of rows:
`[.data.frame` <- function(x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1) {
if (!missing(i) && is.logical(i) && nrow(x) %% length(i) != 0) {
warning("Indexing data frame with logical vector that doesn't evenly divide row count")
}
base::`[.data.frame`(x, i, j, drop)
}
Here's a demonstration with the 150-row iris dataset, passing logical indexing vectors of length 11 (should cause warning) and 15 (should not cause warning):
iris[c(rep(FALSE, 10), TRUE),]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 11 5.4 3.7 1.5 0.2 setosa
# 22 5.1 3.7 1.5 0.4 setosa
# 33 5.2 4.1 1.5 0.1 setosa
# 44 5.0 3.5 1.6 0.6 setosa
# 55 6.5 2.8 4.6 1.5 versicolor
# 66 6.7 3.1 4.4 1.4 versicolor
# 77 6.8 2.8 4.8 1.4 versicolor
# 88 6.3 2.3 4.4 1.3 versicolor
# 99 5.1 2.5 3.0 1.1 versicolor
# 110 7.2 3.6 6.1 2.5 virginica
# 121 6.9 3.2 5.7 2.3 virginica
# 132 7.9 3.8 6.4 2.0 virginica
# 143 5.8 2.7 5.1 1.9 virginica
# Warning message:
# In `[.data.frame`(iris, c(rep(FALSE, 10), TRUE), ) :
# Indexing data frame with logical vector that doesn't evenly divide number of rows
iris[c(rep(FALSE, 14), TRUE),]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 15 5.8 4.0 1.2 0.2 setosa
# 30 4.7 3.2 1.6 0.2 setosa
# 45 5.1 3.8 1.9 0.4 setosa
# 60 5.2 2.7 3.9 1.4 versicolor
# 75 6.4 2.9 4.3 1.3 versicolor
# 90 5.5 2.5 4.0 1.3 versicolor
# 105 6.5 3.0 5.8 2.2 virginica
# 120 6.0 2.2 5.0 1.5 virginica
# 135 6.1 2.6 5.6 1.4 virginica
# 150 5.9 3.0 5.1 1.8 virginica

Expanding on #josilber I've written the following for atomic vector and matrix subsetting in case anyone else wants it:
`[` <- function(x, i) {
if(!missin
g(i) && is.logical(i) && (length(x) %% length(i) != 0 || length(i) > length(x))) {
warning("Indexing atomic vector with logical vector that doesn't evenly divide row count")
}
base::`[`(x,i)
}
`[` <- function(x,i,j,...,drop=TRUE) {
if (!missing(i) && is.logical(i) && nrow(x) %% length(i) != 0) {
warning("Indexing matrix with logical vector that doesn't evenly divide row count")
}
if (!missing(j) && is.logical(j) && nrow(x) %% length(j) != 0) {
warning("Indexing matrix with logical vector that doesn't evenly divide column count")
}
base::`[`(x,i,j,...,drop)
}
Testing my original example afterwards with this modification now produces the warning and other operations behave as per normal:
> x =
1:5
> y = 10:12
> x[y>10]
[1] 2 3 5
Warning message:
In x[y > 10] :
Indexing atomic vector with logical vector that doesn't evenly divide row count
> y[x>2]
[1] 12 NA NA
Warning message:
In y[x > 2] :
Indexing atomic vector with logical vector that doesn't evenly divide row count
> x[x>2]
[1] 3 4 5
> x[1:2]
[1] 1 2

remove rows containing certain data

In my data frame the first column is a factor and I want to delete rows that have a certain value of factorname (when the value is present). I tried:
df <- df[-grep("factorname",df$parameters),]
Which works well when the targeted factor name is present. However if the factorname is absent, this command destroys the data frame, leaving it with 0 rows. So I tried:
df <- df[!apply(df, 1, function(x) {df$parameters == "factorname"}),]
that does not remove the offending lines. How can I test for the presence of factorname and remove the line if factorname is present?

You could use:
df[ which( ! df$parameter %in% "factorname") , ]
(Used %in% since it would generalize better to multiple exclusion criteria.) Also possible:
df[ !grepl("factorname", df$parameter) , ]

l<-sapply(iris,function(x)is.factor(x)) # test for the factor variables
>l
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
FALSE FALSE FALSE FALSE TRUE
m<-iris[,names(which(l=="TRUE"))]) #gives the data frame of factor variables only
iris[iris$Species !="setosa",] #generates the data with Species other than setosa
> head(iris[iris$Species!="setosa",])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting block of m rows at regular interval from large dataset - r

Try this: 1) Create a list of indices lapply(seq(1, 8208, 336), function(X) X:(X+239)) -> Indices 2) Select Data Columns <- lapply(Indices, function(X) OldDF[X,]) 3) Combine selected data in columns NewDF <- do.call(cbind, Columns)

Related

Function to filter data equal to or greater than a certain value

Plotting sales over time in R

Sum column every n column in a data frame R

R range referencing with different dimensions not causing an error

remove rows containing certain data

Categories

Resources