Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row - r

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?

You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55

If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

Related

How to extract rows with similar names into a submatrix?

I am building an asymmetrical matrix of values with the rows being coefficient names and the column the value of each coefficient:
# Set up Row and Column Names.
rows = c("Intercept", "actsBreaks0", "actsBreaks1","actsBreaks2","actsBreaks3","actsBreaks4","actsBreaks5","actsBreaks6",
"actsBreaks7","actsBreaks8","actsBreaks9","tBreaks0","tBreaks1","tBreaks2","tBreaks3", "unitBreaks0", "unitBreaks1",
"unitBreaks2","unitBreaks3", "covgBreaks0","covgBreaks1","covgBreaks2","covgBreaks3","covgBreaks4","covgBreaks5",
"covgBreaks6","yearBreaks2016","yearBreaks2015","yearBreaks2014","yearBreaks2013","yearBreaks2011",
"yearBreaks2010","yearBreaks2009","yearBreaks2008","yearBreaks2007","yearBreaks2006","yearBreaks2005",
"yearBreaks2004","yearBreaks2003","yearBreaks2002","yearBreaks2001","yearBreaks2000","yearBreaks1999",
"yearBreaks1998","plugBump0","plugBump1","plugBump2","plugBump3")
cols = c("Value")
# Build Matrix
matrix1 <- matrix(c(1:48), nrow = 48, ncol = 1, byrow = TRUE, dimnames = list(rows,cols))
output:
> matrix1
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
and I wish to extract certain rows that share row names (i.e. all rows with "unitBreaks'x'") into a submatrix.
I tried this
est_actsBreaks <- est_coef_mtrx[c("actsBreaks0","actsBreaks1","actsBreaks2","actsBreaks3",
"actsBreaks4","actsBreaks5","actsBreaks6","actsBreaks7",
"actsBreaks8","actsBreaks9"),c("Value")]
but it returns a vector and I need a matrix. I have seen other questions concerning similar procedures but their columns and rows all had identical names and/or values. Is there a way to do the operation I have in mind, such as grep()?
Welcome to StackOverflow.
As usual in R, there would probably be many ways to do what you request.
EDIT: I realized that my solution was going a little bit too far, sorry about that.
To extract only the rows that contain the pattern "unitBreaks" followed by several numbers, and still keep a matrix structure, you can run the following code. In a nutshell, grep is going to look for the pattern that you need and the argument drop = FALSE is going to make sure that you get a matrix as a result and not a vector.
uniBreakLines <- grep("unitBreaks[0-9]*", rows)
matrix1[uniBreakLines, , drop = FALSE]
Below is the first version of my answer.
First, I create a vector that describes the groups of rows. For this, I remove the numbers at the end of the row names.
grp <- gsub("[0-9]+$", "", rows)
Then, I transform the data matrix into a data-frame (why I do that is explained a little bit later).
dat1 <- data.frame(matrix1)
Finally, I use "split" on the data-frame, with the groups defined earlier. Using split on the data-frame will keep the structure: the result will be a list of data-frames, even though there is only one column.
dat1.split <- split(dat1, grp)
The result is a list of data-frames.
lapply(dat1.split, head)
$actsBreaks
Value
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
$covgBreaks
Value
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
$Intercept
Value
Intercept 1
$plugBump
Value
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
$tBreaks
Value
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
$unitBreaks
Value
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
$yearBreaks
Value
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
After that, if you still need matrices, you can convert them with the function as.matrix in an "lapply":
matrix1.split <- lapply(dat1.split, as.matrix)
You might want to consider combining your data in a "tibble" with the "grouping" column. You will then be able to use these groups with the group_by function or other functions from the dplyr package (or other packages from the tidyverse).
For example:
library(dplyr)
tib1 <- tibble(rows, simpler_rows, value = 1:48)
And an example on how to use the grouping variable:
tib1 %>%
group_by(simpler_rows) %>%
summarize(sum(value))
EDIT bis: what if I don't know the pattern?
I played around a little bit with your example to answer the question (that nobody asked, but still, it's fun!): "what if I don't know the pattern?"
In this case, I would use a distance between the row names. This distance would look like this:
... and would be the output of the following lines of code
library(stringdist)
library(pheatmap)
strdist <- stringdistmatrix(rows)
pheatmap(strdist, border_color = "white", cluster_rows = F, cluster_cols = FALSE, cellwidth = 10, cellheight = 10, labels_row = rows, fontsize_row = 7)
After that, I only need to get the number of cluster, which can be done with a silhouette plot (similar to this one), that tells me that there are 8 clusters of words, which seems about right:
The cluster can be extracted then with the function used to create the silhouette plot (I used hclust and cutree).
Here a solution with dplyr and stringr to extract rownames that contain a certain string.
At the end change back to matrix:
library(dplyr)
library(stringr)
df1 <- df %>%
filter(!str_detect(rownames(df), "unitBreaks"))
df1 <- as.matrix(df1)
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48

R: Randomly sampling (with replacement) each column of a data frame independently

I am trying to create a new data frame by randomly sampling an existing data frame. Specifically, I want create a data frame that is the same size as the original data frame, but each column of the new data frame is a random sample (with replacement) of the corresponding column in the original data frame. My first attempt looked like this:
# Create toy data set
data.set <- as.data.frame(matrix(1:50, ncol = 5))
# Change names
colnames(data.set) <- c("Stuff", "Things", "Foo", "Bar", "Guff")
# Try to create randomly sampled data frame
data.set %>% sample_n(replace = TRUE, size = nrow(data.set))
The problem here is that it just randomly samples rows, but not elements within each column individually. For example, here is some output.
Stuff Things Foo Bar Guff
2 2 12 22 32 42
10 10 20 30 40 50
2.1 2 12 22 32 42
3 3 13 23 33 43
5 5 15 25 35 45
3.1 3 13 23 33 43
8 8 18 28 38 48
9 9 19 29 39 49
1 1 11 21 31 41
6 6 16 26 36 46
Notice that the first and third rows are exactly the same, as are the fourth and sixth rows. What I would like is for each and every column to be randomly sampled independently. So, I tried this.
apply(data.set, MARGIN = 2, sample_n, replace = TRUE, size = nrow(data.set))
which produced the following error:
Error: Don't know how to sample from objects of class integer
although, I don't see what I did incorrectly. Can anyone offer a concise way of achieving my goal?
First, the apply function should have argument. In this case we use columns since the margin is 2.
apply(df, MARGIN = 2, function(x) sample(x, replace = TRUE, size = length(x)))

r - lapply divides a column by an integer value from different dataset, unexpected result

I have two data.frames, one with genotype counts and one with a number that I need to normalize my counts from the first dataset.
countsdata=data.frame(genotype1=rep(c(10,20,30,40),each=1),
genotype2=rep(c(100,200,300,400),each=1),
genotype3=rep(c(40,50,60,70),each=1),
genotype4=rep(c(40,50,60,70),each=1)
)
coldata = data.frame(Group =c('genotype1', 'genotype2', 'genotype3', 'genotype4'),
Treatment = rep(c("control","treated"),each = 2),
Norm=rep(c(1,2,5,5)))
I made sure my variables don't have factors
factorsCharacter <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
as.character))
coldata=factorsCharacter(coldata)
Then I see that lapply loops through my counts, one column at the time and through my coldata that contains the normalization value (Norm). All is looking good, until I combined the two action in the same step
> lapply(coldata['Group'],function(group_i){group_i})
$Group
[1] "genotype1" "genotype2" "genotype3" "genotype4"
> lapply(coldata['Group'],function(group_i){countsdata[,group_i]})
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 20 200 50 50
3 30 300 60 60
4 40 400 70 70
> lapply(coldata['Group'],function(group_i){as.integer(coldata[coldata$Group==group_i,'Norm'])})
$Group
[1] 1 2 5 5
> lapply(coldata['Group'],function(group_i){
+ countsdata[,group_i]/as.integer(coldata[coldata$Group==group_i,'Norm'])
+ })
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 10 100 25 25
3 6 60 12 12
4 8 80 14 14
Here the result is not what I was expecting (dividing each column by its normalization number). After further inspection I noticed it's normalizing by rows, in other words it's normalizing across different columns, which shouldn't be the case as I am looping through one column at the time. I am probably missing a basic concept but looking through other SO posts didn't find anything I could use. My goal is to fix the code to make the right calculation but I also would like to understand why this code above is not working. Thanks so much.
The problem is in using [ and not [[. So, instead of looping through each of the elements in 'Group' column, we have a list of length 1 with all the elements. So, either use coldata[, 'Group'] or coldata[['Group']] or coldata$Group for looping.
countsdataNew <- countsdata
countsdataNew[] <- lapply(coldata[['Group']],function(group_i)
countsdata[,group_i]/coldata$Norm[coldata$Group==group_i])
countsdataNew
# genotype1 genotype2 genotype3 genotype4
#1 10 50 8 8
#2 20 100 10 10
#3 30 150 12 12
#4 40 200 14 14
If the column name in 'countsdata' and 'Group' column from 'countsdata' are in the same order, we can do this easily with Map
Map(`/`, countsdata, coldata$Norm)
Or just replicate the 'Norm' and do a simple division
countsdata/coldata$Norm[col(countsdata)]
Or with sweep
sweep(countsdata, 2, coldata$Norm, "/")

How to extract certain rows

So As you can see I have a price and Day columns below
Price Day
2 1
5 2
8 3
11 4
14 5
17 6
20 7
23 8
26 9
29 10
32 11
35 12
38 13
41 14
44 15
47 16
50 17
53 18
56 19
59 20
I then want the output below
Difference Day
12 5
15 10
15 15
15 20
So now I have the difference in prices every 5 days...it just basically subtracts the 5th day with the first day.....and then the 10th day with the 5th day etc....
I already made a code that will seperate my data into 5 day intervals...but I want the code that will let me minus the 5th with the 1st day....the 10th day with the 5th day...etc
So the code should look something like this
difference<-tapply(Price[,1],Day, ____________)
So basically Price[,1] will be my Price data.....while "Day" is the variable that I created that will let me seperate my Day data into 5 day intervals.....I'm thinking that in the blank section I could put in the function or another variable that will let me subtract the 5th day with the 1st day prices and then the 10th day and 5th day prices...etc.....you dont have to help me to seperate my Days into intervals...just how to do "difference" section....thanks guys
Here's one option, assuming your data.frame is called "SODF":
within(SODF[c(1, seq(5, nrow(SODF), 5)), ], {
Price <- diff(c(0, Price))
})[-1, ]
# Price Day
# 5 12 5
# 10 15 10
# 15 15 15
# 20 15 20
The first step is basic subsetting. According to your description and expected answer, you want the first row, and then every fifth row starting from row 5:
> SODF[c(1, seq(5, nrow(SODF), 5)), ]
Price Day
1 2 1
5 14 5
10 29 10
15 44 15
20 59 20
From there, you can use diff on the "Price" column, but since diff will result in a vector that is one in length shorter than your input, you need to "pad" the input vector, which I did with diff(c(0, Price)).
# Correct values, but the number of rows needs to be 5
> diff(SODF[c(1, seq(5, nrow(SODF), 5)), "Price"])
[1] 12 15 15 15
Then, the [-1, ] at the end just deletes the extraneous row.
Update
In the comments below, #geektrader points out in the comments (thanks!), an alternative to using:
SODF[c(1, seq(5, nrow(SODF), 5)), ]
as your input data.frame, you may consider using the following instead:
rbind(SODF[1,], SODF[$Day %% 5 == 0,] )
The difference in the two approaches is that the first approach simply subsets by row number, while the second approach subsets according to the value in the "Day" column, extracting rows where "Day" is a multiple of 5. This second approach might be useful, for instance, when there are missing rows in the dataset.
Ananda's is a nice approach (always forget about within myself). Here's another approach:
dat2 <- dat[seq(0, nrow(dat), by=5), ]
data.frame(Difference=diff(c(dat[1,1], dat2[, 1])), Day=dat2[, 2])
Here a solution if you have a matrix as input.
The subsequent function, given a matrix m, a column col_id and a numeric interval interv, subtracts every interv rows the current value in the col_id column of the m matrix with the previous value (5 rows before, same column, obiviously).
The results are stored in a new column called diff and appended to the end of the m matrix.
In short, the approach is very similar to that used by #Ananda Mahto.
So, this is the function:
subtract_column <- function(m, col_id, interv) {
select <- c(1, seq(interv, nrow(m), interv))
cbind(m[select[-1], ], diff = diff(m[select, col_id]))
}
Example:
# this emulates your data as a matrix
price_vect <- c(2,5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50,53,56,59)
day_vect <- 1:20
matr <- do.call(cbind, list(price = price_vect, day = day_vect))
# and this calls the function above and does the job:
# subtracts every 5 rows the current and the previous (5 rows back) value in the column `price` of matrix `matr`
subtract_column(matr, 'price', 5)
Output:
price day diff
[1,] 14 5 12
[2,] 29 10 15
[3,] 44 15 15
[4,] 59 20 15

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Resources