I'm a newer user of R and understand how to make my code work but I know there has to be a dplyr or purrr function that does this more efficiently and with a lot less code? If there is I haven't found it yet. My PI wants a summation of our race data but the trick is to have it separated by one race and then if they answered more than one race the sum breakdown of those. I did a subset of the data to get just those columns and then added the columns individually in each row and output that to a new matrix 7x7 to get sums of each.
This is my code. My question is there a much more efficient way of doing this?
-sum races to create totaled matrix of all races
subset <- subset(dataset[,11:17])
test <- matrix(,nrow=7, ncol=7)
colnames(test) <- c("African_American", "Asian", "Hawaiian_Pacific", "Native_Alaskan", "White_Euro", "Hispanic_Latino", "No-Answer")
rownames(test) <- c("African_American", "Asian", "Hawaiian_Pacific", "Native_Alaskan", "White_Euro", "Hispanic_Latino", "No-Answer")
-basic design of "if ==1 then strictly one race. If >1 stick in appropriate category
test[1,1] <- sum(subset$African_American==1, na.rm=TRUE)
test[1,2] <- sum(subset$African_American+subset$Asian>1, na.rm=TRUE)
test[1,3] <- sum(subset$African_American+subset$Hawaiian_Pacific>1, na.rm=TRUE)
test[1,4] <- sum(subset$African_American+subset$Native_Alaskan>1, na.rm=TRUE)
test[1,5] <- sum(subset$African_American+subset$White_Euro>1, na.rm=TRUE)
test[1,6] <- sum(subset$African_American+subset$Hispanic_Latino>1, na.rm=TRUE)
test[1,7] <- sum(subset$African_American+subset$`No-Answer`>1, na.rm=TRUE)
test[2,1] <- sum(subset$Asian+subset$African_American>1, na.rm=TRUE)
test[2,2] <- sum(subset$Asian==1, na.rm=TRUE)...
There are seven columns to add to each other so it moves all the way through the matrix and outputs something similar to this where the diagonal are actual counts of only one race and the others are multiple occurrences:
matrix
I found a way which is not using plyr but the r-base function apply.
data = data.frame(set1 = round(runif(n = 10,min = 0,max = 1)),
set2 = round(runif(n = 10,min = 0,max = 1)),
set3 = round(runif(n = 10,min = 0,max = 1)),
set4 = round(runif(n = 10,min = 0,max = 1)),
set5 = round(runif(n = 10,min = 0,max = 1)),
set6 = round(runif(n = 10,min = 0,max = 1)),
set7 = round(runif(n = 10,min = 0,max = 1))
)
res = apply(combn(1:ncol(data), 2), 2, function(x) sum(data[, x[1]] & data[, x[2]]))
test <- matrix(0,nrow=7, ncol=7)
test[upper.tri(test)] = res
> test
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 5 3 2 2 4 2 2
[2,] 0 5 5 3 4 5 4
[3,] 0 0 6 3 1 0 5
[4,] 0 0 0 8 3 3 1
[5,] 0 0 0 0 2 2 2
[6,] 0 0 0 0 0 6 3
[7,] 0 0 0 0 0 0 6
The first part is producing some test data.
combn(1:ncol(data), 2) is telling apply to use a function for each combination of 2 columns. The & function then is returning TRUE for all entries of data[, x[1]] and data[, x[2]] (the 2 selected comlumns) where both values are 1. The summation is counting these. As a return you get the desired values. The following two lines construct a matrix as you wanted.
Please note that with addition of
res2 = apply(combn(1:ncol(data), 1), 2, function(x) sum(data[, x[1]]))
test[cbind(1:7,1:7)] <- res2
ou can also set the diagonal to the correct counts. Anyway this is only working for objects having answered 1 in 2 columns. It wont find those who are Asian, Hispanic and American. But you can compute this with a slight change to combination of 3 columns :
apply(combn(1:ncol(data), 3), 2, function(x) sum(data[, x[1]] & data[, x[2]] & data[, x[3]]))
Please also note that my random data may not be representative/unrealistic.
Related
I want to write a function to select i and j from the number of columns in ret_Nifty, while sum of i and j should be 5.
optimize.portfolio(R = ret_Nifty[, i:j], portfolio = ObjSpec,
xxxxxx = "xxx",
search_size = 2000, trace = TRUE,
traceDE = 5, itermax = 50)
But I’m not able to write a function for the same to select both variables.
Maybe something like this. calculate which column numbers add to 5 from every combination:
get_ij <- function(df, adds_to){
max_col <- ncol(df)
x <- t(combn(1:max_col, 2))
x[rowSums(x) == adds_to,]
}
get_ij(mtcars, 5)
#> [,1] [,2]
#> [1,] 1 4
#> [2,] 2 3
I have a large data matrix with many numeric values (counts) in it. I would like to remove 10% of all counts. So, for example, a matrix which looks like this:
30 10
0 20
The sum of all counts here is 60. 10% of 60 is 6. So I want to randomly remove 6. A correct output could be:
29 6
0 19
(As you can see it removed 1 from 30, 4 from 10 and 1 from 20). There cannot be negative values.
How could I program this in R?
Here is a way. It subtracts 1 to positive matrix elements until a certain total to remove is reached.
subtract_int <- function(X, n){
inx <- which(X != 0, arr.ind = TRUE)
N <- nrow(inx)
while(n > 0){
i <- sample(N, 1)
if(X[ inx[i, , drop = FALSE] ] > 0){
X[ inx[i, , drop = FALSE] ] <- X[ inx[i, , drop = FALSE] ] - 1
n <- n - 1
}
if(any(X[inx] == 0)){
inx <- which(X != 0, arr.ind = TRUE)
N <- nrow(inx)
}
}
X
}
set.seed(2021)
to_remove <- round(sum(A)*0.10)
subtract_int(A, to_remove)
# [,1] [,2]
#[1,] 30 6
#[2,] 0 18
Data
A <- structure(c(30, 0, 10, 20), .Dim = c(2L, 2L))
Maybe this helps you at least to get on the right track. It's nothing more than a draft though:
randomlyRemove <- function(matrix) {
sum_mat <- sum(matrix)
while (sum_mat > 0) {
sum_mat <- sum_mat - runif(1, min = 0, max = sum_mat)
x <- round(runif(1, 1, dim(matrix)[1]), digits = 0)
y <- round(runif(1, 1, dim(matrix)[2]), digits = 0)
matrix[x,y] <- matrix[x,y] - sum_mat
}
return(matrix)
}
You might want to play with the random number generator process to get more evenly distributed substractions.
edit: added round(digits = 0) to get only integer (dimension) values and modified the random (dimension) value generation to start from 1 (not zero).
I think we can make it work with using sample. This solution is a lot more compact.
The data
A <- structure(c(30, 0, 11, 20), .Dim = c(2L, 2L))
sum(A)
#> [1] 61
The logic
UseThese <- (1:length(A))[A > 0] # Choose indices to be modified because > 0
Sample <- sample(UseThese, sum(A)*0.1, replace = TRUE) # Draw a sample of indices
A[UseThese] <- A[UseThese] - as.vector(table(Sample)) # Subtract handling repeated duplicate indices in the sample
Check the result
A
#> [,1] [,2]
#> [1,] 28 8
#> [2,] 0 19
sum(A) # should be the value above minus 6
#> [1] 55
One disadvantage of this solution is that it could lead to negative
values. So check with:
any(A < 0)
#> [1] FALSE
I have a large data frame that is taking to long to compute a for loop, I've tried removing all computations to time the for loop but I still have an inefficient code. I'm new to R but I think there should be a better way of coding my for loop.
If you could provide some guidance it would be appreciated.
My dataFrame has 2,772,807 obs of 6 variables.
Simplified code (Still takes long):
library("tictoc")
tic()
dataFlights <- read_delim("U.S._DOT_O&D_Monthly_Traffic_Report.tsv",
"\t", escape_double = FALSE, trim_ws = TRUE)
dataFlights["Connections"] = ""
pb <- txtProgressBar(min = 0, max = nrow(dataFlights), style = 3)
for (row in 1:nrow(dataFlights)) {
dataFlights[row,7] <- 1
setTxtProgressBar(pb, row)
}
close(pb)
toc()
Original Code:
#Reads DOT public flight information for 2017 & 2018,
#and computes the number of connections
#per route (Cp#1 or Cp#2) into a new column. Possible results 0,1, or 2 connections.
library("tictoc")
tic()
dataFlights <- read_delim("U.S._DOT_O&D_Monthly_Traffic_Report.tsv",
"\t", escape_double = FALSE, trim_ws = TRUE)
dataFlights["Connections"] = ""
pb <- txtProgressBar(min = 0, max = nrow(dataFlights), style = 3)
for (row in 1:nrow(dataFlights)) {
if(is.na(dataFlights[row,2]) & is.na(dataFlights[row,3])){
dataFlights[row,7] <- 0
} else if (is.na(dataFlights[row,2]) | is.na(dataFlights[row,3])) {
dataFlights[row,7] <- 1
} else {
dataFlights[row,7] <- 2
}
setTxtProgressBar(pb, row)
}
close(pb)
toc()
As indicated in the comments, this can be done effortlessly with ifelse
# data
set.seed(123)
n <- 1e+6
dataFlights <- data.frame(x1 = runif(n),
x2 = sample(c(runif(n/2), rep(NA, n/2)), n),
x3 = sample(c(runif(n/2), rep(NA, n/2)), n),
stringsAsFactors = FALSE
)
# conditions
na_2 <- is.na(.subset2(dataFlights, 2))
na_3 <- is.na(.subset2(dataFlights, 3))
na_sum <- na_2 + na_3
# ifelse
dataFlights$x4 <- ifelse(na_sum == 2, 0, ifelse(na_sum == 1, 1, 2))
head(dataFlights)
# x1 x2 x3 x4
# 1 0.2875775 NA NA 0
# 2 0.7883051 0.4415287 NA 1
# 3 0.4089769 NA 0.3130298 1
# 4 0.8830174 0.3077688 NA 1
# 5 0.9404673 NA NA 0
# 6 0.0455565 0.5718788 NA 1
where for simplicity I set column 4 as opposed to column 7.
Few suggestions:
dataFlights["Connections"] = ""
In this piece, if you use NA instead of "", it will keep the data size smaller. For comparison, I created a 3,000,000 x 3 matrix to see size. With only one column different, the one with "" had size 268Mb but the one with NA was only about 60Mb. Smaller the size, faster it will be to index.
pb <- txtProgressBar(min = 0, max = nrow(dataFlights), style = 3)
for (row in 1:nrow(dataFlights)) {
dataFlights[row,7] <- 1
setTxtProgressBar(pb, row)
}
In each iteration, you are assigning 1 to a matrix / data.frame cell. This assignment is a computationally expensive step. For your example, this can be completely vectorized. Here are few ways to get 7th column to replace your for loop
rowSums
col7.rowSums = rowSums(!is.na(dataFlights[, 2:3]))
sapply
col7.sapply = sapply(1:nrow(dataFlights), function(x) sum(!is.na(dataFlights[x, 2:3])))
apply
col7.apply = apply(!is.na(dataFlights[, 2:3]), 1, sum)
Microbenchmark
Unit: microseconds
expr min lq mean median uq max neval
for.loop 52604.86 56768.5590 58810.55595 58137.651 60064.056 81958.717 100
rowSums 35.87 49.2225 61.23889 53.845 72.010 139.409 100
sapply 49756.32 53131.1065 55778.95541 54414.455 56154.496 102558.473 100
apply 997.21 1060.5380 1225.48577 1135.066 1254.936 3864.779 100
I have a list of vectors stored
library(seqinr) mydata <- read.fasta(file="mydata.fasta")
mydatavec <- mydata[[1]]
lst <- split(mydatavec, as.integer(gl(length(mydatavec), 100,length(mydatavec))))
df <- data.frame(matrix(unlist(lst), nrow=2057, byrow=T), stringsAsFactors=FALSE)
Now, each vector in df is 100 long and made up of letters "a", "c", "g", "t". I would like to calculate Shannon entropy of each of these vector, I will give example of what I mean:
v1 <- count(df[1,], 1)
a c g t
27 26 24 23
v2 <- v1/sum(v1)
a c g t
0.27 0.26 0.24 0.23
v3 <- -sum(log(v2)*v2) ; print(v3)
[1]1.384293
In total I need 2057 printed values because that is how many vectors I have. My question here, is it possible to create a for loop or repeat loop that would do this operation for me? I tried myself but I didn't get nowhere with this.
dput(head(sequence))
structure(c("function (nvec) ", "unlist(lapply(nvec, seq_len))"
), .Dim = c(2L, 1L), .Dimnames = list(c("1", "2"), ""), class = "noquote")
My attempt: I wanted to focus on the count function only and created this
A <- matrix(0, 2, 4)
for (i in 1:2) {
A[i] <- count(df[i,], 1)
}
What the function does is it correctly calculates number of "a" in the first vector and then follows to the second one. It completely ignores the rest of the letters
A
[,1] [,2] [,3] [,4]
[1,] 27 0 0 0
[2,] 28 0 0 0
Additionally I naively thought that adding bunch of "i" everywhere will make it work
s <- matrix(0, 1, 4)
s1 <- matrix(0, 1, 4)
s2 <- numeric(4)
for (i in 1:2) {
s[i] <- count(df[i,],1)
s1[i] <- s[i]/sum(s[i])
s2[i] <- -sum(log(s1[i])*s1[i])
}
But that didn't get me anywhere either.
If you don't need to save the count and you only need to print or save the calculation you show, these should work:
for(i in 1:dim(df)[1]{
v1 <- count(df[i,], 1)
v2 <- v1/sum(v1)
v3 <- sum(log(v2)*v2)
print(-v3) #for print
entropy[i] <- v3 #for save the value in a vector, first create this vector
}
The problem with the loop that you show may be the output of count is a table class with 1 row and 4 columns and you assign that to a matrix row. Also another possible problem may be that in the assignment for example you declare s[i] <- count(df[i,],1), when should be s[i,] <- count(df[i,],1).
Would this work for you:
df <- data.frame (x = c("a","c","g","g","g"),
y = c("g","c","a","a","g"),
z = c("g","t","t","a","g"),stringsAsFactors=FALSE)
A <- sapply(1:nrow(df), FUN=function(i){count(df[i,],1)})
> A
[,1] [,2] [,3] [,4] [,5]
a 1 0 1 2 0
c 0 2 0 0 0
g 2 0 1 1 3
t 0 1 1 0 0
I would like to make a new matrix from another matrix but only with rows which do not contain 0, how can I do that?
Here is a more vectorized way.
x <- matrix(c(0,0,0,1,1,0,1,1,1,1), ncol = 2, byrow = TRUE)
x[rowSums(x==0)==0,]
I found that it could by done very simply
x <- matrix(c(0,0,0,1,1,0,1,1,1,1), ncol = 2, byrow = TRUE)
y <- cbind (x[which(x[,1]*x[,2] >0), 1:2])
I am only piecing together the great suggestions others have already given. I like the ability to store this as a function and generalize to values besides 1 including categorcal values (also selects positively or negatively using the select argument):
v.omit <- function(dataframe, v = 0, select = "neg") {
switch(select,
neg = dataframe[apply(dataframe, 1, function(y) !any(y %in% (v))), ],
pos = dataframe[apply(dataframe, 1, function(y) any(y %in% (v))), ])
}
Let's try it.
x <- matrix(c(0,0,0,1,1,0,1,1,1,1,NA,1), ncol = 2, byrow = TRUE)
v.omit(x)
v.omit(mtcars, 0)
v.omit(mtcars, 1)
v.omit(CO2, "chilled")
v.omit(mtcars, c(4,3))
v.omit(CO2, c('Quebec', 'chilled'))
v.omit(x, select="pos")
v.omit(CO2, c('Quebec', 'chilled'), select="pos")
v.omit(x, NA)
v.omit(x, c(0, NA))
Please do not mark my answer as the correct one as others have answered before me, this is just to extend the conversation. Thanks for the code and the question.
I'm sure there are better ways, but here's one approach. We'll use apply() and the all() function to create a boolean vector to index into the matrix of interest.
x <- matrix(c(0,0,0,1,1,0,1,1,1,1), ncol = 2, byrow = TRUE)
x
> x
[,1] [,2]
[1,] 0 0
[2,] 0 1
[3,] 1 0
[4,] 1 1
[5,] 1 1
> x[apply(x, 1, function(y) all(y > 0)) ,]
[,1] [,2]
[1,] 1 1
[2,] 1 1