Deleting x rows every y rows in a dataframe in R - r

How can I remove (for example) 3 rows every 10 rows?
For example if I have a dataframe with 100 rows at the end I need I dataframe with 70 rows (with missing the first,the second,the third,the eleventh,the twelfth,the thirteenth and so on).

Using a toy dataframe with 100 lines, try this:
df <- data.frame(x = 1:100, y = 1:100)
rem <- as.vector(sapply(1:3, function(i) seq(i, nrow(df), 10)))
df[-rem, ]

We can use outer to create a sequence of the rows to remove.
result <- df[-c(outer(seq(1, nrow(df), 10), 0:2, `+`)), ]

We could do this with rep in a vectorized way
i1 <- seq(1, nrow(df), 10)
out <- df[-(rep(0:2, each = length(i1)) + i1),]
data
df <- data.frame(x = 1:100, y = 1:100)

Related

How to quantify the frequency of all possible row combinations of a binary matrix in R in a more efficient way?

Lets assume I have a binary matrix with 24 columns and 5000 rows.
The columns are Parameters (P1 - P24) of 5000 subjects. The parameters are binary (0 or 1).
(Note: my real data can contain as much as 40,000 subjects)
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
Now I would like to determine what are all possible combinations of the 24 measured parameters:
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
The final question is: How often does each of the possible row combinations from comb appear in matrix m?
I managed to write a code for this and create a new column in comb to add the counts. But my code appears to be really slow and would take 328 days to complete to run. Therefore the code below only considers the 20 first combinations
comb$count <- 0
for (k in 1:20){ # considers only the first 20 combinations of comb
for (i in 1:nrow(m)){
if (all(m[i,] == comb[k,1:24])){
comb$count[k] <- comb$count[k] + 1
}
}
}
Is there computationally a more efficient way to compute this above so I can count all combinations in a short time?
Thank you very much for your help in advance.
Data.Table is fast at this type of operation:
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
library(data.table)
data_t = data.table(m)
ans = data_t[, .N, by = P1:P24]
dim(ans)
head(ans)
The core of the function is by = P1:P24 means group by all the columns; and .N the number of records in group
I used this as inspiration - How does one aggregate and summarize data quickly?
and the data_table manual https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
If all you need is the combinations that occur in the data and how many times, this will do it:
m2 <- apply(m, 1, paste0, collapse="")
m2.tbl <- xtabs(~m2)
head(m2.tbl)
m2
# 000000000001000101010010 000000000010001000100100 000000000010001110001100 000000000100001000010111 000000000100010110101010 000000000100101000101100
# 1 1 1 1 1 1
You can use apply to paste the unique values in a row and use table to count the frequency.
table(apply(m, 1, paste0, collapse = '-'))

R - randomly picking columns to sum up row values

I have a dataset with 20 columns and 1000 rows generated using:
sim_data <- do.call(cbind, replicate(20, rexp(1000, 1/120), simplify = FALSE))
How can I pick a random number of columns per row to add up their values, and have a column indicating how many columns were picked?
I have:
picked <- sim_data[sample(nrow(sim_data), 5)]
sim_data$Sum <- sum(picked)
sim_data$Number <- length(picked)
but how do I pick a random size from 1 to 20, instead of "5", and repeat over all rows?
We can use apply
cbind(sim_data, t(apply(sim_data,1, function(x) {
i1 <- sample(seq_along(x), 1)
out <- sum(sample(x, i1))
c(Length = i1, Sum = out)
}
)))

Vectorization of a nested for-loop that inputs all paired combinations

I thought that the following problem must have been answered or a function must exist to do it, but I was unable to find an answer.
I have a nested loop that takes a row from one 3-col. data frame and copies it next to each of the other rows, to form a 6-col. data frame (with all possible combinations). This works fine, but with a medium sized data set (800 rows), the loops take forever to complete the task.
I will demonstrate on a sample data set:
Sdat <- data.frame(
x = c(10,20,30,40),
y = c(15,25,35,45),
ID =c(1,2,3,4)
)
compar <- data.frame(matrix(nrow=0, ncol=6)) # to contain all combinations
names(compar) <- c("x","y", "ID", "x","y", "ID")
N <- nrow(Sdat) # how many different points we have
for (i in 1:N)
{
for (j in 1:N)
{
Temp1 <- Sdat[i,] # data from 1st point
Temp2 <- Sdat[j,] # data from 2nd point
C <- cbind(Temp1, Temp2)
compar <- rbind(C,compar)
}
}
These loops provide exactly the output that I need for further analysis. Any suggestion for vectorizing this section?
You can do:
ind <- seq_len(nrow(Sdat))
grid <- expand.grid(ind, ind)
compar <- cbind(Sdat[grid[, 1], ], Sdat[grid[, 2], ])
A naive solution using rep (assuming you are happy with a data frame output):
compar <- data.frame(x = rep(Sdat$x, each = N),
y = rep(Sdat$y, each = N),
id = rep(1:n, each = N),
x1 = rep(Sdat$x, N),
y1 = rep(Sdat$y, N),
id_1 = rep(1:n, N))

apply a function with two dataframes as input in r

I want to get the total number of NA that missmatch between two dataframes.
I have found the way to get this for two vectors as follows:
compareNA <- function(v1,v2) {
same <- (v1 == v2) | (is.na(v1) & is.na(v2))
same[is.na(same)] <- FALSE
n <- 0
for (i in 1:length(same))
if (same[i] == "FALSE"){
n <- n+1
}
return(n)
}
Lets say I have vector aand bwhen comparing them I got as a result 2
a <- c(1,2,NA, 4,5,6,NA,8)
b <- c(NA,2,NA, 4,NA,6,NA,8)
h <- compareNA(a,b)
h
[1] 2
My question is: how to apply this function for dataframes instead of vectors?
Having as an example this datafames:
a2 <- c(1,2,NA,NA,NA,6,NA,8)
b2 <- c(1,NA,NA,4,NA,6,NA,NA)
df1 <- data.frame(a,b)
df2 <- data.frame(a2,b2)
what i expect as a result is 5, since this are the total number of NAs that appear in df2 that are not in df1. Any suggestion how to make this work?
Here's a second thought.
xy1 <- data.frame(a = c(NA, 2, 3), b = rnorm(3))
xy2 <- data.frame(a = c(NA, 2, 4), b = rnorm(3))
com <- intersect(colnames(xy1), colnames(xy2))
sum(xy1[, com] == xy2[, com], na.rm = TRUE)
If you don't want to worry about column names (but you should), you can make sure the columns align perfectly. In that case, intersect step is redundant.
sum(xy1 == xy2, na.rm = TRUE)
A third way (assuming dimensions of df1 & df2 are same):
sum(sapply(1:ncol(df1), function(x) compareNA(df1[,x], df2[,x])))
# 5
It would be easier to force both dataframes to have the same column names and compare column by column when those have the same name. You can then simply use a loop over columns and increment a running total by applying your function.
compareNA.df <- function(df1, df2) {
total <- 0
common_columns <- intersect(colnames(df1), colnames(df2))
for (col in common_columns) {
total <- total + compareNA(df1[[col]], df2[[col]])
}
return(total)
}
colnames(df2) <- c("a", "b")
compareNA.df(df1, df2)

Alternative to for loop in R

i have this script:
x<-seq(1,5)
y<-seq(6,10)
z<-sample(25)
x.range <- range(x)
y.range <- range(y)
df <- expand.grid(x = seq(from = x.range[1], to = x.range[2], by = 1), y = seq(from = y.range[1],
to = y.range[2], by = 1))
df$z<-z
x1<-c(1,2,3)
y1<-c(6,7,8)
z1<-c(10,12,13)
df_1<-data.frame(x1,y1,z1)
n<-length(df_1$x1)
df_pred<-data.frame(0,0,0)
names(df_pred)[1:3] <- c("x", "y", "z_pred")
for(i in 1:n)
{df_pred[i,]<-filter(df, x==df_1$x1[i], y==df_1$y1[i])}
sqm <- mean((df_pred[,3]-df_1[,3])^2)
I want to calculate the quadratic error between z value of df and z1 value of df_1. To do this i use a loop for to extract the rows that i need from df, basing on x1 and y1 values of df_1.
I ask you if there is something different to this for loop, to do the same thing (using, for example, dplyr package). Thanks.
If you name columns of df_1 as "x","y"and "z" similar to df then you can use
df_1 <- data.frame(x=x_1,y=y_1,z=z_1)
library(dplyr)
inner_join(df,df_1,by=c("x","y"))
I am not sure what is your loop for yet you want to try this. I use it to replace your loop.
df_pred <- subset(df, x %in% df_1$x1 & y %in% df_1$y1)
Let me know if it solves your problem

Resources