Find k highest values in each column and compute mean in R - r

I am trying to calculate the average of each column in a dataframe using the top k values. I have a solution, however, it is slow and hamfisted. Here is what I came up with:
predictMat <- matrix(0,nrow = length(colnames(DT)),ncol = 1)
k <- 100
itemSummary <- for(i in colnames(DT)) {
u <- data.frame(DT[,i , drop = F])
sortU1 <- data.frame(u[order(u[,i], decreasing = T),, drop = F])
u1Neighbors <- data.matrix(sortU1[1:k,, drop = F])
predictMat[i] <- mean(u1Neighbors, na.rm = T)
}

You can do this in one line using the apply function:
# Sample data frame
set.seed(144)
DT <- matrix(rnorm(1000), nrow=100)
k <- 10
# Compute average of 10 largest values in each column
apply(DT, 2, function(x) mean(tail(sort(x), k)))
# [1] 1.721765 1.658917 1.630231 1.558280 1.606363 1.526322 1.810814 1.678135
# [9] 1.541305 1.621984

could do this with back-to-back apply functions
set.seed(100)
x <- as.data.frame ( matrix(runif(5000,0,10), nrow=1000,ncol=5) )
x1<- apply(x,2,sort,decreasing=T)
apply(x1[1:100,],2,mean)
V1 V2 V3 V4 V5
9.548000 9.572912 9.422325 9.547370 9.462894
edit: looks like I was a few seconds behind in my answer!

Related

sample rows from dataframe using for loop in R

There is a df given with nrow=600 and ncol=18
Now I need to sample 10000 of each of this columns with replacement.
According to the specifications first I need to create an empty matrix:
df1 <- as.data.frame(matrix(NA,nrow = 10000,ncol=18))
now I want to use for loop to do all the samples(for each column) at once:
for (i in 1:18){
df1[1:10000, i) <- sample(df[,i], 10 000, replace=true)
when I run this code, my df1 is still empty.
Can anyone help?
Many thanks in advance
There are syntax issues in your code. Try the following :
df1 <- as.data.frame(matrix(NA,nrow = 10000,ncol=18))
for (i in 1:18) {
df1[, i] <- sample(df[, i], 10000, replace = TRUE)
}
Without an explicit for loop you can also use sapply/lapply :
#With `sapply`
df1 <- as.data.frame(sapply(df, sample, 1000, replace = TRUE))
#Using `lapply`
df1 <- do.call(cbind.data.frame, lapply(df, sample, 1000, replace = TRUE))
It works for the data shared in comments.
df <- data.frame(V1, V2, V3)
df1 <- as.data.frame(matrix(NA,nrow = 10000,ncol=3))
for (i in 1:3) {
df1[, i] <- sample(df[, i], 10000, replace = TRUE)
}
dim(df1)
#[1] 10000 3
head(df1)
# V1 V2 V3
#1 0.02527926 0.039423826 0.097738594
#2 0.03391239 0.039423826 0.036153091
#3 0.03919354 -0.004922473 0.097738594
#4 -0.06703827 0.039423826 0.097738594
#5 0.02168909 0.048176052 0.036153091
#6 0.02527926 0.074435079 -0.009444024

Function in R that performs multiple operations over columns of two datasets

I have two datasets, each with 5 columns and 10,000 rows. I want to calculate y from values in columns between the two datasets, column 1 in data set 1 and column 1 in data set 2; then column 2 in data set 1 and column 2 in data set 2. The yneeds nonetheless to follow a set of rules before being calculated. What I did so far doesn't work, and I cannot figure it out why and if there is a easier way to do all of this.
Create data from t-distributions
mx20 <- as.data.frame(replicate(10000, rt(20,19)))
mx20.50 <- as.data.frame(replicate(10000, rt(20,19)+0.5))
Calculates the mean for each simulated sample
m20 <- apply(mx20, FUN=mean, MARGIN=2)
m20.05 <- apply(mx20.50, FUN=mean, MARGIN=2)
The steps 1 and 2_ above are repeated for five sample sizes from t-distributions rt(30,29); rt(50,49); rt(100,99); and rt(1000,999)
Bind tables (create data.frame) for each t-distribution specification
tbl <- cbind(m20, m30, m50, m100, m1000)
tbl.50 <- cbind(m20.05, m30.05, m50.05, m100.05, m1000.05)
Finally, I want to calculate the y as specified above. But here is where I get totally lost. Please see below my best attempt so far.
y = (mtheo-m0)/(m1-m0), where y = 0 when m1 < m0 and y = y when m1 >= m0. mtheo is a constant (e.g. 0.50), m1 is value in column 1 of tbl and m0 is value in column 1 of tbl.50.
ycalc <- function(mtheo, m1, m0) {
ifelse(m1>=m0) {
y = (mteo-m0)/(m1-m0)
} ifelse(m1<m0) {
y=0
} returnValue(y)
}
You can try this. I used data frames instead of data tables.
This code is more versatile. You can add or remove parameters. Below are the parameters that you can use to create t distributions.
params = data.frame(
n = c(20, 30, 50, 100, 1000),
df = c(19, 29, 49, 99, 999)
)
And here is a loop that creates the values you need for each t distribution. You can ignore this part if you already have those values (or code to create those values).
tbl = data.frame(i = c(1:10000))
tbl.50 = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
mx = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])))
m <- apply(mx, FUN=mean, MARGIN=2)
tbl = cbind(tbl, m)
names(tbl)[ncol(tbl)] = paste("m", params[i, 1], sep="")
mx.50 = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])+.5))
m.50 <- apply(mx.50, FUN=mean, MARGIN=2)
tbl.50 = cbind(tbl.50, m.50)
names(tbl.50)[ncol(tbl.50)] = paste("m", params[i, 1], ".50", sep="")
}
tbl = tbl[-1]
tbl.50 = tbl.50[-1]
And here is the loop that does the calculations. I save them in a data frame (y). Each column in this data frame is the result of your function applied for all rows.
mtheo = .50
y = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
y$dum = 0
idx = which(tbl[, i] >= tbl.50[, i])
y[idx, ]$dum =
(mtheo - tbl.50[idx, i]) /
(tbl[idx, i] - tbl.50[idx, i])
names(y)[ncol(y)] = paste("y", params[i, 1], sep="")
}
y = y[-1]
You could try this, if the first column in tbl is called m0 and the first column in tbl.50 is called m1:
mteo <- 0.5
ycalc <- ifelse(tbl$m1 >= tbl.50$m0, (mteo - tbl.50$m0)/(tbl$m1 - tbl.50$m0),
ifelse(tbl$m1 < tbl.50$m0), 0, "no")
Using the same column names provided by your code, and transforming your matrices into dataframes:
tbl <- data.frame(tbl)
tbl.50 <- data.frame(tbl.50)
mteo <- 0.5
ycalc <- ifelse(tbl$m20 >= tbl.50$m20.05, (mteo - tbl.50$m20.05)/(tbl$m20 - tbl.50$m20.05),
ifelse(tbl$m20 < tbl.50$m20.05, "0", "no"))
This results in:
head(ycalc)
[1] "9.22491706576716" "0" "0" "0" "0" "1.77027049630147"

How to calculate Mode (Statistics) for a set of every 10 numbers in a large data set

Like if i have 1223455567 1777666666 i want the output be 5 an 6 .
how can i do this in R language?
i know how to find the mean for every 10 data but what i want is mode.
here is what i tried for mean
mean10 <- aggregate(level, list(rep(1:(nrow(level) %/% n+1),each = n, len = nrow(level))), mean)[-1];
and there is a function for mode as follow:
MODE <- function(dataframe){
DF <- as.data.frame(dataframe)
MODE2 <- function(x){
if (is.numeric(x) == FALSE){
df <- as.data.frame(table(x))
df <- df[order(df$Freq), ]
m <- max(df$Freq)
MODE1 <- as.vector(as.character(subset(df, Freq == m)[, 1]))
if (sum(df$Freq)/length(df$Freq)==1){
warning("No Mode: Frequency of all values is 1", call. = FALSE)
}else{
return(MODE1)
}
}else{
df <- as.data.frame(table(x))
df <- df[order(df$Freq), ]
m <- max(df$Freq)
MODE1 <- as.vector(as.numeric(as.character(subset(df, Freq == m)[, 1])))
if (sum(df$Freq)/length(df$Freq)==1){
warning("No Mode: Frequency of all values is 1", call. = FALSE)
}else{
return(MODE1)
}
}
}
return(as.vector(lapply(DF, MODE2)))
}
This should work
Mode <- function(x) {
y <- unique(x)
y[which.max(tabulate(match(x, y)))]
}
library(zoo)
x<- c(1,2,2,3,4,5,5,5,6,7,1,7,7,7,6,6,6,6,6,6)
rollapply(data = x, width = 10, FUN = Mode, by = 10 )
Given you're not after a rolling mode but really a group mode, none of the other answers are accurate. It's actually much easier to do this in the case you have in mind; I'll use data.table.
#fixed cost: set-up of 'data.table'
library(data.table)
setDT(DF)
Now solving:
#this works on a single column;
# the rep(...) bit is about creating the
# sequence (1, ..., 1, 2, ..., 2, ...)
# of integers each repeated 10 times.
# Here, .N will give the frequency -- i.e.,
# this first step is basically running 'table' for every 10 rows
DF[ , .N, by = .(col1, grp = rep(1:(.N %/% 10 + 1), length.out = .N)))
#by going in descending order on frequency, we can simply
# extract the first element of each 'grp' to get the mode.
# (this glosses over the issue of ties, but you haven't given
# any guidance to that end)
][order(-N), .SD[1L], by = grp]
You can use the zoo package to calculate a moving mode:
library(zoo)
# sample data
d <- data.frame(x = sample(1:3, 100, T))
# mode function (handles ties by choosing one)
my_mode <- function(x) as.numeric(which.max(table(x)))
# add moving mode as new variable
transform(d, moving_mode = rollapply(x, 10, FUN = my_mode, fill = NA))
You can always convert to character and see which char is max in a table. E.g.
> which.max(table(strsplit(as.character(1777666666),"")))
6
2

R: How to write a for loop that reads every two lines in a matrix?

I want to calculate correlation statistics using cor.test(). I have a data matrix where the two pairs to be tested are on consecutive lines (I have more than thousand pairs so I need to correct for that also later). I was thinking that I could loop through every two and two lines in the matrix and perform the test (i.e. first test correlation between row1 and row2, then row3 and row4, row5 and row6 etc.), but I don't know how to make this kind of loop.
This is how I do the test on a single pair:
d = read.table(file="cor-test-sample-data.txt", header=T, sep="\t", row.names = 1)
d = as.matrix(d)
cor.test(d[1,], d[2,], method = "spearman")
You could try
res <- lapply(split(seq_len(nrow(mat1)),(seq_len(nrow(mat1))-1)%/%2 +1),
function(i){m1 <- mat1[i,]
if(NROW(m1)==2){
cor.test(m1[1,], m1[2,], method="spearman")
}
else NA
})
To get the p-values
resP <- sapply(res, function(x) x$p.value)
indx <- t(`dim<-`(seq_len(nrow(mat1)), c(2, nrow(mat1)/2)))
names(resP) <- paste(indx[,1], indx[,2], sep="_")
resP
# 1_2 3_4 5_6 7_8 9_10 11_12 13_14
#0.89726818 0.45191660 0.14106085 0.82532260 0.54262680 0.25384239 0.89726815
# 15_16 17_18 19_20 21_22 23_24 25_26 27_28
#0.02270217 0.16840791 0.45563229 0.28533447 0.53088721 0.23453161 0.79235990
# 29_30 31_32
#0.01345768 0.01611903
Or using mapply (assuming that the rows are even)
ind <- seq(1, nrow(mat1), by=2) #similar to the one used by #CathG in for loop
mapply(function(i,j) cor.test(mat1[i,], mat1[j,],
method='spearman')$p.value , ind, ind+1)
data
set.seed(25)
mat1 <- matrix(sample(0:100, 20*32, replace=TRUE), ncol=20)
Try
d = matrix(rep(1:9, 3), ncol=3, byrow = T)
sapply(2*(1:(nrow(d)/2)), function(pair) unname(cor.test(d[pair-1,], d[pair,], method="spearman")$estimate))
pvalues<-c()
for (i in seq(1,nrow(d),by=2)) {
pvalues<-c(pvalues,cor.test(d[i,],d[i+1,],method="spearman")$p.value)
}
names(pvalues)<-paste(row.names(d)[seq(1,nrow(d),by=2)],row.names(d)[seq(2,nrow(d),by=2)],sep="_")

How do you find the sample sizes used in calculations on r?

I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?

Resources