Dynamically append column to dataframe R - r

How can I append a column in a dataframe?
I'm iterating over my datamatrix and if some data agree with a threshold I've set, I want to store them in a 1-row dataframe so I can print at the end of the loop.
My code, looks like this:
for (i in 1:nrow(my.data.frame)) {
# Store gene name in a variable and use it as row name for the 1-row dataframe.
gene.symbol <- rownames(my.data.frame)[i]
# init the dataframe to output
gene.matrix.of.truth <- data.frame(matrix(ncol = 0, nrow = 0))
for (j in 1:ncol(my.data.frame)) {
if (my.data.frame[i,j] < threshold) {
str <- paste(colnames(my.data.frame)[j], ';', my.data.frame[i,j], sep='')
# And I want to append this str to the gene.matrix.of.truth
# I tried gene.matrix.of.truth <- cbind(gene.matrix.of.truth, str) But didn't get me anywhere.
}
}
# Ideally I want to print the dataframe here.
# but, no need to print if nothing met my requirements.
if (ncol(gene.matrix.of.truth) != 0) {
write.table(paste('out_',gene.symbol,sep=''), gene.matrix.of.truth, row.names = T, col.names = F, sep='|', quote = F)
}
}

I do this sort of thing all the time, but with rows instead of columns. Start with
gene.matrix.of.truth = data.frame(x = character(0))
instead of the gene.matrix.of.truth <- data.frame(matrix(ncol = 0, nrow = 0)) you have at initiation. Your append step inside the for j loop will be
gene.matrix.of.truth = rbind(gene.matrix.of.truth, data.frame(x = str))
(i.e. create a dataframe around str and append it to gene.matrix.of.truth).
Obviously, your final if statement will be if(nrow(...)) instead of if(ncol(...)), and if you want the final table as a big row you'll need t to transpose your dataframe at printing time.

Related

Using lapply instead of a for loop to add information to an empty dataframe

Like the title says, I wish to use lapply instead of a for loop to parse data from a data frame and put it into an empty data frame. My motivation is that the data frame I'm parsing contains thousands of genes and I've read that the apply functions are faster at iterating through large tables.
### My data table ###
rawCounts <- data.frame(ensembl_gene_id_version = c('ENSG00000000003.15', 'ENSG00000000005.6', 'ENSG00000000419.14'),
HS1 = c(1133, 0, 1392),
HS2 = c(900, 0, 1155),
HS3 = c(1251, 0, 2011),
HS4 = c(785, 0, 1022),
stringsAsFactors = FALSE)
## Function
extract_counts <- function(df, esdbid){
counts <- data.frame()
plyr::ldply(esdbid, function(i) {counts <- df[grep(pattern = i, x = df),] %>% rbind()})
return(counts)
}
## Call the first one
extract_counts(df = rawCounts, esdbid = c('ENSG00000000003.15'))
I want this to return a data frame, so I used the plyr::ldply function from this post - Extracting outputs from lapply to a dataframe
However, it isn't returning anything. Eventually I want to scale up my esdbid vector to include multiple values; such as any combination of gene IDs to quickly retrieve the gene counts.
Strangely, when I run this in the console it appears to work as intended for a vector of length 1, i.e.;
esdbid <- 'ENSG00000000003.15'
plyr::ldply(esdbid, function(i) {counts <- rawCounts[grep(pattern = i, x = rawCounts),] %>% rbind()})
Returns a data frame with the correct values. However, when I increase the length of the vector it returns only the first value for each row. For example if esdbid <- c('ENSG00000000003.15', 'ENSG00000000005.6', 'ENSG00000000419.14') then the console code will return the values for ENSG00000000003.15 three times.
Maybe subset can handle this more effectively?
extract_counts <- function(.data, esdbid) {
subset(.data, grepl(esdbid, .data))
}
esdbid <- "ENSG00000000003.15"
df |> extract_counts(esdbid)
Then you can use lapply if you want a list with all dataframe subsets:
lapply(
unique(df$ensembl_gene_id_version),
function(id) { df |> extract_counts(id) }
)

How to run function on indivisual columns instead of data frame?

Hello everyone I have two data frame trying to do bootstrapping with below script1 in my script1 i am taking number of rows from data frame one and two. Instead of taking rows number from entire data frame I wanted split individual columns as a data frame and remove the zero values and than take the row number than do the bootstrapping using below script. So trying with script2 where I am creating individual data frame from for loop as I am new to R bit confused how efficiently do add the script1 function to it
please suggest me below I am providing script which is running script1 and the script2 I am trying to subset each columns creating a individual data frame
Script1
set.seed(2)
m1 <- matrix(sample(c(0, 1:10), 100, replace = TRUE), 10)
m2 <- matrix(sample(c(0, 1:5), 50, replace = TRUE), 5)
m1 <- as.data.frame(m1)
m2 <- as.data.frame(m2)
nboot <- 1e3
n_m1 <- nrow(m1); n_m2 <- nrow(m2)
temp<- c()
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
value <- colSums(m2)/colSums(m1[boot,])
temp <- rbind(temp, value)
}
boot_data<- apply(temp, 2, median)
script2
for (i in colnames(m1)){
m1_subset=(m1[m1[[i]] > 0, ])
m1_subset=m1_subset[i]
m2_subset=m2[m2[[i]] >0, ]
m2_subset=m2_subset[i]
num_m1 <- nrow(m1_subset); n_m2 <- nrow(m2_subset)# after this wanted add above script changing input
}
If I understand correctly, you want to do the sampling and calculation on each column individually, after removing the 0 values. I. modified your code to work on a single vector instead of a dataframe (i.e., using length() instead of nrow() and sum() instead of colSums(). I also suggest creating the empty matrix for your results ahead of time, and filling in -- it will be fasted.
temp <- matrix(nrow = nboot, ncol = ncol(m1))
for (i in seq_along(m1)){
m1_subset = m1[m1[,i] > 0, i]
m2_subset = m2[m2[,i] > 0, i]
n_m1 <- length(m1_subset); n_m2 <- length(m2_subset)
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
temp[j, i] <- sum(m2_subset)/sum(m1_subset[boot])
}
}
boot_data <- apply(temp, 2, median)
boot_data <- setNames(data.frame(t(boot_data)), names(m1))
boot_data

for loop to make new variables in r

I want to create 9 new variables which are called bank1, bank2, through bank9. These will be the column names. The values will be a full column of 1 of bank1, 2 for bank2, and so on and so forth. Now I was reading on loops and I have a code that does the loop but do no know how to store these values. This is what I got so far. I want to add these columns to Subs dataframe.
set.seed(3)
Subs <- data.frame(value = rnorm(10, 0, 1))
for(i in 1:9){
Subs <- assign(paste("bank", i, sep = ""), i)
}

R, define a function then apply to a list

I am trying to write a function (and I am new to R, most of my knowledeges of R were learned form this wedsite, thanks),
I want to apply my function to a list. The list contain some ".CSV" files.
All CSV files in my folder look like the picture below, same structure but with different column numbers.
I want to :
based on "Frame" column, delete all the row contain words "T",
then I got "110*n1" rows data.
delete all the column contain ""Flag" words, they are blank column.
delete the 1st column. then I have "2*n2" columns.
reshape the mulit-column to 2 column data, now I got "110*n3" rows data.
repeat "1,2,3,4,...,110" as seires numbers, n times(n=n3), rebind as a column.
form "1,2,3,...,n3", each repeat 110 times, make as a colum.
export the new table as txt files.
Here is what I've done so far:
T_function <- function(x) {
data.df <- read.csv(x, skip = 1,header=TRUE, na.strings=c("NA","NaN", " ","*"),
dec=".", strip.white=TRUE)
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
data.df[!grepl("T", data.df$Frame),]
data.df <- data.df [,-1]
data.df <- data.df [,colSums(is.na(data.df))<nrow(data.df)]
splitter <- function(indf, ncols) {
if (ncol(indf) %% ncols != 0) stop("Not the right number of columns to split")
inds <- split(sequence(ncol(indf)), c(0, sequence(ncol(indf)-1) %/% ncols))
temp <- unlist(lapply(inds, function(x) c(t(indf[x]))), use.names = FALSE)
as.data.frame(matrix(temp, ncol = ncols, byrow = TRUE))
}
out <- splitter(data.df, 2)
list <- 1:110
from <- which(out$V1 == 1)
to <- c((from-1)[-1], nrow(out))
end <- c(to/110)
list2 <- rep(list,length(to/110))
out$Number <- unlist(list2)
out$Number <- as.factor(out$Number)
list3 <- rep(1:end,each=110)
out$slice <- unlist(list3)
out$slice <- as.factor(out$slice)
write.table(x = data.df,
file = paste0(filename, "_analysis.txt"),
sep = ",",quote=F)
}
It seems the function can not add correct "out$Number" and "out$slice".
filenames <- list.files(path = "",pattern="csv",full.names = T)
sapply(filenames, FUN = T_function)
I am trying to apply my function to all files in list, while it seems beside the 1st files I can't get other files to work.
Could anybody help me find out and salve problems?

MHSMM package R input data format with multiple variables

my problem is similar to the question as followingthe problem of R-input Format
I have tried the above code in the above link and revised some part to suit my data. my data is like follow
I want my data can be created as a data frame with 4 variable vectors. The code what I have revised is
formatMhsmm <- function(data){
nb.sequences = nrow(data)
nb.variables = ncol(data)
data_df <- data.frame(matrix(unlist(data), ncol = 4, byrow = TRUE))
# iterate over these in loops
rows <- 1: nb.sequences
# build vector with id value
id = numeric(length = nb.sequences)
for( i in rows)
{
id[i] = data_df[i,2]
}
# build vector with time value
time = numeric (length = nb.sequences)
for( i in rows)
{
time[i] = data_df[i,3]
}
# build vector with observation values
sequences = numeric(length = nb.sequences)
for(i in rows)
{
sequences[i] = data_df[i, 4]
}
data.df = data.frame(id,time,sequences)
# creation of hsmm data object need for training
N <- as.numeric(table(data.df$id))
train <- list(x = data.df$sequences, N = N)
class(train) <- "hsmm.data"
return(train)
}
library(mhsmm)
dataset <- read.csv("location.csv", header = TRUE)
train <- formatMhsmm(dataset)
print(train)
The output observation is not the data of 4th col, it's a list of (4, 8, 12,...,396, 1, 1, ..., 56, 192,...,6550, 68, NA, NA,...) It has picked up 1/4 data of each col. Why it is like this?
Thank you very much!!!!
Why don't you simply count yout observations by Id, and create the hsmm.data object directly? Supposing yout dataframe is called "data", we have:
N <- as.numeric(table(data$id))
train <- list(x=data$location, N = N)
class(train) <- "hsmm.data"
Extracted from http://www.jstatsoft.org/v39/i04/paper

Resources