How can I delete values by column in a data frame?

How can I delete values by column in a data frame? - r

I need to take abundance values by column without zeros, by this reason I used an empty list and a loop (for loop). When I delete [i] in the first line of my loop I get the desired result only in the column of total values (sum by an object), but in the way in which I learn to write them, I only obtain an undesired result.
set.seed(1000)
df <- data.frame(Category = sample(LETTERS[1:10]),
Object = sample(letters[1:10]),
A = sample(0:20, 10, rep = TRUE),
B = sample(0:20, 10, rep = TRUE),
C = sample(0:20, 10, rep = TRUE))
sincero <- list()
for (i in colnames(df[ , 3:5])){
sincero[i] = df[df[ , i] != 0, ]
sincero
}
sincero

Related

How to run function on indivisual columns instead of data frame?

Hello everyone I have two data frame trying to do bootstrapping with below script1 in my script1 i am taking number of rows from data frame one and two. Instead of taking rows number from entire data frame I wanted split individual columns as a data frame and remove the zero values and than take the row number than do the bootstrapping using below script. So trying with script2 where I am creating individual data frame from for loop as I am new to R bit confused how efficiently do add the script1 function to it
please suggest me below I am providing script which is running script1 and the script2 I am trying to subset each columns creating a individual data frame
Script1
set.seed(2)
m1 <- matrix(sample(c(0, 1:10), 100, replace = TRUE), 10)
m2 <- matrix(sample(c(0, 1:5), 50, replace = TRUE), 5)
m1 <- as.data.frame(m1)
m2 <- as.data.frame(m2)
nboot <- 1e3
n_m1 <- nrow(m1); n_m2 <- nrow(m2)
temp<- c()
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
value <- colSums(m2)/colSums(m1[boot,])
temp <- rbind(temp, value)
}
boot_data<- apply(temp, 2, median)
script2
for (i in colnames(m1)){
m1_subset=(m1[m1[[i]] > 0, ])
m1_subset=m1_subset[i]
m2_subset=m2[m2[[i]] >0, ]
m2_subset=m2_subset[i]
num_m1 <- nrow(m1_subset); n_m2 <- nrow(m2_subset)# after this wanted add above script changing input
}

If I understand correctly, you want to do the sampling and calculation on each column individually, after removing the 0 values. I. modified your code to work on a single vector instead of a dataframe (i.e., using length() instead of nrow() and sum() instead of colSums(). I also suggest creating the empty matrix for your results ahead of time, and filling in -- it will be fasted.
temp <- matrix(nrow = nboot, ncol = ncol(m1))
for (i in seq_along(m1)){
m1_subset = m1[m1[,i] > 0, i]
m2_subset = m2[m2[,i] > 0, i]
n_m1 <- length(m1_subset); n_m2 <- length(m2_subset)
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
temp[j, i] <- sum(m2_subset)/sum(m1_subset[boot])
}
}
boot_data <- apply(temp, 2, median)
boot_data <- setNames(data.frame(t(boot_data)), names(m1))
boot_data

list to dataframe without unique columns

I have this loop to generate some values
for (j in 1:2) {
table <- rep(data.frame(
matrix(c(letters[1:2],
sample(c(rep(1,100),0), size = 1),
sample(c(rep(0,100),1), size = 1)), ncol = 2) ), j)
}
I would like to get this output like this
X1 X2
a 1
b 0
a 1
b 1
To get table of letters with one column and numbers in second column
I tried
do.call(rbind, table)
data.frame(matrix(unlist(table), nrow=length(table), byrow=TRUE))
But I am not able to get values to right column in data table.

The table is getting updated in each of the iteration. Instead, we may use replicate to create a list
lst1 <- replicate(2, data.frame(
matrix(c(letters[1:2],
sample(c(rep(1,100),0), size = 1),
sample(c(rep(0,100),1), size = 1)), ncol = 2) ), simplify = FALSE)
do.call(rbind, lst1)

function applied to dataset R

Below are two dataframes labeled as 'A' and 'C'. I have created a function that would take the top 5 rows for dataframe and want the same applied to dataframe C. However, it only replicates it for A. How would I have this function be applied for C only. Thanks!
L3 <- LETTERS[1:3]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = 1, y = 1:10, fac = fac))
## The "same" with automatic column names:
A<-data.frame(1, 1:10, sample(L3, 10, replace = TRUE))
L3 <- LETTERS[7:9]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = 1, y = 1:10, fac = fac))
## The "same" with automatic column names:
C<-data.frame(1, 1:10, sample(L3, 10, replace = TRUE))
function_y<-function(Data_Analysis_Task) {
sample2<-head(A, 5)
return(sample2)
}
D<-function_y(C)

We need to have the same argument passed inside the function as well
function_y <- function(Data_Analysis_Task) {
head(Data_Analysis_Task, 5)
}
D <- function_y(C)
If we use head(A, 5), inside the function, it looks for the object 'A', inside the function env first, then if it doesn't find, looks at the parent env, and so on until it finds the object 'A' in the global env. So, it would return the same output of head of 'A' every time the function is called

R: How to pick rows of a data frame that match a criteria but not filter out others

Assume a data frame that look something like this:
set.seed(42)
seqs <- sapply(1:20, FUN = function(x) { paste(sample(letters, size = 11, replace = T), collapse = "") })
annot1 <- sapply(1:1000, FUN = function(x) { sample(c("A", "B","C"), size = 1, replace = T)})
annot2 <- sapply(1:1000, FUN = function(x) { sample(c("X", "Y","Z"), size = 1, replace = T)})
values <- rnorm(n = length(annot1), mean = 1, sd = 1)
df <- data.frame(id=sample(seqs, size = length(annot1), replace = T), annot1, annot2, values)
I would like to get the rows that have a value above a given threshold, e.g. value > 1.5 in either 1 or all 3 conditions (but not 2), denoted by variables annot1 or annot2. For the ids that match this criteria, I want all values (not only the ones above the threshold).
My usual approach which consists of chaining filter() and n_distinct() doesn't work in this case since it will filter out observations where the value isn't above the threshold, which creates issue when I go to wide format later on to do clustering on these variables.
I have thought about creating intermediate variables and use them to pick up ids but it feels like there must be a more elegant solution.

MHSMM package R input data format with multiple variables

my problem is similar to the question as followingthe problem of R-input Format
I have tried the above code in the above link and revised some part to suit my data. my data is like follow
I want my data can be created as a data frame with 4 variable vectors. The code what I have revised is
formatMhsmm <- function(data){
nb.sequences = nrow(data)
nb.variables = ncol(data)
data_df <- data.frame(matrix(unlist(data), ncol = 4, byrow = TRUE))
# iterate over these in loops
rows <- 1: nb.sequences
# build vector with id value
id = numeric(length = nb.sequences)
for( i in rows)
{
id[i] = data_df[i,2]
}
# build vector with time value
time = numeric (length = nb.sequences)
for( i in rows)
{
time[i] = data_df[i,3]
}
# build vector with observation values
sequences = numeric(length = nb.sequences)
for(i in rows)
{
sequences[i] = data_df[i, 4]
}
data.df = data.frame(id,time,sequences)
# creation of hsmm data object need for training
N <- as.numeric(table(data.df$id))
train <- list(x = data.df$sequences, N = N)
class(train) <- "hsmm.data"
return(train)
}
library(mhsmm)
dataset <- read.csv("location.csv", header = TRUE)
train <- formatMhsmm(dataset)
print(train)
The output observation is not the data of 4th col, it's a list of (4, 8, 12,...,396, 1, 1, ..., 56, 192,...,6550, 68, NA, NA,...) It has picked up 1/4 data of each col. Why it is like this?
Thank you very much!!!!

Why don't you simply count yout observations by Id, and create the hsmm.data object directly? Supposing yout dataframe is called "data", we have:
N <- as.numeric(table(data$id))
train <- list(x=data$location, N = N)
class(train) <- "hsmm.data"
Extracted from http://www.jstatsoft.org/v39/i04/paper

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can I delete values by column in a data frame? - r

Related

How to run function on indivisual columns instead of data frame?

list to dataframe without unique columns

function applied to dataset R

R: How to pick rows of a data frame that match a criteria but not filter out others

MHSMM package R input data format with multiple variables

Categories

Resources