Cant create a data.frame with my findings from loop - r

I have a for-loop which return 4 different answeres, which is correct, but when I try to retrieve these values to my data.frame I get "Error in [<-.data.frame(*tmp*, p, 1, value = 29.1520685791182) :
missing values are not allowed in subscripted assignments of data frames"
Goal: Im trying to get values, which is printed 29, 485,-14, 12, in a data.frame
library("xts")
library("quantmod")
library("fredr")
Tesla <- getSymbols("TSLA", from=as.Date("2014-11-03"),to=as.Date("2019-11-03"))
Amazon <- getSymbols("AMZN", from=as.Date("2014-11-03"),to=as.Date("2019-11-03"))
Equinor <- getSymbols("EQNR",from="2014-11-03",to="2019-11-03")
FTSE100 <- getSymbols("^FTSE",from="2014-11-03",to="2019-11-03")
dftest <- data.frame(merge(TSLA$TSLA.Close, AMZN$AMZN.Close, EQNR$EQNR.Close,FTSE$FTSE.Close))
df <- data.frame(matrix(nrow = 1, ncol = 4)) #The data.frame where i want my returned values from print(pros) to be in.
colnames(dfProsent) <- c("TESLA", "AMAZON","EQUINOR","FTSE")
for (p in dftest) {
pros <- ((last(as.numeric(p)))-(first(as.numeric(p))))/(first(as.numeric(p)))*100
print(pros) #this print out 29, 485,-14,12
df[p,1] <- pros #the problem
}

Related

Dataframe output from a for-loop

I am trying to populate the output of a for loop into a data frame. The loop is repeating across the columns of a dataset called "data". The output is to be put into a new dataset called "data2". I specified an empty data frame with 4 columns (i.e. ncol=4). However, the output generates only the first two columns. I also get a warning message: "In matrix(value, n, p) : data length [2403] is not a sub-multiple or multiple of the number of columns [2]"
Why does the dataframe called "data2" have 2 columns, when I have specified 4 columns? This is my code:
a <- 0
b <- 0
GM <- 0
GSD <- 0
data2 <- data.frame(ncol=4, nrow=33)
for (i in 1:ncol(data))
{
if (i==34) {break}
a[i] <- colnames(data[i])
b <- data$cycle
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2
If you look at the ?data.frame() help page, you'll see that it does not take arguments nrow and ncol--those are arguments for the matrix() function.
This is how you initialize data2, and you can see it starts with 2 columns, one column is named ncol, the second column is named nrow.
data2 <- data.frame(ncol=4, nrow=33)
data2
# ncol nrow
# 1 4 33
Instead you could try data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33)), though if you share a small sample of data and your expected result there may be more efficient ways than explicit loops to get this job done.
Generally, if you do loop, you want to do as much outside of the loop as possible. This is just guesswork without having sample data, these changes seem like a start at improving your code.
a <- colnames(data)
b <- data$cycle ## this never changes, no need to redefine every iteration
GM <- numeric(ncol(data)) ## better to initialize vectors to the correct length
GSD <- numeric(ncol(data))
data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33))
for (i in 1:ncol(data))
{
if (i==34) {break}
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
## it's weird to assign a row of data.frame at once...
## maybe you should keep it as a matrix?
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2

How to run function on indivisual columns instead of data frame?

Hello everyone I have two data frame trying to do bootstrapping with below script1 in my script1 i am taking number of rows from data frame one and two. Instead of taking rows number from entire data frame I wanted split individual columns as a data frame and remove the zero values and than take the row number than do the bootstrapping using below script. So trying with script2 where I am creating individual data frame from for loop as I am new to R bit confused how efficiently do add the script1 function to it
please suggest me below I am providing script which is running script1 and the script2 I am trying to subset each columns creating a individual data frame
Script1
set.seed(2)
m1 <- matrix(sample(c(0, 1:10), 100, replace = TRUE), 10)
m2 <- matrix(sample(c(0, 1:5), 50, replace = TRUE), 5)
m1 <- as.data.frame(m1)
m2 <- as.data.frame(m2)
nboot <- 1e3
n_m1 <- nrow(m1); n_m2 <- nrow(m2)
temp<- c()
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
value <- colSums(m2)/colSums(m1[boot,])
temp <- rbind(temp, value)
}
boot_data<- apply(temp, 2, median)
script2
for (i in colnames(m1)){
m1_subset=(m1[m1[[i]] > 0, ])
m1_subset=m1_subset[i]
m2_subset=m2[m2[[i]] >0, ]
m2_subset=m2_subset[i]
num_m1 <- nrow(m1_subset); n_m2 <- nrow(m2_subset)# after this wanted add above script changing input
}
If I understand correctly, you want to do the sampling and calculation on each column individually, after removing the 0 values. I. modified your code to work on a single vector instead of a dataframe (i.e., using length() instead of nrow() and sum() instead of colSums(). I also suggest creating the empty matrix for your results ahead of time, and filling in -- it will be fasted.
temp <- matrix(nrow = nboot, ncol = ncol(m1))
for (i in seq_along(m1)){
m1_subset = m1[m1[,i] > 0, i]
m2_subset = m2[m2[,i] > 0, i]
n_m1 <- length(m1_subset); n_m2 <- length(m2_subset)
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
temp[j, i] <- sum(m2_subset)/sum(m1_subset[boot])
}
}
boot_data <- apply(temp, 2, median)
boot_data <- setNames(data.frame(t(boot_data)), names(m1))
boot_data

R for loop to calculate wilcox.test

I am trying to write a code that would automatically calculate Wilcoxon test p-value for several comparisons.
Data used: 2 data sets with the same information representing two groups of participants completed the same 5 tasks which means that the each table contains 5 columns (tasks) and X rows with tasks scores.
data_17_18_G2 # first data set (in data.table format)
data_18_20_G2 # second data set (in data.table format)
Both data sets have identical names of column which are to be used in the W-test the next way:
wilcox.test(Group1Task1, Group2Task1, paired = F)
wilcox.test(Group1Task2, Group2Task2, paired = F)
and so on.
The inputs (e.g., Grou1Task1) are two vectors of task scores (the first one will be from data_17_18_G2 and the other one from data_18_20_G2
Desired output: a data table with a column of p-values
The problem I faced is that no matter how I manipulated the val1 and val2 empty objects, in the second and the third lines the right size "as.numeric(unlist(data_17_18_G2[, ..i]))" gives a correct output (a numeric vector) but it's left size "val1[i]" always returns only one value from the vector. That gave me the idea that the main problem appeared on the step of creating an empty vector, however, I wasn't able to solve it.
Empty objects:
result <- data.table(matrix(ncol=2))
val1 <- as.numeric() # here I also tried functions "numeric" and "vector"
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
For loop
for (i in 1:5) {
val1[i] <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2[i] <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
Output:
Error in `[<-.data.table`(`*tmp*`, i, 2, value = NULL) :
When deleting columns, i should not be provided
1: В val1[i] <- as.numeric(unlist(data_17_18_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
2: В val2[i] <- as.numeric(unlist(data_18_20_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
3: В res[i] <- wilcox.test(val1[i], val2[i], paired = F) :
number of items to replace is not a multiple of replacement length
Alternative:
I changed the second and the third lines
for (i in 1:5) {
val1[i] <- as.numeric(data_17_18_G2[ , ..i])
val2[i] <- as.numeric(data_18_20_G2[ , ..i])
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
And got this
Error in as.numeric(data_17_18_G2[, ..i]) :
(list) object cannot be coerced to type 'double'
which means that the function wilcox.test cannot interpret this type of input.
How can I improve the code so that I get a data table of p-values?
There would appear to be some bugs in the code. I have rewritten the code using the cars dataset as a example.
## use the cars dataset as a example (change with appropriate data)
data(cars)
data_17_18_G2 <- as.data.table(cars)
data_18_20_G2 <- data_17_18_G2[,2:1]
## Fixed code
result <- data.table(matrix(as.numeric(), nrow=ncol(data_17_18_G2), ncol=2))
val1 <- as.numeric()
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
for (i in 1:ncol(data_17_18_G2)) {
val1 <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2 <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[[i]] <- wilcox.test(val1, val2, paired = F)
result[i, 1] <- as.numeric(i)
result[i, 2] <- as.numeric(res[[i]]$p.value)
}
Hope this gives you the output you are after.

Automatically conducting t tests and chi square tests across multiple variables in a data frame in R

Here is some example data:
set.seed(1234) # Make the results reproducible
count <- 100
cs1 <- round(rchisq(count, 1), 2)
cs2 <- round(rchisq(count, 2), 2)
c(rep("Present", 30), rep("Absent", 30), rep("NA", 40)) -> temp
temp[temp == "NA"] <- NA
as.factor(temp) -> temp
temp1 <- round(rnorm(count, 3), 2)
temp1[7] <- NA
temp2 <- round(rnorm(count, 7), 2)
temp2[54] <- NA
c(rep("Yes", 30), rep("No", 30), rep("Maybe", 30), rep("NA", 10)) -> temp3
temp3[temp3 == "NA"] <- NA
as.factor(temp3) -> temp3
c(rep("Group A", 55), rep("Group B", 45)) -> temp4
as.factor(temp4) -> temp4
mydata <- data.frame(cs1, cs2, temp, temp1, temp2, temp3, temp4)
mydata$cs2[56:100] <- NA ; mydata
I know I can compute summary statistics for each variable stratified by temp4 like so:
by(mydata, mydata$temp4, summary)
However, I would also like to compute either a t.test or a chisq.test for each variable stratified by temp4. I've tried simply modifying the above code to do that but it always gives me an error. It seems the error stems from the fact that some of the variables in the data frame are numeric (and thus, would need a t.test) while others are factors (and thus, would need a chisq.test).
Is there a simple way to tell R to check the variable to see what kind it is, and then run the appropriate test, all at once? And to still print out all of the results even if it encounters an error?
I am not worried about the appropriateness of doing this (e.g., I am aware of the risks of multiple testing, etc) but rather just need to know how to do it. Thanks!
You can use lapply to loop through the variables and decide inside the anonymous function which test to conduct.
When an error occurs, it's caught by tryCatch and instead of a test result the final list will have the error message as a member.
tests_list <- lapply(mydata[-ncol(mydata)], function(x){
tryCatch({
if(is.numeric(x)){
if(length(levels(mydata$temp4)) == 2){
t.test(x ~ temp4, data = mydata)
}else{
aov(x ~ temp4, data = mydata)
}
}else{
tbl <- table(x, mydata$temp4)
chisq.test(tbl)
}
}, error = function(e) e)
})
err <- sapply(tests_list, inherits, "error")
tests_list$cs1
tests_list$temp3
tests_list[[err]]
Yes, you can loop through designated columns, keeping temp4 as the factor, and check class of each column (named x within the anonymous function). You can use sapply or apply(X, MARGIN = 2, FUN ...). Note that I'm explicitly subsetting mydata because I find it more explicit and readable.
sapply(mydata[, c("cs1", "cs2", "temp", "temp1", "temp2", "temp3")], FUN = function(x, group) {
if (class(x) == "numeric") {
# perform t-test, e.g. t.test(x ~ group)
return(result_of_t_test)
}
if (class(x) == "factor") {
# perform chi-square test
return(result_of_chisq_test)
}
}, group = mydata$temp4)

MHSMM package R input data format with multiple variables

my problem is similar to the question as followingthe problem of R-input Format
I have tried the above code in the above link and revised some part to suit my data. my data is like follow
I want my data can be created as a data frame with 4 variable vectors. The code what I have revised is
formatMhsmm <- function(data){
nb.sequences = nrow(data)
nb.variables = ncol(data)
data_df <- data.frame(matrix(unlist(data), ncol = 4, byrow = TRUE))
# iterate over these in loops
rows <- 1: nb.sequences
# build vector with id value
id = numeric(length = nb.sequences)
for( i in rows)
{
id[i] = data_df[i,2]
}
# build vector with time value
time = numeric (length = nb.sequences)
for( i in rows)
{
time[i] = data_df[i,3]
}
# build vector with observation values
sequences = numeric(length = nb.sequences)
for(i in rows)
{
sequences[i] = data_df[i, 4]
}
data.df = data.frame(id,time,sequences)
# creation of hsmm data object need for training
N <- as.numeric(table(data.df$id))
train <- list(x = data.df$sequences, N = N)
class(train) <- "hsmm.data"
return(train)
}
library(mhsmm)
dataset <- read.csv("location.csv", header = TRUE)
train <- formatMhsmm(dataset)
print(train)
The output observation is not the data of 4th col, it's a list of (4, 8, 12,...,396, 1, 1, ..., 56, 192,...,6550, 68, NA, NA,...) It has picked up 1/4 data of each col. Why it is like this?
Thank you very much!!!!
Why don't you simply count yout observations by Id, and create the hsmm.data object directly? Supposing yout dataframe is called "data", we have:
N <- as.numeric(table(data$id))
train <- list(x=data$location, N = N)
class(train) <- "hsmm.data"
Extracted from http://www.jstatsoft.org/v39/i04/paper

Resources