I have an excel spreadsheet with a bunch of numbers in it (some empty cells as well)
What I want to do is read that file in R in such a way that I can take the numbers from all the non empty cells and put it into some vector. What is the best way to do this?
It's difficult to demonstrate a solution here without an example, but it's straightforward to create one:
library(openxlsx)
set.seed(1)
sample(c(round(rnorm(100), 2), rep("", 100)), 50, TRUE) |>
matrix(10) |>
as.data.frame() |>
write.xlsx("myfile.xlsx")
In our spreadsheet software, the file looks like this:
To get the values in the spreadsheet into a single vector in R, we read it into a data frame, unlist it, convert to numeric, and remove NA values:
all_numbers <- read.xlsx("myfile.xlsx") |>
unlist() |>
as.numeric() |>
na.omit() |>
c()
all_numbers
#> [1] -0.44 -0.57 0.82 -0.02 1.36 -1.22 -1.25 -0.04 0.76 0.58 0.88
#> [12] 1.21 -2.21 -0.04 1.12 -0.74 -0.02 -0.16 -0.71 -0.41 -0.11 -0.16
#> [23] 0.03 0.34
You will see these match the numbers in the picture of the spreadsheet.
Related
So I am trying to get an output in csv file but I am having trouble formatting as per my need.
My Code
method.metric <- mmetric(testCenScal[[course_name]], method.pred, c("RMSE", "R2", "MAE", "COR"))
write.table(method.metric, "metric.csv", sep = ",", col.names = T, append = T)
Current Output
"x"
"MAE",0.636059658390333
"RMSE",0.814405873704867
"COR",0.581863604936215
"R2",0.338565254749368
"x"
"MAE",0.636059658390333
"RMSE",0.814405873704867
"COR",0.581863604936215
"R2",0.338565254749368
"x"
"RMSE",0.869309100173694
"R2",0.356594555638249
"MAE",0.653084184175849
"COR",0.597155386510286
"x"
"RMSE",0.869309100173694
"R2",0.356594555638249
"MAE",0.653084184175849
"COR",0.597155386510286
It would be nice if I could format this output into something like:
RMSE R2 MAE COR param1 param2
0.89 0.35 0.65 0.59 courseA Blackboost
0.89 0.35 0.65 0.59 courseB Blackboost
0.89 0.35 0.65 0.59 courseC Blackboost
0.89 0.35 0.65 0.59 courseD Blackboost
0.89 0.35 0.65 0.59 courseE Blackboost
0.89 0.35 0.65 0.59 courseA Rpart
0.89 0.35 0.65 0.59 courseB Rpart
0.89 0.35 0.65 0.59 courseC Rpart
0.89 0.35 0.65 0.59 courseD Rpart
0.89 0.35 0.65 0.59 courseE Rpart
I dont know what is "x" and where is it coming from, I guess I don't have the column name mentioned therefore it prints default as "x"?
I have this code in a function so I am passing two parameters one is the method and another is the target field. I would like to print those while appending it to a CSV file.
If I type dput(method.metric)
I get output as:
structure(c(0.869309100173694, 0.356594555638249, 0.653084184175849,
0.597155386510286), .Names = c("RMSE", "R2", "MAE", "COR"))
I already tried using the code write.csv(method.metric, file ="metric.csv", row.names=FALSE, eol=",", append=T) but it did not help much.
I will try to work on what you said formatting in R using cbind and other functions. If I get the output in above format, I will be able to create graphs with ease as I have lot of predictive model results being output.
I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?
n <- 5
r <- 6
> df
X1 X2 X3 X4 X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60
This is what result should look like:
> result
X1 X2 X3 X4 X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57
I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as #coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.
n <- 5L
r <- 6L
result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same
for (icol in as.integer(2:n)) {
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}
Try
for (i in 2:n) {
maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
result[,i] <- temp[,1] + temp[,maxcor]
temp[,1] <- result[,i] # Set result as new 1st column
temp[,maxcor] <- NULL # Remove column
}
The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.
One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.
To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])
contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with
cortemp <- cor(temp)
mch <- match(c(max(cortemp[-1,1])),cortemp[,1])
should cut the computational burden of the initial code line in half.
I want to save the following output I get in the R console into a csv or txt file.
Discordancy measures (critical value 3.00)
0.17 3.40 1.38 0.90 1.62 0.13 0.15 1.69 0.34 0.39 0.36 0.68 0.39
0.54 0.70 0.70 0.79 2.08 1.14 1.23 0.60 2.00 1.81 0.77 0.35 0.15
1.55 0.78 2.87 0.34
Heterogeneity measures (based on 100 simulations)
30.86 14.23 3.75
Goodness-of-fit measures (based on 100 simulations)
glo gev gno pe3 gpa
-3.72 -12.81 -19.80 -32.06 -37.66
This is the outcome I get when I run the following
Heter<-regtst(regsamlmu(-extremes), nsim=100)
where Heter is a list (i.e., is.list(Heter) returns TRUE)
You could use capture.output:
capture.output(regtst(regsamlmu(-extremes), nsim=100), file="myoutput.txt")
Or for capturing output coming from several consequential commands:
sink("myfile.txt")
#
# [commands generating desired output]
#
sink()
You could make a character vector which you write to a file. Each entry in the vector will be separated by a newline character.
out <- capture.output(regtst(regsamlmu(-extremes), nsim=100))
write(out, "output.txt", sep="\n")
If you would like to add more lines just do something like c(out, "hello Kostas")
I would like to remove cases from a data frame based on whether they contain a particular pattern. For example in the data frame below I would like to remove all the rows that contain (Intercept), iyeareducc, ibphtdep and gender_R22 (or alternatively selecting the rows that contain _carrier1 or adri).
OR CI P
apoee4_carrier.(Intercept) 1.96 0.97-3.94 0.06
apoee4_carrier.apoee4_carrier1 1.03 0.77-1.37 0.84
apoee4_carrier.iyeareducc 0.86 0.82-0.9 0.00
apoee4_carrier.ibphdtdep 1.01 0.96-1.05 0.81
apoee4_carrier.gender_R22 0.87 0.67-1.12 0.28
BDNF_carrier.(Intercept) 2.05 1.01-4.14 0.04
BDNF_carrier.BDNF_carrier1 0.87 0.66-1.14 0.33
BDNF_carrier.iyeareducc 0.86 0.82-0.9 0.00
BDNF_carrier.ibphdtdep 1.00 0.96-1.05 0.82
BDNF_carrier.gender_R22 0.87 0.67-1.12 0.28
adri.(Intercept) 1.60 0.78-3.31 0.20
adri.adri 1.03 1-1.06 0.04
adri.iyeareducc 0.89 0.84-0.94 0.00
adri.ibphdtdep 1.00 0.95-1.04 0.87
adri.gender_R22 0.87 0.67-1.12 0.27
While I could use a sequence to subset out the rows I require, like so
dat[(seq(2,nrow(dat),5)),]
OR CI P
apoee4_carrier.apoee4_carrier1 1.03 0.77-1.37 0.84
BDNF_carrier.BDNF_carrier1 0.87 0.66-1.14 0.33
adri.adri 1.03 1-1.06 0.04
this will only work if the sequence is the same throughout the entire dataframe, which may not be necessarily the case as this data frame is created from a list of data frames that have been rbind together.
Thanks.
You can use grep to select the rows you want/don't want:
dat[-grep("Intercept|iyeareducc|ibphdtdep|gender", rownames(dat)),]
grep returns the row numbers of the rows for which the row names contain at least one of your search strings (the | between each string means "OR"). Putting a minus sign in front of grep tells R to return only the rows of dat that are not returned by grep.
pretty newb question here, but I have not been able to track down a solution for some time:
I have an XTS object of trading indicators (indicate) for stock data that looks like
A XOM MSFT
2000-11-30 -0.59 0.22 0.10
2000-12-29 0.55 -0.23 0.05
2001-01-30 -0.52 0.09 -0.10
And a table with an identical index for the corresponding period returns (return) that looks like
A XOM MSFT
2000-11-30 -0.15 0.10 0.03
2000-12-29 0.03 -0.05 0.02
2001-01-30 -0.04 0.02 -0.05
I have sorted the indicator table and had it return the column name with the following code:
indicate.label <- colnames(indicate)
indicate.rank <- t(apply(indicate, 1, function(x) indicate.label[order(-x)]))
indicate.rank <- xts(indicate.rank, order.by = index(returns))
Which gives the table (indicate.rank) of the symbol names ranked by their trading indicator:
1 2 3
2000-11-30 XOM MSFT A
2000-12-29 A MSFT XOM
2001-01-30 XOM A MSFT
I would like to also have a table that gives the period returns based on the indicator rank:
2000-11-30 0.10 0.03 -0.15
2000-12-29 0.03 0.02 -0.05
2001-01-30 0.02 -0.04 -0.05
I cannot figure out how to call the correct symbol for all rows or just sort the table return based on the order of indicate.
Thank you for any suggestions.
Trevor J
I'm not particularly satisfied with this solution, but it works.
row.rank <- t(apply(indicate, 1, order, decreasing=TRUE))
indicate.rank <- return.rank <- indicate # pre-allocate
for(i in 1:NROW(indicate.rank)) {
indicate.rank[i,] <- colnames(indicate)[row.rank[i,]]
return.rank[i,] <- return[i,row.rank[i,]]
}
It would probably be easier to handle this if the returns and the indicators for each symbol were in the same object, but I don't know how that would fit with the rest of your workflow.