hopefully a relatively easy one for those more experienced than me!
Trying to perform a Box-Cox transformation using the following code:
fit <- lm(ABOVEGROUND_BIO ~ TREATMENT * P_LEVEL, data = MYCORRHIZAL_VARIANCE)
bc <- boxcox(fit)
lambda<-with(bc, x[which.max(y)])
MYCORRHIZAL_VARIANCE$bc <- ((x^lambda)-1/lambda)
boxplot(bc ~ TREATMENT * P_LEVEL, data = MYCORRHIZAL_VARIANCE)
however when I run it, I get the following error message:
Error: object 'x' not found. (on line 4)
For context, here's the str of my dataset:
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 24 obs. of 14 variables:
$ TREATMENT : Factor w/ 2 levels "Mycorrhizal",..: 1 1 1 1 1 1 1 1 1 1 ...
$ P_LEVEL : Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 2 2 2 2 ...
$ REP : int 1 2 3 4 5 6 1 2 3 4 ...
$ ABOVEGROUND_BIO : num 7.5 6.8 5.3 6 6.7 7 12 12.7 12 10.2 ...
$ BELOWGROUND_BIO : num 3 2.4 2 4 2.7 3.6 7.9 8.8 9.5 9.2 ...
$ ROOT_SHOOT : num 0.4 0.35 0.38 0.67 0.4 0.51 0.66 0.69 0.79 0.9 ...
$ ROOT_SHOOT.log : num -0.916 -1.05 -0.968 -0.4 -0.916 ...
$ ABOVEGROUND_BIO.log : num 2.01 1.92 1.67 1.79 1.9 ...
$ ABOVEGROUND_BIO.sqrt : num 2.74 2.61 2.3 2.45 2.59 ...
$ ABOVEGROUND_BIO.cubert: num 1.96 1.89 1.74 1.82 1.89 ...
$ BELOWGROUND_BIO.log : num 1.099 0.875 0.693 1.386 0.993 ...
$ BELOWGROUND_BIO.sqrt : num 1.73 1.55 1.41 2 1.64 ...
$ BELOWGROUND_BIO.cubert: num 1.44 1.34 1.26 1.59 1.39 ...
$ TOTAL_BIO : num 10.5 9.2 7.3 10 9.4 10.6 19.9 21.5 21.5 19.4 ...
- attr(*, "spec")=
.. cols(
.. TREATMENT = col_factor(levels = c("Mycorrhizal", "Non-mycorrhizal"), ordered = FALSE, include_na = FALSE),
.. P_LEVEL = col_factor(levels = c("Low", "High"), ordered = FALSE, include_na = FALSE),
.. REP = col_integer(),
.. ABOVEGROUND_BIO = col_number(),
.. BELOWGROUND_BIO = col_number(),
.. ROOT_SHOOT = col_number()
.. )
I understand there's no variable named bc in the MYCORRHIZAL_VARIANCE dataset, but I'm just following basic instructions given to me on performing a Box-Cox, and I guess I'm confused as to what 'x' should actually be denoted as, since I thought 'x' was being defined in line 3? Any suggestions as to how to fix this error?
Thanks in advance!
I thought 'x' was being defined in line 3?
Line 3 is lambda<-with(bc, x[which.max(y)]). It doesn't define x, it defines lambda. It does use x, which it looks for within the bc environment. If you're using boxcox() from the MASS package, bc should indeed include x and y components, so bc$x shouldn't give you the same error message. I'd expect an error about the replacement lengths. Because...
bc$x are the potential lambda values tried by boxcox - you're using the default seq(-2, 2, 1/10), and it would be an unlikely coincidence if your data had a multiple of 41 rows needed to not give an error when assigning 41 values to a new column.
Line 3 picks out the lambda value that maximizes the likelihood, so you shouldn't need the rest of the values in bc ever again. I'd expect you to use that lambda values to transform your response variable, as that's what the Box Cox transformation is for. ((x^lambda)-1/lambda) doesn't make any statistical or programmatic sense. Use this instead:
MYCORRHIZAL_VARIANCE$bc <- (MYCORRHIZAL_VARIANCE$ABOVEGROUND_BIO ^ lambda - 1) / lambda
(Note that I also corrected the parentheses. You want (y ^ lambda - 1) / lambda, not (y ^ lambda) - 1 / lambda.)
Related
I have two survival functions, one is not truncated so I have experience for all time periods. The other is left-truncated until t = 4, so it has no experience until t > 4. I can plot the two together in the following code in R using the survival package.
library(tidyverse)
library(survival)
library(ggfortify)
# create two survival functions
set1 <- tibble(start0 = rep(0,10), end0 = 1:10, event0 = rep(1,10))
set2 <- tibble(start0 = rep(4,10), end0 = c(5, 5, 7, 9, rep(10, 6)), event0 = rep(1,10))
combined_set <- bind_rows(set1, set2)
survival_fn <- survfit(Surv(start0, end0, event0) ~ start0, data = combined_set)
# plot the survival function:
autoplot(survival_fn, conf.int = FALSE)
I would like to show the difference in survival between the two functions if they had both experienced the same survival experience during the truncation period - i.e. up to t = 4. I've manually sketched the approximate graph I am trying to achieve (size of steps not to scale).
This is a simplified example - in practice I have eight different sets of data with different truncation periods, and around 2000 data-points in each set.
If you look at the structure of the survival_fn object (which is not a function but rather a list), you see:
str(survival_fn)
List of 17
$ n : int [1:2] 10 10
$ time : num [1:14] 1 2 3 4 5 6 7 8 9 10 ...
$ n.risk : num [1:14] 10 9 8 7 6 5 4 3 2 1 ...
$ n.event : num [1:14] 1 1 1 1 1 1 1 1 1 1 ...
$ n.censor : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
$ surv : num [1:14] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ...
$ std.err : num [1:14] 0.105 0.158 0.207 0.258 0.316 ...
$ cumhaz : num [1:14] 0.1 0.211 0.336 0.479 0.646 ...
$ std.chaz : num [1:14] 0.1 0.149 0.195 0.242 0.294 ...
$ strata : Named int [1:2] 10 4
..- attr(*, "names")= chr [1:2] "start0=0" "start0=4"
$ type : chr "counting"
$ logse : logi TRUE
$ conf.int : num 0.95
$ conf.type: chr "log"
$ lower : num [1:14] 0.732 0.587 0.467 0.362 0.269 ...
$ upper : num [1:14] 1 1 1 0.995 0.929 ...
$ call : language survfit(formula = Surv(start0, end0, event0) ~ start0, data = combined_set)
- attr(*, "class")= chr "survfit"
So one way of getting something like you goal although still with an automatic start to the survival function at (t=0,S=1) would be to multiply all the $surv items in the 'start0=4'-stratum by the surv value at t=4, and then redo the plot:
survival_fn[['surv']][11:14] <- survival_fn[['surv']][11:14]*survival_fn[['surv']][4]
I can see why this might not be a totally conforming answer since there is still a blue line from 1 out to t=5 and it doesn't actually start at the surv value for stratum 1 at t=4. That is however a limitation of using a "high-level" abstraction plotting paradigm. The customizability is inhibited by the many "helpful" assumptions built into the plotting grammar. It would not be as difficult to do this in base plotting since you could "move things around" without as many constraints.
If you do need to build a step unction from estimated survial proportions and times you might look at this answer and then build an augmented dataset with a y at time=4 adjustment for the later stratum. You would need to add a time=0 value of for the main stratum and a time=4 value of the first stratum for the second stratum as well as dong the adjustment as shown above. See this question and answer. Reconstruct survival curve from coordinates
I'm getting the following error while trying to generate the confusion Matrix - this used to work.
str(credit_test)
# Generate predicted classes using the model object
class_prediction <- predict(object=credit_model,
newdata=credit_test,
type="class")
class(class_prediction)
class(credit_test$ACCURACY)
# Calculate the confusion matrix for the test set
confusionMatrix(data=class_prediction, reference=credit_test$ACCURACY)
'data.frame': 20 obs. of 4 variables:
$ ACCURACY : Factor w/ 2 levels "win","lose": 1 1 1 2 2 1 1 1 1 1 ...
$ PM_HIGH : num 5.7 5.12 10.96 7.99 1.73 ...
$ OPEN_PRICE: num 4.46 3.82 9.35 7.77 1.54 5.17 1.88 2.65 5.71 4.09 ...
$ PM_VOLUME : num 0.458 0.676 1.591 3.974 1.785 ...
[1] "factor"
[1] "factor"
**Error in confusionMatrix(data=class_prediction, reference=credit_test$ACCURACY) :
unused arguments (data=class_prediction, reference=credit_test$ACCURACY)**
From some reason I had to run it this way, something has changed
caret::confusionMatrix(data=class_prediction,reference=credit_test$ACCURACY)
So I have seen questions regarding this error code before, but the suggested troubleshooting that worked for those authors didn't help me diagnose. I'm self-learning R and new to Stackoverflow, so please give me constructive feedback on how to better ask my question, and I will do my best to provide necessary information. I've seen many, similar questions put on hold so I want to help you to help me. I'm sure the error probably stems from my lack of experience in data prep.
I'm trying to run a panel data model, loaded as .csv and this error returns when the model is run
fixed = plm(Y ~ X, data=pdata, model = "within")
Error in `colnames<-`(`*tmp*`, value = "1") :
length of 'dimnames' [2] not equal to array extent
running str() on my dataset returns that ID and Time are factors with 162 levels and 7 levels, respectively.
str(pdata)
Classes ‘plm.dim’ and 'data.frame': 1127 obs. of 11 variables:
$ ID : Factor w/ 162 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...
$ Time : Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7 1 2 3 ...
$ Online.Service.Index : num 0.083 0.131 0.177 0.268 0.232 ...
$ Eparticipation : num 0.0345 0.0328 0.0159 0.0454 0.0571 ...
$ CPI : num 2.5 2.6 2.5 1.5 1.4 0.8 1.2 2.5 2.5 2.4 ...
$ GE.Est : num -1.178 -0.883 -1.227 -1.478 -1.466 ...
$ RL.Est : num -1.67 -1.71 -1.72 -1.95 -1.9 ...
$ LN.Pop : num 16.9 17 17 17.1 17.1 ...
$ LN.GDP.Cap : num 5.32 5.42 5.55 5.95 6.35 ...
$ Human.Capital.Index : num 0.268 0.268 0.268 0.329 0.364 ...
$ Telecommunication.Infrastructure.Index: num 0.0016 0.00173 0.00202 0.01576 0.03278 ...
Still, I don't see how it would create this error. I've tried transforming it as a data frame or matrix, with the same result (I got desperate and it worked for some people)
dim() yields
[1] 1127 11
I have some NA values, but I understand that these shouldn't cause a problem. Again, I'm self-taught and new here, so please take it easy on me! Hope I explained the problem well.
I'm using continuous Morlet wavelet transform (cwt) analysis over a time series by the use of the R-package dplR. The time series corresponds to a 15min data (gam_15min) with length 7968 (corresponding to 83 days of measurements).
I have the following output:
cwtGamma=morlet(gam_15min,x1=seq_along(gam_15min),p2=NULL,dj=0.1,siglvl=0.95)
str(cwtGamma)
List of 9
$ y : Time-Series [1:7968] from 1 to 1993: 672 674 673 672 672 ...
$ x : int [1:7968] 1 2 3 4 5 6 7 8 9 10 ...
$ wave : cplx [1:7968, 1:130] -0.00332+0.0008i 0.00281-0.00181i -0.00194+0.00234i ...
$ coi : num [1:7968] 0.73 1.46 2.19 2.92 3.65 ...
$ period: num [1:130] 1.03 1.11 1.19 1.27 1.36 ...
$ Scale : num [1:130] 1 1.07 1.15 1.23 1.32 ...
$ Signif: num [1:130] 0.000382 0.001418 0.005197 0.018514 0.062909 ...
$ Power : num [1:7968, 1:130] 1.17e-05 1.11e-05 9.26e-06 7.09e-06 5.54e-06 ...
$ siglvl: num 0.95
In my analysis I want to truncate the time-series (I suppose $wave) by removing 1 period length in the beginning and 1 period length at the end. how do I do that? maybe its easy but I'm seeing how... Thanks
Implemented:
I am importing a .xlsx file into R.
This file consists of three sheets.
I am binding all the sheets into a list.
Need to Implement
Now I want to combine this matrix lists into a single data.frame. With the header being the --> names(dataset).
I tried using the as.data.frame with read.xlsx as given in the help but it did not work.
I explicitly tried with as.data.frame(as.table(dataset)) but still it generates a long list of data.frame but nothing that I want.
I want to have a structure like
header = names and the values below that, just like how the read.table imports the data.
This is the code I am using:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
b <- rbind(list(lapply(1:sheet_ct, function(x) {
res <- read.xlsx(xlfile, x, as.data.frame = TRUE, header = TRUE)
})))
b <- b [-c(1),] # Just want to remove the second header
I want to have the data arrangement something like below.
Ei Mi hours Nphy Cphy CHLphy Nhet Chet Ndet Cdet DON DOC DIN DIC AT dCCHO TEPC Ncocco Ccocco CHLcocco PICcocco par Temp Sal co2atm u10 dicfl co2ppm co2mol pH
1 1 1 1 0.1023488 0.6534707 0.1053458 0.04994161 0.3308593 0.04991916 0.3307085 0.05042275 49.76304 14.99330000 2050.132 2150.007 0.9642220 0.1339044 0.1040715 0.6500288 0.1087667 0.1000664 0.0000000 9.900000 31.31000 370 0.01 -2.963256000 565.1855 0.02562326 7.879427
2 1 1 2 0.1045240 0.6448216 0.1103250 0.04988347 0.3304699 0.04984045 0.3301691 0.05085697 49.52745 14.98729000 2050.264 2150.007 0.9308690 0.1652179 0.1076058 0.6386706 0.1164099 0.1001396 0.0000000 9.900000 31.31000 370 0.01 -2.971632000 565.7373 0.02564828 7.879042
3 1 1 3 0.1064772 0.6369597 0.1148174 0.04982555 0.3300819 0.04976363 0.3296314 0.05130091 49.29323 14.98221000 2050.396 2150.007 0.8997098 0.1941872 0.1104229 0.6291149 0.1225822 0.1007908 0.8695131 9.900000 31.31000 370 0.01 -2.980446000 566.3179 0.02567460 7.878636
4 1 1 4 0.1081702 0.6299084 0.1187672 0.04976784 0.3296952 0.04968840 0.3290949 0.05175249 49.06034 14.97810000 2050.524 2150.007 0.8705440 0.2210289 0.1125141 0.6213265 0.1273103 0.1018360 1.5513170 9.900000 31.31000 370 0.01 -2.989259000 566.8983 0.02570091 7.878231
5 1 1 5 0.1095905 0.6239005 0.1221460 0.04971029 0.3293089 0.04961446 0.3285598 0.05220978 48.82878 14.97485000 2050.641 2150.007 0.8431960 0.2459341 0.1140222 0.6152447 0.1308843 0.1034179 2.7777070 9.900000
Please dont suggest me to have all data on a single sheet and also convert .xlsx to .csv or simple text format. I am trying really hard to have a proper dataframe from a .xlsx file.
Following is the file
And this is the post following : Followup
This is what resulted:
str(full_data)
'data.frame': 0 obs. of 19 variables:
$ Experiment : Factor w/ 2 levels "#","1":
$ Mesocosm : Factor w/ 10 levels "#","1","2","3",..:
$ Exp.day : Factor w/ 24 levels "1","10","11",..:
$ Hour : Factor w/ 24 levels "108","12","132",..:
$ Temperature: Factor w/ 125 levels "10","10.01","10.02",..:
$ Salinity : num
$ pH : num
$ DIC : Factor w/ 205 levels "1582.2925","1588.6475",..:
$ TA : Factor w/ 117 levels "1813","1826",..:
$ DIN : Factor w/ 66 levels "0.2","0.3","0.4",..:
$ Chl.a : Factor w/ 156 levels "0.171","0.22",..:
$ PIC : Factor w/ 194 levels "-0.47","-0.96",..:
$ POC : Factor w/ 199 levels "-0.046","1.733",..:
$ PON : Factor w/ 151 levels "1.675","1.723",..:
$ POP : Factor w/ 110 levels "0.032","0.034",..:
$ DOC : Factor w/ 93 levels "100.1","100.4",..:
$ DON : Factor w/ 1 level "µmol/L":
$ DOP : Factor w/ 1 level "µmol/L":
$ TEP : Factor w/ 100 levels "10.4934","11.0053",..:
[Note: Above is the structure after reading from .xlsx file......the levels makes the calculation and manipulation part tedious and messy.]
This is what I want to achieve:
str(a)
'data.frame': 9936 obs. of 29 variables:
$ Ei : int 1 1 1 1 1 1 1 1 1 1 ...
$ Mi : int 1 1 1 1 1 1 1 1 1 1 ...
$ hours : int 1 2 3 4 5 6 7 8 9 10 ...
$ Cphy : num 0.653 0.645 0.637 0.63 0.624 ...
$ CHLphy : num 0.105 0.11 0.115 0.119 0.122 ...
$ Nhet : num 0.0499 0.0499 0.0498 0.0498 0.0497 ...
$ Chet : num 0.331 0.33 0.33 0.33 0.329 ...
$ Ndet : num 0.0499 0.0498 0.0498 0.0497 0.0496 ...
$ Cdet : num 0.331 0.33 0.33 0.329 0.329 ...
$ DON : num 0.0504 0.0509 0.0513 0.0518 0.0522 ...
$ DOC : num 49.8 49.5 49.3 49.1 48.8 ...
$ DIN : num 15 15 15 15 15 ...
$ DIC : num 2050 2050 2050 2051 2051 ...
$ AT : num 2150 2150 2150 2150 2150 ...
$ dCCHO : num 0.964 0.931 0.9 0.871 0.843 ...
$ TEPC : num 0.134 0.165 0.194 0.221 0.246 ...
$ Ncocco : num 0.104 0.108 0.11 0.113 0.114 ...
$ Ccocco : num 0.65 0.639 0.629 0.621 0.615 ...
$ CHLcocco: num 0.109 0.116 0.123 0.127 0.131 ...
$ PICcocco: num 0.1 0.1 0.101 0.102 0.103 ...
$ par : num 0 0 0.87 1.55 2.78 ...
$ Temp : num 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 ...
$ Sal : num 31.3 31.3 31.3 31.3 31.3 ...
$ co2atm : num 370 370 370 370 370 370 370 370 370 370 ...
$ u10 : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
$ dicfl : num -2.96 -2.97 -2.98 -2.99 -3 ...
$ co2ppm : num 565 566 566 567 567 ...
$ co2mol : num 0.0256 0.0256 0.0257 0.0257 0.0257 ...
$ pH : num 7.88 7.88 7.88 7.88 7.88 ...
[Note: sorry for the extra columns, this is another dataset (simple text), which I am reading from read.table]
With NA's handled:
> unique(mydf_1$Exp.num)
[1] # 1
Levels: # 1
> unique(mydf_2$Exp.num)
[1] # 2
Levels: # 2
> unique(mydf_3$Exp.num)
[1] # 3
Levels: # 3
> unique(full_data$Exp.num)
[1] 2 3 4
Without handling NA's:
> unique(full_data$Exp.num)
[1] 1 NA 2 3
> unique(full_data$Mesocosm)
[1] 1 2 3 4 5 6 7 8 9 NA
I think this is what you need. I add a few comments on what I am doing:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
for( i in 1:sheet_ct) { #read the sheets into 3 separate dataframes (mydf_1, mydf_2, mydf3)
print(i)
variable_name <- sprintf('mydf_%s',i)
assign(variable_name, read.xlsx(xlfile, sheetIndex=i,startRow=1, endRow=209)) #using this you don't need to use my formula to eliminate NAs. but you need to specify the first and last rows.
}
colnames(mydf_1) <- names(mydf_2) #this here was unclear. I chose the second sheet's
# names as column names but you can chose whichever you want using the same (second and third column had the same names).
#some of the sheets were loaded with a few blank rows (full of NAs) which I remove
#with the following function according to the first column which is always populated
#according to what I see
remove_na_rows <- function(x) {
x <- x[!is.na(x)]
a <- length(x==TRUE)
}
mydf_1 <- mydf_1[1:remove_na_rows(mydf_1$Exp.num),]
mydf_2 <- mydf_2[1:remove_na_rows(mydf_2$Exp.num),]
mydf_3 <- mydf_3[1:remove_na_rows(mydf_3$Exp.num),]
full_data <- rbind(mydf_1[-1,],mydf_2[-1,],mydf_3[-1,]) #making one dataframe here
full_data <- lapply(full_data,function(x) as.numeric(x)) #convert fields to numeric
full_data2$Ei <- as.integer(full_data[['Ei']]) #use this to convert any column to integer
full_data2$Mi <- as.integer(full_data[['Mi']])
full_data2$hours <- as.integer(full_data[['hours']])
#*********code to use for removing NA rows *****************
#so if you rbind not caring about the NA rows you can use the below to get rid of them
#I just tested it and it seems to be working
n_row <- NULL
for ( i in 1:nrow(full_data)) {
x <- full_data[i,]
if ( all(is.na(x)) ) {
n_row <- append(n_row,i)
}
}
full_data <- full_data[-n_row,]
I think now this is what you need