Agricolae, tapply, error: arguments must have same length - r

I am new to R and I am having issues moving forward with the data analysis. My Excel data has a lot of NA's and I tried troubleshooting this error. Here's my code if anyone can help, and a link to a sample of my data
file:///C:/Users/steph/Documents/DLI%20ANOVA%20Sample.htm
Some of my variables have 4 reps instead of all 8reps, so I have a lot of NA's in the excel file. I keep getting this error after I try tapply:
Error in tapply(X = data1$gi..m3., INDEX = data1$cultivar, FUN = mean, :
arguments must have same length
library(agricolae)
data1=read.csv("DLI ANOVA Sample.csv", header=T, as.is=T)
#setting factors
block = as.factor(data1$block)
treatmentt = as.factor(data1$trt)
cultivar<-factor(data1$cv,c("CR", "LB","RF","RR","S","SNS","SNY","SSJ","YC"))
str(data1)
#Summary statistics
tapply(X = data1$growth.index, INDEX = data1$cultivar, FUN = mean, na.rm=T)
tapply(X = data1$growth.index, INDEX = data1$treatment, FUN = mean, na.rm=T)
data.frame': 288 obs. of 24 variables:
$ block : int 1 1 2 2 3 3 4 4 1 1 ...
$ trt : chr "HL-L" "HL-L" "HL-L" "HL-L" ..
$ cv : chr "CR" "CR" "CR" "CR" ...
$ rep : int 1 2 3 4 5 6 7 8 1 2 ...
$ height : int 23 20 25 19 23 19 22 19 19 24
$ growth.index : num 0.0221 0.0258 0.0276 0.0227 0.0209
$ number.of.mature.fruit : int 34 30 35 34 28 25 40 24 12 16 ...
$ mature.fruit.fw : num 163 163 186 152 169 ...
$ number.of.immature.fruit : int 38 28 40 27 35 37 44 48 20 30 ...
$ immature.fruit.fw : num 77.4 66.6 87.6 43.4 81.3 ...
$ Total.number.of.fruit : num 72 58 75 61 63 62 84 72 32 46 ...
$ Total.fruit.fw : num 241 230 273 195 250 ...
$ Fruit.Water.Content..g. : num NA 209 NA 176 NA ...
$ Brix.. : num 4.9 NA 5.6 NA 4.7 NA 5.1 NA 5.6 NA ...
$ pH : num 4.17 NA 4.3 NA 4.1 ...
$ EC.uS.mL : num 4.46 NA 9.19 NA 8.24 ...
$ X..citric.Acid : num 0.704 NA 0.397 NA 0.653 ...
$ Sugar.Acid.Ratio : num 6.96 NA 14.11 NA 7.2 ...
$ oedema.injury.level..1.6. : int 3 3 1 2 1 1 1 2 2 1 ...
$ Stomatal.conductance : num NA 365 NA 422 NA ...
$ spad : num NA NA NA 64.3 NA 65.5 NA 68.7 NA 55.6 ...
$ Irrigation.Events : int NA 14 NA 12 NA 13 NA 16 NA 13 ...
$ WUE : num NA 0.00584 NA 0.00693 NA ...
$ transpiration..g.H2O.lost..g.dry.biomass.: num NA 117 NA 111 NA ...

Related

convert tibble to data frame in R

Im new to R and have received data from others with a much better level of R than me.
I need to due some simple descriptive statistic with deadline tomorrow and noticed that the data is in "tibble" format and I would like it as a dataframe instead. Can anybody help? its probably very simple but my skills in tidyverse are still very limited - but working on it:)
The tibble is as i would like it to be with 165 rows (one for each patient) and 16 columns.
I would just like it to be a dataframe.
Thank you for your time
I tried the very simple
data_dataframe <- as.data.frame(model_data)
But didnt work.
My str out is:
> str(model_data)
tibble [165 × 16] (S3: tbl_df/tbl/data.frame)
$ PatientID : Factor w/ 165 levels "Patient_001",..: 1 2 3 4 5 6 7 8 9 10 ...
$ VAS_baseline : num [1:165] NA 99 25 75 50 50 90 81 80 100 ...
$ VAS_followup_30 : num [1:165] 75 95 53 85 88 92 98 NA NA 80 ...
$ VAS_followup_180 : num [1:165] 95 83 35 NA 94 68 98 NA NA 100 ...
$ Index_baseline : num [1:165] NA 1 0.847 1 0.813 0.826 0.967 1 0.96 0.952 ...
$ Index_followup_30 : num [1:165] 0.967 0.919 0.764 0.919 1 0.96 1 NA NA 0.919 ...
$ Index_followup_180: num [1:165] 1 0.88 0.728 NA 1 1 1 NA NA 0.952 ...
$ Age : num [1:165] 68 74 61 64 69 55 74 68 79 66 ...
$ Group : Factor w/ 3 levels "Group_1","Group_2",..: 2 3 1 1 1 3 1 3 2 2 ...
$ Surgeon : Factor w/ 6 levels "Surgeon_1","Surgeon_2",..: 1 3 1 1 5 3 1 6 4 6 ...
$ VAS_to_30 : num [1:165] NA -4 28 10 38 42 8 NA NA -20 ...
$ VAS_to_180 : num [1:165] NA -16 10 NA 44 18 8 NA NA 0 ...
$ VAS_30_to_180 : num [1:165] 20 -12 -18 NA 6 -24 0 NA NA 20 ...
$ Index_to_30 : num [1:165] NA -0.081 -0.083 -0.081 0.187 ...
$ Index_to_180 : num [1:165] NA -0.12 -0.119 NA 0.187 0.174 0.033 NA NA 0 ...
$ Index_30_to_180 : num [1:165] 0.033 -0.039 -0.036

How do I resolve this error using lapply and my own function?

I have a list of 18 data frames that I read in using read.xlsx. Each data frame has the same number of columns but some columns contain NA for some rows.
Also, in the Abundance column there are rows that contain non-numeric data and I suspect that I may need to remove these rows from each data frame but I have not been able to find a way to remove those rows.
My data frame structure is like this:
$ :'data.frame': 118 obs. of 10 variables:
..$ Locus : Factor w/ 24 levels "A","CS",..: 14 14 14 14 22 22 NA 22 10 10 ...
..$ Target : Factor w/ 96 levels "[AAAGA]14","[AAAGA]15",..: 88 91 90 87 11 12 NA 9 65 67 ...
..$ Length : num [1:118] 60 76 72 56 24 39 NA 20 139 141 ...
..$ Abundance : num [1:118] 1479 1108 180 144 1786 ...
..$ Size : num [1:118] 15 19 18 14 6 9.3 NA 5 32 32.2 ...
..$ Call : Factor w/ 4 levels "Al","HAs",..: 1 1 3 3 1 1 NA 3 1 1 ...
..$ RAR : num [1:118] NA 74.92 12.17 9.74 NA ...
..$ Position : num [1:118] NA NA NA NA NA NA NA NA NA NA ...
..$ Al.1.s.percent: num [1:118] NA NA 12.17 9.74 NA ...
..$ Al.2.s.percent: num [1:118] NA NA 16.2 13 NA ...
I want to apply this function to each data frame in my list of data frames.
add.sum = function(df){
transform(df, Tot.count = ave(df[[Abundunce]], df[[Locus]], FUN = sum))
}
I tried using this line with lapply
transformed.data = lapply(mydata, add.sum)
I also tried it this way
transformed.data = lapply(mydata, function (x) add.sum(x))
But these give me the following error
Error in .subset2(x, i, exact = exact) : no such index at level 1
Any suggestions on how to get this working correctly?

Have issues in getting probability values using SVM in R

I am having a data set which has 28 attributes. The response variable is binary (0 & 1). I tried using SVM with "Probability=T" while running it. But I still could not get the probability values from the result.
Here is my training data set (last attribute is my response variable):
str(train)
'data.frame': 73630 obs. of 29 variables:
$ EMOTION_INDICATOR: num -2 -0.625 0.9 0 1.625 ...
$ CLUSTER : Factor w/ 8 levels "","cluster0",..: 4 7 5 1 1 3 8 6 7 1 ...
$ GENDER : Factor w/ 3 levels "","Female","Male": 2 2 2 1 1 3 3 2 3 1 ...
$ AGE : num 36 37 70 NA NA ...
$ REGION : Factor w/ 6 levels "","'Northern Ireland'",..: 5 6 5 1 1 6 4 4 6 1 ...
$ WORKING : Factor w/ 14 levels "","A","B","C",..: 4 14 8 1 1 6 4 3 4 1 ...
$ MUSIC : Factor w/ 7 levels "","A","B","C",..: 5 7 6 1 1 4 5 2 5 1 ...
$ LIST_OWN : num 1 1 6 NA NA ...
$ LIST_BACK : num 1 3 2 NA NA 0.5 2 0.5 6 NA ...
$ Q1 : num 10 51 35 NA NA 29 51 25 69 NA ...
$ Q2 : num 53 51 36 NA NA 7 49 25 71 NA ...
$ Q3 : num 12 70 37 NA NA 26 51 23 70 NA ...
$ Q4 : num 11 31 36 NA NA 2 50 24 7 NA ...
$ Q5 : num 12 6 37 NA NA 51 73 22 10 NA ...
$ Q6 : num 12 6 9 NA NA 51 47 22 68 NA ...
$ Q7 : num 76 5 36 NA NA 29 50 30 11 NA ...
$ Q8 : num 76 24 13 NA NA 72 52 10 10 NA ...
$ Q9 : num 51 7 70 NA NA 12 36 48 53 NA ...
$ Q10 : num 53 70 69 NA NA 9 91 18 75 NA ...
$ Q11 : num 76 89 65 NA NA 53 53 18 86 NA ...
$ Q12 : num 76 91 63 NA NA 5 52 16 89 NA ...
$ Q13 : num 52 50 6 NA NA 51 77 17 99 NA ...
$ Q14 : num 75 73 62 NA NA 70 78 21 100 NA ...
$ Q15 : num 11 72 31 NA NA 33 48 19 67 NA ...
$ Q16 : num 12 47.7 24.3 NA NA ...
$ Q17 : num 71 74 51 NA NA 51 52 27 98 NA ...
$ Q18 : num 23.6 52 31 NA NA ...
$ Q19 : num 22.5 52 32 NA NA ...
$ AVERAGE_RATING : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
My test set looks similar too. It has 24544 obs. with 29 variables.
This is the code that I used for SVM:
fitSVM <- svm(AVERAGE_RATING ~., data=train, na.action = na.omit,probability=T)
predSVM <- predict(fitSVM,test[!rowSums(is.na(test)),],type="probability")
table(predSVM,test$AVERAGE_RATING[!rowSums(is.na(test))],useNA="no")
predSVM 0 1
0 8091 1523
1 3259 9865
I get proper output, but without probability values:
attr(predSVM,"probabilities")
NULL
Am I doing something wrong?
You need to call predict with:
predSVM <- predict(fitSVM,test[!rowSums(is.na(test)),], probability=T)
See ? predict.svm

Caret error: "all the Accuracy metric values are missing"

I'm getting the following error and I don't know what may have gone wrong.
I'm using R Studio with the 3.1.3 version of R for Windows 8.1 and using the Caret package for datamining.
I have the following training data:
str(training)
'data.frame': 212300 obs. of 21 variables:
$ FL_DATE_MDD_MMDD : int 101 101 101 101 101 101 101 101 101 101 ...
$ FL_DATE : int 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 ...
$ UNIQUE_CARRIER : Factor w/ 13 levels "9E","AA","AS",..: 11 10 2 5 8 9 11 10 10 10 ...
$ DEST : Factor w/ 150 levels "ABE","ABQ","ALB",..: 111 70 82 8 8 31 110 44 53 80 ...
$ DEST_CITY_NAME : Factor w/ 148 levels "Akron, OH","Albany, NY",..: 107 61 96 9 9 29 106 36 97 78 ...
$ ROUNDED_TIME : int 451 451 551 551 551 551 551 551 551 551 ...
$ CRS_DEP_TIME : int 500 520 600 600 600 600 600 600 602 607 ...
$ DEP_DEL15 : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 2 1 1 ...
$ CRS_ARR_TIME : int 746 813 905 903 855 815 901 744 901 841 ...
$ Conditions : Factor w/ 28 levels "Blowing Snow",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Dew.PointC : num -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 ...
$ Events : Factor w/ 10 levels "","Fog","Fog-Rain",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Gust.SpeedKm.h : num NA NA NA NA NA NA NA NA NA NA ...
$ Humidity : int 68 68 71 71 71 71 71 71 71 71 ...
$ Precipitationmm : num NA NA NA NA NA NA NA NA NA NA ...
$ Sea.Level.PressurehPa: num 1021 1021 1022 1022 1022 ...
$ TemperatureC : num -9.4 -9.4 -10 -10 -10 -10 -10 -10 -10 -10 ...
$ VisibilityKm : num 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 ...
$ Wind.Direction : Factor w/ 18 levels "Calm","East",..: 9 9 7 7 7 7 7 7 7 7 ...
$ WindDirDegrees : int 320 320 330 330 330 330 330 330 330 330 ...
$ Wind.SpeedKm.h : num 20.4 20.4 13 13 13 13 13 13 13 13 ...
- attr(*, "na.action")=Class 'omit' Named int [1:22539] 3 32 45 87 94 325 472 548 949 1333 ...
.. ..- attr(*, "names")= chr [1:22539] "3" "32" "45" "87" ...
and when I execute the following command:
ldaModel <- train(DEP_DEL15~.,data=training,method="lda",preProc=c("center","scale"),na.remove=TRUE)
I get:
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :1 NA's :1
Error in train.default(x, y, weights = w, ...) : Stopping
It is probably due to having about outcome factor with levels "0" and "1".
There is a specific warning issued when this happens: At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1"
It seems that people uniformly ignore warnings so I'm going to make this throw an error in the next version.
If the variables Gust.SpeedKm.h and Precipitationmm contain only NA's try omitting those variables from your data before running the model. If they contain partial NA's and you think they could have predictive value as features then use imputation. Follow this documentation for pre-processing in caret, including imputation.

Split and unsplit a dataframe in four parts

I'd like to split a dataframe in 4 equals parts, because I'd like to use the 4 cores of my computer.
I did this :
df2 <- split(df, 1:4)
unsplit(df2, f=1:4)
and that
df2 <- split(df, 1:4)
unsplit(df2, f=c('1','2','3','4')
But the unsplit function did not work, I have these warnings messages
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Do you have an idea of the reason ?
How many rows in df? You will get that warning if the number of rows in your table is not divisible by 4. I think you are using the split factor f incorrectly, unless what you want to do is put each subsequent row into a different split data.frame.
If you really want to split your data into 4 dataframes. one row after the other then make your splitting factor the same size as the number of rows in your dataframe using rep_len like this:
## Split like this:
split(df , f = rep_len(1:4, nrow(df) ) )
## Unsplit like this:
unsplit( split(df , f = rep_len(1:4, nrow(df) ) ) , f = rep_len(1:4,nrow(df) ) )
Hopefully this example illustrates why the error occurs and how to avoid it (i.e. use a proper splitting factor!).
## Want to split our data.frame into two halves, but rows not divisible by 2
df <- data.frame( x = runif(5) )
df
## Splitting still works but...
## We get a warning because the split factor 'f' was not recycled as a multiple of it's length
split( df , f = 1:2 )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
## Instead let's use the same split levels (1:2)...
## but make it equal to the length of the rows in the table:
splt <- rep_len( 1:2 , nrow(df) )
splt
#[1] 1 2 1 2 1
## Split works, and f is not recycled because there are
## the same number of values in 'f' as rows in the table
split( df , f = splt )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
## And unsplitting then works as expected and reconstructs our original data.frame
unsplit( split( df , f = splt ) , f = splt )
# x
#1 0.6970968
#2 0.6206521
#3 0.5614762
#4 0.1798006
#5 0.5910995
In the R language 'split' example . . .
aq <- airquality
g <- aq$Month
l <- split(aq,g)
After the 'scale' function is executed
l <- lapply(l, transform, Ozone = scale(Ozone))
I am guessing that at one time in R history
the function 'scale' did not add extra attributes
to the column it is modifying.
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
As seen in here . . .
> str(l)
List of 5
$ 5:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 0.782 0.557 -0.523 -0.253 NA ...
.. ..- attr(*, "scaled:center")= num 23.6
.. ..- attr(*, "scaled:scale")= num 22.2
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 6:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] NA NA NA NA NA ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ 7:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 2.399 -0.32 -0.857 NA 0.154 ...
.. ..- attr(*, "scaled:center")= num 59.1
.. ..- attr(*, "scaled:scale")= num 31.6
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 8:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] -0.528 -1.284 -1.108 0.455 -0.629 ...
.. ..- attr(*, "scaled:center")= num 60
.. ..- attr(*, "scaled:scale")= num 39.7
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 9:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] 2.674 1.928 1.721 2.467 0.644 ...
.. ..- attr(*, "scaled:center")= num 31.4
.. ..- attr(*, "scaled:scale")= num 24.1
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
But now it does add those attributes
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
and the very simple 'unsplit' function is not programmed to handle those attributes.
> unsplit(l,g)
Error in xj[i, , drop = FALSE] : (subscript) logical subscript too long
The (direct and simple) solution is to get rid of those attributes.
attributes(l[[1]]$Ozone) <- NULL
attributes(l[[2]]$Ozone) <- NULL
attributes(l[[3]]$Ozone) <- NULL
attributes(l[[4]]$Ozone) <- NULL
attributes(l[[5]]$Ozone) <- NULL
Then try to unsplit again.
str( unsplit(l,g) )
> str( unsplit(l,g) )
'data.frame': 153 obs. of 6 variables:
$ Ozone : num 0.782 0.557 -0.523 -0.253 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
So, now it works.
Andre Mikulec

Resources