random forest error NA not permitted in predictors

random forest error NA not permitted in predictors - r

Hi I am using the following r script to build a random forest:
# load the necessary libraries
library(randomForest)
testPP<-numeric()
# load the dataset
QdataTrain <- read.csv('train.csv',header = FALSE)
QdataTest <- read.csv('test.csv',header = FALSE)
QdataTrainX <- subset(QdataTrain,select=-V1)
QdataTrainY<-as.factor(QdataTrain$V1)
QdataTestX <- subset(QdataTest,select=-V1)
QdataTestY<-as.factor(QdataTest$V1)
mdl <- randomForest(QdataTrainX, QdataTrainY)
where I am getting the following error:
Error in randomForest.default(QdataTrainX, QdataTrainY) :
NA not permitted in predictors
however i see no occurence of NA in my data.
for reference here is my data:
https://docs.google.com/file/d/0B0iDswLYaZ0zUFFsT01BYlRZU0E/edit
does anyone know why this error is being thrown? I'll keep looking in the mean time.
Thanks in advance for any help!

The given data does contain some missing values (7 in particular):
sapply(QdataTrainX, function(x) sum(is.na(x)))
## V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29
## 0 0 0 0 0 0 1 1 1 1 1 1 1
Therefore columns V23 to V29 have one missing value each
which(is.na(QdataTrainX$V23))
## 318
Gives the row number for that.

Related

Find the Maximum with R

I want to find the maximum in one function. I'll start with one input variable in the function, but the goal is at least 4 variables.
The function is continuous (but for some variables like first variable) if you give it a decimal it will get the same return (example: 99.6,99.8102,100,100.2,100.3,100.4 = 2.28)
In below, I plot the first 500 values (just change VAR1), we can easily see the maximum is x=110 y =3.31
I've tried several approaches, but it seems like something is missing (maybe my knowledge).
The goal is to use at least 4 variables in function and find the maximum (of course it is not feasible to draw all the plots and check which is the maximum).
I used optim (below with one variable just to make it easier explain what I want).
optimize(SMA_R, interval=c(1,500), maximum=TRUE,tol=0.0001)
$maximum 111.5218 #Expecting 110
$objective 3.137111
optimize(SMA_R, interval=c(1,100), maximum=TRUE,tol=0.0001)
$maximum 38.81469 #Expecting 35
$objective 2.370557 #Expecting 2.41
optim(par = c(100), fn = SMA_R,control =list(fnscale=-1))
$par 110 #It work as expected
$value 3.314204
optim(par = c(40), fn = SMA_R,control =list(fnscale=-1))
$par 39 #local maximum expected 35
$value 2.370557`
It evident for one variable I can use Brent or Nelder-Mead.
Now adding the second variable
function(x,y) {optim(par = c(x,y), fn = SMA_R,control =list(fnscale=-1))}
l <- list(VAR1 = c(35,110), VAR2 = c(1,1)) #Inputs 35 and 110 both best local points for VAR1.
par value counts convergence message
V1 36.0048828, 0.9589844 3.148596 39, NA 0 NULL
V2 110, 1 3.314204 47, NA 0 NULL
V3 36.0048828, 0.9589844 3.148596 39, NA 0 NULL
V4 110, 1 3.314204 47, NA 0 NULL
Adding this second variable slightly increases the output.
I was checking manually for some results to be sure.
After several tries a find better maximum for VAR1=118, VAR2=0.81 RESULT=3.41
If Add more options to VAR2=c(0.5,0.8,1,1.2,1.5), I find the result that I want.
par value counts convergence message
V1 35.0, 0.5 1 5, NA 0 NULL
V2 121.00, -2.25 1 15, NA 0 NULL
V3 37.9257812, 0.9640625 3.140239 33, NA 0 NULL
V4 118.2875977, 0.8107422 3.415413 45, NA 0 NULL
V5 36.0048828, 0.9589844 3.148596 39, NA 0 NULL
V6 110, 1 3.314204 47, NA 0 NULL
V7 35.0, 4.7 0.9415069 7, NA 0 NULL
V8 110.4605713, 0.8374512 3.31747 49, NA 0 NULL
V9 35.0, 1.5 0.9415069 3, NA 0 NULL
V10 110.0, 12.5 0.9415069 7, NA 0 NULL
My question is how can you be sure that this is the best Maximum?
Maybe there's a better one and I don't give all the combinations to find the maximum.
Given more options I have V16 with 3.8 a huge increase.
VAR1 = c(20,35,50,70,110,250)
VAR2 = c(0.5,0.8,1,1.2,1.5))
par value counts convergence message
V1 20.0, 0.5 1 5, NA 0 NULL
V2 35.0, 0.5 1 5, NA 0 NULL
V3 50.0, 0.5 1 5, NA 0 NULL
V4 70.0, 0.5 1 5, NA 0 NULL
V5 121.00, -2.25 1 15, NA 0 NULL
V6 274.7558594, 0.5976562 2.718835 43, NA 0 NULL
V7 20.0566406, 0.8957031 2.499993 45, NA 0 NULL
V8 37.9257812, 0.9640625 3.140239 33, NA 0 NULL
V9 54.8144531, 0.9367188 3.679734 55, NA 0 NULL
V10 76.7539062, 0.8546875 3.496945 45, NA 0 NULL
V11 118.2875977, 0.8107422 3.415413 45, NA 0 NULL
V12 269.4213867, 0.6291016 2.718835 47, NA 0 NULL
V13 20.8669434, 0.9204102 2.496346 35, NA 0 NULL
V14 36.0048828, 0.9589844 3.148596 39, NA 0 NULL
V15 55.1074219, 0.9414062 3.679734 43, NA 0 NULL
V16 69.1523438, 0.9453125 3.823568 41, NA 0 NULL
V17 110, 1 3.314204 47, NA 0 NULL
V18 262.0971680, 0.6337891 2.718835 45, NA 0 NULL
V19 22.0, -0.8 1 9, NA 0 NULL
V20 35.0, 4.7 0.9415069 7, NA 0 NULL
V21 54.7412109, 0.9363281 3.679734 43, NA 0 NULL
V22 76.2617188, 0.9265625 3.672373 41, NA 0 NULL
V23 110.4605713, 0.8374512 3.31747 49, NA 0 NULL
V24 266.6488528, 0.6399309 2.721007 55, NA 0 NULL
V25 20.0, 1.5 0.9415069 3, NA 0 NULL
V26 35.0, 1.5 0.9415069 3, NA 0 NULL
V27 50.0, 6.5 0.9415069 7, NA 0 NULL
V28 70.0, 8.5 0.9415069 7, NA 0 NULL
V29 110.0, 12.5 0.9415069 7, NA 0 NULL
V30 275.4394531, 0.6210938 2.718835 45, NA 0 NULL
Is there a better way to get the results I want?
Each variable adds a lot of complexity (and processing time) and the results it always depends on my inputs.

R - extract of one character string column for each row: variable-names an values - data cleaning

How can I extract the variables and their values for each row when they are stuck in a character row like {"variable1":value1,"variable2":value2, ...}?
In my df about activities, there is one "info" column (a string with variables following their values) which should be made to columns themselves. The strings are buit like this: {"variable1":value1,"variable2":value2, ...} and NA rows are: {}).
activity_type #There is also the column "activity_type"
94 Running #(what activity the measurement was)
95 Multi Sport
96 Running
98 Walking
info
94 "{\"calories\":2994,\"intensity\":30,\"manual_distance\":0,\"manual_calories\":0,\"hr_average\":60,\"hr_min\":54,\"hr_max\":66,\"hr_zone_0\":86,\"hr_zone_1\":0,\"hr_zone_2\":0,\"hr_zone_3\":0,\"device_startdate\":1612293557,\"device_enddate\":1612315157,\"pause_duration\":0,\"steps\":767,\"distance\":566,\"elevation\":50,\"metcumul\":50}"
95 {}
96 "{\"calories\":2994,\"intensity\":30,\"manual_distance\":0,\"manual_calories\":0,\"hr_average\":52,\"hr_min\":49,\"hr_max\":58,\"hr_zone_0\":59,\"hr_zone_1\":0,\"hr_zone_2\":0,\"hr_zone_3\":0,\"device_startdate\":1612336821,\"device_enddate\":1612358421,\"pause_duration\":0,\"steps\":1162,\"distance\":827,\"elevation\":101,\"metcumul\":101}"
98 "{\"calories\":1208,\"intensity\":30,\"manual_distance\":0,\"manual_calories\":0,\"hr_average\":0,\"hr_min\":0,\"hr_max\":0,\"hr_zone_0\":0,\"hr_zone_1\":0,\"hr_zone_2\":0,\"hr_zone_3\":0,\"device_startdate\":1612457880,\"device_enddate\":1612479480,\"pause_duration\":0,\"steps\":0,\"distance\":0,\"elevation\":0,\"metcumul\":0}"
#what I want in the end
row_number calories intensity manual_distance manual_calories ...
94 2994 30 0 0 ...
95 NA NA 0 0 ...
96 2994 30 0 0 ...
98 1208 30 0 0 ...
I tried:
info2 <-as.data.frame(do.call(rbind, strsplit(infos, ":|,")))
FYI: The sequence of variables are different between rows ("calories", "intensity"...all start similar but the sequences differ). So the resulting df was not "consistent".
I thought the resulting df would be constructed like: one col is one type of variable and the next col the corresponding value :
#I thought:
# V25=steps V27=distance V29=evaluation V31=metcul
V25 V26 V27 V28 V29 V30 V31
1 "steps" 0 "distance" 0 "elevation" 17 "metcumul"
2 "steps" 0 "distance" 0 "elevation" 17 "metcumul"
3 "steps" 2420 "distance" 1971 "elevation" 110 "metcumul"
But as the sequence of variables in each row differ, the result is shifted:
info2 [c(1, 52, 86, 93:95), c(25:36)]
V25 V26 V27 V28 V29 V30
1 "steps" 0 "distance" 0 "elevation" 17
52 "steps" 828 "distance" 536 "elevation" 33
86 "laps" 0 "mvts" 0 "pool_length" 25
93 "steps" 2420 "distance" 1971 "elevation" 110
94 "device_enddate" 1612315157 "pause_duration" 0 "steps" 767
95 {} {} {} {} {} {}
V31 V32 V33 V34 V35 V36
1 "metcumul" 17} {"calories" 123 "intensity" 30
52 "metcumul" 33} {"calories" 30 "intensity" 30
86 "type" 9} {"calories" 1881 "hr_average" 55
93 "metcumul" 110} {"calories" 218 "intensity" 30
94 "distance" 566 "elevation" 50 "metcumul" 50}
95 {} {} {} {} {} {}
FYI: There are even differences within one type of activity (some have more/less variables, so the shorter rows have to repeat the first variables at the end) :
swim2[2:4,c(1,29:34)]
V1 V29 V30 V31 V32 V33 V34
2 {"calories" "pool_length" 25 "version" 0 "type" 3}
3 {"calories" "pool_length" 25 "version" 0 "type" 4}
4 {"calories" "pool_length" 25 "type" 9} {"calories" 1881
walking2[17:19,c(1:2,21:31)]
V1 V2 V21 V22 V23 V24
17 {"calories" 30 "hr_zone_3" 0 "pause_duration" 0
18 {"calories" 13 "hr_zone_3" 0 "pause_duration" 0
19 {"calories" 1208 "hr_zone_3" 0 "device_startdate" 1612457880
V25 V26 V27 V28 V29 V30 V31
17 "steps" 511 "distance" 339 "elevation" 23 "metcumul"
18 "steps" 405 "distance" 282 "elevation" 16 "metcumul"
19 "device_enddate" 1612479480 "pause_duration" 0 "steps" 0 "distance"
Conclusion: How can I extract the variables and their values for each row (when they are stuck in a row like {"variable1":value1,"variable2":value2, ...})?
Maybe with stringr? For example looking for "calories" in the row and then use the next number?
Or maybe I should have used the strsplitin a loop/ apply function?
I am absolutly open to other approachs!
Thanks a lot to all R-cracks!

Here is one idea.
library(tidyverse)
infogroup2 <- infogroup %>%
mutate(ID = 1:n()) %>%
mutate(info = str_remove_all(info, pattern = regex("\\{|\\}"))) %>%
mutate(info = ifelse(info %in% "", NA, info)) %>%
separate_rows(info, sep = ",") %>%
separate(info, into = c("parameter", "value"), sep = ":", convert = TRUE) %>%
mutate(parameter = str_remove_all(parameter, pattern = regex('\\"|\\"'))) %>%
pivot_wider(names_from = "parameter", values_from = "value", values_fill = 0) %>%
select(-all_of(c("NA", "ID")))
DATA
infogroup <- tibble(
activity_type = c("Running", "Multi Sport", "Running", "Walking"),
info = c('{"calories":2994,"intensity":30,"manual_distance":0,"manual_calories":0,"hr_average":60,"hr_min":54,"hr_max":66,"hr_zone_0":86,"hr_zone_1":0,"hr_zone_2":0,"hr_zone_3":0,"device_startdate":1612293557,"device_enddate":1612315157,"pause_duration":0,"steps":767,"distance":566,"elevation":50,"metcumul":50}',
'{}',
'{"calories":2994,"intensity":30,"manual_distance":0,"manual_calories":0,"hr_average":52,"hr_min":49,"hr_max":58,"hr_zone_0":59,"hr_zone_1":0,"hr_zone_2":0,"hr_zone_3":0,"device_startdate":1612336821,"device_enddate":1612358421,"pause_duration":0,"steps":1162,"distance":827,"elevation":101,"metcumul":101}',
'{"calories":1208,"intensity":30,"manual_distance":0,"manual_calories":0,"hr_average":0,"hr_min":0,"hr_max":0,"hr_zone_0":0,"hr_zone_1":0,"hr_zone_2":0,"hr_zone_3":0,"device_startdate":1612457880,"device_enddate":1612479480,"pause_duration":0,"steps":0,"distance":0,"elevation":0,"metcumul":0}'
))

R: read.table seperator every XX characters

Unable to find the solution to my problem.
I am trying to read a text file using r. File contains a single row and separated by number of characters.
000341656.0000000000000000004.6000000000000000009.0000000000000000050.9566787004000000052.0000000000000000072.8621215573000000007.0000000000000000050.0361010830000000047.2490974729000000054.5560183531000000006.0000000000000000049.9711191336000000047.0397111913000000043.1488475260000000023.0000000000000000046.6281588448000000040.1516245487000000038.4653540241000000002.0000000000000000046.2129963899000000041.9963898917000000037.3850068798000000030.0000000000000000046.0144404332000000040.0324909747000000027.0930952140000000003.0000000000000000043.3971119134000000032.4801444043000000010.4757238771
First value is of 20 digit floating.9 digit followed by 10 decimal digit.
The file contains between 22 and 30 values, each 20 digits long (decimal digit set to '.')
What i am unable to figure out how to get rid of this extra 0.
Any lead of help is highly appreciable.

You can read data in a fixed-width format with read.fwf:
> read.fwf("./d.txt", widths=rep(20,30))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 341656 4.6 9 50.95668 52 72.86212 7 50.0361 47.2491 54.55602 6 49.97112
2 341656 4.6 9 50.95668 52 72.86212 7 50.0361 47.2491 54.55602 6 49.97112
V13 V14 V15 V16 V17 V18 V19 V20 V21 V22
1 47.03971 43.14885 23 46.62816 40.15162 38.46535 2 46.213 41.99639 37.38501
2 47.03971 43.14885 23 46.62816 40.15162 38.46535 2 46.213 41.99639 37.38501
V23 V24 V25 V26 V27 V28 V29 V30
1 30 46.01444 40.03249 27.0931 3 43.39711 32.48014 10.47572
2 30 46.01444 40.03249 27.0931 3 43.39711 32.48014 10.47572
You need to know how many fields and how big they are. You didnt say how many lines in the file but I copied your line in twice (hence the duplication).

library(stringi)
stri_match_all_regex(
"000341656.0000000000000000004.6000000000000000009.0000000000000000050.9566787004000000052.0000000000000000072.8621215573000000007.0000000000000000050.0361010830000000047.2490974729000000054.5560183531000000006.0000000000000000049.9711191336000000047.0397111913000000043.1488475260000000023.0000000000000000046.6281588448000000040.1516245487000000038.4653540241000000002.0000000000000000046.2129963899000000041.9963898917000000037.3850068798000000030.0000000000000000046.0144404332000000040.0324909747000000027.0930952140000000003.0000000000000000043.3971119134000000032.4801444043000000010.4757238771",
".{20}"
) %>%
unlist() %>%
as.numeric()
## [1] 341656.00000 4.60000 9.00000 50.95668 52.00000
## [6] 72.86212 7.00000 50.03610 47.24910 54.55602
## [11] 6.00000 49.97112 47.03971 43.14885 23.00000
## [16] 46.62816 40.15162 38.46535 2.00000 46.21300
## [21] 41.99639 37.38501 30.00000 46.01444 40.03249
## [26] 27.09310 3.00000 43.39711 32.48014 10.47572
Also:
as.numeric(readChar("~/Data/20.txt", rep(20, file.size("~/Data/20.txt")/20)))

Skipping columns in R read.table()

I wanna skip the first three columns. Couldn't quite understand the posts about colClasses because I'm new to R.
YDL025C YDL025C 1 -0.1725 -0.5375 -0.4970 -0.3818 -0.5270 -0.4260 -0.6929 -0.4020 -0.3263 -0.3373 -0.3532 -0.2771 -0.2732 -0.3307 -0.4660 -0.4314 -0.3135
YKL032C YKL032C 1 -0.2364 0.0794 0.1678 0.2389 0.3847 0.2625 0.1889 0.2681 0.0363 -0.1992 -0.0521 -0.0307 0.0584 0.2817 0.2239 -0.0253 0.0751

If you have to use read.table and you want to filter on the way in, you can use col.classes as follows. You have 20 columns. Say the first 2 are character, rest are numeric and you want to drop 4,5,6. You construct a vector of length 20 detailing that information. The NULL will not pull in those columns.
x<- read.table(file="datat.txt",
colClasses = c(rep("character", 2),
rep("numeric", 1),
rep("NULL", 3),
rep("numeric", 14)),
header = FALSE)
x
V1 V2 V3 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 YDL025C YDL025C 1 -0.3818 -0.5270 -0.4260 -0.6929 -0.4020 -0.3263 -0.3373 -0.3532 -0.2771 -0.2732 -0.3307 -0.4660 -0.4314 -0.3135
2 YKL032C YKL032C 1 0.2389 0.3847 0.2625 0.1889 0.2681 0.0363 -0.1992 -0.0521 -0.0307 0.0584 0.2817 0.2239 -0.0253 0.0751

As commented above, easier to remove the columns after reading in. For example:
mydf <- read.table("mydf.txt")
Then,
mydf[, 4:ncol(mydf)]
will remove the first 3 columns.

R: format csv file as.data.frame

I have a csv file i'm importing but some columns are not getting the correct formatting. It's very strange and I can't figure it out. The entire top row is formatting the columns as characters, instead of numeric. Believe it is getting the formatting from V1/time1 column?
> dde = read.csv("dde.csv",header=F, sep=",",stringsAsFactors=FALSE)
> dde <- na.omit(dde)
> dde <- as.data.frame(dde)
> time1 = as.POSIXct(strptime(paste(dde$V1, sep=" "),format="%m/%d/%Y %I:%M:%S %p"))
> head(dde)
V1 V2 V3 V4 V5 V6 V7 V8
1 9/7/2014 9:20:00 PM 105.061 136.099 169.961 98.391 96.515 112.802 87.277
2 9/7/2014 9:26:00 PM 105.068 136.074 169.954 98.399 96.521 112.790 87.276
3 9/7/2014 9:31:00 PM 105.078 136.107 170.031 98.414 96.528 112.813 87.287
4 9/7/2014 9:35:00 PM 105.068 136.102 170.001 98.424 96.516 112.789 87.289
5 9/7/2014 9:41:00 PM 105.074 136.109 169.994 98.422 96.519 112.821 87.300
6 9/7/2014 9:45:00 PM 105.091 136.114 170.028 98.420 96.539 112.829 87.302
V9 V10 V11 V12 V13 V14 V15 V16 V17
1 1.29531 0.80054 1.38283 1.40974 1.20601 1.55867 1.61761 0.93644 1.08825
2 1.29503 0.80041 1.38256 1.40949 1.20607 1.55817 1.61749 0.93643 1.08828
3 1.29514 0.80026 1.38256 1.40963 1.20607 1.55828 1.61796 0.93650 1.08832
4 1.29520 0.80038 1.38250 1.40957 1.20594 1.55819 1.61791 0.93666 1.08835
5 1.29517 0.80042 1.38259 1.40965 1.20590 1.55843 1.61777 0.93658 1.08840
6 1.29519 0.80046 1.38275 1.40969 1.20588 1.55860 1.61780 0.93648 1.08834
V18 V19 V20 V21 V22 V23 V24 V25 V26
1 0.93103 0.83073 1.72682 1.77693 1.50608 1.94649 1.01918 0.87190 1.12698
2 0.93106 0.83075 1.72689 1.77693 1.50593 1.94627 1.01912 0.87187 1.12676
3 0.93109 0.83069 1.72704 1.77693 1.50638 1.94661 1.01929 0.87202 1.12684
4 0.93110 0.83082 1.72687 1.77693 1.50645 1.94631 1.01941 0.87213 1.12694
5 0.93101 0.83080 1.72681 1.77693 1.50613 1.94643 1.01934 0.87199 1.12701
6 0.93097 0.83070 1.72706 1.77693 1.50613 1.94680 1.01927 0.87190 1.12696
V27 V28 V29 V30
1 0.85511 0.90400 0.77324 1268.81
2 0.85520 0.90390 0.77332 1268.81
3 0.85517 0.90405 0.77328 1268.81
4 0.85515 0.90415 0.77333 1268.81
5 0.85508 0.90423 0.77344 1268.81
6 0.85513 0.90412 0.77334 1268.81
> V22 = xts(dde$V22, order.by=time1)
> V22 <-to.minutes(V22[,1],240,'minutes')
> V22 <- align.time(xts(V22),5 * 60)
>
> V2 = xts(dde$V2, order.by=time1)
> V2 <-to.minutes(V2[,1],240,'minutes')
Error in to.period(x, "minutes", k = k, name = name, ...) :
unsupported type
> V2 <- align.time(xts(V2),5 * 60)
>
> class(dde$V22)
[1] "numeric"
> class(dde$V2)
[1] "character"
> typeof(dde$V22)
[1] "double"
> typeof(dde$V2)
[1] "character"

When I do the things you said you did in the comments, it works for me
dde <- read.csv("dde.csv", header=FALSE, stringsAsFactors=FALSE)
dde <- na.omit(dde)
colnames(dde)[1] <- "time"
colnames(dde)[2] <- "test1"
time1 = as.POSIXct(strptime(paste(dde$time, sep=" "),format="%m/%d/%Y %I:%M:%S %p"))
test1 = xts(dde[, c("test1")], order.by=time1)
test1 <- to.minutes(test1[,1],240,'minutes')
test1 <- align.time(xts(test1),5 * 60)
test1
# minutes.Open minutes.High minutes.Low minutes.Close
#2014-09-07 21:30:00 105.061 105.068 105.061 105.068

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

random forest error NA not permitted in predictors - r

Related

Find the Maximum with R

R - extract of one character string column for each row: variable-names an values - data cleaning

R: read.table seperator every XX characters

Skipping columns in R read.table()

R: format csv file as.data.frame

Categories

Resources