I want to find the maximum in one function. I'll start with one input variable in the function, but the goal is at least 4 variables.
The function is continuous (but for some variables like first variable) if you give it a decimal it will get the same return (example: 99.6,99.8102,100,100.2,100.3,100.4 = 2.28)
In below, I plot the first 500 values (just change VAR1), we can easily see the maximum is x=110 y =3.31
I've tried several approaches, but it seems like something is missing (maybe my knowledge).
The goal is to use at least 4 variables in function and find the maximum (of course it is not feasible to draw all the plots and check which is the maximum).
I used optim (below with one variable just to make it easier explain what I want).
optimize(SMA_R, interval=c(1,500), maximum=TRUE,tol=0.0001)
$maximum 111.5218 #Expecting 110
$objective 3.137111
optimize(SMA_R, interval=c(1,100), maximum=TRUE,tol=0.0001)
$maximum 38.81469 #Expecting 35
$objective 2.370557 #Expecting 2.41
optim(par = c(100), fn = SMA_R,control =list(fnscale=-1))
$par 110 #It work as expected
$value 3.314204
optim(par = c(40), fn = SMA_R,control =list(fnscale=-1))
$par 39 #local maximum expected 35
$value 2.370557`
It evident for one variable I can use Brent or Nelder-Mead.
Now adding the second variable
function(x,y) {optim(par = c(x,y), fn = SMA_R,control =list(fnscale=-1))}
l <- list(VAR1 = c(35,110), VAR2 = c(1,1)) #Inputs 35 and 110 both best local points for VAR1.
par value counts convergence message
V1 36.0048828, 0.9589844 3.148596 39, NA 0 NULL
V2 110, 1 3.314204 47, NA 0 NULL
V3 36.0048828, 0.9589844 3.148596 39, NA 0 NULL
V4 110, 1 3.314204 47, NA 0 NULL
Adding this second variable slightly increases the output.
I was checking manually for some results to be sure.
After several tries a find better maximum for VAR1=118, VAR2=0.81 RESULT=3.41
If Add more options to VAR2=c(0.5,0.8,1,1.2,1.5), I find the result that I want.
par value counts convergence message
V1 35.0, 0.5 1 5, NA 0 NULL
V2 121.00, -2.25 1 15, NA 0 NULL
V3 37.9257812, 0.9640625 3.140239 33, NA 0 NULL
V4 118.2875977, 0.8107422 3.415413 45, NA 0 NULL
V5 36.0048828, 0.9589844 3.148596 39, NA 0 NULL
V6 110, 1 3.314204 47, NA 0 NULL
V7 35.0, 4.7 0.9415069 7, NA 0 NULL
V8 110.4605713, 0.8374512 3.31747 49, NA 0 NULL
V9 35.0, 1.5 0.9415069 3, NA 0 NULL
V10 110.0, 12.5 0.9415069 7, NA 0 NULL
My question is how can you be sure that this is the best Maximum?
Maybe there's a better one and I don't give all the combinations to find the maximum.
Given more options I have V16 with 3.8 a huge increase.
VAR1 = c(20,35,50,70,110,250)
VAR2 = c(0.5,0.8,1,1.2,1.5))
par value counts convergence message
V1 20.0, 0.5 1 5, NA 0 NULL
V2 35.0, 0.5 1 5, NA 0 NULL
V3 50.0, 0.5 1 5, NA 0 NULL
V4 70.0, 0.5 1 5, NA 0 NULL
V5 121.00, -2.25 1 15, NA 0 NULL
V6 274.7558594, 0.5976562 2.718835 43, NA 0 NULL
V7 20.0566406, 0.8957031 2.499993 45, NA 0 NULL
V8 37.9257812, 0.9640625 3.140239 33, NA 0 NULL
V9 54.8144531, 0.9367188 3.679734 55, NA 0 NULL
V10 76.7539062, 0.8546875 3.496945 45, NA 0 NULL
V11 118.2875977, 0.8107422 3.415413 45, NA 0 NULL
V12 269.4213867, 0.6291016 2.718835 47, NA 0 NULL
V13 20.8669434, 0.9204102 2.496346 35, NA 0 NULL
V14 36.0048828, 0.9589844 3.148596 39, NA 0 NULL
V15 55.1074219, 0.9414062 3.679734 43, NA 0 NULL
V16 69.1523438, 0.9453125 3.823568 41, NA 0 NULL
V17 110, 1 3.314204 47, NA 0 NULL
V18 262.0971680, 0.6337891 2.718835 45, NA 0 NULL
V19 22.0, -0.8 1 9, NA 0 NULL
V20 35.0, 4.7 0.9415069 7, NA 0 NULL
V21 54.7412109, 0.9363281 3.679734 43, NA 0 NULL
V22 76.2617188, 0.9265625 3.672373 41, NA 0 NULL
V23 110.4605713, 0.8374512 3.31747 49, NA 0 NULL
V24 266.6488528, 0.6399309 2.721007 55, NA 0 NULL
V25 20.0, 1.5 0.9415069 3, NA 0 NULL
V26 35.0, 1.5 0.9415069 3, NA 0 NULL
V27 50.0, 6.5 0.9415069 7, NA 0 NULL
V28 70.0, 8.5 0.9415069 7, NA 0 NULL
V29 110.0, 12.5 0.9415069 7, NA 0 NULL
V30 275.4394531, 0.6210938 2.718835 45, NA 0 NULL
Is there a better way to get the results I want?
Each variable adds a lot of complexity (and processing time) and the results it always depends on my inputs.
Related
How can I extract the variables and their values for each row when they are stuck in a character row like {"variable1":value1,"variable2":value2, ...}?
In my df about activities, there is one "info" column (a string with variables following their values) which should be made to columns themselves. The strings are buit like this: {"variable1":value1,"variable2":value2, ...} and NA rows are: {}).
activity_type #There is also the column "activity_type"
94 Running #(what activity the measurement was)
95 Multi Sport
96 Running
98 Walking
info
94 "{\"calories\":2994,\"intensity\":30,\"manual_distance\":0,\"manual_calories\":0,\"hr_average\":60,\"hr_min\":54,\"hr_max\":66,\"hr_zone_0\":86,\"hr_zone_1\":0,\"hr_zone_2\":0,\"hr_zone_3\":0,\"device_startdate\":1612293557,\"device_enddate\":1612315157,\"pause_duration\":0,\"steps\":767,\"distance\":566,\"elevation\":50,\"metcumul\":50}"
95 {}
96 "{\"calories\":2994,\"intensity\":30,\"manual_distance\":0,\"manual_calories\":0,\"hr_average\":52,\"hr_min\":49,\"hr_max\":58,\"hr_zone_0\":59,\"hr_zone_1\":0,\"hr_zone_2\":0,\"hr_zone_3\":0,\"device_startdate\":1612336821,\"device_enddate\":1612358421,\"pause_duration\":0,\"steps\":1162,\"distance\":827,\"elevation\":101,\"metcumul\":101}"
98 "{\"calories\":1208,\"intensity\":30,\"manual_distance\":0,\"manual_calories\":0,\"hr_average\":0,\"hr_min\":0,\"hr_max\":0,\"hr_zone_0\":0,\"hr_zone_1\":0,\"hr_zone_2\":0,\"hr_zone_3\":0,\"device_startdate\":1612457880,\"device_enddate\":1612479480,\"pause_duration\":0,\"steps\":0,\"distance\":0,\"elevation\":0,\"metcumul\":0}"
#what I want in the end
row_number calories intensity manual_distance manual_calories ...
94 2994 30 0 0 ...
95 NA NA 0 0 ...
96 2994 30 0 0 ...
98 1208 30 0 0 ...
I tried:
info2 <-as.data.frame(do.call(rbind, strsplit(infos, ":|,")))
FYI: The sequence of variables are different between rows ("calories", "intensity"...all start similar but the sequences differ). So the resulting df was not "consistent".
I thought the resulting df would be constructed like: one col is one type of variable and the next col the corresponding value :
#I thought:
# V25=steps V27=distance V29=evaluation V31=metcul
V25 V26 V27 V28 V29 V30 V31
1 "steps" 0 "distance" 0 "elevation" 17 "metcumul"
2 "steps" 0 "distance" 0 "elevation" 17 "metcumul"
3 "steps" 2420 "distance" 1971 "elevation" 110 "metcumul"
But as the sequence of variables in each row differ, the result is shifted:
info2 [c(1, 52, 86, 93:95), c(25:36)]
V25 V26 V27 V28 V29 V30
1 "steps" 0 "distance" 0 "elevation" 17
52 "steps" 828 "distance" 536 "elevation" 33
86 "laps" 0 "mvts" 0 "pool_length" 25
93 "steps" 2420 "distance" 1971 "elevation" 110
94 "device_enddate" 1612315157 "pause_duration" 0 "steps" 767
95 {} {} {} {} {} {}
V31 V32 V33 V34 V35 V36
1 "metcumul" 17} {"calories" 123 "intensity" 30
52 "metcumul" 33} {"calories" 30 "intensity" 30
86 "type" 9} {"calories" 1881 "hr_average" 55
93 "metcumul" 110} {"calories" 218 "intensity" 30
94 "distance" 566 "elevation" 50 "metcumul" 50}
95 {} {} {} {} {} {}
FYI: There are even differences within one type of activity (some have more/less variables, so the shorter rows have to repeat the first variables at the end) :
swim2[2:4,c(1,29:34)]
V1 V29 V30 V31 V32 V33 V34
2 {"calories" "pool_length" 25 "version" 0 "type" 3}
3 {"calories" "pool_length" 25 "version" 0 "type" 4}
4 {"calories" "pool_length" 25 "type" 9} {"calories" 1881
walking2[17:19,c(1:2,21:31)]
V1 V2 V21 V22 V23 V24
17 {"calories" 30 "hr_zone_3" 0 "pause_duration" 0
18 {"calories" 13 "hr_zone_3" 0 "pause_duration" 0
19 {"calories" 1208 "hr_zone_3" 0 "device_startdate" 1612457880
V25 V26 V27 V28 V29 V30 V31
17 "steps" 511 "distance" 339 "elevation" 23 "metcumul"
18 "steps" 405 "distance" 282 "elevation" 16 "metcumul"
19 "device_enddate" 1612479480 "pause_duration" 0 "steps" 0 "distance"
Conclusion: How can I extract the variables and their values for each row (when they are stuck in a row like {"variable1":value1,"variable2":value2, ...})?
Maybe with stringr? For example looking for "calories" in the row and then use the next number?
Or maybe I should have used the strsplitin a loop/ apply function?
I am absolutly open to other approachs!
Thanks a lot to all R-cracks!
Here is one idea.
library(tidyverse)
infogroup2 <- infogroup %>%
mutate(ID = 1:n()) %>%
mutate(info = str_remove_all(info, pattern = regex("\\{|\\}"))) %>%
mutate(info = ifelse(info %in% "", NA, info)) %>%
separate_rows(info, sep = ",") %>%
separate(info, into = c("parameter", "value"), sep = ":", convert = TRUE) %>%
mutate(parameter = str_remove_all(parameter, pattern = regex('\\"|\\"'))) %>%
pivot_wider(names_from = "parameter", values_from = "value", values_fill = 0) %>%
select(-all_of(c("NA", "ID")))
DATA
infogroup <- tibble(
activity_type = c("Running", "Multi Sport", "Running", "Walking"),
info = c('{"calories":2994,"intensity":30,"manual_distance":0,"manual_calories":0,"hr_average":60,"hr_min":54,"hr_max":66,"hr_zone_0":86,"hr_zone_1":0,"hr_zone_2":0,"hr_zone_3":0,"device_startdate":1612293557,"device_enddate":1612315157,"pause_duration":0,"steps":767,"distance":566,"elevation":50,"metcumul":50}',
'{}',
'{"calories":2994,"intensity":30,"manual_distance":0,"manual_calories":0,"hr_average":52,"hr_min":49,"hr_max":58,"hr_zone_0":59,"hr_zone_1":0,"hr_zone_2":0,"hr_zone_3":0,"device_startdate":1612336821,"device_enddate":1612358421,"pause_duration":0,"steps":1162,"distance":827,"elevation":101,"metcumul":101}',
'{"calories":1208,"intensity":30,"manual_distance":0,"manual_calories":0,"hr_average":0,"hr_min":0,"hr_max":0,"hr_zone_0":0,"hr_zone_1":0,"hr_zone_2":0,"hr_zone_3":0,"device_startdate":1612457880,"device_enddate":1612479480,"pause_duration":0,"steps":0,"distance":0,"elevation":0,"metcumul":0}'
))
I have tried to get a frequency table for one dataset ("sim") using the intervals and classes from another dataset ("obs") (both of the same type). I've tried using the table () function in R, but it doesn't give me the frequency of the dataset called "sim" using the "obs" intervals. There may be data that falls outside the range defined with "obs", the idea is that those are omitted. Is there a simple way to get the frequency table for this case?
Here is a sample of my data (vector):
X obs sim
1 1 11.2 8.44
2 2 22.5 15.51
3 3 26.0 20.08
4 4 28.1 23.57
5 5 29.0 26.46
6 6 29.5 28.95
...etc...
I leave you the lines of code:
# Set working directory
setwd("C:/Users/...")
# Vector has 2 set of data, "obs" and "sim"
vector <- read.csv("vector.csv", fileEncoding = 'UTF-8-BOM')
# Divide the range of "obs" into intervals, using Sturges for number of classes:
factor_obs <- cut(vector$obs, breaks=nclass.Sturges(vector$obs), include.lowest = T)
# Get a frequency table using the table() function for "obs"
obs_out <- as.data.frame(table(factor_obs))
obs_out <- transform(obs_out, cumFreq = cumsum(Freq), relative = prop.table(Freq))
# Get a frequency table using the table() function for "sim", using cut from "obs"
sim_out <- as.data.frame(table(factor_obs, vector$sim > 0))
This is what I get from "obs" frequency table:
> obs_out
factor_obs Freq cumFreq relative
1 [11.1,25.6] 2 2 0.04166667
2 (25.6,40.1] 10 12 0.20833333
3 (40.1,54.5] 17 29 0.35416667
4 (54.5,69] 4 33 0.08333333
5 (69,83.4] 8 41 0.16666667
6 (83.4,97.9] 5 46 0.10416667
7 (97.9,112] 2 48 0.04166667
This is what I get from "sim" frequency table:
> sim_out
factor_obs Var2 Freq
1 [11.1,25.6] TRUE 2
2 (25.6,40.1] TRUE 10
3 (40.1,54.5] TRUE 17
4 (54.5,69] TRUE 4
5 (69,83.4] TRUE 8
6 (83.4,97.9] TRUE 5
7 (97.9,112] TRUE 2
Which is the same frequency from the table for "obs".
The idea is that the elements of "sim" in each interval defined by the classes of "obs" are counted, and that extreme values outside the ranges of "obs" are omitted.
It would be helpful if someone can guide me. Thanks a lot!!
You will need to define your own breakpoints since if you let cut do it, the values are not saved for you to use with the sim variable. First use dput(vector) to put the data in a simple form for R:
vector <- structure(list(X = 1:48, obs = c(11.2, 22.5, 26, 28.1, 29, 29.5,
30.8, 32, 33.5, 35, 35.5, 38.9, 41, 41, 41, 43, 43.51, 44, 46,
48.5, 50, 50, 50, 50, 50.8, 51.5, 51.5, 53, 54.4, 55, 57.5, 59.5,
66.9, 70.6, 74.2, 75, 77, 80.2, 81.5, 82, 83, 83.6, 85, 85.1,
93.8, 94, 106.7, 112.3), sim = c(8.44, 15.51, 20.08, 23.57, 26.46,
28.95, 31.16, 33.17, 35.02, 36.75, 38.37, 39.92, 41.39, 42.81,
44.19, 45.52, 46.82, 48.09, 49.34, 50.56, 51.78, 52.98, 54.18,
55.37, 56.55, 57.75, 58.94, 60.14, 61.36, 62.59, 63.83, 65.1,
66.4, 67.74, 69.11, 70.53, 72.01, 73.55, 75.18, 76.9, 78.75,
80.76, 82.98, 85.46, 88.35, 91.84, 96.41, 103.48)), class = "data.frame",
row.names = c(NA, -48L))
Now we need the number of categories and the breakpoints:
nbreaks <- nclass.Sturges(vector$obs)
minval <- min(vector$obs)
maxval <- max(vector$obs)
int <- round((maxval - minval) / nbreaks, 3) # round to 1 digit more thab obs or sim
brks <- c(minval, minval + seq(nbreaks-1) * int, maxval)
The table for the obs data:
factor_obs <- cut(vector$obs, breaks=brks, include.lowest=TRUE)
obs_out <- transform(table(factor_obs), cumFreq = cumsum(Freq), relative = prop.table(Freq))
print(obs_out, digits=3)
# factor_obs Freq cumFreq relative
# 1 [11.2,25.6] 2 2 0.0417
# 2 (25.6,40.1] 10 12 0.2083
# 3 (40.1,54.5] 17 29 0.3542
# 4 (54.5,69] 4 33 0.0833
# 5 (69,83.4] 8 41 0.1667
# 6 (83.4,97.9] 5 46 0.1042
# 7 (97.9,112] 2 48 0.0417
Now the sim data:
factor_sim <- cut(vector$sim, breaks=brks, include.lowest=TRUE)
sim_out <- transform(table(factor_sim), cumFreq = cumsum(Freq), relative = prop.table(Freq))
print(sim_out, digits=3)
# factor_sim Freq cumFreq relative
# 1 [11.2,25.6] 3 3 0.0638
# 2 (25.6,40.1] 8 11 0.1702
# 3 (40.1,54.5] 11 22 0.2340
# 4 (54.5,69] 11 33 0.2340
# 5 (69,83.4] 9 42 0.1915
# 6 (83.4,97.9] 4 46 0.0851
# 7 (97.9,112] 1 47 0.0213
Notice there are only 47 cases shown instead of 48 since one value is less then the minimum.
addmargins(table(factor_obs, factor_sim, useNA="ifany"))
# factor_sim
# factor_obs [11.2,25.6] (25.6,40.1] (40.1,54.5] (54.5,69] (69,83.4] (83.4,97.9] (97.9,112] <NA> Sum
# [11.2,25.6] 1 0 0 0 0 0 0 1 2
# (25.6,40.1] 2 8 0 0 0 0 0 0 10
# (40.1,54.5] 0 0 11 6 0 0 0 0 17
# (54.5,69] 0 0 0 4 0 0 0 0 4
# (69,83.4] 0 0 0 1 7 0 0 0 8
# (83.4,97.9] 0 0 0 0 2 3 0 0 5
# (97.9,112] 0 0 0 0 0 1 1 0 2
# Sum 3 8 11 11 9 4 1 1 48
May dataset is like this:
d <- read.table('age.txt', header = F,sep=' ')
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 101 12 3.531704 16.0 40.8 1.449648 1.080353 20.85738 74.53056 0
2 102 15 -9.000000 24.0 36.4 -9.000000 -9.000000 -9.00000 -9.00000 0
3 103 13 3.306023 26.2 48.4 2.178820 1.349228 22.51904 72.82571 2.3
4 104 12 2.715226 18.2 42.6 2.343138 1.414314 23.13632 72.73414 4.5
and I need to perform log10 transform on column 6:10, but only for the values that are not equal to 0 or -9. Well, I tried this:
if(d[,6:10]!=-9 || 0){d[,6:10]=log10(d[,6:10])}
but it did not work. If anyone can help thanks.
log2 or log10?
m <- as.matrix(df[6:10])
df[6:10] <- ifelse(m > 0, log10(m), m)
One option would be to loop through the columns 6:10, get the index of elements that are not 0 or -9, apply the log10 on those and return the vector.
d[6:10] <- lapply(d[6:10], function(x) {
i1 <- !x %in% c(0, -9)
x[i1] <- log10(x[i1])
x} )
Or another option would be to create a logical matrix ('i1') subset the elements from the columns and update with the log10
i1 <- d[6:10]!=0 & d[6:10] != -9
d[6:10][i1] <- log10(d[6:10][i1])
Hi I am using the following r script to build a random forest:
# load the necessary libraries
library(randomForest)
testPP<-numeric()
# load the dataset
QdataTrain <- read.csv('train.csv',header = FALSE)
QdataTest <- read.csv('test.csv',header = FALSE)
QdataTrainX <- subset(QdataTrain,select=-V1)
QdataTrainY<-as.factor(QdataTrain$V1)
QdataTestX <- subset(QdataTest,select=-V1)
QdataTestY<-as.factor(QdataTest$V1)
mdl <- randomForest(QdataTrainX, QdataTrainY)
where I am getting the following error:
Error in randomForest.default(QdataTrainX, QdataTrainY) :
NA not permitted in predictors
however i see no occurence of NA in my data.
for reference here is my data:
https://docs.google.com/file/d/0B0iDswLYaZ0zUFFsT01BYlRZU0E/edit
does anyone know why this error is being thrown? I'll keep looking in the mean time.
Thanks in advance for any help!
The given data does contain some missing values (7 in particular):
sapply(QdataTrainX, function(x) sum(is.na(x)))
## V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29
## 0 0 0 0 0 0 1 1 1 1 1 1 1
Therefore columns V23 to V29 have one missing value each
which(is.na(QdataTrainX$V23))
## 318
Gives the row number for that.
Is there a variant of lag somewhere that keeps NAs in position? I want to compute returns of price data where data could be missing.
Col 1 is the price data
Col 2 is the lag of price
Col 3 shows p - lag(p) - the return from 99 to 104 is effectively missed, so the path length of the computed returns will differ from the true.
Col 4 shows the lag with NA position preserved
Col 5 shows the new difference - now the return of 5 for 2009-11-07 is available
Cheers, Dave
x <- xts(c(100, 101, 97, 95, 99, NA, 104, 103, 103, 100), as.Date("2009-11-01") + 0:9)
# fake the lag I want, with NA kept in position
x.pos.lag <- lag.xts(x.pos)
x.pos.lag <- lag.xts(x.pos)
x.pos.lag['2009-11-07']=99
x.pos.lag['2009-11-06']=NA
cbind(x, lag.xts(x), x - lag.xts(x), x.pos.lag, x-x.pos.lag)
..1 ..2 ..3 ..4 ..5
2009-11-01 100 NA NA NA NA
2009-11-02 101 100 1 100 1
2009-11-03 97 101 -4 101 -4
2009-11-04 95 97 -2 97 -2
2009-11-05 99 95 4 95 4
2009-11-06 NA 99 NA NA NA
2009-11-07 104 NA NA 99 5
2009-11-08 103 104 -1 104 -1
2009-11-09 103 103 0 103 0
2009-11-10 100 103 -3 103 -3
There are no functions to do that natively in R, but you can create an index of the original NA positions and then swap the values there after the lag.
x <- xts(c(100, 101, 97, 95, 99, NA, 104, 103, 103, 100), as.Date("2009-11-01") + 0:9)
lag.xts.na <- function(x, ...) {
na.idx <- which(is.na(x))
x2 <- lag.xts(x, ...)
x2[na.idx+1,] <- x2[na.idx,]
x2[na.idx,] <- NA
return(x2)
}
lag.xts.na(x)
[,1]
2009-11-01 NA
2009-11-02 100
2009-11-03 101
2009-11-04 97
2009-11-05 95
2009-11-06 NA
2009-11-07 99
2009-11-08 104
2009-11-09 103
2009-11-10 103
Incidentally, are you just trying to deal with weekends/holidays or something along that line? If so, you might consider dropping those positions from your series; that will dramatically simplify things for you. Alternatively, the timeSeries package in Rmetrics has a number of functions to deal with business days.