Filter out columns in R [closed] - r

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
Referring to Post# Filtering out columns in R , the columns with all 1's and 0's were successfully eliminated from the training_data. However, the classification algorithm still complaint about the columns where MOST of the values are 0's except 1 or 2 (All the values in the column are 0 except 1 or 2 values).
I am using penalizedSVM R package to perform feature selection. Looking more closely at the data set, the function svm.fs complains about the columns where most of the values are 0 except a one or two.
How one can modify (or add something to) the following code to achieve the result.
lambda1.scad<-c(seq(0.01, 0.05, .01), seq(0.1, 0.5, 0.2), 1)
lambda1.scad<-lambda1.scad[2:3]
seed <- 123
f0 <- function(x) any(x!=1) & any(x!=0) & is.numeric(x)
trainingdata <- lapply(trainingdata, function(data) cbind(label=data$label,
colwise(identity, f0)(data)))
datax <- trainingdata[[1]]
levels(datax$label) <- c(-1, 1)
train_x<-datax[, -1]
train_x<-data.matrix(train_x)
trainy<-datax[, 1]
idx <- is.na(train_x) | is.infinite(train_x)
train_x[idx] <- 0
tryCatch(scad.fix<-svm.fs(train_x, y=trainy, fs.method="scad",
cross.outer=0, grid.search="discrete",
lambda1.set=lambda1.scad, parms.coding="none",
show="none", maxIter=1000, inner.val.method="cv",
cross.inner=5, seed=seed, verbose=FALSE), error=function(e) e)
Or one may propose an entirely different solution.

Use the fact that boolean values can be summed and define some tolerance of zeros:
sum(x == 0) / length(x) >= tolerance
Where this becomes your condition for dropping. However, often zeros are not only valid data, but are critical to the phenomenon being studied. You should think carefully about your algorithm choice and the decision to drop columns before going forward wit this approach.

Related

what is the meaning of these data.table expressions? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
Could someone tell me what these two data.table expressions do?
dt[,R_Fuel:=c(0, diff(dt[, Fuel]))]
dt[R_Fuel < 0 | R_Fuel > 5, R_Fuel:=NA]
That is the data.table parlance/dialect.
In data.table, assignments should be done inside of the [ brackets, and instead of the typical R assignment operators <-/=, one needs to use :=. Your first line is equivalent to
dt$R_Fuel <- c(0, diff(dt$Fuel))
However, even this is not "good" data.table code, the use of dt[,Fuel] is unnecessary, it should be just
dt[, R_Fuel := c(0, diff(Fuel))]
If you're curious about what the R code itself is doing, diff(.) returns the differences between values of a vector. Because it is the diffs, if done on a vector of length n, the return value is length n - 1. Since data.frames (and data.tables) require that all columns have the same number of elements, the diffs need to have one value padded; in this case, pre-padded with 0.
Similar to base-R, when using [i,j]-notation, the i is a row-selector. Unlike base R, though, when j includes an assignment (as both of your expressions do), then the i-component does not subset the data in the return, it just changes which rows get the calculation. The second expression is similar to any of the following (R-basic and data.table-canonical versions, generally equivalent):
## basic R
dt$$R_Fuel <- ifelse(dt$R_Fuel < 0 | dt$R_Fuel > 5, NA, dt$R_Fuel)
## canonical data.table
dt[, R_Fuel := ifelse(R_Fuel < 0 | R_Fuel > 5, NA, R_Fuel)]
## canonical data.table using the preferred `fifelse`
dt[, R_Fuel := fifelse(R_Fuel < 0 | R_Fuel > 5, NA_real_, R_Fuel)]
FYI, it might be more readable to use between here:
dt[ !between(R_Fuel, 0, 5), R_Fuel := NA ]
The first expression creates a new column (R_Fuel) in the data.table dt, which holds the row-over-row change (see ?diff) in the values of Fuel in dt. Since there is no value for the first row, 0 is appended to the set of differences. It would be better to write dt[,R_Fuel:=c(0,diff(Fuel))]
The second line, then replaces the new column R_Fuel to NA in all rows where R_Fuel is less than 0 or greater than 5

is.null function does not identify NULL values [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I want to calculate the logarithm of total assets, and I want r to set the result to NULL if the total assets element itself is 0. Otherwise, r will show me -Inf as the result.
I have tried to solve this problem as follows:
at <- c(0.028, 0, 0, 0, 0, 0)
name <- c("comp 1","comp2", "comp3", "comp4", "comp5", "comp6")
df <- as.data.frame(cbind(name, at))
df$log_at <- ifelse(is.null(at), 0, log(at))
However, if you run this line of code, the column 2 with the log(assets) will then show -3.575551 instead of 0 in each row of the data frame for the NULL values of the total assets. Also, I wanted to know how many NULL values are in the total assets column, but:
sum(is.null(df$at))
will give me 0, so I am wondering why r does not identify the Null values as such, since the values in the total assets column are all numeric.
I know, I can use the following as an alternative since the amounts of total assets are pretty big but I am wondering why the code above does not work.
df$at <- as.numeric(df$at)
df$log_at <- log(df$at + 1)
I hope someone can help me out !
NULL is not 0. Check:
is.null(df$at)
[1] FALSE
So if you want to transform atusing ifelsethen it should be:
df$log_at <- ifelse(at == 0, 0, log(at))
Likewise, to find out how often at is 0:
sum(df$at == 0)

Why does sample() not work for a single number? [duplicate]

This question already has answers here:
Sample from vector of varying length (including 1)
(4 answers)
Closed 3 years ago.
sample(x,n) The parameters are the vector, and how many times you wish to sample
sample(c(5,9),1) returns either 5 or 9
however,
sample(5,1) returns 1,2,3,4, or 5?
I've read the help section:
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x. Note that this convenience
feature may lead to undesired behaviour when x is of varying length in
calls such as sample(x). See the examples.
But is there a way to make it not do this? Or do I just need to include an if statement to avoid this.
Or do I just need to include an if statement to avoid this.
Yeah, unfortunately. Something like this:
result = if(length(x) == 1) {x} else {sample(x, ...)}
Here's an alternative approach: you simply subset a random value from your vector like this -
set.seed(4)
x <- c(5,9)
x[sample(length(x), 1)]
[1] 9
x <- 5
x[sample(length(x), 1)]
[1] 5

How to place a condition on a variable [duplicate]

This question already has answers here:
Finding which element of a vector is between two values in R
(3 answers)
Closed 3 years ago.
I have a sampling function of:
z = rnorm(n, 0.3, 1)
And would like my variable f to equal 1 if -pi < z_i < pi and equal 0 otherwise.
I'm not sure how to achieve this. My other idea was to use a reject function but this seems overly complicated.
Assuming you have created the variable z, then you can do this with ifelse():
f <- ifelse(abs(z) < pi, 1, 0)
Or with:
f <- as.integer(abs(z) < pi)
The second one might be quicker, although you probably won't notice unless n is large.
EDIT: The second one is much faster: just checked it on a vector of 1,000,000 values and the first takes 0.2 seconds, the second method shows elapsed 0!
Here you can use the function ifelse, which does exactly what you want: to return a value if a condition satisfies, and another different if not. For example, for n= 10 and pi=1:
f = ifelse (rnorm(n=10, 0.3, 1) > 1,1,0)
You should easily figure out how to solve your problem with this example.
You could also save your results in a temporal vector, and then check the exact condition you want with boolean operators:
ri = rnorm(n=10, 0.3, 1)
ifelse(-1 < ri & ri < 1,1,0)
Following #M-M comment, you can also obtain values of 1 or 0 just from the evaluated conditional expression. For instance:
as.integer(-1 < ri & ri < 1)

Most significant decimal digit (or 0.3 - 0.1 = 0.1 ) [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
This operation should return 2, but it returns 1 instead because of the floating point representation:
a <- .3
b <- .1
floor((a-b)*10)
I basically want the first digit after the point, of the actual base-10 result, not the floating-point computer's result. In this case a and b only have one decimal digit, but in most situations there will be more. Examples:
0.3-0.1=0.2 so I want the 2
0.5-0.001=0.499 so I want the 4
0.925-0.113=0.812 so I want the 8
0.57-0.11=0.46 so I want the 4
0.12-0.11=0.01 so I want the 0
that is, not rounding but truncating. I thought of using this:
floor(floor((a-b)*100)/10)
but I'm not sure if that is the best I can do.
update: indeed, it doesn't work (see comments below):
floor(floor((.9-.8)*100)/10) # gives 0 instead of 1
floor(round((.5-.001)*100)/10) # gives 5 instead of 1
update 2: think this does work (at least in all cases listed so far):
substring(as.character(a-b),first=3,last=3)
Suggestions?
This is not possible, because the information is no longer there:
doubles cannot exactly represent decimal numbers.
If you are fine with an approximate solution,
you can add a small number, and truncate the result.
For instance, if you know that your numbers have at most 14 digits,
the following would work:
first_digit <- function(x, epsilon=5e-15)
floor( (x+epsilon) * 10 )
first_digit( .3 - .1 ) # 2
first_digit( .5 - .001 ) # 4
first_digit( .925 - .113 ) # 8
first_digit( .57 - .11 ) # 4
first_digit( .12 - .11 ) # 0
If you wanted the first significant digit (that means "first non-zero digit"),
you could use:
first_significant_digit <- function(x, epsilon=5e-14)
floor( (x+epsilon) * 10^-floor(log10(x+epsilon)) )
first_significant_digit(0.12-0.11) # 1

Resources