Basically I want to exclude 0 and 1 from the Dataset and want 2 classes i.e. Prime numbers 2,3,5,7 and Composite numbers 4,6,8,9.
I want train binary classifier i.e. Prime numbers and Composite numbers
Related
I wish to write a program in R to generate 1 random number (a positive integer) each starting from 3 digits to 12 digits following these conditions:
There is no order in the consecutive number digits.
Strictly no repetition of digits in a number until the 9th digit number.
0 can be used after the 9 digit number only.
After 10 digits, a digit can be used twice but with no order.
And most importantly:
**The first number will not be the last number of next line and vice versa. **
All I know how to use is the sample command in R:
sample(1:9, size=n, replace=FALSE)
where n is the number of digits I wish to generate. However, I need to write a more generalized function or program which strictly obeys these conditions.
My dataset is in the following format where for each disease I am generating a 2D vector using word2vec.(Showing 2D vectors for example but in practice, vectors are in 100D )
Disease Vectors
disease a, disease c [[ 0.2520773 , 0.433798],[0.38915345, 0.5541569]]
disease b [0.12321666, 0.64195603]
disease c, disease b [[0.38915345, 0.5541569],[0.12321666, 0.64195603]]
disease c [0.38915345, 0.5541569]
From here I am generating a 1D array for each disease/disease combination by taking the average of the vectors. The issue with averaging word vectors is the fact that the combination of 2 or more diseases can have the same average vector as a totally different disease which is not at all relevant but the average vectors get matched. This makes the concept of averaging vectors flawed. To counter this, the understanding is with an increase in dimension of the vectors, this should be even less of a possibility.
So, couple of questions in all:
Is there a better way than averaging the output from word2vec vectors to generate a 1D array?
These generated vectors will be treated as features to a classifier model that I am trying to build for each disease/disease combination so, if I generate a 100D feature vector from word2vec, shall I use something like a PCA on it to reduce the dimension or shall I just consider the 100D feature vector as 100 features to my classifier.
Consider
> data.frame(n=runif(6),m=1:6)
n m
1 0.44000000 1
2 0.12102262 2
3 0.95483015 3
4 0.35628753 4
5 0.55000000 5
6 0.50189420 6
where you want to form the least number of sets having decimal numbers where the sum of numbers is less than 1.
Example trial to find the partions, not necessarily optimal way to find the partitions (particularly with bigger sets)
For example, a partition is a set of number 3 because it is less than one i.e. 0.95483015<1. Then other partition is a set of 5 and 1 because 0.55+0.44<1. And rest numbers go to third partitions such that
partition: 3
partition: 5,1
partition: 2,4,6
now I have a big list of numbers like that which I need to make into least number of partitions or least number of sets having decimal numbers.
Does there exist some R package to find partitions with some optimal criteria like the least number of partitions with some condition?
I'm trying to produce a set of 480 random integers between 1-9. However there are some constraints:
the sequence cannot contain 2 duplicate digits in a row.
the sequence must include exactly 4 sequences of odd numbers and 4 sequences of even numbers (in any order) within every 80 digit sequence (e.g. 6 4 5 2 4 8 3 4 6 9 1 5 4 6 1).
I have been able to produce a set of random numbers hat allows repeated digits, using:
NumRep <- sample(1:9, 480, replace=T)
but not been able to work out how to allow digits to be repeated over the entire set, but to not allow sequential repeats (e.g. 2 5 3 would be okay, 2 5 5 would not). I have got nowhere with the odd/even constraint.
For context, this is not homework! I am a researcher, and this is part of a psychological experiment that I am creating.
Any help would be greatly appreciated!
First, the problem loses the "random" way of simulating with that conditions. Anyway this code corresponds with the first constraint:
# Build a vector
C<-vector()
# Length of the vector
n<-480
# The first element
C<-sample(1:9,1)
# Complete the rest
for (i in 2:n){
# Take a random number not equal to the previous one
C[i] <- sample(c(1:9)[1:9!=C[i-1]],1)
}
# It is an odd number?
C %% 2 == 0
# How many consecutive odd numbers are in the sequence?
# Build a table with this information
TAB <- rle( C %% 2 == 0)
# Maximum of consecutive odd numbers
max(TAB$lengths[TAB$values==T])
# Maximum of consecutive even numbers
max(TAB$lengths[TAB$values==F])
I don't understand the second constraint, but I hope the final part of the code helps. You should use that information in order to interchange some values.
I have many columns in a dataframe that are flags "0" and "1". They belong to class "integer" when i import the dataframe.
0 denotes absence and 1 denotes presence in all columns.
Do i need to convert them to fators?[factors will make levels 1 & 2 while currently they are almost similar 0 & 1 albeit integers]
I plan to later use xgboost to build a predictive model.
Xgboost works only on numeric columns so if i convert the columns to factor's then i will need to one-hot encode them to convert them to numeric.
(Side question: Do we always need to drop one column if we do one hot encoding to remove collinearity?)
Short answer: Depends. Yes, just for better variable interpretation. No as for 0/1 variables integer and factors both are same.
If you ask my personal opinion then I am more towards YES; as you will more likely also be having some categorical variables which are either have string values or more than 2 levels or 2 integer levels other than 0 and 1. In all aforementioned cases 0/1 variables integer and factors both are NOT same. Only specific case of 0/1 binary levels; integer variable and factors are same. So you may want to bring consistency in your coding and even want to adopt this for 0/1 case as well.
To see yourself:
a <- c(1,2,1,2,1,2,5)
c<-as.character(a)
b<-as.factor(c)
d<-as.integer(b)
Here I am just playing with a vector, which in end gives me:
> d
[1] 1 2 1 2 1 2 3
So if you don't want to debug why values are changing in future then use as.factor() from starting.
Side Answer: Yes. Search for model.matrix() and contrasts.arg for getting this done in R.
The error states that xgb.DMatrix takes numeric values, where the data were integers.
To convert the data to numeric use
train[] <- lapply(train, as.numeric)
and then use
xgb.DMatrix(data=data.matrix(train))