onehot is converting NA to arbitrary numbers - r

I have a data frame with categorical and numeric variables. I try to onehot transform the categories, while the numerics should stay unchanged.
It took me some time to reproduce the issue that I get with my own data, but apparently it comes from some of the numeric columns loaded as integers.
So the following code should reproduce the issue I got:
library(onehot)
set.seed(123)
d_tmp<-data.frame(a=as.integer(sample(c(120, 94, 140, 100, 130, NA),10, replace = T)),
b=sample(c("I", "II", NA), 10, replace=T))
data.frame(predict(onehot(d_tmp), d_tmp))
Outcome:
1 140 0 1
2 -2147483648 1 0
3 140 0 1
4 94 0 0
5 94 1 0
6 -2147483648 0 0
7 140 0 0
8 130 1 0
9 100 1 0
10 -2147483648 1 0
So the NA values are replaced by some highly negative numbers, which seem arbitrary to me. While trying to reproduce the dataframe, I figured that this happens only if I add the as.integer() to the dataframe creation (my own original data is loaded as integer by default).
Why is this happening? And how should I handle this robustly? Of cause I could convert all numeric columns to numeric, however I am not 100% sure if this is fixing the problem once and for all. I want to know the reason, so I don't have to worry about any implicit data errors later. I hope I am addressing this correctly here, if you think, this problem should be addressed somewhere else, please let me know.
Thanks for the help.

Related

How to return the values of a factor that have a 0 level as the raw numbers

I am using a model to create prediction. The model is giving me a factor out which ranges from 0 to 6.
I am trying to report this as this value, but when I try to convert this to a number or put it into a data frame, it converts the 0 value to a 1 and all the other values up one...sometimes.
out = as.factor(c(0,1,2,3,4,5))
out
[1] 0 1 2 3 4 5
Levels: 0 1 2 3 4 5
as.numeric(out)
[1] 1 2 3 4 5 6
I would simply subtract by 1 if this increased the value by 1 everytime, but if my model returns only non-zero values, it will not increase the value:
out = as.factor(c(1,2,3,4,5,6))
as.numeric(out)
[1] 1 2 3 4 5 6
Is there a simple way to get the raw values out of the factor rather than R converting the 0 to a 1 and adjusting the rest of the values?
Thank you,
RStudio 1.3.1093
r 4.0.3
From my own comments, I found the solution here: How to convert a factor to integer\numeric without loss of information?
"In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f))." - Joshua Ulrich
This solved the issue and I was able to put into a data frame without a problem.

Cannot get completed dataset using imputeMCA

I use missMDA package to fill in multiple categorical columns. However, I cannot get any result from these two functions: estim_ncpMCA(test_fill) and imputeMCA(test_fill). The program keeps running without any progress bar or results popped out.
This is the sample from the dataset.
Hybrid G1 G5 G8 G9 G10
P1000:P2030 0 -1 0 1 0
P1006:P1384 0 0 0 0 1
P1006:P1401 0 NA NA NA 1
P1006:P1412 0 0 0 0 1
P1006:P1594 0 0 0 0 1
P1013:P1517 0 0 0 0 1
I am working on a genetic project in R. In this dataset, there are 497 rows and 11,226 columns. Every row is a genetic marker series for a particular hybrid, and every column is a genetic marker ("G1", "G2" and etc) with value 1, 0, -1 and NA. There are total 746,433 of missing values and I am trying to fill in the missing values by imputeMCA.
I also made some transformations on test_fill before running imputeMCA.
test_fill = as.matrix(test_fill)
test_fill[, -1] <- lapply(test_fill[, -1], as.numeric)
I am wondering whether this is the result of too many columns in my dataset. And do I need to transpose my columns and rows.
I don't know if you found your answer, but I think your function doesn't run because of your first column, which seems to be the label of the individuals. You can specify that it should not be taken into the analysis.
estim_ncpMCA(test_fill[,2:11226], ncp.max = 5)
imputeMCA(test_fill[,2:11226], ncp = X)
I hope this can help.

2 numbers in R not equal despite being the same, fails in left_join

I have a strange problem, when trying to do a left_join from dplyr between two data frames say table_a and table_b which have the column C in common I get lots of NAs except for when the values are zero in both even though the values in the rows match more often.
One thing I did notice was that the C column in table_b on which I would like to match, has values 0 as 0.0 whereas in the table_a, 0 is displayed as simply 0.
A sample is here
head(table_a) gives
likelihood_ols LR_statistic_ols decision_ols C
1 -1.51591 0.20246 0 -10
2 -1.51591 0.07724 0 -9
3 -1.51591 0.00918 0 -8
4 -1.51591 0.00924 0 -7
5 -1.51591 0.08834 0 -6
6 -1.51591 0.25694 0 -5
and the other one is here
head(table_b)
quantile C pctile
1 2.96406 0.0 90
2 4.12252 0.0 95
3 6.90776 0.0 99
4 2.78129 -1.8 90
5 3.92385 -1.8 95
6 6.77284 -1.8 99
Now, there are definitely overlaps between the C columns but only the zeroes are found, which is confusing.
When I subset the unique values in the C columns according to
a <- sort(unique(table_a$C)) and b <- sort(unique(table_b$C)) I get the following confusing output:
> a[2]
[1] -9
> b[56]
[1] -9
> a[2]==b[56]
[1] FALSE
What is going on here? I am reading in the values using read.csv and the csvs are generated once on CentOS and once RedHat/Fedora if that plays a role at all. I have tried forcing them to be tibbles or first as characters then numerics and also checked all of R's classes and also checked the types discussed here but to no avail and they all match.
What else could make them different and how do I tell R that they are so I can run my merge function?
Just because two floating point numbers print out the same doesn't mean they are identical.
A simple enough solution is to round, e.g.:
table_a$new_a_likelihood_ols <- signif(table_a$likelihood_ols, 6)

R if then else loop

I have the following output and would like to insert a column if net.results$net output equal to 0 if the output if <0.5 and 1 if >0.5 but <1.0. Basically rounding up or down.
How do I go about doing in this in a loop ? Can I insert this column using data.frame below , just in between the predicted and the test set columns ?
Assume I don't know the number of rows that net.results$net.result has.
Thank you for your help.
data.frame(net.results$net.result,diabTest$class)
predicted col Test set col
net.results.net.result diabTest.class
4 0.2900909633 0
7 0.2900909633 1
10 0.4912509122 1
12 0.4912509122 1
19 0.2900909633 0
21 0.2900909633 0
23 0.4912509122 1
26 0.2900909633 1
27 0.4912509122 1
33 0.2900909633 0
As the commenters have pointed out. This will not work for some situations, but based on the appearance of the data, this should produce the output desired.
df$rounded <- round(df$net.results.net.result,0)
Here are a few test values to see what the function does for different numbers. Read the round help page for more info.
round(0.2900909633,0)
[1] 0
round(0.51, 0)
[1] 1
You can help everyone by supplying a reproducible example, doing research, and explaining approaches that you've tried.

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

Resources