I have the following output and would like to insert a column if net.results$net output equal to 0 if the output if <0.5 and 1 if >0.5 but <1.0. Basically rounding up or down.
How do I go about doing in this in a loop ? Can I insert this column using data.frame below , just in between the predicted and the test set columns ?
Assume I don't know the number of rows that net.results$net.result has.
Thank you for your help.
data.frame(net.results$net.result,diabTest$class)
predicted col Test set col
net.results.net.result diabTest.class
4 0.2900909633 0
7 0.2900909633 1
10 0.4912509122 1
12 0.4912509122 1
19 0.2900909633 0
21 0.2900909633 0
23 0.4912509122 1
26 0.2900909633 1
27 0.4912509122 1
33 0.2900909633 0
As the commenters have pointed out. This will not work for some situations, but based on the appearance of the data, this should produce the output desired.
df$rounded <- round(df$net.results.net.result,0)
Here are a few test values to see what the function does for different numbers. Read the round help page for more info.
round(0.2900909633,0)
[1] 0
round(0.51, 0)
[1] 1
You can help everyone by supplying a reproducible example, doing research, and explaining approaches that you've tried.
Related
I have a data frame with categorical and numeric variables. I try to onehot transform the categories, while the numerics should stay unchanged.
It took me some time to reproduce the issue that I get with my own data, but apparently it comes from some of the numeric columns loaded as integers.
So the following code should reproduce the issue I got:
library(onehot)
set.seed(123)
d_tmp<-data.frame(a=as.integer(sample(c(120, 94, 140, 100, 130, NA),10, replace = T)),
b=sample(c("I", "II", NA), 10, replace=T))
data.frame(predict(onehot(d_tmp), d_tmp))
Outcome:
1 140 0 1
2 -2147483648 1 0
3 140 0 1
4 94 0 0
5 94 1 0
6 -2147483648 0 0
7 140 0 0
8 130 1 0
9 100 1 0
10 -2147483648 1 0
So the NA values are replaced by some highly negative numbers, which seem arbitrary to me. While trying to reproduce the dataframe, I figured that this happens only if I add the as.integer() to the dataframe creation (my own original data is loaded as integer by default).
Why is this happening? And how should I handle this robustly? Of cause I could convert all numeric columns to numeric, however I am not 100% sure if this is fixing the problem once and for all. I want to know the reason, so I don't have to worry about any implicit data errors later. I hope I am addressing this correctly here, if you think, this problem should be addressed somewhere else, please let me know.
Thanks for the help.
I have a strange problem, when trying to do a left_join from dplyr between two data frames say table_a and table_b which have the column C in common I get lots of NAs except for when the values are zero in both even though the values in the rows match more often.
One thing I did notice was that the C column in table_b on which I would like to match, has values 0 as 0.0 whereas in the table_a, 0 is displayed as simply 0.
A sample is here
head(table_a) gives
likelihood_ols LR_statistic_ols decision_ols C
1 -1.51591 0.20246 0 -10
2 -1.51591 0.07724 0 -9
3 -1.51591 0.00918 0 -8
4 -1.51591 0.00924 0 -7
5 -1.51591 0.08834 0 -6
6 -1.51591 0.25694 0 -5
and the other one is here
head(table_b)
quantile C pctile
1 2.96406 0.0 90
2 4.12252 0.0 95
3 6.90776 0.0 99
4 2.78129 -1.8 90
5 3.92385 -1.8 95
6 6.77284 -1.8 99
Now, there are definitely overlaps between the C columns but only the zeroes are found, which is confusing.
When I subset the unique values in the C columns according to
a <- sort(unique(table_a$C)) and b <- sort(unique(table_b$C)) I get the following confusing output:
> a[2]
[1] -9
> b[56]
[1] -9
> a[2]==b[56]
[1] FALSE
What is going on here? I am reading in the values using read.csv and the csvs are generated once on CentOS and once RedHat/Fedora if that plays a role at all. I have tried forcing them to be tibbles or first as characters then numerics and also checked all of R's classes and also checked the types discussed here but to no avail and they all match.
What else could make them different and how do I tell R that they are so I can run my merge function?
Just because two floating point numbers print out the same doesn't mean they are identical.
A simple enough solution is to round, e.g.:
table_a$new_a_likelihood_ols <- signif(table_a$likelihood_ols, 6)
This question already has answers here:
Convert *some* column classes in data.table
(2 answers)
Closed 4 years ago.
I am trying to write a "for" loop that iterates through each column in a data.table and return a frequency table. However, I keep getting an error saying:
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,i]))
}
Error in `[.data.table`(cars, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.
When I use each column individually like below, I do not have any problem:
> table(cars[,dist])
2 4 10 14 16 17 18 20 22 24 26 28 32 34 36 40 42 46 48 50 52 54 56 60 64 66
1 1 2 1 1 1 1 2 1 1 4 2 3 3 2 2 1 2 1 1 1 2 2 1 1 1
68 70 76 80 84 85 92 93 120
1 1 1 1 1 1 1 1 1
My data is quite large (8921483x52), that is why I want to use the "for" loop and run everything at once then look at the result.
I included the cars dataset (which is easier to run) to demonstrate my code.
If I convert the dataset to data.frame, there is no problem running the "for" loop. But I just want to know why this does not work with data.table because I am learning it, which work better with large dataset in my belief.
If by chance, someone saw a post with an answer already, please let me know because I have been trying for several hours to look for one.
Some solution found here
My personal preference is the apply function though
library(datasets)
data(cars)
cars <- as.data.table(cars)
apply(cars,2,table)
To make your loop work you tweak the i
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,(i) := as.character(get(i))]))
}
I am trying to use svm() to classify my data. A sample of my data is as follows:
ID call_YearWeek week WeekCount oc
x 2011W01 1 0 0
x 2011W02 2 1 1
x 2011W03 3 0 0
x 2011W04 4 0 0
x 2011W05 5 1 1
x 2011W06 6 0 0
x 2011W07 7 0 0
x 2011W08 8 1 1
x 2011W09 9 0 0
x 2011W10 10 0 0
x 2011W11 11 0 0
x 2011W12 12 1 1
x 2011W13 13 1 1
x 2011W14 14 1 1
x 2011W15 15 0 0
x 2011W16 16 2 1
x 2011W17 17 0 0
x 2011W18 18 0 0
x 2011W19 19 1 1
The third column shows week of the year. The 4th column shows number of calls in that week and the last column is a binary factor (if a call was received in that week or not). I used the following lines of code:
train <- data[1:105,]
test <- data[106:157,]
model <- svm(oc~week,data=train)
plot(model,train,week)
plot(model,train)
none of the last two lines work. they dont show any plots and they return no error. I wonder why this is happening.
Thanks
Seems like there are two problems here, first is that not all svm types are supported by plot.svm -- only the classification methods are, and not the regression methods. Because your response is numeric, svm() assumes you want to do regression so it chooses "eps-regression" by default. If you want to do classification, change your response to a factor
model <- svm(factor(oc)~week,data=train)
which will then use "C-classification" by default.
The second problem is that there does not seem to be a univariate predictor plot implemented. It seems to want two variables (one for x and one for y).
It may be better to take a step back and describe exactly what you want your plot to look like.
I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).