How to return the values of a factor that have a 0 level as the raw numbers - r

I am using a model to create prediction. The model is giving me a factor out which ranges from 0 to 6.
I am trying to report this as this value, but when I try to convert this to a number or put it into a data frame, it converts the 0 value to a 1 and all the other values up one...sometimes.
out = as.factor(c(0,1,2,3,4,5))
out
[1] 0 1 2 3 4 5
Levels: 0 1 2 3 4 5
as.numeric(out)
[1] 1 2 3 4 5 6
I would simply subtract by 1 if this increased the value by 1 everytime, but if my model returns only non-zero values, it will not increase the value:
out = as.factor(c(1,2,3,4,5,6))
as.numeric(out)
[1] 1 2 3 4 5 6
Is there a simple way to get the raw values out of the factor rather than R converting the 0 to a 1 and adjusting the rest of the values?
Thank you,
RStudio 1.3.1093
r 4.0.3

From my own comments, I found the solution here: How to convert a factor to integer\numeric without loss of information?
"In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f))." - Joshua Ulrich
This solved the issue and I was able to put into a data frame without a problem.

Related

can I select some rows in my data set whose have the same value in 2 of the columns?

I have a data set with 40 columns and 2000 rows. the value of 2 columns are important. I want to select rows whose have the same value in these 2 columns.
a small sample of my data is like this
2 3 4 5 6 3 23 32
4 3 4 1 0 5 6 43
4 4 3 22 1 2 23
Suppose I want to select rows whose have same value in first and third columns. So I want the second row to be stored in a new data set
I take from your comments that you have numbers stored as factors in that dataframe. Factors have different internal values. So when the console output shows the factor level to be 4 it is not necessarily a 4 in the internal representation. In general, two different factors are not compatible with each other except if they have the same level set. To see the 'internal representation' of your first column use as.numeric(df[[1]]).
Now to the solution of your problem. You first have to convert the factors in your columns 1 and 3 (or all columns) into numeric values using the factor levels. Instructions for that can be found here.
## converting factor levels to numeric values
df[[1]] <- as.numeric(levels(df[[1]]))[df[[1]]]
df[[3]] <- as.numeric(levels(df[[3]]))[df[[3]]]
## filter data
df[df[1] == df[3],]

Factor Level issues after filling data frame using match

I am using two large data files, each having >2m records. The sample data frames are
x <- data.frame("ItemID" = c(1,2,1,1,3,4,2,3,4,1), "SessionID" = c(111,112,111,112,113,114,114,115,115,115), "Avg" = c(1.0,0.45,0.5,0.5,0.46,0.34,0.5,0.6,0.10,0.15),"Category" =c(0,0,0,0,0,0,0,0,0,0))
y <- data.frame("ItemID" = c(1,2,3,4,3,4,5,7),"Category" = c("1","0","S","120","S","120","512","621"))
I successfully filled the x$Category using following command
x$Category <- y$Category[match(x$ItemID,y$ItemID)]
but
x$Category
gave me
[1] 1 0 1 1 S 120 0 S 120 1
Levels: 0 1 120 512 621 S
In x there are only four distinct categories but the Levels shows six. Similarly, the frequency shows me 512 and 621 with 0 frequency. I am using the same data for classification where it shows six classes instead of four which effects the f measure and recall etc. negatively.
table(x$Category)
0 1 120 512 621 S
2 4 2 0 0 2
while I want
table(x$Category)
0 1 120 S
2 4 2 2
I tried merge this and this with a number of other questions but it is giving me an error message. I found here Practical limits of R data frame that it is the limitation of R.
I would omit the Category column from your x data.frame, since it seems to only be serving as a placeholder until values from the y data.frame are filled in. Then, you can use left_join from dplyr with ItemID as the key variable, followed by droplevels() as suggested by TingITangIBob.
This gets you close, but my table does not exactly match yours:
dplyr::select(x, -Category) %>%
dplyr::left_join(y, by = "ItemID") %>%
droplevels()
0 1 120 S
2 4 4 4
I think this may have to do with the repeat ItemIDs in x?

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

Why is my use of predict in r not working - should be very simple case

So I'm having some trouble getting r to actually predict given a very, very simple linear model. Using the following,
> x=1:10
> y=1:10*2
> lm(y~x)
we get the corect answer, namely, y is twice x. But when I do,
predict(lm(y~x),newdata=2.5)
I just get
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
What is going on?
The newdata parameter should be a data.frame with column names matching the names used as covariates. So the correct case is
predict(lm(x~y),newdata=data.frame(y=2.5))
or
predict(lm(y~x),newdata=data.frame(x=2.5))
depending on which way you wanted to do the regression.

drawing multiple boxplots from imputed data in R

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.
I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

Resources