Issues solving a Regression with numeric and categorical variables in R - r

I am very new to statistics and R in general so my question might be a bit dumb, but since I cannot find my solutions online I thought I should try ask it here.
I have a data frame dataset of a whole lot of different variables very similar to as follows:
Item | Size | Value | Town
----------------------------------
A | 10 | 800 | 1
B | 11 | 100 | 2
A | 17 | 900 | 2
D | 13 | 200 | 3
B | 15 | 500 | 1
C | 12 | 250 | 3
E | 14 | NA | 2
A | | 800 | 1
C | | 800 | 2
Basically, I have to try and 'guess' the Size based on the type of Item, it's Value, and the Town it was sold in, so I think a regression method would be a good idea.
I try and use a polynomial regression (although I'm not even sure if that's correct) to see how that looks by using a function similar to the following:
summary(lm(Size~ polym(factor(Item), Value, factor(Town), degree=2, raw=TRUE), dataset))
But I get this Warning message when I try to do this:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In Ops.factor(X, Y, ...) : ‘^’ not meaningful for factors
Can anyone tell me why this happens? More importantly, is what I've done even correct?
My second question is regarding NA values in a regression. In the dataset above, I have an NA value in the Value Column. From what I understand, R ignores rows which have an NA value in a column. But what if I have a lot of NA values? Also, it seems like a waste of data to automatically eliminate entire rows if there is only one NA value in a column, so I was wondering if there is perhaps a better way of solving or working around this issue. Thanks!
EDIT: I just have one more question: In the regression model I have created it appears there are new 'levels' in the testing data which were not in the training data (e.g. the error says factor(Town) has new levels). What would be the right thing to do for cases such as this?

Yes, follow #RemkoDuursma's suggestion in using lm(Size ~ factor(Item) + factor(Town) + Value,...) and look into other degrees as well (was there a reason why you chose squared?) by comparing residuals.
In regards to substituting NA values, you have many options:
substitute all with median variable value
substitute all with mean variable value
substitute each with prediction based on values of other variables
good luck, and next time you might want to check out https://stats.stackexchange.com/!

Related

Converting 1 row instance to a suitable format in R for repeated measures ANOVA

I'm really struggling with how to format my data to a suitable one in R.
At the moment, I have my data in the format of:
ParticipantNo | Sex | Age | IV1(0)_IV2(0)_DV1 | IV1(1)_IV2(0)_DV1 | etc
There are two levels for IV1, and 3 for IV2, so 6 columns per DV.
I've stacked them, so that I compare all IV1 results with each other, and the same for IV2 using a Friedman test.
However, I'd like to compare across groups like Sex and Age, and was told ANOVA is the best for this. I've used ANOVA directly before in SPSS, which accepts this data format.
The problem I have is getting this data into the correct format in R.
As I understand it, it should look like:
1 | M | 40 | IV1(0)_IV2(0)_DV1_Result
1 | M | 40 | IV1(1)_IV2(0)_DV1_Result
1 | M | 40 | IV1(0)_IV2(1)_DV1_Result
1 | M | 40 | IV1(1)_IV2(1)_DV1_Result
1 | M | 40 | IV1(0)_IV2(2)_DV1_Result
1 | M | 40 | IV1(1)_IV2(2)_DV1_Result
Then I can do
aov(sex~DV1_result, data=data)
Does this seem like the correct thing to do, and if so, how can I convert from the format I have to the one I need in R?
Figured it out!
I used stack on my data, and then separate (i.e. s = separate(stack(data), "ind", c("IV1", "IV2").
Then I could do the ANOVA by aov(values ~ IV1 * IV2, data = s)
Hope this helps someone!

Apply a formula through a function in many columns with different column names of data frame in R

I want to use pb2gen function from WRS2 package. It runs a robust t-test against your data and here is its documentation
pb2gen(formula, data, est = "mom")
formula an object of class formula.
data an optional data frame for the input data.
est estimate to be used for the group comparisons: either "onestep"
for one-step M-estimator of location using Huber's Psi, "mom" for the
modified one-step (MOM) estimator of location based on Huber's Psi, or
"median", "mean".
Anyway. The thing is that I'm trying to apply this function on 5 columns of a data frame.
The data frame seems like this
| Ge/treat | Control_1 | Control_2 | Cancer_1 | Cancer_2 | Cancer_3 |
|----------|:-------------:|----------:|----------:|---------:|---------:|
| gene1 | 2.65 | 3.01 | 2.20 | 3.65 | 4.01 |
and i want to run the t-test between Controls and Cancer.
The formula i want to apply somehow is the Control_{1,2} ~ Cancer_{1,2,3).
I mean I want it to take in mind both Control columns and all of the Cancer ones.
Until now I can run only the t-test for the first column of Control's and Cancer's by running pb2gen(Control_1 ~ Cancer_1, data= data, est="mom").
I'm wondering if it is possible to run the same command by including and the other columns. So any idea or hint is welcomed
Thank you.
EDIT:
I also tried
pb2gen(c(Control_1,Control_2) ~ c(Cancer_1,Cancer_2,Cancer_3) , data = data, est="mom")
but got
Error in model.frame.default(formula, data) : variable lengths
differ

Friedman test unreplicated complete block design error

I'm having trouble running a Friedman test over my data.
I'm trying to run a Friedman test using this command:
friedman.test(mean ~ isi | expId, data=monoSum)
On the following database (https://www.dropbox.com/s/2ox0y1b4gwld0ai/monoSum.csv):
> monoSum
expId isi N mean
1 m80B1 1 10 100.000000
2 m80B1 2 10 73.999819
3 m80B1 3 10 45.219362
4 m80B1 4 10 116.566174
. . . . .
18 m80L2 2 10 82.945491
19 m80L2 3 10 57.675480
20 m80L2 4 10 207.169277
. . . . . .
25 m80M2 1 10 100.000000
26 m80M2 2 10 49.752687
27 m80M2 3 10 19.042592
28 m80M2 4 10 150.411035
It gives me back the error:
Error in friedman.test.default(c(100, 73.9998193095267, 45.2193621626293, :
not an unreplicated complete block design
I figure it gives the error because, when monoSum$isi==1 the value of mean is always 100. Is this correct?
However, monoSum$isi==1 is alway 100 because it is the control group on which all the other monoSum$isi groups are normalized. I can not assume a normal distribution, so I cannot run a rmANOVA…
Is there a way to run a friedman test on this data or am I missing a very essential point here?
Many thanks in advance!
I don't get an error if I run your dataset:
Friedman rank sum test
data: mean and isi and expId
Friedman chi-squared = 17.9143, df = 3, p-value = 0.0004581
However, you have to make sure that expId and isi are coded as factors. Run these commands:
monoSum$expID$<-factor(monoSum$expID)
monoSum$isi$<-factor(monoSum$isi)
Then run the test again. This has worked for me with a similar problem.
I know this is pretty old but for future generations (see also: me when I forget and google this again):
You can determine what the missing values are in your dataframe by running table(groups, blocks) or in the case of this question table(monoSum$isi, monoSum$expID). This will return a table of 0s and 1s. This missing records are in the the cells with 0s.
I ran into this problem after trying to remove the blocks that had incomplete results; taking a subset of the data did not remove the blocks for some reason.
Just thought I would mention I found this post because I was getting a similar error message. The above suggestions did not solve it. Strangely, I had to sort my dataframe so that block by block the groups appeared in order (i.e. I could not have the following:
Block 1 A
Block 1 B
Block 2 B
Block 2 A
It has to appear as A, B, A, B)
I ran into the same cryptic error message in R, though in my case it was resolved when I applied the 'as.matrix' function to what was originally a dataframe for the CSV file I imported in using the read.csv() function.
I also had a missing data point in my original data set, and I found that when my data was transformed into a matrix for the friedman.test() call, the entire row containing the missing data point was omitted automatically.
Using the function as.matrix() to transform my dataframe is the magic that got the function to run for me.
I had this exact error too with my dataset.
It turns out that the function friedman.test() accepts data frames (fx those created by data.frame() ) but not tibbles (those created by dplyr and other modern tools). The solution for me was to convert my dataset to a dataframe first.
D_fri <- D_all %>% dplyr::select(FrustrationEpisode, Condition, Participant)
D_fri <- as.data.frame(D_fri)
str(D_fri) # confirm the object should now be a 'data.frame'
friedman.test(FrustrationEpisode ~ Condition | Participant, D_fri)
I ran into this problem too. Fixed mine by removing the NAs.
# My data (called layers) looks like:
| resp.no | av.l.all | av.baseem | av.base |
| 1 | 1.5 | 1.3 | 2.3 |
| 2 | 1.4 | 3.2 | 1.4 |
| 3 | 2.5 | 2.8 | 2.9 |
...
| 1088 | 3.6 | 1.1 | 3.3 |
# Remove NAs
layers1 <- na.omit(layers)
# Re-organise data so the scores are stacked, and a column added with the original column name as a factor
layers2 <- layers1 %>%
gather(key = "layertype", value = "score", av.l.all, av.baseem, av.base) %>%
convert_as_factor(resp.no, layertype)
# Data now looks like this
| resp.no | layertype | score |
| 1 | av.l.all | 1.5 |
| 1 | av.baseem | 1.3 |
| 1 | av.base | 2.3 |
| 2 | av.l.all | 1.4 |
...
| 1088 | av.base | 3.3 |
# Then do Friedman test
friedman.test(score ~ layertype | resp.no, data = layers2)
Just want to share what my problem was. My ID factor did not have correct levels after doing pivot_longer(). Because of this, the same error was given. I made sure the correct level and it worked by the following:as.factor(as.character(df$ID))
Reviving an old thread with new information. I ran into a similar problem after removing NAs. My group and block were factors before the NA removal. However, after removing NAs, the factors retained the levels before the removal even though some levels were no longer in the data!
Running the friedman.test() with the as.matrix() trick (e.g., friedman.test(a ~ b | c, as.matrix(df))) was fine but running frdAllPairsExactTest() or friedman_effsize() would throw the not an unreplicated complete block design error. I ended up re-factoring the group and block (i.e., dropping the levels that were no longer in the data, df$block <- factor(df$block)) to make things work. After the re-factor, I did not need the as.matrix() trick, either.

using ggpairs with NA-continaing data

ggpairs in the GGally package seems pretty useful, but it appears to fail when there NA is present anywhere in the data set:
#require(GGally)
data(tips, package="reshape")
pm <- ggpairs(tips[,1:3]) #works just fine
#introduce NA
tips[1,1] <- NA
ggpairs(tips[,1:3])
> Error in if (lims[1] > lims[2]) { : missing value where TRUE/FALSE needed
I don't see any documentation for dealing with NA values, and solutions like ggpairs(tips[,1:3], na.rm=TRUE) (unsurprisingly) don't change the error message.
I have a data set in which perhaps 10% of values are NA, randomly scattered throughout the dataset. Therefore na.omit(myDataSet) will remove much of the data. Is there any way around this?
Some functions of GGally like ggparcoord() support handling NAs by missing=[exclude,mean,median,min10,random] parameter. However this is not the case for ggpairs() unfortunately.
What you can do is to replace NAs with a good estimation of your data you were expecting ggpair() will do automatically for you. There are good solutions like replacing them by row means, zeros, median or even closest point (Notice 4 hyperlinks on the words of the recent sentence!).
I see that this is an old post. Recently I encountered the same problem but still could not find a solution on the Internet. So I provide my workaround below FYI.
I think the aim is to use pair-wise complete observations for plotting (i.e. in a manner that is specific to each panel/facet of the ggpairs grid plot), instead of using complete observations across all variables. The former will keep "useable" observations to the maximal extent, w/o introducing "artificial" data by imputing missing values. Up to date it seems that ggpairs still does not support this. My workaround for this is to:
Encode NA with another value not present in the data, e.g. for numerical variables, I replaced NA's with -666 for my dataset. For each dataset you can always pick something that is out of the range of its data values. BTW it seems that Inf doesn't work;
Then retrieve the pair-wise complete cases with user-created plotting functions. For example, for scatter plots of continuous variables in the lower triangle, I do something like:
scat.my <- function(data, mapping, ...) {
x <- as.character(unclass(mapping$x))[2] # my way of parsing the x variable name from `mapping`; there may be a better way
y <- as.character(unclass(mapping$y))[2] # my way of parsing the y variable name from `mapping`; there may be a better way
dat <- data.table(x=data[[x]], y=data[[y]])[x!=-666 & y!=-666] # I use the `data.table` package; assuming NA values have been replaced with -666
ggplot(dat, aes(x=x, y=y)) +
geom_point()
}
ggpairs(my.data, lower=list(continuous=scat.my), ...)
This can be similarly done for the upper triangle and the diagonal. It is somewhat labor-intensive as all the plotting functions need to be re-done manually with customized modifications as above. But it did work.
I'll take a shot at it with my own horrible workaround, because I think this needs stimulation. I agree with OP that filling in data based on statistical assumptions or a chosen hack is a terrible idea for exploratory analysis, and I think it's guaranteed to fail as soon as you forget how it works (about five days for me) and need to adjust it for something else.
Disclaimer
This is a terrible way to do things, and I hate it. It's useful for when you have a systematic source of NAs coming from something like sparse sampling of a high-dimensional dataset, which maybe the OP has.
Example
Say you have a small subset of some vastly larger dataset, making some of your columns sparsely represented:
| Sample (0:350)| Channel(1:118)| Trial(1:10)| Voltage|Class (1:2)| Subject (1:3)|
|---------------:|---------------:|------------:|-----------:|:-----------|--------------:|
| 1| 1| 1| 0.17142245|1 | 1|
| 2| 2| 2| 0.27733185|2 | 2|
| 3| 1| 3| 0.33203066|1 | 3|
| 4| 2| 1| 0.09483775|2 | 1|
| 5| 1| 2| 0.79609409|1 | 2|
| 6| 2| 3| 0.85227987|2 | 3|
| 7| 1| 1| 0.52804960|1 | 1|
| 8| 2| 2| 0.50156096|2 | 2|
| 9| 1| 3| 0.30680522|1 | 3|
| 10| 2| 1| 0.11250801|2 | 1|
require(data.table) # needs the latest rForge version of data.table for dcast
sample.table <- data.table(Sample = seq_len(10), Channel = rep(1:2,length.out=10),
Trial = rep(1:3, length.out=10), Voltage = runif(10),
Class = as.factor(rep(1:2,length.out=10)),
Subject = rep(1:3, length.out=10))
The example is hokey but pretend the columns are uniformly sampled from their larger subsets.
Let's say you want to cast the data to wide format along all channels to plot with ggpairs. Now, a canonical dcast back to wide format will not work, with an id column or otherwise, because the column ranges are sparsely (and never completely) represented:
wide.table <- dcast.data.table(sample.table, Sample ~ Channel,
value.var="Voltage",
drop=TRUE)
> wide.table
Sample 1 2
1: 1 0.1714224 NA
2: 2 NA 0.27733185
3: 3 0.3320307 NA
4: 4 NA 0.09483775
5: 5 0.7960941 NA
6: 6 NA 0.85227987
7: 7 0.5280496 NA
8: 8 NA 0.50156096
9: 9 0.3068052 NA
10: 10 NA 0.11250801
It's obvious in this case what id column would work because it's a toy example (sample.table[,index:=seq_len(nrow(sample.table)/2)]), but it's basically impossible in the case of a tiny uniform sample of a huge data.table to find a sequence of id values that will thread through every hole in your data when applied to the formula argument. This kludge will work:
setkey(sample.table,Class)
We'll need this at the end to ensure the ordering is fixed.
chan.split <- split(sample.table,sample.table$Channel)
That gets you a list of data.frames for each unique Channel.
cut.fringes <- min(sapply(chan.split,function(x) nrow(x)))
chan.dt <- cbind(lapply(chan.split, function(x){
x[1:cut.fringes,]$Voltage}))
There has to be a better way to ensure each data.frame has an equal number of rows, but for my application, I can guarantee they're only a few rows different, so I just trim off the excess rows.
chan.dt <- as.data.table(matrix(unlist(chan.dt),
ncol = length(unique(sample.table$Channel)),
byrow=TRUE))
This will get you back to a big data.table, with Channels as columns.
chan.dt[,Class:=
as.factor(rep(0:1,each=sampling.factor/2*nrow(original.table)/ncol(chan.dt))[1:cut.fringes])]
Finally, I rebind my categorical variable back on. The tables should be sorted by category already so this will match. This assumes you have the original table with all the data; there are other ways to do it.
ggpairs(data=chan.dt,
columns=1:length(unique(sample.table$Channel)), colour="Class",axisLabels="show")
Now it's plottable with the above.
As far as I can tell, there is no way around this with ggpairs(). Also, you are absolutely correct to not fill in with 'fake' data. If it is appropriate to suggest here, I would recommend using a different plotting method. For example
cor.data<- cor(data,use="pairwise.complete.obs") #data correlations ignoring pair-wise NA's
chart.Correlation(cor.data) #library(PerformanceAnalytics)
or using code from here http://hlplab.wordpress.com/2012/03/20/correlation-plot-matrices-using-the-ellipse-library/

R heatmap with different color scales for different rows

I am wondering what would be an easy solution to produce heatmaps() with of composed data that require different scaling for different rows.
So in my case the columns represent different events of the same type, and the rows are different observations of these events that can be binary or diff. continuous data.
F.ex:
Event: ev1 | ev2 | ev3 | ev4 | ev5 | ev6
Obs1: 1 | 0 | 1 | 1 | 0 | 0
Obs2: 5.6 | 0.2 | 4.8 | 7.1 | 0.1 | 0.8
Thanks in advance for hints and help
To have a single heatmap where different rows have been scaled differently would be rather confusing, since you would need a legend for each shading. This would quickly clutter up the plot.
If you suggest that you don't need a legend, then simply scale your values to range between zero/one and you should get would you are after. So for Obs2 you would have something like:
scaled_obs2 = (Obs2 - min(Obs2))/(Obs - min(Obs2))

Resources