using ggpairs with NA-continaing data - r

ggpairs in the GGally package seems pretty useful, but it appears to fail when there NA is present anywhere in the data set:
#require(GGally)
data(tips, package="reshape")
pm <- ggpairs(tips[,1:3]) #works just fine
#introduce NA
tips[1,1] <- NA
ggpairs(tips[,1:3])
> Error in if (lims[1] > lims[2]) { : missing value where TRUE/FALSE needed
I don't see any documentation for dealing with NA values, and solutions like ggpairs(tips[,1:3], na.rm=TRUE) (unsurprisingly) don't change the error message.
I have a data set in which perhaps 10% of values are NA, randomly scattered throughout the dataset. Therefore na.omit(myDataSet) will remove much of the data. Is there any way around this?

Some functions of GGally like ggparcoord() support handling NAs by missing=[exclude,mean,median,min10,random] parameter. However this is not the case for ggpairs() unfortunately.
What you can do is to replace NAs with a good estimation of your data you were expecting ggpair() will do automatically for you. There are good solutions like replacing them by row means, zeros, median or even closest point (Notice 4 hyperlinks on the words of the recent sentence!).

I see that this is an old post. Recently I encountered the same problem but still could not find a solution on the Internet. So I provide my workaround below FYI.
I think the aim is to use pair-wise complete observations for plotting (i.e. in a manner that is specific to each panel/facet of the ggpairs grid plot), instead of using complete observations across all variables. The former will keep "useable" observations to the maximal extent, w/o introducing "artificial" data by imputing missing values. Up to date it seems that ggpairs still does not support this. My workaround for this is to:
Encode NA with another value not present in the data, e.g. for numerical variables, I replaced NA's with -666 for my dataset. For each dataset you can always pick something that is out of the range of its data values. BTW it seems that Inf doesn't work;
Then retrieve the pair-wise complete cases with user-created plotting functions. For example, for scatter plots of continuous variables in the lower triangle, I do something like:
scat.my <- function(data, mapping, ...) {
x <- as.character(unclass(mapping$x))[2] # my way of parsing the x variable name from `mapping`; there may be a better way
y <- as.character(unclass(mapping$y))[2] # my way of parsing the y variable name from `mapping`; there may be a better way
dat <- data.table(x=data[[x]], y=data[[y]])[x!=-666 & y!=-666] # I use the `data.table` package; assuming NA values have been replaced with -666
ggplot(dat, aes(x=x, y=y)) +
geom_point()
}
ggpairs(my.data, lower=list(continuous=scat.my), ...)
This can be similarly done for the upper triangle and the diagonal. It is somewhat labor-intensive as all the plotting functions need to be re-done manually with customized modifications as above. But it did work.

I'll take a shot at it with my own horrible workaround, because I think this needs stimulation. I agree with OP that filling in data based on statistical assumptions or a chosen hack is a terrible idea for exploratory analysis, and I think it's guaranteed to fail as soon as you forget how it works (about five days for me) and need to adjust it for something else.
Disclaimer
This is a terrible way to do things, and I hate it. It's useful for when you have a systematic source of NAs coming from something like sparse sampling of a high-dimensional dataset, which maybe the OP has.
Example
Say you have a small subset of some vastly larger dataset, making some of your columns sparsely represented:
| Sample (0:350)| Channel(1:118)| Trial(1:10)| Voltage|Class (1:2)| Subject (1:3)|
|---------------:|---------------:|------------:|-----------:|:-----------|--------------:|
| 1| 1| 1| 0.17142245|1 | 1|
| 2| 2| 2| 0.27733185|2 | 2|
| 3| 1| 3| 0.33203066|1 | 3|
| 4| 2| 1| 0.09483775|2 | 1|
| 5| 1| 2| 0.79609409|1 | 2|
| 6| 2| 3| 0.85227987|2 | 3|
| 7| 1| 1| 0.52804960|1 | 1|
| 8| 2| 2| 0.50156096|2 | 2|
| 9| 1| 3| 0.30680522|1 | 3|
| 10| 2| 1| 0.11250801|2 | 1|
require(data.table) # needs the latest rForge version of data.table for dcast
sample.table <- data.table(Sample = seq_len(10), Channel = rep(1:2,length.out=10),
Trial = rep(1:3, length.out=10), Voltage = runif(10),
Class = as.factor(rep(1:2,length.out=10)),
Subject = rep(1:3, length.out=10))
The example is hokey but pretend the columns are uniformly sampled from their larger subsets.
Let's say you want to cast the data to wide format along all channels to plot with ggpairs. Now, a canonical dcast back to wide format will not work, with an id column or otherwise, because the column ranges are sparsely (and never completely) represented:
wide.table <- dcast.data.table(sample.table, Sample ~ Channel,
value.var="Voltage",
drop=TRUE)
> wide.table
Sample 1 2
1: 1 0.1714224 NA
2: 2 NA 0.27733185
3: 3 0.3320307 NA
4: 4 NA 0.09483775
5: 5 0.7960941 NA
6: 6 NA 0.85227987
7: 7 0.5280496 NA
8: 8 NA 0.50156096
9: 9 0.3068052 NA
10: 10 NA 0.11250801
It's obvious in this case what id column would work because it's a toy example (sample.table[,index:=seq_len(nrow(sample.table)/2)]), but it's basically impossible in the case of a tiny uniform sample of a huge data.table to find a sequence of id values that will thread through every hole in your data when applied to the formula argument. This kludge will work:
setkey(sample.table,Class)
We'll need this at the end to ensure the ordering is fixed.
chan.split <- split(sample.table,sample.table$Channel)
That gets you a list of data.frames for each unique Channel.
cut.fringes <- min(sapply(chan.split,function(x) nrow(x)))
chan.dt <- cbind(lapply(chan.split, function(x){
x[1:cut.fringes,]$Voltage}))
There has to be a better way to ensure each data.frame has an equal number of rows, but for my application, I can guarantee they're only a few rows different, so I just trim off the excess rows.
chan.dt <- as.data.table(matrix(unlist(chan.dt),
ncol = length(unique(sample.table$Channel)),
byrow=TRUE))
This will get you back to a big data.table, with Channels as columns.
chan.dt[,Class:=
as.factor(rep(0:1,each=sampling.factor/2*nrow(original.table)/ncol(chan.dt))[1:cut.fringes])]
Finally, I rebind my categorical variable back on. The tables should be sorted by category already so this will match. This assumes you have the original table with all the data; there are other ways to do it.
ggpairs(data=chan.dt,
columns=1:length(unique(sample.table$Channel)), colour="Class",axisLabels="show")
Now it's plottable with the above.

As far as I can tell, there is no way around this with ggpairs(). Also, you are absolutely correct to not fill in with 'fake' data. If it is appropriate to suggest here, I would recommend using a different plotting method. For example
cor.data<- cor(data,use="pairwise.complete.obs") #data correlations ignoring pair-wise NA's
chart.Correlation(cor.data) #library(PerformanceAnalytics)
or using code from here http://hlplab.wordpress.com/2012/03/20/correlation-plot-matrices-using-the-ellipse-library/

Related

Join and merge are only working partially and completely in R

I am working with a dataset that was obtained from a global environment and loaded into R. It has been saved as a CSV and is being read in R as a data frame from that CSV. This dataset (survey_df) has almost 3 million entries, I am trying to join this dataset based on a column ID (repeated multiple times since there are multiple entries per id) to what originally was a shapefile and is now loaded in R as a data frame shapefile_df . This data frame has 60,000 unique entries, each representing a geometry in a country. We expect to have many entries per geometry in most cases. I am using a simple left_join which should in theory join these two datasets together. I am running into an issue where they are not fully joining together, only some entries are. I have tried inner,fully and right join as well as merge and I keep getting the same issue. I made a full_join and a copy of the id columns to compare the ones that are not joining and I do not see any patterns. They seem to be the same id, they are not joining for some reason. I tried formatting them as.character and as.factor and yet nothing. Below I pasted a sample of the join/unjoined df.
Matched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
0901200010229 | 0901200010229 | 0901200010229
0901500010729 | 0901500010729 | 0901500010729
090050001087A | 090050001087A | 090050001087A
0900600010467 | 0900600010467 | 0900600010467
0901400010897 | 0901400010897 | 0901400010897
0901200011960 | 0901200011960 | 0901200011960
Unmatched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
01903900010480 | 01903900010480 | NA
070470001010A | NA | 070470001010A
0704700010117 | NA | 0704700010117
0704700010140 | NA | 0704700010140
0705200010672 | NA | 0705200010672
0705200010742 | NA | 0705200010742
Most of the entries that are unmatched are like the first row where shapefile_df_id is NA. However, there are a few where survey_id_copy is NA. This field is simply a mutate of survey_df_id and in theory should not be any different yet they are. Any idea what could be causing this? I suspect this is a formatting issue but as a said, using as. hasn't fixed the issue. I am using tidyverse and read.csv. Any help?

Issues solving a Regression with numeric and categorical variables in R

I am very new to statistics and R in general so my question might be a bit dumb, but since I cannot find my solutions online I thought I should try ask it here.
I have a data frame dataset of a whole lot of different variables very similar to as follows:
Item | Size | Value | Town
----------------------------------
A | 10 | 800 | 1
B | 11 | 100 | 2
A | 17 | 900 | 2
D | 13 | 200 | 3
B | 15 | 500 | 1
C | 12 | 250 | 3
E | 14 | NA | 2
A | | 800 | 1
C | | 800 | 2
Basically, I have to try and 'guess' the Size based on the type of Item, it's Value, and the Town it was sold in, so I think a regression method would be a good idea.
I try and use a polynomial regression (although I'm not even sure if that's correct) to see how that looks by using a function similar to the following:
summary(lm(Size~ polym(factor(Item), Value, factor(Town), degree=2, raw=TRUE), dataset))
But I get this Warning message when I try to do this:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In Ops.factor(X, Y, ...) : ‘^’ not meaningful for factors
Can anyone tell me why this happens? More importantly, is what I've done even correct?
My second question is regarding NA values in a regression. In the dataset above, I have an NA value in the Value Column. From what I understand, R ignores rows which have an NA value in a column. But what if I have a lot of NA values? Also, it seems like a waste of data to automatically eliminate entire rows if there is only one NA value in a column, so I was wondering if there is perhaps a better way of solving or working around this issue. Thanks!
EDIT: I just have one more question: In the regression model I have created it appears there are new 'levels' in the testing data which were not in the training data (e.g. the error says factor(Town) has new levels). What would be the right thing to do for cases such as this?
Yes, follow #RemkoDuursma's suggestion in using lm(Size ~ factor(Item) + factor(Town) + Value,...) and look into other degrees as well (was there a reason why you chose squared?) by comparing residuals.
In regards to substituting NA values, you have many options:
substitute all with median variable value
substitute all with mean variable value
substitute each with prediction based on values of other variables
good luck, and next time you might want to check out https://stats.stackexchange.com/!

Renaming and adding variables to a list of datasets in R

I've got a list of datasets, and I want to make a few changes to these datasets using R.
First, if variable "mac_sector" exists, I want to rename it to "sector".*Edit: It always says mac_sector not found, even if it is in at least one of the datasets. Also, if something is not found by if(exists()) does it then just continue on with the rest of the script, or does it terminate the script?
Second, if there is no variable called "mac_sector" or "sector", I want to create a new column variable called "sector" with putting "total" as the values.
Lastly, I rearrange the columns because I want variable "sector" to be the 3rd column in each dataset.
I wrote the script (some parts are not even in R language) below, but obviously it's not working, so I'm hoping that some of you may be able to help me with this.
I also want to save these changes to the respective datasets, but I've no idea how to even go about that in this particular case?? (I know of the save() command but I feel like it wouldn't work here)
setwd("C:\\Users\\files")
mylist = list.files(pattern="*.dta")
#Loop through all of the datasets in C:\\Users\\files
#Reading the datasets into R
df <- lapply(mylist, read.dta13)
#Naming the list of elemenents to match the files for convenience
names(df) <- gsub("\\.dta$", "", mylist)
# If column mac_sector exists, rename to sector
if(exists(mac_sector, df)){
df <- rename(df, c(mac_sector="sector"))
}
# If column variable with pattern("sector") does not exist, create variable sector=total
if(does not exist(pattern="sector")){
sector <- c("total")
df$sector <- sector
}
# rearrange variable, sector must be placed 3rd
df <- arrange.vars(df, c("sector" = 3))
edit:
I want all datasets to look like this (and some already do look like this):
Country|sector| Variable1| Variable2| Variable3|....
GER | M | value | value | value |....
BELG | K | value | value | value |....
and so on.
Now some of them look like this:
Country|mac_sector| Variable1| Variable2| Variable3|....
GER | F | value | value | value |....
BELG | L | value | value | value |....
In which case I want to rename mac_sector to sector.
They can also look like this:
Country| Variable1| Variable2| Variable3|....
GER | value | value | value |....
BELG | value | value | value |....
In which case I want to add a variable sector = total:
Country|sector| Variable1| Variable2| Variable3|....
GER | total| value | value | value |....
BELG | total| value | value | value |....
*Variable1, Variable2, Variable3 and so on, do not represent the same thing across datasets, just thought I should mention that.

Using R to group values within different vectors so they can be plotted (ggplot2)

I have a question about how to group different vectors from a dataframe, in order to compare and analyse them. For example using ggplot2 to plot some graphs. To make it clearer I will provide the type of dataframe I am working with.
ID Date |X |Y |Z | BR
---------------------------------
6001-102| 2016-03| 1| 1| 1| 1.0
--------------------------------
6001-102| 2016-03| 1| 1| 1| 1.0
--------------------------------
6001-102| 2016-03| 1| 1| 1| 1.0
--------------------------------
6044-460| 2016-03| 2| 1| 4| 0.5
---------------------------------
The data columns I am focused on here are Date, Z and BR.
The dates are characters containing the month and years, for example 2016-03 and 2015-05, whilst Z is numeric and ranges from 1-8. I am finding this complicated for myself, because what I want R to do is to first group the results by the date (for example looking at only May 2015) and then get the average BR for each level of Z. Z represents different time groups, so if I was using ggplot I would see the average BR for each time group in May.
Can anyone show me a good example or maybe a previous question that is trying to accomplish the same as me? Hopefully with ggplot2? I haven't found one, but I am sorry if this is a duplicate question.
Thank you for your help!
Edit: Removed dput as question answered.
You can use mydf %>% group_by(Date_fill, y) %>% summarise(z = sum(z, na.rm=TRUE))

Is it possible to combine separate boxplot summaries into one and create the combined graph?

I am working with rather large datasets (appx. 4 mio rows per month with 25 numberic attributes and 4 factor attributes). I would like to create a graph that contains per month (for the last 36 months) a boxplot for each numeric attribute per product (one of the 4 factor attributes).
So as an example for product A:
-
_ | -
_|_ | _|_
| | | | |
| | _|_ | |
| | | | |---|
| | |---| | |
|---| | | | |
|_ _| | | |_ _|
| |_ _| |
| | |
- | -
-
--------------------------------------------------------------
jan '10 feb '10 mar '10 ................... feb '13
But since these are quite large datasets I will be working with I would like some advice to get started on how to approach. My idea (but I am not sure if this is possible) is to
a) extract the data per month per product
b) create a boxplot for that specific month (so let's say jan'10 for product A)
c) store the boxplot summary data somewhere
d) repeat a-c for all months until feb '13
e) combine all the stored boxplot summary data into one
f) plot the combined boxplot g) repeat a-f for all other products
So my main question is: is it possible to combine separate boxlot summaries into one and create the combined graph as sketched above from this?
Any help would be appreciated,
Thank you
Here's a long-hand example that you can probably cook something up around:
Read in the individual datasets - you might want to overwrite the same data or wrap this step in a function given the large data you are using.
dset1 <- 1:10
dset2 <- 10:20
dset3 <- 20:30
Store some boxplot info, notice the plot=FALSE
result1 <- boxplot(dset1,plot=FALSE,names="month1")
result2 <- boxplot(dset2,plot=FALSE,names="month2")
result3 <- boxplot(dset3,plot=FALSE,names="month3")
Group up the data and plot with bxp
mylist <- list(result1, result2, result3)
groupbxp <- do.call(mapply, c(cbind, mylist))
bxp(groupbxp)
Result:
You will not be able to predict with absolute precision what the values of the "fivenum" values will be for combined assembly of values. Think about the situation with two groups for which you have the 75th percentiles in each group and the counts of observations in each group. Suppose the percentiles are unequal. You cannot just take the weighted mean of the percentiles to get the 75th percentile of the aggregated values. The see the help page for ?boxplot.stats. I would think, however, that you might come very close by using the median values of the fivenum collections. This might be a place to start your examinations.
mo.mtx <- tapply(dat$values, dat$month, function( mo.dat) c( fivenum(mo.dat), length(mo.dat) )
matplot( mo.mtx[, 1:5] , type="l" )

Resources