single instead multiple boxplots with ggplot - r

I would like to make a boxplot for a variable (Theta..vol..) depending on two factors (Tiefe) and (Ort).
> str(data)
'data.frame': 30 obs. of 6 variables:
$ Nummer : int > 1 2 3 4 5 6 7 8 9 10 ...
$ Name : int 11 12 13 14 15 16 17 18 19 20 ...
$ Ort : Factor w/ 2 levels "NNW","S": 2 2 2 2 2 2 2 2 2 2 ...
$ Tiefe : int 20 20 20 20 20 50 50 50 50 50 ...
$ Gerät : int 2 2 2 2 2 2 2 2 2 2 ...
$ Theta..vol..: num 15 16.4 14.9 16.6 10.6 22.1 17.6 10 18 20.3 ...
My code is:
ggplot(data, aes(x = Tiefe, y = Theta..vol.., fill=Ort))+geom_boxplot()
Since the variable(Tiefe) has 3 levels and the variable (Ort) has 2 levels I wish to see three paired boxplots (each pair for a single (Tiefe).
But I see just a single pair (one boxplot for one level of "Ort" and another boxplot for the second level of the "Ort"
What should I change to get three pairs for each "Tiefe"? Thank you

In your code, Tiefe is being read as an integer not a factor.
Easy fix using dplyr with ggplot2:
First I made some dummy data:
library(dplyr)
data <- tibble(
Ort = ifelse(runif(30) > 0.5, "NNW", "S"),
Tiefe = rep(c(20, 50, 75), times = 10),
Theta..vol.. = rnorm(30,15))
Next, we modify the Tiefe column before piping into the ggplot:
data %>%
mutate(Tiefe = factor(Tiefe)) %>%
ggplot(aes(x = Tiefe, y = Theta..vol.., fill = Ort)) +
geom_boxplot()

Related

R round correlate function from corrr package

I'm creating a correlation table using the correlate function in the corrr package. Here is my code and a screenshot of the output.
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table
I think this would look better and be easier to read if I could round off the values in the correlation table. I tried this code:
correlation_table <- round(corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson"),2)
But I get this error:
Error in Math.data.frame(list(term = c("prof_rank_factor", "yrs.since.phd", : non-numeric variable(s) in data frame: term
The non-numeric variables part of this error message doesn't make sense to me. When I look at the structure I only see integer or numeric variable types.
'data.frame': 397 obs. of 6 variables:
$ prof_rank_factor : num 3 3 1 3 3 2 3 3 3 3 ...
$ yrs.since.phd : int 19 20 4 45 40 6 30 45 21 18 ...
$ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
$ salary : num 139750 173200 79750 115000 141500 ...
$ sex_factor : num 1 1 1 1 1 1 1 1 1 2 ...
$ discipline_factor: num 2 2 2 2 2 2 2 2 2 2 ...
How can I clean up this correlation table with rounded values?
After returning the tibble output with correlate, loop across the columns that are numeric and round
library(dplyr)
corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson") %>%
mutate(across(where(is.numeric), round, digits = 2))
We can use:
options(digits=2)
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table

How do you add jitter to a scatterplot matrix in ggpairs?

I want to add jitter to a scatterplot matrix. The question was addressed on the following page (and nowhere else) on stackoverflow:
How to produce a meaningful draftsman/correlation plot for discrete values
But both solutions to the jitter problem which were suggested there involve deprecated code (plotmatrix and params):
library(ggplot2)
plotmatrix(y) + geom_jitter(alpha = .2)
library(GGally)
ggpairs(y, lower = list(params = c(alpha = .2, position = "jitter")))
I would have simply commented asking for an update there so as to not create a new question, but that appears to require reputation points, and I'm new to the site. My apologies if I've done something wrong in posting the question.
EDIT:
Here's what the data looks like:
> str(EHRound4.subset)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 301 obs. of 22 variables:
$ Subject# : int 1 2 3 4 6 7 8 13 14 16 ...
$ Condition : Factor w/ 2 levels "CDR","Mturk": 1 1 1 1 1 1 1 1
1 1 ...
$ Launch4 : int 5 8 8 5 8 5 3 8 5 6 ...
$ NewSong4 : int 6 8 8 6 8 6 8 8 8 7 ...
$ StudCom5 : int 6 5 8 3 1 3 4 8 7 7 ...
$ Textbook5 : int 8 1 8 3 1 7 8 8 8 8 ...
And here's several attempts at getting jitter.
> ggpairs(EHRound4.subset, columns = 3:6,
ggplot2::aes(colour=Condition), lower = list(geom_jitter(alpha = .2)))
> ggpairs(EHRound4.subset, columns = 3:6,
ggplot2::aes(colour=Condition, alpha=.2), lower = list(geom_jitter()))
> ggpairs(EHRound4.subset, columns = 3:6,
ggplot2::aes(colour=Condition, alpha=.2, position="jitter"))
#user20650 answered the question in comments below the question. For completeness, here it is in the form of an answer:
Use wrap, such as:
library(GGally)
ggpairs(y, lower = list(continuous=wrap("points", position=position_jitter(height=3, width=3))))
By using position = position_jitter() instead of just position = "jitter" (which also works) the additional jitter parameters can also be controlled.

R: ggplot: plot shows vertical lines instead of time course

I am trying to get a simple plot showing the time course of worry duration over 6 days for two groups. However, I get vertical lines instead of a line showing the time course.
This is what my data looks like:
> head(alldays_dur)
ParticipantID Session Day Time Worry_duration group
1 1 2 1 71804 15 intervention
2 1 4 1 56095 5 intervention
3 2 2 1 36739 15 intervention
4 2 4 1 45013 10 intervention
5 2 5 1 51026 5 intervention
This is the structure of my data
> str(alldays_dur)
'data.frame': 2620 obs. of 10 variables:
$ ParticipantID : num 113 113 113 113 113 113 113 113 113 113 ...
$ Session : num 9 10 11 12 14 15 16 21 22 24 ...
$ Day : Factor w/ 6 levels "1","2","3","4",..: 2 2 2 2 2 2 2 3 3
$ Time : num 37350 42862 47952 51555 61499 ...
$ Worry_duration: num 5 5 5 5 10 0 5 5 5 5 ...
$ group : Factor w/ 2 levels "Intervention group",..: 1 1 1 1 1 1
I have tried the following code:
p <- ggplot(alldays_dur, aes(x=Day, y=Worry_duration, group=1)) +
geom_line() +
labs(x = "Day",
y = "Mean worry duration in minutes per day")
print(p)
However, I get the following plot: plot
I have included the group=1 in the code after reading some earlier posts on this topic. However, it didn't help me as I had hoped.
Do you maybe have some useful tips for me? Thank you in advance.
Ps. I am sorry if the post is unclear in any way, this is my first time ever posting on stackoverflow, so I am not quite familiar with all the 'post-options' yet.
You need to summarize your data first, with ddply for example:
require(plyr) # ddply
require(ggplot2) # ggplot
# Creating dataset
raw_data = data.frame(Day = sample(c(1:6),100, replace = T),
group = sample(c("group_1", "group_2"),100, replace = T),
Worry_duration = sample(seq(0,30,5), 100, replace = T))
# Summarize
DF = ddply(raw_data, c("Day", "group"), summarize,
Worry_duration.mean = mean(Worry_duration, na.rm = T))
# Plot
ggplot(DF, aes(x = Day, y = Worry_duration.mean, group = group, color = group)) +
geom_line()+ xlab("Day") + ylab("Mean worry duration in minutes per day")

Modeling a very big data set (1.8 Million rows x 270 Columns) in R

I am working on a Windows 8 OS with a RAM of 8 GB . I have a data.frame of 1.8 million rows x 270 columns on which I have to perform a glm. (logit/any other classification)
I've tried using ff and bigglm packages for handling the data.
But I am still facing a problem with the error "Error: cannot allocate vector of size 81.5 Gb".
So, I decreased the number of rows to 10 and tried the steps for bigglm on an object of class ffdf. However the error still is persisting.
Can any one suggest me the solution of this problem of building a classification model with these many rows and columns?
**EDITS**:
I am not using any other program when I am running the code.
The RAM on the system is 60% free before I run the code and that is because of the R program. When I terminate R, the RAM 80% free.
I am adding some of the columns which I am working with now as suggested by the commenters for reproduction.
OPEN_FLG is the DV and others are IDVs
str(x[1:10,])
'data.frame': 10 obs. of 270 variables:
$ OPEN_FLG : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1
$ new_list_id : Factor w/ 9 levels "0","3","5","6",..: 1 1 1 1 1 1 1 1 1 1
$ new_mailing_id : Factor w/ 85 levels "1398","1407",..: 1 1 1 1 1 1 1 1 1 1
$ NUM_OF_ADULTS_IN_HHLD : num 3 2 6 3 3 3 3 6 4 4
$ NUMBER_OF_CHLDRN_18_OR_LESS: Factor w/ 9 levels "","0","1","2",..: 2 2 4 7 3 5 3 4 2 5
$ OCCUP_DETAIL : Factor w/ 49 levels "","00","01","02",..: 2 2 2 2 2 2 2 21 2 2
$ OCCUP_MIX_PCT : num 0 0 0 0 0 0 0 0 0 0
$ PCT_CHLDRN : int 28 37 32 23 36 18 40 22 45 21
$ PCT_DEROG_TRADES : num 41.9 38 62.8 2.9 16.9 ...
$ PCT_HOUSEHOLDS_BLACK : int 6 71 2 1 0 4 3 61 0 13
$ PCT_OWNER_OCCUPIED : int 91 66 63 38 86 16 79 19 93 22
$ PCT_RENTER_OCCUPIED : int 8 34 36 61 14 83 20 80 7 77
$ PCT_TRADES_NOT_DEROG : num 53.7 55 22.2 92.3 75.9 ...
$ PCT_WHITE : int 69 28 94 84 96 79 91 29 97 79
$ POSTAL_CD : Factor w/ 104568 levels "010011203","010011630",..: 23789 45173 32818 6260 88326 29954 28846 28998 52062 47577
$ PRES_OF_CHLDRN_0_3 : Factor w/ 4 levels "","N","U","Y": 2 2 3 4 2 4 2 4 2 4
$ PRES_OF_CHLDRN_10_12 : Factor w/ 4 levels "","N","U","Y": 2 2 4 3 3 2 3 2 2 3
[list output truncated]
And this is the example of code which I am using.
require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)
require(ff)
x$id <- ffseq_len(nrow(x))
xex <- expand.ffgrid(x$id, ff(1:100))
colnames(xex) <- c("id","explosion.nr")
xex <- merge(xex, x, by.x="id", by.y="id", all.x=TRUE, all.y=FALSE)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = xex)
The problem is both times I get the same error "Error: cannot allocate vector of size 81.5 Gb".
Please let me know if this is enough or should I include anymore details about the problem.
I have the impression you are not using ffbase::bigglm.ffdf but you want to. Namely the following will put all your data in RAM and will use biglm::bigglm.function, which is not what you want.
require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)
You need to use ffbase::bigglm.ffdf, which works chunkwise on an ffdf. So load package ffbase which exports bigglm.ffdf.
If you use ffbase, you can use the following:
require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
Explanation:
Because you don't limit yourself to the columns you use in the model, you will get all your columns of your xex ffdf in RAM which is not needed. You were using a gaussian model on a factor response, bizarre? I believe you were trying to do a logistic regression, so use the appropriate family argument? And it will use ffbase::bigglm.ffdf and not biglm::bigglm.function.
If that does not work - which I doubt, it is because you have other things in RAM which you are not aware of. In that case do.
require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
ffsave(mymodeldataset, file = "mymodeldataset")
## Open R again
require(ffbase)
require(biglm)
ffload("mymodeldataset")
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
And off you go.

stacked barchart with lattice: is my data too big?

I want a graph that looks similar to the example given in the lattice docs:
#EXAMPLE GRAPH, not my data
> barchart(yield ~ variety | site, data = barley,
+ groups = year, layout = c(1,6), stack = TRUE,
+ auto.key = list(points = FALSE, rectangles = TRUE, space = "right"),
+ ylab = "Barley Yield (bushels/acre)",
+ scales = list(x = list(rot = 45)))
I melted my data to obtain this "long" form dataframe:
> str(MDist)
'data.frame': 34560 obs. of 6 variables:
$ fCycle : Factor w/ 2 levels "Dark","Light": 2 2 2 2 2 2 2 2 2 2 ...
$ groupname: Factor w/ 8 levels "rowA","rowB",..: 1 1 1 1 1 1 1 1 1 1 ...
$ location : Factor w/ 96 levels "c1","c10","c11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ timepoint: num 1 2 3 4 5 6 7 8 9 10 ...
$ variable : Factor w/ 3 levels "inadist","smldist",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 55.7 75.3 99.2 45.9 73.8 79.3 73.5 69.8 67.6 ...
I want to create a stacked barchart for each groupname and fCycle. I tried this:
barchart(value~timepoint|groupname*fCycle, data=MDist, groups=variable,stack=T)
It doesn't throw any errors, but it's still thinking after 30 minutes. Is this because it doesn't know how to deal with the 36 values that contribute to each bar? How can I make this data easier for barchart to digest?
I don't know lattice well, but could it be because your timepoint variable is numeric, not a factor?

Resources