Data frame error when trying to do histogram - r

I converted https://users.stat.ufl.edu/~winner/data/sexlierel.dat into a csv file. How would I do a histogram like this https://d33wubrfki0l68.cloudfront.net/73b1aa01f56fdebb5829f8bb9efefd2d424165dd/0799c/eda_files/figure-html/unnamed-chunk-6-1.png? When I knit it gives me a data must be in data frame error.
data_set <- read.csv("project_data.csv", header = TRUE)
names(data_set)
summary(data_set)
summary(data_set$Gender)
data=data.frame("Gender","Count")
```
```{r}
plot_density(data_set)
ggplot(data = "Gender") + geom_histogram(mapping = aes(x = "Count"), binwidth = 1)
scatter=ggplot(data=data, aes("Gender", "Count")) + geom_point()
```
[1]: https://users.stat.ufl.edu/~winner/data/sexlierel.dat
[2]: https://d33wubrfki0l68.cloudfront.net/73b1aa01f56fdebb5829f8bb9efefd2d424165dd/0799c/eda_files/figure-html/unnamed-chunk-6-1.png

The following is probably closer to what you want:
data_set <- read.csv("project_data.csv", header = TRUE)
ggplot(data = data_set, aes(Gender)) + geom_histogram(binwidth = 1)
scatter = ggplot(data = data_set, aes(Gender, Count)) + geom_point()
This assumes there is a Count column in your data set. I don't know anything about plot_density() or what package it's from, so I have no advice about that.
Explanation of the other bits:
You don't need to call data.frame() to get your data into a data frame. read.csv() has already done that for you.
data.frame() doesn't work that way anyway; passing it "Gender" and "Count" just makes it construct an otherwise empty data frame with those two strings as cells. In my experience, calling data.frame() directly never does what I want it to, so I avoid it and use other functions that do that work for me.
ggplot(data = "Gender")... this part doesn't work because you are just giving ggplot a string instead of a data set.
scatter = ggplot(data = data ... this part would be fine, but data doesn't contain what you want because of #2 above.

Related

ggplot par new=TRUE option

I am trying to plot 400 ecdf graphs in one image using ggplot.
As far as I know ggplot does not support the par(new=T) option.
So the first solution I thought was use the grid.arrange function in gridExtra package.
However, the ecdfs I am generating are in a for loop format.
Below is my code, but you could ignore the steps for data processing.
i=1
for(i in 1:400)
{
test<-subset(df,code==temp[i,])
test<-test[c(order(test$Distance)),]
test$AI_ij<-normalize(test$AI_ij)
AI = test$AI_ij
ggplot(test, aes(AI)) +
stat_ecdf(geom = "step") +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
new_theme +
xlab("Calculated Accessibility Value") +
ylab("Percent")
}
So I have values stored in "AI" in the for loop.
In this case how should I plot 400 graphs in the same chart?
This is not the way to put multiple lines on a ggplot. To do this, it is far easier to pass all of your data together and map code to the "group" aesthetic to give you one ecdf line for each code.
By far the hardest part of answering this question was attempting to reverse-engineer your data set. The following data set should be close enough in structure and naming to allow the code to be run on your own data.
library(dplyr)
library(BBmisc)
library(ggplot2)
set.seed(1)
all_codes <- apply(expand.grid(1:16, LETTERS), 1, paste0, collapse = "")
temp <- data.frame(sample(all_codes, 400), stringsAsFactors = FALSE)
df <- data.frame(code = rep(all_codes, 100),
Distance = sqrt(rnorm(41600)^2 + rnorm(41600)^2),
AI_ij = rnorm(41600),
stringsAsFactors = FALSE)
Since you only want the first 400 codes from temp that appear in df to be shown on the plot, you can use dplyr::filter to filter out code %in% test[[1]] rather than iterating through the whole thing one element at a time.
You can then group_by code, and arrange by Distance within each group before normalizing AI_ij, so there is no need to split your data frame into a new subset for every line: the data is processed all at once and the data frame is kept together.
Finally, you plot this using the group aesthetic. Note that because you have 400 lines on one plot, you need to make each line faint in order to see the overall pattern more clearly. We do this by setting the alpha value to 0.05 inside stat_ecdf
Note also that there are multiple packages with a function called normalize and I don't know which one you are using. I have guessed you are using BBmisc
So you can get rid of the loop and do:
df %>%
filter(code %in% temp[[1]]) %>%
group_by(code) %>%
arrange(Distance, by_group = TRUE) %>%
mutate(AI = normalize(AI_ij)) %>%
ggplot(aes(AI, group = code)) +
stat_ecdf(geom = "step", alpha = 0.05) +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
xlab("Calculated Accessibility Value") +
ylab("Percent")

Changing title of plots in a loop with colnames() in R

I am creating a for loop which creates a ggplot2 plot for each of the first six columns in a dataframe. Everything works except for the looping of the title names. I have been trying to use title = colnames(df[,i]) and title = paste0(colnames(df[,i]) to create the proper title but it simply ends up repeating the 2nd column name. The plots themselves produce the data correctly for each column, but the title is for some reason not looping. For the first plot it produces the correct title, but then for the second plot and beyond it just keeps on repeating the third column name, completely skipping over the second column name. I even tried creating a variable within the loop to store the respective title name to then use within the ggplot2 title labels: changetitle <- colnames(df[,i]) and then using title = changetitle but that also loops incorrectly.
Here is an example of what I have so far:
plot_6 <- list()
for(i in df[1:6]){
plot_6[i] <- print(ggplot(df, aes(x = i, ...) ...) +
... +
labs(title = colnames(df[,i]),
x = ...) +
...)
}
Thank you very much.
df[1:6] is a data frame with six columns. When used as a loop variable, this results in i being a vector of values each time through the loop. This might "work" in the sense that ggplot will prroduce a plot, but it breaks the link between the data frame provided to ggplot (df in this case) and the mapping of df's columns to ggplot's aesthetics.
Here are a few options, using the built-in mtcars data frame:
library(tidyverse)
library(patchwork)
plot_6 <- list()
for(i in 1:6) {
var = names(mtcars)[i]
plot_6[[i]] <- ggplot(mtcars, aes(x = !!sym(var))) +
geom_density() +
labs(title = var)
}
# Use column names directly as loop variable
for(i in names(mtcars)[1:6]) {
plot_6[[i]] <- ggplot(mtcars, aes(x = !!sym(i))) +
geom_density() +
labs(title = var)
}
# Use map, which directly generates a list of plots
plot_6 = map(names(mtcars)[1:6],
~ggplot(mtcars, aes(x = !!sym(.x))) +
geom_density() +
labs(title = .x)
)
Any of these produces the same list of plots:
wrap_plots(plot_6)

Overlaying trials in separate files onto one ggplot graph

I am trying to plot one graph with multiple trials (from separate text files). In the below case, I am plotting the "place" variable with the "firing rate" variable, and it works when I use ggplot on its own:
a <- read.table("trial1.txt", header = T)
library(ggplot2)
ggplot(a, aes(x = place, y = firing_rate)) + geom_point() + geom_path()
But when I try to create a for loop to go through each trial file in the folder and plot it on the same graph, I am having issues. This is what I have so far:
files <- list.files(pattern=".txt")
for (i in files){
p <- lapply(i, read.table)
print(ggplot(p, aes(x = place, y = firing_rate)) + geom_point() + geom_path())
}
It gives me a "Error: data must be a data frame, or other object coercible by fortify(), not a list" message. I am a novice in R so I am not sure what to make of that.
Thank you in advance for the help!
In general avoiding loops is the best adivce in R. Since you are using ggplot you may be interested in using the map_df function from tidyverse:
First create a read function and include the filename as a trial lable:
readDataFile = function(x){
a <- read.table(x, header = T)
a$trial = x
return(a)
}
Next up map_df:
dataComplete = map_df(files, readDataFile)
This runs our little function on each file and combines them all to a single data frame (of course assuming they are compatible in format).
Finally, you can plot almost as before but can distinguish based on the trial variable:
ggplot(dataComplete, aes(x = place, y = firing_rate, color=trial, group=trial)) + geom_point() + geom_path()

how to pass an arguments to function to get a line plot using ggplot2?

I am trying to write a function to create time series plot (line graph). How do I pass an argument to function so that the plot is created? I tried different ways like using aes_string etc. but no success.
lineplotfun <- function(feature){
ggplot(aes(x = 1:length(feature), y = feature), data = mtcars) +
geom_line()
}
lineplotfun(mpg)
I want to pass mpg as string or name.
There are numerous problems with the code in the question.
1) y is not in aes()
2) if ggplot2 is loaded, mpg is a tibble
3) y = feature with data = mtcars is meaningless
4) 1:length(feature) only makes sense if feature is a vector
One way of achieving what you want is by setting data = NULL and pass a vector to the function:
lineplotfun <- function(feature){
require(ggplot2)
ggplot2::ggplot(data = NULL, aes(x = seq_along(feature), y = feature)) +
ggplot2::geom_line()
}
lineplotfun(mtcars$mpg)
The result is:

creating a subset of data frame when running a loop

I'm quite new in R, trying to find my why around. I have created a new data frame based on the "original" data frame.
library(dplyr)
prdgrp <- as.vector(mth['MMITCL'])
prdgrp %>% distinct(MMITCL)
When doing this, then the result is a list of Unique values of the column MMITCL. I would like to use this data in a loop sequence that first creates a new subset of the original data and the prints a graph based on this:
#START LOOP
for (i in 1:length(prdgrp))
{
# mth[c(MMITCL==prdgrp[i],]
mth_1 <- mth[c(mth$MMITCL==prdgrp[i]),]
# Development of TPC by month
library(ggplot2)
library(scales)
ggplot(mth_1, aes(Date, TPC_MTD))+ geom_line()
}
# END LOOP
Doing this gives me the following error message:
Error in mth$MMITCL == prdgrp[i] :
comparison of these types is not implemented
In addition: Warning:
I `[.data.frame`(mth, c(mth$MMITCL == prdgrp[i]), ) :
Incompatible methods ("Ops.factor", "Ops.data.frame") for "=="
What am I doing wrong.
If you just want to plot the outputs there is no need to subset the dataframe, it is simpler to just put ggplot in a loop (or more likely use facet_wrap). Without seeing your data it is a bit hard to give you a precise answer. However there are two generic iris examples below - hopefully these will also show where you made the error in sub setting your dataframe. Please let me know if you have any questions.
library(ggplot2)
#looping example
for(i in 1:length(unique(iris$Species))){
g <- ggplot(data = iris[iris$Species == unique(iris$Species)[i], ],
aes(x = Sepal.Length,
y = Sepal.Width)) +
geom_point()
print(g)
}
#facet_wrap example
g <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
facet_wrap(~Species)
g
However if you need to save the data frames for later use, one option is to put them into a list. If you only need to save the data frame with in the loop you can just remove the list and use whatever variable name you wish.
myData4Later <- list()
for(i in 1:length(unique(iris$Species))){
myData4Later[[i]] <- iris[iris$Species == unique(iris$Species)[i], ]
g <- ggplot(data = myData4Later[[i]],
aes(x = Sepal.Length,
y = Sepal.Width)) +
geom_point()
print(g)
}

Resources