creating a subset of data frame when running a loop - r

I'm quite new in R, trying to find my why around. I have created a new data frame based on the "original" data frame.
library(dplyr)
prdgrp <- as.vector(mth['MMITCL'])
prdgrp %>% distinct(MMITCL)
When doing this, then the result is a list of Unique values of the column MMITCL. I would like to use this data in a loop sequence that first creates a new subset of the original data and the prints a graph based on this:
#START LOOP
for (i in 1:length(prdgrp))
{
# mth[c(MMITCL==prdgrp[i],]
mth_1 <- mth[c(mth$MMITCL==prdgrp[i]),]
# Development of TPC by month
library(ggplot2)
library(scales)
ggplot(mth_1, aes(Date, TPC_MTD))+ geom_line()
}
# END LOOP
Doing this gives me the following error message:
Error in mth$MMITCL == prdgrp[i] :
comparison of these types is not implemented
In addition: Warning:
I `[.data.frame`(mth, c(mth$MMITCL == prdgrp[i]), ) :
Incompatible methods ("Ops.factor", "Ops.data.frame") for "=="
What am I doing wrong.

If you just want to plot the outputs there is no need to subset the dataframe, it is simpler to just put ggplot in a loop (or more likely use facet_wrap). Without seeing your data it is a bit hard to give you a precise answer. However there are two generic iris examples below - hopefully these will also show where you made the error in sub setting your dataframe. Please let me know if you have any questions.
library(ggplot2)
#looping example
for(i in 1:length(unique(iris$Species))){
g <- ggplot(data = iris[iris$Species == unique(iris$Species)[i], ],
aes(x = Sepal.Length,
y = Sepal.Width)) +
geom_point()
print(g)
}
#facet_wrap example
g <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
facet_wrap(~Species)
g
However if you need to save the data frames for later use, one option is to put them into a list. If you only need to save the data frame with in the loop you can just remove the list and use whatever variable name you wish.
myData4Later <- list()
for(i in 1:length(unique(iris$Species))){
myData4Later[[i]] <- iris[iris$Species == unique(iris$Species)[i], ]
g <- ggplot(data = myData4Later[[i]],
aes(x = Sepal.Length,
y = Sepal.Width)) +
geom_point()
print(g)
}

Related

Data frame error when trying to do histogram

I converted https://users.stat.ufl.edu/~winner/data/sexlierel.dat into a csv file. How would I do a histogram like this https://d33wubrfki0l68.cloudfront.net/73b1aa01f56fdebb5829f8bb9efefd2d424165dd/0799c/eda_files/figure-html/unnamed-chunk-6-1.png? When I knit it gives me a data must be in data frame error.
data_set <- read.csv("project_data.csv", header = TRUE)
names(data_set)
summary(data_set)
summary(data_set$Gender)
data=data.frame("Gender","Count")
```
```{r}
plot_density(data_set)
ggplot(data = "Gender") + geom_histogram(mapping = aes(x = "Count"), binwidth = 1)
scatter=ggplot(data=data, aes("Gender", "Count")) + geom_point()
```
[1]: https://users.stat.ufl.edu/~winner/data/sexlierel.dat
[2]: https://d33wubrfki0l68.cloudfront.net/73b1aa01f56fdebb5829f8bb9efefd2d424165dd/0799c/eda_files/figure-html/unnamed-chunk-6-1.png
The following is probably closer to what you want:
data_set <- read.csv("project_data.csv", header = TRUE)
ggplot(data = data_set, aes(Gender)) + geom_histogram(binwidth = 1)
scatter = ggplot(data = data_set, aes(Gender, Count)) + geom_point()
This assumes there is a Count column in your data set. I don't know anything about plot_density() or what package it's from, so I have no advice about that.
Explanation of the other bits:
You don't need to call data.frame() to get your data into a data frame. read.csv() has already done that for you.
data.frame() doesn't work that way anyway; passing it "Gender" and "Count" just makes it construct an otherwise empty data frame with those two strings as cells. In my experience, calling data.frame() directly never does what I want it to, so I avoid it and use other functions that do that work for me.
ggplot(data = "Gender")... this part doesn't work because you are just giving ggplot a string instead of a data set.
scatter = ggplot(data = data ... this part would be fine, but data doesn't contain what you want because of #2 above.

How to combine jobs to avoid nested lapply

I have a data frame where I would like to perform multiple operations with. Here I give you an example to illustrate it, for example to create a list of plots:
library(tidyverse)
plot_fun = function(data, geom) {
plot = ggplot(data, aes(x = factor(0), y = Sepal.Length))
if (geom == 'bar') {
plot = plot + geom_col()
} else if (geom == 'box') {
plot = plot + geom_boxplot()
}
plot +
labs(x = unique(data$Species)) +
theme_bw() +
theme(axis.text.x = element_blank())
}
As you can see, this function takes a data frame, and perform two types of plots depending the geom parameter.
In my real problem, I have to split the data frame by one or multiple factors, and do the job. Do not take care about this specific example (I know I can put iris$Species on x-axis)
iris_ls = split(iris, iris$Species)
geom_ls = c('bar', 'box')
lapply(geom_ls, function(g) {
lapply(iris_ls, function(x) {
plot_fun(x, g)
})
})
My problem is due if I want to create both types of plots, I have to write a nested lapply (bad performance for parallelization cases).
So my question is, how should I avoid nested lapply procedure?
Should I multiplicate length of iris_ls by the length of geom_ls vector?
I do not know how to asses this. Imagine I have multiple geom like parameters in my function.
PS: Using drop = TRUE on split function, does not drop factor levels for each element of the list, I don't not know if it's the correct way to do it. I have to use another lapply to do it
Use the purrr package :
cross_ls <- purrr::cross(list(iris = split(iris, iris$Species),
geom = list('bar', 'box')))
cross_ls %>% purrr::map(~{plot_fun(.x$iris,.x$geom)})
or in its parallel version :
library(furrr)
plan(multiprocess)
cross_ls %>% furrr::future_map(~{plot_fun(.x$iris,.x$geom)})

Changing title of plots in a loop with colnames() in R

I am creating a for loop which creates a ggplot2 plot for each of the first six columns in a dataframe. Everything works except for the looping of the title names. I have been trying to use title = colnames(df[,i]) and title = paste0(colnames(df[,i]) to create the proper title but it simply ends up repeating the 2nd column name. The plots themselves produce the data correctly for each column, but the title is for some reason not looping. For the first plot it produces the correct title, but then for the second plot and beyond it just keeps on repeating the third column name, completely skipping over the second column name. I even tried creating a variable within the loop to store the respective title name to then use within the ggplot2 title labels: changetitle <- colnames(df[,i]) and then using title = changetitle but that also loops incorrectly.
Here is an example of what I have so far:
plot_6 <- list()
for(i in df[1:6]){
plot_6[i] <- print(ggplot(df, aes(x = i, ...) ...) +
... +
labs(title = colnames(df[,i]),
x = ...) +
...)
}
Thank you very much.
df[1:6] is a data frame with six columns. When used as a loop variable, this results in i being a vector of values each time through the loop. This might "work" in the sense that ggplot will prroduce a plot, but it breaks the link between the data frame provided to ggplot (df in this case) and the mapping of df's columns to ggplot's aesthetics.
Here are a few options, using the built-in mtcars data frame:
library(tidyverse)
library(patchwork)
plot_6 <- list()
for(i in 1:6) {
var = names(mtcars)[i]
plot_6[[i]] <- ggplot(mtcars, aes(x = !!sym(var))) +
geom_density() +
labs(title = var)
}
# Use column names directly as loop variable
for(i in names(mtcars)[1:6]) {
plot_6[[i]] <- ggplot(mtcars, aes(x = !!sym(i))) +
geom_density() +
labs(title = var)
}
# Use map, which directly generates a list of plots
plot_6 = map(names(mtcars)[1:6],
~ggplot(mtcars, aes(x = !!sym(.x))) +
geom_density() +
labs(title = .x)
)
Any of these produces the same list of plots:
wrap_plots(plot_6)

Overlaying trials in separate files onto one ggplot graph

I am trying to plot one graph with multiple trials (from separate text files). In the below case, I am plotting the "place" variable with the "firing rate" variable, and it works when I use ggplot on its own:
a <- read.table("trial1.txt", header = T)
library(ggplot2)
ggplot(a, aes(x = place, y = firing_rate)) + geom_point() + geom_path()
But when I try to create a for loop to go through each trial file in the folder and plot it on the same graph, I am having issues. This is what I have so far:
files <- list.files(pattern=".txt")
for (i in files){
p <- lapply(i, read.table)
print(ggplot(p, aes(x = place, y = firing_rate)) + geom_point() + geom_path())
}
It gives me a "Error: data must be a data frame, or other object coercible by fortify(), not a list" message. I am a novice in R so I am not sure what to make of that.
Thank you in advance for the help!
In general avoiding loops is the best adivce in R. Since you are using ggplot you may be interested in using the map_df function from tidyverse:
First create a read function and include the filename as a trial lable:
readDataFile = function(x){
a <- read.table(x, header = T)
a$trial = x
return(a)
}
Next up map_df:
dataComplete = map_df(files, readDataFile)
This runs our little function on each file and combines them all to a single data frame (of course assuming they are compatible in format).
Finally, you can plot almost as before but can distinguish based on the trial variable:
ggplot(dataComplete, aes(x = place, y = firing_rate, color=trial, group=trial)) + geom_point() + geom_path()

Function for formatting and plotting in R

I am currently trying to create a function that will format my data and properly and return a bar plot that is sorted. Yet for some reason I keep getting this error:
Error in `$<-.data.frame`(`*tmp*`, "Var1", value = integer(0)) :
replacement has 0 rows, data has 3
I have tried debugging it, but have had no luck. I have an example of what I expect down at the bottom. Can anyone spot what I am doing wrong?
x <- rep(c("Mark","Jimmy","Jones","Jones","Jones","Jimmy"),2)
y <- rnorm(12)
df <- data.frame(x,y)
plottingfunction <- function(data, name,xlabel,ylabel,header){
newDf <- data.frame(table(data))
order <- newDf[order(newDf$Freq, decreasing = FALSE), ]$Var1
newDf$Var1 <- factor(newDf$Var1,order)
colnames(newDf)[1] <- name
plot <- ggplot(newDf, aes(x=name, y=Freq)) +
xlab(xlabel) +
ylab(ylabel) +
ggtitle(header) +
geom_bar(stat="identity", fill="lightblue", colour="black") +
coord_flip()
return(plot)
}
plottingfunction(df$x, "names","xlabel","ylabel","header")
A few comments, your function didn't work, because this part isn't correct:
order <- newDf[order(newDf$Freq, decreasing = FALSE), ]$Var1
Since we have no idea if there will be any columns in data which has the column name Var1. What looks like happend is when you were trying your code you ran:
newDf <- data.frame(table(df$x))
which immediately renamed your column to Var1, but when you ran your function, the name changed. So to get around this I would recommend being explicit with your column names. In this example, I used the dplyr library to make my life easier. So following your code and logic it would look like this:
newDf <- data %>% group_by_(col_name) %>% tally
order <- newDf[order(newDf$n, decreasing = FALSE), col_name][[col_name]]
data[,col_name] <- factor(data[,col_name], order)
Then within your ggplot we can use aes_string to refer to the column name of the data frame instead. So then the whole function would look like this:
plottingFunction <- function(data, col_name, xlabel, ylabel, header) {
#' create a dataframe with the data that we're interested in
#' make sure that you preserve the anme of the column that you're
#' counting on...
newDf <- data %>% group_by_(col_name) %>% tally
order <- newDf[order(newDf$n, decreasing = FALSE), col_name][[col_name]]
data[,col_name] <- factor(data[,col_name], order)
plot <- ggplot(data, aes_string(col_name)) +
xlab(xlabel) +
ylab(ylabel) +
ggtitle(header) +
geom_bar(fill="lightblue", colour="black") +
coord_flip()
return(plot)
}
plottingFunction(df, "x", "xlabel","ylabel","header")
Which would have output like:
I think for your plot having stat="identity" is redundant since you can just use your original data frame rather than having a transformed one.

Resources