Related
I have a dataframe with two categorical variables. Column 1 is variable 1 and column 2 is variable 2. I want to create a frequency table with the number of times Var1 status is 1, 2 and 3 when Var2 status is 1. Similarly when Var2 status is 2 & 3, I want the frequency of Var1 status- 1, 2 and 3. At the end I want to plot a histogram with Var2 status (1,2,3) on x-axis and on y-axis a frequency of Var1 statuses for each of Var2 status. Thanks for the help.
structure(list(`1` = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1), `2` = c(3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 3, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2)), row.names = c(NA, -101L), class = c("tbl_df", "tbl",
"data.frame"))
You are probably looking to plot barplot of frequency instead of histogram in your description.
library(tidyr)
# add column names for the dataframe example
names(df) <- paste0("Var", 1:2)
# group and summarise to find the number of occurrence for each paired Var2, Var1 combination
plotting_df <- df %>% group_by(Var2, Var1) %>% summarise(n=n())
# plot using the new summary data frame
ggplot(plotting_df, aes(x=factor(Var2), y=n, fill=factor(Var1))) +
geom_col(position="dodge")
I have a problem similar to what is found here. I have a loop which runs through some modelling for different pairs of variables. Probably should not have used loops to go through them, but right now that is too late. Then I want to create a plot for each run. At first nothing showed before looking at that post. Looking at the post and implementing the best answer i could at least print the plots, but they still were not stored. The idea is to generate the plots, and then use grid.arrange to plot them together. Could someone show how to fix it? Here is some random data and the loop from example:
col1 <- c(2, 4, 1, 2, 5, 1, 2, 0, 1, 4, 4, 3, 5, 2, 4, 3, 3, 6, 5, 3, 6, 4, 3, 4, 4, 3, 4,
2, 4, 3, 3, 5, 3, 5, 5, 0, 0, 3, 3, 6, 5, 4, 4, 1, 3, 3, 2, 0, 5, 3, 6, 6, 2, 3)
col2 <- c(2, 4, 4, 0, 4, 4, 4, 4, 1, 4, 4, 3, 5, 0, 4, 5, 3, 6, 5, 3, 6, 4, 4, 2, 4, 4, 4,
1, 1, 2, 2, 3, 3, 5, 0, 3, 4, 2, 4, 5, 5, 4, 4, 2, 3, 5, 2, 6, 5, 2, 4, 6, 3, 3)
col3 <- c(2, 5, 4, 1, 4, 2, 3, 0, 1, 3, 4, 2, 5, 1, 4, 3, 4, 6, 3, 4, 6, 4, 1, 3, 5, 4, 3,
2, 1, 3, 2, 2, 2, 4, 0, 1, 4, 4, 3, 5, 3, 2, 5, 2, 3, 3, 4, 2, 4, 2, 4, 5, 1, 3)
data2 <- data.frame(col1,col2,col3)
data2[,1:3] <- lapply(data2[,1:3], as.factor)
colnames(data2)<- c("A","B","C")
myplots <- vector('list', ncol(data2))
for (i in seq_along(data2)) {
message(i)
myplots[[i]] <- local({
i <- i
p1 <- ggplot(data2, aes(x = data2[[i]])) +
geom_histogram(fill = "lightgreen") +
xlab(colnames(data2)[i])
print(p1)
})
}
I tried to change print to return, but to no avail. I get the plots printed in the View window in Rstudio, but the plots are not stored at all.
You can use the following code -
library(ggplot2)
myplots <- vector('list', ncol(data2))
for (i in seq_along(data2)) {
myplots[[i]] <- ggplot(data2, aes(x = .data[[colnames(data2)[i]]])) +
geom_histogram(fill = "lightgreen")
}
However, using lapply would be easier.
myplots <- lapply(names(data2), function(x)
ggplot(data2, aes(x = .data[[x]])) + geom_histogram(fill = "lightgreen"))
Plot the list of plots with grid.arrange.
gridExtra::grid.arrange(grobs = myplots)
data
A <- c(2, 4, 1, 2, 5, 1, 2, 0, 1, 4, 4, 3, 5, 2, 4, 3, 3, 6, 5, 3, 6, 4, 3, 4, 4, 3, 4,
2, 4, 3, 3, 5, 3, 5, 5, 0, 0, 3, 3, 6, 5, 4, 4, 1, 3, 3, 2, 0, 5, 3, 6, 6, 2, 3)
B <- c(2, 4, 4, 0, 4, 4, 4, 4, 1, 4, 4, 3, 5, 0, 4, 5, 3, 6, 5, 3, 6, 4, 4, 2, 4, 4, 4,
1, 1, 2, 2, 3, 3, 5, 0, 3, 4, 2, 4, 5, 5, 4, 4, 2, 3, 5, 2, 6, 5, 2, 4, 6, 3, 3)
C <- c(2, 5, 4, 1, 4, 2, 3, 0, 1, 3, 4, 2, 5, 1, 4, 3, 4, 6, 3, 4, 6, 4, 1, 3, 5, 4, 3,
2, 1, 3, 2, 2, 2, 4, 0, 1, 4, 4, 3, 5, 3, 2, 5, 2, 3, 3, 4, 2, 4, 2, 4, 5, 1, 3)
data2 <- data.frame(A,B,C)
Does this work for you?, With patchwork and purrr::reduce we can club these graphs to stack(horizontal or vertical) with each other. You can also use slashes(/) instead of plus(+) in reduce to make it appended vertically instead of horizontally. If you want to plot histogram you should have continuous data , In case you do want to plot counts for discrete data you should try geom_bar. If you do want to check for geom_bar then you need to convert the columns into factors. I am not so sure what plot you want to carry out, I am assuming that you have continuous data and you want to carry out histogram here. Please let me know if it doesn't work in your scenario.
library(tidyverse)
library(patchwork)
data2 <- data.frame(col1, col2, col3) ## No conversion of factors
nm <- names(data2)
g1 <- reduce(map2(data2,nm, ~ggplot(data2,aes(x =.x )) + geom_histogram(fill = "yellow4") + labs(x=.y, y = 'count')), `+`)
print(g1)
Or with slashes:
g2 <- reduce(map2(data2,nm, ~ggplot(data2,aes(x =.x )) + geom_histogram(fill = "yellow4") + labs(x=.y, y = 'count')), `/`)
print(g2)
Or if you want to have for loops then probably you can do this as well, you already have intialised myplots so not adding it here:
for (i in seq_along(data2)) {
myplots[[i]] <-
ggplot(data2, aes(x = data2[[i]])) +
geom_histogram(fill = "lightgreen") +
xlab(colnames(data2)[i])
}
Explanation:
Now you can use reduce with your myplots to arrange them, Note here myplots should be containing your 3 plots :
reduce(myplots, `+`)
for arranging it.
The map2 and reduce is similar solution, with map2 you are getting 3 plots saved into a list, so 3 objects are returned from below code:
plots <- map2(data2,nm, ~ggplot(data2,aes(x =.x )) + geom_histogram(fill = "yellow4") + labs(x=.y, y = 'count'))
To add them (arrange) them all you have to do is to use patchwork like below:
plots[[1]] + plots[[2]] + plots[[3]], but then its quite cumbersome, so we use reduce to make it happen like below:
reduce(plots, `+`)
Also like I mentioned earlier you can use slash instead of plus to make the arrangement vertical than horizontal. with plot_layout option in patchwork, you can create more flexible plots. You can check here .
with gridExtra : gridExtra::grid.arrange(grobs = (myplots)), again instead of myplots, it can be any list that contain ggplot objects.
I want to create a variable region based on a series of similar variables zipid1 to zipid26. My current code is like this:
dat$region <- with(dat, ifelse(zipid1 == 1, 1,
ifelse(zipid2 == 1, 2,
ifelse(zipid3 == 1, 3,
ifelse(zipid4 == 1, 4,
5)))))
How can I write a loop to avoid typing from zipid1 to zipid26? Thanks!
We subset the 'zipid' columns, create a logical matrix by comparing with 1 (== 1), get the column index of the TRUE value with max.col (assuming there is only a single 1 per each row and assign it to create 'region'
dat$region <- max.col(dat[paste0("zipid", 1:26)] == 1, "first")
Using a small reproducible example
max.col(dat[paste0("zipid", 1:5)] == 1, "first")
data
dat <- data.frame(id = 1:5, zipid1 = c(1, 3, 2, 4, 5),
zipid2 = c(2, 1, 3, 5, 4), zipid3 = c(3, 2, 1, 5, 4),
zipid4 = c(4, 3, 6, 2, 1), zipid5 = c(5, 3, 8, 1, 4))
I am trying to add a new variable in a dataframe using dplyr but I find it difficult.
The new variable should be the number of runs with length 2 (of all the variable values in each line). Using apply I would do this:
tmp$rle = apply(tmp,1,function(x) sum(rle(x)$lengths==2))
How can I perform this action using dplyr and mutate (without defining all variable names) ?
tmp <- structure(list(X1 = c(3, 1, 1, 4, 4, 1, 3, 2, 2, 2, 1, 3, 3,
2, 3, 1, 4, 2, 3, 2), X2 = c(2, 4, 2, 2, 3, 2, 1, 1, 3, 1, 3,
1, 4, 4, 4, 1, 3, 1, 2, 1), X3 = c(2, 4, 3, 3, 3, 2, 4, 3, 4,
4, 2, 3, 3, 3, 1, 3, 1, 4, 4, 2), X4 = c(1, 3, 3, 1, 1, 3, 2,
4, 4, 1, 4, 4, 1, 1, 1, 3, 1, 3, 1, 1), X5 = c(4, 2, 4, 2, 1,
4, 1, 2, 2, 4, 3, 4, 1, 1, 4, 4, 2, 4, 4, 3), X6 = c(3, 1, 4,
3, 4, 4, 4, 1, 1, 3, 4, 2, 2, 2, 3, 2, 3, 2, 2, 3), X7 = c(4,
2, 1, 1, 2, 1, 3, 3, 3, 3, 2, 2, 4, 4, 2, 4, 4, 3, 3, 4), X8 = c(1,
3, 2, 4, 2, 3, 2, 4, 1, 2, 1, 1, 2, 3, 2, 2, 2, 1, 1, 4)), .Names = c("X1",
"X2", "X3", "X4", "X5", "X6", "X7", "X8"), row.names = c(NA,
20L), class = "data.frame")
Rather than dplyr, you might consider using the purrr package which RStudio has fairly recently introduced as a complement to dplyr to, among other things, better handle vectors and lists. In your case, tmp is a numeric data frame where you want to treat each row as a vector. The code could look like:
library(purrr)
tmp <- tmp %>% by_row(..f=function(x) sum(rle(x)$lengths==2),
.to = "rle", .collate = "cols")
In dplyr:
tmp <- mutate(tmp, rle = apply(tmp, 1, function(x) sum(rle(x)$lengths==2)))
I am having a difficult time QA'ing this as I am unfamiliar with what results I should expect out of the rle function. I tried comparing results with your apply version of the code, and it seems that set.seed() is perhaps important for replicability? Am I understanding this correctly?
Here is the QA attempt I made: (original tmp should be exactly the same: I just wrapped the lines at the list() and structure() arguments.)
set.seed(1)
tmp <- structure(list(X1 = c(3, 1, 1, 4, 4, 1, 3, 2, 2, 2, 1, 3, 3, 2, 3, 1, 4, 2, 3, 2),
X2 = c(2, 4, 2, 2, 3, 2, 1, 1, 3, 1, 3, 1, 4, 4, 4, 1, 3, 1, 2, 1),
X3 = c(2, 4, 3, 3, 3, 2, 4, 3, 4, 4, 2, 3, 3, 3, 1, 3, 1, 4, 4, 2),
X4 = c(1, 3, 3, 1, 1, 3, 2, 4, 4, 1, 4, 4, 1, 1, 1, 3, 1, 3, 1, 1),
X5 = c(4, 2, 4, 2, 1, 4, 1, 2, 2, 4, 3, 4, 1, 1, 4, 4, 2, 4, 4, 3),
X6 = c(3, 1, 4, 3, 4, 4, 4, 1, 1, 3, 4, 2, 2, 2, 3, 2, 3, 2, 2, 3),
X7 = c(4, 2, 1, 1, 2, 1, 3, 3, 3, 3, 2, 2, 4, 4, 2, 4, 4, 3, 3, 4),
X8 = c(1, 3, 2, 4, 2, 3, 2, 4, 1, 2, 1, 1, 2, 3, 2, 2, 2, 1, 1, 4)),
.Names = c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"),
row.names = c(NA, 20L), class = "data.frame")
tmpApply <- tmp
tmpApply$rle = apply(tmp, 1, function(x) sum(rle(x)$lengths==2))
tmpDplyr <- tmp %>% mutate(rle = apply(tmp, 1, function(x) sum(rle(x)$lengths==2)))
tmpApply
tmpDplyr
I have the 2 tables as below
subj <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
gamble <- c(1, 2, 3, 1, 2, 3, 1, 2, 3)
ev <- c(4, 5, 6, 4, 5, 6, 4, 5, 6)
table1 <- data.frame(subj, gamble, ev)
subj2 <- c(1, 2, 3)
gamble2 <- c(1, 3, 2)
table2 <- data.frame(subj2, gamble2)
I want to merge the two tables by gamble, only choose the gamble from table 1 which has the same number to gamble in table 2. The expected output is as follows:
sub gamble ev
1 1 4
2 3 6
3 2 5
You are looking for merge
merge(table1, table2, by.x=c("subj", "gamble"), by.y=c("subj2", "gamble2"), all=FALSE, sort=TRUE)
edited as per Ananda's helpful observation