This question already has answers here:
How to use a variable to specify column name in ggplot
(6 answers)
Closed 4 years ago.
I am about to write a function to iterate my plots over various variables. Unfortunately I am getting an error i don't understand.
library(ggplot2)
library(dplyr)
library(purrr)
df <- data.frame(af = c(rep(1,6),rep(2,6),rep(3,6)),
p = c(rep(c(rep("A",2),rep("B",2),rep("C",2)),3)),
ele.1 = sample(c(1:100), size=6),
ele.2 = sample(c(1:100), size=6),
ele.3 = sample(c(1:100), size=6))
af p ele.1 ele.2 ele.3
1 A 99 1 68
1 A 55 38 72
1 B 70 36 13
1 B 86 77 89
1 C 7 24 49
1 C 89 23 53
2 A 99 1 68
2 A 55 38 72
....
test <- function(.x = df, .af = 1,.p=c("A","B"), .var = ele.1) {.x %>%
filter(af == .af & p %in% .p) %>%
ggplot(aes(x = .var, y = ele.2)) +
geom_point() +
geom_path()}
test(df)
this results in
**Error in FUN(X[[i]], ...) : object 'ele.1' not found
In addition: Warning message:
In FUN(X[[i]], ...) : restarting interrupted promise evaluation**
how could i call the object ele.1 in ggplot warped around that function?
hope this is no reword from another question.
cheers
If you set .var argument as character it runs. Is this what you are looking for?
test <- function(.x = df, .af = 1,.p=c("A","B"), .var = "ele.1") {
.x %>%filter(af == .af & p %in% .p)
ggplot() +geom_point(aes(x = .x[[.var]], y = .x[["ele.2"]])) + geom_path()
}
test(df)
Related
We have a set of 50 csv files from participants, currently being read into a list as
file_paths <- fs::dir_ls("data")
file_paths
file_contents <- list ()
for (i in seq_along (file_paths)) {
file_contents[[i]] <- read_csv(
file = file_paths[[i]]
)
}
dt <- set_names(file_contents, file_paths)
My data looks like this:
level time X Y Type
1 1 355. -10.6 22.36 P
1 1 371. -33 24.85 O
1 2 389. -10.58 17.23 P
1 2 402. -16.7 30.46 O
1 3 419. -29.41 17.32 P
1 4 429. -10.28 26.36 O
2 5 438. -26.86 32.98 P
2 6 451. -21 17.06 O
2 7 463. -21 32.98 P
2 8 474. -19.9 17.06 O
We have 70 sets of coordinates per csv.
Time does not matter for this, but I would like to split up by the level column at some stage.
For every 'P' I want to compare it to 'O' and get the distance between coordinates.The first P will always match with the first O and so on.
For now, I have them split into two different lists, though this may be the complete wrong way to do it! I'm having trouble figuring out how to take all of these csv files and get the distances for all of them, the list seems to cause issues with most functions (like dist)
Here is how I've pulled the right information so far
for (i in seq_along (dt)) {
pLoc[[i]] <- dplyr::filter(dt[[i]], grepl("P", type))
oLoc[[i]] <- dplyr::filter(dt[[i]], grepl("o", type))
pX[[i]] <- pLoc[[i]] %>% pull(as.numeric(headX))
pY[[i]] <- pLoc[[i]] %>% pull(as.numeric(headY))
pCoordinates[[i]] <- cbind(pX[[i]], pY[[i]])
}
[EDITED] Following comments, here is how you can do it with the raster library:
library(raster)
library(dplyr)
df = data.frame(
x = c(10, 20 ,15,9),
y = c(45,34,54,24),
type = c("P","O","P","O")
)
df = cbind(df[df$type=="P",] %>%
dplyr::select(-type) %>%
dplyr::rename(xP = x,
yP = y),
df[df$type=="O",] %>%
dplyr::select(-type) %>%
dplyr::rename(xO = x,
yO = y))
The following could probably be achieved more efficiently with some form of the apply() function:
v = c()
for(i in 1:nrow(df)){
dist = raster::pointDistance(lonlat = F,
p1 = c(df$xP[i],df$yP[i]),
p2 = c(df$xO[i],df$yO[i]))
v = c(v,dist)
}
df$dist = v
print(df)
xP yP xO yO dist
1 10 45 20 34 14.86607
3 15 54 9 24 30.59412
I am a dataset of jokes Dataset 2 (jester_dataset_2.zip) from the Jester project and I would like to divide the jokes into groups of jokes with similar rating and visualize the results appropriately.
The data look like this
> str(tabulka)
'data.frame': 1761439 obs. of 3 variables:
$ User : int 1 1 1 1 1 1 1 1 1 1 ...
$ Joke : int 5 7 8 13 15 16 17 18 19 20 ...
$ Rating: num 0.219 -9.281 -9.281 -6.781 0.875 ...
Here is a subset of Dataset 2.
> head(tabulka)
User Joke Rating
1 1 5 0.219
2 1 7 -9.281
3 1 8 -9.281
4 1 13 -6.781
5 1 15 0.875
6 1 16 -9.656
I found out I can't use ANOVA since the homogenity is not the same. Hence I am using Kruskal–Wallis method from agricolae package in R.
KWtest <- with ( tabulka , kruskal ( Rating , Joke ))
Here are the groups.
> head(KWtest$groups)
trt means M
1 53 1085099 a
2 105 1083264 a
3 89 1077435 ab
4 129 1072706 b
5 35 1070016 bc
6 32 1062102 c
The thing is I don't know how to visualize the joke groups appropriately. I am using boxplot to show the confidence intervals for each joke.
barvy <- c ("yellow", "grey")
boxplot (Rating ~ Joke, data = tabulka,
col = barvy,
xlab = "Joke",
ylab = "Rating",
ylim=c(-7,7))
It would be nice to somehow color each box (each joke) with an appropriate color according to the color given by the KW test.
How could I do that? Or is there some better way to find the best and the worst jokes in the dataset?
Interesting question per se. It's easy to color each bar according to the group the joke belongs to. However, I think it is just a intermediate solution, there must be better visualization for these data. So, certainly not the best one, but there is my version:
library(tidyverse)
# download data (jokes, part 1) to temporaty file, and unzip
tmp <- tempfile()
download.file("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip", tmp)
tmp <- unzip(tmp)
# read data from temp
vtipy <- readxl::read_excel(tmp, col_names = F, na = '99')
# clean data
vtipy <- vtipy %>%
mutate(user = 1:n()) %>%
gather(key = 'joke', value = 'rating', -c('..1', 'user')) %>%
rename(n = '..1', ) %>%
filter(!is.na(rating)) %>%
mutate(joke = as.character(as.numeric(gsub('\\.+', '', joke)) - 1)) %>%
select(user, n, joke, rating)
# your code
KWtest <- with(vtipy, agricolae::kruskal(rating, joke))
# join groups from KWtest to original data, clean and plot
KWtest$groups %>%
rownames_to_column('joke') %>%
select(joke, groups) %>%
right_join(vtipy, by = 'joke') %>%
mutate(joke = stringi::stri_pad_left(joke, 3, '0')) %>%
ggplot(aes(x = joke, y = rating, fill = groups)) +
geom_boxplot(show.legend = F) +
scale_x_discrete(breaks = stringi::stri_pad_left(c(1, seq(5, 100, by = 5)), 3, '0')) +
ggthemes::theme_tufte() +
labs(x = 'Joke', y = 'Rating')
I'm trying to make a heatmap using ggplot2. What I want to be plotted is in the form of a matrix which is the result of a function.
Here is the data:
Image A B C D E F
1 3 23 45 23 45 90
2 4 34 34 34 34 89
3 34 33 24 89 23 67
4 3 45 234 90 12 78
5 78 89 34 23 12 56
6 56 90 56 67 34 45
Here is the function:
vector_a <- names(master)[2:4]
vector_b <- names(master)[5:6]
heatmap_prep <- function(dataframe, vector_a,vector_b){
dummy <- as.data.frame(matrix(0, nrow=length(vector_a), ncol=length(vector_b)))
for (i in 1:length(vector_a)){
first_value <- dataframe[[ vector_a[i] ]]
# print(first_value)
for(j in 1:length(vector_b)){
second_value <- dataframe[[ vector_b[j] ]]
result <- cor(first_value, second_value, method = "spearman")
dummy [i,j] <- result
}
}
rownames(dummy) <- vector_a
return(as.matrix(dummy))
heatmap_data_matrix1 <- heatmap_prep(master,vector_a, vector_b)
Using the data in heatmap_data_matrix1, I want to create a heatmap using the following code:
library(ggplot2)
if (length(grep("ggplot2", (.packages() ))) == 0){
library(ggplot2)
}
p <- ggplot(data = heatmap_data_matrix1, aes(x = vector_a, y = vector_b)
+ geom_tile(aes(fill = ))
However, this does not work. How should I reformat my data/code so this heatmap can be created? What should I put under "fill="?
Thanks!
Due to many of R functions being vectorized and that, for the most part, you don't need to pre-allocate or define a vector the for loop is unnecessary. You can simply run corr(x,y, method = "spearman") without the complications of the loop.
Regarding your question of what to put in for fill, you'll need to reshape your data to the configuration that ggplot2 uses (long format).
The gather function from tidyr does this, placing the rows/columns of the correlation into separate columns, and then using the r value for fill.
library(tidyverse) # for tidyr, tibble, ggplot2, and magrittr
heatmap_function <- function(df, a, b) {
cor_data <- cor(df[a], df[b], method = "spearman") %>%
as.data.frame(rownames = a) %>%
rownames_to_column("x") %>%
gather(y, fill, -x)
ggplot(cor_data, aes(x = x, y = y, fill = fill)) +
geom_tile()
}
This results in:
heatmap_function(master, c("A","B","C"), c("D","E"))
I have a question about ifelse in data.frame in R. I checked several SO posts about it, and unfortunately none of these solutions fitted my case.
My case is, making a conditional calculation in a data frame, but it returns the condition has length > 1 and only the first element will be used even after I used ifelse function in R, which should work perfectly according to the SO posts I checked.
Here is my sample code:
library(scales)
head(temp[, 2:3])
previous current
1 0 10
2 50 57
3 92 177
4 84 153
5 30 68
6 162 341
temp$change = ifelse(temp$previous > 0, rate(temp$previous, temp$current), temp$current)
rate = function(yest, tod){
value = tod/yest
if(value>1){
return(paste("+", percent(value-1), sep = ""))
}
else{
return(paste("-", percent(1-value), sep = ""))
}
}
So if I run the ifelse one, I will get following result:
head(temp[, 2:4])
previous current change
1 0 10 10
2 50 57 +NaN%
3 92 177 +NaN%
4 84 153 +NaN%
5 30 68 +NaN%
6 162 341 +NaN%
So my question is, how should I deal with it? I tried to assign 0 to the last column before I run ifelse, but it still failed.
Many thanks in advance!
Try the following two segments, both should does what you wanted. May be it is the second one you are looking for.
library(scales)
set.seed(1)
temp <- data.frame(previous = rnorm(5), current = rnorm(5))
rate <- function(i) {
yest <- temp$previous[i]
tod <- temp$current[i]
if (yest <= 0)
return(tod)
value = tod/yest
if (value>1) {
return(paste("+", percent(value-1), sep = ""))
} else {
return(paste("-", percent(1-value), sep = ""))
}
}
temp$change <- unlist(lapply(1:dim(temp)[1], rate))
Second:
ind <- which(temp$previous > 0)
temp$change <- temp$current
temp$change[ind] <- unlist(lapply(ind,
function(i) rate(temp$previous[i], temp$current[i])))
In the second segment, the function rate is same as you've coded it.
Here's another way to do the same
# 1: load dplyr
#if needed install.packages("dplyr")
library(dplyr)
# 2: I recreate your data
your_dataframe = as_tibble(cbind(c(0,50,92,84,30,162),
c(10,57,177,153,68,341))) %>%
rename(previous = V1, current = V2)
# 3: obtain the change using your conditions
your_dataframe %>%
mutate(change = ifelse(previous > 0,
ifelse(current/previous > 1,
paste0("+%", (current/previous-1)*100),
paste0("-%", (current/previous-1)*100)),
current))
Result:
# A tibble: 6 x 3
previous current change
<dbl> <dbl> <chr>
1 0 10 10
2 50 57 +%14
3 92 177 +%92.3913043478261
4 84 153 +%82.1428571428571
5 30 68 +%126.666666666667
6 162 341 +%110.493827160494
Only the first element in value is evaluated. So, the output of rate solely depend on the first row of temp.
Adopting the advice I received from warm-hearted SO users, I vectorized some of my functions and it worked! Raise a glass to SO community!
Here is the solution:
temp$rate = ifelse(temp$previous > 0, ifelse(temp$current/temp$previous > 1,
temp$current/temp$previous - 1,
1 - temp$current/temp$previous),
temp$current)
This will return rate with scientific notation. If "regular" notation is needed, here is an update:
temp$rate = format(temp$rate, scientific = F)
I am attempting to draw a stacked bar plot with the the following data, using either ggplot2 or the barplot function in r. I have failed with both.
str(ISCE_LENGUAJE5_APE_DEC)
'data.frame': 50 obs. of 5 variables:
$ Nombre : Factor w/ 49 levels "C.E. DE BORAUDO",..: 6 5 25 21 16 7 27 45 24 38 ...
$ v2014_5L_porNivInsu: int 100 93 73 67 67 65 63 60 59 54 ...
$ v2014_5L_porNivMini: int 0 7 22 26 32 32 37 26 34 35 ...
$ v2014_5L_porNivSati: int 0 0 4 6 2 3 0 12 6 10 ...
$ v2014_5L_porNivAvan: int 0 0 1 2 0 0 0 2 1 2 ...
The integers are percentage values: them sum of the v2014... columns for each observation is 100.
I have attempted to use ggplot2, but I only manage to plot one of the variables, not the stacked bar with all four.
ggplot(ISCE_LENGUAJE5_APE_DEC, aes(x=Nombre, y= v2014_5L_porNivInsu)) + geom_bar(stat="identity")
I can't figure out how to pass the values for all four columns to the y parameter.
If I only pass x, I get an error:
ggplot(ISCE_LENGUAJE5_APE_DEC, aes(x=Nombre)) + geom_bar(stat="identity")
Error in exists(name, envir = env, mode = mode) :
argument "env" is missing, with no default
I found this answer, but don't understand the data transformations used. Thank you for any help provided.
ggplot2 works with data expressed in "long" format. The function melt from package reshape2 is your friend.
Because you did not provide a reproducible example, I generated some data.
v2014 <- data.frame(v2014_5L_porNivInsu = sample(1:100, 50, replace = TRUE),
v2014_5L_porNivMini = sample(1:50, 50, replace = TRUE),
v2014_5L_porNivSati = sample(0:10, 50, replace = TRUE),
v2014_5L_porNivAvan = sample(0:2, 50, replace = TRUE))
v2014_prop <- t(apply(dummy[, -1], 1, function(x) {x / sum(x) * 100}))
ISCE_LENGUAJE5_APE_DEC <- data.frame(Nombre = factor(sample(1:100, 50)),
v2014_prop)
You first express your table in long format using melt.
library(reshape2)
gg <- melt(ISCE_LENGUAJE5_APE_DEC, id = "Nombre")
See how your new table, gg, looks like.
str(gg)
head(gg)
In your ggplot, you use the data.frame gg. The x-axis is Nombre, the y-axis is value, i.e. the proportions, segmented by different fill colours defined from the variable column, where you find the v2014_... expressed as factor levels instead as column headers thanks to the melt function.
library(ggplot2)
ggplot(gg, aes(x = Nombre, y = value, fill = variable)) +
geom_bar(stat = "identity")