I have a dataframe ("data") with 7 columns (2 Factor, 5 num). The first column is containing the names of 7 different countries and in the following columns I have collected data for different parameters (like population, GDP etc.) characterizing each country. In the last column a factor variable assigns which continent the respective country belongs to.
The data looks like this:
structure(list(Country = structure(c(5L, 4L, 7L, 2L, 1L, 6L,
3L), .Label = c("Brazil", "Chile", "China", "France", "Germany",
"India", "Netherlands"), class = "factor"), GDP = c(0.46, 0.57,
0.75, 0.56, 0.28, 0.88, 1), Population = c(0.18, 0.09, 0.54,
0.01, 0.02, 0.17, 0.84), Birth.rate = c(87.21, 18.34, 63.91,
14.21, 5.38, 51.19, 209.26), Income = c(43.89, 18.23, 63.91,
12.3, 0.1, 14.61, 160.82), Savings = c(43.32, 0.11, 0, 1.91,
5.29, 36.58, 50.38), Continent = structure(c(2L, 2L, 2L, 3L,
3L, 1L, 1L), .Label = c("Asia", "Europe", "South America"), class = "factor")), .Names = c("Country",
"GDP", "Population", "Birth.rate", "Income", "Savings", "Continent"
), class = "data.frame", row.names = c(NA, -7L))
I need some sort of loop function which plots (e.g. scatter plot) every single column against each other so that in the end every column (except the first and the last, i.e. the two factor variables) has been plotted against all other columns but each in a single plot chart (not all plots in one). Preferably all these plots are being saved to some folder on my local machine.
Also it would be great if the x and y axis are already labeled according to the respective two columns that are plotted against each other. Moreover it would be convenient to have a label next to each point in the plot displaying the respective country name. Lastly it would be nice to have three different colors for the points of the countries according to the three different continents.
So far I only have a piece of code that goes like
for (i in seq(1,length(data),1)) {
plot(data[,i], ylab=names(data[i]), xlab="Country",
text(i, labels=Country, pos=4, cex =.5))
}
As you can see it only plots each column against the first column ("Country") which is not what I want in the end.
Do you have any idea how I could achieve this?
You can use pairs() directly from R. Note that dt represents your dataset.
pairs(dt)
dt <- structure(list(Country = structure(c(5L, 4L, 7L, 2L, 1L, 6L,
3L), .Label = c("Brazil", "Chile", "China", "France", "Germany",
"India", "Netherlands"), class = "factor"), GDP = c(0.46, 0.57,
0.75, 0.56, 0.28, 0.88, 1), Population = c(0.18, 0.09, 0.54,
0.01, 0.02, 0.17, 0.84), Birth.rate = c(87.21, 18.34, 63.91,
14.21, 5.38, 51.19, 209.26), Income = c(43.89, 18.23, 63.91,
12.3, 0.1, 14.61, 160.82), Savings = c(43.32, 0.11, 0, 1.91,
5.29, 36.58, 50.38), Continent = structure(c(2L, 2L, 2L, 3L,
3L, 1L, 1L), .Label = c("Asia", "Europe", "South America"), class = "factor")), .Names = c("Country",
"GDP", "Population", "Birth.rate", "Income", "Savings", "Continent"
), class = "data.frame", row.names = c(NA, -7L))
I've alway thought that splom function in package 'lattice' was quite useful for this sort of exploratory analysis. This is obviously not a great example since it obscures the group memberships but it shows the combinations of points and a non-parametric regression line in the "pairs" format:
png()
print( splom(~iris[1:4], groups = Species, data = iris,
panel = function(x, y, i, j, ...) {
panel.points(x,y, ...)
panel.loess(x,y, ...)
})); dev.off()
Related
I hope this title wasn't too confusing. I have data from 10 dates, and each data point is either labelled as "Irrigation" or "Precipitation". I would like to make one bar graph that has a bar for each value in chronological order with the date labelled under the bar, but would also like the bars to be color coded (for example, all bars from "Irrigation" would be red, while those from "Precipitation" would be blue.)
I am fairly new to R, is this possible? If so, any help would be appreciated.
Here is my data:
dput(head(data))
structure(list(Date = structure(4:9, .Label = c("7/11/2019",
"7/13/2019", "7/15/2019", "7/2/2019", "7/3/2019", "7/4/2019",
"7/5/2019", "7/6/2019", "7/7/2019", "7/9/2019"), class = "factor"),
Inches = c(1.6, 0.02, 0.35, 0.64, 0.07, 0.35), Type = structure(c(2L,
2L, 1L, 2L, 2L, 1L), .Label = c("Irrigation", "Precipitation"
), class = "factor")), row.names = c(NA, 6L), class = "data.frame")
You can use dplyr and ggplot2 to make a bar graph. The stat='identity' call means each bar height uses the corresponding value in the data, and position='dodge' places the bars side by side, rather than stacked.
df <- data.frame(date=c(paste0('2019-0',1:9),'2019-10'),
precipitation=runif(10),irrigation=runif(10))
df
df %>% gather(k,v,-date) %>%
ggplot(aes(date,v,fill=k)) + geom_bar(stat='identity',position = 'dodge') +
theme(axis.text.x = element_text(angle = 45,hjust=1))
EDIT
Here is the bar graph using the data you provided from dput. I didn't realize your data were already grouped in one column.
df <- structure(list(Date = structure(4:9, .Label = c("7/11/2019",
"7/13/2019", "7/15/2019", "7/2/2019", "7/3/2019", "7/4/2019",
"7/5/2019", "7/6/2019", "7/7/2019", "7/9/2019"), class = "factor"),
Inches = c(1.6, 0.02, 0.35, 0.64, 0.07, 0.35), Type = structure(c(2L,
2L, 1L, 2L, 2L, 1L), .Label = c("Irrigation", "Precipitation"
), class = "factor")), row.names = c(NA, 6L), class = "data.frame")
df %>% ggplot(aes(Date,Inches,fill=Type)) + geom_bar(stat='identity',position = 'dodge') +
theme(axis.text.x = element_text(angle = 45,hjust=1))
Replicating a visualization I saw in print media using ggplot2
Context:
I am always looking to make data visualizations more appealing/aesthetic specifically for non-data people, who are the majority of people I work with (stakeholders like marketers, management, etc) -- I've noted that when visualizations look like academic-publication-quality (standard ggplot2 aesthetics) they tend to assume they can't understand it and don't bother trying, defeating the whole purpose of visualizations in the first place. However, when it looks more graphic'y (like something you may see on websites or marketing material) they focus and try to understand the visualization, usually successfully. Often we'll end up in the most interesting discussions from these types of visualizations, so that is my ultimate goal.
The Visualization:
Here is something I saw on some marketing brochure on the device share of web traffic by geo, and though it is actually a bit busy and unclear, it resonated better than a similar stacked bar chart I created in standard -- I have not the slightest idea how I might replicate something like this within ggplot2, any attempts would be much appreciated! Here is some sample tidy data to use in a data.table:
structure(list(country = c("Argentina", "Argentina", "Argentina",
"Brazil", "Brazil", "Brazil", "Canada",
"Canada", "Canada", "China", "China",
"China", "Japan", "Japan", "Japan", "Spain",
"Spain", "Spain", "UK", "UK", "UK", "USA",
"USA", "USA"),
device_type = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L),
class = "factor",
.Label = c("desktop",
"mobile",
"multi")),
proportion = c(0.37, 0.22, 0.41, 0.3, 0.31, 0.39,
0.35, 0.06, 0.59, 0.19, 0.2, 0.61,
0.4, 0.18, 0.42, 0.16, 0.28, 0.56,
0.27, 0.06, 0.67, 0.37, 0.08, 0.55)),
.Names = c("country", "device_type", "proportion"),
row.names = c(NA, -24L),
class = c("data.table", "data.frame"))
You could also consider googleVis
library(googleVis)
dat <- structure(list(country = c("Argentina", "Argentina", "Argentina",
"Brazil", "Brazil", "Brazil", "Canada",
"Canada", "Canada", "China", "China",
"China", "Japan", "Japan", "Japan", "Spain",
"Spain", "Spain", "UK", "UK", "UK", "USA",
"USA", "USA"),
device_type = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L),
class = "factor",
.Label = c("desktop",
"mobile",
"multi")),
proportion = c(0.37, 0.22, 0.41, 0.3, 0.31, 0.39,
0.35, 0.06, 0.59, 0.19, 0.2, 0.61,
0.4, 0.18, 0.42, 0.16, 0.28, 0.56,
0.27, 0.06, 0.67, 0.37, 0.08, 0.55)),
.Names = c("country", "device_type", "proportion"),
row.names = c(NA, -24L),
class = c("data.table", "data.frame"))
link_order <- unique(dat$country)
node_order <- unique(as.vector(rbind(dat$country, as.character(dat$device_type))))
link_cols <- data.frame(color = c('#ffd1ab', '#ff8d14', '#ff717e', '#dd2c40', '#d6b0ea',
'#8c4fab','#00addb','#297cbe'),
country = c("UK", "Canada", "USA", "China", "Spain", "Japan", "Argentina", "Brazil"),
stringsAsFactors = F)
node_cols <- data.frame(color = c("#ffc796", "#ff7100", "#ff485b", "#d20000",
"#cc98e6", "#6f2296", "#009bd2", "#005daf",
"grey", "grey", "grey"),
type = c("UK", "Canada", "USA", "China", "Spain", "Japan",
"Argentina", "Brazil", "multi", "desktop", "mobile"))
link_cols2 <- sapply(link_order, function(x) link_cols[x == link_cols$country, "color"])
node_cols2 <- sapply(node_order, function(x) node_cols[x == node_cols$type, "color"])
actual_link_cols <- paste0("[", paste0("'", link_cols2,"'", collapse = ','), "]")
actual_node_cols <- paste0("[", paste0("'", node_cols2,"'", collapse = ','), "]")
opts <- paste0("{
link: { colorMode: 'source',
colors: ", actual_link_cols ," },
node: {colors: ", actual_node_cols ,"}}")
Sankey <- gvisSankey(dat,
from = "country",
to = "device_type",
weight = "proportion",
options = list(height = 500, width = 1000, sankey = opts))
plot(Sankey)
You can try with "ggalluvial" package and its respective "geom".
Chek this out
I am trying to calculate a CAGR value, defined as (Ending/Beginning)^(1/number of years)-1.
I have a df which has columns "Stock", "date", "Annual.Growth.Rate". To quickly note: I was trying to do this using the lag function, however, I wasn't able to change the recursive formula at the beginning of each stocks. It'll make more sense looking at the dput:
structure(list(Stock = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
date = structure(c(6L, 2L, 3L, 4L, 5L, 1L, 12L, 8L, 9L, 10L,
11L, 7L), .Label = c("3/28/16", "3/29/12", "3/29/13", "3/29/14",
"3/29/15", "3/30/11", "6/28/16", "6/29/12", "6/29/13", "6/29/14",
"6/29/15", "6/30/11"), class = "factor"), Annual.Growth.Rate = c(0.1,
0.2, 0.1, 0.1, 0.1, 0.1, 0.3, 0.2, 0.14, 0.14, 0.14, 0.14
), Growth = c(110, 132, 145.2, 159.72, 175.692, 193.2612,
130, 156, 177.84, 202.7376, 231.120864, 263.477785), CAGR = c(0.098479605,
0.098479605, 0.098479605, 0.098479605, 0.098479605, 0.098479605,
0.125, 0.125, 0.125, 0.125, 0.125, 0.125)), .Names = c("Stock",
"date", "Annual.Growth.Rate", "Growth.on.100", "CAGR"), class = "data.frame", row.names = c(NA,
-12L))
This is the expected output. Before there was the stock, date, and growth). The growth on 100 is not all a "lag" from before. Since the first available date is multiplied by a given starter, in this case 100, (1+.1)*100, and then the following growth value is the future value (110) * the next growth rate. I can figure out how to do the CAGR using dplyr, but I'm really stuck on growth on 100.
You could use cumprod in a mutate. Also the starting 100 value is arbitrary. It is all a product. You can calculate the rest of the product then multiply by the starter.
starter <- 100
my.data <- data.frame(stock=c('a','a','a','b','b','b'), growth = c(.1,.2,.1,.1,.1,.1), date = c(1,2,3,1,2,3)) #example Data
my.data
my.data %>%
group_by(stock) %>%
mutate(growth.unit = order_by(date,cumprod(1+growth)),
growth = growth.unit*starter) -> new.data
I want to plot dendrogram with for dataframe data with 7 columns (2 Factor, 5 num). The first column is containing the names of 7 different countries and in the following columns I have collected data for different parameters (like population, GDP etc.) characterizing each country. In the last column a factor variable assigns which continent the respective country belongs to.
Here is the data
structure(list(Country = structure(c(5L, 4L, 7L, 2L, 1L, 6L,
3L), .Label = c("Brazil", "Chile", "China", "France", "Germany",
"India", "Netherlands"), class = "factor"), GDP = c(0.46, 0.57,
0.75, 0.56, 0.28, 0.88, 1), Population = c(0.18, 0.09, 0.54,
0.01, 0.02, 0.17, 0.84), Birth.rate = c(87.21, 18.34, 63.91,
14.21, 5.38, 51.19, 209.26), Income = c(43.89, 18.23, 63.91,
12.3, 0.1, 14.61, 160.82), Savings = c(43.32, 0.11, 0, 1.91,
5.29, 36.58, 50.38), Continent = structure(c(2L, 2L, 2L, 3L,
3L, 1L, 1L), .Label = c("Asia", "Europe", "South America"), class = "factor")), .Names = c("Country",
"GDP", "Population", "Birth.rate", "Income", "Savings", "Continent"
), class = "data.frame", row.names = c(NA, -7L))
Now the dendrogram which I want to obtain should have the following characteristics:
the leave-labels should be colored according to there continent membership
the leaves should be labeled according to the respective country (NOT numbers)
there should be rectangles around the clusters
I have tried the dendextend package which can be found here https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches but 2. and 3. of the above characteristics seem not to work together at the same time. My code looks like this (after having normalized data to norm)
#color codes for continents
regionCodes = c(rep("Europe",3), rep("South America", 2), rep("Asia",2), )
rownames(data) = make.unique(regionCodes)
colorCodes = c(Europe="blue", South America="yellow", Asia="red")
#dendrogram generation and plot
dc = as.dendrogram(hclust(dist(norm), method="complete"))
labels_colors(dc) = colorCodes[regionCodes][order.dendrogram(dc)]
labels(dc) = data$Country
labels_cex(dc) = .7
dc %>% plot
dc %>% rect.dendrogram(k=4, border = 8, lty = 5, lwd = 2)
But it produces the following error
Error in data$Country : object of type 'closure' is not subsettable
Can you help me?
This question already has answers here:
Scatter plot with error bars
(6 answers)
Closed 7 years ago.
I'd like to make a simple graph: mean in the middle and min and max as whiskers. No box. What is the easiest way to do it?
Thanks.
structure(list(Country = structure(c(1L, 9L, 6L, 5L, 3L, 8L,
7L), .Label = c("BU", "CZ", "ES", "HU", "LT", "LV", "PL", "SL",
"UK"), class = "factor"), Mean = c(0.68, 0.56, 0.44, 0.31, 0.27,
0.8, 0.13), Min = c(0.44, 0.34, -0.35, -0.05, -0.16, 0.76, -0.44
), Max = c(0.85, 0.83, 0.83, 0.84, 0.55, 0.85, 0.84)), .Names = c("Country",
"Mean", "Min", "Max"), row.names = c(1L, 2L, 3L, 4L, 5L, 8L,
9L), class = "data.frame")
That's how i did it, thanks to #astrosyam.
corr1 = corr[c(1:5, 8:9),1:4] # to remove NAs
# to order the cournties the way I need
Country1 = factor(corr1$Country, levels(corr1$Country)[c(1,9,6,5,3,8,7)])
x = c(1,2,3,4,5,6,7,8,9)
plot(Country1, corr1$Mean, pch=19, ylim=range(c(corr1$Mean-corr1$Min, corr1$Mean+corr1$Min)))
# hack: we draw arrows but with very special "arrowheads"
arrows(x, corr1$Mean-corr1$Min, x, corr1$Mean+corr1$Min, length=0.05, angle=90, code=3)
Here's a hackish way of getting it done using bxp:
bxp(
list(
stats=rbind(df$Min,df$Max,df$Mean,df$Min,df$Max),
n=seq_len(nrow(df)),
names=df$Country
),
lty=1,
boxlty=0
)
One way to do it with ggplot is the recast the data in long format using tidyr packages gather function and use as follows:
library(tidyr)
library(ggplot2)
df <- df %>% gather(Type, Value, -Country)
ggplot(df, aes(x = Country, y = Value, col = Country)) +
geom_line() +
geom_point(size = 2) +
theme(legend.position = 'none')