I have a data set like this (simplified for illustration purposes):
zz <- textConnection("Company Market.Cap Institutions.Own Price.Earnings Industry
ExxonMobil 405.69 50% 9.3 Energy
Citigroup 156.23 67% 18.45 Banking
Pfizer 212.51 73% 20.91 Pharma
JPMorgan 193.1 75% 9.12 Banking
")
Companies <- read.table(zz, header= TRUE)
close(zz)
I would like to create a bubble chart (well, something like a bubble chart) with the following properties:
each bubble is a company, with the size of the bubble tied to market cap,
the color of the bubble tied to industry,
with the x-axis having two categories, Industries.Own and Price.Earnings,
and the y-axis being a 1-10 scale, each company's values being normalized to that scale. (I could of course do the normalization outside R but I believe R makes that possible.)
To be clear, each company will appear in each column of the result, for example ExxonMobil will be near the bottom of both the Institutions.Own column and the Price.Earnings column; ideally, the name of the company would appear in or next to both of its bubbles.
I think this touches on all of your points. Note - your Institutions.Own is read in as a factor because of the %...I simply deleted that for ease of time...you'll need to address that somewhere. Once that is done, I would use ggplot and map your different aesthetics accordingly. You can fiddle with the axes titles et al if you need.
#Requisite packages
library(ggplot2)
library(reshape2)
#Define function, adjust this as necessary
rescaler <- function(x) 10 * (x-min(x)) / (max(x)-min(x))
#Rescale your two variables
Companies$Inst.Scales <- with(Companies, rescaler(Institutions.Own))
Companies$Price.Scales <- with(Companies, rescaler(Price.Earnings))
#Melt into long format
Companies.m <- melt(Companies, measure.vars = c("Inst.Scales", "Price.Scales"))
#Plotting code
ggplot(Companies.m, aes(x = variable, y = value, label = Company)) +
geom_point(aes(size = Market.Cap, colour = Industry)) +
geom_text(hjust = 1, size = 3) +
scale_size(range = c(4,8)) +
theme_bw()
Results in:
Related
So I have a data set that sorts DJs by Rank, the year they received that rank, and the name of the DJ that received the previously mention information on a horizontal access in Excel.
When I plot the data I'm currently working with it ends up displaying a line chart with the a vertical line from 1 to 5 for each year and I'm not sure what to do from here.
library(ggplot2)
library(plyr)
DJMAG <- DJMAG_MOdified
Top <-data.frame(DJMAG$Year, DJMAG$Rank , DJMAG$DJ)
names(Top) <- c("Year","Rank","DJ")
ggplot(Top, aes(Top$Year)) +
geom_line(aes(y = as.numeric(Top$Rank), color = "Hardwell")) + xlab("2004 to 2018") + ylab("Rank")
There are no error messages but What I'm trying to show with this data is how (X = Year) DJs with their own line plot increased or decreased in ranking from 2004 to 2017 and the rankings of the top 5, 1-5 on the Y-axis with an inverted y-axis.
So I took the liberty of coming up with some example data.
DJMAG_MOdified <- data.frame(Year=rep(2004:2018,3),
Rank=runif(45,0,1),
DJ=rep(c("A","B","C"),each=15),
Other=runif(45,0,1))
I purposefully added the Other column, so we still subset it as you have done.
Instead of your method which was:
Top <-data.frame(DJMAG$Year, DJMAG$Rank , DJMAG$DJ)
names(Top) <- c("Year","Rank","DJ")
It would be preferable to have it in one line where you dont need to change column names as follows:
Top <- DJMAG_MOdified[,c("Year","Rank","DJ")]
As for the plot, I am thinking maybe this is what you are looking for, where each DJ is represented by a different coloured line?
ggplot(Top, aes(x=Year,y=as.numeric(Rank))) +
geom_line(aes(col = DJ)) +
xlab("2004 to 2018") +
ylab("Rank")
I didnt understand where the color = "Hardwell" part of your code came from...
I am trying to vary the color of the LINE in an R graph (within the same series of data). For example if you had points showing temperature for weeks of the year, and they're connected with geom_line() or equivalent, how would I show say the line being deeper red for higher temperature weeks and gradually changing to say yellow in colder weeks?
(if BOTH points and line could vary across the same palette/gradient based on the same variable - say temperature, that would be ideal).
bp.df <- NULL
bp.weeks <- 26
bp.days <- 7 * bp.weeks
bp.df$day <- 1:bp.days
bp.df$week <- ceiling(bp.df$day / 7)
bp.mean.normal <- 100
bp.sd.normal <- 20
bp.df <- as.data.frame(bp.df)
bp.df$normal <- rnorm(nrow(bp.df),bp.mean.normal, bp.sd.normal)
bp.df$day <- as.numeric(bp.df$day)
# make sure ggplot2 is installed and loaded
g <- ggplot(bp.df, aes(x = day, y = normal, color = normal)) +
geom_point() +
geom_line(col = bp.df$normal)
g
Why do the line colors not match the levels of the "normal" variable? I understand that, since they connect two dots, the lines must "decide" which one's value to use, but this output seems to make the colors completely random.
If varying across a gradient won't work, how would I make the line be red for the first 50 days, green for the next 50, etc?
Try this:
g <- ggplot(bp.df, aes(x=day, y = normal, color = normal)) + geom_point() +
geom_line(aes(color=normal))
g
It seems the col argument is different than the color in the aes argument.
I am trying to build a parallel coordinate diagram in R for showing the difference in ranking in different age groups. And I want to have a fixed scale on the Y axis for showing the values.
Here is a PC plot :
The goal is to see the slopes of the lines really well. So if I have value 1 that is bound with the value 1000, I want to see the line going aaall the way down steeply.
In R so far, if I have values that are too big, my plot is all squished so everything fits and it's hard to visualize anything.
My code for drawing the parallel coordinate plot is the following so far:
pc_18_34 <- read.table("parCoordData_18_24_25_34.csv", header=FALSE, sep="\t")
#name columns of data frame
colnames(pc_18_34) = c("18-25","25-34")
#build the parallel coordinate plot
# doc : http://docs.ggplot2.org/current/geom_path.html
group <- rep(c("Top 10", "Top 10-29", "Top 30-49"), each = 18)
df <- data.frame(id = seq_along(group), group, pc_18_34[,1], pc_18_34[,2])
colnames(df)[3] = "18-25"
colnames(df)[4] = "25-34"
library(reshape2) # for melt
dfm <- melt(df, id.var = c("id", "group"))
dfm[order(dfm$group,dfm$ArtistRank,decreasing=TRUE),]
colnames(dfm)[3] = "AgeGroup"
colnames(dfm)[4] = "ArtistRank"
ggplot(dfm, aes(x=AgeGroup, y=ArtistRank, group = id, colour = group), main="Tops across age groups")+ geom_path(alpha = 0.5, size=1) + geom_path(aes(color=group))
I have looked into how to get the scales to change in ggplot, using libraries like scales but when I had a layer of scale, the diagram doesn't even show up anymore.
Any thoughts on how to make to use a fixed scale (say difference of 1 in rank shown as 5px in the plot), even if it means that the plot is very tall ?
Thaanks !! :)
You can set the panel height to an absolute size based on the number of axis breaks. Note that the device won't scale automatically, so you'll have to adjust it manually for your plot to fit well.
library(ggplot2)
library(gtable)
p <- ggplot(Loblolly, aes(height, factor(age))) +
geom_point()
gb <- ggplot_build(p)
gt <- ggplot_gtable(gb)
n <- length(gb$panel$ranges[[1]]$y.major_source)
# locate the panel in the gtable layout
panel <- gt$layout$t[grepl("panel", gt$layout$name)]
# assign new height to the panels, based on the number of breaks
gt$heights[panel] <- list(unit(n*25,"pt"))
grid.newpage()
grid.draw(gt)
I have a question regarding data handling in R. I have two datasets. Both are originally .csv files.
I've prepared two example Datasets:
Table A - Persons
http://pastebin.com/HbaeqACi
Table B - City
http://pastebin.com/Fyj66ahq
To make it as less work as possible the corresponding R Code for loading and visualizing.
# Read csv files
# check pastebin links and save content to persons.csv and city.csv.
persons_dataframe = read.csv("persons.csv", header = TRUE)
city_dataframe = read.csv("city.csv", header = TRUE)
# plot them on a map
# load used packages
library(RgoogleMaps)
library(ggplot2)
library(ggmap)
library(sp)
persons_ggplot2 <- persons_dataframe
city_ggplot2 <- city_dataframe
gc <- geocode('new york, usa')
center <- as.numeric(gc)
G <- ggmap(get_googlemap(center = center, color = 'color', scale = 4, zoom = 10, maptype = "terrain", frame=T), extent="panel")
G1 <- G + geom_point(aes(x=POINT_X, y=POINT_Y ),data=city_dataframe, shape = 22, color="black", fill = "yellow", size = 4) + geom_point(aes(x=POINT_X, y=POINT_Y ),data=persons_dataframe, shape = 8, color="red", size=2.5)
plot(G1)
As a result I have a map, which visulaizes all cities and persons.
My problem: All persons are distributed only on these three cities.
My questions:
A more general questions: Is this a problem for R?
I want to create something like a bubble map, which visualized the amount of persons at one position. Like: In City A there are 20 persons, in City B are 5 persons. The position at city A should get a bigger bubble than City B.
I want to create a label, which states the amount of persons at a certain position. I've already tried to realize this with the ggplo2 geom_text options, but I can't figure out how to sum up all points at a certain position and write this to a label.
A more theoretical approach (maybe I come back to this later on): I want to create something like a density map / cluster map, which shows the area, with the highest amount of persons. I've already search for some packages, which I could use. Suggested ones were SpatialEpi, spatstat and DCluster. My question: Do I need the distance from the persons to a certain object (let's say supermarket) to perform a cluster analyses?
Hopefully, these were not too many questions.
Any help is much appreciated. Thanks in advance!
Btw: Is there any better help to prepare a question containing example datasets? Should I upload a file somewhere or is the pastebin way okay?
You can create the bubble chart by counting the number in each city and mapping the size of the points to the counts:
library(plyr)
persons_count <- count(persons_dataframe, vars = c("city", "POINT_X", "POINT_Y"))
G + geom_point(aes(x=POINT_X, y=POINT_Y, size=freq),data=persons_count, color="red")
You can map the counts to the area of the points, which perhaps gives a better sense of the relative sizes:
G + geom_point(aes(x=POINT_X, y=POINT_Y, size=freq),data=persons_count, color="red") +
scale_size_area(breaks = unique(persons_count$freq))
You can add the frequency labels, though this is somewhat redundant with the size scale legend:
G + geom_point(aes(x=POINT_X, y=POINT_Y, size=freq),data=persons_count, color="red") +
geom_text(aes(x = POINT_X, y=POINT_Y, label = freq), data=persons_count) +
scale_size_area(breaks = unique(persons_count$freq))
You can't really plot densities with your example data because you only have three points. But if you had more fine-grained location information you could calculate and plot the densities using the stat_density2d function in ggplot2.
I have a data frame, where I am talking about different flows of water at a dam (water units are kcfs—1000 cubic feet per second—if anyone is interested)
Call it df4plot
date kcfs Flowtype
10/1/2010 50 Power
10/1/2010 10 Spill_Overgen
10/1/2010 8 Spill_Force
10/2/2010 52 Power
10/2/2010 7 Spill_Overgen
10/2/2010 10 Spill_Force
(there are 3x365 rows in the data frame)
So what I want to do is make an aggregated area graph that shows each of these flows
p <- ggplot(data = df4plot, aes(date,kcfs)) +
geom_area(aes(colour = Flowtype, fill=Flowtype), position = “stack”)
I want to control the colors used, so I added
plot_colors_aggregate <- c("forestgreen","lightsalmon","dodgerblue")
p <- p +
scale_color_manual(values = plot_colors_aggregate) +
scale_fill_manual(values = plot_colors_aggregate)
Now I want to add a dashed line, showing the maximum turbine capacity—the flow limits for power generation—that vary by month. I have a separate dataframe for this (365 rows long), df4FGline
Date FGlimit
10/1/2010 52
10/2/2010 52
…
11/1/2010 60
11/2/2010 60
...
Etc
So now I have
p <- p +
geom_line(data = df4FGline, aes(x=date,y=FGlimit), colour = “darkblue”, linetype = “dashed”)
p
The legend is currently just the three blocks for the three types of Flowtype. I’d like to add the dashed line for the flow gate limits to the bottom, but I can’t get it to show up there.
It is probably related to my incomplete understanding of aes (help(aes) is AMAZINGLY unhelpful).
I’ve tried something similar to this and something similar to this, but since I’m only trying to add 1 line to a pre-existing legend, maybe?, this is not working for me.
I tried adding “legend = TRUE” inside the parentheses for the geom_line, but it put a dashed line inside each color box in the legend, AND created a 4th entry for the legend, but offset from the rest of the legend (below and to the right)... ARG!
I swear I have the book on order... any help you can share so that I understand this aesthetic thing and how it relates to the legend a little better, I'd be extremely grateful.
edited for typo
This should help:
df <- data.frame(x = 1:10,y = 1:10)
ggplot(df,aes(x = x,y = y)) +
geom_line(aes(linetype = "dashed")) +
scale_linetype_manual(name = "Linetype",values = "dashed")