R column dataframe names number - r

I have a dataframe like this
geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13
As the names of 2nd an 3rd column are considered as number I'm not able to get a bar chart wth ggplot2. Is there any solution to set the column names to be considered as text?
data
pivot_dat <- read.table(text="geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13",strin=F,h=T)
pivot_dat <- setNames(pivot_dat,c("geo","2001","2002"))

Here's how to do it :
library(ggplot2)
ggplot(pivot_dat, aes(x = geo, y = `2002`)) + geom_col()+ coord_flip()
by using ticks instead of quotes/double quotes you make sure you pass a name to the function and not a string.
If you use quotes, ggplot will convert this character value to a factor and recycle it, so all bars will have the same length of 1, and a label of value "2002".
Note 1 :
You might want to learn the difference between geom_col and geom_bar :
?ggplot2::geom_bar
In short geom_col is geom_bar with stat = "identity", which is what you want here since you want to show on your plot the raw values from your table.
Note 2:
aes_string can be used to give string instead of names but here it doesn't work as "2002" is evaluated as a number :
ggplot(pivot_dat, aes_string(x = "geo", y = "2002")) +
geom_col()+ coord_flip() # incorrect output
ggplot(pivot_dat, aes_string(x = "geo", y = "`2002`")) +
geom_col()+ coord_flip() # correct output

Without an example to see exactly what your problem is, and what you want, it is hard to give you a perfect answer. But here's the thing.
You can do a geom_bar with numeric data. There are 3 possible ways I see that you could have problems (but I may not be able to guess every way.
First, let's set up the r for plotting.
library(readr)
library(ggplot2)
test <- read_csv("geo,2001,2002
Spain,21,23
Germany,34,50
Italy,57,89
France,19,13")
Next, let's make the first mistake...incorrectly calling the column name. In the next example I will tell ggplot to make a bar of the number 2001. Not the column 2001! r has to guess whether we mean 2001 or whether we mean the object 2001. By default it always picks the number instead of the column.
ggplot(test) +
geom_bar(aes(x=2001))
Ok, that just gives you a bar at 2001...because you gave it a single number input instead of a column. Let's fix that. Use the right facing quotes `` to identify the column name 2001 instead of the number 2001.
ggplot(test) +
geom_bar(aes(x=`2001`))
This creates a perfectly workable bar chart. But maybe you don't want the spaces? That's the only possible reason you would use text instead of a number. But you want text so I'm going to show you how to use as.factor to do something similar (and more powerful).
ggplot(test) +
geom_bar(aes(x=as.factor(`2001`)))

Related

Making a time plot of the frequency that a certain value appears in data set

I have a dataset about a university's student body with 10 columns that represent different factors such as their student id, gender, ethnicity, etc.
For right now I'm just interested in the term they were admitted, and their ethnicity because I want to see how the number of students from different ethnic backgrounds has changed over time. So I created a new data frame with two columns called ethnicitydf:
> head(ethnicitydf)
admit_term ethn_desc
1 2011-10-01 White/Caucasian
2 2011-10-01 Filipino/Filipino-American
3 2011-10-01 White/Caucasian
4 2011-10-01 Latino/Other Spanish
5 2011-10-01 East Indian/Pakistani
6 2011-10-01 White/Caucasian
I'm not exactly sure how I would create a plot that has the admit_term (time) in the x-axis and the frequency that each ethnicity occurs for each admit_term. There are 12 unique ethnicities in the second column and I want to have the frequency of all 12 ethnicities for each admit_term (6 terms in total) in one graph, each ethnicity having a different color.
The first step I was thinking was counting up each ethnicity for each term using length(which(ethnicitydf$admit_term == "2011-10-01" & ethnicitydf$ethn_desc == "White/Caucasian")) for example and recording the data in a new data frame, but I feel like there should be a faster and more efficient way of doing this. Maybe the use of a package? Could any body help me out? Thank you!
A bar plot will do the counts for you.
library(ggplot2)
ethnicitydf <- data.frame(admit_term = sample(c("2011-10-01","2012-10-01","2013-10-01"), 100, TRUE),
ethn_desc =sample(c("White/Caucasian","Filipino/Filipino-American","East Indian/Pakistani"), 100, TRUE))
ggplot() +
geom_bar(data=ethnicitydf, mapping=aes(x=admit_term, fill=ethn_desc), position="dodge")
Created on 2019-07-03 by the reprex package (v0.3.0)
You can also just plot points if you have a lot of series, like this.
ggplot() +
geom_point(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count")
To get lines you will need to make sure your y axis is numeric (turns the text dates into numbers, e.g. years).
ethnicitydf$admit_term <- as.Date(ethnicitydf$admit_term)
ggplot() +
geom_line(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count") +
geom_point(data=ethnicitydf, mapping=aes(x=admit_term, colour=ethn_desc), stat="count")

Simple barplot displaying voting of a county

I'm fairly new to R and making plots, so sorry about that. I have a dataset of the voting for counties and I want to make a barplot showing how many mandates each county voted for.
What I've done so far is to extract one row, which includes the name of the county and the number of mandates it voted for the different parties (which are headers).
Fylker AP FRP H KrF SP
Ostlandet 3 2 2 0 1
Sorry for the bad display of code, whenever I paste the code, it looks really weird, despite indenting.
The data is called "Ostlandet" and is only 1 row. So as I tried to explain above, I want to make some sort of barplot out of this. The idea is to have the different parties on the x-axis and number of votes on y. I've tried this so far
ggplot(Ostfold, aes(x = Ostfold[1,])) +
geom_histogram(binwidth = 20)
Which just gave me tons of errors.
I've also tried using barplot, but I just can't seem to figure this out.
Sorry, this is probably super easy, but I'm just getting into coding.
You have a few issues. First, there's no need for extracting rows. Second, the data are in "wide" format (mandates in columns) instead of "long format" (a column named "mandate" with values). Third, you want to plot counts so geom_col() is better than geom_histogram().
The gather() function from the tidyr package will get your data from wide into long:
library(tidyr)
library(ggplot2)
Ostfold %>%
gather(Mandate, Votes, -Fylker)
That should generate something like this:
Fylker Mandate Votes
1 Ostlandet AP 3
2 Ostlandet FRP 2
3 Ostlandet H 2
4 Ostlandet KrF 0
5 Ostlandet SP 1
You can pass that to ggplot:
Ostfold %>%
gather(Mandate, Votes, -Fylker) %>%
ggplot(aes(Mandate, Votes)) + geom_col()
Result for your one row:
For a dataset with multiple counties, you might want to add + facet_wrap(~Fylker) to facet the plot by county, depending on how many there are.

geom_tile adds a third fill colour not in the data

It's difficult for me to create a reproducible example of this as the issue only seems to show as the size of the data frame goes up to too large to paste here. I hope someone will bear with me and help here. I'm sure I'm doing something stupid but reading the help and searching is failing (perhaps on the "stupid" issue.)
I have a data frame of 2,319 rows and three variables: clientID, month and nSlots where clientID is character, month is 1:12 and nSlots is 1:2.
> head(tmpDF2)
month clientID2 nSlots
21 1 8 1
30 2 8 1
31 4 8 1
28 5 8 1
25 6 8 1
24 7 8 1
Here's table(tmpDF2$nSlots)
> table(tmpDF2$nSlots, useNA = "always")
1 2 <NA>
1844 15 0
I'm trying to use ggplot and geom_tile to plot the attendance of clients and I expect two colours for the tiles depending on the two values of nSlots but when the size of the data frame goes up, I am getting a third colour. Here is is the plot.
OK. Well I gather you can't see that so perhaps I should stop here! Aha, or maybe you can click through to that link. I hope so!
Here's the code then for what it's worth.
ggplot(dat=tmpDF2,
aes(x=month,y=clientID2,fill=nSlots)) +
geom_tile() +
# geom_text(aes(label=nSlots)) +
theme(panel.background = element_blank()) +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.line=element_line()) +
ylab("clients")
The bizarre thing (to me) is that when I keep the number of rows small, the plot seems to work fine but as the number goes up, there's a point, and I've failed utterly to find if one row in the data or value of nrow(tmpDF2) triggers it, when this third colour, a paler value than the one in the legend, appears.
TIA,
Chris

r ggplot x axis tick marks order

I've seen several questions about the order of x axis marks but still none of them could solve my problem.
I'm trying to do a density plot which shows the distribution of people by percentile within each score given like this
library(dplyr); library(ggplot2); library(ggtheme)
ggplot(KA,aes(x=percentile,group=kscore,color=kscore))+
xlab('Percentil')+ ylab('Frecuencia')+ theme_tufte()+ ggtitle("Prospectos")+
scale_color_brewer(palette = "Greens")+geom_density(size=3)
but the x axis mark gets ordered like 1,10,100,11,12,..,2,20,21,..,99 instead of just 1,2,3,..,100 which is my desired output
I fear this affects the whole plot not just the labels
I'll turn my comment to an answer so this can be marked resolved:
Your x variable is (almost certainly) a factor. You probably want it to be numeric.
KA$percentile = as.numeric(as.character(KA$percentile))
When you're seeing weird stuff, it's good to check on your data. Running str(KA) is a good way to see what's there. If you just want to see classes, sapply(KA, class) is a nice summary.
And it's a common R quirk that if you're converting from factor to numeric, go by way of character or you risk ending up with just the level numbers:
year_fac = factor(1998:2002)
as.numeric(year_fac) # not so good
# [1] 1 2 3 4 5
as.numeric(as.character(year_fac)) # what you want
# [1] 1998 1999 2000 2001 2002

In R, plotting wide form data with ggplot2 or base plot. Is there a way to use ggplot2 without melting wide form data frame?

I have a data frame that looks like this (though thousands of times larger).
df<-data.frame(sample(1:100,10,replace=F),sample(1:100,10,replace=F),runif(10,0,1),runif(10,0,1),runif(10,0,1), rep(c("none","summer","winter","sping","allyear"),2))
names(df)<-c("Mother","ID","Wavelength1","Wavelength2","Wavelength3","WaterTreatment")
df
Mother ID Wavelength1 Wavelength2 Wavelength3 WaterTreatment
1 2 34 0.9143670 0.03077356 0.82859497 none
2 24 75 0.6173382 0.05958151 0.66552338 summer
3 62 77 0.2655572 0.63731302 0.30267893 winter
4 30 98 0.9823510 0.45690437 0.40818031 sping
5 4 11 0.7503750 0.93737900 0.24909228 allyear
6 55 76 0.6451885 0.60138475 0.86044856 none
7 97 21 0.5711019 0.99732068 0.04706894 summer
8 87 14 0.7699293 0.81617911 0.18940531 winter
9 92 30 0.5855559 0.70152698 0.73375917 sping
10 93 44 0.1040359 0.85259166 0.37882469 allyear
I want to plot wavelength values on the y axis, and wavelength on the x. I have two ways of doing this:
First method which works, but uses base plot and requires more code than should be necessary:
colors=c("red","blue","green","orange","yellow")
plot(0,0,xlim=c(1,3),ylim=c(0,1),type="l")
for (i in 1:10) {
if (df$WaterTreatment[i]=="none"){
a<-1
} else if (df$WaterTreatment[i]=="allyear") {
a<-2
}else if (df$WaterTreatment[i]=="summer") {
a<-3
}else if (df$WaterTreatment[i]=="winter") {
a<-4
}else if (df$WaterTreatment[i]=="spring") {
a<-5
}
lines(seq(1,3,1),df[i,3:5],type="l",col=colors[a])
}
Second method: I attempt to melt the data to put it in long form, then use ggplot2. The plot it produces is not correct because there is a line for each water treatment, rather than a line for each "Mother" "ID" (the unique identifier, what were the rows in the original data frame).
require(reshape2)
require(data.table)
df_m<-melt(df,id.var=c("Mother","ID","WaterTreatment"))
df_m$variable<-as.numeric(df_m$variable) #sets wavelengths to numeric
qplot(x=df_m$variable,y=df_m$value,data=df_m,color=df_m$WaterTreatment,geom = 'line')
There is probably something simple I'm missing about ggplot2 that would fix the plotting of the lines. I'm a newbie with ggplot, but am working to get more familiar with it and would like to use it in this application.
But more broadly, is there an efficient way to plot this type of wide form data in ggplot2? The time it takes to transform/melt the data is enormous and I'm wondering if it is worth it, or if there is some kind of work around that can eliminate the redundant cells created when melting.
Thanks for your help, if you need more clarity on this question please let me know and I can edit.
I'd like to point out that you are basically re-inventing an existing base plotting function, namely matplot. This could replace your plot and for-loop:
matplot(1:3, t( df[ ,3:5] ), type="l",col=colors[ as.numeric(df$WaterTreatment)] )
With that in mind you might want to search SO for: [r] matplot ggplot2, as I did, and see if this see if this or any of the other hits are effective.
It looks like you want a separate line for each ID, but you want the lines colored based on the value of WaterTreatment. If so, you can do it like this in ggplot:
ggplot(df_m, aes(x=variable, y=value, group=ID, colour=WaterTreatment)) +
geom_line() + geom_point()
You can also use faceting to make it easier to see the different levels of WaterTreatment
ggplot(df_m, aes(x=variable, y=value, group=ID, colour=WaterTreatment)) +
geom_line() + geom_point() +
facet_grid(WaterTreatment ~ .)
To answer your general question: ggplot is set up to work most easily and powerfully with a "long" (i.e., melted) data frame. I guess you could work with a "wide" data frame and plot separate layers for each combination of factors you want to plot. But that would be a lot of extra work compared to a single melt command to get your data into the right format.

Resources