r ggplot x axis tick marks order - r

I've seen several questions about the order of x axis marks but still none of them could solve my problem.
I'm trying to do a density plot which shows the distribution of people by percentile within each score given like this
library(dplyr); library(ggplot2); library(ggtheme)
ggplot(KA,aes(x=percentile,group=kscore,color=kscore))+
xlab('Percentil')+ ylab('Frecuencia')+ theme_tufte()+ ggtitle("Prospectos")+
scale_color_brewer(palette = "Greens")+geom_density(size=3)
but the x axis mark gets ordered like 1,10,100,11,12,..,2,20,21,..,99 instead of just 1,2,3,..,100 which is my desired output
I fear this affects the whole plot not just the labels

I'll turn my comment to an answer so this can be marked resolved:
Your x variable is (almost certainly) a factor. You probably want it to be numeric.
KA$percentile = as.numeric(as.character(KA$percentile))
When you're seeing weird stuff, it's good to check on your data. Running str(KA) is a good way to see what's there. If you just want to see classes, sapply(KA, class) is a nice summary.
And it's a common R quirk that if you're converting from factor to numeric, go by way of character or you risk ending up with just the level numbers:
year_fac = factor(1998:2002)
as.numeric(year_fac) # not so good
# [1] 1 2 3 4 5
as.numeric(as.character(year_fac)) # what you want
# [1] 1998 1999 2000 2001 2002

Related

R column dataframe names number

I have a dataframe like this
geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13
As the names of 2nd an 3rd column are considered as number I'm not able to get a bar chart wth ggplot2. Is there any solution to set the column names to be considered as text?
data
pivot_dat <- read.table(text="geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13",strin=F,h=T)
pivot_dat <- setNames(pivot_dat,c("geo","2001","2002"))
Here's how to do it :
library(ggplot2)
ggplot(pivot_dat, aes(x = geo, y = `2002`)) + geom_col()+ coord_flip()
by using ticks instead of quotes/double quotes you make sure you pass a name to the function and not a string.
If you use quotes, ggplot will convert this character value to a factor and recycle it, so all bars will have the same length of 1, and a label of value "2002".
Note 1 :
You might want to learn the difference between geom_col and geom_bar :
?ggplot2::geom_bar
In short geom_col is geom_bar with stat = "identity", which is what you want here since you want to show on your plot the raw values from your table.
Note 2:
aes_string can be used to give string instead of names but here it doesn't work as "2002" is evaluated as a number :
ggplot(pivot_dat, aes_string(x = "geo", y = "2002")) +
geom_col()+ coord_flip() # incorrect output
ggplot(pivot_dat, aes_string(x = "geo", y = "`2002`")) +
geom_col()+ coord_flip() # correct output
Without an example to see exactly what your problem is, and what you want, it is hard to give you a perfect answer. But here's the thing.
You can do a geom_bar with numeric data. There are 3 possible ways I see that you could have problems (but I may not be able to guess every way.
First, let's set up the r for plotting.
library(readr)
library(ggplot2)
test <- read_csv("geo,2001,2002
Spain,21,23
Germany,34,50
Italy,57,89
France,19,13")
Next, let's make the first mistake...incorrectly calling the column name. In the next example I will tell ggplot to make a bar of the number 2001. Not the column 2001! r has to guess whether we mean 2001 or whether we mean the object 2001. By default it always picks the number instead of the column.
ggplot(test) +
geom_bar(aes(x=2001))
Ok, that just gives you a bar at 2001...because you gave it a single number input instead of a column. Let's fix that. Use the right facing quotes `` to identify the column name 2001 instead of the number 2001.
ggplot(test) +
geom_bar(aes(x=`2001`))
This creates a perfectly workable bar chart. But maybe you don't want the spaces? That's the only possible reason you would use text instead of a number. But you want text so I'm going to show you how to use as.factor to do something similar (and more powerful).
ggplot(test) +
geom_bar(aes(x=as.factor(`2001`)))

Grouping/stacking factor levels in ggplot bar chart

I'm relatively new to R and a complete beginner with ggplot, but I haven't managed to find an answer to the seemingly simple problem I have. Using ggplot, I would like to make a bar chart in which two of three or more graphed factor levels are stacked.
Essentially, this is the type of data I am looking at:
df <- data.frame(Answer=c("good","good","kinda good","kinda good",
"kinda good","good","bad","good","bad"))
This provides me with a factor with three levels, two of which are very similar:
Answer
1 good
2 good
3 kinda good
4 kinda good
5 kinda good
6 good
7 bad
8 good
9 bad
If I let ggplot go over these data for me now,
c <- ggplot(df, aes(df$Answer))
c + geom_bar()
I will get a bar chart with three columns. However, I would like to end up with two columns, one of which should be a stack of the two factor levels "good" and "kinda good", still visibly separated.
I am working with 100 columns of input (study on orthography), which I will need to go through manually, so I would like to make the code as easily adjustable as possible. Some of them have more than ten levels, and I would need to sort them into three columns. Therefore, in most cases my data would more likely look like this:
df <- data.frame(Answer=c("good","goood","goo0d","good",
"I don't know","Bad","bad","baaad","really bad"))
I would consequently group this into three categories. In approximately half of the cases, I could probably still filter using pattern matching because I will be looking at the use of spaces. The other half, however, is looking at capitalization, which would get a little messy, or at least very tedious.
I have thought of two different approaches to solve this issue more efficiently:
Simply rewriting the factor levels, but this would result in a loss of information (and I would like to keep the two levels separate). I would like to keep the original levels names because I think I need them to graph the ratio within that stacked column and to label the column properly.
I could split the respective column/factor into two separate columns/factors and graph them next to each other, and thus create a "fake" third dimension. This is looking to be the most promising approach, but before I work through 100 columns of data with this - is there a more elegant approach, maybe within the ggplot2 package, where I could just point/group the level names instead of changing/reordering the data frame behind it?
Thanks!
You can try the following for a more automated approach in grouping the answers.
We select some keywords based on your data and loop over them to see which answers may contain each keyword
groups <- c('good','bad','ugly','know')
df <- data.frame(Answer=c("good","medium good","kinda good","still good",
"I don't know","good","bad","good","really bad"))
idx <- sapply(groups, function(x) grepl(x, df$Answer, ignore.case = TRUE))
df$group <- rep(colnames(idx), nrow(idx))[t(idx)]
df
# Answer group
# 1 good good
# 2 medium good good
# 3 kinda good good
# 4 still good good
# 5 I don't know know
# 6 good good
# 7 bad bad
# 8 good good
# 9 really bad bad
library('ggplot2')
ggplot(df, aes(group, fill = Answer)) + geom_bar()

Using abline() when x-axis is date (ie, time-series data)

I want to add multiple vertical lines to a plot.
Normally you would specify abline(v=x-intercept) but my x-axis is in the form Jan-95 - Dec-09. How would I adapt the abline code to add a vertical line for example in Feb-95?
I have tried abline(v=as.Date("Jan-95")) and other variants of this piece of code.
Following this is it possible to add multiple vertical lines with one piece of code, for example Feb-95, Feb-97 and Jan-98?
An alternate solution could be to alter my plot, I have a column with month information and a column with the year information, how do I collaborate these to have a year month on the X-axis?
example[25:30,]
Year Month YRM TBC
25 1997 1 Jan-97 136
26 1997 2 Feb-97 157
27 1997 3 Mar-97 163
28 1997 4 Apr-97 152
29 1997 5 May-97 151
30 1997 6 Jun-97 170
The first note: your YRM column is probably a factor, not a datetime object, unless you converted it manually. I assume we do not want to do that and our plot is looking fine with YRM as a factor.
In that case
vline_month <- function(s) abline(v=which(s==levels(df$YRM)))
# keep original order of levels
df$YRM <- factor(df$YRM, levels=unique(df$YRM))
plot(df$YRM, df$TBC)
vline_month(c("Jan-97", "Apr-97"))
Disclaimer: this solution is a quick hack; it is neither universal nor scalable. For accurate representation of datetime objects and extensible tools for them, see packages zoo and xts.
I see two issues:
a) converting your data to a date/POSIX element, and
b) actually plotting vertical lines at specific rows.
For the first, create a proper date string then use strptime().
The second issue is resolved by converting the POSIX date to numeric using as.numeric().
# dates need Y-M-D
example$ymd <- paste(example$Year, '-', example$Month, '-01', sep='')
# convet to POSIX date
example$ymdPX <- strptime(example$ymd, format='%Y-%m-%d')
# may want to define tz otherwise system tz is used
# plot your data
plot(example$ymdPX, example$TBC, type='b')
# add vertical lines at first and last record
abline(v=as.numeric(example$ymdPX[1]), lwd=2, col='red')
abline(v=as.numeric(example$ymdPX[nrow(example)]), lwd=2, col='red')

R histogram from already summarized count

I have a really huge file, thus I had to count frequencies for histogram generation outside the R.
Couldn't find the correct answer in already existing threads. Everything I tried led me to bar plot or failure (even R's exceptions didn't let it plot as histogram the way I tried)
file looks like (it's tab delimited):
freq cov
394104974 1
387288861 3
141169009 4
105488813 2
60039934 6
45109486 5
26318120 7
9691068 8
7532886 9
3973434 10
it has sth like 3k lines.
How can I plot this with ggplot2 as a nice histogram? (cov column holds x axis values)
Cheers,
Irek

Loop to create series of graphs from different files

I am trying to plot histograms with long term (several years) mean precipitation (pp) for each day of the month from a series of files. Each file has data collected from a different place (and has a different code). Each of my files looks like this:
X code year month day pp
1 2867 1945 1 1 0.0
2 2867 1945 1 2 0.0
...
And I am using the following code:
files <- list.files(pattern=".csv")
par(mfrow=c(4,6))
for (i in 1:24) {
obs <- read.table(files[i],sep=",", header=TRUE)
media.dia <- ddply(obs, .(day), summarise, daily.mean<-mean(pp))
codigo <- unique(obs$code)
hist(daily.mean, main=c("hist per day of month", codigo))
}
I get 24 histograms with 24 different codes in the title, but instead of 24 DIFFERENT histograms from 24 different locations, I get the same histogram 24 times (with 24 different titles). Can anybody tell me why? Thanks!
There are at least two errors I can see in your code.
There is an error in your ddply statement.
You are passing the wrong variable to hist, thus plotting something that may or may not exist depending on previous session actions.
The problem in your ddply statement is that you are doing an invalid assign (using <- ). Fix this by using =:
media.dia<- ddply(obs, .(day),summarise, daily.mean = mean(pp))
Then edit your hist statement:
hist(media.dia$daily.mean,main=c("hist per day of month",codigo))
I suspect the problem is that you are not passing the correct parameter to hist. The reason that your code actually produces a plot at all, is because in some previous step in your session you must have created a variable called daily.mean (as Brandon points out in the comment.)
I think the daily.mean calculated in the ddply function is assigned in a separate environment, and does not exist in an environment hist can see.
Try daily.mean<<-mean(pp)

Resources