I have a data frame,in which, two of the columns are Age and Income. I have clustered the data Using Kmeans. Now I want to plot between Age and Income distinguishing the data points based on Clusters (By Colours)
df
Age Income Cluster
20 10000 1
30 20000 2
40 25000 1
50 20000 2
60 10000 3
70 15000 3
.
plot(df$Age,df$Income)
I want to plot the datapoints between Age and Income and Each datapoint should be coloured based on clusters
You could use ggplot() for this:
ggplot() +
geom_point(mapping = aes(x = Age, y = Income, color = Cluster))
Here it is creating the aesthetics based on the values in the data (x position of the point is based on age, y position on the income, and colour of the point on the variable "cluster").
You could also add this using base R, here's an example using the mtcars dataset...
plot(x = mtcars$wt, y = mtcars$mpg, col = mtcars$cyl)
try something like this :
library(ggplot2)
ggplot() + geom_point(data = df, aes(x = Age, y = Income, group = Cluster, color = Cluster))
I found one Using plot function
df
Age Income
20 10000
30 20000
40 25000
50 20000
60 10000
70 15000
clust <- kmeans(df,centers = 3) # df without the last "Cluster" Column as in the Question
plot(df,col=clust$cluster, color=TRUE,las=1,xlab ="Age",ylab="Income") # df containing only Columns Age and Income. #Cluster is one of the components of Class Kmeans
If your data frame contains more than two Columns, subset it to the two Columns you want to plot between.
Related
I am looking to do a plot to look into the most common occuring FINAL_CALL_TYPE in my dataset by BOROUGH in NYC. I have a dataset with over 3 million obs. I broke this down into a sample of 2000, but have refined it even more to just the incident type and the borough it occured in.
Essentially, I want to create a plot that will visualize to the 5 most common call types in each borough, with the count of how many of each call types there was in each borough.
Below is a brief look of how my data looks with just Call Type and Borough
> head(df)
FINAL_CALL_TYPE BOROUGH
1804978 INJURY BRONX
1613888 INJMAJ BROOKLYN
294874 INJURY BROOKLYN
1028374 DRUG BROOKLYN
1974030 INJURY MANHATTAN
795815 CVAC BRONX
This shows how many unique values there are
> str(df)
'data.frame': 2000 obs. of 2 variables:
$ FINAL_CALL_TYPE: Factor w/ 139 levels "ABDPFC","ABDPFT",..: 50 48 50 34 50 25 17 138 28 28 ...
$ BOROUGH : Factor w/ 5 levels "BRONX","BROOKLYN",..: 1 2 2 2 3 1 4 2 4 4 ...
This is the code that I have tried
> ggplot(df, aes(x=BOROUGH, y=FINAL_CALL_TYPE)) +
+ geom_bar(stat = 'identity') +
+ facet_grid(~BOROUGH)
and below is the result
I have tried a few suggestions accross this community, but I have not found any that shows how to perform the action with 2 columns.
It would be much appreciated if there is someone who know a solution for this.
Thanks!
If I understand correctly, you can use tidyverse to doo something like:
df <- df %>%
group_by(BOROUGH, FINAL_CALL) %>%
summarise(count = n()) %>%
top_n(n = 5, wt = count)
then plot
ggplot(df, aes(x = FINAL_CALL, y = count) +
geom_col() +
facet(~BOROUGH, scales = "free")
creating the barplot
The first part of your problem is to create the barplot. With geom_bar you only need to supply the x variable, as the y-axis is the count of observations of that variable. You can then use the facet option to separate that count into different panels for another grouping variable.
library(ggplot2)
ggplot(data = diamonds, aes(x = color)) +
geom_bar() +
facet_grid(.~cut)
filtering to top 5 observations
The second part of your problem, limiting the data to only the top five in each group is slightly more complex. An easy way to do this is to first tally the data which will create a column n that has the count of observations. By adding the sort option we can filter the data to the first five rows in each group. tally, like summarize, automatically removes the last group.
In the ggplot call I now use geom_col instead of geom_bar and I explicitly specify that the y-variable is n (n is created by tally).
geom_bar plots the count of observations per x-variable, geom_col plots a y-variable value for each value of the x-variable.
scales = "free_x" removes values from the x-axis that are present in one cut panel but not another.
library(tidyverse)
df <- diamonds %>%
group_by(cut, color) %>%
tally(sort = TRUE) %>%
filter(row_number() <= 5)
ggplot(data = df, aes(x = color, y = n)) +
geom_col() +
facet_grid(.~cut, scales = "free_x")
I've tried to search for an answer, but can't seem to find the right one that does the job for me.
I have a dataset (data) with two variables: people's ages (age) and number of awards (awards)
My objective is to plot the number of awards against age in R. FYI, a person can have multiple awards and people can have the same age.
I tried to plot a histogram and barplot, but the problem with that is that it counts the number of observations instead of summing the number of awards.
A sample dataset:
age <- c(21,22,22,25,30,34,45,26,37,46,49,21)
awards <- c(0,3,2,1,0,0,1,3,1,1,1,1)
data <- data.frame(cbind(age,awards))
What I'm looking for is a histogram (or barplot) that represents this data.
Ideally, I'd want the ages to be split into age groups. For example,
20-30, 31-40, 41-50 and then the total number of awards for each group.
The age group would be on the x-axis and the total number of awards for each age group would be on the y-axis.
Thanks!
We can use the aggregate function and then use the ggplot2 package. I don't make too many barplots in base R these days so I'm not sure of the best way to do it without loading ggplot2:
create sample data
#data
set.seed(123)
dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
head(dat)
age awards
1 28 2
2 44 6
3 32 3
4 47 3
5 49 2
6 21 5
By age
#aggregate
sum_by_age <- aggregate(awards ~ age, data = dat, FUN = sum)
library(ggplot2)
ggplot(sum_by_age, aes(x = age, y = awards))+
geom_bar(stat = 'identity')
By age group
#create groups
dat$age_group <- ifelse(dat$age <= 30, '20-30',
ifelse(dat$age <= 40, '30-40',
'41 +'))
sum_by_age_group <- aggregate(awards ~ age_group, data = dat, FUN = sum)
ggplot(sum_by_age_group, aes(x = age_group, y = awards))+
geom_bar(stat = 'identity')
Note
We could skip the aggregate step altogether and just use:
ggplot(dat, aes(x = age, y = awards)) + geom_bar(stat = 'identity')
but I don't prefer that way because I think having an intermediate data step may be useful within your analytical pipeline for comparisons other than visualizing.
For completeness, I am adding the base R solution to #bouncyball's great answer. I will use their synthetic data, but I will use cut to create the age groups before aggregation.
# Creates data for plotting
> set.seed(123)
> dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
# Created a new column containing the age groups
> dat[["ageGroups"]] <- cut(dat[["age"]], c(-Inf, 20, 30, 40, Inf),
right = FALSE)
cut will divide up a set of numeric data based on breaks defined in the second argument. right = FALSE flips the breaks so values the groups would include the lower values rather than the upper ones (ie 20 <= x < 30 rather than the default of 20 < x <= 30). The groups do not have to be equally spaced. If you do not want to include data above or below a certain value, simply remove the Inf from the end or -Inf from the beginning respectively, and the function will return <NA> instead. If you would like to give your groups names, you can do so with the labels argument.
Now we can aggregate based on the groups we created.
> (summedGroups <- aggregate(awards ~ ageGroups, dat, FUN = sum))
ageGroups awards
1 [20,30) 188
2 [30,40) 212
3 [40, Inf) 194
Finally, we can plot these data using the barplot function. The key here is to use names for the age groups.
> barplot(summedGroups[["awards"]], names = summedGroups[["ageGroups"]])
I have a dataset set with 34 columns and 600+ rows.
I successfully managed to reshape it for my data to be predicted for 5 columns (5 years) using reshape2
Dataset_name <- melt(data=XYZ, id.vars=c("A", "B", "C",.... {so on minus 5 columns}))
Now I have the reshaped data and plotted the graph and since it has 600+ points in each column, I cant make sense of it.
Is it possible for me to plot the top Row 1 to Row 50 in one graph and in another Row 51 to Row 100 and so on?
Also, I want to connect the dots to see whether they varied over the years.
Thanks.
Dataset
You can assign rows numbers (first 50 designated as 1, second 50 as 2...) and use that variable in facet_wrap. Each facet would thus hold 50 data points. Here's an example on the iris dataset which comes shipped with R.
library(ggplot2)
nrow(iris) # 150, let's do 50 obs. per facet
iris <- iris[sample(1:nrow(iris)), ] # shuffle the dataset
iris$desig <- rep(c("set1", "set2", "set3"), each = 50)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
theme_bw() +
geom_point() +
facet_wrap(~ desig)
I've tried to search for an answer, but can't seem to find the right one that does the job for me.
I have a dataset (data) with two variables: people's ages (age) and number of awards (awards)
My objective is to plot the number of awards against age in R. FYI, a person can have multiple awards and people can have the same age.
I tried to plot a histogram and barplot, but the problem with that is that it counts the number of observations instead of summing the number of awards.
A sample dataset:
age <- c(21,22,22,25,30,34,45,26,37,46,49,21)
awards <- c(0,3,2,1,0,0,1,3,1,1,1,1)
data <- data.frame(cbind(age,awards))
What I'm looking for is a histogram (or barplot) that represents this data.
Ideally, I'd want the ages to be split into age groups. For example,
20-30, 31-40, 41-50 and then the total number of awards for each group.
The age group would be on the x-axis and the total number of awards for each age group would be on the y-axis.
Thanks!
We can use the aggregate function and then use the ggplot2 package. I don't make too many barplots in base R these days so I'm not sure of the best way to do it without loading ggplot2:
create sample data
#data
set.seed(123)
dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
head(dat)
age awards
1 28 2
2 44 6
3 32 3
4 47 3
5 49 2
6 21 5
By age
#aggregate
sum_by_age <- aggregate(awards ~ age, data = dat, FUN = sum)
library(ggplot2)
ggplot(sum_by_age, aes(x = age, y = awards))+
geom_bar(stat = 'identity')
By age group
#create groups
dat$age_group <- ifelse(dat$age <= 30, '20-30',
ifelse(dat$age <= 40, '30-40',
'41 +'))
sum_by_age_group <- aggregate(awards ~ age_group, data = dat, FUN = sum)
ggplot(sum_by_age_group, aes(x = age_group, y = awards))+
geom_bar(stat = 'identity')
Note
We could skip the aggregate step altogether and just use:
ggplot(dat, aes(x = age, y = awards)) + geom_bar(stat = 'identity')
but I don't prefer that way because I think having an intermediate data step may be useful within your analytical pipeline for comparisons other than visualizing.
For completeness, I am adding the base R solution to #bouncyball's great answer. I will use their synthetic data, but I will use cut to create the age groups before aggregation.
# Creates data for plotting
> set.seed(123)
> dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
# Created a new column containing the age groups
> dat[["ageGroups"]] <- cut(dat[["age"]], c(-Inf, 20, 30, 40, Inf),
right = FALSE)
cut will divide up a set of numeric data based on breaks defined in the second argument. right = FALSE flips the breaks so values the groups would include the lower values rather than the upper ones (ie 20 <= x < 30 rather than the default of 20 < x <= 30). The groups do not have to be equally spaced. If you do not want to include data above or below a certain value, simply remove the Inf from the end or -Inf from the beginning respectively, and the function will return <NA> instead. If you would like to give your groups names, you can do so with the labels argument.
Now we can aggregate based on the groups we created.
> (summedGroups <- aggregate(awards ~ ageGroups, dat, FUN = sum))
ageGroups awards
1 [20,30) 188
2 [30,40) 212
3 [40, Inf) 194
Finally, we can plot these data using the barplot function. The key here is to use names for the age groups.
> barplot(summedGroups[["awards"]], names = summedGroups[["ageGroups"]])
I'm trying to create simple line charts with r that connect data points the average of groups of respondents (would also nive to lable them or distinguish them in diferent colors etc.)
My data is in long format and sorted like this shown (I also have it in wide format if thats of any value):
ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
...
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
...
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA
...
Basically, every respondent was measured a total of n times and the occasions (week) were the same for everyone. Some respondents were missing during one or more occasions. Let's say for motivation. Variables like gender, class and ID don't change, motivation does.
I tried to get a line chart using ggplot2
## define base for the graphs and store in object 'p'
plot <- ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender))
plot + geom_line()
As grouping variable, I want to use class or gender for example.
However, my approach does not lead to lines that connect the averages per group.
I also get vertical lines for each measurement occasion. What does this mean? The only way I cold imagine fixing this is to create a new variable average.motivation and to compute the average for every group per occasion and then assign this average to all members of the group. However, this would mean that I had do this for every single group variable when I want to display group lines based on another factor.
Also, how does the plot handle missing data? (If one member of a group has a missing value, I still want the group average of this occasion to calculate the point rather than omitting the whole occasion for that group ).
Edit:
Thank you, the solution with dplyr works great for all my categorical variables.
Now, I'm trying to figure out how I can distinguish between subgroups by colouring their lines based on a second/third factor.
For example, I plot 20 lines for the groups of "class2", but rather than having all of them in 20 different colors, I would like them to use the same colour, if they belong to the same type of class ("class_type", e.g. A, B or C =20 lines, three groups of colours).
I've added the second factor to "mean_data2". That works well. Next, I've tried to change the colour argument in ggplot, (also tried as in geom_line), but that way, I don't have 20 lines anymore.
mean_data2 <- group_by(DataRlong, class2, class_type, occ)%>%
summarise(procras = mean(procras, na.rm = TRUE))
library(ggplot2) ggplot(na.omit(mean_data2), aes(x = occ, y = procras,
colour=class2)) + geom_point() + geom_line(aes(colour=class_type))
You can also use the dplyr package to aggregate the data:
library(dplyr)
mean_data <- group_by(data, gender, week) %>%
summarise(motivation = mean(motivation, na.rm = TRUE))
You can use na.omit() to get rid of the NA values as follows:
library(ggplot2)
ggplot(na.omit(mean_data), aes(x = week, y = motivation, colour = gender)) +
geom_point() + geom_line()
There is no need here to explicitly use the group aesthetic because ggplot will automatically group the lines by the categorical variables in your plot. And the only categorical variable you have is gender. (See this answer for more information).
Another possibility is using stat_summary, so you can do it only with ggplot.
ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender)) +
stat_summary(geom = "line", fun.y = mean)
You almost certainly have to make sure those grouping variables are factors.
I'm not quite sure what you want, but here's a shot...
library("ggplot2")
df <- read.table(textConnection("ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA"), header=TRUE, stringsAsFactors=FALSE)
df2 <- aggregate(df$motivation, by=list(df$gender, df$week),
function(x)mean(x, na.rm=TRUE))
names(df2) <- c("gender", "week", "avg")
df2$gender <- factor(df2$gender)
ggplot(data = df2[!is.na(df2$avg), ],
aes(x = week, y = avg, group=gender, color=gender)) +
geom_point()+geom_line()