This question already has answers here:
How to produce a heatmap with ggplot2?
(2 answers)
Closed 7 years ago.
I would simply like to represent a sequence of categorical states with different colours.
This kind of plot is also known as individual sequence plot (TraMineR).
I would like to use ggplot2.
My data simply look like this
> head(dta)
V1 V2 V3 V4 V5 id
1 b a e d c 1
2 d b a e c 2
3 b c a e d 3
4 c b a e d 4
5 b c e a d 5
with the personal id in the last column.
The plot looks like this.
Each letters (states) is represented by a colour. Basically, this plot visualise the successive states for each individual.
Blue is a, Red is b, Purple is c, Yellow is d and Brown is e.
Any idea how I could do this with ggplot2?
dta = structure(list(V1 = structure(c(1L, 3L, 1L, 2L, 1L), .Label = c("b",
"c", "d"), class = "factor"), V2 = structure(c(1L, 2L, 3L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), V3 = structure(c(2L,
1L, 1L, 1L, 2L), .Label = c("a", "e"), class = "factor"), V4 = structure(c(2L,
3L, 3L, 3L, 1L), .Label = c("a", "d", "e"), class = "factor"),
V5 = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("c", "d"
), class = "factor"), id = 1:5), .Names = c("V1", "V2", "V3",
"V4", "V5", "id"), row.names = c(NA, -5L), class = "data.frame")
what I tried so far
nr = nrow(dta3)
nc = ncol(dta3)
# space
m = 0.8
n = 1 # do not touch this one
plot(0, xlim = c(1,nc*n), ylim = c(1, nr), type = 'n', axes = F, ylab = 'individual sequences', xlab = 'Time')
axis(1, at = c(1:nc*m), labels = c(1:nc))
axis(2, at = c(1:nr), labels = c(1:nr) )
for(i in 1:nc){
points(x = rep(i*m,nr) , y = 1:nr, col = dta3[,i], pch = 15)
}
But it is not with ggplot2 and not very satisfying.
Here you go:
library(reshape2)
library(ggplot2)
m_dta <- melt(dta,id.var="id")
m_dta
p1 <- ggplot(m_dta,aes(x=variable,y=id,fill=value))+
geom_tile()
p1
Related
I have a fairly large list looking something like this, where I have the first two variables stored are factors
Product Vendor Sales Product sales share
a x 100
b y 200
a y 250
c y 700
a z 150
Ideally, I'd like to create a new column containing the vendors share of that product's total sales i.e. Share_{p=a,v=x} = 100/(100+250+150)
I figure lapply() would be viable but not sure how to write the function
> dput(list)
list(structure(list(Product = structure(c(1L, 2L, 1L, 3L, 1L), .Label = c("a",
"b", "c"), class = "factor"), Vendor = structure(c(1L, 2L, 2L,
2L, 3L), .Label = c("x", "y", "z"), class = "factor"), Sales = c(100,
200, 250, 700, 150)), class = "data.frame", row.names = c(NA,
-5L)))
Using dplyr package, you could calculate the total sales for each product, then calculate the vendor share based on individual vendor and total sales.
library(dplyr)
df %>%
group_by(Product) %>%
mutate(Total_Sales = sum(Sales),
Vendor_Share = Sales/Total_Sales)
A base R approach could use prop.table as an alternative:
df$Vendor_Share <- with(df, ave(Sales, Product, FUN = prop.table))
Output
Product Vendor Sales Vendor_Share
1 a x 100 0.2
2 b y 200 1.0
3 a y 250 0.5
4 c y 700 1.0
5 a z 150 0.3
Data
df <- structure(list(Product = structure(c(1L, 2L, 1L, 3L, 1L), .Label = c("a",
"b", "c"), class = "factor"), Vendor = structure(c(1L, 2L, 2L,
2L, 3L), .Label = c("x", "y", "z"), class = "factor"), Sales = c(100,
200, 250, 700, 150), Vendor_Share = c(0.2, 1, 0.5, 1, 0.3)), row.names = c(NA,
-5L), class = "data.frame")
My two dataframes are:
df1<-structure(list(header1 = structure(1:4, .Label = c("a", "b",
"c", "d"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
and
df2<-structure(list(sample_x = structure(c(1L, 1L, 2L, 3L), .Label = c("0",
"a", "c"), class = "factor"), sample_y = structure(c(1L, 3L,
2L, 4L), .Label = c("0", "a", "m", "t"), class = "factor"), sample_z = structure(c(3L,
2L, 1L, 1L), .Label = c("0", "a", "c"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
0s in df2 means no values.
Now I want to overlap df1 and df2 to make an output dataframe(df3):
df3<-structure(list(sample_x = c(2L, 2L, 0L), sample_y = c(1L, 3L,
2L), sample_z = c(2L, 2L, 0L)), class = "data.frame", row.names = c("overlap_df1_df2",
"unique_df1", "unique_df2"))
I tried the datatable function foverlaps:
setkeyv(df1, names(df1))
setkeyv(df2, names(df2))
df3<-foverlaps(df1,df2)
But seems like I need to have some common column names in these two dataframes, which is obviously not the case.
Thank you!
Loop through columns, and use set operations:
sapply(df2, function(i){
x = i[ !is.na(i) ]
o = intersect(df1$header1, x)
u_df1 = setdiff(df1$header1, o)
u_df2 = setdiff(x, o)
c(o = length(o),
u_df1 = length(u_df1),
u_df2 = length(u_df2))
})
# sample_x sample_y sample_z
# o 2 1 2
# u_df1 2 3 2
# u_df2 0 2 0
A solution using map:
library(purrr)
rbind(
overlap = map_dbl(df2, ~length(intersect(df1$header1, .x))),
unique_df1 = map_dbl(df2, ~length(setdiff(df1$header1, .x))),
unique_df2 = unique_df1 - overlap
)
sample_x sample_y sample_z
overlap 2 1 2
unique_df1 2 3 2
unique_df2 0 2 0
I have five columns with 2 levels and their column names are like c(a,b,x,y,z). The command below works for 1 column at time. But I need to it for all five columns at the same time.
levels(car_data[,"x"]) <- c(0,1)
car_data[,"x"] <- as.numeric(levels(car_data[,"x"]))[car_data[,"x"]]
If there are two levels, then we can do
library(dplyr)
car_data %>%
mutate_all(funs(as.integer(.)-1))
# a b c
#1 0 0 0
#2 1 1 1
#3 0 0 0
#4 1 1 1
data
car_data <- structure(list(a = structure(c(1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), b = structure(c(1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), c = structure(c(1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor")), .Names = c("a", "b", "c"), row.names = c(NA,
-4L), class = "data.frame")
I am trying to read in the following .csv file into R. As you can see from the imagine below, Row 2 has the unique variable names, while Row 3 has the values for the above variables. So Rows 2/3 together represent one observation. This process continues, so that Row 4 is a line of variable names and Row 5 corresponds to those variable values. This process continues so that each two-row pair (2/3, 4/5, 6/7....999/1000) represent one observation. There are 1,000 observations total in the data set.
What I am having trouble with is reading this into R so that I have a more usable dataset. My goal is to have a standard set of variable names across the top row, and each subsequent line representing one observation.
Any expert R coders have suggestions?
Thank you,
Here is a solution that worked on a simple test case I made. You'd need to import your data into a data.frame, x = read.csv(file="your-file.csv")
To test this though, I used the test data.frame, x:
x=structure(list(V1 = structure(c(2L, 1L, 4L, 3L), .Label = c("1",
"a", "ab", "h"), class = "factor"), V2 = structure(c(2L, 1L,
4L, 3L), .Label = c("2", "b", "cd", "i"), class = "factor"),
V3 = structure(c(3L, 1L, 2L, 4L), .Label = c("3", "a", "c",
"ef"), class = "factor"), V4 = structure(c(3L, 1L, 2L, 4L
), .Label = c("4", "b", "d", "gh"), class = "factor"), V5 = structure(c(3L,
1L, 2L, 4L), .Label = c("5", "c", "e", "ij"), class = "factor"),
V6 = structure(c(3L, 1L, 2L, 4L), .Label = c("6", "d", "f",
"kl"), class = "factor"), V7 = structure(c(3L, 1L, 2L, 4L
), .Label = c("7", "e", "g", "mno"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7"), class = "data.frame", row.names = c(NA,
-4L))
Which turns this table (rows 1 and 3 are your labels):
V1 V2 V3 V4 V5 V6 V7
1 a b c d e f g
2 1 2 3 4 5 6 7
3 h i a b c d e
4 ab cd ef gh ij kl mno
Using this script to generate a final data.frame dat:
library(plyr)
variables = x[seq(1,nrow(x),2),] #df of all variable rows
values = x[seq(2,nrow(x),2),] #df of all value rows
dat=data.frame() #generate blank data.frame
for(i in 1:nrow(variables)) {
dat.temp=data.frame(values[i,])#make temporary df for the row i of your values
colnames(dat.temp)=as.matrix(variables[i,]) # name the temporary df from row i of your variables
print(dat.temp) #check that they are coming out right (comment this out as necessary)
dat=rbind.fill(dat,dat.temp) #create the final data.frame
rm(dat.temp) #remove the temporary df
}
Into this final table (variables are the column names now):
a b c d e f g h i
1 1 2 3 4 5 6 7 <NA> <NA>
2 ef gh ij kl mno <NA> <NA> ab cd
Hope it works.
I have following data and code:
dd
grp categ condition value
1 A X P 2
2 B X P 5
3 A Y P 9
4 B Y P 6
5 A X Q 4
6 B X Q 5
7 A Y Q 8
8 B Y Q 2
>
>
dput(dd)
structure(list(grp = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("A", "B"), class = "factor"), categ = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("X", "Y"), class = "factor"),
condition = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("P",
"Q"), class = "factor"), value = c(2, 5, 9, 6, 4, 5, 8, 2
)), .Names = c("grp", "categ", "condition", "value"), out.attrs = structure(list(
dim = structure(c(2L, 2L, 2L), .Names = c("grp", "categ",
"condition")), dimnames = structure(list(grp = c("grp=A",
"grp=B"), categ = c("categ=X", "categ=Y"), condition = c("condition=P",
"condition=Q")), .Names = c("grp", "categ", "condition"))), .Names = c("dim",
"dimnames")), row.names = c(NA, -8L), class = "data.frame")
ggplot(dd, aes(grp,value, fill=condition))+geom_bar(stat='identity')+facet_grid(~categ)
How can I convert this bar chart to pie chart? I want 4 pies here with their sizes corresponding to heights of respective bars here. I tried following but they did not work:
ggplot(dd, aes(grp,value, fill=condition))+geom_bar(stat='identity')+facet_grid(~categ)+coord_polar()
ggplot(dd, aes(grp,value, fill=condition))+geom_bar(stat='identity')+facet_grid(~categ)+coord_polar('y')
I also tried to make pie chart similar to Pie charts in ggplot2 with variable pie sizes but I am not able to manage with my data. Thanks for your help.
Using the same idea as in the link you posted, you could add a column size do your dataframe that would be the sum of the values for each group, and use that as the width argument:
library(dplyr)
dd<-dd %>% group_by(categ,grp) %>% mutate(size=sum(value))
ggplot(dd, aes(x=size/2,y=value,fill=condition,width=size))+geom_bar(position="fill",stat='identity')+facet_grid(grp~categ)+coord_polar("y")
You want the group and category both to be variables for the grid, and not inside any plot. Here are two different layouts. X ought to be any single item, string, or something else.
ggplot(dd, aes(x=factor(1),y=value,
fill=condition))+geom_bar(stat='identity')+
facet_grid(~grp+categ)+coord_polar("x")
ggplot(dd, aes(x=factor(1),y=value,
fill=condition))+geom_bar(stat='identity')+
facet_grid(grp~categ)+coord_polar("x")
Something strange happened with the top opening here, maybe its just my interface. Should get you going enough though!