Accessing the values by their rowname and columnname,instead of numbers

Accessing the values by their rowname and columnname,instead of numbers - r

I have a table which has multiple columns and rows. I want to access the each value by its column name and rowname, and make a plot with these values.
The table looks like this with 101 columns:
IDs Exam1 Exam2 Exam3 Exam4 .... Exam100
Ellie 12 48 33 64
Kate 98 34 21 76
Joe 22 53 49 72
Van 77 40 12
Xavier 88 92
What I want is to be able to reach the marks for given row (IDs),and given column(exams) as:
table[Ellie,Exam3] --> 48
table[Ellie,Exam100] --> 64
table[Ellie,Exam2] --> (empty)
Then with these numbers, I want to see the distribution of how Ellie did comparing the rest of exams to Exam2,3 and 100.
I have almost figured out this part with R:
library(data.table)
library(ggplot2)
pdf("distirbution_given_row.pdf")
selectedvalues <- c(table[Ellie,Exam3] ,table[Ellie,Exam100])
library(plyr)
cdat <- ddply(selected values, "IDs", summarise, exams.mean=mean(exams))
selectedvaluesggplot <- ggplot(selectedvalues, aes(x=IDs, colour=exams)) + geom_density() + geom_vline(data=cdat, aes(xintercept=exams.mean, colour=IDs), linetype="dashed", size=1)
dev.off()
Which should generate the Ellie's marks for exams of interests versus the rest of the marks ( if it is a blank, then it should not be seen as zero. It is still a blank.)
Red: Marks for Exam3, 100 and 2 , Blue: The marks for the remaining 97 exams
(The code and the plot are taken as an example of ggplot2 from this link.)
All ideas are appreciated!

For accessing your data at least you can do the following:
df=data.frame(IDs=c("Ellie","Kate","Joe","Van","Xavier"),Exam1=c(12,98,22,77,NA),Exam2=c(NA,34,53,NA,NA),
Exam3=c(48,21,49,40,NA),Exam4=c(33,76,NA,12,88))
row.names(df)=df$IDs
df=df%>%select(-IDs)
> df['Joe','Exam2']
[1] 53
Now I prepared an example with random created numbers to illustrate a bit what you could do. First let us create an example data frame
df=as.data.frame(matrix(rnorm(505,50,10),ncol=101))
colnames(df)=c("IDs",paste0("Exam",as.character(1:100)))
df$IDs=c("Ellie","Kate","Joe","Van","Xavier")
To work with ggplot it is recomended to convert it to long format:
df0=df%>%gather(key="exams",value="score",-IDs)
From here on you can play with your variables as desired. For instance plotting the density of the score per ID:
ggplot(df0, aes(x=score,col=IDs)) + geom_density()
or selecting only Exams 2,3,100 and plotting density for different exams
df0=df0%>%filter(exams=="Exam2"|exams=="Exam3"|exams=="Exam100")
ggplot(df0, aes(x=score,col=exams)) + geom_density()

IIUC - you want to plot each IDs select exams with all else exams. Consider the following steps:
Reshape your data to long format even replace NAs with zero as needed.
Run by() to subset data by IDs and build mean aggregrate data and ggplots.
Within by, create a SelectValues indicator column on the select exams then graph with vertical line mean summation.
Data
txt = 'IDs Exam1 Exam2 Exam3 Exam4 Exam100
Ellie 12 NA 48 33 64
Kate 98 34 21 76 NA
Joe 22 53 49 NA 72
Van 77 NA 40 12 NA
Xavier NA NA NA 88 92'
exams_df <- read.table(text=txt, header = TRUE)
# ADD OTHER EXAM COLUMNS (SEEDED FOR REPRODUCIBILITY)
set.seed(444)
exams_df[paste0("Exam", seq(5:99))] <- replicate(99-4, sample(100, 5))
Reshape and Graph
library(ggplot2) # ONLY PACKAGE NEEDED
# FILL NA
exams_df[is.na(exams_df)] <- 0
# RESHAPE (BASE R VERSION)
exams_long_df <- reshape(exams_df,
timevar = "Exam",
times = names(exams_df)[grep("Exam", names(exams_df))],
v.names = "Score",
varying = names(exams_df)[grep("Exam", names(exams_df))],
new.row.names = 1:1000,
direction = "long")
# GRAPH BY EACH ID
by(exams_long_df, exams_long_df$IDs, FUN=function(df) {
df$SelectValues <- ifelse(df$Exam %in% c("Exam1", "Exam3", "Exam100"), "Select Exams", "All Else")
cdat <- aggregate(Score ~ SelectValues, df, FUN=mean)
ggplot(df, aes(Score, colour=SelectValues)) +
geom_density() + xlim(-50, 120) +
labs(title=paste(df$IDs[[1]], "Density Plot of Scores"), x ="Exam Score", y = "Density") +
geom_vline(data=cdat, aes(xintercept=Score, colour=SelectValues), linetype="dashed", size=1)
})
Output

Related

Plot Column of Dataframe as Y, Another Column as X, and Group by Another Column in ggplot in R [duplicate]

I have a survey file in which row are observation and column question.
Here are some fake data they look like:
People,Food,Music,People
P1,Very Bad,Bad,Good
P2,Good,Good,Very Bad
P3,Good,Bad,Good
P4,Good,Very Bad,Very Good
P5,Bad,Good,Very Good
P6,Bad,Good,Very Good
My aim is to create this kind of plot with ggplot2.
I absolutely don't care of the colors, design, etc.
The plot doesn't correspond to the fake data
Here are my fake data:
raw <- read.csv("http://pastebin.com/raw.php?i=L8cEKcxS",sep=",")
raw[,2]<-factor(raw[,2],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
raw[,3]<-factor(raw[,3],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
raw[,4]<-factor(raw[,4],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
But if I choose Y as count then I'm facing an issue about choosing the X and the Group values... I don't know if I can succeed without using reshape2... I've also tired to use reshape with melt function. But I don't understand how to use it...

EDIT: Many years later
For a pure ggplot2 + utils::stack() solution, see the answer by #markus!
A somewhat verbose tidyverse solution, with all non-base packages explicitly stated so that you know where each function comes from:
library(magrittr) # needed for %>% if dplyr is not attached
"http://pastebin.com/raw.php?i=L8cEKcxS" %>%
utils::read.csv(sep = ",") %>%
tidyr::pivot_longer(cols = c(Food, Music, People.1),
names_to = "variable",
values_to = "value") %>%
dplyr::group_by(variable, value) %>%
dplyr::summarise(n = dplyr::n()) %>%
dplyr::mutate(value = factor(
value,
levels = c("Very Bad", "Bad", "Good", "Very Good"))
) %>%
ggplot2::ggplot(ggplot2::aes(variable, n)) +
ggplot2::geom_bar(ggplot2::aes(fill = value),
position = "dodge",
stat = "identity")
The original answer:
First you need to get the counts for each category, i.e. how many Bads and Goods and so on are there for each group (Food, Music, People). This would be done like so:
raw <- read.csv("http://pastebin.com/raw.php?i=L8cEKcxS",sep=",")
raw[,2]<-factor(raw[,2],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
raw[,3]<-factor(raw[,3],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
raw[,4]<-factor(raw[,4],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
raw=raw[,c(2,3,4)] # getting rid of the "people" variable as I see no use for it
freq=table(col(raw), as.matrix(raw)) # get the counts of each factor level
Then you need to create a data frame out of it, melt it and plot it:
Names=c("Food","Music","People") # create list of names
data=data.frame(cbind(freq),Names) # combine them into a data frame
data=data[,c(5,3,1,2,4)] # sort columns
# melt the data frame for plotting
data.m <- melt(data, id.vars='Names')
# plot everything
ggplot(data.m, aes(Names, value)) +
geom_bar(aes(fill = variable), position = "dodge", stat="identity")
Is this what you're after?
To clarify a little bit, in ggplot multiple grouping bar you had a data frame that looked like this:
> head(df)
ID Type Annee X1PCE X2PCE X3PCE X4PCE X5PCE X6PCE
1 1 A 1980 450 338 154 36 13 9
2 2 A 2000 288 407 212 54 16 23
3 3 A 2020 196 434 246 68 19 36
4 4 B 1980 111 326 441 90 21 11
5 5 B 2000 63 298 443 133 42 21
6 6 B 2020 36 257 462 162 55 30
Since you have numerical values in columns 4-9, which would later be plotted on the y axis, this can be easily transformed with reshape and plotted.
For our current data set, we needed something similar, so we used freq=table(col(raw), as.matrix(raw)) to get this:
> data
Names Very.Bad Bad Good Very.Good
1 Food 7 6 5 2
2 Music 5 5 7 3
3 People 6 3 7 4
Just imagine you have Very.Bad, Bad, Good and so on instead of X1PCE, X2PCE, X3PCE. See the similarity? But we needed to create such structure first. Hence the freq=table(col(raw), as.matrix(raw)).

In #jakub's answer the calculations are done before the data is passed to ggplot(), which is why the stat in geom_bar is set to "identity" (i.e. take the data as is and do nothing with it).
Another approach is to let ggplot do the counting for you, hence we can make use of stat = "count", the default of geom_bar:
library(ggplot2)
ggplot(stack(df1[, -1]), aes(ind, fill = values)) +
geom_bar(position = "dodge")
data
df1 <- read.csv(text = "People,Food,Music,People
P1,Very Bad,Bad,Good
P2,Good,Good,Very Bad
P3,Good,Bad,Good
P4,Good,Very Bad,Very Good
P5,Bad,Good,Very Good
P6,Bad,Good,Very Good
P7,Bad,Very Bad,Good
P8,Very Good,Very Bad,Good
P9,Very Bad,Good,Bad
P10,Bad,Good,Very Bad
P11,Good,Bad,Very Bad
P12,Very Bad,Bad,Very Good
P13,Bad,Very Good,Bad
P14,Bad,Very Good,Very Bad
P15,Good,Good,Good
P16,Very Bad,Very Good,Very Bad
P17,Very Bad,Good,Good
P18,Very Bad,Very Bad,Bad
P19,Very Good,Very Bad,Very Bad
P20,Very Bad,Bad,Good", header = TRUE)

Creating a loop with compare_means

I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot

Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}

Creating a Unique Key in R for time series data based on a binary starttime and endtime columns

I'm trying to plot lane data for trips taken in driving studies. The data-set looks like this:
Dataset
Using the following code, I was able to create plots for each trip
library(ggplot2)
library(grid)
LaneData <- read.csv(file="c:\\Users\\jasonzb\\Desktop\\Cars overtaking trucks Project\\Honda Image\\Truck Overpassing Event Device 10150.csv", header=TRUE, sep=",")
LaneData <- cbind(LaneData[,1:3], LaneData[,7:8])
#rename first column
colnames(LaneData)[1] <- "device"
#Create segment to split data by
LaneData$SplitID <- seq.int(nrow(LaneData))
LaneData$SplitID = round(LaneData$SplitID, digits = -4)
#melt the data
LaneData <-melt(LaneData, id=c("device", "trip", "time", "SplitID"))
#Create levels for SplitID
LaneData$SplitID = factor(LaneData$SplitID)
for (i in levels(LaneData$SplitID)){
LaneData_temp <- subset(LaneData, LaneData$SplitID == i)
print(ggplot(LaneData_temp, aes(LaneData_temp$value, LaneData_temp$time, col =LaneData_temp$variable)) + geom_point() + facet_wrap(~ trip, scales = "free"))
}
This results in plots that look like this:
Plots
For the plot on the right (Trip 5820), everything looks fine. However, for the plot on the right (Trip 5813), there are several time series events in different intervals combined due to gaps in the data that looks like this: Dataset: Time series Break
Is there a way to create a unique key to separate the changes in the timestamps so each segment of time series data could be plotted separately?
Thanks!

Consider creating a new column as multiplies to the time variable. Then run a multiple variable facet_wrap or facet_grid. As example:
time <- c(523490 + seq(0, 70, 10), 764350 + seq(0, 50, 10))
time
# [1] 523490 523500 523510 523520 523530 523540 523550 523560 764350 764360 764370 764380 764390 764400
# THOUSAND MULTIPLIERS
time_grp <- as.integer(time / 10000)
time_grp
# [1] 52 52 52 52 52 52 52 52 76 76 76 76 76 76
# TEN-THOUSAND MULTIPLIERS
time_grp <- as.integer(time / 100000)
time_grp
# [1] 5 5 5 5 5 5 5 5 7 7 7 7 7 7
To integrate into current code, replace last for loop with by to split data frames into subsets for plotting:
...
by(LaneData, LaneData$SplitID, function(sub) {
sub$time_grp <- as.integer(sub$time / 10000)
ggplot(sub, aes(value, time, col=variable)) + geom_point() +
facet_wrap(trip ~ time_grp, scales = "free")
# TRY: facet_grid(. ~ time_grp + trip, scales = "free")
})

Heatmap in ggplot2 issue with fill

I'm trying to make a heatmap using ggplot2. What I want to be plotted is in the form of a matrix which is the result of a function.
Here is the data:
Image A B C D E F
1 3 23 45 23 45 90
2 4 34 34 34 34 89
3 34 33 24 89 23 67
4 3 45 234 90 12 78
5 78 89 34 23 12 56
6 56 90 56 67 34 45
Here is the function:
vector_a <- names(master)[2:4]
vector_b <- names(master)[5:6]
heatmap_prep <- function(dataframe, vector_a,vector_b){
dummy <- as.data.frame(matrix(0, nrow=length(vector_a), ncol=length(vector_b)))
for (i in 1:length(vector_a)){
first_value <- dataframe[[ vector_a[i] ]]
# print(first_value)
for(j in 1:length(vector_b)){
second_value <- dataframe[[ vector_b[j] ]]
result <- cor(first_value, second_value, method = "spearman")
dummy [i,j] <- result
}
}
rownames(dummy) <- vector_a
return(as.matrix(dummy))
heatmap_data_matrix1 <- heatmap_prep(master,vector_a, vector_b)
Using the data in heatmap_data_matrix1, I want to create a heatmap using the following code:
library(ggplot2)
if (length(grep("ggplot2", (.packages() ))) == 0){
library(ggplot2)
}
p <- ggplot(data = heatmap_data_matrix1, aes(x = vector_a, y = vector_b)
+ geom_tile(aes(fill = ))
However, this does not work. How should I reformat my data/code so this heatmap can be created? What should I put under "fill="?
Thanks!

Due to many of R functions being vectorized and that, for the most part, you don't need to pre-allocate or define a vector the for loop is unnecessary. You can simply run corr(x,y, method = "spearman") without the complications of the loop.
Regarding your question of what to put in for fill, you'll need to reshape your data to the configuration that ggplot2 uses (long format).
The gather function from tidyr does this, placing the rows/columns of the correlation into separate columns, and then using the r value for fill.
library(tidyverse) # for tidyr, tibble, ggplot2, and magrittr
heatmap_function <- function(df, a, b) {
cor_data <- cor(df[a], df[b], method = "spearman") %>%
as.data.frame(rownames = a) %>%
rownames_to_column("x") %>%
gather(y, fill, -x)
ggplot(cor_data, aes(x = x, y = y, fill = fill)) +
geom_tile()
}
This results in:
heatmap_function(master, c("A","B","C"), c("D","E"))

R Plot Bar graph transposed dataframe

I'm trying to plot the following dataframe as bar plot, where the values for the filteredprovince column are listed on a separate column (n)
Usually, the ggplot and all the other plots works on horizontal dataframe, and after several searches I am not able to find a way to plot this "transposed" version of dataframe.
The cluster should group each bar graph, and within each cluster I would plot each filteredprovince based on the value of the n column
Thanks you for the support
d <- read.table(text=
" cluster PROVINCIA n filteredprovince
1 1 08 765 08
2 1 28 665 28
3 1 41 440 41
4 1 11 437 11
5 1 46 276 46
6 1 18 229 18
7 1 35 181 other
8 1 29 170 other
9 1 33 165 other
10 1 38 153 other ", header=TRUE,stringsAsFactors = FALSE)
UPDATE
Thanks to the suggestion in comments I almost achived the format desired :
ggplot(tab_s, aes(x = cluster, y = n, fill = factor(filteredprovince))) + geom_col()
There is any way to put on Y labels not frequencies but the % ?

If I understand correctly, you're trying to use the geom_bar() geom which gives you problems because it wants to make sort of an histogram but you already have done this kind of summary.
(If you had provided code which you have tried so far I would not have to guess)
In that case you can use geom_col() instead.
ggplot(d, aes(x = filteredprovince, y = n, fill = factor(PROVINCIA))) + geom_col()
Alternatively, you can change the default stat of geom_bar() from "count" to "identity"
ggplot(d, aes(x = filteredprovince, y = n, fill = factor(PROVINCIA))) +
geom_bar(stat = "identity")
See this SO question for what a stat is
EDIT: Update in response to OP's update:
To display percentages, you will have to modify the data itself.
Just divide n by the sum of all n and multiply by 100.
d$percentage <- d$n / sum(d$n) * 100
ggplot(d, aes(x = cluster, y = percentage, fill = factor(filteredprovince))) + geom_col()

I'm not sure I perfectly understand, but if the problem is the orientation of your dataframe, you can transpose it with t(data) where data is your dataframe.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Accessing the values by their rowname and columnname,instead of numbers - r

Related

Plot Column of Dataframe as Y, Another Column as X, and Group by Another Column in ggplot in R [duplicate]

Creating a loop with compare_means

Creating a Unique Key in R for time series data based on a binary starttime and endtime columns

Heatmap in ggplot2 issue with fill

R Plot Bar graph transposed dataframe

Categories

Resources