I want to produce a x,y plot, with ggplot or whatever works, with multiple columns represented in the table below: They should be grouped together with Day, Soil Number, Sample. Mean is my y value and SD as my errorbar while the column Day should also serve as my x value as a timeline. How do I manage this?
Results_CMT
# A tibble: 22 x 5
# Groups: Day, Soil_Number [10]
Day Soil_Number Sample Mean SD
<int> <int> <chr> <dbl> <dbl>
1 3.84 0.230
2 0 65872 R 4.82 0.679
3 1 65871 R 3.80 1.10
4 1 65872 R 3.24 1.61
5 3 65871 fLF NA NA
6 3 65871 HF 1.73 0.795
7 3 65871 oLF 0.360 0.129
8 3 65871 R 3.13 1.36
9 3 65872 fLF NA NA
10 3 65872 HF 1.86 0.374
# ... with 12 more rows
At the end their should be 8 Lines (if data is found).
65871 R
65871 HF
65871 fLF
65871 oLF
65872 R
65872 HF
65872 fLF
65872 oLF
Do I have to produce another Column with a combined character of Day, SoilNumber and Sample?
Thanks for any help.
Try this:
library(ggplot2)
ggplot(Results_CMT, aes(x = Day, y = Mean, colour = interaction(Sample, Soil_Number))) +
geom_line() +
geom_errorbar(aes(ymin = Mean-SD, ymax = Mean+SD), width = .2)
Related
This question already has answers here:
How to add texture to fill colors in ggplot2
(8 answers)
Closed 9 months ago.
My tibble looks like this:
# A tibble: 5 × 6
clusters neuroticism introverty empathic open unconscious
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.242 1.02 0.511 0.327 -0.569
2 2 -0.285 -0.257 -1.36 0.723 -0.994
3 3 0.904 -0.973 0.317 0.0622 -0.0249
4 4 -0.836 0.366 0.519 0.269 1.00
5 5 0.0602 -0.493 -1.03 -1.53 -0.168
I was wondering how I can plot this with ggplot2, so that It looks like the big five personality profiles shown in this picture:
My goal is to plot an personality profile for each cluster.
In order to plot it you'd typically need to have the data in a long format. A tidyverse-solution using pivot_longer and then ggplot could look like:
df |>
pivot_longer(-clusters) |>
ggplot(aes(x = name,
y = value,
fill = as.factor(clusters))) +
geom_col(position = "dodge")
Plot:
Data:
df <-
tribble(
~clusters,~neuroticism,~introverty,~empathic,~open,~unconscious,
1,0.242,1.02,0.511,0.327,-0.569,
2,-0.285,-0.257,-1.36,0.723,-0.994,
3,0.904,-0.973,0.317,0.0622,-0.0249,
4,-0.836,0.366,0.519,0.269,1.00,
5,0.0602,-0.493,-1.03,-1.53,-0.168
)
Hello I am trying to add a legend to my graph:
Having looked at a few previous answers they all seem to rely on aes() or having the lines be related to a factor in some way. I didn't understand this answer Add legend to geom_line() graph in r.
In my case I simply want a legend that states "RED = No Cross Validation" and "BLUE = Cross Validation"
R Code
ggplot(data=graphDF,aes(x=rev(kAxis)))+
geom_line(y=rev(noCVErr),color="red")+
geom_point(y=rev(noCVErr),color="red")+
geom_line(y=rev(CVErr),color="blue")+
geom_point(y=rev(CVErr),color="blue")+
ylim(minErr,maxErr)+
ggtitle("The KNN Error Rate for Cross Validated and Non-Cross Validated Models")+
labs(y="Error Rate", x = "1/K")
Dataset
ks kAxis noCVAcc noCVErr CVAcc CVErr
1 1 1.00000000 1.0000000 0.00000000 0.8279075 0.1720925
2 3 0.33333333 0.9345238 0.06547619 0.8336898 0.1663102
3 5 0.20000000 0.8809524 0.11904762 0.8158645 0.1841355
4 7 0.14285714 0.8690476 0.13095238 0.8272727 0.1727273
5 9 0.11111111 0.8809524 0.11904762 0.7857398 0.2142602
6 11 0.09090909 0.8809524 0.11904762 0.7500891 0.2499109
7 13 0.07692308 0.8511905 0.14880952 0.7622103 0.2377897
8 15 0.06666667 0.7976190 0.20238095 0.7320856 0.2679144
9 17 0.05882353 0.7916667 0.20833333 0.7320856 0.2679144
10 19 0.05263158 0.7559524 0.24404762 0.7201426 0.2798574
11 21 0.04761905 0.7678571 0.23214286 0.7023173 0.2976827
12 23 0.04347826 0.7440476 0.25595238 0.6903743 0.3096257
13 25 0.04000000 0.7559524 0.24404762 0.6786096 0.3213904
It might help if you put your data into "long" form, such as this for your data frame graphDF (perhaps using pivot_longer from tidyr if necessary):
library(tidyr)
graphDF_long <- pivot_longer(data = graphDF,
cols = c(noCVErr, CVErr),
names_to = "model",
values_to = "errRate")
This creates a new data.frame called graphDF_long that has a single column for the error rate, and a new column that specifies model:
ks kAxis noCVAcc CVAcc model errRate
<int> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 1 0.828 noCVErr 0
2 1 1 1 0.828 CVErr 0.172
3 3 0.333 0.935 0.834 noCVErr 0.0655
4 3 0.333 0.935 0.834 CVErr 0.166
5 5 0.2 0.881 0.816 noCVErr 0.119
6 5 0.2 0.881 0.816 CVErr 0.184
....
Then, you can simplify your ggplot statement, and use an aesthetic with the column model for color:
library(ggplot2)
ggplot(data = graphDF_long, aes(x = rev(kAxis), y = rev(errRate), color = model)) +
geom_line() +
geom_point() +
scale_color_manual(values = c("blue", "red"),
labels = c("Cross Validation", "No Cross Validation")) +
ylim(min(graphDF_long$errRate), max(graphDF_long$errRate)) +
ggtitle("The KNN Error Rate for Cross Validated and Non-Cross Validated Models") +
labs(y="Error Rate", x = "1/K")
This will generate the legend automatically:
I have 7 variables (density of plankton functional groups) in a time series which I want to place in a single plot to compare their trends over time. I used ggplot, geom_point and geom_line. Since each of the variables vary in range, those with smaller values end up as almost flat lines when plotted against those with the larger values. Since I am only after the trends, not the density, I would prefer to see all lines in one plot. I considered using the sec.axis function, but could not figure out how to assign the variables to either of the y-axes.
Below is my sample data:
seq=1:6
fgrp:
Cop<-c(4.166667,4.722222,3.055556,4.444444,2.777778,2.222222)
Cyan<-c(7.222222,3.888889,1.388889,0.555556,6.944444,3.611111)
Dia<-c(96.66667,43.88889,34.44444,111.8056,163.0556,94.16667)
Dino<-c(126.9444,71.11111,50,55.97222,65,38.33333)
Naup<-c(271.9444,225.5556,207.7778,229.8611,139.7222,92.5)
OT<-c(22.5,19.16667,10.27778,18.61111,18.88889,8.055556)
Prot<-c(141.9444,108.8889,99.16667,113.8889,84.44444,71.94444)
And the ggplot script without the sec.axis since I could not make it work yet:
ggplot(data=df,aes(x=seq,y=mean,shape=fgrp,linetype=fgrp))+geom_point(size=2.5)+geom_line(size=0.5)+scale_shape_manual(values=c(16,17,15,18,8,1,0),
guide=guide_legend(title="Functional\nGroups"))+scale_linetype_manual(values=c("solid","longdash","dotted","dotdash","dashed","twodash","12345678"),guide=F)+scale_y_continuous(sec.axis = sec_axis(~./3)) +geom_errorbar(mapping=aes(ymax=mean+se,ymin=mean-se), width=0.04,linetype="longdash",color="gray30")+theme_minimal()+labs(list(title="Control",x="time",y="density"),size=12)+theme(plot.title = element_text(size = 12,hjust = 0.5 ))
The lines do not look terrible, as is, but here's an example that leverages facet_wrap with scales = "free_y" that should get you going in the right direction:
library(tidyverse)
seq <- 1:6
Cop <- c(4.166667,4.722222,3.055556,4.444444,2.777778,2.222222)
Cyan <- c(7.222222,3.888889,1.388889,0.555556,6.944444,3.611111)
Dia <- c(96.66667,43.88889,34.44444,111.8056,163.0556,94.16667)
Dino <- c(126.9444,71.11111,50,55.97222,65,38.33333)
Naup <- c(271.9444,225.5556,207.7778,229.8611,139.7222,92.5)
OT <- c(22.5,19.16667,10.27778,18.61111,18.88889,8.055556)
Prot <- c(141.9444,108.8889,99.16667,113.8889,84.44444,71.94444)
df <- tibble(
seq = seq,
cop = Cop,
cyan = Cyan,
dia = Dia,
dino = Dino,
naup = Naup,
ot = OT,
prot = Prot
)
df
#> # A tibble: 6 x 8
#> seq cop cyan dia dino naup ot prot
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 4.17 7.22 96.7 127. 272. 22.5 142.
#> 2 2 4.72 3.89 43.9 71.1 226. 19.2 109.
#> 3 3 3.06 1.39 34.4 50 208. 10.3 99.2
#> 4 4 4.44 0.556 112. 56.0 230. 18.6 114.
#> 5 5 2.78 6.94 163. 65 140. 18.9 84.4
#> 6 6 2.22 3.61 94.2 38.3 92.5 8.06 71.9
df_tidy <- df %>%
gather(grp, value, -seq)
df_tidy
#> # A tibble: 42 x 3
#> seq grp value
#> <int> <chr> <dbl>
#> 1 1 cop 4.17
#> 2 2 cop 4.72
#> 3 3 cop 3.06
#> 4 4 cop 4.44
#> 5 5 cop 2.78
#> 6 6 cop 2.22
#> 7 1 cyan 7.22
#> 8 2 cyan 3.89
#> 9 3 cyan 1.39
#> 10 4 cyan 0.556
#> # ... with 32 more rows
ggplot(df_tidy, aes(x = seq, y = value, color = grp)) +
geom_line()
ggplot(df_tidy, aes(x = seq, y = value, color = grp)) +
geom_line() +
facet_wrap(~ grp, scales = "free_y")
I assume this has been asked multiple times but I couldn't find the proper words to find a workable solution.
How can I spread() a data frame based on multiple keys for multiple values?
A simplified (I have many more columns to spread, but on only two keys: Id and time point of a given measurement) data I'm working with looks like this:
df <- data.frame(id = rep(seq(1:10),3),
time = rep(1:3, each=10),
x = rnorm(n=30),
y = rnorm(n=30))
> head(df)
id time x y
1 1 1 -2.62671241 0.01669755
2 2 1 -1.69862885 0.24992634
3 3 1 1.01820778 -1.04754037
4 4 1 0.97561596 0.35216040
5 5 1 0.60367158 -0.78066767
6 6 1 -0.03761868 1.08173157
> tail(df)
id time x y
25 5 3 0.03621258 -1.1134368
26 6 3 -0.25900538 1.6009824
27 7 3 0.13996626 0.1359013
28 8 3 -0.60364935 1.5750232
29 9 3 0.89618748 0.0294315
30 10 3 0.14709567 0.5461084
What i'd like to have is a dataframe populated like this:
One row per Id columns for each value from the time and each measurement variable.
With the devel version of tidyr (tidyr_0.8.3.9000), we can use pivot_wider to reshape multiple value columns from long to wide format
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(time = str_c("time", time)) %>%
pivot_wider(names_from = time, values_from = c("x", "y"), names_sep="")
# A tibble: 10 x 7
# id xtime1 xtime2 xtime3 ytime1 ytime2 ytime3
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 -0.256 0.483 -0.254 -0.652 0.655 0.291
# 2 2 1.10 -0.596 -1.85 1.09 -0.401 -1.24
# 3 3 0.756 -2.19 -0.0779 -0.763 -0.335 -0.456
# 4 4 -0.238 -0.675 0.969 -0.829 1.37 -0.830
# 5 5 0.987 -2.12 0.185 0.834 2.14 0.340
# 6 6 0.741 -1.27 -1.38 -0.968 0.506 1.07
# 7 7 0.0893 -0.374 -1.44 -0.0288 0.786 1.22
# 8 8 -0.955 -0.688 0.362 0.233 -0.902 0.736
# 9 9 -0.195 -0.872 -1.76 -0.301 0.533 -0.481
#10 10 0.926 -0.102 -0.325 -0.678 -0.646 0.563
NOTE: The numbers are different as there was no set seed while creating the sample dataset
Reshaping with multiple value variables can best be done with dcast from data.table or reshape from base R.
library(data.table)
out <- dcast(setDT(df), id ~ paste0("time", time), value.var = c("x", "y"), sep = "")
out
# id xtime1 xtime2 xtime3 ytime1 ytime2 ytime3
# 1: 1 0.4334921 -0.5205570 -1.44364515 0.49288757 -1.26955148 -0.83344256
# 2: 2 0.4785870 0.9261711 0.68173681 1.24639813 0.91805332 0.34346260
# 3: 3 -1.2067665 1.7309593 0.04923993 1.28184341 -0.69435556 0.01609261
# 4: 4 0.5240518 0.7481787 0.07966677 -1.36408357 1.72636849 -0.45827205
# 5: 5 0.3733316 -0.3689391 -0.11879819 -0.03276689 0.91824437 2.18084692
# 6: 6 0.2363018 -0.2358572 0.73389984 -1.10946940 -1.05379502 -0.82691626
# 7: 7 -1.4979165 0.9026397 0.84666801 1.02138768 -0.01072588 0.08925716
# 8: 8 0.3428946 -0.2235349 -1.21684977 0.40549497 0.68937085 -0.15793111
# 9: 9 -1.1304688 -0.3901419 -0.10722222 -0.54206830 0.34134397 0.48504564
#10: 10 -0.5275251 -1.1328937 -0.68059800 1.38790593 0.93199593 -1.77498807
Using reshape we could do
# setDF(df) # in case df is a data.table now
reshape(df, idvar = "id", timevar = "time", direction = "wide")
Your entry data frame is not tidy. You should use gather to make it so.
gather(df, key, value, -id, -time) %>%
mutate(key = paste0(key, "time", time)) %>%
select(-time) %>%
spread(key, value)
Say I have a dataframe called RaM that holds cumulative return values. In this case, they literally are just a single row of cumulative return values along with column headers, but I would like to apply the logic to not just single row dataframes.
Say I want to sort by the max cumulative return value of each column, or even the average, or the sum of each column.
So each column would be re-ordered so that the max cumulative returns for each column is compared and the highest return becomes the 1st column with the min being the last column
then say I want to derive either the top 10 (1st 10 columns after they are rearranged), or even the top 10%.
I know how to derive the column averages, but I don't know how to effectively do the remaining operations. There is an order function, but when I used it, it stripped my column names, which I need. I could easily then cut the 1st say 10 columns, but is there a way that preserves the names? I don't think I can easily extract the names from the unordered original dataframe and apply it with my sorted by aggregate dataframe. My goal is to extract the column names of the top n columns (in dataframe RaM) in terms of a column aggregate function over the entire dataframe.
something like
top10 <- getTop10ColumnNames(colSums(RaM))
that would then output a dataframe of the top 10 columns in terms of their sum from RaM
Here's output off RaM
> head(RaM,2)
ABMD ACAD ALGN ALNY ANIP ASCMA AVGO CALD CLVS CORT
2013-01-31 0.03794643 0.296774194 0.13009009 0.32219178 0.13008130 0.02857604 0.13014640 -0.07929515 0.23375000 0.5174825
2013-02-28 0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138 0.04066986 -0.04457953 -0.2465438
CPST EA EGY EXEL FCSC FOLD GNC GTT HEAR HK HZNP
2013-01-31 -0.05269663 0.08333333 -0.01849711 0.01969365 0 0.4179104 0.07992677 0.250000000 0.2017417 0.10404624 -0.085836910
2013-02-28 0.15051595 0.11443102 -0.04475854 -0.02145923 0 -0.2947368 0.14079036 0.002857143 0.4239130 -0.07068063 -0.009389671
ICON IMI IMMU INFI INSY KEG LGND LQDT MCF MU
2013-01-31 0.07750896 0.05393258 -0.01027397 -0.01571429 -0.05806459 0.16978417 -0.03085824 -0.22001958 0.01345609 0.1924290
2013-02-28 -0.01746362 0.03091684 -0.20415225 0.19854862 0.36849503 0.05535055 0.02189055 0.06840289 -0.09713487 0.1078042
NBIX NFLX NVDA OREX PFPT PQ PRTA PTX RAS REXX RTRX
2013-01-31 0.2112299 0.7846467 0.00000000 0.08950306 0.06823721 0.03838384 -0.1800819 0.04387097 0.23852335 0.008448541 0.34328358
2013-02-28 0.1677704 0.1382251 0.03888981 0.04020979 0.06311787 -0.25291829 0.0266223 -0.26328801 0.05079882 0.026656512 -0.02222222
SDRL SHOS SSI STMP TAL TREE TSLA TTWO UVE VICL
2013-01-31 0.07826093 0.2023956 -0.07788381 0.07103175 -0.14166875 -0.030504714 0.10746974 0.1053588 0.0365299 0.2302405
2013-02-28 -0.07585546 0.1384419 0.08052150 -0.09633197 0.08009728 -0.002860412 -0.07144761 0.2029581 -0.0330408 -0.1061453
VSI VVUS WLB
2013-01-31 0.06485356 -0.0976155 0.07494647
2013-02-28 -0.13965291 -0.1156069 0.04581673
Here's one way using the first section of your sample data to illustrate. You can gather up all the columns so that we can do summary calculations more easily, calculate all the summaries by group that you want, and then sort with arrange. Here I ordered with the highest sums first, but you could do whatever order you wanted.
library(tidyverse)
ram <- read_table2(
"ABMD ACAD ALGN ALNY ANIP ASCMA AVGO CALD CLVS CORT
0.03794643 0.296774194 0.13009009 0.32219178 0.13008130 0.02857604 0.13014640 -0.07929515 0.23375000 0.5174825
0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138 0.04066986 -0.04457953 -0.2465438"
)
summary <- ram %>%
gather(colname, value) %>%
group_by(colname) %>%
summarise_at(.vars = vars(value), .funs = funs(mean = mean, sum = sum, max = max)) %>%
arrange(desc(sum))
summary
#> # A tibble: 10 x 4
#> colname mean sum max
#> <chr> <dbl> <dbl> <dbl>
#> 1 ALNY 0.152 0.304 0.322
#> 2 ACAD 0.152 0.303 0.297
#> 3 CORT 0.135 0.271 0.517
#> 4 CLVS 0.0946 0.189 0.234
#> 5 ABMD 0.0939 0.188 0.150
#> 6 ALGN 0.0663 0.133 0.130
#> 7 ASCMA 0.0526 0.105 0.0766
#> 8 AVGO 0.0434 0.0868 0.130
#> 9 ANIP 0.0363 0.0725 0.130
#> 10 CALD -0.0193 -0.0386 0.0407
If you then want to reorder your original data frame, you can get the order from this summary output and index with it:
ram[summary$colname]
#> # A tibble: 2 x 10
#> ALNY ACAD CORT CLVS ABMD ALGN ASCMA AVGO ANIP
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.322 0.297 0.517 0.234 0.0379 0.130 0.0286 0.130 0.130
#> 2 -0.0182 0.00663 -0.247 -0.0446 0.150 0.00255 0.0766 -0.0433 -0.0576
#> # ... with 1 more variable: CALD <dbl>
Created on 2018-08-01 by the reprex package (v0.2.0).