My data:
GFP_mywide<-bothdata[1:10,c(2,3,4)]
GFP_long<-melt(GFP_mywide, id =c('Entrez.Symbol'))
looks like this:
>GFP_long
Entrez.Symbol variable value
1 TRIP11 GFP_my 1.015
2 SIN3B GFP_my 0.336
3 SF3B1 GFP_my 0.315
4 PSMD14 GFP_my 0.254
5 RAD51 GFP_my 0.286
6 BARD1 GFP_my 0.157
7 BRCA1 GFP_my 0.275
8 BRCA1 GFP_my 0.230
9 U5200KD GFP_my 0.772
10 SETD5 GFP_my 0.364
11 TRIP11 GFP_wide 0.020
12 SIN3B GFP_wide 0.055
13 SF3B1 GFP_wide 0.071
14 PSMD14 GFP_wide 0.102
15 RAD51 GFP_wide 0.109
16 BARD1 GFP_wide 0.139
17 BRCA1 GFP_wide 0.146
18 BRCA1 GFP_wide 0.146
19 U5200KD GFP_wide 0.151
20 SETD5 GFP_wide 0.179
I want to create a heatmap when the values are sorted according to GFP_wide, so in the plot I will see the green becoming red for the GFP_wide, and GFP_my will be ordered by the same Entrez.Symbol. Until now I have a heatmap but can't find a way to sort the values based on GFP_wide. How do I do that?
This is my code and result:
ggplot(GFP_long, aes(x=Entrez.Symbol,y=variable,fill=value)) +
geom_tile(aes(fill = value))+
scale_fill_gradient(low="red", high="green")
Using the data from your other question:
library(dplyr); library(forcats) # please check these load successfully
GFP_long %>%
arrange(desc(variable), -value) %>%
mutate(Entrez.Symbol = fct_inorder(Entrez.Symbol)) %>%
ggplot(aes(x=Entrez.Symbol,y=variable,fill=value)) +
geom_tile(aes(fill = value))+
# I'm modifying here to show the top row is ordered; hard to see in original
scale_fill_gradient2(low="red", mid = "gray80", high="green", midpoint = 0.2)
Data: how to order column by a different column value r
GFP_long<-data.frame(Entrez.Symbol<-c("TRIP11","SIN3B","SF3B1","PSMD14","RAD51","BARD1",
"BRCA1","BRCA1","U5200KD","SETD5","TRIP11","SIN3B",
"SF3B1","PSMD14","RAD51","BARD1","BRCA1","BRCA1",
"U5200KD","SETD5"),
variable<-c("GFP_my","GFP_my","GFP_my","GFP_my","GFP_my","GFP_my","GFP_my",
"GFP_my","GFP_my","GFP_my","GFP_wide","GFP_wide","GFP_wide","GFP_wide",
"GFP_wide","GFP_wide","GFP_wide","GFP_wide","GFP_wide","GFP_wide"),
value<-c(1.015,0.336,0.315,0.254,0.286,0.157,0.275,0.230,0.772,0.364,
0.020,0.055,0.071,0.102,0.109,0.139,0.146,0.146,0.151,0.179))
colnames(GFP_long)<-c("Entrez.Symbol","variable","value")
Related
I have a data.frame I want to filter based on whether the range from low to high contains zero. Here's an example
head(toy)
# A tibble: 6 x 3
difference low high
<dbl> <dbl> <dbl>
1 0.0161 -0.143 0.119
2 0.330 0.0678 0.656
3 0.205 -0.103 0.596
4 0.521 0.230 0.977
5 0.328 0.177 0.391
6 -0.0808 -0.367 0.200
I could swear I have used dplyr::between() to do this kind of filtering operation a million times (even with columns of class datetime, where it warns about S3 objects). But I can't find what's wrong with this one.
# Does does not find anything
toy %>%
filter(!dplyr::between(0, low, high))
# Maybe it's because it needs `x` to be a vector, using mutate
# Does not find anything
toy %>%
mutate(zero = 0) %>%
filter(!dplyr::between(zero, low, high))
# if we check the logic, all "keep" go to FALSE
toy %>%
mutate(zero = 0,
keep = !dplyr::between(zero, low, high))
# data.table::between works
toy %>%
filter(!data.table::between(0, low, high))
# regular logic works
toy %>%
filter(low > 0 | high < 0)
The data below:
> dput(toy)
structure(list(difference = c(0.0161058505175378, 0.329976207353122,
0.20517072042705, 0.520837282826481, 0.328289597476641, -0.0807728725339096,
0.660320444135006, 0.310679750033675, -0.743294517440579, -0.00665462977775899,
0.0890903981794149, 0.0643321993757249, 0.157453334405998, 0.107320325893175,
-0.253664041938671, -0.104025850079389, -0.284835573264143, -0.330557762091307,
-0.0300387610595219, 0.081297046765014), low = c(-0.143002432870633,
0.0677907794288728, -0.103344717845837, 0.229753302951895, 0.176601773133456,
-0.366899428200429, 0.403702557199546, 0.0216878391530755, -1.01129163487875,
-0.222395625167488, -0.135193611295608, -0.116654715121314, -0.168581379777843,
-0.281919444558125, -0.605918194917671, -0.364539852350809, -0.500147478407119,
-0.505906196974183, -0.233810558283787, -0.193048952382206),
high = c(0.118860787421672, 0.655558974886329, 0.595905673925067,
0.97748896372657, 0.391043536410999, 0.199727242557477, 0.914173497837859,
0.633804982827898, -0.549942089679123, 0.19745782761473,
0.340823604797603, 0.317956343103116, 0.501279107093568,
0.442497779066522, 0.0721480109893818, 0.280593530192991,
-0.0434862536882377, -0.229723776097642, 0.22550243301984,
0.252686968655449)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Just in case somebody finds it useful
> "between" %in% conflicts()
[1] FALSE
> packageVersion("dplyr")
[1] ‘1.0.2’
dplyr::between() is not vectorized. One thing you could do is:
df %>%
rowwise() %>%
filter(!dplyr::between(0, low, high))
difference low high
<dbl> <dbl> <dbl>
1 0.330 0.0678 0.656
2 0.521 0.230 0.977
3 0.328 0.177 0.391
4 0.660 0.404 0.914
5 0.311 0.0217 0.634
6 -0.743 -1.01 -0.550
7 -0.285 -0.500 -0.0435
8 -0.331 -0.506 -0.230
data.table::between() is vectorized: that's the reason why it works.
We could use map2
library(dplyr)
library(purrr)
toy %>%
filter(!map2_lgl(low, high, ~ between(0, .x, .y)))
-output
# A tibble: 8 x 3
difference low high
<dbl> <dbl> <dbl>
1 0.330 0.0678 0.656
2 0.521 0.230 0.977
3 0.328 0.177 0.391
4 0.660 0.404 0.914
5 0.311 0.0217 0.634
6 -0.743 -1.01 -0.550
7 -0.285 -0.500 -0.0435
8 -0.331 -0.506 -0.230
I am fairly new to stack overflow but did not find this in the search engine. Please let me know if this question should not be asked here.
I have a very large text file. It has 16 entries and each entry looks like this:
AI_File 10
Version
Date 20200708 08:18:41
Prompt1 LOC
Resp1 H****
Prompt2 QUAD
Resp2 1012
TransComp c-p-s
Model Horizontal
### Computed Results
LAI 4.36
SEL 0.47
ACF 0.879
DIFN 0.031
MTA 40.
SEM 1.
SMP 5
### Ring Summary
MASK 1 1 1 1 1
ANGLES 7.000 23.00 38.00 53.00 68.00
AVGTRANS 0.038 0.044 0.055 0.054 0.030
ACFS 0.916 0.959 0.856 0.844 0.872
CNTCT# 3.539 2.992 2.666 2.076 1.499
STDDEV 0.826 0.523 0.816 0.730 0.354
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.028 0.039 0.034 0.032 0.018
### Contributing Sensors
### Observations
A 1 20200708 08:19:12 x 31.42 38.30 40.61 48.69 60.28
L 2 20200708 08:19:12 1 5.0e-006
B 3 20200708 08:19:21 x 2.279 2.103 1.408 5.027 1.084
B 4 20200708 08:19:31 x 1.054 0.528 0.344 0.400 0.379
B 5 20200708 08:19:39 x 0.446 1.255 2.948 3.828 1.202
B 6 20200708 08:19:47 x 1.937 2.613 5.909 3.665 5.964
B 7 20200708 08:19:55 x 0.265 1.957 0.580 0.311 0.551
Almost all of this is junk information, and I am looking to run some code for the whole file that will only give me the lines for "Resp2" and "LAI" for all 16 of the entries. Is a task like this doable in R? If so, how would I do it?
Thanks very much for any help and please let me know if there's any more information I can give to clear anything up.
I've saved your file as a text file and read in the lines. Then you can use regex to extract the desired rows. However, I feel that my approach is rather clumsy, I bet there are more elegant ways (maybe also with (unix) command line tools).
data <- readLines("testfile.txt")
library(stringr)
resp2 <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^Resp2).*$")))
lai <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
resp2 = resp2[!is.na(resp2)],
lai = lai[!is.na(lai)]
)
data_extract
resp2 lai
1 1012 4.36
A solution based in the tidyverse can look as follows.
library(dplyr)
library(vroom)
library(stringr)
library(tibble)
library(tidyr)
vroom_lines('data') %>%
enframe() %>%
filter(str_detect(value, 'Resp2|LAI')) %>%
transmute(value = str_squish(value)) %>%
separate(value, into = c('name', 'value'), sep = ' ')
# name value
# <chr> <chr>
# 1 Resp2 1012
# 2 LAI 4.36
What happens to my R-code? I don't want to see the bar graph and I want to see only the plotting characters (pch):
plot(Graphdata$Sites,ylim=c(-1,2.5), xlab="sites",
ylab="density frequency",lwd=2)
points(Graphdata$A,pch=1,col="blue",lwd=2)
points(Graphdata$B,pch=2,col="green",lwd=2)
points(Graphdata$C,pch=3,col="red",lwd=2)
points(Graphdata$D,pch=4,col="orange",lwd=2)
legend("topright",
legend=c("A","B","C","D"),
col=c("red","blue","green","orange"),lwd=2)
Part of my data looks like:
Sites A B C D
1 A 2.052 2.268 1.828 1.474
2 B 0.549 0.664 0.621 1.921
3 C 0.391 0.482 0.400 0.382
4 D 0.510 0.636 0.497 0.476
5 A 0.214 0.239 0.215 0.211
6 B 1.016 1.362 0.978 0.876
......................................
.....................................
and I want the legend according to pch and not in the form of line.
The barplot is happening because you're plotting the factor of your Sites column. To only plot the points:
library(reshape2)
library(ggplot2)
#plots 6 categorical variables
Graphadata.m = melt(t(Graphdata[-1]))
ggplot(Graphadata.m,aes(Var2,value,group=Var2)) + geom_point() + ylim(-1,2.5)
If you want it to plot 4 categorical variables instead:
Graphdata.1 = t(Graphdata[-1])
colnames(Graphdata.1) = c("A","B","C","D","A","B")
Graphdata.m = melt(Graphdata.1)
ggplot(Graphdata.m,aes(Var2,value,group=Var2)) + geom_point() + ylim(-1,2.5)
EDIT:
In base R:
plot(1,xlim=c(1,4),ylim=c(-1,2.5), xlab="sites",ylab="density frequency",lwd=2,type="n",xaxt = 'n')
axis(1,at=c(1,2,3,4),tick=T,labels=c("A","B","C","D"))
points(Graphdata$A,pch=1,col="blue",lwd=2)
points(Graphdata$B,pch=1,col="green",lwd=2)
points(Graphdata$C,pch=1,col="red",lwd=2)
points(Graphdata$D,pch=1,col="orange",lwd=2)
I have to plot data from immunized animals in a way to visualize possible correlations in protection. As a background, when we vaccinate an animal it produces antibodies, which might or not be linked to protection. We immunized bovine with 9 different proteins and measured antibody titers which goes up to 1.5 (Optical Density (O.D.)). We also measured tick load that goes up to 5000. Each animal have different titers for each protein and different tick loads, maybe some proteins are more important for protection than the others, and we think that a heatmap could illustrate it.
TL;DR: Plot a heatmap with one variable (Ticks) that goes from 6 up to 5000, and another variable (Prot1 to Prot9) that goes up to 1.5.
A sample of my data:
Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
G1-54-102 control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
G1-130-102 control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
G1-133-102 control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
G3-153-102 vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
G3-200-102 vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
G3-807-102 vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035
I have little knowledge in R, but I'm really excited to learn more about it. So feel free to put whatever code you want and I will try my best to understand it.
Thank you in advance.
Luiz
Here is an option to use the ggplot2 package to create a heatmap. You will need to convert your data frame from wide format to long format. It is also important to convert the Ticks column from numeric to factor if the numbers are discrete.
library(tidyverse)
library(viridis)
dat2 <- dat %>%
gather(Prot, Value, starts_with("Prot"))
ggplot(dat2, aes(x = factor(Ticks), y = Prot, fill = Value)) +
geom_tile() +
scale_fill_viridis()
DATA
dat <- read.table(text = "Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
'G1-54-102' control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
'G1-130-102' control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
'G1-133-102' control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
'G3-153-102' vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
'G3-200-102' vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
'G3-807-102' vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035",
header = TRUE, stringsAsFactors = FALSE)
In the newest version of ggplot2 / the tidyverse, you don't even need to explicitly load the viridis-package. The scale is included via scale_fill_viridis_c(). Exciting times!
How can I change the y axis to exclude outliers (not just hide them but scale the y axis so as not to include them) for geom_boxplot with multiple individual boxplots using facet_wrap? An example of my dataset is:
Pop. grp1 grp2 grp3 grp4 grp5 grp6 grp7 grp8
a 0.00652 1.27 0.169 0.859 0.388 0.521 3.58 0.0912
a 0.0133 0.136 0.154 0.167 0.845 0.159 0.561 0.108
a 0.0270 1.60 0.119 0.515 0.0386 0.0145 0.884 0.0155
b 0.00846 0.331 0.100 0.897 0.330 2.52 0.663 0.0338
b 0.0154 0.0997 0.122 0.0873 0.905 0.136 0.413 0.139
b 0.0353 0.536 0.171 0.471 0.0280 0.00608 0.414 0.00973
where I'd like to make a boxplot for each column showing populations a and b.
I've melted the data by population and then used geom_boxplot + facet_wrap but some outliers are so far above the whiskers that the boxes themselves barely show. The code I've used is:
wc.m <- melt(w_c_diff_ab, id.var="Pop.")
p.wc <- ggplot(data = wc.m, aes(x=variable, y=value)) + geom_boxplot(aes(fill=Population))
p.wc + facet_wrap( ~ variable, scales="free") + scale_fill_manual(values=c("skyblue", violetred1"))
but I'm struggling to remove outliers as I'm not sure how to calculate limits for the y axes on a per-boxplot basis.