How to use new grouped together values in a plot

How to use new grouped together values in a plot - r

I have combine different values which means the same, under the variable "Wine_type"
Mock Neg Neg1PCR NegPBS red Red RedWine water Water white White
1 9 1 1 2 18 4 3 4 2 24
into
Mock Neg Neg1PCR NegPBS Redwine Water Whitewine
1 9 1 1 24 7 26
By using this code
dat2<- data.frame(sample_data(psdata.r), stringsAsFactors =FALSE )
dat2$Project<- as.character(dat2$Wine_type)
table(dat2$Project)
dat2[grepl("water|water", dat2$Project, ignore.case = TRUE), "Project"] <- "Water"
dat2[grepl("White|white", dat2$Project, ignore.case = TRUE), "Project"] <- "Whitewine"
dat2[grepl("red|Red", dat2$Project, ignore.case = TRUE), "Project"] <- "Red"
dat2[grepl("Red|Redwine", dat2$Project, ignore.case = TRUE), "Project"] <- "Redwine"
then i produce a plot by the code
plot_richness(psdata.r, measures = c("Observed","Shannon"), x = "Wine_type", color = "SampleType") + geom_boxplot()
the only problem is that i get a plot with the old values. What am i missing to use the new group together values?

Related

Can I separate two groups of vertices in an arc plot in ggraph/ggplot2 in R?

I have gotten a great graph by two distinct electrodes with the following code:
elec <- c("Fp1","Fp2","F7","F3","Fz","F4","F8","FC5","FC1","FC2","FC6","C3","Cz","C4","CP5","CP1","CP2","CP6","P7","P3","Pz","P4","P8","POz","Oz",
"Fp1","Fp2","F7","F3","Fz","F4","F8","FC5","FC1","FC2","FC6","C3","Cz","C4","CP5","CP1","CP2","CP6","P7","P3","Pz","P4","P8","POz","Oz")
edgelist <- get.edgelist(net)
# get vertex labels
label <- get.vertex.attribute(net, "name")
# get vertex groups
group <- get.vertex.attribute(net, "group")
# get vertex fill color
fill <- get.vertex.attribute(net, "color")
# get family
family <- get.vertex.attribute(net, "family")
# get vertex degree
degrees <- degree(net)
# data frame with groups, degree, labels and id
nodes <- data.frame(group, degrees, family, label, fill, id=1:vcount(net))
nodes$family <- factor(nodes$family, levels = unique(nodes$family))
nodes$label <- factor(nodes$label, levels = unique(nodes$label))
nodes <- as_tibble(nodes)
# prepare data for edges
edges <- as_tibble(edgelist)
net.tidy <- tbl_graph(nodes = nodes, edges = edges, directed = TRUE, node_key = "label")
ggraph(net.tidy, layout = "linear") +
geom_edge_arc(alpha = 0.5) +
scale_edge_width(range = c(0.2, 2)) +
scale_colour_manual(values= vrtxc) +
geom_node_point(aes(size = degrees, color = family)) +
geom_node_text(aes(label = elec), angle = 90, hjust = 1, nudge_y = -0.5, size = 3) +
coord_cartesian(clip = "off") +
theme_graph()+
theme(legend.position = "top")
And I got a great graph that I like.
However, I would like to separate both sets of electrodes, in the middle, where Oz is, a little bit, to see there is a difference. In the node attributes I have them differentiated by group (1,2), and I would like to know whether this information could be used to expand through the x axis both sets of vertices, on set of 25 electrodes to the left, the other one to the right of the axis, leaving a space in the middle of them.
I attach some look at the data in case its useful
> nodes
# A tibble: 50 x 6
group degrees family label fill id
<fct> <dbl> <fct> <fct> <fct> <int>
1 1 5 fronp Fp1_1 #3B9AB2 1
2 1 9 fronp Fp2_1 #3B9AB2 2
3 1 6 fron F7_1 #5DAABC 3
4 1 7 fron F3_1 #5DAABC 4
5 1 11 fron Fz_1 #5DAABC 5
6 1 9 fron F4_1 #5DAABC 6
7 1 11 fron F8_1 #5DAABC 7
8 1 8 fronc FC5_1 #88BAAE 8
9 1 6 fronc FC1_1 #88BAAE 9
10 1 4 fronc FC2_1 #88BAAE 10
# … with 40 more rows

In case someone is interested, I ended up inserting some dummy rows in the nodes dataframe and it did the trick.

Using barplot in R studio

when I try this code for barplot (L$neighbourhood is the apartment neighbourhood in Paris for example, Champs-ElysÃ©es, Batignolles, which is string data, and L$price is the numeric data for apartment price).
barplot(L$neighbourhood, L$price, main = "TITLE", xlab = "Neighbourhood", ylab = "Price")
But, I get an error:
Error in barplot.default(L$neighbourhood, L$price, main = "TITLE",
xlab = "Neighbourhood", : 'height' must be a vector or a matrix
We cannot use string data as an input in barplot function in R? How can I fix this error please?
allneighbourhoods

Quite unclear what you want to barplot. Let's assume you want to see the average price per neighborhood. If that's what you're after you can proceed like this.
First some illustrative data:
set.seed(123)
Neighborhood <- sample(LETTERS[1:4], 10, replace = T)
Price <- sample(10:100, 10, replace = T)
df <- data.frame(Neighborhood, Price)
df
Neighborhood Price
1 C 23
2 C 34
3 C 99
4 B 100
5 C 78
6 B 100
7 B 66
8 B 18
9 C 81
10 A 35
Now compute the averages by neighborhood using the function aggregate and store the result in a new dataframe:
df_new <- aggregate(x = df$Price, by = list(df$Neighborhood), function(x) mean(x))
df_new
Group.1 x
1 A 35
2 B 71
3 C 63
And finally you can plot the average prices in variable x and add the neighborhood names from the Group.1column:
barplot(df_new$x, names.arg = df_new$Group.1)
An even simpler solution is this, using tapplyand mean:
df_new <- tapply(df$Price, df$Neighborhood, mean)
barplot(df_new, names.arg = names(df_new))

Find and visualize best and worst items using boxplot

I am a dataset of jokes Dataset 2 (jester_dataset_2.zip) from the Jester project and I would like to divide the jokes into groups of jokes with similar rating and visualize the results appropriately.
The data look like this
> str(tabulka)
'data.frame': 1761439 obs. of 3 variables:
$ User : int 1 1 1 1 1 1 1 1 1 1 ...
$ Joke : int 5 7 8 13 15 16 17 18 19 20 ...
$ Rating: num 0.219 -9.281 -9.281 -6.781 0.875 ...
Here is a subset of Dataset 2.
> head(tabulka)
User Joke Rating
1 1 5 0.219
2 1 7 -9.281
3 1 8 -9.281
4 1 13 -6.781
5 1 15 0.875
6 1 16 -9.656
I found out I can't use ANOVA since the homogenity is not the same. Hence I am using Kruskal–Wallis method from agricolae package in R.
KWtest <- with ( tabulka , kruskal ( Rating , Joke ))
Here are the groups.
> head(KWtest$groups)
trt means M
1 53 1085099 a
2 105 1083264 a
3 89 1077435 ab
4 129 1072706 b
5 35 1070016 bc
6 32 1062102 c
The thing is I don't know how to visualize the joke groups appropriately. I am using boxplot to show the confidence intervals for each joke.
barvy <- c ("yellow", "grey")
boxplot (Rating ~ Joke, data = tabulka,
col = barvy,
xlab = "Joke",
ylab = "Rating",
ylim=c(-7,7))
It would be nice to somehow color each box (each joke) with an appropriate color according to the color given by the KW test.
How could I do that? Or is there some better way to find the best and the worst jokes in the dataset?

Interesting question per se. It's easy to color each bar according to the group the joke belongs to. However, I think it is just a intermediate solution, there must be better visualization for these data. So, certainly not the best one, but there is my version:
library(tidyverse)
# download data (jokes, part 1) to temporaty file, and unzip
tmp <- tempfile()
download.file("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip", tmp)
tmp <- unzip(tmp)
# read data from temp
vtipy <- readxl::read_excel(tmp, col_names = F, na = '99')
# clean data
vtipy <- vtipy %>%
mutate(user = 1:n()) %>%
gather(key = 'joke', value = 'rating', -c('..1', 'user')) %>%
rename(n = '..1', ) %>%
filter(!is.na(rating)) %>%
mutate(joke = as.character(as.numeric(gsub('\\.+', '', joke)) - 1)) %>%
select(user, n, joke, rating)
# your code
KWtest <- with(vtipy, agricolae::kruskal(rating, joke))
# join groups from KWtest to original data, clean and plot
KWtest$groups %>%
rownames_to_column('joke') %>%
select(joke, groups) %>%
right_join(vtipy, by = 'joke') %>%
mutate(joke = stringi::stri_pad_left(joke, 3, '0')) %>%
ggplot(aes(x = joke, y = rating, fill = groups)) +
geom_boxplot(show.legend = F) +
scale_x_discrete(breaks = stringi::stri_pad_left(c(1, seq(5, 100, by = 5)), 3, '0')) +
ggthemes::theme_tufte() +
labs(x = 'Joke', y = 'Rating')

How to combine multiple variable data to a single variable data?

After making my data frame, and selecting the variables i want to look at, i face a dilemma. The excel sheet which acts as my data source was used by different people recording the same type of data.
Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26
As you can see, because the data is written diffently, Major groups (Redwine, Whitewine and Water) have now been split into undergroups . How do i combine the undergroups into a combined group eg. red+Red+RedWine -> Total wine. I use the phyloseq package for this kind of dataset

names <- c("red","white","water")
df2 <- setNames(data.frame(matrix(ncol = length(names), nrow = nrow(df))),names)
for(col in names){
df2[,col] <- rowSums(df[,grep(col,tolower(names(df)))])
}
here
grep(col,tolower(names(df)))
looks for all the column names that contain the strings like "red" in the names of your vector. You then just sum them in a new data.frame df2 defined with the good lengths

I would just create a new data.frame, easiest to do with dplyr but also doable with base R:
with dplyr
newFrame <- oldFrame %>% mutate(Mock = Mock, Neg = Neg + Neg1PCR + Neg2PCR + NegPBS, Red = red + Red + RedWine, Water = water + Water, White = white = White)
with base R (not complete but you get the point)
newFrame <- data.frame(Red = oldFrame$Red + oldFrame$red + oldFrame$RedWine...)

One can use dplyr:starts_with and dplyr::select to combine columns. The ignore.case is by default TRUE in dplyr:starts_with with help in the data.frame OP has posted.
library(dplyr)
names <- c("red", "white", "water")
cbind(df[1], t(mapply(function(x)rowSums(select(df, starts_with(x))), names)))
# Mock red white water
# 1 1 24 28 8
Data:
df <- read.table(text =
"Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26",
header = TRUE, stringsAsFactors = FALSE)

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!

Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363

Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]