Barplot with threshold - r

I have a huge data frame consisting of binary values (extract):
id,topic,w_hello,w_apple,w_tomato
1,politics,1,1,0
2,sport,0,1,0
3,politics,1,0,1
With:
barplot(col_prefix_matrix)
I plot the number of their occurrences:
As there are many columns, the plot looks very confusing.
Would it be possible to plot only those columns with a specific threshold, say 5, to make it look more clear?

Related

I have labelled the data matrix for PCA. How to colour them according to each label in PCA using r?

My data matrix has 100 rows and 900 columns. Here each row represents a IR spectra. The column represents the wavenumbers. The first 23 rows belong to different IR spectra from the same sample (i.e spectra from 23 different positions in the sample). Similarly I have measured 5 samples each with certain no.of observations. For ex: 1-23 rows belongs to sample 1, 24:40 belongs to sample 2. Now I want to colour the scores in my PCA score plot according to the sample colours and label the colour with the sample name. Like, 23 scores in blue and then a label referring Sample 1.
I have added an extra column named label, to my data matrix referring the sample names. But I do not how to proceed further?
I was using the packages "factoextra", "sf" for this. Here df is the data frame that contains the data for PCA. Here I added another column referring to the labels of my data. In the code,col.ind= df$lab.id says that I have taken the labelling id (labels) as the color index. Hence in the resulting PCA score plot, my scores were colour coded according to their labels.
fviz_pca_ind(PCA,axes=c(1,2),title="PC1 vs PC2",label="none",geom.ind="point",col.ind=df$lab.id,palette="lancet",addEllipses=FALSE, ellipse.level=0.95,pointsize=2,
repel = TRUE, # Avoid text overlapping,
legend.title="Disease ",mean.point=FALSE,xlab=paste0("PC1: ",round(Variance_xplained[1]*100,1),"%"),ylab=paste0("PC2: ",round(Variance_xplained[2]*100,1),"%")
)

Data from two data frames in one plot (R)

I've got two data frames in R, both of the same structure - with columns named: Year, Age, Gender and Value1.
What I'd like to do, is to plot (as points) Value1 (on Y axis) against Year (on X axis), for a particular gender and age. The plot should consists of points from both data frames (with legend indicating which points are from which data frame).
What I've done is:
attach(df1)
plot(Value1[Gender=="Female" & Age==30] ~ Year[Gender=="Female" & Age==30])
which creates the plot with points from one data frame. The question is, how to add the points from the second data frame to the same plot, and how to create proper legend? I tried few combinations of the points() formula, but it did not help.
without a reproducable example it is not very easy to help. Assuming your data frames are called df1,df2 you can try this:
library(ggplot2)
library(dplyr)
df1$frame="1"
df2$frame="2"
df=rbind(df1,df2)
df<-filter(df,Gender=="Female"&Age==30)
ggplot(data=df,aes(x=Year,y=Value1,col=frame))+geom_point()

Count line graphs in ggplot2

I am new to R/ggplot2 and am trying to create a line graph of counts (or percentages, it doesn't matter) for responses to 6 stimuli in ggplot in R. The stimuli should go across the x-axis, and the count on the y-axis. One line will represent the number of participants who responded with a preposition, and the other line will represent the number of participants who responded with a number.
I believe that ggplot with geom_line() requires an x and y (where y is the count or percentage).
Should I create a new data frame with count so that I can use ggplot? And then, a subquestion would be how do I count responses based on the stimulus data (so, how do I count response based on another column in the data frame, or how many preposition responses for stimulus 1, how many number responses for stimulus 1, how many preposition responses for stimulus 2, etc. Maybe with some kind of if statement?)?
or
Is there a way to automatically produce these counts in ggplot?
Of course, it's entirely possible that I'm going about this the wrong way entirely.
I've tried to search this, but cannot find anything. Thank you so much.
As I said in my comment, I ended up creating a frequency table and using ggplot to plot the resulting data frame. Here's the code below!
# creates data frame
resp <- c("number", "number", "preposition", "number")
sound <- c(1, 1, 2, 2)
df <- data.frame(resp, sound)
# creates frequency table
freq.table <- prop.table((xtabs(~resp+sound, data=df)), 2)
freq.table.df <- as.data.frame(freq.table)
# plots lines based on frequency
ggplot(freq.table.df, aes(sound, Freq, group=resp, color=resp)) +
geom_line()

R - how to make barplot plot zeros for missing values over the data range?

Lets say I have 10 observations of 200 points of integers between one and ten:
mysample = sample(rep(seq(1,10),20),10);
and I want to barplot it
barplot(table(mysample));
barplot
In this example, there are no observations of 7. Is there a quick way of telling barplot to set the x-axis range to all integers between 1 and 10, or do I have to manually edit the table?
Try
barplot(table(factor(mysample, levels=1:10)));
By using a factor, R will know which levels are "missing"

How to visualize (value, count) dataset with thousands data points

I have a file with 2 numeric columns: value and count. File may have > 5000 rows. I do plot(value, count) to find the shape of distribution. But because there are too many data points the picture is not very clear.
Do you know better visualization approach? Probably histograms or barplot with grouping close values on x axis will be the better way to look on data? I cannot figure out the syntax of using histogram or barplot for my case.
If you want to relate the two (continuous) quantities value and count to each other, then you want to do a scatterplot. The problem is that if you have too many observations, the points will overlap and the plot ends up as a big opaque mass with a few scattered outliers. There are a couple of ways to solve this:
Use a smaller plotting symbol: plot(value, count, pch=".")
Plot the data points with a transparency factor: plot(value, count, col=rgb(0, 0, 1, alpha=0.1))
Why not plot a subset of the data? For example, plot the counts associated with values corresponding to the 5th, 10th, ..., 90th, 95th percentiles, e.g.,
value.subset <- quantile(value, seq(0, 1, 0.05))plot
Then plot the quantiles against their respective counts.

Resources