Identify and plot datapoints surrounded by NAs

Identify and plot datapoints surrounded by NAs - r

I am using ggplot2 and geom_line() to make a lineplot of a large number of time series. The dataset has a high number of missing values, and I am generally happy that lines are not drawn across missing segments, as this would look awkard.
My problem is that single non-NA datapoints surrounded by NAs (or points at the beginning/end of the series with an NA on the other side) are not plotted. A potential solution would be adding geom_point() for all observations, but this increases my filesize tenfold, and makes the plot harder to read.
Thus, I want to identify only those datapoints that do not get shown with geom_line() and add points only for those. Is there a straightforward way to identify these points?
My data is currently in long format, and the following MWE can serve as an illustration. I want to identify rows 1 and 7 so that I can plot them:
library(ggplot2)
set.seed(1)
dat <- data.frame(time=rep(1:5,2),country=rep(1:2,each=5),value=rnorm(10))
dat[c(2,6,8),3] <- NA
ggplot(dat) + geom_line(aes(time,value,group=country))
> dat
time country value
1 1 1 -0.6264538
2 2 1 NA
3 3 1 -0.8356286
4 4 1 1.5952808
5 5 1 0.3295078
6 1 2 NA
7 2 2 0.4874291
8 3 2 NA
9 4 2 0.5757814
10 5 2 -0.3053884

You can use zoo::rollapply function to create a new column with values surrended with NA only. Then you can simply plot those points. For example:
library(zoo)
library(ggplot2)
foo <- data.frame(time =c(1:11), value = c(1 ,NA, 3, 4, 5, NA, 2, NA, 4, 5, NA))
# Perform sliding window processing
val <- c(NA, NA, foo$value, NA, NA) # Add NA at the ends of vector
val <- rollapply(val, width = 3, FUN = function(x){
if (all(is.na(x) == c(TRUE, FALSE, TRUE))){
return(x[2])
} else {
return(NA)
}
})
foo$val_clean <- val[c(-1, -length(val))] # Remove first and last values
foo$val_clean
ggplot(foo) + geom_line(aes(time, value)) + geom_point(aes(time, val_clean))

Do you mean something like this?
library(tidyverse)
dat %>%
na.omit() %>%
ggplot() +
geom_line(aes(time, value, group = country))

Related

How to index a dataframe for using ggplot in a loop

I am struggling with creating multiple ggplots using a loop.
I use data in the following format:
a <- c(1,2,3,4)
b <- c(5,6,7,8)
c <- c(9,10,11,12)
d <- c(13,14,15,16)
time <- c(1,2,3,4)
data <- cbind(a,b,c,d,time)
What I want to create is a list of plots that plot one of the letters against the variable time.
Which I tried in the following way:
library(ggplot2)
library(gridExtra)
plots <- list()
for (i in 1:4){
plots[[i]] <- ggplot() + geom_line(data = data, aes(x = time, y = data[,i]))
}
grid.arrange(plots[[1]], plots[[2]], plots[[3]], plots[[4]])
This results in four times the fourth plot. How do I index this correctly in a way that creates the four intended plots?

(Up front: the reason that your plots are all identical is due to ggplot's "lazy" evaluation of code. See my #2 below, where I identify that the data[,i] is evaluated when you try to plot the data, at which point i is 4, the last pass in the for loop.)
It's generally preferred/recommended to use data.frames instead of matrices or vectors (as you're doing here). It gives a bit more power and control.
data <- data.frame(a,b,c,d,time)
Also, I tend to prefer lapply to for-loops and lists, for various (some subjective) reasons. Ultimately, the issue you're having is that ggplot2 is evaluating the data lazily, so plots is a list with four plots that make reference to i ... and that is realized when you try to plot them all, at which point i is 4 (from the last pass through the loop). One benefit of using lapply is that the i referenced is a local-only (inside of the anon-func) version of i that is preserved as you would expect.
plots <- lapply(names(data)[1:4],
function(nm) ggplot(data, aes(x = time, y = .data[[nm]])) + geom_line())
gridExtra::grid.arrange(plots[[1]], plots[[2]])
I also prefer patchwork to gridExtra, mostly because it makes more-customized layouts a bit more intuitive, plus adds functionality such as axis-alignment, shared legends, shared titles, etc. (None of those other features are demonstrated here.)
library(patchwork)
plots[[1]] / plots[[2]] # same plot
plots[[1]] + plots[[2]] # side-by-side instead of top/bottom
(plots[[1]] + plots[[2]]) / (plots[[3]] + plots[[4]]) # grid
Ultimately, though, I suggest that facets can be useful and very powerful. For this, we need to melt/pivot the data into a "long format" so that the column names a-b are actually in one column.
reshape2::melt(data, id.vars = "time") |>
ggplot(aes(time, value)) +
geom_line() +
facet_grid(variable ~ ., scales = "free_y")
I assumed the preference for independent (free) y-scales, ergo the scales="free_y". Try it without if you want to see the options. (There are also scales="free_x" and scales="free" (both).)
To see what I mean by "long" format:
reshape2::melt(data, id.vars = "time")
# time variable value
# 1 1 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 1 b 5
# 6 2 b 6
# 7 3 b 7
# 8 4 b 8
# 9 1 c 9
# 10 2 c 10
# 11 3 c 11
# 12 4 c 12
# 13 1 d 13
# 14 2 d 14
# 15 3 d 15
# 16 4 d 16
This can also be done with tidyr::pivot_longer(data, -time), albeit the variable name is now name. For this use, there is no advantage to reshape2::melt or tidyr::pivot_longer; there are opportunities for significantly more complex pivoting in the latter, not relevant with this data.
Data
data <- structure(list(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8), c = c(9, 10, 11, 12), d = c(13, 14, 15, 16), time = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA, -4L))

stack bars in plot without preserving label order

ggplot preserves the order of stacked bars according to labels:
d <- read.table(text='Day Location Length Amount
1 2 3 1
1 1 4 2
3 3 3 2
3 2 5 1',header=T)
d$Amount<-as.factor(d$Amount) # in real world is not numeric
ggplot(d, aes(x = Day, y = Length)) +
geom_bar(aes(fill = Amount), stat = "identity")
What I desired is something similar as the result of the plot without the as.factor line. That is: that the greater bars are always on top. However, I cannot do that with my data because I have categories, not numbers.
Similar post: https://www.researchgate.net/post/R_ggplot2_Reorder_stacked_plot
Solution can come in other R package
Note: data.frame is only demonstrative.

I came up with this solution:
(1) First, sort data.frame by column of values in decreasing order
(2) Then duplicate column of values, as factor.
(3) In ggplot group by new factor (values)
d <- read.table(text='Day Length Amount
1 3 1
1 4 2
3 3 2
3 5 1',header=T)
d$Amount<-as.factor(d$Amount)
d <- d[order(d$Length, decreasing = TRUE),] # (1)
d$LengthFactor<-factor(d$Length, levels= unique(d$Length) ) # (2)
ggplot(d)+
geom_bar(aes(x=Day, y=Length, group=LengthFactor, fill=Amount), # (3)
stat="identity", color="white")
{
library(data.table)
sam<-data.frame(population=c(rep("PRO",8),rep("SOM",4)),
allele=c("alele1","alele2","alele3","alele4",rep("alele5",2),
rep("alele3",2),"alele2","alele3","alele3","alele2"),
frequency=rep(c(10,5,4,6,7,16),2) #,rep(1,6)))
)
sam <- setDT(sam)[, .(frequencySum=sum(frequency)), by=.(population,allele)]
sam <- sam[order(sam$frequency, decreasing = TRUE),] # (1)
# (2)
sam$frequency<-factor(sam$frequency, levels = unique(sam$frequency) )
library(ggplot2)
ggplot(sam)+
geom_bar(aes(x=population, y=frequencySum, group=frequency, fill=allele), # (3)
stat="identity", color="white")
}

Plot every 10 datapoint in a vector by different color in R

I have one dimensional vector in R which I would like to plot like :
Every 10 data points have different color. How do I do this in R with normal plot function, with ggplot and with plotly?

in base R you can try this.
I changed the data a little bit compared to the other answer
# The data
set.seed(2017);
df <- data.frame(x = 1:100, y = 0.001 * 1:100 + runif(100));
nCol <- 10;
df$col <- rep(1:10, each = 10);
# base R plot
plot(df[1:2]) #add `type="n"` to remove the points
sapply(1:nrow(df), function(x) lines(df[x+0:1,1:2], col=df$col[x], lwd=2))
As for lines the col parameter will be recycled you have to use a loop (here sapply) over the rows and plot segments.

Here is a ggplot solution; unfortunately you don't provide sample data, so I'm generating some random data.
# Sample data
set.seed(2017);
df <- data.frame(x = 1:100, y = 0.001 * 1:100 + runif(1000));
# The number of different colours
nCol <- 5;
df$col <- rep(1:nCol, each = 10);
# ggplot
library(tidyverse);
ggplot(df, aes(x = x, y = y, col = as.factor(col), group = 1)) +
geom_line();
For plotly just wrap the ggplot call within ggplotly.

This answer doesn't show you how to do it in a specific plotting package, but instead shows how to assign random colors to your data according to your specifications. The benefit of this approach is that it gives you control over which colors you use if you choose.
library(dplyr) # assumed okay given ggplot2 mention
df = data_frame(v1=rnorm(100))
n = nrow(df)
df$group = (1:n - (1:n %% -10)) / 10
colors = sample(colors(), max(df$group), replace=FALSE)
df$color = colors[df$group]
df %>% group_by(group) %>% filter(row_number() <= 2) %>% ungroup()
# A tibble: 20 x 3
v1 group color
<dbl> <dbl> <chr>
1 -0.6941434087 1 lightsteelblue2
2 -0.4559695973 1 lightsteelblue2
3 0.7567737300 2 darkgoldenrod2
4 0.9478937275 2 darkgoldenrod2
5 -1.2358486079 3 slategray3
6 -0.7068140340 3 slategray3
7 1.3625895045 4 cornsilk
8 -2.0416315923 4 cornsilk
9 -0.6273386846 5 darkgoldenrod4
10 -0.5884521130 5 darkgoldenrod4
11 0.0645078975 6 antiquewhite1
12 1.3176727205 6 antiquewhite1
13 -1.9082708004 7 khaki
14 0.2898018693 7 khaki
15 0.7276799336 8 greenyellow
16 0.2601492048 8 greenyellow
17 -0.0514811315 9 seagreen1
18 0.8122600269 9 seagreen1
19 0.0004641533 10 darkseagreen4
20 -0.9032770589 10 darkseagreen4
The above code first creates a fake dataset with 100 rows of data, and sets n equal to 100. df$group is set by taking the row numbers (1:n) performing a rather convoluted evaluation to get a vector of numbers like c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ..., 10). It then samples the colors available in base R returning as many colors as their are groups (max(df$group)) and then using the group variable to index the color vector to get the color. The final output is just the first two rows of each group to show that the colors are the same within group, but different between groups. This should now be able to be passed in as a variable in your various plotting environments.

How to remove rows that do not have more than 3 values in r?

this is my first time asking a question and hopefully I can get your help!
I need to remove rows that have values for only one or two genes using R
basically I need to get rid of 50S, ABCC8, and ACAT1 because these have a n<3.
My desired output is
thank you very much!

If this is in a data.frame, you can use dplyr package to do some manipulation. We can group the data by the Genes and count how many instances are there. Then we simply set the filter criteria to remove the records.
require(dplyr)
df <- data.frame(
Genes=c('50S' ,'abcb1' ,'abcb1' ,'abcb1' ,'ABCC8' ,'ABL' ,'ABL' ,'ABL' ,'ABL' ,'ACAT1' ,'ACAT1' ),
Values=c(-0.627323448, -0.226358414, 0.347305901 ,0.371632631 ,0.099485307 ,0.078512979 ,-0.426643782, -1.060270668, -2.059157991, 0.608899174 ,-0.048795611)
)
#group, filter and join back to get subset the data
df %>% group_by(Genes)
%>% summarize(count=n())
%>% filter(count>=3)
%>% inner_join(df)
%>% select(Genes,Values)
As per #Lamia's comments, it is possible to simplify it to just:
df %>% group_by(Genes) %>% filter(n()>=3)

# generating data
x <- c(NA, NA, NA, NA, 2, 3) # has n < 3!
y <- c(1, 2, 3, 4, 5, 6)
z <- c(1 ,2, 3, NA, 5, 6)
df <- data.frame(x,y,z)
colsToKeep <- c() # making empty vector I will fill with column numbers
for (i in 1:ncol(df)) { # for every column
if (sum(!is.na(df[,i]))>=3) { # if that column has greater than 3 valid values (i.e., ones that are not na...
colsToKeep <- c(colsToKeep, i) # then save that column number into this vector
}
}
df[,colsToKeep] # then use that vector to call the columns you want
Note that R treats FALSE as 0 and TRUE as 1, so that is how the sum() function works here.

Another possible solution by using table:
gene <- c("A","A","A","B","B","C","C","C","C","D")
value <- c(seq(1,10,1))
df<-data.frame(gene,value)
df
gene value
1 A 1
2 A 2
3 A 3
6 C 6
7 C 7
8 C 8
9 C 9
su<-data.frame(table(df$gene))
df_keep <-df[which(df$gene %in% su[which(su$Freq>2),1]),]
df_keep
gene value
1 A 1
2 A 2
3 A 3
6 C 6
7 C 7
8 C 8
9 C 9

Add a row in a sorted data frame : which solutions?

There is something I don't understand.
I've this data frame :
Var1 Freq
1 2008-05 1
2 2008-07 7
3 2008-08 5
4 2008-09 3
I need to append a row on second position, for exemple it would be :
2008-06 0
I followed this (Add a new row in specific place in a dataframe). First step : add an index column ; second step : append rows with an index number for each ; then, sort it.
df$ind <- seq_len(nrow(df))
df <- rbind(df,data.frame(Var1 = "2008-06", Freq = "0",ind=1.1))
df <- df[order(df$ind),]
Ok, everything seems good. Even if I don't know why a column called "row.names" has appeared, I get :
row.names Var1 Freq ind
1 1 2008-05 1 1
2 5 2008-06 0 1.1
3 2 2008-07 7 2
4 3 2008-08 5 3
5 4 2008-09 3 4
Now, I plot it, with ggplot2.
ggplot(df, aes(y = Freq, x = Var1)) + geom_bar()
Here we are. On the X axis, "2008-06" is placed at the end, after "2008-09" (ie with the index 5). In clear, the data frame has not been sorted, in despite of it seems to be.
Where I'm wrong ? Thanks for help...

Try this:
df$Var1 <- factor(df$Var1, df$Var1[order(df$ind)])
If you want ggplot2 to order labels, you have to specify the ordering yourself.
You might also want to look into converting Var1 to some sort of date class, then dispensing with the index variable altogether. This would makes things clearer, I think. The zoo package actually has a nice class for representing months of a given year, and you could use this for Var1. For example:
library(zoo)
df$Var1 <- as.yearmon(df$Var1)
df <- rbind(df,data.frame(Var1 = as.yearmon("2008-06"), Freq = "0"))
Now you can just order your data frame by Var1 without having to worry about keeping an index:
> df[order(df$Var1), ]
Var1 Freq
1 May 2008 1
5 Jun 2008 0
2 Jul 2008 7
3 Aug 2008 5
4 Sep 2008 3
A plot in ggplot2 will turn out as expected:
ggplot(df, aes(as.Date(Var1), Freq)) + geom_bar(stat="identity")
Though you do have to convert Var1 to Date, since ggplot2 doesn't understand yearmon objects.

It is because somewhere along the way you got a factor in the mix. This produces what you're after (without the rownames column):
df <- read.table(text=" Var1 Freq
1 2008-05 1
2 2008-07 7
3 2008-08 5
4 2008-09 3", header=TRUE, stringsAsFactors = FALSE)
df$ind <- seq_len(nrow(df))
df <- rbind(df,data.frame(Var1 = "2008-06", Freq = "0",ind=1.1, stringsAsFactors = FALSE))
df <- df[order(df$ind),]
ggplot(df, aes(y = Freq, x = Var1)) + geom_bar()
Notice the stringsAsFactors = FALSE?
As far as the order goes if you already have factors (as you do) you need to reorder the factor. If you want more detailed info see this post

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Identify and plot datapoints surrounded by NAs - r

Do you mean something like this? library(tidyverse) dat %>% na.omit() %>% ggplot() + geom_line(aes(time, value, group = country))

Related

How to index a dataframe for using ggplot in a loop

stack bars in plot without preserving label order

Plot every 10 datapoint in a vector by different color in R

How to remove rows that do not have more than 3 values in r?

Add a row in a sorted data frame : which solutions?

Categories

Resources