Detect outer rows in the dataset - r

I have data set that contain positions of the objects:
so <- data.frame(x = rep(c(1:5), each = 5), y = rep(1:5, 5))
so1 <- so %>% mutate(x = x + 5, y = y +2)
so2 <- rbind(so, so1) %>% mutate(x = x + 13, y = y + 7)
so3 <- so2 %>% mutate(x = x + 10)
ggplot(aes(x = x, y = y), data = rbind(so, so1, so2, so3)) + geom_point()
What I want to know is if there is a method in R that can detect that the object is located in the outer row in the data set as I have to exclude such objects from the analysis. I want to exclude the objects in red as on the picture
So far I used min, max and ifelse but this is tidious and I could not create something that could be generalised to the different data sets with different design of x and y.
Is there any package that do the thing? or/and is it possible to solve such a problem?

You could perhaps use a "spatial" approach?
Visualizing your data as a spatial object, your problem would become to remove the borders of your patches...
This can be done quite straightforwardly using the package raster: find the boundaries and mask your data accordingly.
library(dplyr)
library(raster)
# Your reproducible example
myDF = rbind(so,so1,so2,so3)
myDF$z = 1 # there may actually be more 'z' variables
# Rasterize your data
r = rasterFromXYZ(myDF) # if there are more vars, this will be a RasterBrick
par(mfrow=c(2,2))
plot(r, main='Original data')
# Here I artificially add 1 row above and down and 1 column left and right,
# This is a trick needed to make sure to also remove the cells that are
# located at the border of your raster with `boundaries` in the next step.
newextent = extent(r) + c(-res(r)[1], res(r)[1], -res(r)[2], res(r)[2] )
r = extend(r, newextent)
plot(r, main='Artificially extended')
plot(rasterToPoints(r, spatial=T), add=T, col='blue', pch=20, cex=0.3)
# Get the cells to remove, i.e. the boundaries
bounds = boundaries(r[[1]], asNA=T) #[[1]]: in case r is a RasterBrick
plot(bounds, main='Cells to remove (where 1)')
plot(rasterToPoints(bounds, spatial=T), add=T, col='red', pch=20, cex=0.3)
# Then mask your data (i.e. subset to remove boundaries)
subr = mask(r, bounds, maskvalue=1)
plot(subr, main='Resulting data')
plot(rasterToPoints(subr, spatial=T), add=T, col='blue', pch=20, cex=0.3)
# This is your new data (the added NA's are not translated so it's OK)
myDF2 = rasterToPoints(subr)
Would it help you?

Related

Setting per-column y axis limits with facet_grid

I am, in R and using ggplot2, plotting the development over time of several variables for several groups in my sample (days of the week, to be precise). An artificial sample (using long data suitable for plotting) is this:
library(tidyverse)
groups1 <- rep(1:2, each = 7 * 100)
groups2 <- rep(rep(1:7, times = 2), each = 100)
x <- rep(1:100, times = 14)
values <- c(rnorm(n = 700), rgamma(n = 700, shape = 2))
data <- tibble(x, groups1, groups2, values)
data %>% ggplot(mapping = aes(x = x, y = values)) + geom_line() + facet_grid(groups2 ~ groups1)
which gives
In this example, the first variable -- shown in the left column -- has unlimited range, while the second variable -- shown in the right column -- is weakly positive.
I would like to reflect this in my plot by allowing the Y axes to differ across the columns in this plot, i.e. set Y axis limits separately for the two variables plotted. However, in order to allow for easy visual comparison of the different groups for each of the two variables, I would also like to have the identical Y axes within each column.
I've looked at the scales option to facet_grid(), but it does not seem to be able to do what I want. Specifically,
passing scales = "free_x" allows the Y axes to vary across rows, while
passing scales = "free_y" allows the X axes to vary across columns, but
there is no option to allow the Y axes to vary across columns (nor, presumably, the X axes across rows).
As usual, my attempts to find a solution have yielded nothing. Thank you very much for your help!
I think the easiest would to create a plot per facet column and bind them with something like {patchwork}. To get the facet look, you can still add a faceting layer.
library(tidyverse)
library(patchwork)
groups1 <- rep(1:2, each = 7 * 100)
groups2 <- rep(rep(1:7, times = 2), each = 100)
x <- rep(1:100, times = 14)
set.seed(42) ## always better to set a seed before using random functions
values <- c(rnorm(n = 700), rgamma(n = 700, shape = 2))
data <- tibble(x, groups1, groups2, values)
data %>%
group_split(groups1) %>%
map({
~ggplot(.x, aes(x = x, y = values)) +
geom_line() +
facet_grid(groups2 ~ groups1)
}) %>%
wrap_plots()
Created on 2023-01-11 with reprex v2.0.2

Dot plot of multiple X and Y variables?

I am using a gene expression dataset from ~100 cells.
I want to generate a dot plot indicating which cells are expressing which genes, like below, excluding the color delineations.
I have tried ggplot solutions, but (from what I can tell) Ggplot2 cannot graph numerous variables in each axis. I've looked into more complex packages like Seurot and cRegulome (the image above is from cRegulome), but these produce more information the graphical output than I want.
Below is an example of the type of data frame I am working with.
Cell_A<-c(0,0,1,0,1,0,1,0)
Cell_B<-c(1,1,1,0,0,0,1,0)
Cell_C<-c(1,0,1,0,0,1,0,1)
Cell_D<-c(0,0,0,1,1,1,1,0)
Cell_E<-c(1,1,1,1,1,0,1,1)
Cell_F<-c(0,0,0,0,0,1,1,0)
Cell_G<-c(1,1,1,1,1,1,1,1)
Cell_H<-c(1,1,1,1,1,1,1,1)
Genes <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene6","Gene7","Gene8")
fake_data <- data.frame(Cell_A, Cell_B, Cell_C, Cell_D, Cell_E,
Cell_F, Cell_G,Cell_H, row.names = Genes)
How can I manipulate this dataset to get the graphical output I want?
You can do this by reshaping the data and using geom_point. Map the size aesthetic to your count variable and it will work well. The legend is currently a bit nonsensical but can be manually tweaked if you do not have any other sizes than 0 and 1.
library(tidyverse)
Cell_A<-c(0,0,1,0,1,0,1,0)
Cell_B<-c(1,1,1,0,0,0,1,0)
Cell_C<-c(1,0,1,0,0,1,0,1)
Cell_D<-c(0,0,0,1,1,1,1,0)
Cell_E<-c(1,1,1,1,1,0,1,1)
Cell_F<-c(0,0,0,0,0,1,1,0)
Cell_G<-c(1,1,1,1,1,1,1,1)
Cell_H<-c(1,1,1,1,1,1,1,1)
Genes <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene6","Gene7","Gene8")
fake_data <- data.frame(Cell_A, Cell_B, Cell_C, Cell_D, Cell_E,
Cell_F, Cell_G,Cell_H, row.names = Genes)
fake_data %>%
rownames_to_column(var = "gene") %>%
gather(cell, count, -gene) %>%
ggplot() +
geom_point(aes(x = gene, y = cell, size = count))
Created on 2019-08-02 by the reprex package (v0.3.0)
This solution is a base R solution that relies on matplot().
fake_data2 <- sweep(fake_data, 2, seq_len(length(fake_data)), FUN = '*')
fake_data2[fake_data2 == 0] <- NA_integer_
matplot(x = seq_along(Genes), y = as.matrix(fake_data2),
, cex = colSums(fake_data) / 3, pch = 16, col = 1
, yaxt='n', xaxt='n', ann=FALSE)
axis(1, at = seq_along(Genes), Genes)
axis(2, at = seq_len(length(fake_data)), names(fake_data), las = 1)
You didn't provide enough details on how what size you wanted. The size here is based on the number of 1 values for each column.

Density count in heatmaps

I have a problem with my heatmap, which displays the density LEVEL, but doesn't say anything about the density count. (how many points are in the same area for example).
My data is divided in more columns, but the most important ones are: lat,lon.
I would like to have something like this, but with "count" : https://stackoverflow.com/a/24615674/5316566,
however when I try to apply the code he uses in that answer, my maximum-"level" density doesn't reflect my density count.( Intead of 7500 I receive for example 6, even if I have thousands and thousands of data concentrated).
That's my code:
us_map_g_str <- get_map(location = c(-90.0,41.5,-81.0,42.7), zoom = 7)
ggmap(us_map_g_str, extent = "device") +
geom_tile(data = data1, aes(x = as.numeric(lon), y = as.numeric(lat)), size = 0.3) +
stat_density2d(data = data1, aes(x = as.numeric(lon), y = as.numeric(lat), fill = ..level.., alpha = ..level..), size = 0.3, bins = 10, geom = "polygon") +
scale_fill_gradient(name= "Ios",low = "green", high = "red", trans= "exp") +
scale_alpha(range = c(0, 0.3), guide = FALSE)
This is what I get:
This is part of the data:
lat lon tag device
1 43.33622 -83.67445 0 iPhone5
2 43.33582 -83.69964 0 iPhone5
3 43.33623 -83.68744 0 iPhone5
4 43.33584 -83.72186 0 iPhone5
5 43.33616 -83.67526 0 iPhone5
6 43.25040 -83.78234 0 iPhone5
(The "tag" column is not important)
REVISED
I realised that my previous answer needs to be revised. So, here it is. If you want to find out how many data points exist in each level of a contour, you actually have a lot of things to do. If you are happy to use the leaflet option below, your life would be much easier.
First, let's get a map of Detroit, and create a sample data frame.
library(dplyr)
library(ggplot2)
library(ggmap)
mymap <- get_map(location = "Detroit", zoom = 8)
### Create a sample data
set.seed(123)
mydata <- data.frame(long = runif(min = -84, max = -82.5, n = 100),
lat = runif(min = 42, max = 42.7, n = 100))
Now, we draw a map and save it as g.
g <- ggmap(mymap) +
stat_density2d(data = mydata,
aes(x = long, y = lat, fill = ..level..),
size = 0.5, bins = 10, geom = "polygon")
The real job begins here. In order to find out the numbers of data points in all levels, you want to employ the data frame, which ggplot generates. In this data frame you have data for polygons. These polygons are used to draw level lines. You can see that in the following image, which I draw three levels on a map.
### Create a data frame so that we can find how many data points exist
### in each level.
mydf <- ggplot_build(g)$data[[4]]
### Check where the polygon lines are. This is just for a check.
check <- ggmap(mymap) +
geom_point(data = mydata, aes(x = long, y = lat)) +
geom_path(data = subset(mydf, group == "1-008"), aes(x = x, y = y)) +
geom_path(data = subset(mydf, group == "1-009"), aes(x = x, y = y)) +
geom_path(data = subset(mydf, group == "1-010"), aes(x = x, y = y))
The next step is to reate a level vector for a legend. We group the data by group (e.g., 1-010) and take the first row for each group using slice(). Then, ungroup the data and choose the 2nd column. Finally, create a vector
with unlist(). We come back to lev in the end.
mydf %>%
group_by(group) %>%
slice(1) %>%
ungroup %>%
select(2) %>%
unlist -> lev
Now we split the polygon data (i.e., mydf) by group and create a polygon for each level. Since we have 11 levels (11 polygons), we use lapply(). In the lapply loop, we need to do; 1) extract column for longitude anf latitude, 2) create polygon, 3) convert polygons to spatial polygons, 4) assign
CRS, 5) create a dummy data frame, and 6) create SpatialPolygonsDataFrames.
mylist <- split(mydf, f = mydf$group)
test <- lapply(mylist, function(x){
xy <- x[, c(3,4)]
circle <- Polygon(xy, hole = as.logical(NA))
SP <- SpatialPolygons(list(Polygons(list(circle), ID = "1")))
proj4string(SP) <- CRS("+proj=longlat +ellps=WGS84")
df <- data.frame(value = 1, row.names = "1")
circleDF <- SpatialPolygonsDataFrame(SP, data = df)
})
Now we go back to the original data. What we need to to is to convert the data frame to SpatialPointsDataFrame. This is because we need to subset data and find how many data points exist in each polygon (in each level). First, get long and lat from your data.frame. Make sure that the order is in lon/lat.
xy <- mydata[,c(1,2)]
Then, we create SPDF (SpatialPolygonsDataFrame). You want to have an identical proj4string between spatial polygons and spatial points data.
spdf <- SpatialPointsDataFrame(coords = xy, data = mydata,
proj4string = CRS("+proj=longlat +ellps=WGS84"))
Then, we subset data (mydata) using each polygon.
ana <- lapply(test, function(y){
mydf <- as.data.frame(spdf[y, ])
})
Data points are overlapping across levels; we have duplication. First we try to find out unique data points for each level. We bind data frames in ana and create a data frame, which is foo1. We also create a data frame, which we want to find unique number of data points. We make sure that columns names are all identical between foo1 and foo2. Using setdiff() and nrow(), we can find the unique number of data points in each level.
total <- lapply(11:2, function(x){
foo1 <- bind_rows(ana[c(11:x)])
foo2 <- as.data.frame(ana[x-1])
names(foo2) <- names(foo1)
nrow(setdiff(foo2, foo1))
})
Finally, we need to find the number of data points for the most inner level, which is level 11. We choose a data frame for level 11 in ana and create a data frame and count the number of row.
bob <- nrow(as.data.frame(ana[11]))
out <- c(bob,unlist(total))
### check if total is 100
### sum(out)
### [1] 100
We assign reversed out as names for lev. This is because we want to show how many data points exist for each level in a legend.
names(lev) <- rev(out)
Now we are ready to add a legend.
final <- g +
scale_fill_continuous(name = "Total",
guide = guide_legend(),
breaks = lev)
final
LEAFLET OPTION
If you use leaflet package, you can group data points with different zooms. Leaflet counts data points in certain areas and indicate numbers in circles like the following figure. The more you zoom in, the more leaflet breaks up data points into small groups. In terms of workload, this is much lighter. In addition, your map is interactive. This may be a better option.
library(leaflet)
leaflet(mydf) %>%
addTiles() %>%
addMarkers(clusterOptions = markerClusterOptions())

Combine a ggplot2 object with a lattice object in one plot

I would like to combine a ggplot2 with a lattice plot object. Since both packages are based on grid I was wondering whether this is possible? Ideally, I would do everything in ggplot2 but I cannot plot a 3d scatter.
So assume I have the following data:
set.seed(1)
mdat <- data.frame(x = rnorm(100), y = rnorm(100), z = rnorm(100),
cluster = factor(sample(5, 100, TRUE)))
First, I want to create a scatterplot matrix in ggplot2:
library(ggplot2)
library(gtools)
library(plyr)
cols <- c("x", "y", "z")
allS <- adply(combinations(3, 2, cols), 1, function(r)
data.frame(cluster = mdat$cluster,
var.x = r[1],
x = mdat[[r[1]]],
var.y = r[2],
y = mdat[[r[2]]]))
sc <- ggplot(allS, aes(x = x, y = y, color = cluster)) + geom_point() +
facet_grid(var.x ~ var.y)
So far so good. Now I want to create a lattice 3d scatterplot with all the variables together:
library(lattice)
sc3d <- cloud(z ~ x + y, data = mdat, groups = cluster)
Now I would like to combine sc and sc3d in one single plot. How can I achieve that? Maybe with the help of grid or gridExtra (pushViewport, arrangeGrob?)? Or can I produce a 3d scatterplot in ggplot? Ideally, I would like to see the 3d plot in the empty panel pf the ggplot but I guess that's asked even too much, so for starters I would be very happy to learn how we could arrange these two plots side by side.
library(gridExtra); library(lattice); library(ggplot2)
grid.arrange(xyplot(1~1), qplot(1,1))
You can replace the empty panel by the lattice grob within the gtable, but it doesn't look very good due to the axes etc.
g <- ggplotGrob(sc)
lg <- gridExtra:::latticeGrob(sc3d)
ids <- which(g$layout$name == "panel")
remove <- ids[2]
g$grobs[[remove]] <- lg
grid.newpage()
grid.draw(g)

colored points in R

I have a table with 3 numeric columnes. Two of them are coordinates and the third one means color. There are hundreds of rows in my text file.
I want to make a picture, where to first numbers mean coordinates of each point and the third one is the color of the point. The bigger number - the darker point.
How could i do this?
The example of the row in my file:
99.421875 48.921875 0.000362286050144
Will this do?
require(ggplot2)
# assuming your data is in df and x,y, and col are the column names.
ggplot(data = df, aes(x = x, y = y)) +
geom_point(colour="red", size = 3, aes(alpha=col))
# sample data
set.seed(45)
df <- data.frame(x=runif(100)*sample(1:10, 100, replace=T),
y= runif(100*sample(1:50, 100, replace=T)),
col=runif(100/sample(1:100)))
Plot:
A lattice solution:
library(lattice)
mydata <- matrix(c(1,2,3,1,1,1,2,5,10),nrow=3)
xyplot(mydata[,2] ~ mydata[,1], col = mydata[,3], pch= 19 ,
alpha = (mydata[,3]/10), cex = 15)
alpha here controls the transparency.
Here is a base R solution:
##Generate data
##Here z lies between 0 and 10
dd = data.frame(x = runif(100), y= runif(100), z= runif(100, 0, 10))
First normalise z:
dd$z = dd$z- min(dd$z)
dd$z = dd$z/max(dd$z)
Then plot as normal using the size of z for the shading:
##See ?gray for other colour combinations
##pch=19 gives solid points. See ?point for other shapes
plot(dd$x, dd$y, col=gray(dd$z), pch=19)
Another solution using base... to change the colour, you can replace some of data[,3] to 0 inside the rgb()
n <- 1000
data <- data.frame(x=runif(n),y=runif(n),col=runif(n))
plot(data[,1:2],col=rgb(data[,3],data[,3],data[,3],maxColorValue = max(data[,3])),pch=20)

Resources