How to avoid dropping levels with 0 frequencies - r

I am trying to compare two questions (columns Q1_b and Q2_b) and barplot them next to each other (in the same barplot), the answer options are 1-6. The problem is that noone answered with 4 for Q1_b, so the barplot skips to displaying 5 where 4 should be for Q1_b, next to the percentage of people who answered 4 for Q2_b. How can I make sure R doesn't do this and automatically enters a 0% column if there weren't any answers for a specific option?
alldataset<-structure(list(Q1_b = c(6L, 1L, 5L, 3L, 5L, 6L, 6L, 2L),
Q2_b = c(1L, 2L, 2L, 5L, 4L, 3L, 6L, 1L)),
.Names = c("Q1_b", "Q2_b"),
class = "data.frame",
row.names = c(NA, -8L))
Qb<-table(alldataset$Q2_b)
Qf<-table(alldataset$Q1_b)
nrowFUP<-NROW(alldataset$Q1_b)
nrowBL<-NROW(alldataset$Q2_b)
options(digits=6)
newbl <- transform(as.data.frame(table(alldataset$Q2_b)),
percentage_column=Freq/nrowBL*100)
newfup <- transform(as.data.frame(table(alldataset$Q1_b)),
percentage_column=Freq/nrowFUP*100)
matrixQ1<-cbind(newbl$percentage_column, newfup$percentage_column)
matrixQ1dataframe<-data.frame(matrixQ1)
rmatrixQ1<-as.vector(t(matrixQ1dataframe))
roundedrmatrix<-round(rmatrixQ1, digits=0)
barplotmatrix<-matrix(roundedrmatrix)
par(mar=c(7.5,4,3,2), mgp=c(2,.7,0), tck=-.01, las=1, xpd=TRUE)
b<-barplot(matrix(roundedrmatrix, nr=2),
beside=T, xlab="",
ylab="Percentage",
cex.lab=0.9,
main="Comparison",
cex.main=0.9, ylim=c(0,70),
col=c("black","yellow"),
names.arg=c(1:6),
legend=c("Q2_b","Q1_b"),
args.legend=list(x="bottomleft",
cex=0.8,
inset=c(0.4,-0.4)))
text(x=b, y=roundedrmatrix,labels=roundedrmatrix, pos=3, cex=0.8)
R also warns me this will happen by displaying:
Warning message:
In cbind(newbl$percentage_column, newfup$percentage_column) :
number of rows of result is not a multiple of vector length (arg 2)
I have been trying for ages to sort this out but I am not getting anywhere. Can anyone help?

The problem is that you never told R that you vectors represent categorical responses with potential values of 1-6, so it does not know to include the 0 counts (you would not want it to include a 0 for 7, 8, 1 million, etc.).
Try replacing your 1st 2 lines with:
Qb<-table(factor(alldataset$Q2_b, levels=1:6))
Qf<-table(factor(alldataset$Q1_b, levels=1:6))
or run somethingn like:
alldataset$Q1_b <- factor(alldataset$Q1_b, levels=1:6)
alldataset$Q2_b <- factor(alldataset$Q2_b, levels=1:6)
before the table commands.

You need to tell table to use all values from one to six with table(factor(x, seq.int(6))).
Here is an improved version of your code:
dat <- t(round(sapply(rev(alldataset),
function(x) table(factor(x, seq.int(6)))) /
nrow(alldataset) * 100))
par(mar=c(7.5,4,3,2), mgp=c(2,.7,0), tck=-.01, las=1, xpd=TRUE)
b <- barplot(dat, beside=T,xlab="", ylab="Percentage", cex.lab=0.9,
main="Comparison", cex.main=0.9, ylim=c(0,70),
col=c("black","yellow"), names.arg=c(1:6), legend=names(dat),
args.legend=list(x="bottomleft", cex=0.8, inset=c(0.4,-0.4)))
text(x=b, y=dat,labels=dat, pos=3, cex=0.8)

Related

Rasterize points using the terra package to produce multi-layer SpatRaster

I have some long-standing package code that uses raster::rasterize that I'm trying to update to terra::rasterize. The code takes point data, where each point has one of two possible integer ID values. The output is a raster with two layers, one for each possible point ID, where cell values are counts. The relevant bits are:
# r0 is template raster to define extent and resolution
r <- raster::rasterize(dat[, c("X", "Y")],
r0,
field = dat$flightlineID,
fun = f,
background = 0)
Here, f is a function that takes a vector of point IDs and returns a two-element vector of counts, which results in the desired two layer output raster.
My first attempt to port this to terra::rasterize (package version 1.6-17) was...
r <- terra::rasterize(cbind(dat$X, dat$Y), # seem to need a matrix rather than a data frame
r0, # template SpatRaster
values = dat$flightlineID,
fun = f,
background = 0)
This fails with the error:
Error in w[vv[, 1], ] <- vv[, -1] :
number of items to replace is not a multiple of replacement length
Delving into the code for terra:::rasterize_points it seems that the number of layers for the output raster is determined by treating the 'values' argument as a data frame and checking the number of columns. This is a bit confusing because the package docs state that the values argument is expected to be a numeric vector, of either length 1 or nrow(x) where x is the input point data. Moreover, the length of the vector returned by the user-supplied summary function doesn't seem to play any part in determining the number of output raster layers.
For the moment I've simply retained the old raster::rasterize code and convert the output raster to a SpatRaster, but I think I must be missing something obvious. Is there a way of using just terra::rasterize to accomplish this task?
EDIT: As requested in comments, here is a small sample of the input point data to show the format. Typical input data sizes range from 2 to 40 million points.
structure(list(X = c(420094, 420067, 420017, 420050, 420058,
420090, 420038, 420040, 420081, 420097, 420075, 420041, 420039,
420062, 420050, 420083, 420019, 420019, 420044, 420087, 420099,
420077, 420030, 420014, 420015, 420051, 420033, 420056, 420041,
420030, 420027, 420024, 420058, 420042, 420063, 420028, 420073,
420053, 420010, 420100, 420048, 420062, 420056, 420080, 420053,
420068, 420074, 420004, 420010, 420078), Y = c(6676049, 6676029,
6676034, 6676019, 6676096, 6676010, 6676003, 6676048, 6676073,
6676023, 6676089, 6676082, 6676010, 6676051, 6676039, 6676099,
6676024, 6676073, 6676040, 6676056, 6676072, 6676086, 6676030,
6676042, 6676002, 6676033, 6676078, 6676073, 6676013, 6676056,
6676055, 6676069, 6676072, 6676089, 6676069, 6676058, 6676023,
6676039, 6676043, 6676017, 6676011, 6676054, 6676095, 6676068,
6676098, 6676077, 6676049, 6676073, 6676097, 6676057), flightlineID = c(2L,
1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L)), row.names = c(NA, -50L), class = "data.frame")
EDIT: In the raster package code, the private .pointsToRaster function has a line (see here) where the length of the output from the user-supplied summary function is checked with some arbitrary test values to determine the number of layers in the output raster. This seems to be absent from the terra package code.
It may be that you don't want this as two layers in one raster, though this is hard to tell with the supplied data as it appears to be all 'within' the overlap. I notice in you package, there is an attempt to throttle/reduce tile edge points that maybe just needs to be set lower than 1K.
That terra doesn't work the same as raster when rasterize(ing may be a decision that under terra one should intend two layers via making each then add<-ing or <- c(ing, whereas with raster it was assumed via a hard to follow logic of 'field' and 'values'. Using your above data (and keeping two rasters):
library(terra)
#las_df <- structure(...)
las_df1 <- las_df[which(las_df$flightlineID == 1L), ]
las_df2 <- las_df[which(las_df$flightlineID == 2L), ]
las_vect1 <- vect(las_df1, geom = c('X', 'Y'), crs = 'EPSG:32755')
las_vect2 <- vect(las_df2, geom = c('X', 'Y'), crs = 'EPSG:32755')
las_rast <- rast(xmin=0, nrow = length(unique(las_df$X)), ncol = length(unique(las_df$Y)), crs='EPSG:32755')
set.ext(las_rast, c(min(las_df$X), max(las_df$X), min(las_df$Y), max(las_df$Y)))
pts1_rast <- rasterize(las_vect1, las_rast, fun = length)
pts2_rast <- rasterize(las_vect2, las_rast, fun = length)
pts1_pts2_rast <- c(pts1_rast, pts2_rast)
names(pts1_pts2_rast) <- c('lyr.1', 'lyr.2') # have to attend to this as both lyr.1 after `c(`
plot(pts1_pts2_rast$lyr.1, col = 'red')
plot(pts1_pts2_rast$lyr.2, col = 'blue', alpha=.75, add = TRUE)
# there is 1 cell that contains points from both pts1_rast and pts2_rast
cells(pts1_rast) %in% cells(pts2_rast)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
cells(pts2_rast) %in% cells(pts1_rast)
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE
One might suggest a consistent merge policy where pts1 or pts2 are always favored. In the end, if this is about optimizing allocation of scarce resources, clear bush where you have the best data, inspect, and clear again. But it still seems best to resolve this at the las level upstream.

How to plot 3D with legends in rgl (R)

This is what the sample looks like:
PC1 PC2 PC3 PC4 clusterNum
1 -3.0278979 -0.9414093 -2.0593369 -0.92992822 6
2 -1.5343149 2.5393680 -0.6645160 -0.42415503 1
3 -3.1827899 0.4878230 -2.1716015 0.87140142 1
4 -2.0630451 -0.6765663 -2.0103567 -1.20913031 6
5 -2.5608251 0.3093504 -1.8429190 -0.08175088 1
6 -2.3229565 2.1314606 -1.0680616 0.53312488 1
7 -1.8015610 -0.4233978 -0.7954366 -0.74790714 6
62378 -2.5379848 -1.3008801 -1.3621545 0.93952670 6
62379 0.5763662 -0.5990910 -0.2045754 0.32887753 5
62380 1.0751095 -0.9948755 0.4209824 0.89306204 5
data <- structure(list(PC1 = c(-3.02789789907534, -1.53431493608036,-3.18278992851587, -2.06304508820853, -2.56082511958789, -2.32295654380193,-1.80156103002696, -2.53798478044841, 0.57636622461764, 1.07510945315635), PC2 = c(-0.94140934359441, 2.53936804189767, 0.487822997171811,-0.676566283079183, 0.309350374661524, 2.13146057296978, -0.423397780929157,-1.30088008176366, -0.599090979848925, -0.994875508747934), PC3 = c(-2.05933693083859,-0.664515950436883, -2.17160152842666, -2.01035669961785, -1.84291903624489,-1.06806160129806, -0.795436603544969, -1.36215450269855, -0.204575393904516,0.420982419847553), PC4 = c(-0.929928223454337, -0.424155026745399,0.871401419380821, -1.20913030836257, -0.0817508821137412, 0.533124880557676,-0.747907142699851, 0.939526696339997, 0.328877528585212, 0.893062041850707), clusterNum = c(6L, 1L, 1L, 6L, 1L, 1L, 6L, 6L, 5L, 5L)), row.names = c(1L,2L, 3L, 4L, 5L, 6L, 7L, 62378L, 62379L, 62380L), class = "data.frame")
So, I'm learning to plot 3d in R with rgl package. I used this code to plot my data.
plot3d(data$PC1, data$PC2, data$PC3, col=data$clusterNum)
and here is my output;
My question is how to add the legends based on my clusterNum column to visualize this graph.
Thank you in advance for any help.
Using rgl::legend3d(). You may practically use all the arguments of the graphics::legend() function, e.g. defining x and y coordinates of the legend and give a value for point characters pch= to get points printed, lookup ?pch for any other shape. To get the legend= elements just sort the unique values ofg your cluster variable. For the point colors use the same trick you did in the plot.
library(rgl)
with(data, plot3d(PC1, PC2, PC3, col=clusterNum)) ## use `with` to get nicer labs
k <- sort(unique(data$clusterNum))
legend3d(x=.1, y=.95, legend=k, pch=18, col=k, title='Cluster', horiz=TRUE)

BradleyTerry2 package missing one player in model results

I have data on 23 'players'. Some of them played against each other (but not every possible pair) one or multiple times. The dataset I have (see dput below) includes the number of times one player won and lost against another player. I use it to fit a BT model using BradleyTerry2 package. The issue I have is that the model gives me the coefficients for 22 players not 23. Can anyone help me figure out what the problem is, please?
Below is the dput of my data (head)
structure(list(player1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("a12TTT.pdf",
"a15.pdf", "a17.pdf", "a18.pdf", "a21.pdf", "a2TTT.pdf", "a5.pdf",
"B11.pdf", "B12.pdf", "B13.pdf", "B22.pdf", "B24.pdf", "B4.pdf",
"B7.pdf", "B8.pdf", "cw10-1.pdf", "cw15-1TTT.pdf", "cw17-1.pdf",
"cw18.pdf", "cw3.pdf", "cw4.pdf", "cw7_1TTT.pdf", "cw13-1.pdf"
), class = "factor"), player2 = structure(c(4L, 5L, 8L, 9L, 10L,
12L), .Label = c("a12TTT.pdf", "a15.pdf", "a17.pdf", "a18.pdf",
"a21.pdf", "a2TTT.pdf", "a5.pdf", "B11.pdf", "B12.pdf", "B13.pdf",
"B22.pdf", "B24.pdf", "B4.pdf", "B7.pdf", "B8.pdf", "cw10-1.pdf",
"cw15-1TTT.pdf", "cw17-1.pdf", "cw18.pdf", "cw3.pdf", "cw4.pdf",
"cw7_1TTT.pdf", "cw13-1.pdf"), class = "factor"), win1 = c(0,
1, 1, 1, 2, 0), win2 = c(1, 1, 0, 1, 0, 2)), row.names = c(NA,
6L), class = "data.frame")
The code I am using:
BTm(cbind(win1,win2), player1, player2, data= prep)
I also tried
BTm(cbind(win1,win2), player1, player2, ~player, id="player", data= prep)
And it gives me the same result (i.e. the same player is missing, and the 22 coefficients for the rest are the same).
If that is relevant, I created 'prep' using the below code.
prep<-countsToBinomial(table(ju$winner, ju$loser))
ju$winner and ju$loser are two columns in which rows are individual games and the winner is in the first column.
I also tried the following code to fit the model:
BTm(1, p1, p2, data=ju)
In this case p1 and p2 are the same as columns winner and losser, but transformed so as to have the same level factors (so that the function would work). I am not sure I used this alternative correctly, and I mention it because in this case I also have one player missing (although a different one).
After reading more carefully the documentation for the package, I found that when estimating the model the function removes one script/player/contestant as a reference. Its value is always 0. So my understanding is that if you want to do any further analysis, you have to find what player was removed and reintroduce it in the data frame with the value for its ability 0.

R: Need to perform multiple matches for each row in data frame

I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.
First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?

ddply run in a function looks in the environment outside the function?

I'm trying to write a function to do some often repeated analysis, and one part of this is to count the number of groups and number of members within each group, so ddply to the rescue !, however, my code has a problem....
Here is some example data
> dput(BGBottles)
structure(list(Machine = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
weight = c(14.23, 14.96, 14.85, 16.46, 16.74, 15.94, 14.98,
14.88, 14.87, 15.94, 16.07, 14.91)), .Names = c("Machine",
"weight"), row.names = c(NA, -12L), class = "data.frame")
and here is my code
foo<-function(exp1, exp2, data) {
datadesc<-ddply(data, .(with(data, get(exp2))), nrow)
return(datadesc)
}
If I run this function, I get an error
> foo(exp="Machine",exp1="weight",data=BGBottles)
Error in eval(substitute(expr), data, enclos = parent.frame()) :
invalid 'envir' argument
However, if I define my exp1, exp2 and data variables int he global environemtn first, it works
> exp1<-"weight"
> exp2<-"Machine"
> data<-BGBottles
> foo(exp="Machine",exp1="weight",data=BGBottles)
with.data..get.exp2.. V1
1 1 3
2 2 3
3 3 3
4 4 3
So, I assume ddply is running outside of the environemtn of the function ? Is there a way to stop this, or am I doing something wrong ?
Thanks
Paul.
You don't need get:
foo<-function(exp1, exp2, data) {
datadesc<-ddply(data, exp2, nrow)
return(datadesc)
}
This is an example of this bug: http://github.com/hadley/plyr/issues#issue/3. But as Marek points out, you don't need get here anyway.

Resources