I have some long-standing package code that uses raster::rasterize that I'm trying to update to terra::rasterize. The code takes point data, where each point has one of two possible integer ID values. The output is a raster with two layers, one for each possible point ID, where cell values are counts. The relevant bits are:
# r0 is template raster to define extent and resolution
r <- raster::rasterize(dat[, c("X", "Y")],
r0,
field = dat$flightlineID,
fun = f,
background = 0)
Here, f is a function that takes a vector of point IDs and returns a two-element vector of counts, which results in the desired two layer output raster.
My first attempt to port this to terra::rasterize (package version 1.6-17) was...
r <- terra::rasterize(cbind(dat$X, dat$Y), # seem to need a matrix rather than a data frame
r0, # template SpatRaster
values = dat$flightlineID,
fun = f,
background = 0)
This fails with the error:
Error in w[vv[, 1], ] <- vv[, -1] :
number of items to replace is not a multiple of replacement length
Delving into the code for terra:::rasterize_points it seems that the number of layers for the output raster is determined by treating the 'values' argument as a data frame and checking the number of columns. This is a bit confusing because the package docs state that the values argument is expected to be a numeric vector, of either length 1 or nrow(x) where x is the input point data. Moreover, the length of the vector returned by the user-supplied summary function doesn't seem to play any part in determining the number of output raster layers.
For the moment I've simply retained the old raster::rasterize code and convert the output raster to a SpatRaster, but I think I must be missing something obvious. Is there a way of using just terra::rasterize to accomplish this task?
EDIT: As requested in comments, here is a small sample of the input point data to show the format. Typical input data sizes range from 2 to 40 million points.
structure(list(X = c(420094, 420067, 420017, 420050, 420058,
420090, 420038, 420040, 420081, 420097, 420075, 420041, 420039,
420062, 420050, 420083, 420019, 420019, 420044, 420087, 420099,
420077, 420030, 420014, 420015, 420051, 420033, 420056, 420041,
420030, 420027, 420024, 420058, 420042, 420063, 420028, 420073,
420053, 420010, 420100, 420048, 420062, 420056, 420080, 420053,
420068, 420074, 420004, 420010, 420078), Y = c(6676049, 6676029,
6676034, 6676019, 6676096, 6676010, 6676003, 6676048, 6676073,
6676023, 6676089, 6676082, 6676010, 6676051, 6676039, 6676099,
6676024, 6676073, 6676040, 6676056, 6676072, 6676086, 6676030,
6676042, 6676002, 6676033, 6676078, 6676073, 6676013, 6676056,
6676055, 6676069, 6676072, 6676089, 6676069, 6676058, 6676023,
6676039, 6676043, 6676017, 6676011, 6676054, 6676095, 6676068,
6676098, 6676077, 6676049, 6676073, 6676097, 6676057), flightlineID = c(2L,
1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L)), row.names = c(NA, -50L), class = "data.frame")
EDIT: In the raster package code, the private .pointsToRaster function has a line (see here) where the length of the output from the user-supplied summary function is checked with some arbitrary test values to determine the number of layers in the output raster. This seems to be absent from the terra package code.
It may be that you don't want this as two layers in one raster, though this is hard to tell with the supplied data as it appears to be all 'within' the overlap. I notice in you package, there is an attempt to throttle/reduce tile edge points that maybe just needs to be set lower than 1K.
That terra doesn't work the same as raster when rasterize(ing may be a decision that under terra one should intend two layers via making each then add<-ing or <- c(ing, whereas with raster it was assumed via a hard to follow logic of 'field' and 'values'. Using your above data (and keeping two rasters):
library(terra)
#las_df <- structure(...)
las_df1 <- las_df[which(las_df$flightlineID == 1L), ]
las_df2 <- las_df[which(las_df$flightlineID == 2L), ]
las_vect1 <- vect(las_df1, geom = c('X', 'Y'), crs = 'EPSG:32755')
las_vect2 <- vect(las_df2, geom = c('X', 'Y'), crs = 'EPSG:32755')
las_rast <- rast(xmin=0, nrow = length(unique(las_df$X)), ncol = length(unique(las_df$Y)), crs='EPSG:32755')
set.ext(las_rast, c(min(las_df$X), max(las_df$X), min(las_df$Y), max(las_df$Y)))
pts1_rast <- rasterize(las_vect1, las_rast, fun = length)
pts2_rast <- rasterize(las_vect2, las_rast, fun = length)
pts1_pts2_rast <- c(pts1_rast, pts2_rast)
names(pts1_pts2_rast) <- c('lyr.1', 'lyr.2') # have to attend to this as both lyr.1 after `c(`
plot(pts1_pts2_rast$lyr.1, col = 'red')
plot(pts1_pts2_rast$lyr.2, col = 'blue', alpha=.75, add = TRUE)
# there is 1 cell that contains points from both pts1_rast and pts2_rast
cells(pts1_rast) %in% cells(pts2_rast)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
cells(pts2_rast) %in% cells(pts1_rast)
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE
One might suggest a consistent merge policy where pts1 or pts2 are always favored. In the end, if this is about optimizing allocation of scarce resources, clear bush where you have the best data, inspect, and clear again. But it still seems best to resolve this at the las level upstream.
Related
I have a dataset that was recorded by observation(each observation has its own row of data). I am looking to combine/condense these rows by the plant they were found on - currently a character variable. All other columns are numerical vales.
EX:
This is the raw data
|Sci_Name|Honeybee_count|Other_bee_Obsevrved|Stem_count|
|---|---|---|---|
|Zizia aurea|1|5|10|
|Asclepias viridiflora|15|1|3|
|Viola unknown|0|0|4|
|Zizia aurea|0|2|6|
|Zizia aurea|3|6|3|
|Asclepias viridiflora|8|2|17|
and I want:
Sci_Name
Honeybee_count
Other_bee_Obsevrved
Stem_count
Zizia aurea
4
13
19
Asclepias viridiflora
23
3
20
Viola unknown
0
0
4
I am currently pulling this data from a CSV already in table form. I have been attempting to create a new table/data frame with one entry of each plant species, and blanks/0s for each other variable, which I can then use to c-binding the two together. This, however, has been clunky at best and I am having trouble figuring out how to have each row check itself. I am open to any approach, let me know what you think!
Thanks :D
We can use the formula method in aggregate from base R. On the rhs of the ~, specify the grouping variable and on the lhs, use . for denoting the rest of the variables. Specify the FUN as sum and it will do the column wise sum by group
aggregate(. ~ Sci_Name, df1, sum)
-output
Sci_Name Honeybee_count Other_bee_Obsevrved Stem_count
1 Asclepias viridiflora 23 3 20
2 Viola unknown 0 0 4
3 Zizia aurea 4 13 19
data
df1 <- structure(list(Sci_Name = c("Zizia aurea", "Asclepias viridiflora",
"Viola unknown", "Zizia aurea", "Zizia aurea", "Asclepias viridiflora"
), Honeybee_count = c(1L, 15L, 0L, 0L, 3L, 8L), Other_bee_Obsevrved = c(5L,
1L, 0L, 2L, 6L, 2L), Stem_count = c(10L, 3L, 4L, 6L, 3L, 17L)),
class = "data.frame", row.names = c(NA,
-6L))
I have two dataframe columns that have apparently identical factors, but they don't:
levels(train$colA)
## [1] "I am currently using (least once over the last 2 weeks)"
## [2] "I have never tried nor used"
## [3] "I have tried or used at some point in time"
levels(test$colA)
## [1] "I am currently using (least once over the last 2 weeks)"
## [2] "I have never tried nor used"
## [3] "I have tried or used at some point in time"
levels(train$colA) == levels(test$colA)
## [1] FALSE TRUE TRUE
I have tried comparing both sentences and actually they are equal:
"I am currently using (least once over the last 2 weeks)" == "I am currently using (least once over the last 2 weeks)"
## [1] TRUE
I am trying to apply xgboost trained model to test data. Trained model comes from train dataframe. Now I am trying to apply it to test, but with no success, as I get the error that test has a new factor.
Edited:
Here is the output of dput():
dput(head(train$colA))
structure(c(1L, 1L, 1L, 2L, 1L, 1L), .Label = c("I am currently using (least once over the last 2 weeks)", "I have never tried nor used"
"I have tried or used at some point in time"
), class = "factor")
dput(head(test$colA))
structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("I am currently using (least once over the last 2 weeks)", "I have never tried nor used"
"I have tried or used at some point in time"
), class = "factor")
I can see there is a difference from: c(1L, 1L, 1L, 2L, 1L, 1L) to c(1L, 1L, 1L, 1L, 1L, 1L) . So I guess here is the key, although I don't know what does it exactly mean.
R stores factors as integers. Therefore, when using the function identical, it cannot find when two factors are of the same name if they have different levels.
Here's an MWE:
y <- structure(list(portfolio_date = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("2000-10-31", "2001-04-30"), class = "factor"),
security = structure(c(2L, 2L, 1L, 3L, 2L, 4L), .Label = c("Currency Australia (Fwd)",
"Currency Euro (Fwd)", "Currency Japan (Fwd)", "Currency United Kingdom (Fwd)"
), class = "factor")), .Names = c("portfolio_date", "security"
), row.names = c(10414L, 10417L, 10424L, 21770L, 21771L, 21774L
), class = "data.frame")
x <- structure(list(portfolio_date = structure(1L, .Label = "2000-10-31", class = "factor"),
security = structure(1L, .Label = "Currency Euro (Fwd)", class = "factor")),
.Names = c("portfolio_date", "security"), row.names = 10414L, class = "data.frame")
identical(y[1,], x)
Returns FALSE
But if we look at the objects, they appear identical to the user
y[1,]
portfolio_date security
10414 2000-10-31 Currency Euro (Fwd)
x
portfolio_date security
10414 2000-10-31 Currency Euro (Fwd)
Ultimately I want to be able to do something like the following:
apply(y, 1, identical, x)
10414 10417 10424 21770 21771 21774
TRUE TRUE FALSE FALSE FALSE FALSE
which(apply(y, 1, identical, x))
1 2
Any suggestions as to how to achieve this? Thanks.
One option is to use the rowwise from dplyr to check row-by-row; If you need to compare the row.names at the same time then you need to create an id column for both, otherwise, it will return TRUE for the first two rows.
library(dplyr)
x$id <- row.names(x)
y$id <- row.names(y)
rowwise(y) %>% do(check = isTRUE(all.equal(., x, check.attributes = F))) %>% data.frame
check
1 TRUE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
In order to perform the comparison, the factors need to be converted into character objects.
By using base R alone here is a solution:
apply(apply(y, 2, as.character), 1, identical, apply(x, 2, as.character))
The inner apply loops convert each column in the source and target data frames to character objects and the outer apply loops through the rows.
If the x data frame has more than one row, the actual behavior may not be as expected.
Use the package 'compare'.
library(compare)
result <- NULL
for (i in 1:NROW(y)){
one <- compare(y[i,], x, dropLevels=T)
two <- one$detailedResult[1]==T & one$detailedResult[2]==T
result <- c(result, two)
}
as.character(result)#TRUE TRUE FALSE FALSE FALSE FALSE
Solution for data posted in OP
The example posted in the OP can be easily treated by using droplevels().
Let us first look at why the comparison identical(y[1,], x) returns FALSE:
str(y[1,])
#'data.frame': 1 obs. of 2 variables:
#$ portfolio_date: Factor w/ 2 levels "2000-10-31","2001-04-30": 1
#$ security : Factor w/ 4 levels "Currency Australia (Fwd)",..: 2
whereas
str(x)
#'data.frame': 1 obs. of 2 variables:
#$ portfolio_date: Factor w/ 1 level "2000-10-31": 1
#$ security : Factor w/ 1 level "Currency Euro (Fwd)": 1
So the difference lies in the factors, even though both objects are displayed in the same way, as shown in the OP's question.
This is where the function droplevels() is useful: it removes unused factors. By applying droplevels() to y[1,] with its redundant factors, we obtain:
identical(droplevels(y[1,]), x)
#[1] TRUE
If x also contains unused factors, it will be necessary to wrap it into droplevels(), too. In any case, it won't do any harm:
identical(droplevels(y[1,]), droplevels(x))
#[1] TRUE
General solution
Using droplevels() may not work if the real data is more complex than the data posted in the "MWE" of the OP. Such situations may include, e.g., equivalent entries in x and y[1,] that are stored as different factor levels. An example where droplevels() fails is given in the data section at the end of this answer.
The following solution represents an efficient possibility to treat such general situations. It works for the data posted in the OP as well as for the more complicated case of the data posted below.
First, two auxiliary vectors are created that contain only the characters of each row. By using paste() we can concatenate each row to a single character string:
temp_x <- apply(x, 1, paste, collapse=",")
temp_y <- apply(y, 1, paste, collapse=",")
With these vectors it becomes easily possible to compare rows of the original data.frames, even if the entries were originally stored as factors with different levels and numbering.
To identify which rows are identical, we can use the %in% operator, which is more appropriate than the function identical() in this case, as the former checks for equality of all possible row combinations, and not just individual pairs.
With these simple modifications the desired output can be obtained quickly and without further loops:
setNames(temp_y %in% temp_x, names(temp_y))
#10414 10417 10424 21770 21771 21774
# TRUE TRUE FALSE FALSE FALSE FALSE
which(temp_y %in% temp_x)
#[1] 1 2
y[temp_y %in% temp_x,]
# portfolio_date security
#10414 2000-10-31 Currency Euro (Fwd)
#10417 2000-10-31 Currency Euro (Fwd)
data
x <- structure(list(portfolio_date = structure(1:2, .Label = c("2000-05-15",
"2000-10-31"), class = "factor"), security = structure(c(2L, 1L),
.Label = c("Currency Euro (Fwd)", "Currency USD (Fwd)"),
class = "factor")), .Names = c("portfolio_date", "security"),
class = "data.frame", row.names = c("10234", "10414"))
y <- structure(list(portfolio_date = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("2000-10-31", "2001-04-30"), class = "factor"),
security = structure(c(2L, 2L, 1L, 3L, 2L, 4L),
.Label = c("Currency Australia (Fwd)", "Currency Euro (Fwd)",
"Currency Japan (Fwd)", "Currency United Kingdom (Fwd)"),
class = "factor")), .Names = c("portfolio_date", "security"),
row.names = c(10414L, 10417L, 10424L, 21770L, 21771L, 21774L),
class = "data.frame")
I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.
First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?
I am trying to compare two questions (columns Q1_b and Q2_b) and barplot them next to each other (in the same barplot), the answer options are 1-6. The problem is that noone answered with 4 for Q1_b, so the barplot skips to displaying 5 where 4 should be for Q1_b, next to the percentage of people who answered 4 for Q2_b. How can I make sure R doesn't do this and automatically enters a 0% column if there weren't any answers for a specific option?
alldataset<-structure(list(Q1_b = c(6L, 1L, 5L, 3L, 5L, 6L, 6L, 2L),
Q2_b = c(1L, 2L, 2L, 5L, 4L, 3L, 6L, 1L)),
.Names = c("Q1_b", "Q2_b"),
class = "data.frame",
row.names = c(NA, -8L))
Qb<-table(alldataset$Q2_b)
Qf<-table(alldataset$Q1_b)
nrowFUP<-NROW(alldataset$Q1_b)
nrowBL<-NROW(alldataset$Q2_b)
options(digits=6)
newbl <- transform(as.data.frame(table(alldataset$Q2_b)),
percentage_column=Freq/nrowBL*100)
newfup <- transform(as.data.frame(table(alldataset$Q1_b)),
percentage_column=Freq/nrowFUP*100)
matrixQ1<-cbind(newbl$percentage_column, newfup$percentage_column)
matrixQ1dataframe<-data.frame(matrixQ1)
rmatrixQ1<-as.vector(t(matrixQ1dataframe))
roundedrmatrix<-round(rmatrixQ1, digits=0)
barplotmatrix<-matrix(roundedrmatrix)
par(mar=c(7.5,4,3,2), mgp=c(2,.7,0), tck=-.01, las=1, xpd=TRUE)
b<-barplot(matrix(roundedrmatrix, nr=2),
beside=T, xlab="",
ylab="Percentage",
cex.lab=0.9,
main="Comparison",
cex.main=0.9, ylim=c(0,70),
col=c("black","yellow"),
names.arg=c(1:6),
legend=c("Q2_b","Q1_b"),
args.legend=list(x="bottomleft",
cex=0.8,
inset=c(0.4,-0.4)))
text(x=b, y=roundedrmatrix,labels=roundedrmatrix, pos=3, cex=0.8)
R also warns me this will happen by displaying:
Warning message:
In cbind(newbl$percentage_column, newfup$percentage_column) :
number of rows of result is not a multiple of vector length (arg 2)
I have been trying for ages to sort this out but I am not getting anywhere. Can anyone help?
The problem is that you never told R that you vectors represent categorical responses with potential values of 1-6, so it does not know to include the 0 counts (you would not want it to include a 0 for 7, 8, 1 million, etc.).
Try replacing your 1st 2 lines with:
Qb<-table(factor(alldataset$Q2_b, levels=1:6))
Qf<-table(factor(alldataset$Q1_b, levels=1:6))
or run somethingn like:
alldataset$Q1_b <- factor(alldataset$Q1_b, levels=1:6)
alldataset$Q2_b <- factor(alldataset$Q2_b, levels=1:6)
before the table commands.
You need to tell table to use all values from one to six with table(factor(x, seq.int(6))).
Here is an improved version of your code:
dat <- t(round(sapply(rev(alldataset),
function(x) table(factor(x, seq.int(6)))) /
nrow(alldataset) * 100))
par(mar=c(7.5,4,3,2), mgp=c(2,.7,0), tck=-.01, las=1, xpd=TRUE)
b <- barplot(dat, beside=T,xlab="", ylab="Percentage", cex.lab=0.9,
main="Comparison", cex.main=0.9, ylim=c(0,70),
col=c("black","yellow"), names.arg=c(1:6), legend=names(dat),
args.legend=list(x="bottomleft", cex=0.8, inset=c(0.4,-0.4)))
text(x=b, y=dat,labels=dat, pos=3, cex=0.8)