Comparing two apparently identical levels from factors

Comparing two apparently identical levels from factors - r

I have two dataframe columns that have apparently identical factors, but they don't:
levels(train$colA)
## [1] "I am currently using (least once over the last 2 weeks)"
## [2] "I have never tried nor used"
## [3] "I have tried or used at some point in time"
levels(test$colA)
## [1] "I am currently using (least once over the last 2 weeks)"
## [2] "I have never tried nor used"
## [3] "I have tried or used at some point in time"
levels(train$colA) == levels(test$colA)
## [1] FALSE TRUE TRUE
I have tried comparing both sentences and actually they are equal:
"I am currently using (least once over the last 2 weeks)" == "I am currently using (least once over the last 2 weeks)"
## [1] TRUE
I am trying to apply xgboost trained model to test data. Trained model comes from train dataframe. Now I am trying to apply it to test, but with no success, as I get the error that test has a new factor.
Edited:
Here is the output of dput():
dput(head(train$colA))
structure(c(1L, 1L, 1L, 2L, 1L, 1L), .Label = c("I am currently using (least once over the last 2 weeks)", "I have never tried nor used"
"I have tried or used at some point in time"
), class = "factor")
dput(head(test$colA))
structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("I am currently using (least once over the last 2 weeks)", "I have never tried nor used"
"I have tried or used at some point in time"
), class = "factor")
I can see there is a difference from: c(1L, 1L, 1L, 2L, 1L, 1L) to c(1L, 1L, 1L, 1L, 1L, 1L) . So I guess here is the key, although I don't know what does it exactly mean.

Related

Rasterize points using the terra package to produce multi-layer SpatRaster

I have some long-standing package code that uses raster::rasterize that I'm trying to update to terra::rasterize. The code takes point data, where each point has one of two possible integer ID values. The output is a raster with two layers, one for each possible point ID, where cell values are counts. The relevant bits are:
# r0 is template raster to define extent and resolution
r <- raster::rasterize(dat[, c("X", "Y")],
r0,
field = dat$flightlineID,
fun = f,
background = 0)
Here, f is a function that takes a vector of point IDs and returns a two-element vector of counts, which results in the desired two layer output raster.
My first attempt to port this to terra::rasterize (package version 1.6-17) was...
r <- terra::rasterize(cbind(dat$X, dat$Y), # seem to need a matrix rather than a data frame
r0, # template SpatRaster
values = dat$flightlineID,
fun = f,
background = 0)
This fails with the error:
Error in w[vv[, 1], ] <- vv[, -1] :
number of items to replace is not a multiple of replacement length
Delving into the code for terra:::rasterize_points it seems that the number of layers for the output raster is determined by treating the 'values' argument as a data frame and checking the number of columns. This is a bit confusing because the package docs state that the values argument is expected to be a numeric vector, of either length 1 or nrow(x) where x is the input point data. Moreover, the length of the vector returned by the user-supplied summary function doesn't seem to play any part in determining the number of output raster layers.
For the moment I've simply retained the old raster::rasterize code and convert the output raster to a SpatRaster, but I think I must be missing something obvious. Is there a way of using just terra::rasterize to accomplish this task?
EDIT: As requested in comments, here is a small sample of the input point data to show the format. Typical input data sizes range from 2 to 40 million points.
structure(list(X = c(420094, 420067, 420017, 420050, 420058,
420090, 420038, 420040, 420081, 420097, 420075, 420041, 420039,
420062, 420050, 420083, 420019, 420019, 420044, 420087, 420099,
420077, 420030, 420014, 420015, 420051, 420033, 420056, 420041,
420030, 420027, 420024, 420058, 420042, 420063, 420028, 420073,
420053, 420010, 420100, 420048, 420062, 420056, 420080, 420053,
420068, 420074, 420004, 420010, 420078), Y = c(6676049, 6676029,
6676034, 6676019, 6676096, 6676010, 6676003, 6676048, 6676073,
6676023, 6676089, 6676082, 6676010, 6676051, 6676039, 6676099,
6676024, 6676073, 6676040, 6676056, 6676072, 6676086, 6676030,
6676042, 6676002, 6676033, 6676078, 6676073, 6676013, 6676056,
6676055, 6676069, 6676072, 6676089, 6676069, 6676058, 6676023,
6676039, 6676043, 6676017, 6676011, 6676054, 6676095, 6676068,
6676098, 6676077, 6676049, 6676073, 6676097, 6676057), flightlineID = c(2L,
1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L)), row.names = c(NA, -50L), class = "data.frame")
EDIT: In the raster package code, the private .pointsToRaster function has a line (see here) where the length of the output from the user-supplied summary function is checked with some arbitrary test values to determine the number of layers in the output raster. This seems to be absent from the terra package code.

It may be that you don't want this as two layers in one raster, though this is hard to tell with the supplied data as it appears to be all 'within' the overlap. I notice in you package, there is an attempt to throttle/reduce tile edge points that maybe just needs to be set lower than 1K.
That terra doesn't work the same as raster when rasterize(ing may be a decision that under terra one should intend two layers via making each then add<-ing or <- c(ing, whereas with raster it was assumed via a hard to follow logic of 'field' and 'values'. Using your above data (and keeping two rasters):
library(terra)
#las_df <- structure(...)
las_df1 <- las_df[which(las_df$flightlineID == 1L), ]
las_df2 <- las_df[which(las_df$flightlineID == 2L), ]
las_vect1 <- vect(las_df1, geom = c('X', 'Y'), crs = 'EPSG:32755')
las_vect2 <- vect(las_df2, geom = c('X', 'Y'), crs = 'EPSG:32755')
las_rast <- rast(xmin=0, nrow = length(unique(las_df$X)), ncol = length(unique(las_df$Y)), crs='EPSG:32755')
set.ext(las_rast, c(min(las_df$X), max(las_df$X), min(las_df$Y), max(las_df$Y)))
pts1_rast <- rasterize(las_vect1, las_rast, fun = length)
pts2_rast <- rasterize(las_vect2, las_rast, fun = length)
pts1_pts2_rast <- c(pts1_rast, pts2_rast)
names(pts1_pts2_rast) <- c('lyr.1', 'lyr.2') # have to attend to this as both lyr.1 after `c(`
plot(pts1_pts2_rast$lyr.1, col = 'red')
plot(pts1_pts2_rast$lyr.2, col = 'blue', alpha=.75, add = TRUE)
# there is 1 cell that contains points from both pts1_rast and pts2_rast
cells(pts1_rast) %in% cells(pts2_rast)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
cells(pts2_rast) %in% cells(pts1_rast)
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE
One might suggest a consistent merge policy where pts1 or pts2 are always favored. In the end, if this is about optimizing allocation of scarce resources, clear bush where you have the best data, inspect, and clear again. But it still seems best to resolve this at the las level upstream.

Create a column value based on a matching regular expression

I have the following character string in a column called "Sentences" for a df:
I like an apple
I would like to create a second column, called Type, whose values are determined by matching strings. I would like to take the regular expression \bapple\b, match it with the sentence and if it matches, add the value Fruit_apple in the Type column.
In the long run I'd like to do this with several other strings and types.
Is there an easy way to do this using a function?
dataset (survey_1):
structure(list(slider_8.response = c(1L, 1L, 3L, 7L, 7L, 7L,
1L, 3L, 2L, 1L, 1L, 7L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 7L,
7L, 7L, 1L, 1L, 7L, 6L, 6L, 1L, 1L, 7L, 1L, 7L, 7L, 1L, 7L, 7L,
7L, 7L, 7L, 6L, 7L, 7L, 7L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 7L, 2L
), Sentences = c("He might could do it.", "I ever see the film.",
"I may manage to come visit soon.", "She’ll never be forgotten.",
"They might find something special.", "It might not be a good buy.",
"Maybe my pain will went away.", "Stephen maybe should fix your bicycle.",
"It used to didnʼt matter if you walked in late.", "He’d could climb the stairs.",
"Only Graeme would might notice that.", "I used to cycle a lot. ",
"Your dad belongs to disagree with this. ", "We can were pleased to see her.",
"He may should take us to the city.", "I could never forgot his deep voice.",
"I should can turn this thing over to Ann.", "They must knew who they really are.",
"We used to runs down three flights.", "I don’t care what he may be up to. ",
"That’s something I ain’t know about.", "That must be quite a skill.",
"We must be able to invite Jim.", "She used to play with a trolley.",
"He is done gone. ", "You might can check this before making a decision.",
"It would have a positive effect on the team. ", "Ruth can maybe look for it later.",
"You should tag along at the dance.", "They’re finna leave town.",
"A poem should looks like that.", "I can tell you didn’t do your homework. ",
"I can driving now.", "They should be able to put a blanket over it.",
"We could scarcely see each other.", "I might says I was never good at maths.",
"The next dance will be a quickstep. ", "I might be able to find myself a seat in this place.",
"Andrew thinks we shouldn’t do it.", "Jack could give a hand.",
"She’ll be able to come to the event.", "She’d maybe keep the car the way it is.",
"Sarah used to be able to agree with this proposal.", "I’d like to see your lights working. ",
"I’d be able to get a little bit more sleep.", "John may has a second name.",
"You must can apply for this job.", "I maybe could wait till the 8 o’clock train.",
"She used to could go if she finished early.", "That would meaned something else, eh?",
"You’ll can enjoy your holiday.", "We liketa drowned that day. ",
"I must say it’s a nice feeling.", "I eaten my lunch."), construct = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 54L), class = "data.frame")
type_list:
list("DM_will_can"=c("ll can","will can"), "DM_would_could"=c("d could","would could"),
"DM_might_can"="might can","DM_might_could"="might could","DM_used_to_could"="used to could",
"DM_should_can"="should can","DM_would_might"=c("d might", "would might"),"DM_may_should"="may should",
"DM_must_can"="must can", "SP_will_be_able"=c("ll be able","will be able"),
"SP_would_be_able"=c("d be able","would be able"),"SP_might_be_able"="might be able",
"SP_maybe_could"="maybe could","SP_used_to_be_able"="used to be able","SP_should_be_able"=
"should be able","SP_would_maybe"=c("d maybe", "would maybe"), "SP_maybe_should"="maybe should",
"SP_must_be_able"="must be able", "Filler_will_a"="quickstep","Filler_will_b"="forgotten",
"Filler_would_a"="lights working","Filler_would_b"="positive effect","Filler_can_a"="homework",
"Filler_can_b"="Ruth","Filler_could_a"="scarcely","Filler_could_b"="Jack", "Filler_may_a"="may be up to",
"Filler_may_b"="visit soon", "Filler_might_a"="good buy","Filler_might_be"="something special",
"Filler_should_a"="tag along","Filler_should_b"="Andrew","Filler_used_to_a"="trolley",
"Filler_used_to_b"="cycle a lot","Filler_must_a"="quite a skill","Filler_must_b"="nice feeling",
"Dist_gram_will_went"="will went","Dist_gram_meaned"="meaned","Dist_gram_can_were"="can were",
"Dist_gram_forgot"="never forgot", "Dist_gram_may_has"="may has",
"Dist_gram_might_says"="might says","Dist_gram_used_to_runs"="used to runs",
"Dist_gram_should_looks"="should looks","Dist_gram_must_knew"="must knew","Dist_dial_liketa"="liketa",
"Dist_dial_belongs"="belongs to disagree","Dist_dial_finna"="finna","Dist_dial_used_to_didnt"="used to didn't matter",
"Dist_dial_eaten"="I eaten", "Dist_dial_can_driving"="can driving","Dist_dial_aint_know"="That's something",
"Dist_dial_ever_see"="ever see the film","Dist_dial_done_gone"="done gone")

I want to do this with a Python dictionary, but we're talking about R, so I've more or less translated the approach. There is probably a more idiomatic way to do this in R than two for loops, but this should work:
# Define data
df <- data.frame(
id = c(1:5),
sentences = c("I like apples", "I like dogs", "I have cats", "Dogs are cute", "I like fish")
)
# id sentences
# 1 1 I like apples
# 2 2 I like dogs
# 3 3 I have cats
# 4 4 Dogs are cute
# 5 5 I like fish
type_list <- list(
"fruit" = c("apples", "oranges"),
"animals" = c("dogs", "cats")
)
types <- names(type_list)
df$type <- NA
df$item <- NA
for (type in types) {
for (item in type_list[[type]]) {
matches <- grep(item, df$sentences, ignore.case = TRUE)
df[matches, "type"] = type
df[matches, "item"] = item
}
}
# Output:
# id sentences type item
# 1 1 I like apples fruit apples
# 2 2 I like dogs animals dogs
# 3 3 I have cats animals cats
# 4 4 Dogs are cute animals dogs
# 5 5 I like fish <NA> <NA>
EDIT
Added after data was added. If I read in your data, and call it df, and your type list and call it type_list, the following works:
types <- names(type_list)
df$type <- NA
df$item <- NA
for (type in types) {
for (item in type_list[[type]]) {
matches <- grep(item, df$Sentences, ignore.case = TRUE)
df[matches, "type"] = type
df[matches, "item"] = item
}
}
This is exactly the same as my previous code, except Sentences has an upper case S in your data frame.

s3 is there a way to combine prop.table for character variables?

Noob here, I'm stuck trying to use S3 to summarise proportion data for a data.frame where there are four columns of character data. My goal is to build a summary method to show the proportions for every level of every variable at one time.
I can see how to get the propotion for each column
a50survey1 <- table(Student1995$alcohol)
a50survey2 <- table(Student1995$drugs)
a50survey3 <- table(Student1995$smoke)
a50survey4 <- table(Student1995$sport)
prop.table(a50survey1)
prop.table(a50survey1)
Not Once or Twice a week Once a month Once a week More than once a week
0.10 0.32 0.24 0.28 0.06
But I cannot find a way to combine all of the prop.table outputs into one summary output.
Unless I'm really wrong. I cannot find a S3 method like summary.prop.table which would work for me. The goal is to set up for the current data frame and then drop in new same size & observations data frames in the future.
I'm really a step by step guy and if you can help me, that would be great - thank you
Dataframe info here. There are four columns, where each column has a different number of catagorical options for obersvations.
> dput(head(Student1995,5))
structure(list(alcohol = structure(c(3L, 2L, 2L, 2L, 3L), .Label = c("Not",
"Once or Twice a week", "Once a month", "Once a week", "More than once a week"
), class = "factor"), drugs = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("Not",
"Tried once", "Occasional", "Regular"), class = "factor"), smoke = structure(c(2L,
3L, 1L, 1L, 1L), .Label = c("Not", "Occasional", "Regular"), class = "factor"),
sport = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Not regular",
"Regular"), class = "factor")), row.names = c(NA, 5L), class = "data.frame")
The Summary data if it helps - edit
> summary(Student1995)
alcohol drugs smoke sport
Not : 5 Not :36 Not :38 Not regular:13
Once or Twice a week :16 Tried once: 6 Occasional: 5 Regular :37
Once a month :12 Occasional: 7 Regular : 7
Once a week :14 Regular : 1
More than once a week: 3

Maybe this is what you wanted. Values in each category sum up to 100%.
lis <- sapply( Student1995, function(x) t( sapply( x, table ) ) )
sapply( lis, function(x) colSums(prop.table(x)) )
$alcohol
Not Once.or.Twice.a.week Once.a.month
0.0 0.6 0.4
Once.a.week More.than.once.a.week
0.0 0.0
$drugs
Not Tried.once Occasional Regular
0.8 0.2 0.0 0.0
$smoke
Not Occasional Regular
0.6 0.2 0.2
$sport
Not.regular Regular
0.4 0.6
and the whole summary...
prop.table( table(as.vector( sapply( Student1995, unlist ))) )
Not Not regular Occasional
0.35 0.10 0.05
Once a month Once or Twice a week Regular
0.10 0.15 0.20
Tried once
0.05

R: Need to perform multiple matches for each row in data frame

I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.

First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?

ddply run in a function looks in the environment outside the function?

I'm trying to write a function to do some often repeated analysis, and one part of this is to count the number of groups and number of members within each group, so ddply to the rescue !, however, my code has a problem....
Here is some example data
> dput(BGBottles)
structure(list(Machine = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
weight = c(14.23, 14.96, 14.85, 16.46, 16.74, 15.94, 14.98,
14.88, 14.87, 15.94, 16.07, 14.91)), .Names = c("Machine",
"weight"), row.names = c(NA, -12L), class = "data.frame")
and here is my code
foo<-function(exp1, exp2, data) {
datadesc<-ddply(data, .(with(data, get(exp2))), nrow)
return(datadesc)
}
If I run this function, I get an error
> foo(exp="Machine",exp1="weight",data=BGBottles)
Error in eval(substitute(expr), data, enclos = parent.frame()) :
invalid 'envir' argument
However, if I define my exp1, exp2 and data variables int he global environemtn first, it works
> exp1<-"weight"
> exp2<-"Machine"
> data<-BGBottles
> foo(exp="Machine",exp1="weight",data=BGBottles)
with.data..get.exp2.. V1
1 1 3
2 2 3
3 3 3
4 4 3
So, I assume ddply is running outside of the environemtn of the function ? Is there a way to stop this, or am I doing something wrong ?
Thanks
Paul.

You don't need get:
foo<-function(exp1, exp2, data) {
datadesc<-ddply(data, exp2, nrow)
return(datadesc)
}

This is an example of this bug: http://github.com/hadley/plyr/issues#issue/3. But as Marek points out, you don't need get here anyway.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Comparing two apparently identical levels from factors - r

Related

Rasterize points using the terra package to produce multi-layer SpatRaster

Create a column value based on a matching regular expression

s3 is there a way to combine prop.table for character variables?

R: Need to perform multiple matches for each row in data frame

ddply run in a function looks in the environment outside the function?

Categories

Resources