R Data.table strange search issue. (unprintable chars?) - r

Dears,
I have a strange issue.
I have a data.table "harvest":
str(harvest)
Classes ‘data.table’ and 'data.frame': 30005 obs. of 19 variables:
$ Date : Date, format: "2014-07-08" ...
$ Client : Factor w/ 68 levels
etc...
Now
harvest[grepl("Belgilux",Client),]
yields 258 results
To find out the exact name of the Client I do :
harvest[grepl("Belgilux",Client),unique(Client)]
[1] N.V. L'ORÉAL Belgilux S.A.
(out of 68 levels)
So far so good.
But if I do
harvest[Client=="N.V. L'ORÉAL Belgilux S.A."]
Empty data.table (0 rows) of 19 cols: Date,YearMonth,Month,Client,Project,Project.Code...
While I expected the same 258 results
levels(harvest$Client)[levels(harvest$Client)=="N.V. L'ORÉAL Belgilux S.A."] <- "Replace Me"
Gives no result either.
When I do the same with any other Clientname, it returns the correct amount of results.
I tried to give you a reproducable setup, but there the issue is not seen.
dt=data.table(Client=c("N.V. L'ORÉAL Belgilux S.A.", "Oh MyMedia","Testme","Oh MyMedia","N.V. L'ORÉAL Belgilux S.A."),Value=c(1:5))
dt$Client<-as.factor(dt$Client)
My question is : is it possible that there are some unprintable chars in my "Client" string and how can i see this?
Any other approaches?

Related

unimplemented type 'list' in 'listgreater' error in R when using cast

I'm trying to cast a simple dataframe using R. It's been melted using melt(). Here's the structure of the melted dataframe:
> str(pMaster)
'data.frame': 172 obs. of 7 variables:
$ Year : chr "1788" "1792" "1796" "1800" ...
$ TotalVotesCast: num 43782 28579 66840 67280 143028 ...
$ result : chr "RunnerUp" "RunnerUp" "RunnerUp" "RunnerUp" ...
$ candidate : chr "No candidate" "No candidate" "Thomas Jefferson" "Aaron Burr" ...
$ party : chr NA NA "D-R" "D-R" ...
$ PopPct : num 0 0 46.5 38.6 27.2 ...
$ PopVotes : num 0 0 31115 25952 38919 ...
And here's what I get when I try to cast it:
> cast(pMaster, Year ~ result, value = "PopVotes", fun.aggregate = sum)
Error in order(Year = c("1788", "1788", "1792", "1792", "1796", "1796", :
unimplemented type 'list' in 'listgreater'
Nothing I try seems to solve the error I'm getting. I am able to cast things if I use a variable other than Year, but I can't see anything about the data in the Year column that looks like it could be causing trouble. I've done some searching here on SO and elsewhere, but was surprised to see there isn't that much on whatever "listgreater" is. Any ideas, anyone? Thanks for any help.
Can you do something similar with pivot_wider and pivot_longer from the tidyr package? I haven't used the case/melt framework in some time, and I'm wondering if the language has moved away from it.

Why does lubridate mdy() return an error in lapply()?

I'm trying to understand why my lubridate mdy() function is returning an error in lapply() to convert dates in a dplyr pipeline. I have used mdy() on other data in a similar method but have yet to see this issue. I am relatively new to R but had been able to troubleshoot other issues until now. I am not very familiar with how to use lapply().
My data is a large .csv of water quality data, which I'm subsetting to simply show the data in question.
library(dplyr)
library(lubridate)
require(lubridate)
wq.all<-as.data.frame(read.csv('C:/WQdata.csv',header=TRUE,stringsAsFactors = FALSE))
test.wq<-wq.all[1:5,12:13]
class(test.wq)
[1] "data.frame"
mode(test.wq)
[1] "list"
str(test.wq)
'data.frame': 5 obs. of 2 variables:
$ YearMonth : chr "2019-07" "2019-06" "2019-05" "2019-04" ...
$ SampleTime: chr "07/09/2019 14:44" "06/10/2019 14:17" "05/22/2019 14:31" "04/08/2019 14:15" ...
In str(test.wq), SampleTime is the data in question which I am trying to coerce from chr to date, or at least num.
First, I don't need the time values, so I used dplyr mutate() to create SampleDate with only the 10-character dates, and then was attempting to coerce using mdy():
wq.date<-test.wq%>%
mutate(SampleDate=str_sub(test.wq[[2]],start=0,end=10))%>%
mdy(SampleDate)
But this returns an error:
Error in lapply(list(...), .num_to_date) : object 'SampleDate' not found
If I only use mutate() it all seems to work fine, and gives me the new SampleDate column I was looking for:
wq.date<-test.wq%>%
mutate(SampleDate=str_sub(test.wq[[2]],start=0,end=10))
head(wq.date)
YearMonth SampleTime SampleDate
1 2019-07 07/09/2019 14:44 07/09/2019
2 2019-06 06/10/2019 14:17 06/10/2019
3 2019-05 05/22/2019 14:31 05/22/2019
4 2019-04 04/08/2019 14:15 04/08/2019
5 2019-03 03/13/2019 14:19 03/13/2019
str(wq.date)
'data.frame': 5 obs. of 3 variables:
$ YearMonth : chr "2019-07" "2019-06" "2019-05" "2019-04" ...
$ SampleTime: chr "07/09/2019 14:44" "06/10/2019 14:17" "05/22/2019 14:31" "04/08/2019 14:15" ...
$ SampleDate: chr "07/09/2019" "06/10/2019" "05/22/2019" "04/08/2019" ...
So it only seems to result in error once I attempt to coerce using mdy(), even though SampleDate clearly exists and I believe I was referencing it correctly.
I have researched other posts here and here, but neither seem to get to quite this issue.
Thoughts? Many thanks!
We need to have it inside mutate or extract the column, otherwise, it is applying the function on the entire data.frame. According to ?mdy
Transforms dates stored in character and numeric vectors to Date or POSIXct objects
So, if the input is not a vector, it won't work
library(dplyr)
library(lubridate)
library(stringr)
test.wq%>%
mutate(SampleDate=str_sub(SampleTime,start=0,end=10))%>%
mutate(date = mdy(SampleDate))

Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector

Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector
occurs when trying to run my script in r.
I have tried to find the solution for it, but it seems to be pretty specific, and little help for me.
My dataset contains 3936 obs of 7 variables.
environment, skill, volume, datetime, year, month, day
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3696 obs. of 7 variables:
$ environment: chr "b2b" "b2b" "b2b" "b2b" ...
$ skill : chr "BO Bedrift" "BO Bedrift" "BO Bedrift" "BO Bedrift" ...
$ year : num 2017 2017 2017 2017 2017 ...
$ month : num 1 1 1 1 1 2 2 2 2 3 ...
$ day : num 2 9 16 23 30 6 13 20 27 6 ...
$ volume : num 360 312 305 222 113 ...
$ datetime : Date, format: "2017-01-02" "2017-01-09" "2017-01-16" "2017-01-23" ...
but when trying to run
volume_ets <- volume_tsbl %>% ETS(volume)
this message shows in the console
Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector
I tried somewhat of a shortcut, but nothing helped,
volume_tsbl$volume <- as.numeric(as.character(volume_tsbl$volume))
Tried to run
volume_ets <- volume_tsbl %>% ETS(volume)
this message shows in the console
Error in match.arg(opt_crit) : 'arg' must be NULL or a character vector
I tried somewhat of a shortcut, but nothing helped,
volume_tsbl$volume <- as.numeric(as.character(volume_tsbl$volume))
volume_ets <- volume_tsbl %>% ETS(volume)
my tsibble looks like this;
volume_tsbl <- volume %>¤ as_tsibble(key = c(skill, environment), index = c(datetime), regular = TRUE )
Expected the code to run, but it does not.
This is the result of an interface change made in late 2018. The change was to make model functions (such as ETS()) create model definitions, rather than fitted models. Essentially, ETS() no longer accepts data as an input, and the specification for the ETS model would become ETS(volume).
The equivalent code in the current version of fable is:
volume_ets <- volume_tsbl %>% model(ETS(volume))
Where the model() function is used to train one or more model definitions (ETS(volume) in this case) to a given dataset.
You can refer to the pkgdown site for fable to see more details: http://fable.tidyverts.org/
In particular, the ETS() function is documented here: http://fable.tidyverts.org/reference/ETS.html

Data importing Delimiter issue in R

I am trying to import a text file into R, and put it into a data frame, along with other data.
My delimiter is "|" and a sample of my data is here :
|Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD,
very light load and had 3 seats to myself. A very enthusiastic and friendly crew as usual on this transpacific
route that I take several times a year. Arrived 20 min ahead of schedule. The expected high level of service from
our flag carrier, Air Canada. Altitude Elite member.
|We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited
staffing in Toronto our flight was excellent. Due to the rush in Toronto one of our carry ones was placed to go in
the cargo hold. When we arrived in Winnipeg it stayed in Toronto, they were most helpful and kind at the Winnipeg
airport, and we received 3 phone calls the following day in regards to the misplaced bag and it was delivered to
our home. We are very thankful and more than appreciative of the service we received what a great end to a
wonderful holiday.
|Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which
had no storage whatsoever, and not even any room under the seats. Ridiculous. Crew were poor, not friendly. One
older male member of staff was quite attitudinal, acting as though he was doing everyone a huge favour by serving
them. A reasonable dinner but breakfast was a measly piece of banana loaf. That's it! The worst airline breakfast
I have had.
As you can see, there are many "|" , but as this screenshot below shows, when I imported the data in R, it only separated it once, instead of about 152 times.
How do I get each individual piece of text in a different column inside the data frame? I would like a data frame of length 152, not 2.
EDIT: The code lines are:
myData <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", sep="|",quote=NULL, comment='',fill = TRUE, header=FALSE)
length(myData)
[1] 2
class(myData)
[1] "data.frame"
str(myData)
'data.frame': 1244 obs. of 2 variables:
$ V1: Factor w/ 1093 levels "","'delayed' on departure (I reference flights between March 2014 and January 2015 in this regard: Denver, SFO,",..: 210 367 698 853 1 344 483 87 757 52 ...
$ V2: Factor w/ 154 levels ""," hotel","5/9/2014, LHR to Vancouver, AC855. 23/9/2014, Vancouver to LHR, AC854. For Economy the leg room was OK compared to",..: 1 1 1 1 78 1 1 1 1 1 ...
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue", stringsAsFactors = FALSE)
str(myDataFrame)
'data.frame': 531 obs. of 3 variables:
$ text : chr "BRU-YUL, May 26th, A330-300. Departed on-time, landed 30 minutes late due to strong winds, nice flight, food" "excellent, cabin-crew smiling and attentive except for one old lady throwing meal trays like boomerangs. Seat-" "pitch was very generous, comfortable seat, IFE a bit outdated but selection was Okay. Air Canadas problem is\nthat the new pro"| __truncated__ "" ...
$ otherVar2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ otherVar2.1: chr "blue" "blue" "blue" "blue" ...
length(myDataFrame)
[1] 3
A better way to read in the text is using scan(), and then put it into a data frame with your other variables (here I just made some up). Note that I took your text above, and pasted it into a file called sample.txt, after removing the starting "|".
myData <- scan("sample.txt", what = "character", sep = "|")
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue",
stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 3 obs. of 3 variables:
## $ text : chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__ "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__ "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
## $ otherVar2 : num 1 1 1
## $ otherVar2.1: Factor w/ 1 level "blue": 1 1 1
The otherVar1, otherVar2 are just placeholders for your own variables, as you said you wanted a data.frame with other variables. I chose an integer variable and a text variable, and by specifying a single value, it gets recycled for all observations in the dataset (in the example, 3).
I realize that your question asks how to get each text in a different column, but that is not a good way to use a data.frame, since data.frames are designed to hold variables in columns. (With one text per column, you cannot add other variables.)
If you really want to do that, you have to coerce the data after transposing it, as follows:
myDataFrame <- as.data.frame(t(data.frame(text = myData, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 1 obs. of 3 variables:
## $ V1: chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__
## $ V2: chr "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__
## $ V3: chr "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
length(myDataFrame)
## [1] 3
"Measly banana loaf"? Definitely economy class.

data.table() still converts strings to factors?

From what I can see here I would assume that data.table v1.8.0+ does not automatically convert strings to factors.
Specifically, to quote Matthew Dowle from that page:
No need for stringsAsFactors. Done like this in v1.8.0 : o character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported.
I'm not seeing that ... here's my R session transcript:
First, I make sure I have a recent enough version of data.table > 1.8.0
> library(data.table)
data.table 1.8.8 For help type: help("data.table")
Next, I create a 2x2 data.table. Notice that it creates factors ...
> m <- matrix(letters[1:4], ncol=2)
> str(data.table(m))
Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
$ V1: Factor w/ 2 levels "a","b": 1 2
$ V2: Factor w/ 2 levels "c","d": 1 2
- attr(*, ".internal.selfref")=<externalptr>
When I use stringsAsFactors in data.frame() and then call data.table(), all is well ...
> str(data.table(data.frame(m, stringsAsFactors=FALSE)))
Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
$ X1: chr "a" "b"
$ X2: chr "c" "d"
- attr(*, ".internal.selfref")=<externalptr>
What am I missing? Is data.frame() supposed to convert strings to factors, and if so, is there a "better way" of turning that behavior off?
Thanks!
Update:
This issue seems to have slipped past somehow until now. Thanks to #fpinter for filing the issue recently. It is now fixed in commit 1322. From NEWS, No:39 under bug fixes for v1.9.3:
as.data.table.matrix does not convert strings to factors by default. data.table likes and prefers using character vectors to factors. Closes #745. Thanks to #fpinter for reporting the issue on the github issue tracker and to vijay for reporting here on SO.
It appears that this non-coercion is not yet implemented.
data.table deals with matrix arguments using as.data.table
if (is.matrix(xi) || is.data.frame(xi)) {
xi = as.data.table(xi, keep.rownames = keep.rownames)
x[[i]] = xi
numcols[i] = length(xi)
}
and
as.data.table.matrix
contains
if (mode(x) == "character") {
for (i in ic) value[[i]] <- as.factor(x[, i])
}
Might be worth reporting this to the bug tracker. (it is still implemented in 1.8.9, the current r-forge version)
As a workaround and to complete #mnel answer, if you want to turn off the default behavior of data.frame you can use the dedicated option.
options(stringsAsFactors=FALSE)
str(data.table(data.frame(m)))
Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
$ X1: chr "a" "b"
$ X2: chr "c" "d"
- attr(*, ".internal.selfref")=<externalptr>

Resources