Search for timestamp anomalies in a dataframe R - r

im working with some GTFS data from Berlin and I am hitting a wall here right now.
There is a stop_times.txt file for all Busstops in Berlin with 5 million rows.
Two Columns (Arrival_time and Departure_time) contain anomalies, such as
Arrival_time : 112:30:0 instead of the regular format 11:20:30.
I dont really know how to extract those specific lines and erase them from the dataset. I cant come up with an algorithm which is able to detect it. I tried to go with the length of strings (should be 8 00:00:00 = 8 characters), but the errored ones are also 8 long.
Do you know a simple way to make sure that the format is always xx:xx:xx and delete all others?
Thanks...
Edit :
So, now after trying the below suggested solution, it didnt work because it would just tell me how many rows were malicious and not where and not how i could delete those.
My idea is basically now :
Find every timestamp which does not correspond to this exact format :
'00:00:00', where it has to be length '8' and 2 digits seperated by ':'. Is there a way to detect anomalies within this pattern and then delete them? I really dont know how to fix this issue anymore.
Thanks

lubridate is such a useful package I can't remember how I was doing without it.
require(lubridate)
times <- c("112:30:0", "11:20:30")
datetimes <- paste("01.01.2018", times)
parsed.datetimes <- lubridate::dmy_hms(datetimes)
#[1] NA "2018-01-01 11:20:30 UTC"
#Warning message:
# 1 failed to parse.
This function will automatically tell you when format parsing failed, only thing is, it is taking datetime format as input instead of just times, but you can easily get around that like shown.
In order to know exactly which ones have failed to parse, you can then apply:
failed.list <- which(is.na(parsed.datetimes))

Related

When i use difftime() in R and what i think is the proper setup I still get strange time values

I load a dataset from excel
library(readxl)
df<-read_excel("excel_file.XLSX")
In the file there is a separate date column as Posixct
str(df$datecol)
I also have a time column that in R gets loaded as a date time. To bring it back as time I do........
df$Timecol<-format(df$Timecol,"%H:%M:%S")
However it turns into a character. This is where i think the problem lies
str(STOP_DATA$`Stop Frisk Time`)
I would think this part resolves the situation
df$merge_date_time<-as.POSIXct(paste(df$Datecol, df$TimeCol), format="%Y-%m-%d %H:%M:%S")
The date and time is then combined. What i want to do now is reference a timestamp column that is a Poxict data type.
str(df$Timestamp)
I would like to then find the time difference between them
df$TIME_SINCE <- difftime(df$Timestamp, df$merge_date_time, tz="UTC", units = "mins" )
but I end up with weird numbers that don't make sense. My guess its the Character data type for time. Does anyone know how to solve this?
I ended up finding out that this works
df$date_time<-paste(df$date, format(as.POSIXct(df$time), '%T'))
I removed the portion below from the script as it changed the file into a character.
df$Timecol<-format(df$Timecol,"%H:%M:%S")
I accepted the obscure POSIXT default with the proper time and odd dates (1899-12-31) and what the script did was replace 1899-12-31 with the proper correstponding df$date column.

R read csv with comma in column

Update 2020-5-14
Working with a different but similar dataset from here, I found read_csv seems to work fine. I haven't tried it with the original data yet though.
Although the replies didn't help solve the problem because my question was not correct, Shan's reply fits the original question I posted the most, so I accepted his answer.
Update 2020-5-12
I think my original question is not correct. Like mentioned in the comment, the data was quoted. Although changing the separator made the 11582 row in R look the same as the 11583 row in excel, it doesn't mean it's "right". Maybe there is some incorrect line switch due to inappropriate encoding or something, and thus causing some of the columns to be displaced. If I open the data with notepad++, the instance at row 11583 in excel is at the 11596 row.
Original question
I am trying to read the listings.csv from this dataset in kaggle into R. I downloaded the file and wrote the coderead.csv('listing.csv'). The first column, the column id, is supposed to be numeric. However, it shows:
listing$id[1:10]
[1] 2015 2695 3176 3309 7071 9991 14325 16401 16644 17409
13129 Levels: Ole Berl穩n!,16736423,Nerea,Mitte,Parkviertel,52.55554132116211,13.340658248460871,Entire home/apt,36,6,3,2018-01-26,0.16,1,279\n17312576,Great 2 floor apartment near Friederich Str MITTE,116829651,Selin,Mitte,Alexanderplatz,52.52349354926847,13.391003496971203,Entire home/apt,170,3,31,2018-10-13,1.63,1,92\n17316675,80簡 m of charm in 3 rooms with office space,116862833,Jon,Neuk繹lln,Schillerpromenade,52.47499080234379,13.427509313575928...
I think it is because there are values with commas in the second column. For example, opening the file with MiCrosoft excel, I can see one of the value in the second column is Ole,Ole...:
How can I read a csv file into R correctly when some values contain commas?
Since you have access to the data in Excel, you can 'Save As' in Excel with a seperator other than comma (,). First go in to Control Panel –> Region and Language -> Additional settings, you can change the "List Seperator". Most common one other than comma is pipe symbol (|). In R, when you read_csv, specify the seperator as '|'.
You could try this?
lsitings <- read.csv("listings.csv", stringsAsFactors = FALSE)
listings$name <- gsub(",","", listings$name) - This will remove the comma in Col name
If you don't need the information in the second column, then you can always delete it (in Excel) before importing into R. The read.csv function, which calls scan, can also omit unwanted columns using the colClasses argument. However, the fread function from the data.table package does this much more simply with the drop argument:
library(data.table)
listings <- fread("listings.csv", drop=2)
If you do need the information in that column, then other methods are needed (see other solutions).

Character values stored in DATAFRAME with Double Quotes while reading into R

I have a csv file with almost 4 millions records and 30 + columns.
The Columns are of varied type that includes Numeric, Alphanumeric, Date Column, character etc.
Attempt 1:
When I first read the file in R using read.csv Function then only 2 millions of the records were read.
This may have happened because of some special characters in the DATA.
Attempt 2:
I provided the argument quote = "" in read.csv Function and all the records were read succesfully.
However this brings up 2 issues:
a. all teh Columns were appended with 'x.' modifier:
egs.: x.date , x.name
b. all the Character Columns were loaded in dataframe, enclosed with double quotes ""
Can someone, please advise me that how to resolve these 2 issues and get the data loaded in R succesfully?
I work for a financial insititution and the data is highly sensitive, hence cannot paste the screenshot over here.
I also tried to create the scenario at my home but all my efforts were of little or of no avail.
The below screenshot is closest I have came to the exact scenario:
DATAFRAME SCREENSHOT: Not exact copy

unable to retrieve POSIXct from numeric format after ifelse statement (R)

I have a table like so:
dtab<-data.table(Arr.Dep=c("A","D"),
time=c("2017-05-01 04:50:00","2017-05-01 04:55:00"))
dtab[,time:=parse_date_time(dtab$time, c("%y-%m-%d %H:%M%S"))]
Operations on the date and time column seem successful:
dtab[,time2:=time+2]
But if I try to run an ifelse statement, the POSIXct format goes back to numeric and I seem to be unable to bring it back to date and time.
dtab[,time3:=ifelse(dtab[,Arr.Dep]=="A",dtab[,time]+2,"hello")]
I saw the issue has already been raised:
R- date time variable loses format after ifelse
Unfortunately it's not of great help to me, as when I try to follow the example - adding 2 seconds rather than replacing with NA as in the OP -, I hit an error anyway.
Any help?
Use library(lubridate) and add time dtab$time2 <- dtab$time2 + seconds(2). With this method, the format does not change.

Scientific notation issue in R

I have an ID variable with 20 digits. Once i read the data in R , it changes to Scientific notation and then if i write the same id to csv file, the value of ID changes.
For example , running the below code should print me the value of x as "12345678912345678912",but it prints "12345678912345679872":
Code:
options(scipen=999)
x <- 12345678912345678912
print(x)
Output:
[1] 12345678912345679872
My questions are :
1) Why it is happening ?
2) How to fix this problem ?
I know it has to do with the storage of data types in R but still i think there should be some way to deal with this problem. I hope i am clear with this question.
I don't know if this question was asked or not in so point me to a link if its a duplicate.I will remove this post
I have gone through this, so i can relate with the issue of mine, but i am unable to fix it.
Any help would be highly appreciated. Thanks
R does not by default handle integers numerically larger than 2147483647L.
If you append an L to your number (to tell R its an integer), you get:
x <- 12345678912345678912L
#Warning message:
#non-integer value 12345678912345678912L qualified with L; using numeric value
This also explains the change of the last digits as R stores the number as a double.
I think the gmp-package should be able to handle large numbers in general. You should therefore either accept the loss of precision, store them as character stings, or use a data-type from the gmp package.
To circumvent the problem due to number storing/representation, you can import your ID variable directly as character with the option colClasses, for example, if using read.csv and importing a data.frame with the ÌD column and another numeric column:
mydata<-read.csv("file.csv",colClasses=c("character","numeric"),...)
Using readr you can do
mydata <- readr::read_csv("file.csv", col_types = list(ID=col_character()))
where "ID" is the name of your ID column

Resources