I load a dataset from excel
library(readxl)
df<-read_excel("excel_file.XLSX")
In the file there is a separate date column as Posixct
str(df$datecol)
I also have a time column that in R gets loaded as a date time. To bring it back as time I do........
df$Timecol<-format(df$Timecol,"%H:%M:%S")
However it turns into a character. This is where i think the problem lies
str(STOP_DATA$`Stop Frisk Time`)
I would think this part resolves the situation
df$merge_date_time<-as.POSIXct(paste(df$Datecol, df$TimeCol), format="%Y-%m-%d %H:%M:%S")
The date and time is then combined. What i want to do now is reference a timestamp column that is a Poxict data type.
str(df$Timestamp)
I would like to then find the time difference between them
df$TIME_SINCE <- difftime(df$Timestamp, df$merge_date_time, tz="UTC", units = "mins" )
but I end up with weird numbers that don't make sense. My guess its the Character data type for time. Does anyone know how to solve this?
I ended up finding out that this works
df$date_time<-paste(df$date, format(as.POSIXct(df$time), '%T'))
I removed the portion below from the script as it changed the file into a character.
df$Timecol<-format(df$Timecol,"%H:%M:%S")
I accepted the obscure POSIXT default with the proper time and odd dates (1899-12-31) and what the script did was replace 1899-12-31 with the proper correstponding df$date column.
Update 2020-5-14
Working with a different but similar dataset from here, I found read_csv seems to work fine. I haven't tried it with the original data yet though.
Although the replies didn't help solve the problem because my question was not correct, Shan's reply fits the original question I posted the most, so I accepted his answer.
Update 2020-5-12
I think my original question is not correct. Like mentioned in the comment, the data was quoted. Although changing the separator made the 11582 row in R look the same as the 11583 row in excel, it doesn't mean it's "right". Maybe there is some incorrect line switch due to inappropriate encoding or something, and thus causing some of the columns to be displaced. If I open the data with notepad++, the instance at row 11583 in excel is at the 11596 row.
Original question
I am trying to read the listings.csv from this dataset in kaggle into R. I downloaded the file and wrote the coderead.csv('listing.csv'). The first column, the column id, is supposed to be numeric. However, it shows:
listing$id[1:10]
[1] 2015 2695 3176 3309 7071 9991 14325 16401 16644 17409
13129 Levels: Ole Berl穩n!,16736423,Nerea,Mitte,Parkviertel,52.55554132116211,13.340658248460871,Entire home/apt,36,6,3,2018-01-26,0.16,1,279\n17312576,Great 2 floor apartment near Friederich Str MITTE,116829651,Selin,Mitte,Alexanderplatz,52.52349354926847,13.391003496971203,Entire home/apt,170,3,31,2018-10-13,1.63,1,92\n17316675,80簡 m of charm in 3 rooms with office space,116862833,Jon,Neuk繹lln,Schillerpromenade,52.47499080234379,13.427509313575928...
I think it is because there are values with commas in the second column. For example, opening the file with MiCrosoft excel, I can see one of the value in the second column is Ole,Ole...:
How can I read a csv file into R correctly when some values contain commas?
Since you have access to the data in Excel, you can 'Save As' in Excel with a seperator other than comma (,). First go in to Control Panel –> Region and Language -> Additional settings, you can change the "List Seperator". Most common one other than comma is pipe symbol (|). In R, when you read_csv, specify the seperator as '|'.
You could try this?
lsitings <- read.csv("listings.csv", stringsAsFactors = FALSE)
listings$name <- gsub(",","", listings$name) - This will remove the comma in Col name
If you don't need the information in the second column, then you can always delete it (in Excel) before importing into R. The read.csv function, which calls scan, can also omit unwanted columns using the colClasses argument. However, the fread function from the data.table package does this much more simply with the drop argument:
library(data.table)
listings <- fread("listings.csv", drop=2)
If you do need the information in that column, then other methods are needed (see other solutions).
I have a csv file with almost 4 millions records and 30 + columns.
The Columns are of varied type that includes Numeric, Alphanumeric, Date Column, character etc.
Attempt 1:
When I first read the file in R using read.csv Function then only 2 millions of the records were read.
This may have happened because of some special characters in the DATA.
Attempt 2:
I provided the argument quote = "" in read.csv Function and all the records were read succesfully.
However this brings up 2 issues:
a. all teh Columns were appended with 'x.' modifier:
egs.: x.date , x.name
b. all the Character Columns were loaded in dataframe, enclosed with double quotes ""
Can someone, please advise me that how to resolve these 2 issues and get the data loaded in R succesfully?
I work for a financial insititution and the data is highly sensitive, hence cannot paste the screenshot over here.
I also tried to create the scenario at my home but all my efforts were of little or of no avail.
The below screenshot is closest I have came to the exact scenario:
DATAFRAME SCREENSHOT: Not exact copy
I have a table like so:
dtab<-data.table(Arr.Dep=c("A","D"),
time=c("2017-05-01 04:50:00","2017-05-01 04:55:00"))
dtab[,time:=parse_date_time(dtab$time, c("%y-%m-%d %H:%M%S"))]
Operations on the date and time column seem successful:
dtab[,time2:=time+2]
But if I try to run an ifelse statement, the POSIXct format goes back to numeric and I seem to be unable to bring it back to date and time.
dtab[,time3:=ifelse(dtab[,Arr.Dep]=="A",dtab[,time]+2,"hello")]
I saw the issue has already been raised:
R- date time variable loses format after ifelse
Unfortunately it's not of great help to me, as when I try to follow the example - adding 2 seconds rather than replacing with NA as in the OP -, I hit an error anyway.
Any help?
Use library(lubridate) and add time dtab$time2 <- dtab$time2 + seconds(2). With this method, the format does not change.
I have an ID variable with 20 digits. Once i read the data in R , it changes to Scientific notation and then if i write the same id to csv file, the value of ID changes.
For example , running the below code should print me the value of x as "12345678912345678912",but it prints "12345678912345679872":
Code:
options(scipen=999)
x <- 12345678912345678912
print(x)
Output:
[1] 12345678912345679872
My questions are :
1) Why it is happening ?
2) How to fix this problem ?
I know it has to do with the storage of data types in R but still i think there should be some way to deal with this problem. I hope i am clear with this question.
I don't know if this question was asked or not in so point me to a link if its a duplicate.I will remove this post
I have gone through this, so i can relate with the issue of mine, but i am unable to fix it.
Any help would be highly appreciated. Thanks
R does not by default handle integers numerically larger than 2147483647L.
If you append an L to your number (to tell R its an integer), you get:
x <- 12345678912345678912L
#Warning message:
#non-integer value 12345678912345678912L qualified with L; using numeric value
This also explains the change of the last digits as R stores the number as a double.
I think the gmp-package should be able to handle large numbers in general. You should therefore either accept the loss of precision, store them as character stings, or use a data-type from the gmp package.
To circumvent the problem due to number storing/representation, you can import your ID variable directly as character with the option colClasses, for example, if using read.csv and importing a data.frame with the ÌD column and another numeric column:
mydata<-read.csv("file.csv",colClasses=c("character","numeric"),...)
Using readr you can do
mydata <- readr::read_csv("file.csv", col_types = list(ID=col_character()))
where "ID" is the name of your ID column