How to apply xts function per row in a dataframe - r

I have a dataframe in R, and a column called created_at which holds a text which I want to parse into a datetime. Here is a snappy preview:
head(pushes)
created_at repo.url repository.url
1 2013-06-17T00:14:04Z https://github.com/Mindful/blog
2 2013-07-31T21:08:15Z https://github.com/leapmotion/js.leapmotion.com
3 2012-11-04T07:08:15Z https://github.com/jplusui/jplusui
4 2012-06-21T08:16:22Z https://github.com/LStuker/puppet-rbenv
5 2013-03-10T09:15:51Z https://github.com/Fchaubard/CS108FinalProject
6 2013-10-04T11:34:11Z https://github.com/cmmurray/soccer
actor.login payload.actor actor_attributes.login
1 Mindful
2 joshbuddy
3 xuld
4 LStuker
5 ststanko
6 cmmurray
I wrote an instructions which works ok with some test data:
xts::.parseISO8601("2012-06-17T00:14:04",tz="UTC")$first.time returns proper Posix date
But when I apply it to a column with this instruction:
pushes$created_at <- xts::.parseISO8601(substr(pushes$created_at,1,nchar(pushes$created_at)-1),tz="UTC")$first.time
every row in a dataframe gets a duplicated date 2012-06-17 00:14:04 UTC
Like the function runned only once for the first row and then result was duplicated in rest of the rows :( Can you please help me to apply it properly row per row in a created_at column ?
Thanks.

The first argument to .parseISO8601 is supposed to be a character string, not a vector. You need to use sapply (or equivalent) to loop over your vector.
created_at <-
c("2013-06-17T00:14:04Z", "2013-07-31T21:08:15Z", "2012-11-04T07:08:15Z",
"2012-06-21T08:16:22Z", "2013-03-10T09:15:51Z", "2013-10-04T11:34:11Z")
# Only parses first element
.parseISO8601(substr(created_at,1,nchar(created_at)-1),tz="UTC")$first.time
# [1] "2013-06-17 00:14:04 UTC"
firstParseISO8601 <- function(x) .parseISO8601(x,tz="UTC")$first.time
# parse all elements
datetimes <- sapply(sub("Z$","",created_at), firstParseISO8601, USE.NAMES=FALSE)
# note that "simplifying" the output strips the POSIXct class, so we re-add it
datetimes <- .POSIXct(datetimes, tz="UTC")

Related

data frame with mixed date format

I would like to change all the mixed date format into one format for example d-m-y
here is the data frame
x <- data.frame("Name" = c("A","B","C","D","E"), "Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
I hv tried using this code down here, but it gives NAs
newdateformat <- as.Date(x$Birthdate,
format = "%m%d%y", origin = "2020-6-25")
newdateformat
Then I tried using parse, but it also gives NAs which means it failed to parse
require(lubridate)
parse_date_time(my_data$Birthdate, orders = c("ymd", "mdy"))
[1] NA NA "2001-09-12 UTC" NA
[5] "2005-02-18 UTC"
and I also could find what is the format for the first date in the data frame which is "36085.0"
i did found this code but still couldn't understand what the number means and what is the "origin" means
dates <- c(30829, 38540)
betterDates <- as.Date(dates,
origin = "1899-12-30")
p/s : I'm quite new to R, so i appreciate if you can use an easier explanation thank youuuuu
You should parse each format separately. For each format, select the relevant rows with a regular expression and transform only those rows, then move on the the next format. I'll give the answer with data.table instead of data.frame because I've forgotten how to use data.frame.
library(lubridate)
library(data.table)
x = data.table("Name" = c("A","B","C","D","E"),
"Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
# or use setDT(x) to convert an existing data.frame to a data.table
# handle dates like "2001-sep-12" and "2020-6-25"
# this regex matches strings beginning with four numbers and then a dash
x[grepl('^[0-9]{4}-',Birthdate),Birthdate1:=ymd(Birthdate)]
# handle dates like "36085.0": days since 1904 (or 1900)
# see https://learn.microsoft.com/en-us/office/troubleshoot/excel/1900-and-1904-date-system
# this regex matches strings that only have numeric characters and .
x[grepl('^[0-9\\.]+$',Birthdate),Birthdate1:=as.Date(as.numeric(Birthdate),origin='1904-01-01')]
# assume the rest are like "Feb-18-2005" and "05/27/84" and handle those
x[is.na(Birthdate1),Birthdate1:=mdy(Birthdate)]
# result
> x
Name Birthdate Birthdate1
1: A 36085.0 2002-10-18
2: B 2001-sep-12 2001-09-12
3: C Feb-18-2005 2005-02-18
4: D 05/27/84 1984-05-27
5: E 2020-6-25 2020-06-25

List Rows from column selection

Hello I would like to select rows in form of list from a dataframe. Here is my dataframe:
df2 <- data.frame("user_id" = 1:2, "username" = c(215,154), "password" = c("John4","Dora4"))
now with this dataframe I can only select 1 column to view rows as a list, which I did with this code
df2[["user_id"]]
output is
[1] 1 2
but now when I try this with more columns I am told its out of bounds, what is the problem here
df2[["user_id", "username"]]
How can I resolve and get the results of rows as a list
If I understood your question correctly, you need to familiarize yourself with subsetting in R. These are ways to select multiple columns in R:
df2[,c('user_id', 'username')]
or
df2[,1:2]
If you want to return all columns as a list, you can use something like this:
lapply(1:ncol(df2), function(x) df2[,x])
The format is df2['rows','columns'], so you should use:
df2[,c("user_id", "username")]
To get them 'in form of list', do:
as.list(df2[,c("user_id", "username")])
The double bracket [[ notion is used to select a single unnamed element (in this case a single unnamed column since data frames are essentially lists of column data).
See this answer for more on double vs single bracket notion: https://stackoverflow.com/a/1169495/8444966
This should give you a row of list (There's got to be an answer somewhere here).
row_list<- as.list(as.data.frame(t(df2[c("user_id", "username")])))
#$V1
#[1] 1 215
#$V2
#[1] 2 154
If you want to keep names of the rows.
df2_subset <- df2[c("user_id", "username")]
setNames(split(df2_subset, seq(nrow(df2_subset))), rownames(df2_subset))
#$`1`
# user_id username
#1 1 215
#$`2`
# user_id username
#2 2 154

R integer (date) to number

I have a matrix date that looks like this:
Date Time
1 2017-05-19 08:52:21
2
3 2017-05-20 22:29:29
4 2017-05-20 15:21:35
Both date$Date and date$Time are integers.
I would like to obtain a new column like this:
Date Time
1 20170519 085221
2 NA NA
3 20170520 222929
4 20170520 152135
I've tried with as.character, as.numeric, as.Date... But can't find the solution /=
Sorry if the question was already answer in another post, but I wasn't able to find it!
You need format...
format(as.POSIXct("2017-05-19"),"%Y%m%d")
[1] "20170519"
format(as.POSIXct("08:52:21",format="%H:%M:%S"),"%H%M%S")
[1] "085221"
See ?strptime for the formatting codes.
Since you apparently don't necessarily want date or time class objects (do you?), and since you don't further specify what exactly you need this for, there seems no need to work with date or time functions.
You could try this:
Step 1: First, if you want empty cells to contain NA, fill those in per column
df$Date[df$Date == ""] <- NA
df$Time[df$Time == ""] <- NA
Step 2: And then simply replace the "-" and ":" in the Date and Time values, respectively, to get the wanted strings
df$Date <- gsub(pattern = "-", x = df$Date, replacement = "")
df$Time <- gsub(pattern = ":", x = df$Time, replacement = "")
Date Time
1 20170519 85221
2 <NA> <NA>
3 20170520 222929
4 20170520 152135
The output might not yield integer classes (my starting df resembling your df did not contain integers, so can't double check; result here were character classes), so if you really want integer classes, simply apply as.integer().
As you see the output is the same as your expected output, except for the leading "0" of the row 1 Time value. If need be, there's a work around to get that in there, although I'm not sure what that would add. And after applying as.integer it would most likely disappear anyway.

how to manipulate variables in a factor of a data frame

I need to do some manipulations in a factor inside my data frame with name phone number.
the variables must be numeric with lenght 5
also not contains special char
and I want to change the format AO-11111, VQ-11111from to 111111 it means erase the first chars and finally transform the rest of variables to na
My data.frame is derived from a .csv file.initial phone_number is a factor data such that
phone_number
VQ-40773
VQ-43685
VQ-44986
40270
41694
42623
.
.
strsplit function will help you to get the value out string.
str="VQ-40773"
(strsplit(str,"-"))[[1]][2] //will return 40773
If you want to remove anything the precedes a dash, then:
sub("^([^-]+[-])(.+)", "\\2", phone_number)
> phone_number <- scan(what="")
1: VQ-40773
2: VQ-43685
3: VQ-44986
4: 40270
5: 41694
6: 42623
7:
Read 6 items
> sub("^([^-]+[-])(.+)", "\\2", phone_number)
[1] "40773" "43685" "44986" "40270" "41694" "42623"
> as.numeric(sub("^([^-]+[-])(.+)", "\\2", phone_number))
[1] 40773 43685 44986 40270 41694 42623
The nchar function would allow checking the lengths of a character vector. Post an adequate example and, please, do make a greater effort to get punctuation and capitalization correct.

How can you insert a colon every two characters?

I have a column of time values, except that they are in character format and do not have the colons to separate H, M, S. The column looks similar to the following:
Time
024201
054722
213024
205022
205024
125440
I want to convert all the values in the column to look like actual time values in the format H:M:S. The values are already in HMS format, so it is simply a matter of inserting colons, but that is proving more difficult than I thought. I found a package that adds commas every three digits from the right to make Strings look like currency values, but nothing for time (without also adding a date value, which I do not want to do). Any help would be appreciated.
Since the data is time related, you should consider storing it in a POSIX format:
> df <- data.frame(Time=c("024201", "054722", "213024", "205022", "205024", "125440")
> df$Time <- as.POSIXct(df$Time, format="%H%M%S")
> df
Time
1 2014-01-05 02:42:01
2 2014-01-05 05:47:22
3 2014-01-05 21:30:24
4 2014-01-05 20:50:22
5 2014-01-05 20:50:24
6 2014-01-05 12:54:40
To output just the times:
> format(df, "%H:%M:%S")
Time
1 02:42:01
2 05:47:22
3 21:30:24
4 20:50:22
5 20:50:24
6 12:54:40
A regular expression with lookaround works for this:
gsub('(..)(?=.)', '\\1:', x$Time, perl=TRUE)
The (?=.) means a character (matched by .) must follow, but is not considered part of the match (and is not captured).
Here is a regex solution:
x <- readLines(n=6)
024201
054722
213024
205022
205024
125440
gsub("(\\d\\d)(\\d\\d)(\\d\\d)", "\\1:\\2:\\3", x)
## [1] "02:42:01" "05:47:22" "21:30:24"
## [4] "20:50:22" "20:50:24" "12:54:40 "
Here the (\\d\\d) says we're looking for 2 digits. The parenthesis breaks the string into 3 parts. Then the \\1: says take chunk 1 and place a colon after it.
Or via date/times classes:
time <- c("024201", "054722", "213024", "205022", "205024", "125440")
time <- as.POSIXct(paste0("1970-01-01", time), format="%Y-%d-%m %H%M%S")
(time <- format(time, "%H:%M:%S"))
# [1] "02:42:01" "05:47:22" "21:30:24" "20:50:22" "20:50:24" "12:54:40"
This gives a chron "times" class vector:
> library(chron)
> times(gsub("(..)(..)(..)", "\\1:\\2:\\3", DF$Time))
[1] 02:42:01 05:47:22 21:30:24 20:50:22 20:50:24 12:54:40
The "times" class can display times without having to display the date and supports various methods on the times.
On the other hand, if only a character string is wanted then only the gsub part is needed.

Resources