I have multiple character columns (around 20) that I would like to change all to date formats and drop the time using r. I've tried loops, mutate and apply.
Here is some sample data using just two columns
col1 = c("2017-04-01 23:00:00", "2017-03-03 00:00:01", "2017-04-02
00:00:01")
col2 = c("2017-04-10 08:41:49", "2017-04-10 08:39:48", "2017-04-10
08:41:51")
df <- cbind(col1, col2)
I've tried:
df <- df %>% mutate(df, funs(ymd))
and
df <- df %>% mutate(df, funs(mdy))
Both gave me an error. I've also tried putting all column names in a list and do a
for(i in namedlist) {
as_date(df[i])
glimpse(df)
}
That didn't work either.
I've tried to use the answer from Convert multiple columns to dates with lubridate and dplyr and that did not work either. That posts wanted certain variables to be converted. I want all of my variables to be converted so the var command doesn't apply.
Any suggestions to do this efficiently? Thank you.
If you're applying over all columns, you can do a very short call with lapply. I'll pass it here using data.table:
library( data.table )
setDT( df )
df <- df[ , lapply( .SD, as.Date ) ]
On your test data, this gives:
> df
col1 col2
1: 2017-04-01 2017-04-10
2: 2017-03-03 2017-04-10
3: 2017-04-02 2017-04-10
NOTE: your test data actually results in a matrix, so you need to convert it to a data.frame first (or directly to a data.table).
You can do the same thing with just base R, but I personally like the above solution better:
df <- as.data.frame( lapply( df, as.Date ) )
> df
col1 col2
1 2017-04-01 2017-04-10
2 2017-03-03 2017-04-10
3 2017-04-02 2017-04-10
EDIT: This time with the right wildcards for the as.Date function. I also added a reproducible example:
library(dplyr)
df <- data.frame(date_1 = c("2019-01-01", "2019-01-02", "2019-01-03"),
date_2 = c("2019-01-04", "2019-01-05", "2019-01-06"),
value = c(1,2,3),
stringsAsFactors = F)
str(df)
date_cols <- c("date_1", "date_2")
df_2 <- df %>%
mutate_at(vars(date_cols), funs(as.Date(., "%Y-%m-%d")))
str(df_2)
Related
This question already has answers here:
How to delete multiple values from a vector?
(9 answers)
Closed 3 years ago.
I have a vector of values and a data frame.
I would like to filter out the rows of the data frame which contain (in specific column) any of the values in my vector.
I'm trying to figure out if a person in the survey has a child who was also questioned in the survey - if so I would like to remove them from my data frame.
I have a list of respondent IDs, and vectors of mother/father personal IDs. If the ID appears in the mother/father column I would like to remove it.
df <- data.frame(ID= c(101,102,103,104,105), Name = (Martin, Sammie, Reg, Seamus, Aine)
vec <- c(103,105,108,120,150)
Output should be a dataframe with three rows - Martin, Sammie, Seamus.
ID Name
1 101 Martin
2 102 Sammie
3 104 Seamus
df[!(df$ID %in% vec), ] # Or subset(df, !(ID %in% vec))
# ID Name
# 1 101 Martin
# 2 102 Sammie
# 4 104 Seamus
Data
df <- data.frame(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
You can do this with filter from dplyr
library(tidyverse)
df2 <- df%>%
filter(!ID %in% vec)
If you create this as a data.table (and load data.table package, and fix the errors in the example data):
library(data.table)
df <- data.table(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
# solution, slightly different from base R
df[!(ID %in% vec)]
Data.table is likely going to run a bit quicker than base R so very useful with large datasets. Microbenchmarking with a large dataset using base R, tidyverse and data.table shows data.table to be a bit quicker than tidyverse and a lot faster than base.
library(tidyverse)
library(data.table)
library(microbenchmark)
n <- 10000000
df <- data.frame("ID" = c(1:n), "Name" = sample(LETTERS, size = n, replace = TRUE))
dt <- data.table(df)
vec <- sample(1:n, size = n/10, replace = FALSE)
microbenchmark(dt[!(ID %in% vec)], df[!(df$ID %in% vec),], df%>% filter(!ID %in% vec))
I've got a data frame with text. I'd like to change all "," to "-" in all observations of selected variables, and like to select the variables based on their names containing the word date.
I've tried to incorporate various variations of grep() expressions into MyFunc but haven't been able to get it to work.
Thanks!
starting point:
df <- data.frame(dateofbirth=c("25,06,1939","15,04,1941","21,06,1978","06,07,1946","14,07,1935"),recdate=c("26,06,1945","03,04,1964","21,06,1949","15,07,1923","07,12,1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2,st"),disdatenow=c("25,06,1975","25,05,1996","21,06,1932","26,07,1934","07,07,1965"), stringsAsFactors = FALSE)
desired outcome:
df <- data.frame(dateofbirth=c("25-06-1939","15-04-1941","21-06-1978","06-07-1946","14-07-1935"),recdate=c("26-06-1945","03-04-1964","21-06-1949","15-07-1923","07-12-1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2"),disdatenow=c("25-06-1975","25-05-1996","21-06-1932","26-07-1934","07-07-1965"), stringsAsFactors = FALSE)
Current code:
MyFunc <- function(x) {gsub(",","-",df$x)}
You can use mutate_at from dplyr:
df %>%
mutate_at(vars(contains("date")), function(x){gsub(",", "-", x)})
and that gives you this:
dateofbirth recdate b disdatenow
1 25-06-1939 26-06-1945 8,ted,st 25-06-1975
2 15-04-1941 03-04-1964 99,tes,rd 25-05-1996
3 21-06-1978 21-06-1949 6,ldk,dr 21-06-1932
4 06-07-1946 15-07-1923 2,sdd,jun 26-07-1934
5 14-07-1935 07-12-1945 asd,2,st 07-07-1965
Using your function MyFunc, this should also work
MyFunc <- function(x) {gsub(",", "-", x)}
library(data.table)
setDT(df)
cols <- c("dateofbirth", "recdate", "disdatenow")
df[, cols] <- df[, lapply(.SD, MyFunc), .SDcols = cols]
I found myself at the limits of the grep() function or perhaps there are efficient ways of doing this.
Start off a sample data-frame:
Date <- c( "31-DEC-2014","31-DEC-2014","31-DEC-2014","30-DEC-2014",
"30-DEC-2014","30-DEC-2014", "29-DEC-2014","29-DEC-2014","29-DEC-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
df <- as.data.frame(cbind(Date, ISIN, price))
And the desired Result is a list() containing subsets of the main data file which looks like the below (x3 for the 3 individual Identifiers in Result_I)
The idea is that the data should first filter by ISIN and then filter by Date. this 2 step process should keep my data intact.
Result_d <- c("31-DEC-2014", "30-DEC-2014","29-DEC-2014")
Result_I <- c("LU0168343191","LU0168343191","LU0168343191")
Result_P <- c(1,4,7)
Result_df <- cbind(Result_d, Result_I, Result_P)
Please keep in mid the above is for demo purposes and the real data-set has 5M rows and 50 columns over a period of 450+ different dates as per Result_d so i am lookign for something that is applicable irrespective of nrow or ncol
What i have so far:
I take all unique dates and store:
Unique_Dates <- unique(df$Date)
The same for the Identifiers:
Unique_ID <- unique(df$ISIN)
Now the grepping issue:
If i wanted all rows containing Unique_Dates i would do something like:
pattern <- paste(Unique_dates, collapse = "|")
result <- as.matrix(df[grep(pattern, df$Date),])
and this will retrieve basically the entire data set. i am wondering if anyone knows an efficient way of doing this.
Thanks in advance.
Using dplyr:
library(dplyr)
Date <- c( "31-Dec-2014","31-Dec-2014","31-Dec-2014","30-Dec-2014",
"30-Dec-2014","30-Dec-2014", "29-Dec-2014","29-Dec-2014","29-Dec-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
DF <- data.frame(Date, ISIN, price,stringsAsFactors=FALSE)
DF$Date=as.Date(DF$Date,"%d-%b-%Y")
#Examine data ranges and frequencies
#date range
range(DF$Date)
#date frequency count
table(DF$Date)
#ISIN frequency count
table(DF$ISIN)
#select ISINs for filtering, with user defined choice of filters
# numISIN = 2
# subISIN = head(names(sort(table(DF$ISIN))),numISIN)
subISIN = names(sort(table(DF$ISIN)))[2]
subDF=DF %>%
dplyr::group_by(ISIN) %>%
dplyr::arrange(ISIN,Date) %>%
dplyr::filter(ISIN %in% subISIN) %>%
as.data.frame()
#> subDF
# Date ISIN price
#1 2014-12-29 LU0168343191 7
#2 2014-12-30 LU0168343191 4
#3 2014-12-31 LU0168343191 1
We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Date', specify the 'i' based on the index returned with grep and Subset the Data.table (.SD) based on the 'i' index.
library(data.table)
setDT(df)[grep("LU", ISIN), .SD, by = Date]
# Date ISIN price
#1: 31-DEC-2014 LU0168343191 1
#2: 30-DEC-2014 LU0168343191 4
#3: 29-DEC-2014 LU0168343191 7
I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.
dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.
If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))
mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson
I must be missing something obvious here but for this:
> range(data$timestamp)
[1] "2015-06-29 09:32:43.000 UTC" "2015-07-03 15:50:35.986 UTC"
I want to do something like:
df <- data.frame(as.Date(range(data$timestamp)))
names(df) <- c('from', 'to')
and get a data frame with columns 'from' and 'to' without needing an extra variable only to index. Written as above data.frame converts the vector to two rows in a single-column data frame. I've tried various combinations of cbind, matrix, t, list and attempts at destructuring. What is the best way to do this?
df <- as.data.frame(as.list(as.Date(range(data$timestamp))))
names(df) <- c('from', 'to')
This will work. data.frames are really just special lists after all.
If you wanted a one-liner, you could use setNames. I've also found this type of thing much more readable now using magrittr:
data$timestamp %>% range %>% as.Date %>% as.list %>% as.data.frame %>% setNames(c("from", "to")
Alternatively, you could cast via a matrix:
df <- as.data.frame(matrix(as.Date(range(data$timestamp)), ncol = 2))
names(df) <- c('from', 'to')
This will, however, strip the class (and other attributes) from the dates. If you instead set the dimensions of the vector using dim<-, then neither print nor as.data.frame will treat it as a matrix (because it still has the class Date).
To get round this, convert to Date after creating the data.frame:
df <- as.data.frame(matrix(range(data$timestamp), ncol = 2))
df[] <- lapply(df, as.Date)
names(df) <- c('from', 'to')
You can try :
range_timestamp <- c("2015-06-29 09:32:43.000 UTC", "2015-07-03 15:50:35.986 UTC")
df <- data.frame(from=as.Date(range_timestamp[1]), to=as.Date(range_timestamp)[2])
df
# from to
#1 2015-06-29 2015-07-03
Another option, using data.table and avoiding indexing:
require(data.table)
df <- `colnames<-`(data.frame(rbind(range_timestamp)), c("from","to"))
df <- setDT(df)[, lapply(.SD, as.Date)]
df
from to
1: 2015-06-29 2015-07-03
Or, as mentionned by #akrun in the comment:
require(data.table)
df <- setnames(setDT(as.list(as.Date(range_timestamp))), c('from', 'to'))[]
I was a few seconds too late with my suggestion. As I see, others have already answered. Anyway: here is an alternative that is similar to what you have attempted:
timestamp <-c("2015-06-29 09:32:43.000 UTC","2015-07-03 15:50:35.986 UTC")
df <- t(data.frame(as.Date(range(timestamp))))
colnames(df) <- c('from', 'to')
rownames(df) <- NULL
#> df
# from to
#[1,] "2015-06-29" "2015-07-03"