How to alter multiple objects in a data frame in R? - r

I am working with a database that has 6 columns: 'cik', 'company.name', 'form.type', 'date.filed', 'nword.hits' and 'ticker'. The 'date.filed' column has objects of the following form: 2014-02-21, these numbers are dates. I would like to filter only the year and create a new column for this dataset, so that only the number 2014 remains. First I transformed the variables into as.character by the following code:
t <- transform(fdt, date.filed = as.character(date.filed))
Then, I separated the numbers by using following command:
bb <- strsplit(t$date.filed, split = "-")
In this way, the variables became as follows: '2014''02''21'.
In order to extract the years, I used the following code:
ex11 <- substr(bb, start = 1, stop = 8)
oficial <- data.frame(ex11)
I was able to extract the years, however they looked like this: c("2014". I wonder if there is any way to remove the c, the parentheses and the strings.
Thank you in advance!

Instead of converting to character and then splitting or substring the dates, it may be better to convert to Date class and use the methods to extract those components
# // assuming the format to be in YYYY-MM-DD
fdt$date.filed <- as.Date(fdt$date.filed)
fdt$year <- as.integer(format(fdt$date.filed, "%Y"))
In the OP's code, output of strsplit is a list of vectors. Instead of applying substr on the list (which is already splitted and just needs to extract the first component as in the comments), we need substr on the original column
substr(fdt$date.filed, 1, 4)
NOTE: For Date columns, the recommended solution is to use Date methods instead of regex or substring

Related

How to filter via a logical expression that filters via a variable [duplicate]

This question already has an answer here:
r - Filter a rows by a date that alters each day
(1 answer)
Closed 1 year ago.
I have a question about the use of a logical expression in combination with a variable.
Imagine that I have a data frame with multiple rows that each contain a date saved as 2021-09-25T06:04:35:689Z.
I also have a variable that contains the date of yesterday as '2021-09-24' - yesterday <- Sys.Date()-1.
How do I filter the rows in my data frame based on the date of yesterday which is stored in the variable 'yesterday'?
To solve my problem, I have looked at multiple posts, for example:
Using grep to help subset a data frame
I am well aware that this question might be a duplicate. However, current questions do not provide me with the help that I need help. I hope that one of you can help me.
As an initial matter, it looks like you have a vector instead of a data frame (only one column). If you really do have a data frame and only ran str() on one column, the very similar technique at the end will work for you.
The first thing to know is that your dates are stored as character strings, while your yesterday object is in the Date format. R will not let you compare objects of different types, so you need to convert at least one of the two objects.
I suggest converting both to the POSIXct format so that you do not lose any information in your dates column but can still compare it to yesterday. Make sure to set the timezone to the same as your system time (mine is "America/New_York").
Dates <- c("2021-09-09T06:04:35.689Z", "2021-09-09T06:04:35.690Z", "2021-09-09T06:04:35.260Z", "2021-09-24T06:04:35.260Z")
Dates <- gsub("T", " ", Dates)
Dates <- gsub("Z", "", Dates)
Dates <- as.POSIXct(Dates, '%Y-%m-%d %H:%M:%OS', tz = "America/New_York")
yesterday <- Sys.time()-86400 #the number of seconds in one day
Now you can tell R to ignore the time any only compare the dates.
trunc(Dates, units = c("days")) == trunc(yesterday, units = c("days"))]
The other part of your question was about filtering. The easiest way to filter is subsetting. You first ask R for the indices of the matching values in your vector (or column) by wrapping your comparison in the which() function.
Indices <- which(trunc(Dates, units = c("days")) == trunc(yesterday, units = c("days"))])
None of the dates in your str() results match yesterday, so I added one at the end that matches. Calling which() returns a 4 to tell you that the fourth item in your vector matches yesterday's date. If more dates matched, it would have more values. I saved the results in "Indices"
We can then use the Indices from which() to subset your vector or dataframe.
Filtered_Dates <- Dates[Indices]
Filtered_Dataframe <- df[Indices,] #note the comma, which indicates that we are filtering rows instead of columns.

Automatically name the elements of a list after importing using lapply

I have a list of dataframes which I imported using
setwd("C:path")
fnames <- list.files()
csv <- lapply(fnames, read.csv, header = T, sep=";")
I will need to do this multiple times creating more lists, I would like to keep all the dataframes available separately (i.e. I don't want or need to combine them), I simply used the above code to import them all in quickly. But accessing them now is a little cumbersome and not intuitive (to me at least). Rather having to use [[1]] to access the first element, is there a way that I could amend the first bit of code so that I can name the elements in the list, for example based off a Date which is a variable in each of the dataframes in the list? The dates are stored as chr in the format "dd-mm-yyyy" , so I could potentially just name the dataframes using dd-mm from the Date variable.
You can extract the required data from the 1st value in the Date column of each dataframe and assign it as name of the list.
names(csv) <- sapply(csv, function(x) substr(x$Date[1], 1, 5))
Or extract the data using regex.
names(csv) <- sapply(csv, function(x) sub("(\\w+-\\w+).*", "\\1", x$Date[1]))
We can use
names(csv) <- sapply(csv, function(x) substring(x$Date[1], 1, 5))

R: Daily data to monthly

I have a large xts object, with multiple variable. The index is daily in that manner, it corresponds to exact days, however there is only one observation for each variable in a month. Is there a way to drop the day from the index and only keep year-month?
To ilustrate my problem for instance I have var1 with an observation on 2011-06-28 and var2 with observation 2011-06-30. I would like to index both as 2011-06
Thanks
alternatively you could "tell" R that you are using dates of a certain format with the as.Date() function and then use format() to change it to the format you desire.
Like this:
dates=c("2011-06-28","2011-06-29","2011-06-30","2011-07-1") #test string with dates in original format
dates2 <- format(as.Date(dates,"%Y-%m-%d"), format="%Y-%m") #changing the "%Y-%m-%d" format to the desired "%Y-%m"
print(dates2)
Edit: If you only want to change the index of a xts:
indexFormat(xts_object) <- "%Y-%m"
Cheers
Chris
You can probably do this:
Use gsub (replace a pattern with whatever you want) with regex (a sequence of characters that define a search pattern in e.g. a string).
The pattern is done with regex, which has lots of metacharacters that allow you to do more advanced things. The dot (.) is a wildcard and the $ anchors it at the back. So the pattern is basically any 3 characters before the end and replace them with nothing.
your_object<-c("2011-06-28","2011-06-30")
gsub(pattern = "...$", replace = "", x = your_object)
Here is a guide for using gsub with regex (http://uc-r.github.io/regex).

Dynamically specify column name in spread()

I am attempting to automate a simple process of importing some data and using the spread function from the tidyr package to make it wide format data.
Below is a simplified example
Ticker <- c(rep("GOOG",5), rep("AAPL",5))
Prices <- rnorm(10, 95, 5)
Date <- rep(sapply(c("2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04", "2015-01-05"),as.Date), 2)
exStockData <- data.frame(Ticker, Date, Prices)
After reading in a data frame like exStockData, I'd like to be able to create a data frame like the one below
library(tidyr)
#this is the data frame I'd like to be able to create
desiredDataFrame <- spread(exStockData, Ticker, Prices)
However, the column used for the key argument of the spread function will not always be called Ticker and the column used for the value argument of the function will not always be called Prices. The column names are read in from a different portion of the file that gets imported.
#these vectors are removed because the way my text file is read in
#I don't actually have these vectors
rm(Ticker, Prices, Date)
#the name of the first column (which serves as the key in
#the spread function) of the exStockData data frame will
#vary, and is read in from the file and stored as a one
#element character vector
secID <- "Ticker"
#the name of the last column in the data frame
#(which serves as the value in the spread function)
#is stored also stored as a one element character vector
fields <- "Prices"
#I'd like to be able to dynamically specify the column
#names using these other character vectors
givesAnError <- spread(exStockData, get(secID), get(fields))
The "See also" section of the documentation for the spread function mentions the spread_ function which is intended to be used in this situation.
In this case the solution is to use:
solved <- spread_(exstockData, secID, fields)

as.posixct when applied for an element in the data frame returns a number instead of date and time

Here is the existing data:
I have 2 columns of data. Each row of the first column has data whereas only certain rows of the second column has data (others being blank). I want to convert the format of the data with the help of as.POSIXct(). For the first column I used the following code (I named the data frame as 'mrkt'):
mrkt[1]<-lapply(mrkt[1],as.POSIXct)
This worked well in terms of converting the existing data to the right format
For the second column the above code won't work as the as.POSIXct() cannot address "" values. So I wrote a loop instead:
for (i in 1:dim(mrkt[2])[1]){
if (!as.character(mrkt[[2]][i])==""){
mrkt$open_time[i]<-as.POSIXct(mrkt$open_time[i])
}
}
However this is giving me weird outputs in the form of a number. How can I avoid that? Here is the output:
An easy way to do this would be to do this:
library(plyr)
library(dplyr)
mrkt %>%
mutate(send_time = send_time %>%
as.POSIXct,
open_time = open_time %>%
mapvalues("", NA) %>%
as.POSIXct)
This is due to implicit typecasting from POSIXct to numeric. This only happens in the loop because the vector has an assigned type and values are casted to this type if single values are assigned. When the whole vector is replaced a new vector is created with the right type.
The simplest solution is to use as.POSIXct(strptime(mrkt$open_time, format=yourformat)), with a correctly defined format, see ?strptime for the formats. This is vectorized, and strptime handles empty Strings correctly (returning NA).

Resources