I am a newbie to Stackoverflow, stats and R, so apologies for the simple nature of my question/request for advice:
I am completing analysis of a large data-set comprising of 2 files: a txt containing internal temperature data and a second SPSS data file.
To kick off, I have exported the SPSS data into CSV format and stripped back to contain just the few columns i think i need - house type and occupant type. I have imported all the temperature data and merged the two using a common identifier.
So now I have a merged data frame, containing all the data i need (to begin with) to start completing some analysis.
First question: I have year, date and time as separate columns. However the time column has imported with an incorrect date before "30/12/1899". How can i delete the date part of all observations from this column, but retain the time?
Second question Similar to above, the date colum shows the correct date, but has the time following, which is not correct (every observation showing 00:00:00), how can I delete all the times from this column?
Third question How can I combine the correct Time with correct date, to end up with DD/MM/YYYY HH:MM:SS
Fourth question Should i create subsets of merged to facilitate the analysis: ie: each house type (seperate subsets) vs temp, time and occupant type?
Dates can be brought in as they are instead of factor via the parameter as.is = TRUE i.e.
data <- read.csv(choose.files(), as.is = T)
I would try reading the csv file again and then working with the date time. It will come in as a chron or some format like that and you'll need to change it to Posixct, well I do anyway. To view help on a function, type question mark followed by function name i.e. ?as.posixct.
Date.Time: chron "2018/08/04 10:10:00", ... # '%Y-%m-%d %H:%M:%S' current format as read in from my system.
# Date format you want is '%d/%m/%Y %H:%M'
# tz='' is an empty time zone can't remember exactly you probably should read up on
# finally on the left side of the assign <- I am creating a new column Date.
# You can over write the old column, Date.Time, but can't hurt to learn how to delete
# a column.
data$Date <- as.POSIXct(date$Date.Time, tz='', '%d/%m/%Y %H:%M:%S')
# Now remove the original column. -Date.Time take out Date.Time, if you leave the
# minus out, the data will contain the subset Date.Time and no other columns.
data <- subset(data, select = -Date.Time)
Try this first, and I will look into removing time with in a date field. I have an idea, but I'd rather see if this helps with the problem first.
Though if you do want to merge the Year, month, day columns, you could try something like this, seem like a logical thing to do, you can always keep the original format and delete it later. It's not hurting anything.
data$YMD <- paste(data$Year," ",
data$Month, " ",
data$Day)
Also while you are at it. Install a library called dplyr, written by the same guy that did ggplot2, Hadley....
install.packages("dplyr")
# The add it to the top of your file like ggplot.
library(dplyr)
Related
This question already has an answer here:
r - Filter a rows by a date that alters each day
(1 answer)
Closed 1 year ago.
I have a question about the use of a logical expression in combination with a variable.
Imagine that I have a data frame with multiple rows that each contain a date saved as 2021-09-25T06:04:35:689Z.
I also have a variable that contains the date of yesterday as '2021-09-24' - yesterday <- Sys.Date()-1.
How do I filter the rows in my data frame based on the date of yesterday which is stored in the variable 'yesterday'?
To solve my problem, I have looked at multiple posts, for example:
Using grep to help subset a data frame
I am well aware that this question might be a duplicate. However, current questions do not provide me with the help that I need help. I hope that one of you can help me.
As an initial matter, it looks like you have a vector instead of a data frame (only one column). If you really do have a data frame and only ran str() on one column, the very similar technique at the end will work for you.
The first thing to know is that your dates are stored as character strings, while your yesterday object is in the Date format. R will not let you compare objects of different types, so you need to convert at least one of the two objects.
I suggest converting both to the POSIXct format so that you do not lose any information in your dates column but can still compare it to yesterday. Make sure to set the timezone to the same as your system time (mine is "America/New_York").
Dates <- c("2021-09-09T06:04:35.689Z", "2021-09-09T06:04:35.690Z", "2021-09-09T06:04:35.260Z", "2021-09-24T06:04:35.260Z")
Dates <- gsub("T", " ", Dates)
Dates <- gsub("Z", "", Dates)
Dates <- as.POSIXct(Dates, '%Y-%m-%d %H:%M:%OS', tz = "America/New_York")
yesterday <- Sys.time()-86400 #the number of seconds in one day
Now you can tell R to ignore the time any only compare the dates.
trunc(Dates, units = c("days")) == trunc(yesterday, units = c("days"))]
The other part of your question was about filtering. The easiest way to filter is subsetting. You first ask R for the indices of the matching values in your vector (or column) by wrapping your comparison in the which() function.
Indices <- which(trunc(Dates, units = c("days")) == trunc(yesterday, units = c("days"))])
None of the dates in your str() results match yesterday, so I added one at the end that matches. Calling which() returns a 4 to tell you that the fourth item in your vector matches yesterday's date. If more dates matched, it would have more values. I saved the results in "Indices"
We can then use the Indices from which() to subset your vector or dataframe.
Filtered_Dates <- Dates[Indices]
Filtered_Dataframe <- df[Indices,] #note the comma, which indicates that we are filtering rows instead of columns.
everyone!
As part of my clinical study I created a xlsx spreadsheet containing a data set. Only columns 2 to 12 and lines 1 to 307 are useful to me. I now manipulate my spreadsheet under R, after importing it (read_excel, etc.).
In my columns 11 and 12 ('data' and 'raw_data'), some cells correspond to dates (for example the first 2 rows of 'data' and 'raw_data'). Indeed, this corresponds to the patient's visit dates. However, as you can see, these dates are given to me in number of days since the origin "1899-12-30". However, I would like to be able to transform them into a current date format (2019-07-05).
My problem is that in these columns I don't only have dates, I have different numerical results (times, means, scores, etc.) .
I started by transforming the class of my columns from character to factor/numeric so that I could better manipulate the columns later. But I can't change only the format of cells corresponding to a date.
Do you know if it is possible to transform only the cells concerned and if so how?
I attach my code and a preview of my data frame.
Part "Unsuccessful trial": I tried with this kind of thing. Of course the date changes format here but as soon as I try to make this change in the data frame it doesn't work.
Thank you for your help!
# Indicate the id of the patient
id = "01_AA"
# Get protocol data of patient
idlst <- dir("/data/protocolData", full.names = T, pattern = id)
# Convert the xlsx database into dataframe
idData <- data.table::rbindlist(lapply(
idlst,
read_excel,
n_max = 307,
range = cell_cols("B:M"), # just keep the table
), fill = TRUE)
idData <- as.tibble(idData)
idData<- idData %>%
mutate_at(vars(1:10), as.factor)%>%
mutate_at(vars(11:length(idData)), as.numeric)
# Unsuccessful trial
as.Date.character(data[1:2,11:12], origin ='1899-12-30')
Thank you for your comments and indeed this is one of the problems with R.
I solved my problem with the following code where idData is my df.
# Change the data format of the date cells of the column Data and Raw_data:
idData$Data[grepl("date",idData$Measure)] <- as.character(as.Date(
as.numeric(
idData$Data[grepl("date",idData$Measure)]),
origin = "1899-12-30"))
First time caller, longtime listener.
I am trying to solve two problems.
my function does not perform as anticipated.
I cannot figure out how to make a plot from date data
I have tried to approach my function problem from multiple angles but I am only making things harder than they need to be. The issue that I cannot overcome is that the date sequence I have created for the date range of the data set is not equal to the length of the data set columns.
For the y-axis of my plot, I want:
f(dates[x])= number of data set entries on or before dates[x],
Where dates[x] refers to a given date in the data set date range
I'm sure there is an easy solution but I cannot figure it out.
Note: I used to have a basic understanding of r but I am relearning after a long break, please use the simplest terms possible
# import data
data <- read.csv("https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv")
#
# coerce date column into date class
data$date <- as.POSIXlt.date(data$date)
#
# sequence of dates for date range of data set
dates <- seq(data$date[1], data$date[length(data$date)], by = "days")
#
# numeric vector for the number of days in the date range of data set
xx <- c(1:length(dates))
#
# function meant to return a numeric vector of the count of entries in the data set that occurred on or before a given date
# within the data set date range.
fun <- function(x){
sum(dates[x]<=data$date)
}
# This function returns a single value and not a vector as I'd expected.
# This plot is the objective. x = number of days in data set date range, y = number of entries in data set on or before date(x)
plot(xx,y=fun(xx))
Working with dates is a loaded topic. It is extremely powerful, but it pays to be careful. Here is my take:
data <- read.csv(paste0("https://raw.githubusercontent.com/washingtonpost/", # wrapped
"data-police-shootings/master/fatal-police-shootings-data.csv"))
library(anytime) ## helper package
data$date <- anydate(data$date) ## helper function not requiring format
Now we have a date type and you can do
data[ data$date <= anydate(20150110), ]
If you use the date on the x-axis it all works out correctly too.
That said, I tend to do all this inside of data.table objects, but that is more learning for you. Another day :) Keep it in mind -- the grouping aggregation and
filtering are absolutely worth it. And it is the fastest tool around.
I'm new to R (having worked in C++ and Python before) so this is probably just a factor of me not knowing some of R's nuances.
The program I'm working on is supposed to construct matrices of data by date. Here's how I might initialize such a matrix:
dates <- seq(as.Date("1980-01-01"), as.Date("2013-12-31"), by="days")
HN3 <- matrix(nrow=length(dates), ncol = 5, dimnames = list(as.character(dates), c("Value1", "Value2", "Value3", "Value4", "Value5")))
Notice that dates includes every day between 1980 and 2013.
So, from there, I have files containing certain dates and measurements of Value1, etc for those dates, and I need to read those files' contents into HN3. But the problem is that most of the files don't contain measurements for every day.
So what I want to do is read a file into a dataframe (say, v1read) with column 1 being dates and column 2 being the desired data. Then I'd match the dates of v1read to that date's row in HN3 and copy all of the relevant v1read values that way. Here is my attempt at doing so:
for (i in 1:nrow(v1read)) {
HN3[as.character(v1read[i,1]),Value1] <- v1read[i,4]
}
This gives me an out of index range error when the value of i is bumped up unexpectedly. I understand that R doesn't like to iterate through dates, but since the iterator itself is a numeric value rather than a date, I was hoping I'd found a loophole.
Any tips on how to accomplish this would be enormously appreciated.
Let's use library(dplyr). Start with
dates = seq(as.Date("1980-01-01"), as.Date("2013-12-31"), by="days")
HN3 = data.frame(Date=dates)
Now, load in your first file, the one that has a date and Value1.
file1 = read.file(value1.file) #I'm assuming this file has a column already named "Date" and one named #Value1
HN3 = left_join(HN3,file1,by="Date")
This will do a left join (SQL style) matching only the rows where a date exists and filling in the rest with NA. Now you have a data frame with two columns, Date and Value1. Load in your other files, do a left_join with each and you'll be done.
First, new to programming.
I built a table with 3 columns and I want to evaluate based on time series, so I'm playing around with the ts() function. The first column of my table is DATE as.date in the format "yyyy-mm-dd". I have one observation per variable per day. I've apply ts() to the table and tried start=1 (first observation?) and checked head(df) and the DATE column is sending back loose sequence of numbers that I can't identify (12591, 12592, 12593, 12594, 12597, 12598).
Could it be that the as.date is messing things up?
The line I use is:
ts(dy2, start=1, frequency= 1)
I've also been playing with the deltat argument. In the help file it suggests 1/12 for monthly data. Naturally, I tried 1/365 (for daily data), but have yet to be successful.
As suggested by G. Grothendieck you can use the zoo package. Try this:
require(zoo)
dates <- as.Date(dy2[,1], format = "%Y-%m-%d")
x1 <- zoo(dy2[,2], dates)
plot(x2)
x2 <- zoo(dy2[,3], dates)
plot(x1)
If this does not work, please provide further details about your data as requested by MrFlick. For example, print the output of dput(dy2) or at least head(dy2).