I have a series of dates in my code that are in an ifelse statement, that are returning a single numerical value instead of a date.
osa <- read.delim("C:/RMathew/RScripts/osaevents/osaevents.txt", stringsAsFactors=TRUE)
#
osa$datetime <- ymd_hms(osa$datetime)
osa$date <- as.Date(osa$datetime)
sixoclock <- 6*60*60
osa$daystart <- ymd_hms(ymd(osa$date) + sixoclock)
osa$dateplus <- osa$date + 1
osa$dateminus <- osa$date - 1
osa$dayend <- ymd_hms(ymd(osa$dateplus) + sixoclock)
osa$dateloca <- osa$datetime >= osa$daystart
osa$datelocb <- osa$datetime < osa$dayend
osa$milldate <- ifelse(osa$dateloca==TRUE & osa$datelocb==TRUE,
osa$date,osa$dateminus)
The place where this data originates considers the time between 6 AM on any given day to 6 AM the following day, as one day. The code above is trying to compare the date to the question of is it after 6 AM on a particular day, but before 6 AM on the following day, to assign it the earlier day's date (for whatever day it might be).
So far so good, but it returns a single number for the osa$milldate instead of the dates in the ifelse columns.
'data.frame': 897 obs. of 16 variables:
$ datetime : POSIXct, format: "2015-08-13 15:11:53" "2015-08-13 14:53:26" "2015-08-13 14:34:58" "2015-08-13 14:16:18" ...
$ stream : Factor w/ 1 level "fc": 1 1 1 1 1 1 1 1 1 1 ...
$ fe : num 18.1 18 17.6 18.1 18.5 ...
$ ni : num 2.97 2.99 2.92 3.2 3.32 ...
$ cu : num 3.41 3.35 2.99 3.58 3.73 ...
$ pd : num 138 157 139 166 183 ...
$ mg : num 13.8 13.8 14.4 14.3 13.9 ...
$ so : num 9.67 9.81 9.65 10.58 11.37 ...
$ date : Date, format: "2015-08-13" "2015-08-13" "2015-08-13" "2015-08-13" ...
$ daystart : POSIXct, format: "2015-08-13 06:00:00" "2015-08-13 06:00:00" "2015-08-13 06:00:00" "2015-08-13 06:00:00" ...
$ dateplus : Date, format: "2015-08-14" "2015-08-14" "2015-08-14" "2015-08-14" ...
$ dateminus: Date, format: "2015-08-12" "2015-08-12" "2015-08-12" "2015-08-12" ...
$ dayend : POSIXct, format: "2015-08-14 06:00:00" "2015-08-14 06:00:00" "2015-08-14 06:00:00" "2015-08-14 06:00:00" ...
$ dateloca : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ datelocb : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ milldate : num 16660 16660 16660 16660 16660 ...
Thoughts? Also, there is likely to be a more elegant way to do this.
See the help file for ifelse
Warning:
The mode of the result may depend on the value of ‘test’ (see the
examples), and the class attribute (see ‘oldClass’) of the result
is taken from ‘test’ and may be inappropriate for the values
selected from ‘yes’ and ‘no’.
Sometimes it is better to use a construction such as
(tmp <- yes; tmp[!test] <- no[!test]; tmp)
, possibly extended to handle missing values in ‘test’.
This describes precisely what is going on in your example -- the date class attribute is lost -- and a work around -- a multi-step approach.
osa$milldate <- osa$date
ind<- osa$dateloca==TRUE & osa$datelocb==TRUE
osa$milldate[!ind] <- osa$dateminus
Another option is replace.
A. Webb set me on the right path. The ifelse class was stripping the answer of the date format. The solution above with the index seemed to jumble the date order for some reason. As A. Webb pointed out, in the help file, the following line fixed it immediately.
class(osa$milldate) <- class(osa$date)
Related
I have this preloaded dataset in R that I got from library(dplyr). The dataset names pedestrian.
enter image description here
I want to convert a column Date_Time whose class is S3: POSIXct to chr.
I used as.character() but it didn't give me what I want. My goal is to create a new dataset that exclude Date_Time column. Select() still keep Date_Time column.
You have some incorrect understandings of where you got the data. There is no such dataset in dplyr. How do I know this?
data(package="dplyr")
# ---- returns ---
Data sets in package ‘dplyr’:
band_instruments Band membership
band_instruments2 Band membership
band_members Band membership
starwars Starwars characters
storms Storm tracks data
There is a dataset named "pedestrian" in the naniar package and it does have entries that match values in the fragment of data imaged in the link (although the column names and column order are different). Specifically:
grep("Birrarung", (naniar::pedestrian$sensor_name) ) # returns 8455+1001 values)
But your dataset has too many entries to be an exact match. Your dataset has 66,037 lines, somewhat less than double the number of rows in naniar::pedestrian.
str(naniar::pedestrian)
tibble [37,700 × 9] (S3: tbl_df/tbl/data.frame)
$ hourly_counts: int [1:37700] 883 597 294 183 118 68 47 52 120 333 ...
$ date_time : POSIXct[1:37700], format: "2016-01-01 00:00:00" "2016-01-01 01:00:00" "2016-01-01 02:00:00" "2016-01-01 03:00:00" ...
$ year : int [1:37700] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ month : Ord.factor w/ 12 levels "January"<"February"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ month_day : int [1:37700] 1 1 1 1 1 1 1 1 1 1 ...
$ week_day : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 6 6 6 6 6 6 6 6 6 6 ...
$ hour : int [1:37700] 0 1 2 3 4 5 6 7 8 9 ...
$ sensor_id : int [1:37700] 2 2 2 2 2 2 2 2 2 2 ...
$ sensor_name : chr [1:37700] "Bourke Street Mall (South)" "Bourke Street Mall (South)" "Bourke Street Mall (South)" "Bourke Street Mall (South)" ...
Your request seems a bit confusing. "My goal is to create a new dataset that exclude Date_Time column. Select() still keep Date_Time column." There is no column with that name in the pedestrian dataset (since capitalization needs to be exact), but if there were, it would be simple matter to exclude it:
new_dat <- old_dat[!"Date_Time" %in% names(old_dat)]
# And maybe
new_dat <- old_dat %>% select(-Date_Time)
It would also be a simple matter to destructively convert to character with:
new_dat$Date_Time <- as.character(old_dat$Date_Time)
Since you have not shown what code you tried and cannot even tell us where the data comes from, we can only speculate what you did and how you are failing.
Maybe someone else created a datset from the publicly accessible data at https://data.melbourne.vic.gov.au/Transport/Pedestrian-Counting-System-Monthly-counts-per-hour/b2ak-trbp
AHA! Found it!
https://tsibble.tidyverts.org/reference/pedestrian.html
str(pedestrian)
tbl_ts [66,037 × 5] (S3: tbl_ts/tbl_df/tbl/data.frame)
$ Sensor : chr [1:66037] "Birrarung Marr" "Birrarung Marr" "Birrarung Marr" "Birrarung Marr" ...
$ Date_Time: POSIXct[1:66037], format: "2015-01-01 00:00:00" "2015-01-01 01:00:00" "2015-01-01 02:00:00" "2015-01-01 03:00:00" ...
$ Date : Date[1:66037], format: "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01" ...
$ Time : int [1:66037] 0 1 2 3 4 5 6 7 8 9 ...
$ Count : int [1:66037] 1630 826 567 264 139 77 44 56 113 166 ...
- attr(*, "key")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
..$ Sensor: chr [1:4] "Birrarung Marr" "Bourke Street Mall (North)" "QV Market-Elizabeth St (West)" "Southern Cross Station"
..$ .rows : list<int> [1:4]
.. ..$ : int [1:14566] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int [1:16414] 14567 14568 14569 14570 14571 14572 14573 14574 14575 14576 ...
.. ..$ : int [1:17518] 30981 30982 30983 30984 30985 30986 30987 30988 30989 30990 ...
.. ..$ : int [1:17539] 48499 48500 48501 48502 48503 48504 48505 48506 48507 48508 ...
.. ..# ptype: int(0)
..- attr(*, ".drop")= logi TRUE
- attr(*, "index")= chr "Date_Time"
..- attr(*, "ordered")= logi TRUE
- attr(*, "index2")= chr "Date_Time"
- attr(*, "interval")= interval [1:1] 1h
..# .regular: logi TRUE
And now I see that the inimitable Rob Hyndman is behind this project: https://robjhyndman.com/hyndsight/tsibbles/
And I guess I am a bit behind the times or perhaps I should say I'm behind the time series. There are over 500 hits for an SO search on "[r] tsibble". Now I need to ask; did this answer your question?
Testing:
new_ped <- pedestrian %>% select(-Date_Time) # fails
pedestrian$Date_Time <- as.character(pedestrian$Date_Time) # succeeds
The tsibble::pedestrian dataset is an S4 object and may not behave in the same manner as ordinary R objects. But my error was in using the wrong operator for column removal.
?select # need to choose dplyr version rather than MASS version
Need to use ! rather than "-":
new_ped <- pedestrian %>% select(!Date_Time)
> str(new_ped)
tibble [66,037 × 4] (S3: tbl_df/tbl/data.frame)
$ Sensor: chr [1:66037] "Birrarung Marr" "Birrarung Marr" "Birrarung Marr" "Birrarung Marr" ...
$ Date : Date[1:66037], format: "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01" ...
$ Time : int [1:66037] 0 1 2 3 4 5 6 7 8 9 ...
$ Count : int [1:66037] 1630 826 567 264 139 77 44 56 113 166 ...
How can I use cut while maintaining the POSIXct class of my date.time vector?
library(data.table)
library(lubridate)
Some data:
air.temp <- c(-1.7202,-1.6524,-1.5689,-1.6785,-1.6060,-1.8843)
soil.temp <- c(3.6972,3.6839,3.6716,3.6586,3.6460,3.6701)
date.time <- c('2007-01-01 00:05:00','2007-01-01 00:10:00',
'2007-01-01 00:15:00','2007-01-01 00:20:00',
'2007-01-01 00:25:00','2007-01-01 00:30:00')
DT <- data.table(date.time, air.temp, soil.temp)
DT[, date.time := parse_date_time(date.time, 'YmdHMS')]
Structure shows the date.time column is in the desired POSIXTct format:
str(DT)
Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
$ date.time: POSIXct, format: "2007-01-01 00:05:00" ...
$ air.temp : num -1.72 -1.65 -1.57 -1.68 -1.61 ...
$ soil.temp: num 3.7 3.68 3.67 3.66 3.65 ...
- attr(*, ".internal.selfref")=<externalptr>
Now I cut five minute data to fifteen minute:
DT_15_min <- DT[, lapply(.SD, mean), by=(date.time = cut(date.time, "15 min"))]
Structure shows the conversion to factor vector:
str(DT_15_min)
Classes ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
$ date.time: Factor w/ 2 levels "2007-01-01 00:05:00",..: 1 2
$ air.temp : num -1.65 -1.72
$ soil.temp: num 3.68 3.66
- attr(*, ".internal.selfref")=<externalptr>
Is it possible to cut while maintaining POSIXct vector class?
My desired result is to have my data agregated from a five minute interval to a fifteen minute interval while maintaining the original class of the vector (POSIXct in this case).
As always, I am grateful for any advice.
cut is designed to return factors. If you want to group by 15 min intervals, you could try using the rounding functions from lubridate, e.g.
DT_15_min <- DT[, lapply(.SD, mean), by=(date.time = floor_date(date.time, "15 mins"))]
str(DT_15_min)
Classes ‘data.table’ and 'data.frame': 3 obs. of 3 variables:
$ date.time: POSIXct, format: "2007-01-01 00:00:00" "2007-01-01 00:15:00" ...
$ air.temp : num -1.69 -1.62 -1.88
$ soil.temp: num 3.69 3.66 3.67
- attr(*, ".internal.selfref")=<externalptr>
you can also use dplyr:
df=tibble(date.time, air.temp, soil.temp)%>%mutate(date.time=ceiling_date(ymd_hms(date.time),unit="15 mins"))%>%
group_by(date.time)%>%summarize_all(funs(mean))
This question already has answers here:
How to convert time to decimal
(3 answers)
Closed 5 years ago.
I have data like this.
> head(new3)
Date Hour Dayahead Actual Difference
1 2015-01-01 0:00 42955 42425 530
2 2015-01-01 0:15 42412 42021 391
3 2015-01-01 0:30 41901 42068 -167
4 2015-01-01 0:45 41355 41874 -519
5 2015-01-01 1:00 40710 41230 -520
6 2015-01-01 1:15 40204 40810 -606
Their characteristics are as below:
> str(new3)
'data.frame': 35044 obs. of 5 variables:
$ Date : Date, format: "2015-01-01" "2015-01-01" "2015-01-01" "2015-
01-01" ...
$ Hour : chr "0:00" "0:15" "0:30" "0:45" ...
$ Dayahead : chr "42955" "42412" "41901" "41355" ...
$ Actual : int 42425 42021 42068 41874 41230 40810 40461 40160 39958
39671 ...
$ Difference: chr "530" "391" "-167" "-519" ...
I tried to change Hour and Dayahead as numberic by doing as.numeric. But it shows me this.
> new3$Dayahead<-as.numeric(new3$Dayahead)
Warning message:
NAs introduced by coercion
> new3$Hour<-as.numeric(new3$Hour)
Warning message:
NAs introduced by coercion
So when I checked with str again, it showed me this.
> str(new3)
'data.frame': 35044 obs. of 5 variables:
$ Date : Date, format: "2015-01-01" "2015-01-01" "2015-01-01" "2015-
01-01" ...
$ Hour : num NA NA NA NA NA NA NA NA NA NA ...
$ Dayahead : num 42955 42412 41901 41355 40710 ...
$ Actual : int 42425 42021 42068 41874 41230 40810 40461 40160 39958
39671 ...
$ Difference: chr "530" "391" "-167" "-519" ...
questions is,
1) why do I have 'NAs introduced by coercion' warning message?
2) How can I solve the problem above?
3) Why do I get NA data for Hour and how can I solve it?
Thank you.
As already mentioned in the comments, if your string contains a non-numeric character (i.e., ":" in your Hour column), you cannot convert it to numeric, that's why you get NA.
I am not sure why do you want to convert your times to numeric, but if you'd like to perform some operations on it (e.g., calculate time differences) then you should convert your dates to Posix format. In your case run:
new3$fulldate <- as.POSIXlt(paste(new3$Date, new3$Hour, sep = " "))
Try this:
hour <- c("0:00", "0:15", "0:30", "0:45", "1:00", "1:15")
replace the : per . And you could convert
hour <- gsub(":", ".", hour)
hour <- as.numeric(hour)
hour
[1] 0.00 0.15 0.30 0.45 1.00 1.15
I have several processed microarray data (normalized, .txt files) from which I want to extract a list of 300 candidate genes (ILMN_IDs). I need in the output not only the gene names, but also the expression values and statistics info (already present in the original file).
I have 2 dataframes:
normalizedData with the identifiers (gene names) in the first column, named "Name".
candidateGenes with a single column named "Name", containing the identifiers.
I've tried
1).
all=normalizedData
subset=candidateGenes
x=all%in%subset
2).
all[which(all$gene_id %in% subset)] #(as suggested in other bioinf. forum)#,
but it returns a Dataframe with 0 columns and >4000 rows. This is not correct, since normalizedData has 24 columns and compare them, but I always get error.
The key is to be able to compare the first column of all ("Name") with subset. Here is the info:
> class(all)
> [1] "data.frame"
> dim(all)
> [1] 4312 24
> str(all)
> 'data.frame':4312 obs. of 24 variables:
$ Name: Factor w/ 4312 levels "ILMN_1651253": 3401..
$ meanbgt:num 0 ..
$ meanbgc: num ..
$ cvt: num 0.11 ..
$ cvc: num 0.23 ..
$ meant: num 4618 ..
$ stderrt: num 314.6 ..
$ meanc: num 113.8 ...
$ stderrc: num 15.6 ...
$ ratio: num 40.6 ...
$ ratiose: num 6.21 ...
$ logratio: num 5.34 ...
$ tp: num 1.3e-04 ...
$ t2p: num 0.00476 ...
$ wilcoxonp: num 0.0809 ...
$ tq: num 0.0256 ...
$ t2q: num 0.165 ...
$ wilcoxonq: num 0.346 ...
$ limmap: num 4.03e-10 ...
$ limmapa: num 4.34e-06 ...
$ SYMBOL: Factor w/ 3696 levels "","A2LD1",..
$ ENSEMBL: Factor w/ 3143 levels "ENSG00000000003",..
and here is the info about subset:
> class(subset)
[1] "data.frame"
> dim(subset)
>[1] 328 1
> str(subset) 'data.frame': 328 obs. of 1 variable:
$ V1: Factor w/ 328 levels "ILMN_1651429",..: 177 286 47 169 123 109 268 284 234 186 ...
I really appreciate your help!
What you need to do is
all[all$Name %in% subset$V1, ]
When using a data.frame, it's important to drill down the the correct column that has the data you actually want to use. You need to know which columns have the matching IDs. That the only way that this solution really differed from other suggested or other things you've tried.
It's also important to note that when subsetting a data.frame by rows, you need to use the [,] syntax where the vector before the comma indicates rows and the vector after indicates columns. Here, since you want all columns, we leave it empty.
I've imported a dataset into R where in a column which should be supposed to contain numeric values are present NULL. This make R set the column class to character or factor depending on if you are using or not the stringAsFactors argument.
To give you and idea this is the structure of the dataset.
> str(data)
'data.frame': 1016 obs. of 10 variables:
$ Date : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
$ Name : chr "Chi" "Chi" "Chi" "Chi" ...
$ Impressions: chr "229097" "3323" "70171" "1359" ...
$ Revenue : num 533.78 11.62 346.16 3.36 1282.28 ...
$ Clicks : num 472 13 369 1 963 161 1 7 317 21 ...
$ CTR : chr "0.21" "0.39" "0.53" "0.07" ...
$ PCC : chr "32" "2" "18" "0" ...
$ PCOV : chr "3470.52" "94.97" "2176.95" "0" ...
$ PCROI : chr "6.5" "8.17" "6.29" "NULL" ...
$ Dimension : Factor w/ 11 levels "100x72","1200x627",..: 1 3 4 5 7 8 9 10 11 1 ...
I would like to transform the PCROI column as numeric, but containing NULLs it makes this harder.
I've tried to get around the issue setting the value 0 to all observations where current value is NULL, but I got the following error message:
> data$PCROI[which(data$PCROI == "NULL"), ] <- 0
Error in data$PCROI[which(data$PCROI == "NULL"), ] <- 0 :
incorrect number of subscripts on matrix
My idea was to change to 0 all the NULL observations and afterwards transform all the column to numeric using the as.numeric function.
You have a syntax error:
data$PCROI[which(data$PCROI == "NULL"), ] <- 0 # will not work
data$PCROI[which(data$PCROI == "NULL")] <- 0 # will work
by the way you can say:
data$PCROI = as.numeric(data$PCROI)
it will convert your "NULL" to NA automatically.