I need to compute a condition over a column date in R. Atable would be:
PIL_final1<-data.frame( prior_day1_cart=c(4,8),
prior_day1_comp=c('2014-06-03','2014-06-07'),
dia_lim_23_cart=c('201-07-30','201-07-30') )
PIL_final1$prior_day1_comp<-as.Date(PIL_final1$prior_day1_comp, format='%Y-%m-%d')
PIL_final1$dia_lim_23_cart<-as.Date(PIL_final1$dia_lim_23_cart, format='%Y-%m-%d')
So I use ifelse:
PIL_final1$llamar_dia<-ifelse(PIL_final1$prior_day1_cart+6>23,
PIL_final1$dia_lim_23_cart ,
PIL_final1$prior_day1_comp+6)
But I get:
> PIL_final1
prior_day1_cart prior_day1_comp dia_lim_23_cart llamar_dia
1 4 2014-06-03 0201-07-30 16230
2 8 2014-06-07 0201-07-30 16234
And if I do:
> PIL_final1$prior_day1_comp+6
[1] "2014-06-09" "2014-06-13"
I get the right results.
How can I do the ifelse and get the date? thanks.
Also if I try this, I still get a number (although different):
> PIL_final1$llamar_dia<-ifelse(PIL_final1$prior_day1_cart+6>23,
+ PIL_final1$dia_lim_23_cart ,
+ as.Date(PIL_final$prior_day1_comp+6,format="%Y-%m-%d"))
> PIL_final1
prior_day1_cart prior_day1_comp dia_lim_23_cart llamar_dia
1 4 2014-06-03 0201-07-30 16376
2 8 2014-06-07 0201-07-30 16377
Edition:
Also if I do this:
> as.Date(ifelse(PIL_final1$prior_day1_cart+6>23, PIL_final1$dia_lim_23_cart ,
+ PIL_final1$prior_day1_comp+6), format="%Y-%m-%d", origin="1970-01-01")
[1] "2014-06-09" "2014-06-13"
I get the right results, but if I replace the ifelse with the vector result, I get the wrong dates:
> PIL_final1$llamar_dia<-ifelse(PIL_final1$prior_day1_cart+6>23,
+ PIL_final1$dia_lim_23_cart ,
+ PIL_final$prior_day1_comp+6)
> as.Date(PIL_final1$llamar_dia, format="%Y-%m-%d", origin="1970-01-01")
[1] "2014-11-02" "2014-11-03"
from ?ifelse :
The mode of the result may depend on the value of test (see the examples), Sometimes it is better >to use a construction such as
ifelse(test, yes, no) ~~ (tmp <- yes; tmp[!test] <- no[!test]; tmp)
Applying this :
dat$d3 <-
with(dat,{
tmp <- d2+6; tmp[!(x+6>23)] <- d1[!(x+6>23)]; tmp
})
dat
x d1 d2 d3
1 4 2014-06-03 0201-07-30 2014-06-03
2 8 2014-06-07 0201-07-30 2014-06-07
Maybe you should modify this to handle missing values in test.
Note I changed the variables names since yours are really long to type and a real source of errors.
dat <- data.frame( x=c(4,8),
d1=c('2014-06-03','2014-06-07'),
d2=c('201-07-30','201-07-30') )
Related
I have an R problem if you can help.
x <- data.frame("LocationCode" = c("ESC3","RIECAA6","SJHMAU","RIE104","SJH11","SJHAE","RIEAE1","WGH54","RIE205","GSBROB"), "HospitalNumber" = c("701190923R","2905451068","700547389X","AN11295201","1204541612","104010665","800565884R","620063158W","600029720K","1112391223"),"DisciplineName" = c("ESC Biochemistry", "RIE Haematology","SJH Biochemistry","RIE Biochemistry","SJH Biochemistry","WGH Biochemistry","ESC Biochemistry","WGH Biochemistry","SJH Biochemistry","RIE Haematology"))
From the dataframe above i do wish to add a new column (CRN) made up of all "HospitalNumber" rows with 9 digits plus 1 letter at the end (e.g 701190923R), create another column (TIT) with the rest of the rows which does not meet the 1st criteria
You can do this in base using the code
# Identify cases which match 9 digits then one letter
CRMMatch <- grepl("^\\d{9}[[:alpha:]]$", as.character(x$HospitalNumber))
#Create columns from Hospital number among the matches or those that do not match
x$CRN[CRMMatch] <- as.character(x$HospitalNumber)[CRMMatch]
x$TIT[!CRMMatch] <- as.character(x$HospitalNumber)[!CRMMatch]
# clean up by removing the variable created of matches
rm(CRMMatch)
A dplyr version could be
library(dplyr)
x <-
x %>%
mutate(CRN = if_else(grepl("^\\d{9}[[:alpha:]]$", as.character(HospitalNumber)),as.character(HospitalNumber), NA_character_),
TIT = if_else(!grepl("^\\d{9}[[:alpha:]]$", as.character(HospitalNumber)),as.character(HospitalNumber), NA_character_))
You can detect what you need with the instruction
library(stringr)
str_which(x$HospitalNumber,"[:digit:][:alpha:]")
and you get:
> str_which(x$HospitalNumber,"[:digit:][:alpha:]")
[1] 1 3 7 8 9
Then you know what positions you need and what you don't
Quite similar to Kerry Jackson's approach but using ifelse in base R. I have also converted your x$HospitalNumber from factor to character from the start, assuming that this is what you really want:
x[2] <- as.character( x[ , 2 ] )
x$CRN <- ifelse( grepl( "^\\d{9}[[:alpha:]]$", x$HospitalNumber) , x$HospitalNumber, "" )
x$TIT <- ifelse( x$CRN != "", "", x$HospitalNumber )
gives you
> x
LocationCode HospitalNumber DisciplineName CRN TIT
1 ESC3 701190923R ESC Biochemistry 701190923R
2 RIECAA6 2905451068 RIE Haematology 2905451068
3 SJHMAU 700547389X SJH Biochemistry 700547389X
4 RIE104 AN11295201 RIE Biochemistry AN11295201
5 SJH11 1204541612 SJH Biochemistry 1204541612
6 SJHAE 104010665 WGH Biochemistry 104010665
7 RIEAE1 800565884R ESC Biochemistry 800565884R
8 WGH54 620063158W WGH Biochemistry 620063158W
9 RIE205 600029720K SJH Biochemistry 600029720K
10 GSBROB 1112391223 RIE Haematology 1112391223
I have the fmatch function working in a loop but I was wondering if it's possible to apply this functionality to the vector all once rather than looping through.
Here is the code running through the loop, which currently works.
library(readxl)
library(data.table)
library(plyr)
library(tidyr)
library(dplyr)
library(tibble)
library(fastmatch)
library(stringr)
library(magrittr)
library(RcppBDT)
##library(anytime)
## Load time zone data sheet
TZData <- read_excel("TZDataFile.xlsx")
TZData <- as.data.table(TZData)
TZRange <- TZData[,1]
TZRange <- as.data.frame(TZRange)
##Bring in test data
TD <- read_excel("Test dates.xlsx", col_types = c("text", "text"))
TD <- as.data.table(TD)
####Start Time Conversion Code####
## Define variables
Station <- TD[,1] ##Station
GMT <- TD[,2] ##Date/time stamp in GMT to be converted to local
z <- nrow(TD)+0
APLDateTime <- data.frame(RawLocal = double(), RawLocalDateTime = as.Date(character()))
for (i in 1:z) {
STA <- as.character(Station[i,1]) ## Get Station
APCode <- as.integer(fmatch(STA, TZRange[,1])) ## Match station on Time Zone Data sheet
When I try to just run
STA <- as.character(Station[,1]) ## Get Station
APCode <- as.integer(fmatch(STA, TZRange[,1])) ## Match station on Time Zone Data sheet
I get NA_integer_ for APCode.
Sample Data:
> STA
[1] "c(\"LHR\", \"PHL\", \"DFW\", \"PHX\", \"LAX\", \"BCN\")"
> head(TZRange,10)
Code
1 369
2 04G
3 06A
4 06C
5 06N
6 09J
7 0A9
8 0G6
9 0G7
10 0P2
1183 DFW
2748 LHR
3809 PHL
I am looking for a result like
APCode = c(2748, 3809, 1183, etc.)
Thanks for the help.
First I assume you want STA to be a character vector. As I do not have your full data I will use the one provided and convert it
STA<-"c(\"LHR\", \"PHL\", \"DFW\", \"PHX\", \"LAX\", \"BCN\")"%>%substring(2)%>%str_replace_all("[[:punct:]]","")
> STA
[1] "LHR PHL DFW PHX LAX BCN"
Let me put an extra value so it finds a match to TZRange:
STA=c(STA,"04G")
> STA
[1] "LHR PHL DFW PHX LAX BCN" "04G"
For TZRange I have kept the 10 first values your provided
> TZRange
code
1 369
2 04G
3 06A
4 06C
5 06N
6 09J
7 0A9
8 0G6
9 0G7
10 0P2
Now you can specify the index of TZRange where matches with STA are found
APCode <- na.omit(fmatch(STA,TZRange[,1]))[1]
> APCode
[1] 2
Hope this helps
I got it working! I had to set both STA and TZRange as.data.frame then run the code APCode <- fmatch(STA[,1], TZRange[,1]) and it worked perfectly. I did first try the na.omit Antonis suggested, but it did not give me the list of indexes like I was looking for. Thanks for the assistance.
I am unable to figure out how to use nested for loops in R for solving my problem. Here's a miniature version of what I'm trying to solve:
I have two files, test1 and test2 which look like this:
head(test1)
Date Settlement
2008-08-28 138.29
2008-08-29 135.34
2008-09-01 135.23
2008-09-02 123.36
2008-09-03 126.41
2008-09-04 128.68
2008-09-05 123.70
2008-09-08 124.60
2008-09-09 122.33
2008-09-10 120.85
2008-09-11 120.15
2008-09-12 121.17
2008-09-15 118.97
2008-09-16 114.90
2008-09-17 115.78
2008-09-18 115.60
2008-09-19 115.90
2008-09-22 120.49
2008-09-23 124.10
And here is test2:
test2
X1 X2 X3
2008-08-31 2008-09-05 2008-09-11
2008-09-05 2008-09-11 2008-09-14
2008-09-11 2008-09-14 2008-09-18
2008-09-14 2008-09-18 2009-09-22
The logic that I need to put in is:
Select Dates [1,1] and [1,2] from test2
Find all Settlement Prices between those 2 dates in test1
Get average of those prices, place it in [1,1] of a new dataframe.
Repeat by increasing columns, and then rows in pt1.
The end-result of this would look like this:
X1 X2
128.42 122.87
122.87 120.66
120.66 116.55
116.55 115.75
So, the 1st value in X1 is an average of Settlement prices between 31-Aug-08 (including) and 5-Sep-08 (excluding), and the 1st value in X2 is an average of Settlement prices between 5-Sep-08 (including) and 11-Sep-08 (excluding), and so on for the rows below.
Here's my code that works (if I pass it fixed dates from test2 as given below):
temp1 <- test1 %>%
group_by(Date >= test2$X1[1] & Date < test2$X2[1]) %>%
summarise(AvgPrice2 = mean(Settlement, na.rm = T))
temp1 <- filter(temp1, temp1[,1]==TRUE)
However, no matter what I try (over last 3 days !) I cannot figure out how to put this into a for loop. Even tried rollapply, sapply...not able to get anything to work. The code need not be time efficient, I just need to automate this process.
I have been working with R for sometime, but clearly this is a problem for advanced users...Would deeply appreciate any help on this.
Many thanks in advance.
I would use an SQL-like approach through the sqldf package (which lets you to apply SQL sintax to your data.frames
ds = data.frame(Date = c("2008-08-28", "2008-08-29", "2008-09-01", "2008-09-02", "2008-09-03", "2008-09-04", "2008-09-05", "2008-09-08", "2008-09-09", "2008-09-10", "2008-09-11", "2008-09-12", "2008-09-15", "2008-09-16", "2008-09-17", "2008-09-18", "2008-09-19", "2008-09-22", "2008-09-23"),
Settlement = c(138.29, 135.34, 135.23, 123.36, 126.41, 128.68, 123.70, 124.60, 122.33, 120.85, 120.15, 121.17, 118.97, 114.90, 115.78, 115.60, 115.90, 120.49, 124.10))
dr = data.frame(d1=c("2008-08-31", "2008-09-05", "2008-09-11", "2008-09-14"),
d2=c("2008-09-05", "2008-09-11", "2008-09-14", "2008-09-18"),
d3=c("2008-09-11", "2008-09-14", "2008-09-18", "2009-09-22"))
# add a variable which I will use to identify the rows
dr$g = 1:NROW(dr);
library(sqldf);
output = sqldf("SELECT dr.g, AVG(s1.Settlement) AS X1, AVG(s2.Settlement) AS X2
FROM dr
JOIN ds AS s1 ON dr.d1 <= s1.Date AND s1.Date < dr.d2
JOIN ds AS s2 ON dr.d2 <= s2.Date AND s2.Date < dr.d3
GROUP BY dr.g");
I found the suggested package in this post. In the same post another user suggested the use of the data.table package but I don't feel as confident on data.table sintax as the SQL one :)
The documentation of sqldf and some usage example can be found on GitHub project page
I'm not sure I got it, one of my results is different from the one in your wanted output. First, make sure the dates are of class Date.
test1$Date <- as.Date(test1$Date)
test2$X1 <- as.Date(test2$X1)
test2$X2 <- as.Date(test2$X2)
test2$X3 <- as.Date(test2$X3)
Now, for the computations you've described.
res1 <- numeric(nrow(test2))
res2 <- numeric(nrow(test2))
for(i in seq_len(nrow(test2))){
inx <- test2$X1[i] <= test1$Date & test1$Date < test2$X2[i]
res1[i] <- mean(test1$Settlement[inx])
inx <- test2$X2[i] <= test1$Date & test1$Date < test2$X3[i]
res2[i] <- mean(test1$Settlement[inx])
}
result <- data.frame(X1 = res1, X2 = res2)
result
X1 X2
1 128.42 122.8700
2 122.87 120.6600
3 120.66 116.5500
4 116.55 119.0225
The value that is different is the very last one, result$X2[4]. Your output is 115.75 and here it's 119.0225.
Your data
Ensuring dates are Dates
library(lubridate)
test1 = data.frame(Date = ymd(c("2008-08-28", "2008-08-29", "2008-09-01", "2008-09-02", "2008-09-03", "2008-09-04", "2008-09-05", "2008-09-08", "2008-09-09", "2008-09-10", "2008-09-11", "2008-09-12", "2008-09-15", "2008-09-16", "2008-09-17", "2008-09-18", "2008-09-19", "2008-09-22", "2008-09-23")),
Settlement = c(138.29, 135.34, 135.23, 123.36, 126.41, 128.68, 123.70, 124.60, 122.33, 120.85, 120.15, 121.17, 118.97, 114.90, 115.78, 115.60, 115.90, 120.49, 124.10))
test2 = data.frame(d1=ymd(c("2008-08-31", "2008-09-05", "2008-09-11", "2008-09-14")),
d2=ymd(c("2008-09-05", "2008-09-11", "2008-09-14", "2008-09-18")),
d3=ymd(c("2008-09-11", "2008-09-14", "2008-09-18", "2009-09-22")))
tidyverse solution
library(tidyverse)
result <- map_df(1:nrow(test2), ~data.frame(X1=(filter(test1, Date >= test2$d1[.x] & Date < test2$d2[.x]) %>% summarise(m=mean(Settlement)))$m,
X2=(filter(test1, Date >= test2$d2[.x] & Date < test2$d3[.x]) %>% summarise(m=mean(Settlement)))$m))
Output
X1 X2
1 128.42 122.8700
2 122.87 120.6600
3 120.66 116.5500
4 116.55 119.0225
Thanks a lot for all the answers, I tried all of them, but none seemed to fit my needs given that the files above were a miniaturized version of actual files - so coding by column names / splitting data manually into rows didn't seem like a good option for me.
But I finally figured out what'll work nicely in this case:
library(lubridate)
Makingrows <- function(test1, test2, j){
res<<- NULL
m1 = nrow(test2)
for(i in 1:m1){
d1 <- ymd(test2[i,j])
d2 <- ymd(test2[i,j+1])
X1 <- filter(test1, Date < d2 & Date >= d1)
res[i] <- mean(X1$Settlement, na.rm = T)
}
return(res)
}
mcol1 <- ncol(test2)-1
finalres <- lapply(1:mcol1, function(x) Makingrows(test1, test2, x))
finalres <- as.data.frame(finalres)
And yes, I was also getting the last value as 119.02...and I realized that by mistake I put the year as 2009 in the last cell in test2 file. Due to this, the code was picking up all the values till the end.
Thanks a lot everyone. I hope you'll agree with me as I mark this as the answer to my question.
Hi this is a sample of data.frame / list with two columns containing X and Y. And my problem is when I call subset it will cut decimal part. Can you help me figure why?
(row.names | X | Y)
> var
...
9150 4246838.57 5785639.07
9152 4462019.15 5756344.11
9153 4671745.07 5791092.53
9154 4825699.93 5767058.37
9155 4935126.99 5839357.55
> typeof(var)
[1] "list"
> var = subset(var, Y>10980116 & X>3217133)
...
6569 15163607 11323070
6572 15102381 11079465
6573 16462260 11272569
6577 19028175 11095784
It's the same when I use:
> var = var[var$Y>10980116 & var$X>3217133,]
Thank you for your help.
This is not a subsetting issue, it's a formatting/presentation issue. You're in the first circle of Burns's R Inferno ("[i]f you are using R and you think you’re in hell, this is a map for you"):
another aspect of virtuous pagan beliefs—what is printed is all
that there is
If we just print this bit of the data frame exactly as entered, we "lose" digits.
> df <- read.table(text="
4246838.57 5785639.07
4462019.15 5756344.11
4671745.07 5791092.53
4825699.93 5767058.37
4935126.99 5839357.55",
header=FALSE)
> df
## V1 V2
## 1 4246839 5785639
## 2 4462019 5756344
## 3 4671745 5791093
## 4 4825700 5767058
## 5 4935127 5839358
Tell R you want to see more precision:
> print(df,digits=10)
## V1 V2
## 1 4246838.57 5785639.07
## 2 4462019.15 5756344.11
## 3 4671745.07 5791092.53
## 4 4825699.93 5767058.37
## 5 4935126.99 5839357.55
Or you can set options(digits=10) (the default is 7).
I would like to know the number of unique dams which gave birth on each of the birth dates recorded. My data frame is similar to this one:
dam <- c("2A11","2A11","2A12","2A12","2A12","4D23","4D23","1X23")
bdate <- c("2009-10-01","2009-10-01","2009-10-01","2009-10-01",
"2009-10-01","2009-10-03","2009-10-03","2009-10-03")
mydf <- data.frame(dam,bdate)
mydf
# dam bdate
# 1 2A11 2009-10-01
# 2 2A11 2009-10-01
# 3 2A12 2009-10-01
# 4 2A12 2009-10-01
# 5 2A12 2009-10-01
# 6 4D23 2009-10-03
# 7 4D23 2009-10-03
# 8 1X23 2009-10-03
I used aggregate(dam ~ bdate, data=mydf, FUN=length) but it counts all the dams that gave birth on a particular date
bdate dam
1 2009-10-01 5
2 2009-10-03 3
Instead, I need to have something like this:
mydf2
bdate dam
1 2009-10-01 2
2 2009-10-03 2
Your help is very much appreciated!
What about:
aggregate(dam ~ bdate, data=mydf, FUN=function(x) length(unique(x)))
You could also run unique on the data first:
aggregate(dam ~ bdate, data=unique(mydf[c("dam","date")]), FUN=length)
Then you could also use table instead of aggregate, though the output is a little different.
> table(unique(mydf[c("dam","date")])$bdate)
2009-10-01 2009-10-03
2 2
This is just an example of how to think of the problem and one of the approaches on how to solve it.
split.mydf <- with(mydf, split(x = mydf, f = bdate)) #each list element has only one date.
# it's just a matter of counting unique dams
unique.mydf <- lapply(X = split.mydf, FUN = unique)
#and then count the number of unique elements
unilen.mydf <- lapply(unique.mydf, length)
#you can do these two last steps in one go like so
lapply(split.mydf, FUN = function(x) length(unique(x)))
as.data.frame(unlist(unilen.mydf)) #data.frame is just a special list, so this is water to your mill
unlist(unilen.mydf)
2009-10-01 2
2009-10-03 2
In dplyr you can use n_distinct :
library(tidyverse)
mydf %>%
group_by(bdate) %>%
summarize(dam = n_distinct(dam))