I'm exporting data from a medical record platform.
The data looks like this...
Date.time TEMP HR RR SBP DBP
1 Jun-08-2015
2 1323 36.8 O – – – –
3 931 36.8 O 76 MC 22 SP 104 MC 52 MC
4 930 – – – – –
5 929 – – – – –
6 813 36.8 O 76 MC 22 SP 104 MC 52 MC
7 126 36.3 O 78 MC 23 SP 112 MC 55 MC
8 40 36.3 O 78 MC 23 SP 112 MC 55 MC
9 Jun-07-2015
10 2307 36 O 71 MC 22 SP 120 MC 57 MC
I need to be able to have date and time on a single column, but in the following format yyyymmddhhmm
1323 931 930 929 etc correspond to time
My expected output is...
Date.time TEMP HR RR SBP DBP
1 201506081323 36.8 O – – – –
2 201506080931 36.8 O 76 MC 22 SP 104 MC 52 MC
3 201506080930 – – – – –
4 201506080929 – – – – –
5 201506080813 36.8 O 76 MC 22 SP 104 MC 52 MC
6 201506080126 36.3 O 78 MC 23 SP 112 MC 55 MC
7 201506080040 36.3 O 78 MC 23 SP 112 MC 55 MC
8 201506072307 36 O 71 MC 22 SP 120 MC 57 MC
Separate date into date and time, fill in missing dates, then paste back date and time, convert to date class.
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(x1 = if_else(nchar(Date.time) > 4, Date.time, NA_character_),
x2 = if_else(nchar(Date.time) > 4, NA_character_, Date.time),
x2 = str_pad(x2, width = 4, side = "left", pad = "0")) %>%
fill(x1) %>%
filter(!is.na(x2)) %>%
mutate(Date.time.v1 = as.POSIXct(paste(x1, x2), format = "%b-%d-%Y %H%M")) %>%
select(-c(x1, x2))
# Date.time TEMP HR RR SBP DBP Date.time.v1
# 1 1323 36.8 O - - - - 2015-06-08 13:23:00
# 2 931 36.8 O 76 MC 22 SP 104 MC 52 MC 2015-06-08 09:31:00
# 3 930 - - - - - 2015-06-08 09:30:00
# 4 929 - - - - - 2015-06-08 09:29:00
# 5 813 36.8 O 76 MC 22 SP 104 MC 52 MC 2015-06-08 08:13:00
# 6 126 36.3 O 78 MC 23 SP 112 MC 55 MC 2015-06-08 01:26:00
# 7 40 36.3 O 78 MC 23 SP 112 MC 55 MC 2015-06-08 00:40:00
# 8 2307 36 O 71 MC 22 SP 120 MC 57 MC 2015-06-07 23:07:00
data
df1 <- read.table(text = "
Date.time TEMP HR RR SBP DBP
Jun-08-2015
1323 36.8 O - - - -
931 36.8 O 76 MC 22 SP 104 MC 52 MC
930 - - - - -
929 - - - - -
813 36.8 O 76 MC 22 SP 104 MC 52 MC
126 36.3 O 78 MC 23 SP 112 MC 55 MC
40 36.3 O 78 MC 23 SP 112 MC 55 MC
Jun-07-2015
2307 36 O 71 MC 22 SP 120 MC 57 MC
", header = TRUE, sep = "\t", stringsAsFactor = FALSE)
This is what I came up with, but still had to go back to the file in EXCEL to separate the dates from times. This didn't take long at all (maybe 1 minute). All files that I plan to work with are approximately the same length, so it's not a big deal.
After doing that I ended up with a file like this...
X Date.time TEMP HR RR SBP DBP
1 NA
2 Jun-08-2015 1323 36.8 O – – – –
3 Jun-08-2015 931 36.8 O 76 MC 22 SP 104 MC 52 MC
4 Jun-08-2015 930 – – – – –
5 Jun-08-2015 929 – – – – –
6 Jun-08-2015 813 36.8 O 76 MC 22 SP 104 MC 52 MC
7 Jun-08-2015 126 36.3 O 78 MC 23 SP 112 MC 55 MC
8 Jun-08-2015 40 36.3 O 78 MC 23 SP 112 MC 55 MC
9 NA
10 Jun-07-2015 2307 36 O 71 MC 22 SP 120 MC 57 MC
After that I used the following code. Sorry for all the comments I need to make the codes as easy to understand as possible so that everyone in my lab understands what's going on.
#eliminate empty rows
SJ <- na.omit(SJ)
#Convert month to number
SJ$newdate <- strptime(as.character(SJ$X), "%b-%d-%Y")
#Eliminate dashes from date
SJ$newdate <- gsub("[[:punct:]]","",SJ$newdate)
#Add column with "0000" for later use in proper date conversion
SJ$zeros <- rep("0000",nrow(SJ))
#Combine date column with zeros column to obtain date number of correct length
SJ$date = paste(SJ$newdate, SJ$zeros, sep="")
#convert date column to number
SJ$Date.time <- as.numeric(SJ$Date.time)
#Convert time column to number
SJ$date <- as.numeric(SJ$date)
#Add time column to date column resulting in desired datetime format. Saves as vector.
Datetime <- SJ$date + SJ$Date.time
#Inserts Datetime column as first column
SJ <- cbind(Datetime,SJ)
The file now looks like this.
Datetime X Date.time TEMP HR RR SBP DBP newdate zeros date
2 201506081323 Jun-08-2015 1323 36.8 O – – – – 20150608 0000 201506080000
3 201506080931 Jun-08-2015 931 36.8 O 76 MC 22 SP 104 MC 52 MC 20150608 0000 201506080000
4 201506080930 Jun-08-2015 930 – – – – – 20150608 0000 201506080000
5 201506080929 Jun-08-2015 929 – – – – – 20150608 0000 201506080000
6 201506080813 Jun-08-2015 813 36.8 O 76 MC 22 SP 104 MC 52 MC 20150608 0000 201506080000
7 201506080126 Jun-08-2015 126 36.3 O 78 MC 23 SP 112 MC 55 MC 20150608 0000 201506080000
8 201506080040 Jun-08-2015 40 36.3 O 78 MC 23 SP 112 MC 55 MC 20150608 0000 201506080000
10 201506072307 Jun-07-2015 2307 36 O 71 MC 22 SP 120 MC 57 MC 20150607 0000 201506070000
Finally, I simply deleted the unnecessary columns. X , Date.time , newdate , zeros , date
Thank you all for your help!
Related
My code works 95% correctly but I am not sure why my graph has empty white spaces for certain states. For example Washington state has a count of 152 but it appears NULL?
txt <- "AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT
34 103 78 241 789 200 18 18 13 355 210 26 36 48 119 106 57 98 104 32 81 26 92 62 136 65 34 164 10 30 16 70 107 100 109 151 150 97 113 3 90 15 158 479 68 95 7
WA WI WV WY
152 96 48 14 "
dat <- stack(read.table(text = txt, header = TRUE, fill = TRUE))
names(dat)[2] <-'state.abb'
dat$states <- tolower(state.name[match(dat$state.abb, state.abb)])
mapUSA <- map('state', fill = TRUE, plot = FALSE)
nms <- sapply(strsplit(mapUSA$names, ':'), function(x)x[1])
USApolygons <- map2SpatialPolygons(mapUSA, IDs = nms, CRS('+proj=longlat'))
idx <- match(unique(nms), dat$states)
dat2 <- data.frame(value = dat$value[idx], state = unique(nms))
row.names(dat2) <- unique(nms)
USAsp <- SpatialPolygonsDataFrame(USApolygons, data = dat2)
spplot(USAsp['value'], main = "Armed Males with an Attack Threat Level", sub = "Count Per State", col="transparent")
I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))
I have this data:
Year W L PTS GF GA S SA
1 2006 49 25 106 253 224 2380 2662
2 2007 51 23 110 266 207 2261 2553
3 2008 41 32 91 227 224 2425 2433
4 2009 40 34 88 207 228 2375 2398
5 2010 47 29 100 217 221 2508 2389
6 2011 44 27 99 213 190 2362 2506
7 2012 48 26 104 232 205 2261 2517
8 2014 38 32 88 214 233 2382 2365
9 2015 47 25 104 226 202 2614 2304
10 2016 41 27 96 224 213 2507 2231
11 2017 41 29 94 238 220 2557 2458
12 2018 53 18 117 261 204 2641 2650
I've built a VAR model from this data (it's hockey data for one team for the listed years). I converted the above into a time series the ts() argument, and created this model:
VARselect(NSH_ts[, 3:5], lag.max = 8)
var1 <- VAR(NSH_ts[, 3:5], p = 2, type = "both", ic = c("AIC"))
serial.test(var1, type = "PT.adjusted")
forecast.var1 <- forecast(var1, h = 2)
autoplot(forecast.var1) +
scale_x_continuous(breaks = seq(2006, 2022))
I want to use the serial.test() argument, but I get this error:
Error in t(Ci) %*% C0inv : non-conformable arguments
Why won't the serial.test() argument work? (Overall I'm trying to forecast PTS for the next two years, based on the variables in the set).
I've been using this as a guide: https://otexts.org/fpp2/VAR.html
I'm getting a different error, which may be from the VARselect. My table is mostly -Inf entries, with one NaN, and the rest 0. Adjusting the lag.max gave me real numbers, and I had to adjust the other values as well.
VARselect(dfVAR[, 3:5], lag.max = 2)
var1 <- VAR(dfVAR[, 3:5], p = 1, type = "both", ic = c("AIC"))
serial.test(var1, lags.pt = 4, type = "PT.adjusted")
Portmanteau Test (adjusted)
data: Residuals of VAR object var1
Chi-squared = 35.117, df = 27, p-value = 0.1359
The basis of the non-conformable error is that your matrix algebra isn't working, the number of cols in the first matrix have to match the number of rows in the second. Having no knowledge of VAR models, I can't offer help beyond this.
When I run the command:
H <-length(table(data$Team))
n.h <- rep(5,H)
strata(data, stratanames=data$Team,size=n.h,method="srswor"),
I get the error statement:
'Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?'
Please help me how can I get this stratified sample. The variable 'Team' is 'Factor' type.
Data is as below:
zz <- "Team League.ID Player Salary POS G GS InnOuts PO A
ANA AL molinjo0 335000 C 73 57 1573 441 37
ANA AL percitr0 7833333 P 3 0 149 1 3
ARI NL bautida0 4000000 RF 141 135 3536 265 8
ARI NL estalbo0 550000 C 7 3 92 19 2
ARI NL finlest0 7000000 CF 104 102 2689 214 5
ARI NL koplomi0 330000 P 72 0 260 6 23
ARI NL sparkst0 500000 P 27 18 362 8 21
ARI NL villaos0 325000 P 17 0 54 0 4
ARI NL webbbr01 335000 P 33 35 624 13 41
ATL NL francju0 750000 1B 125 71 1894 627 48
ATL NL hamptmi0 14625000 P 35 29 517 13 37
ATL NL marreel0 3000000 LF 90 42 1125 80 4
ATL NL ortizru0 6200000 P 32 34 614 7 38
BAL AL surhobj0 800000 LF 100 31 805 69 0"
data <- read.table(text=zz, header=T)
This should work:
library(sampling)
H <- length(levels(data$Team))
n.h <- rep(5, H)
strata(data, stratanames=c("Team"), size=n.h, method="srswor")
stratanames should be a list of column names, not a reference to the actual column data.
Update:
Now that example data is available, I see another problem: you are sampling without-replacement (wor), but your samples are bigger that the available data. You need to sample with replacement in this case
smpl <- strata(data, stratanames=c("Team"), size=n.h, method="srswr")
BTW, you get the actual data with:
sampledData <- getdata(data, smpl)
This doesn't really answer your question, but a long time ago, I wrote a function called stratified that might be of use to you.
I've posted it here as a GitHub Gist.
Notice that when you have asked for samples that are bigger than your data, it just returns all of the relevant rows.
output <- stratified(data, "Team", 5)
# Some groups
# ---ANA, ATL, BAL---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
table(output$Team)
#
# ANA ARI ATL BAL
# 2 5 4 1
output
# Team League.ID Player Salary POS G GS InnOuts PO A
# 1 ANA AL molinjo0 335000 C 73 57 1573 441 37
# 2 ANA AL percitr0 7833333 P 3 0 149 1 3
# 9 ARI NL webbbr01 335000 P 33 35 624 13 41
# 7 ARI NL sparkst0 500000 P 27 18 362 8 21
# 8 ARI NL villaos0 325000 P 17 0 54 0 4
# 3 ARI NL bautida0 4000000 RF 141 135 3536 265 8
# 6 ARI NL koplomi0 330000 P 72 0 260 6 23
# 12 ATL NL marreel0 3000000 LF 90 42 1125 80 4
# 13 ATL NL ortizru0 6200000 P 32 34 614 7 38
# 10 ATL NL francju0 750000 1B 125 71 1894 627 48
# 11 ATL NL hamptmi0 14625000 P 35 29 517 13 37
# 14 BAL AL surhobj0 800000 LF 100 31 805 69 0
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.
I'm currently learning R with help of video on coursera. When trying to exclude all hospital of state which have less than 20 hospital form table, I couldn't able to find correct solution with lack of programming knowledge of R (as I had program lots with C, Logic I tried to implemented in R is also like C)
Code I had used is like
>test <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
>test[, 11] <- as.numeric(outcome[, 11])
>test2 <- table(outcome$State)
Here from table test2, I can get the value of particular row as test2[[2]] but couldn't able to find out how to use conditional logic to get state with less then 20 hospital (If i get the state name then I can use subset() to address actual problem). Also I had look on dimnames() function but could find out any idea to solve my problem. So my question is, in R how could I check the threshold value with table value.
Value store in test2 is
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37
MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX
134 133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370
UT VA VI VT WA WI WV WY ##State Name
42 87 2 15 88 125 54 29 ##Count of Hospital
as Arun also specified on his comment... you can do it as names(test2[test2 >= 20]) in order to get state with higher than 20 Hospital... Here is nice explanation why you have to avoid subset.
Or yo can transform your table to a data.frame and use subset
dat <- as.data.frame(test2)
subset(dat, Freq < 20)
nn Freq
1 AK 17
8 DC 8
9 DE 6
12 GU 1
13 HI 19
42 RI 12
49 VI 2
50 VT 15