Subsetting tidy data from a vector - r

I'm using R to analyse data about antibiotic use from a number of hospitals.
I've imported this data into a frame, according to the tidy data principles.
>head(data)
date antibiotic usage hospital
1 2006-01-01 amikacin 0.000000 hospital1
2 2006-02-01 amikacin 0.000000 hospital1
3 2006-03-01 amikacin 0.000000 hospital1
4 2006-04-01 amikacin 0.000000 hospital1
5 2006-05-01 amikacin 0.937119 hospital1
6 2006-06-01 amikacin 1.002961 hospital1
(the data set is monthly data x 5 hospitals x 40 antibiotics)
The first thing I would like to do is aggregate the antibiotics into classes.
> head(distinct(select(data, antibiotic)))
antibiotic
1 amikacin
2 amoxicillin-clavulanate
3 amoxycillin
4 ampicillin
5 azithromycin
6 benzylpenicillin
7 cefalotin
8 cefazolin
> penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
> ceph1 <- c("cefalotin", "cefazolin")
What I would like to do is then subset the data based on these antibiotic class vectors:
filter(data, antibiotic =(any one of the values in the vector "penicillins")
Thanks to thelatemail for pointing out the way to do this is:
d <- filter(data, antibiotic %in% penicillins)
What I would like the data to do is to be analysed in a number of ways:
The key analysis (and ggplot output) is:
x = date
y = usage of antibiotic(s) stratified by (drug | class), filtered by hospital
What I'm not clear on now is how to aggregate the data for this sort of thing.
Example:
I want to analyse the use of class "ceph1" across all the hospitals in the district, resulting in (apologies - i know this is not proper code)
x y
Jan-2006 for all in hospitals(usage of cephazolin + usage of cephalotin)
Feb-2006 for all in hospitals(usage of cephazolin + usage of cephalotin)
etc
And, in the long-run, to be able to pass arguments to a function which will let me select which hospitals and which antibiotic or class of antibiotics.
Thanks again - I know this is an order of magnitude more complicated than the original question!

So after lots of trial and error and heaps of reading, I've managed to sort it out.
>str(data)
'data.frame': 23360 obs. of 4 variables:
$ date : Date, format: "2007-09-01" "2012-06-01" ...
$ antibiotic: Factor w/ 41 levels "amikacin","amoxicillin-clavulanate",..: 17 3 19 30 38 20 20 20 7 25 ...
$ usage : num 21.368 36.458 7.226 3.671 0.917 ...
$ hospital : Factor w/ 5 levels "hospital1","hospital2",..: 1 3 2 1 4 1 4 3 5 1 ...
So I can subset the data first:
>library(dplyr)
>penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
>d <- filter(data, antibiotic %in% penicillins)
And then make the summary using more of dplyr (thanks, Hadley!)
>d1 <- summarise(group_by(d, date), total = sum(usage))
>d1
Source: local data frame [122 x 2]
date total
(date) (dbl)
1 2006-01-01 1669.177
2 2006-02-01 1901.749
3 2006-03-01 2311.008
4 2006-04-01 1921.436
5 2006-05-01 1594.781
6 2006-06-01 2150.997
7 2006-07-01 2052.517
8 2006-08-01 2132.501
9 2006-09-01 1959.916
10 2006-10-01 1751.667
.. ... ...
>
> qplot(date, total, data = d1) + geom_smooth()
> [scatterplot as desired!]
Next step will be to try and build it all into a function and/or to try and do the subsetting in-line, building on what I've worked out here.

Related

vcdExtra::datasets not working on some Packages

R3.6.1, vcdExtra 0.7.1
vcdExtra::datasets("caret")
Error in get(x) : object 'GermanCredit' not found
vcdExtra::datasets fails on some packages like "caret".
Am I missing something?
thanks
If you only require the dataset of German Credit, try this code:
library(caret)
data("GermanCredit")
GermanCredit
And you will get:
Duration Amount InstallmentRatePercentage ResidenceDuration Age NumberExistingCredits NumberPeopleMaintenance Telephone
1 6 1169 4 4 67 2 1 0
2 48 5951 2 2 22 1 1 1
3 12 2096 2 3 49 1 2 1
4 42 7882 2 4 45 1 2 1
5 24 4870 3 4 53 2 2 1
Please, comment if it is what you need.
Regards,
Alexis
This is the sequence of commands that I need to run for a correct functioning of vcdExtra::datasets("caret")
library(evtree)
library(caret)
data(Sacramento)
data(tecator)
data(BloodBrain)
data(cox2)
data(dhfr)
data(oil)
data(mdrr)
data(pottery)
data(scat)
data(segmentationData)
vcdExtra::datasets("caret")
The output is
Item class dim Title
1 GermanCredit data.frame 1000x21 German Credit Data
2 Sacramento data.frame 932x9 Sacramento CA Home Prices
3 absorp matrix 215x100 Fat, Water and Protein Content of Meat Samples
4 bbbDescr data.frame 208x134 Blood Brain Barrier Data
5 cars data.frame 50x2 Kelly Blue Book resale data for 2005 model year GM cars
6 cox2Class factor 462 COX-2 Activity Data
7 cox2Descr data.frame 462x255 COX-2 Activity Data
8 cox2IC50 numeric 462 COX-2 Activity Data
9 dhfr data.frame 325x229 Dihydrofolate Reductase Inhibitors Data
10 endpoints matrix 215x3 Fat, Water and Protein Content of Meat Samples
11 fattyAcids data.frame 96x7 Fatty acid composition of commercial oils
12 logBBB numeric 208 Blood Brain Barrier Data
13 mdrrClass factor 528 Multidrug Resistance Reversal (MDRR) Agent Data
14 mdrrDescr data.frame 528x342 Multidrug Resistance Reversal (MDRR) Agent Data
15 oilType factor 96 Fatty acid composition of commercial oils
16 potteryClass factor 58 Pottery from Pre-Classical Sites in Italy
17 scat data.frame 110x19 Morphometric Data on Scat
18 scat_orig data.frame 122x20 Morphometric Data on Scat
19 segmentationData data.frame 2019x61 Cell Body Segmentation

Merge one dataframe with a date vector

I would like to create a dataframe merging the dataframe ss to a vector daily_vector, with date information, through the column "ss$Date_R". I would like to keep all rows from daily_vector to know which date in the dataframe ss has no data. I have tried use the function mergehowever when I tried it the vector apears as a list of numbers and not like the date.
The column "ss$Date_R" is a character column buecause I concatenated the information of the years, months and days.
head(ss)
Station Variable Value Date_R
1 SAN VICENTE DEL PALACIO TMAX1 90 1985-01-01
910 SAN VICENTE DEL PALACIO TMAX2 90 1985-01-02
1819 SAN VICENTE DEL PALACIO TMAX3 110 1985-01-03
2728 SAN VICENTE DEL PALACIO TMAX4 85 1985-01-04
3637 SAN VICENTE DEL PALACIO TMAX5 110 1985-01-05
4546 SAN VICENTE DEL PALACIO TMAX6 100 1985-01-06
str(ss)
'data.frame': 9418 obs. of 4 variables:
$ Station : Factor w/ 3 levels "MEDINA DE RIOSECO",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Variable: Factor w/ 31 levels "TMAX1","TMAX2",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Value : int 90 90 110 85 110 100 80 30 80 70 ...
$ Date_R : chr "1985-01-01" "1985-01-02" "1985-01-03" "1985-01-04" ...
daily_vector <-as.factor(seq(as.Date("1985-01-01"), as.Date("2010-10-14"), by="days"))
Does someone know how I can merge these two kinds of information?
Do you know a better way to know which day are absent in the dataframe ss?
Thanks in advance
If you just want to to check the dates in daily_vector not in ss$Date_R, you don't need to add a new column. Instead, you can use
ss$Date_R <- as.Date(ss$Date_R)
daily_vector <- seq(as.Date("1985-01-01"), as.Date("2010-10-14"), by="days")
missing <- !daily_vector %in% ss$Date_R
daily_vector[missing]
This will return the dates missing in ss$Date_R as a simple vector of dates.
Edit: To add the rows of missing dates to your dataframe, you can use merge as follows:
daily_ex <- daily_vector[1:6] # 6 total dates
ss <- data.frame(V1=rnorm(5), V2=rnorm(5),
Date_R=c(daily_vector[c(1:4, 6)])) # 5 total rows, skipped date #5 on purpose
Date_R_all <- data.frame(Date_R = daily_ex)
merge(ss, Date_R_all, by="Date_R", all=TRUE)
The result is
1 1985-01-01 -0.2152378 -1.1546424
2 1985-01-02 0.7188043 -0.3882131
3 1985-01-03 0.9581949 1.2717832
4 1985-01-04 -0.6559881 -0.6670120
5 1985-01-05 NA NA
6 1985-01-06 -0.6285255 -1.2645569
I think the merge way is ok, but first: (a) you need to set the class of your Date_R column to "Date"; (b) your daily_vector must be a data.frame (?merge for further information). Try the follows:
ss$Date_R <- as.Date.character(ss$Date_R)
daily <-data.frame((seq(as.Date("1985-01-01"),as.Date("2010-10-14"),by="days")))
colnames(daily_vec) <- "Date_R"
merge(ss, daily_vector, all=TRUE)

R + ggplot: Order irregular Time Strings for Plot

I have a data frame with two columns. The first is a numerical value, the other is a string describing a time. The time format looks like yyyy-mm-dd--hh-mm-ss-?????? (e.g. 2015-03-04--12-11-35-669696), I don't know what the last 6 digits mean. E.g.
y time
1 4.548 2014-08-11--09-07-44-202586
2 4.548 2014-08-11--09-07-54-442586
3 4.548 2014-08-11--09-08-04-522586
4 4.478 2014-08-11--09-08-14-762586
5 4.431 2014-08-11--09-08-24-522586
6 4.446 2014-08-11--09-08-34-922586
7 4.492 2014-08-11--09-08-44-522586
8 4.508 2014-08-11--09-08-54-442586
9 4.486 2014-08-11--09-09-04-202586
10 4.497 2014-08-11--09-09-14-442586
11 4.461 2014-08-11--09-09-24-202586
I want to plot them with
ggplot(df, aes(x=time, y=y)) + geom_line()
But I have the problem, that ggplot doesn't know how to deal with data of class character and in particular with my given time format.
I tried to use AsciiToInt from the pakage {sfsmisc} to convert the strings to numerical values, but it repeats a list of integers for each string (one number for each character, of course).
I can also sort my time strings with mixedsort from the pakage {gtools}, but I don't how to apply it for the plot (also keeping in mind the distance).
Another problem is that I don't want every time string appear as tick at the x-axis, due to I have around 20k rows. Maybe I can solve that problem like in this question, but I cannot check that as long as the first problem occurs.
Can you help me, ploting such data with the time as a numeric-like value on the x-axis?
I loaded your data as a .txt file called time dat. First I convert your data into POSIXct type. To make a cleaner graph for test purposes I omit the seconds field, if you want to add them in just use the commented out line.
library(ggplot2)
timedat<-read.csv("~/Work/Timedat.csv")
timedat
str(timedat)
> str(timedat)
'data.frame': 11 obs. of 3 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ y : num 4.55 4.55 4.55 4.48 4.43 ...
$ time: Factor w/ 11 levels "2014-08-11--09-07-44-202586",..: 1 2 3 4 5 6 7 8 9 10 ...
#timedat$time<-as.POSIXct(as.character(timedat$time),format = "%Y-%m-%d--%H-%M-%S")
timedat$time<-as.POSIXct(as.character(timedat$time),format = "%Y-%m-%d--%H-%M")
qplot(data=timedat,y=y,x=time)+theme_bw()
> timedat
X y time
1 1 4.548 2014-08-11--09-07-44-202586
2 2 4.548 2014-08-11--09-07-54-442586
3 3 4.548 2014-08-11--09-08-04-522586
4 4 4.478 2014-08-11--09-08-14-762586
5 5 4.431 2014-08-11--09-08-24-522586
6 6 4.446 2014-08-11--09-08-34-922586
7 7 4.492 2014-08-11--09-08-44-522586
8 8 4.508 2014-08-11--09-08-54-442586
9 9 4.486 2014-08-11--09-09-04-202586
10 10 4.497 2014-08-11--09-09-14-442586
11 11 4.461 2014-08-11--09-09-24-202586
This produces the following plot with the dates nicely ordered.

Histogram of Weekdays by Year R

I have a .csv file that I have loaded into R using the following basic command:
lace <- read.csv("lace for R.csv")
It pulls in my data just fine. Here is the str of the data:
str(lace)
'data.frame': 2054 obs. of 20 variables:
$ Admission.Day : Factor w/ 872 levels "1/1/2013","1/10/2011",..: 231 238 238 50 59 64 64 64 67 67 ...
$ Year : int 2010 2010 2010 2011 2011 2011 2011 2011 2011 2011 ...
$ Month : int 12 12 12 1 1 1 1 1 1 1 ...
$ Day : int 28 30 30 3 4 6 6 6 7 7 ...
$ DayOfWeekNumber : int 3 5 5 2 3 5 5 5 6 6 ...
$ Day.of.Week : Factor w/ 7 levels "Friday","Monday",..: 6 5 5 2 6 5 5 5 1 1 ...
What I am trying to do is create three (3) different histograms and then plot them all together on one. I want to create a histogram for each year, where the x axis or labels will be the days of the week starting with Sunday and ending on Saturday.
Firstly how would I go about creating a histogram out of Factors, which the days of the week are in?
Secondly how do I create a histogram for the days of the week for a given year?
I have tried using the following post here but cannot get it working. I use the Admission.Day as the variable and get an error message:
dat <- as.Date(lace$Admission.Day)
Error in charToDate(x) : character string is not in a standard unambiguous format
Thank you,
Expanding on the comment above: the problem seems to be with importing dates, rather than making the histogram. Assuming there is an excel workbook "lace for R.xlsx", with a sheet "lace":
## Not tested...
library(XLConnect)
myData <- "lace for R.xlsx" # NOTE: need path also...
wb <- loadWorkbook(myData)
lace <- readWorksheet(wb, sheet="lace")
lace$Admission.Day <- as.Date(lace$Admission.Day)
should provide dates that work with all the R date functions. Also, the lubridate package provides a number of functions that are more intuitive to use than format(...).
Then, as an example:
library(lubridate) # for year(...) and wday(...)
library(ggplot2)
# random dates around Jun 1, across 5 years...
set.seed(123)
lace <- data.frame(date=as.Date(rnorm(1000,sd=50)+365*(0:4),origin="2008/6/1"))
lace$year <- factor(year(lace$date))
lace$dow <- wday(lace$date, label=T)
# This creates the histograms...
ggplot(lace) +
geom_histogram(aes(x=dow, fill=year)) + # fill color by year
facet_grid(~year) + # facet by year
theme(axis.text.x=element_text(angle=90)) # to rotate weekday names...
Produces this:

Date handling and splitting

I have a set of data (in csv format) that looks something like:
Date Auto_Index Realty_Index
29-Dec-02 1742.2 1000
2-Jan-03 1748.85 1009.67
3-Jan-03 1758.66 1041.45
4-Jan-03 1802.9 1062.11
5-Jan-03 1797.45 1047.56
...
...
...
26-Nov-12 1665.5 248.75
27-Nov-12 1676.3 257.6
29-Nov-12 1696.7 266.9
30-Nov-12 1682.8 266.55
3-Dec-12 1702.6 270.4
I want to analyse this data over different periods in R. Is there a way I can break this data into different periods say 2002-2005, 2006-2009 and 2009-2012?
If you want to operate on the periods as numbers (rather than text), then this might help:
br <- c("2002","2005","2010","2013")
df$Int <-findInterval(format(as.Date(df$Date,format='%d-%b-%y'),"%Y"),br)
As #user1317221_G proposed, you should use function cut.POSIXt. Here's how:
d
Date Auto_Index Realty_Index
1 29-Dec-02 1742.20 1000.00
2 2-Jan-03 1748.85 1009.67
3 3-Jan-03 1758.66 1041.45
4 4-Jan-03 1802.90 1062.11
5 5-Jan-03 1797.45 1047.56
6 26-Nov-12 1665.50 248.75
7 27-Nov-12 1676.30 257.60
8 29-Nov-12 1696.70 266.90
9 30-Nov-12 1682.80 266.55
10 3-Dec-12 1702.60 270.40
# First step, convert your date column in POSIXct using strptime
d$Date <- strptime(d$Date, format("%d-%b-%y"))
# Then define your break points for your periods:
breaks <- as.POSIXct(c("2002-01-01","2006-01-01","2010-01-01","2013-01-01"))
# Then cut
d$Period <- cut(d$Date, breaks=breaks,
labels=c("2002-2005","2006-2009","2010-2012"))
d
Date Auto_Index Realty_Index Period
1 2002-12-29 1742.20 1000.00 2002-2005
2 2003-01-02 1748.85 1009.67 2002-2005
3 2003-01-03 1758.66 1041.45 2002-2005
4 2003-01-04 1802.90 1062.11 2002-2005
5 2003-01-05 1797.45 1047.56 2002-2005
6 2012-11-26 1665.50 248.75 2010-2012
7 2012-11-27 1676.30 257.60 2010-2012
8 2012-11-29 1696.70 266.90 2010-2012
9 2012-11-30 1682.80 266.55 2010-2012
10 2012-12-03 1702.60 270.40 2010-2012

Resources