Merge one dataframe with a date vector - r

I would like to create a dataframe merging the dataframe ss to a vector daily_vector, with date information, through the column "ss$Date_R". I would like to keep all rows from daily_vector to know which date in the dataframe ss has no data. I have tried use the function mergehowever when I tried it the vector apears as a list of numbers and not like the date.
The column "ss$Date_R" is a character column buecause I concatenated the information of the years, months and days.
head(ss)
Station Variable Value Date_R
1 SAN VICENTE DEL PALACIO TMAX1 90 1985-01-01
910 SAN VICENTE DEL PALACIO TMAX2 90 1985-01-02
1819 SAN VICENTE DEL PALACIO TMAX3 110 1985-01-03
2728 SAN VICENTE DEL PALACIO TMAX4 85 1985-01-04
3637 SAN VICENTE DEL PALACIO TMAX5 110 1985-01-05
4546 SAN VICENTE DEL PALACIO TMAX6 100 1985-01-06
str(ss)
'data.frame': 9418 obs. of 4 variables:
$ Station : Factor w/ 3 levels "MEDINA DE RIOSECO",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Variable: Factor w/ 31 levels "TMAX1","TMAX2",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Value : int 90 90 110 85 110 100 80 30 80 70 ...
$ Date_R : chr "1985-01-01" "1985-01-02" "1985-01-03" "1985-01-04" ...
daily_vector <-as.factor(seq(as.Date("1985-01-01"), as.Date("2010-10-14"), by="days"))
Does someone know how I can merge these two kinds of information?
Do you know a better way to know which day are absent in the dataframe ss?
Thanks in advance

If you just want to to check the dates in daily_vector not in ss$Date_R, you don't need to add a new column. Instead, you can use
ss$Date_R <- as.Date(ss$Date_R)
daily_vector <- seq(as.Date("1985-01-01"), as.Date("2010-10-14"), by="days")
missing <- !daily_vector %in% ss$Date_R
daily_vector[missing]
This will return the dates missing in ss$Date_R as a simple vector of dates.
Edit: To add the rows of missing dates to your dataframe, you can use merge as follows:
daily_ex <- daily_vector[1:6] # 6 total dates
ss <- data.frame(V1=rnorm(5), V2=rnorm(5),
Date_R=c(daily_vector[c(1:4, 6)])) # 5 total rows, skipped date #5 on purpose
Date_R_all <- data.frame(Date_R = daily_ex)
merge(ss, Date_R_all, by="Date_R", all=TRUE)
The result is
1 1985-01-01 -0.2152378 -1.1546424
2 1985-01-02 0.7188043 -0.3882131
3 1985-01-03 0.9581949 1.2717832
4 1985-01-04 -0.6559881 -0.6670120
5 1985-01-05 NA NA
6 1985-01-06 -0.6285255 -1.2645569

I think the merge way is ok, but first: (a) you need to set the class of your Date_R column to "Date"; (b) your daily_vector must be a data.frame (?merge for further information). Try the follows:
ss$Date_R <- as.Date.character(ss$Date_R)
daily <-data.frame((seq(as.Date("1985-01-01"),as.Date("2010-10-14"),by="days")))
colnames(daily_vec) <- "Date_R"
merge(ss, daily_vector, all=TRUE)

Related

loop for list element with datetime in r

loop for list element with datetime in r
I have a df with name mistake. I splitted the mistake df by ID. Now I have over 300 different objects in the list.
library(dplyr)
df <- split.data.frame(mistake, mistake$ID)
Every list object has two different datetime stamps. At first I need the minutes between this two datetime stamps. Then I duplicate the rows of the object by the variable stay (this is the difftime between the sat and end time too). Then I overwrite the test variable with the increment n_mintes.
library(lubridate)
start_date <- df[[1]]$datetime
end_date <- df[[1]]$gehtzeit
n_minutes <- interval(start_date,end_date)/minutes(1)
see <- start_date + minutes(0:n_minutes)#the diff time in minutes I need
df[[1]]$test<- Sys.time()#a new variable
df[[1]] <- data.frame(df[[1]][rep(seq_len(dim(df[[1]])[1]),df[[1]]$stay+1),1:17, drop= F], row.names=NULL)
df[[1]]$test <- format(start_date + minutes(0:n_minutes), format = "%d.%m.%Y %H:%M:%S")
I want to do this with every objcet of the list. And then 'rbind' or 'unsplit' my list. I know I need a loop. But I don' t know how to do this with the list element.
Any help would be create!
Here is a small df example;
mistake
Baureihe Verbund Fahrzeug Code Codetext Subsystem Kommt.Zeit
71 411 ICE1166 93805411866-7 1A50 Querfederdruck 1 ungleich Sollwert Neigetechnik 29.07.2018 23:00:07
72 411 ICE1166 93805411866-7 1A50 Querfederdruck 1 ungleich Sollwert Neigetechnik 04.08.2018 11:16:41
Geht.Zeit Anstehdauer Jahr Monat KW Tag Wartung.geht datetime gehtzeit
71 29.07.2018 23:02:56 00 Std 02 Min 49 Sek 2018 7 KW30 29 0 2018-07-29 23:00:00 2018-07-29 23:02:00
72 04.08.2018 11:19:20 00 Std 02 Min 39 Sek 2018 8 KW31 4 0 2018-08-04 11:16:00 2018-08-04 11:19:00
bleiben ID
71 2 secs 2018-07-29 23:00:00 2018-07-29 23:02:00 1A50
72 3 secs 2018-08-04 11:16:00 2018-08-04 11:19:00 1A50
And here ist the structure:
str(mistake)
'data.frame': 2 obs. of 18 variables:
$ Baureihe : int 411 411
$ Verbund : Factor w/ 1 level "ICE1166": 1 1
$ Fahrzeug : Factor w/ 7 levels "93805411066-4",..: 7 7
$ Code : Factor w/ 6 levels "1A07","1A0E",..: 3 3
$ Codetext : Factor w/ 6 levels "ITD Karte gestört",..: 5 5
$ Subsystem : Factor w/ 1 level "Neigetechnik": 1 1
$ Kommt.Zeit : Factor w/ 70 levels "02.08.2018 00:07:23",..: 68 6
$ Geht.Zeit : Factor w/ 68 levels "01.08.2018 01:30:25",..: 68 8
$ Anstehdauer : Factor w/ 46 levels "00 Std 00 Min 01 Sek ",..: 12 4
$ Jahr : int 2018 2018
$ Monat : int 7 8
$ KW : Factor w/ 5 levels "KW27","KW28",..: 4 5
$ Tag : int 29 4
$ Wartung.geht: int 0 0
$ datetime : POSIXlt, format: "2018-07-29 23:00:00" "2018-08-04 11:16:00"
$ gehtzeit : POSIXlt, format: "2018-07-29 23:02:00" "2018-08-04 11:19:00"
$ bleiben :Class 'difftime' atomic [1:2] 2 3
.. ..- attr(*, "units")= chr "secs"
$ ID : chr "2018-07-29 23:00:00 2018-07-29 23:02:00 1A50" "2018-08-04 11:16:00 2018-08-04 11:19:00 1A50"
Consider building a generalized user-defined function that receives a data frame as input parameter. Then, call the function with by. Like split, by also subsets a data frame by one or more factor(s) such as ID but, unlike split, by can then pass subsets into a function. To row bind all together, run do.call at end.
Below removes the redundant df$test <- Sys.time() which is overwritten later and uses see object inside format() call at end to avoid re-calculation and repetition.
calc_datetime <- function(df) {
# INITIAL CALCS
start_date <- df$datetime
end_date <- df$gehtzeit
n_minutes <- interval(start_date, end_date)/minutes(1)
see <- start_date + minutes(0:n_minutes) # the diff time in minutes I need
# BUILD OUTPUT DF
df <- data.frame(df[rep(seq_len(dim(df)[1]), df$stay+1), 1:17, drop= F], row.names=NULL)
df$test <- format(see, format = "%d.%m.%Y %H:%M:%S")
return(df)
}
# BUILD LIST OF SUBSETTED DFs
df_list <- by(mistake, mistake$ID, calc_datetime)
# APPEND ALL RESULT DFs TO SINGLE FINAL DF
final_df <- do.call(rbind, df_list)
Along the same lines as Parfait's answer, and using the same user defined function calc_datetime, but I would use map_dfr from the purrr package:
df_list <- split(mistake, mistake$ID)
final_df <- map_dfr(df_list, calc_datetime)
If you update the question to have data I can use I can give a demonstration that works

Leftovers from sample function [duplicate]

This question already has answers here:
How to split data into training/testing sets using sample function
(28 answers)
Closed 6 years ago.
I have a question about show leftovers from sample function.
For school we had to make a test dataframe and a train dataframe.
The data that I have to validate has only a train dataframe.
The raw dataframe has 2158 observations. They made a train dataframe with 1529 observations.
set.seed(22)
train <- Gary[sample(1:nrow(Gary), 1529,
replace=FALSE),]
train[, 1] <- as.factor(unlist(train[, 1]))
train[, 2:201] <- as.numeric(as.factor(unlist(train[, 2:201])))
Now I want to have the "leftovers" in a different dataframe.
Do some of you know how to do this?
You can use negative indexing in R if you know the training indices. So we only need to rewrite your first lines:
set.seed(22)
train_indices <- sample(1:nrow(Gary), 1529, replace=FALSE)
train <- Gary[train_indices, ]
test <- Gary[-train_indices, ]
# Proceed with rest of script.
This can be done using the setdiff() function.
Edit: Please note that there is another answer by #AlexR using negative indexing which is much simpler if the indices are only used for subsetting.
However, first we need to create some dummy data as ther OP hasn't provided any data with the question (For future use, please read How to make a great R reproducible example?):
Dummy data
Create dummy data frame with 2158 rows and two columns:
n <- 2158
Gary <- data.frame(V1 = seq_len(n), V2 = sample(LETTERS, n , replace =TRUE))
str(Gary)
#'data.frame': 2158 obs. of 2 variables:
# $ V1: int 1 2 3 4 5 6 7 8 9 10 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 21 11 24 10 5 17 18 1 25 7 ...
Sampled and leftover rows
First, the vectors of sampled and leftover rows are computed, before subsetting Gary in subsequent steps:
set.seed(22)
sampled_rows <- sample(seq_len(nrow(Gary)), 1529, replace=FALSE)
leftover_rows <- setdiff(seq_len(nrow(Gary)), selected_rows)
train <- Gary[sampled_rows, ]
leftover <- Gary[leftover_rows, ]
str(train)
#'data.frame': 1529 obs. of 2 variables:
# $ V1: int 657 1025 2143 1123 1817 1558 1324 1590 898 801 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 19 16 25 15 2 5 8 14 20 3 ...
str(leftover)
#'data.frame': 629 obs. of 2 variables:
# $ V1: int 2 5 6 7 8 9 10 12 20 24 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 11 5 17 18 1 25 7 25 7 18 ...
leftover is a data frame which contains the rows of Gary which haven't been sampled.
Verification
To verify, we combine train and leftover again and sort the rows to compare with the original data frame:
recombined <- rbind(train, leftover)
identical(Gary, recombined[order(recombined$V1), ])
#[1] TRUE

Subsetting tidy data from a vector

I'm using R to analyse data about antibiotic use from a number of hospitals.
I've imported this data into a frame, according to the tidy data principles.
>head(data)
date antibiotic usage hospital
1 2006-01-01 amikacin 0.000000 hospital1
2 2006-02-01 amikacin 0.000000 hospital1
3 2006-03-01 amikacin 0.000000 hospital1
4 2006-04-01 amikacin 0.000000 hospital1
5 2006-05-01 amikacin 0.937119 hospital1
6 2006-06-01 amikacin 1.002961 hospital1
(the data set is monthly data x 5 hospitals x 40 antibiotics)
The first thing I would like to do is aggregate the antibiotics into classes.
> head(distinct(select(data, antibiotic)))
antibiotic
1 amikacin
2 amoxicillin-clavulanate
3 amoxycillin
4 ampicillin
5 azithromycin
6 benzylpenicillin
7 cefalotin
8 cefazolin
> penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
> ceph1 <- c("cefalotin", "cefazolin")
What I would like to do is then subset the data based on these antibiotic class vectors:
filter(data, antibiotic =(any one of the values in the vector "penicillins")
Thanks to thelatemail for pointing out the way to do this is:
d <- filter(data, antibiotic %in% penicillins)
What I would like the data to do is to be analysed in a number of ways:
The key analysis (and ggplot output) is:
x = date
y = usage of antibiotic(s) stratified by (drug | class), filtered by hospital
What I'm not clear on now is how to aggregate the data for this sort of thing.
Example:
I want to analyse the use of class "ceph1" across all the hospitals in the district, resulting in (apologies - i know this is not proper code)
x y
Jan-2006 for all in hospitals(usage of cephazolin + usage of cephalotin)
Feb-2006 for all in hospitals(usage of cephazolin + usage of cephalotin)
etc
And, in the long-run, to be able to pass arguments to a function which will let me select which hospitals and which antibiotic or class of antibiotics.
Thanks again - I know this is an order of magnitude more complicated than the original question!
So after lots of trial and error and heaps of reading, I've managed to sort it out.
>str(data)
'data.frame': 23360 obs. of 4 variables:
$ date : Date, format: "2007-09-01" "2012-06-01" ...
$ antibiotic: Factor w/ 41 levels "amikacin","amoxicillin-clavulanate",..: 17 3 19 30 38 20 20 20 7 25 ...
$ usage : num 21.368 36.458 7.226 3.671 0.917 ...
$ hospital : Factor w/ 5 levels "hospital1","hospital2",..: 1 3 2 1 4 1 4 3 5 1 ...
So I can subset the data first:
>library(dplyr)
>penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
>d <- filter(data, antibiotic %in% penicillins)
And then make the summary using more of dplyr (thanks, Hadley!)
>d1 <- summarise(group_by(d, date), total = sum(usage))
>d1
Source: local data frame [122 x 2]
date total
(date) (dbl)
1 2006-01-01 1669.177
2 2006-02-01 1901.749
3 2006-03-01 2311.008
4 2006-04-01 1921.436
5 2006-05-01 1594.781
6 2006-06-01 2150.997
7 2006-07-01 2052.517
8 2006-08-01 2132.501
9 2006-09-01 1959.916
10 2006-10-01 1751.667
.. ... ...
>
> qplot(date, total, data = d1) + geom_smooth()
> [scatterplot as desired!]
Next step will be to try and build it all into a function and/or to try and do the subsetting in-line, building on what I've worked out here.

Ranking entries in a column based on sums of entries in another column

everyone. I am a beginner in R with a question I can't quite figure out. I've created multiple queries within Stack Overflow to address my question (links to results here, here, and here) but none have addressed my issue. On to the problem: I have subset data frame DAV from a larger dataset.
> str(DAV)
'data.frame': 994 obs. of 9 variables:
$ MIL.ID : Factor w/ 18840 levels "","0000151472",..: 7041 9258 10513 5286 5759 5304 5312 5337 5337 5547 ...
$ Name : Factor w/ 18395 levels ""," Atticus Finch",..: 1226 6754 12103 17234 2317 14034 15747 4542 4542 14819 ...
$ Center : int 2370 2370 2370 2370 2370 2370 2370 2370 2370 2370 ...
$ Gift.Date : Factor w/ 339 levels "","01/01/2015",..: 6 6 6 7 10 13 13 13 13 13 ...
$ Gift.Amount: num 100 47.5 150 41 95 ...
$ Solic. : Factor w/ 31 levels "","aa","ac","an",..: 20 31 20 29 20 8 28 8 8 8 ...
$ Tender : Factor w/ 10 levels "","c","ca","cc",..: 3 2 3 5 2 9 3 9 9 9 ...
$ Account : Factor w/ 16 levels "","29101-0000",..: 4 4 4 11 2 11 2 11 2 11 ...
$ Restriction: Factor w/ 258 levels "","AAU","ACA",..: 216 59 216 1 137 1 137 1 38 1 ...
The two relevant columns for my issue are MIL.ID, which contains a unique ID for a donor, and Gift.Amount, which contains a dollar amount for a single gift the donor gave. A single MIL.ID is often associated with multiple Gift.Amount entries, meaning that donor has given on multiple different occasions for various amounts. Here is what I want to do:
I want to separate out the above mentioned columns from the rest of the data frame;
I want to sum(Gift.Amount) but only do so for each donor, i.e. I want to create a sum of all gifts for MIL.ID 1234 in the above data.frame; and
I want to rank all the MIL.IDs based on the sum Gift.Amount entries associated with their ID.
I apologize for how basic this is, and if it is redundant to a question already asked, but I couldn't find anything.
Edit to address comment:
shot of table
> print(ranking)
Desired output
I am struggling to get the formatting correct here so I included screen shots
This should do it:
df <- DAV[, c("MIL.ID", "Gift.Amount")] #extract columns
df <- aggregate(Gift.Amount ~ MIL.ID, df, sum) #sum amounts with same ID
df <- df[ order(df$Gift.Amount,decreasing = TRUE), ] #sort Decreasing

Web Scraping in R--readHTMLTable has table names as NULL

Here's my code for reading the tables but the tables which are read are having a NULL name. Is there a better method for finding the land area of each state in square miles without the commas in the numbers? I had the idea of extracting the table and going to the second table and converting it to data.frame but now that they have NULL names I am not sure what should I do or if there's a better method
require("XML")
url="http://simple.wikipedia.org/wiki/List_of_U.S._states_by_area"
wiki_page=readLines(url)
length(wiki_page)
tables=readHTMLTable(url)
Here's a sample output:
> tables
$`NULL`
Rank State km² miles²
1 1 Alaska 1,717,854 663,267
2 2 Texas 696,621 268,581
3 3 California 423,970 163,696
4 4 Montana 380,838 147,042
5 5 New Mexico 314,915 121,589
6 6 Arizona 295,254 113,998
7 7 Nevada 286,351 110,561
8 8 Colorado 269,601 104,094
9 9 Oregon 254,805 98,381
....
You should read the names and assign them to tables:
library(XML)
require("XML")
url="http://simple.wikipedia.org/wiki/List_of_U.S._states_by_area"
doc <- htmlParse(url)
nn <- xpathSApply(doc,'//*[#class="mw-headline"]',xmlValue)[-4]
tabs <- readHTMLTable(url)
names(tabs) <- nn
Check the result :
str(tabs,max=1)
# List of 3
# $ Total area:'data.frame': 50 obs. of 4 variables:
# $ Land area :'data.frame': 50 obs. of 4 variables:
# $ Water area:'data.frame': 50 obs. of 5 variables:
numeric conversion :
convert_num <-
function(x)as.numeric(gsub(',','',x))
lapply(tabs,function(y){
y[,-c(1,2)] <- sapply(y[,-c(1,2)],convert_num)
y
})

Resources