loop for list element with datetime in r - r

loop for list element with datetime in r
I have a df with name mistake. I splitted the mistake df by ID. Now I have over 300 different objects in the list.
library(dplyr)
df <- split.data.frame(mistake, mistake$ID)
Every list object has two different datetime stamps. At first I need the minutes between this two datetime stamps. Then I duplicate the rows of the object by the variable stay (this is the difftime between the sat and end time too). Then I overwrite the test variable with the increment n_mintes.
library(lubridate)
start_date <- df[[1]]$datetime
end_date <- df[[1]]$gehtzeit
n_minutes <- interval(start_date,end_date)/minutes(1)
see <- start_date + minutes(0:n_minutes)#the diff time in minutes I need
df[[1]]$test<- Sys.time()#a new variable
df[[1]] <- data.frame(df[[1]][rep(seq_len(dim(df[[1]])[1]),df[[1]]$stay+1),1:17, drop= F], row.names=NULL)
df[[1]]$test <- format(start_date + minutes(0:n_minutes), format = "%d.%m.%Y %H:%M:%S")
I want to do this with every objcet of the list. And then 'rbind' or 'unsplit' my list. I know I need a loop. But I don' t know how to do this with the list element.
Any help would be create!
Here is a small df example;
mistake
Baureihe Verbund Fahrzeug Code Codetext Subsystem Kommt.Zeit
71 411 ICE1166 93805411866-7 1A50 Querfederdruck 1 ungleich Sollwert Neigetechnik 29.07.2018 23:00:07
72 411 ICE1166 93805411866-7 1A50 Querfederdruck 1 ungleich Sollwert Neigetechnik 04.08.2018 11:16:41
Geht.Zeit Anstehdauer Jahr Monat KW Tag Wartung.geht datetime gehtzeit
71 29.07.2018 23:02:56 00 Std 02 Min 49 Sek 2018 7 KW30 29 0 2018-07-29 23:00:00 2018-07-29 23:02:00
72 04.08.2018 11:19:20 00 Std 02 Min 39 Sek 2018 8 KW31 4 0 2018-08-04 11:16:00 2018-08-04 11:19:00
bleiben ID
71 2 secs 2018-07-29 23:00:00 2018-07-29 23:02:00 1A50
72 3 secs 2018-08-04 11:16:00 2018-08-04 11:19:00 1A50
And here ist the structure:
str(mistake)
'data.frame': 2 obs. of 18 variables:
$ Baureihe : int 411 411
$ Verbund : Factor w/ 1 level "ICE1166": 1 1
$ Fahrzeug : Factor w/ 7 levels "93805411066-4",..: 7 7
$ Code : Factor w/ 6 levels "1A07","1A0E",..: 3 3
$ Codetext : Factor w/ 6 levels "ITD Karte gestört",..: 5 5
$ Subsystem : Factor w/ 1 level "Neigetechnik": 1 1
$ Kommt.Zeit : Factor w/ 70 levels "02.08.2018 00:07:23",..: 68 6
$ Geht.Zeit : Factor w/ 68 levels "01.08.2018 01:30:25",..: 68 8
$ Anstehdauer : Factor w/ 46 levels "00 Std 00 Min 01 Sek ",..: 12 4
$ Jahr : int 2018 2018
$ Monat : int 7 8
$ KW : Factor w/ 5 levels "KW27","KW28",..: 4 5
$ Tag : int 29 4
$ Wartung.geht: int 0 0
$ datetime : POSIXlt, format: "2018-07-29 23:00:00" "2018-08-04 11:16:00"
$ gehtzeit : POSIXlt, format: "2018-07-29 23:02:00" "2018-08-04 11:19:00"
$ bleiben :Class 'difftime' atomic [1:2] 2 3
.. ..- attr(*, "units")= chr "secs"
$ ID : chr "2018-07-29 23:00:00 2018-07-29 23:02:00 1A50" "2018-08-04 11:16:00 2018-08-04 11:19:00 1A50"

Consider building a generalized user-defined function that receives a data frame as input parameter. Then, call the function with by. Like split, by also subsets a data frame by one or more factor(s) such as ID but, unlike split, by can then pass subsets into a function. To row bind all together, run do.call at end.
Below removes the redundant df$test <- Sys.time() which is overwritten later and uses see object inside format() call at end to avoid re-calculation and repetition.
calc_datetime <- function(df) {
# INITIAL CALCS
start_date <- df$datetime
end_date <- df$gehtzeit
n_minutes <- interval(start_date, end_date)/minutes(1)
see <- start_date + minutes(0:n_minutes) # the diff time in minutes I need
# BUILD OUTPUT DF
df <- data.frame(df[rep(seq_len(dim(df)[1]), df$stay+1), 1:17, drop= F], row.names=NULL)
df$test <- format(see, format = "%d.%m.%Y %H:%M:%S")
return(df)
}
# BUILD LIST OF SUBSETTED DFs
df_list <- by(mistake, mistake$ID, calc_datetime)
# APPEND ALL RESULT DFs TO SINGLE FINAL DF
final_df <- do.call(rbind, df_list)

Along the same lines as Parfait's answer, and using the same user defined function calc_datetime, but I would use map_dfr from the purrr package:
df_list <- split(mistake, mistake$ID)
final_df <- map_dfr(df_list, calc_datetime)
If you update the question to have data I can use I can give a demonstration that works

Related

Subset function in R works on one column but not on the other

I am trying to get some NBA data. However, I can't seem to be able to select a subset of my data properly. I tried using the subset function to get only the players with more than 10 games but it doesn't work for some reason. It subset works when I use a different column and I don't know why. Here's my code
install.packages("httr")
library(httr)
require("httr")
install.packages("jsonlite")
library(jsonlite)
require('jsonlite')
install.packages('dplyr')
library(dplyr)
require(dplyr)
params = list(AheadBehind="Ahead or Behind", ClutchTime="Last 5 Minutes",
College="", Conference="", Country="",DateFrom="", DateTo="",
Division="", DraftPick="", DraftYear="", GameScope="",
GameSegment="", Height="", LastNGames= "0", LeagueID="00",
Location="", MeasureType="Base", Month="0", OpponentTeamID="0",
Outcome="", PORound="0", PaceAdjust="N", PerMode="PerGame",
Period="0", PlayerExperience="", PlayerPosition="",
PlusMinus= "N", PointDiff="5", Rank="N", Season="2020-21",
SeasonSegment="",SeasonType="Regular Season", ShotClockRange="",
StarterBench="", TeamID="0", VsConference="", VsDivision="",
Weight="" )
request_headers = c('Accept'='application/json, text/plain, */*',
'Accept-Encoding'='gzip, deflate, br',
'Accept-Language'="en-US,en;q=0.9",
'Connection'='keep-alive',
'Host'='stats.nba.com',
'Origin'='https://www.nba.com',
'Referer'='https://www.nba.com/',
'Sec-Fetch-Dest'='empty',
'Sec-Fetch-Mode'='cors',
'Sec-Fetch-Site'='same-site',
'User-Agent'='Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36',
'x-nba-stats-origin'='stats',
'x-nba-stats-token'='true')
base <- 'https://stats.nba.com/stats/'
endpoint_players_clutch <- 'leaguedashplayerclutch'
call <- paste(base, endpoint_players_clutch, sep = '')
res <- httr::GET(call, httr::add_headers(.headers=request_headers), query=params)
json_resp <- jsonlite::fromJSON(content(res, "text"))
df <- data.frame(json_resp$resultSets$rowSet[1])
colnames(df) <- json_resp[["resultSets"]][["headers"]][[1]]
df2 <- select(df, PLAYER_NAME, GP, PLUS_MINUS)
df3 <- subset(df2, GP > 10)
# The line below works, but not the one above
# df3 <- subset(df2, PLUS_MINUS > 0)
Any solution would help, but it would help if the solution uses the subset function so that I know what I did wrong. Thanks
Let's show what should have been done:
df2 <- select(df, PLAYER_NAME, GP, PLUS_MINUS)
str(df2) # notice that GP displays like 'numeric' but is really 'factor'
#--------------
'data.frame': 382 obs. of 3 variables:
$ PLAYER_NAME: Factor w/ 382 levels "Aaron Gordon",..: 1 2 3 4 5 6 7 8 9 10 ...
$ GP : Factor w/ 23 levels "1","10","11",..: 22 22 12 12 6 1 21 1 2 12 ...
$ PLUS_MINUS : Factor w/ 89 levels "-0.1","-0.2",..: 53 63 33 29 3 39 14 86 78 10 ...
#-----------------------
df2$GP <- as.numeric(as.character(df2$GP)) #convert factor to numeric
df3 <- subset(df2, GP > 10)
str(df3)
#----------------------
'data.frame': 157 obs. of 3 variables:
$ PLAYER_NAME: Factor w/ 382 levels "Aaron Gordon",..: 5 13 14 16 17 21 23 29 31 32 ...
$ GP : num 14 18 16 11 14 18 18 13 17 12 ...
$ PLUS_MINUS : Factor w/ 89 levels "-0.1","-0.2",..: 3 5 48 76 33 11 66 43 10 3 ...
#----------
If this had been the result of a read.table or other read.* operation then depending on your version of R the GP column might have been either factor as it was for me in my 3.6.2 session or character as it might have been for anyone using an up-to-date version. The default for stringsAsFactors was changed in the transition to version 4 and above. When it is a factor, the GP column would first needs to be converted to character before it can be converted to numeric. In your case it might be that jsonlite has not yet made the same decision about assuming columns that can be numeric should be numeric.
If you are running R 4.0+ or above, you don't need the as.character inside the as.numeric.
With the data you're working with, and since you have a lot of variables (66), you could use the type.convert helper function to convert the df data object variables to logical, integer, numeric, character, factor, etc. as appropriate. This would be a good initial step to making sure the variable's in the initial data frame df are of the appropriate type.
For example:
df <- type.convert(data.frame(json_resp$resultSets$rowSet[1]))

How to remove $ from all values in a data frame column in a character vector?

I have a data frame in R that has information about NBA players, including salary information. All the data in the salary column have a "$" before the value and I want to convert the character data to numeric for the purpose of analysis. So I need to remove the "$" in this column. However, I am unable to subset or parse any of the values in this column. It seems that each value is a vector of 1. I've included below the structure of the data and what I have tried in my attempt at removing the "$".
> str(combined)
'data.frame': 588 obs. of 9 variables:
$ Player: chr "Aaron Brooks" "Aaron Gordon" "Aaron Gray" "Aaron Harrison" ...
$ Tm : Factor w/ 30 levels "ATL","BOS","BRK",..: 4 22 9 5 9 18 1 5 25 30 ...
$ Pos : Factor w/ 5 levels "C","PF","PG",..: 3 2 NA 5 NA 2 1 1 4 5 ...
$ Age : num 31 20 NA 21 NA 24 29 31 25 33 ...
$ G : num 69 78 NA 21 NA 52 82 47 82 13 ...
$ MP : num 1108 1863 NA 93 NA ...
$ PER : num 11.8 17 NA 4.3 NA 5.6 19.4 18.2 12.7 9.2 ...
$ WS : num 0.9 5.4 NA 0 NA -0.5 9.4 2.8 4 0.3 ...
$ Salary: chr "$2000000" "$4171680" "$452059" "$525093" ...
combined[, "Salary"] <- gsub("$", "", combined[, "Salary"])
The last line of code above is able to run successfully but it doesn't augment the "Salary" column.
I am able to successfully augment it by running the code listed below, but I need to find a way to automize the replacement process for the whole data set instead of doing it row by row.
combined[, "Salary"] <- gsub("$2000000", "2000000", combined[, "Salary"])
How can I subset the character vectors in this column to remove the "$"? Apologies for any formatting faux pas ahead of time, this is my first time asking a question. Cheers,
The $ is a metacharacter which means the end of the string. So, we need to either escape (\\$) or place it in square brackets ("[$]") or use fixed = TRUE in the sub. We don't need gsub as there seems to be only a single $ character in each string.
combined[, "Salary"] <- as.numeric(sub("$", "", combined[, "Salary"], fixed=TRUE))
Or as #gung mentioned in the comments, using substr would be faster
as.numeric(substr(d$Salary, 2, nchar(d$Salary)))

How to merge two data frames with non overlapping dates?

I have a data set with the following variables:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken (288 intervals per day)
The main data set:
> head(activityData, 3)
steps date interval
1 1.7169811 2012-10-01 0
2 0.3396226 2012-10-01 5
3 0.1320755 2012-10-01 10
> str(activityData)
'data.frame': 17568 obs. of 3 variables:
$ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: num 0 5 10 15 20 25 30 35 40 45 ...
The data set has a range of two months.
I had to divided it by weekdays and weekend days. I did it with the following functions:
> dataAs.xtsWeekday <- dataAs.xts[.indexwday(dataAs.xts) %in% 1:5]
> dataAs.xtsWeekend <- dataAs.xts[.indexwday(dataAs.xts) %in% c(0, 6)]
After doing this I had to make some calculation, at which I failed so I decided to export the files and read them in, again.
After I imported the data again, I made the calculation I wanted, and I tried to merge the 2 datasets, but did not succeed.
First data set:
> head(weekdays, 3)
X steps date interval daytype
1 1 37.3826 2012-10-01 0 weekday
2 2 37.3826 2012-10-01 5 weekday
3 3 37.3826 2012-10-01 10 weekday
> str(weekdays)
'data.frame': 12960 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 37.4 37.4 37.4 37.4 37.4 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekday" "weekday" "weekday" "weekday" ...
Second data set:
> head(weekend, 3)
X steps date interval daytype
1 1 0 2012-10-06 0 weekend
2 2 0 2012-10-06 5 weekend
3 3 0 2012-10-06 10 weekend
> str(weekend)
'data.frame': 4608 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 0 0 0 0 0 0 0 0 0 0 ...
$ date : chr "2012-10-06" "2012-10-06" "2012-10-06" "2012-10-06" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekend" "weekend" "weekend" "weekend" ...
Now I would like to merge the 2 data sets (weekdays, weekends) by date, but the problem is that I don't have any common dates or anything else common.
The final data set should have 4 columns and 17568 observations.
The columns should be:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
daytype: weekends days or normal weekdays.
I tried with:
merge
join(plyr)
union
Everywhere I looked all the data sets had a common ID or a common column in both data sets, not like in my case.
I also looked here, but I did not understand much and at many others, but they had nothing in common with my data set.
The other option I thought about was to add a column to the original data set and call it
"ID" and redo everything that I did so far; thing that I'll have to do if I don't find a way around this problem.
I would like some advice on how to proceed or what to try next.
Since you mentioned that your final data set should have 17568 (=4608+12960) observations/rows, I assume you want to stack the two data.frames over each other (and possibly order them by date afterwards). This is done by using rbind().
finaldata <- rbind(weekdays, weekend)
If you want to remove column X:
finaldata$X <- NULL
To convert your date column to actual dates:
finaldata$date <- as.Date(finaldata$date, format="%Y-%m-%d")
To order the whole data by date:
finaldata <- finaldata[order(finaldata$date),]

Error when exporting dataframe to text file in R

I am trying to write a dataframe in R to a text file, however it is returning to following error:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
I used the following command for the export:
write.table(df, file ='dfname.txt', sep='\t' )
I have no idea what the problem could stem from. As far as "missing data where TRUE/FALSE is needed", I have only one column which contains TRUE/FALSE values, and none of these values are missing.
Contents of the dataframe:
> str(df)
'data.frame': 776 obs. of 15 variables:
$ Age : Factor w/ 4 levels "","A","J","SA": 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Rep : Factor w/ 11 levels "L","NR","NRF",..: 1 1 4 4 2 2 2 2 2 2 ...
$ FA : num 61.5 62.5 60.5 61 59.5 59.5 59.1 59.2 59.8 59.9 ...
$ Mass : num 20 19 16.5 17.5 NA 14 NA 23 19 18.5 ...
$ Vir1 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir2 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir3 : num 40 999 999 999 999 999 999 999 999 999 ...
$ Location : Factor w/ 4 levels "Loc1",..: 4 4 4 4 4 4 2 2 2 2 ...
$ Site : Factor w/ 6 levels "A","B","C",..: 5 5 5 5 5 5 3 3 3 3 ...
$ Date : Date, format: "2010-08-30" "2010-08-30" ...
$ Record : int 35 34 39 49 69 38 145 112 125 140 ...
$ SampleID : Factor w/ 776 levels "AT1-A-F1","AT1-A-F10",..: 525 524 527 528
529 526 111 78
88 110 ...
$ Vir1Inc : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Month :'data.frame': 776 obs. of 2 variables:
..$ Dates: Date, format: "2010-08-30" "2010-08-30" ...
..$ Month: Factor w/ 19 levels "Apr-2011","Aug-2010",..: 2 2 2 2
2 2 18 18 18 18 ...
I hope I've given enough/the right information ...
Many thanks,
Heather
An example to reproduce the error. I create a nested data.frame:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
str(dd)
'data.frame': 15 obs. of 2 variables:
$ Age : int 1 2 3 4 5 6 7 8 9 10 ...
$ Month:'data.frame': 15 obs. of 2 variables:
..$ Dates: Date, format: "2003-02-02" "2003-02-03" "2003-02-04" ...
..$ Month: Factor w/ 12 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
No I try to save it , I reproduce the error :
write.table(dd)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) : missing value where TRUE/FALSE needed
Without inverstigating, one option to remove the nested data.frame:
write.table(data.frame(subset(dd,select=-c(Month)),unclass(dd$Month)))
The solution by agstudy provides a great quick fix, but there is a simple alternative/general solution for which you do not have to specify the element(s) in your data.frame that was(were) nested:
The following bit is just copied from agstudy's solution to obtain the nested data.frame dd:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
You can use akhilsbehl's LinearizeNestedList() function (which mrdwab made available here) to flatten (or linearize) the nested levels:
library(devtools)
source_gist(4205477) #loads the function
ddf <- LinearizeNestedList(dd, LinearizeDataFrames = TRUE)
# ddf is now a list with two elements (Age and Month)
ddf <- LinearizeNestedList(ddf, LinearizeDataFrames = TRUE)
# ddf is now a list with 3 elements (Age, `Month/Dates` and `Month/Month`)
ddf <- as.data.frame.list(ddf)
# transforms the flattened/linearized list into a data.frame
ddf is now a data.frame without nesting. However, it's column names still reflect the nested structure:
names(ddf)
[1] "Age" "Month.Dates" "Month.Month"
If you want to change this (in this case it seems redundant to have Month. written before Dates, for example) you can use gsub and some regular expression that I copied from Sacha Epskamp to remove all text in the column names before the ..
names(ddf) <- gsub(".*\\.","",names(ddf))
names(ddf)
[1] "Age" "Dates" "Month"
The only thing left now is exporting the data.frame as usual:
write.table(ddf, file="test.txt")
Alternatively, you could use the "flatten" function from the jsonlite package to flatten the dataframe before export. It achieves the same result of the other functions mentioned and is much easier to implement.
jsonlite::flatten
https://rdrr.io/cran/jsonlite/man/flatten.html

colClasses date and time read.csv

I have some data of the form:
date,time,val1,val2
20090503,0:05:12,107.25,1
20090503,0:05:17,108.25,20
20090503,0:07:45,110.25,5
20090503,0:07:56,106.25,5
that comes from a csv file. I am relatively new to R, so I tried
data <-read.csv("sample.csv", header = TRUE, sep = ",")
and using POSIXlt, as well as POSIXct in the colClasses argument, but I cant seem to be able to create one column or 'variable' out of my date and time data. I want to do so, so I can then choose arbitrary timeframes over which to calculate running statistics such as max, min, mean (and then boxplots, etc.).
I also thought that I might convert it to a time series and get around it that way,
dataTS <-ts(data)
but have yet been able to use the start, end, and frequency to my advantage. Thanks for your help.
You can't do this upon reading the data in to R using the colClasses argument because the data span two "columns" in the CSV file. Instead, load the data and process the date and time columns into a single POSIXlt variable:
dat <- read.csv(textConnection("date,time,val1,val2
20090503,0:05:12,107.25,1
20090503,0:05:17,108.25,20
20090503,0:07:45,110.25,5
20090503,0:07:56,106.25,5"))
dat <- within(dat, Datetime <- as.POSIXlt(paste(date, time),
format = "%Y%m%d %H:%M:%S"))
[I presume it is year month day??, If not use "%Y%d%m %H:%M:%S"]
Which gives:
> head(dat)
date time val1 val2 Datetime
1 20090503 0:05:12 107.25 1 2009-05-03 00:05:12
2 20090503 0:05:17 108.25 20 2009-05-03 00:05:17
3 20090503 0:07:45 110.25 5 2009-05-03 00:07:45
4 20090503 0:07:56 106.25 5 2009-05-03 00:07:56
> str(dat)
'data.frame': 4 obs. of 5 variables:
$ date : int 20090503 20090503 20090503 20090503
$ time : Factor w/ 4 levels "0:05:12","0:05:17",..: 1 2 3 4
$ val1 : num 107 108 110 106
$ val2 : int 1 20 5 5
$ Datetime: POSIXlt, format: "2009-05-03 00:05:12" "2009-05-03 00:05:17" ...
You can now delete date and `time if you wish:
> dat <- dat[, -(1:2)]
> head(dat)
val1 val2 Datetime
1 107.25 1 2009-05-03 00:05:12
2 108.25 20 2009-05-03 00:05:17
3 110.25 5 2009-05-03 00:07:45
4 106.25 5 2009-05-03 00:07:56

Resources