Removing rows of data in R below a specified value - r

I was wondering if anybody could help...
I have a data frame which includes a continuous time column and I am trying to remove all rows below a specified time.
The data starts from approx. 11:29:00 but I want to remove all rows before the time 12:30.00 and after the time 14:20.00.
Since the data is recorded every second, deleting unnecessary rows will be a great help and make managing this data a whole lot easier for me so any help would be greatly appreciated.
This is the head of the data frame, as you can see the time is continuous in seconds. I would like to remove all these rows up to 12:30:00 within the GPS.Time column. Hope that makes sense.
Raw.Vel. Smooth.Vel. GPS.Time
1.486 0.755 11:39:39
1.425 1.167 11:39:40
1.466 1.398 11:39:41
1.533 1.552 11:39:42
1.517 1.594 11:39:43
1.918 1.556 11:39:44
Creating above data frame:
Raw.Vel. <- c(1.486,1.425, 1.466, 1.533, 1.517, 1.918)
Smooth.Vel. <- c(0.755, 1.167, 1.398, 1.552, 1.594, 1.556)
GPS.Time <- c("11:39:39", "11:39:40", "11:39:41", "11:39:42", "11:39:43", "11:39:44")
sample <- data.frame(Raw.Vel., Smooth.Vel., GPS.Time)
Thanks in advance.

Use the lubridate package to transform your string time column into some kind of time class:
library(lubridate)
sample$GPS.Time <- hms(sample$GPS.Time)
To achieve the required output, just use subsetting with brackets ([), with the condition you want. In your example, I removed all rows up to 11:39:42.
output <- sample[sample$GPS.Time < hms("11:39:42"),]

Turn the GPS.Time into a "POSIXct" object:
df$time <- as.POSIXct(df$GPS.Time, format="%H:%M:%S")
Then you can filter using logic:
filtered_df <- df[df$time < as.POSIXct("12:30:00", format="%H:%M:%S"), ]

You can convert the entries in the "GPS.Time" columns into characters (this is originally a factor variable). After that you can separate the set by comparing the times with a specified cutoff-time stored as a character string that should be written in the same format (HH:MM:SS):
sample$GPS.Time <- as.character(sample$GPS.Time)
cutoff_time <- "11:39:42" # modify as necessary
sample <- sample[-which(sample$GPS.Time < cutoff_time),] #remove all rows with times smaller than the cutoff_time
#> sample
# Raw.Vel. Smooth.Vel. GPS.Time
#4 1.533 1.552 11:39:42
#5 1.517 1.594 11:39:43
#6 1.918 1.556 11:39:44

Related

R: Is there a way to subset a file while reading

I have a huge .csv file, its size is ~ 1.4G and reading with read.csv takes time. There are several variables in that file and all i want is to extract data for few variables in a certain column.
For example, suppose ABC.csv is my file and it looks something like this:
ABC.csv
Date Variables Val
2017-11-01 X 23
2017-11-01 A 2
2017-11-01 B 0.5
............................
2017-11-02 X 20
2017-11-02 C 40
............................
2017-11-03 D 33
2017-11-03 X 22
............................
............................
So , here the variable of interest is X and while reading this file i want the df$Variables to be scanned reading only the rows with X string in this column. So that my new data from will look something like this:
> df
Date Variables Val
2017-11-01 X 23
2017-11-02 X 20
.........................
.........................
Any Help will be appreciated. Thank you in advance.
Check out the LaF package, it allows to read very large textfiles in blocks, so you don't have to read the entire file into memory.
library(LaF)
data_model <- detect_dm_csv("yourFile.csv", skip = 1) # detects the file structure
dat <- laf_open(data_model) # opens connection to the file
block_list <- lapply(seq(1,100000,1000), function(row_num){
goto(dat, row_num)
data_block <- next_block(dat, nrows = 1000) # reads data blocks of 1000 rows
data_block <- data_block[data_block$Variables == "X",]
return(data_block)
})
your_df <- do.call("rbind", block_list)
Admittedly, the package sometimes feels a bit bulky and in some situations I had to find small hacks to get my results (you might have to adapt my solution for your data). Nevertheless, I found it a immensely useful solution for dealing with files that exceeded my RAM.
Just wondering if doing this works. It worked for my code but I am not sure whether it is first reading in the entire data and then subsetting or is it only reading the part of the file where Variables == 'X'.
temp <- fread('dat.csv')[Variables == 'X']
I would say that most of the time you can probably just read in the entire file, and then subset within R:
df <- read.csv(file="path/to/your/file.csv", header=TRUE)
df.x <- df[df$Variables=='x', ]
R operates completely in memory, so an exception to what I said above might occur if you have a file whose total size is so massive that it cannot fit into memory, but for some reason the subset of interest can.

Changing multiple columns of a data frame from class 'character' to class 'time' using chron

I have a data frame with multiple columns, some of which I need to change to 'time' class using chron so that I can retrieve basic statistics. These columns are currently times stored as characters and formatted like this: hh:mm.
Here is a subset of it as well as the list of columns that need to change:
> Data
DATE FLT TYPE REG AC DEP ARR STD STA ATD ATA
1 15-01-02 953 J C-GCPT 73M YVQ YEV 12:00 12:55 13:00 13:59
2 15-01-04 953 J C-GCPT 73M YVQ YEV 12:00 12:55 13:17 14:13
3 15-01-05 953 J C-GCPT 73M YVQ YEV 12:00 12:55 13:20 14:14
Time_list <-c("STD","STA","ATD","ATA")
Here is what I have done to change only one column (and it works):
Data$ATA <- paste0(Data$ATA, ':00')
Data$ATA<-chron(times.=Data$ATA)
class(Data$ATA)
[1] "times"
However, I would prefer to be able to do all the columns at the same time since there are many of them. I've tried multiple techniques and some seem to work for the first part, which is pasting ':00', but it always goes wrong for the second part, using chron . I seem to have a length problem that I don't understand
Using dmap
Data[,Time_list]<-
Data%>%
select(one_of(Time_list)) %>%
dmap(paste0,':00')
Data[,Time_list]<-
Data %>%
select(one_of(Time_list)) %>%
dmap(chron,times.=Data[,Time_list])
**Error in .f(.d[[i]], ...) :
.d[[i]] and Data[, Time_list] must have equal lengths**
Using apply
YEVdata[,(Time_list)] <- lapply(YEVdata[,(Time_list)], paste0,':00')
Data[,(Time_list)] <- lapply(Data[,(Time_list)], chron, times. =Data[,(Time_list)])
**Error in FUN(X[[i]], ...) :
X[[i]] and Data[, (Time_list)] must have equal lengths**
Using a forloop
I tried using a for loop, but I'm just a beginner and could get anywhere.
Using "simple" solution from another Stack Overflow question.
It just made a mess, even pasting.
Efficiently transform multiple columns of a data frame
Any ideas in plain beginner language would be very appreciated! If it is possible to nest both operations, it would be even better!
dplyr::mutate_at would work for this situation. You define the variables you want to mutate and then define the function you want to use.
You can do the pasting and converting to a time in a single step within funs using the . notation and nesting functions.
library(dplyr)
Data = mutate_at(Data, Time_list, funs(chron(times. = paste0(., ":00"))))

Subsetting data by multiple date ranges - R

I'll get straight to the point: I have been given some data sets in .csv format containing regularly logged sensor data from a machine. However, this data set also contains measurements taken when the machine is turned off, which I would like to separate from the data logged from when it is turned on. To subset the relevant data I also have a file containing start and end times of these shutdowns. This file is several hundred rows long.
Examples of the relevant files for this problem:
file: sensor_data.csv
sens_name,time,measurement
sens_A,17/12/11 06:45,32.3321
sens_A,17/12/11 08:01,36.1290
sens_B,17/12/11 05:32,17.1122
sens_B,18/12/11 03:43,12.3189
##################################################
file: shutdowns.csv
shutdown_start,shutdown_end
17/12/11 07:46,17/12/11 08:23
17/12/11 08:23,17/12/11 09:00
17/12/11 09:00,17/12/11 13:30
18/12/11 01:42,18/12/11 07:43
To subset data in R, I have previously used the subset() function with simple conditions which has worked fine, but I don't know how to go about subsetting sensor data which fall outside multiple shutdown date ranges. I've already formatted the date and time data using as.POSIXlt().
I'm suspecting some scripting may be involved to come up with a good solution, but I'm afraid I am not yet experienced enough to handle this type of data.
Any help, advice, or solutions will be greatly appreciated. Let me know if there's anything else needed for a solution.
I prefer POSIXct format for ranges within data frames. We create an index for sensors operating during shutdowns with t < shutdown_start OR t > shutdown_end. With these ranges we can then subset the data as necessary:
posixct <- function(x) as.POSIXct(x, format="%d/%m/%y %H:%M")
sensor_data$time <- posixct(sensor_data$time)
shutdowns[] <- lapply(shutdowns, posixct)
ind1 <- sapply(sensor_data$time, function(t) {
sum(t < shutdowns[,1] | t > shutdowns[,2]) == length(sensor_data$time)})
#Measurements taken when shutdown
sensor_data[ind1,]
# sens_name time measurement
# 1 sens_A 2011-12-17 06:45:00 32.3321
# 3 sens_B 2011-12-17 05:32:00 17.1122
#Measurements taken when not shutdown
sensor_data[!ind1,]
# sens_name time measurement
# 2 sens_A 2011-12-17 08:01:00 36.1290
# 4 sens_B 2011-12-18 03:43:00 12.3189

Plot a histogram of subset of a data

!The image shows the screen shot of the .txt file of the data.
The data consists of 2,075,259 rows and 9 columns
Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
Only data from the dates 2007-02-01 and 2007-02-02 is needed.
I was trying to plot a histogram of "Global_active_power" in the above mentioned dates.
Note that in this dataset missing values are coded as "?"]
This is the code i was trying to plot the histogram:
{
data <- read.table("household_power_consumption.txt", header=TRUE)
my_data <- data[data$Date %in% as.Date(c('01/02/2007', '02/02/2007'))]
my_data <- gsub(";", " ", my_data) # replace ";" with " "
my_data <- gsub("?", "NA", my_data) # convert "?" to "NA"
my_data <- as.numeric(my_data) # turn into numbers
hist(my_data["Global_active_power"])
}
After running the code it is showing this error:
Error in hist.default(my_data["Global_active_power"]) :
invalid number of 'breaks'
Can you please help me spot the mistake in the code.
Link of the data file : https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip
You need to provide the separator (";") explicitly and your types aren't what you think they are, observe:
data <- read.table("household_power_consumption.txt", header=TRUE, sep=';', na.strings='?')
data$Date <- as.Date(data$Date, format='%d/%m/%Y')
bottom.date <- as.Date('01/02/2007', format='%d/%m/%Y')
top.date <- as.Date('02/02/2007', format='%d/%m/%Y')
my_data <- data[data$Date > bottom.date & data$Date < top.date,3]
hist(my_data)
Gives as the plot. Hope that helps.
Given you have 2m rows (though not too many columns), you're firmly into fread territory;
Here's how I would do what you want:
library(data.table)
data<-fread("household_power_consumption.txt",sep=";", #1
na.strings=c("?","NA"),colClasses="character" #2
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% seq(from=as.Date("2007-02-01"), #3
to=as.Date("2007-02-02"),by="day")]
numerics<-setdiff(names(data),c("Date","Time")) #4
data[,(numerics):=lapply(.SD,as.numeric),.SDcols=numerics]
data[,hist(Global_active_power)] #5
A brief explanation of what's going on
1: See the data.table vignettes for great introductions to the package. Here, given the structure of your data, we tell fread up front that ; is what separates fields (which is nonstandard)
2: We can tell fread up front that it can expect ? in some of the columns and should treat them as NA--e.g., here's data[8640] before setting na.strings:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 ? ? ? ? ? ? NA
Once we set na.strings, we sidestep having to replace ? as NA later:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 NA NA NA NA NA NA
On the other hand, we also have to read those fields as characters, even though they're numeric. This is something I'm hoping fread will be able to handle automatically in the future.
data.table commands can be chained (from left to right); I'm using this to subset the data before it's assigned. It's up to you whether you find that more or less readable, as there's only marginal performance differences.
Since we had to read the numeric fields as strings, we now recast them as numeric; this is the standard data.table syntax for doing so.
Once we've got our data subset as we like and of the right type, we can pass hist as an argument in j and get what we want.
Note that if all you wanted from this data set was the histogram, you could have condensed the code a bit:
ok_dates<-seq(from=as.Date("2007-02-01"),
to=as.Date("2007-02-02"),by="day")
fread("household_power_consumption.txt",sep=";",
select=c("Date","Global_active_power"),
na.strings=c("?","NA"),colClasses="character"
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% ok_dates,hist(as.numeric(Global_active_power))]

R date to Excel based number

I know that I can get a date from an Excel based number (days since 1899-12-30) in the following way:
as.Date(41000, origin = "1899-12-30")
which will give me "2012-04-01". I want however the opposite. As a user I would like to input a date as a string and get the number of days since "1899-12-30".
Something along the lines
as.integer(as.Date('2014-03-01', origin="1899-12-30"))
which I hoped would result in 41000 and not in the R based days since 1970-01-01 which is 15431.
Maybe this is silly as I realize that I can add the days manually by writing something like:
as.integer(as.Date('2012-04-01')) + 25569
I just wondered if there is a function which does this?
I think you want difftime as in:
difftime(as.Date('2012-04-01'), as.Date("1899-12-30"))
## Time difference of 41000 days
Do it by hand, simpler and safer:
d0 <- as.Date('1899-12-30')
d1 <- as.Date('2014-10-28')
as.integer(d1 - d0)
##[1] 41940 # This is interpreted by Excel as '2014-10-28'
Of course, you can write a function to convert a R date to an Excel one:
convert_to_excel_date <- function(d) {
# Converts a R date value to an Excel date value
#
# Parameters:
# d: a R date object
d0 <- as.Date('1899-12-30')
return(as.integer(d - d0))
}
# Example:
# convert_to_excel_date(as.Date('2018-10-28'))

Resources