Plot a histogram of subset of a data - r

!The image shows the screen shot of the .txt file of the data.
The data consists of 2,075,259 rows and 9 columns
Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
Only data from the dates 2007-02-01 and 2007-02-02 is needed.
I was trying to plot a histogram of "Global_active_power" in the above mentioned dates.
Note that in this dataset missing values are coded as "?"]
This is the code i was trying to plot the histogram:
{
data <- read.table("household_power_consumption.txt", header=TRUE)
my_data <- data[data$Date %in% as.Date(c('01/02/2007', '02/02/2007'))]
my_data <- gsub(";", " ", my_data) # replace ";" with " "
my_data <- gsub("?", "NA", my_data) # convert "?" to "NA"
my_data <- as.numeric(my_data) # turn into numbers
hist(my_data["Global_active_power"])
}
After running the code it is showing this error:
Error in hist.default(my_data["Global_active_power"]) :
invalid number of 'breaks'
Can you please help me spot the mistake in the code.
Link of the data file : https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip

You need to provide the separator (";") explicitly and your types aren't what you think they are, observe:
data <- read.table("household_power_consumption.txt", header=TRUE, sep=';', na.strings='?')
data$Date <- as.Date(data$Date, format='%d/%m/%Y')
bottom.date <- as.Date('01/02/2007', format='%d/%m/%Y')
top.date <- as.Date('02/02/2007', format='%d/%m/%Y')
my_data <- data[data$Date > bottom.date & data$Date < top.date,3]
hist(my_data)
Gives as the plot. Hope that helps.

Given you have 2m rows (though not too many columns), you're firmly into fread territory;
Here's how I would do what you want:
library(data.table)
data<-fread("household_power_consumption.txt",sep=";", #1
na.strings=c("?","NA"),colClasses="character" #2
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% seq(from=as.Date("2007-02-01"), #3
to=as.Date("2007-02-02"),by="day")]
numerics<-setdiff(names(data),c("Date","Time")) #4
data[,(numerics):=lapply(.SD,as.numeric),.SDcols=numerics]
data[,hist(Global_active_power)] #5
A brief explanation of what's going on
1: See the data.table vignettes for great introductions to the package. Here, given the structure of your data, we tell fread up front that ; is what separates fields (which is nonstandard)
2: We can tell fread up front that it can expect ? in some of the columns and should treat them as NA--e.g., here's data[8640] before setting na.strings:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 ? ? ? ? ? ? NA
Once we set na.strings, we sidestep having to replace ? as NA later:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 NA NA NA NA NA NA
On the other hand, we also have to read those fields as characters, even though they're numeric. This is something I'm hoping fread will be able to handle automatically in the future.
data.table commands can be chained (from left to right); I'm using this to subset the data before it's assigned. It's up to you whether you find that more or less readable, as there's only marginal performance differences.
Since we had to read the numeric fields as strings, we now recast them as numeric; this is the standard data.table syntax for doing so.
Once we've got our data subset as we like and of the right type, we can pass hist as an argument in j and get what we want.
Note that if all you wanted from this data set was the histogram, you could have condensed the code a bit:
ok_dates<-seq(from=as.Date("2007-02-01"),
to=as.Date("2007-02-02"),by="day")
fread("household_power_consumption.txt",sep=";",
select=c("Date","Global_active_power"),
na.strings=c("?","NA"),colClasses="character"
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% ok_dates,hist(as.numeric(Global_active_power))]

Related

Argument is not numeric

I would like to visualize the number of people infected with COVID-19, but I am unable to obtain the mortality rate because the number of deaths is derived by int when obtaining the mortality rate per 100,000 population for each prefecture.
What I want to achieve
I want to find the solution of "covid19j_20200613$POP2019 * 100" by setting the data type of "covid19j_20200613$deaths" to num.
Error message.
Error in covid19j_20200613$deaths/covid19j_20200613$POP2019:
Argument of binary operator is not numeric
Source code in question.
library(spdep)
library(sf)
library(spatstat)
library(tidyverse)
library(ggplot2)
needs::prioritize(magrittr)
covid19j <- read.csv("https://raw.githubusercontent.com/kaz-ogiwara/covid19/master/data/prefectures.csv",
header=TRUE)
# Below is an example for May 20, 2020.
# Month and date may be changed
covid19j_20200613 <- dplyr::filter(covid19j,
year==2020,
month==6,
date==13)
covid19j_20200613$CODE <- 1:47
covid19j_20200613[is.na(covid19j_20200613)] <- 0
pop19 <- read.csv("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/pop2019.csv", header=TRUE)
covid19j_20200613 <- dplyr::inner_join(covid19j_20200613, pop19,
by = c("CODE" = "CODE"))
# Load Japan prefecture administrative boundary data
jpn_pref <- sf::st_read("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/jpn_pref.shp")
# Data and concatenation
jpn_pref_cov19 <- dplyr::inner_join(jpn_pref, covid19j_20200613, by=c("PREF_CODE"="CODE"))
ggplot2::ggplot(data = jpn_pref_cov19) +
geom_sf(aes(fill=testedPositive)) +
scale_fill_distiller(palette="RdYlGn") +
theme_bw() +
labs(title = "Tested Positiv of Covid19 (2020/06/13)")
# Mortality rate per 100,000 population
# Population number in units of 1000
as.numeric(covid19j_20200613$deaths)
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
Source code in question.
prefectures.csv
https://docs.google.com/spreadsheets/d/11C2vVo-jdRJoFEP4vAGxgy_AEq7pUrlre-i-zQVYDd4/edit?usp=sharing
pop2019.csv
https://docs.google.com/spreadsheets/d/1CbEX7BADutUPUQijM0wuKUZFq2UUt-jlWVQ1ipzs348/edit?usp=sharing
What we tried
I tried to put "as.numeric(covid19j_20200613$deaths)" before the calculation and set the number of dead to type
num, but I got the same error message during the calculation.
Additional information (FW/tool versions, etc.)
iMac M1 2021, R 4.2.0
Translated with www.DeepL.com/Translator (free version)
as.numeric() does not permanently change the data type - it only does it temporarily.
So when you're running as.numeric(covid19j_20200613$deaths), this shows you the column deaths as numeric, but the column will stay a character.
So if you want to coerce the data type, you need to also reassign:
covid19j_20200613$deaths <- as.numeric(covid19j_20200613$deaths)
covid19j_20200613$POP2019 <- as.numeric(covid19j_20200613$POP2019)
# Now you can do calculations
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
It's easier to read if you use mutate from dplyr:
covid19j_20200613 <- covid19j_20200613 |>
mutate(
deaths = as.numeric(deaths),
POP2019 = as.numeric(POP2019),
death_rate = deaths / POP2019 * 100
)
Result
deaths POP2019 deaths_rate
1 91 5250 1.73333333
2 1 1246 0.08025682
3 0 1227 0.00000000
4 1 2306 0.04336513
5 0 966 0.00000000
PS: your question is really difficult to follow! There is a lot of stuff that we don't actually need to answer it, so that makes it harder for us to identify where the issue is. For example, all the data import, the join, the ggplot...
When writing a question, please only include the minimal elements that lead to a problem. In your case, we only needed a sample dataset with the deaths and POP2019 columns, and the two lines of code that you tried to fix at the end.
If you look at str(covid19j) you'll see that the deaths column is a character column containing a lot of blanks. You need to figure out the structure of that column to read it properly.

R: Is there a way to subset a file while reading

I have a huge .csv file, its size is ~ 1.4G and reading with read.csv takes time. There are several variables in that file and all i want is to extract data for few variables in a certain column.
For example, suppose ABC.csv is my file and it looks something like this:
ABC.csv
Date Variables Val
2017-11-01 X 23
2017-11-01 A 2
2017-11-01 B 0.5
............................
2017-11-02 X 20
2017-11-02 C 40
............................
2017-11-03 D 33
2017-11-03 X 22
............................
............................
So , here the variable of interest is X and while reading this file i want the df$Variables to be scanned reading only the rows with X string in this column. So that my new data from will look something like this:
> df
Date Variables Val
2017-11-01 X 23
2017-11-02 X 20
.........................
.........................
Any Help will be appreciated. Thank you in advance.
Check out the LaF package, it allows to read very large textfiles in blocks, so you don't have to read the entire file into memory.
library(LaF)
data_model <- detect_dm_csv("yourFile.csv", skip = 1) # detects the file structure
dat <- laf_open(data_model) # opens connection to the file
block_list <- lapply(seq(1,100000,1000), function(row_num){
goto(dat, row_num)
data_block <- next_block(dat, nrows = 1000) # reads data blocks of 1000 rows
data_block <- data_block[data_block$Variables == "X",]
return(data_block)
})
your_df <- do.call("rbind", block_list)
Admittedly, the package sometimes feels a bit bulky and in some situations I had to find small hacks to get my results (you might have to adapt my solution for your data). Nevertheless, I found it a immensely useful solution for dealing with files that exceeded my RAM.
Just wondering if doing this works. It worked for my code but I am not sure whether it is first reading in the entire data and then subsetting or is it only reading the part of the file where Variables == 'X'.
temp <- fread('dat.csv')[Variables == 'X']
I would say that most of the time you can probably just read in the entire file, and then subset within R:
df <- read.csv(file="path/to/your/file.csv", header=TRUE)
df.x <- df[df$Variables=='x', ]
R operates completely in memory, so an exception to what I said above might occur if you have a file whose total size is so massive that it cannot fit into memory, but for some reason the subset of interest can.

Changing multiple columns of a data frame from class 'character' to class 'time' using chron

I have a data frame with multiple columns, some of which I need to change to 'time' class using chron so that I can retrieve basic statistics. These columns are currently times stored as characters and formatted like this: hh:mm.
Here is a subset of it as well as the list of columns that need to change:
> Data
DATE FLT TYPE REG AC DEP ARR STD STA ATD ATA
1 15-01-02 953 J C-GCPT 73M YVQ YEV 12:00 12:55 13:00 13:59
2 15-01-04 953 J C-GCPT 73M YVQ YEV 12:00 12:55 13:17 14:13
3 15-01-05 953 J C-GCPT 73M YVQ YEV 12:00 12:55 13:20 14:14
Time_list <-c("STD","STA","ATD","ATA")
Here is what I have done to change only one column (and it works):
Data$ATA <- paste0(Data$ATA, ':00')
Data$ATA<-chron(times.=Data$ATA)
class(Data$ATA)
[1] "times"
However, I would prefer to be able to do all the columns at the same time since there are many of them. I've tried multiple techniques and some seem to work for the first part, which is pasting ':00', but it always goes wrong for the second part, using chron . I seem to have a length problem that I don't understand
Using dmap
Data[,Time_list]<-
Data%>%
select(one_of(Time_list)) %>%
dmap(paste0,':00')
Data[,Time_list]<-
Data %>%
select(one_of(Time_list)) %>%
dmap(chron,times.=Data[,Time_list])
**Error in .f(.d[[i]], ...) :
.d[[i]] and Data[, Time_list] must have equal lengths**
Using apply
YEVdata[,(Time_list)] <- lapply(YEVdata[,(Time_list)], paste0,':00')
Data[,(Time_list)] <- lapply(Data[,(Time_list)], chron, times. =Data[,(Time_list)])
**Error in FUN(X[[i]], ...) :
X[[i]] and Data[, (Time_list)] must have equal lengths**
Using a forloop
I tried using a for loop, but I'm just a beginner and could get anywhere.
Using "simple" solution from another Stack Overflow question.
It just made a mess, even pasting.
Efficiently transform multiple columns of a data frame
Any ideas in plain beginner language would be very appreciated! If it is possible to nest both operations, it would be even better!
dplyr::mutate_at would work for this situation. You define the variables you want to mutate and then define the function you want to use.
You can do the pasting and converting to a time in a single step within funs using the . notation and nesting functions.
library(dplyr)
Data = mutate_at(Data, Time_list, funs(chron(times. = paste0(., ":00"))))

Removing rows of data in R below a specified value

I was wondering if anybody could help...
I have a data frame which includes a continuous time column and I am trying to remove all rows below a specified time.
The data starts from approx. 11:29:00 but I want to remove all rows before the time 12:30.00 and after the time 14:20.00.
Since the data is recorded every second, deleting unnecessary rows will be a great help and make managing this data a whole lot easier for me so any help would be greatly appreciated.
This is the head of the data frame, as you can see the time is continuous in seconds. I would like to remove all these rows up to 12:30:00 within the GPS.Time column. Hope that makes sense.
Raw.Vel. Smooth.Vel. GPS.Time
1.486 0.755 11:39:39
1.425 1.167 11:39:40
1.466 1.398 11:39:41
1.533 1.552 11:39:42
1.517 1.594 11:39:43
1.918 1.556 11:39:44
Creating above data frame:
Raw.Vel. <- c(1.486,1.425, 1.466, 1.533, 1.517, 1.918)
Smooth.Vel. <- c(0.755, 1.167, 1.398, 1.552, 1.594, 1.556)
GPS.Time <- c("11:39:39", "11:39:40", "11:39:41", "11:39:42", "11:39:43", "11:39:44")
sample <- data.frame(Raw.Vel., Smooth.Vel., GPS.Time)
Thanks in advance.
Use the lubridate package to transform your string time column into some kind of time class:
library(lubridate)
sample$GPS.Time <- hms(sample$GPS.Time)
To achieve the required output, just use subsetting with brackets ([), with the condition you want. In your example, I removed all rows up to 11:39:42.
output <- sample[sample$GPS.Time < hms("11:39:42"),]
Turn the GPS.Time into a "POSIXct" object:
df$time <- as.POSIXct(df$GPS.Time, format="%H:%M:%S")
Then you can filter using logic:
filtered_df <- df[df$time < as.POSIXct("12:30:00", format="%H:%M:%S"), ]
You can convert the entries in the "GPS.Time" columns into characters (this is originally a factor variable). After that you can separate the set by comparing the times with a specified cutoff-time stored as a character string that should be written in the same format (HH:MM:SS):
sample$GPS.Time <- as.character(sample$GPS.Time)
cutoff_time <- "11:39:42" # modify as necessary
sample <- sample[-which(sample$GPS.Time < cutoff_time),] #remove all rows with times smaller than the cutoff_time
#> sample
# Raw.Vel. Smooth.Vel. GPS.Time
#4 1.533 1.552 11:39:42
#5 1.517 1.594 11:39:43
#6 1.918 1.556 11:39:44

unused arguments error using apply() in R

I get an error message when I attempt to use apply() conditional on a column of dates to return a set of coefficients.
I have a dataset (herein modified for simplicity, but reproducible):
ADataset <- data.table(Epoch = c("2007-11-15", "2007-11-16", "2007-11-17",
"2007-11-18", "2007-11-19", "2007-11-20", "2007-11-21"),
Distance = c("92336.22", "92336.23", "92336.22", "92336.20",
"92336.19", "92336.21", "92336.18))
ADataset
Epoch Distance
1: 2007-11-15 92336.22
2: 2007-11-16 92336.23
3: 2007-11-17 92336.22
4: 2007-11-18 92336.20
5: 2007-11-19 92336.19
6: 2007-11-20 92336.21
7: 2007-11-21 92336.18
The analysis begins with establishing start and end dates:
############## Establish dates for analysis
#4.Set date for center of duration
StartDate <- "2007-11-18"
as.numeric(as.Date(StartDate)); StartDate
EndDate <- as.Date(tail(Adataset$Epoch,1)); EndDate
Then I establish time durations for analysis:
#5.Quantify duration of time window
STDuration <- 1
LTDuration <- 3
Then I write functions to regress over both durations and return the slopes:
# Write STS and LTS functions, each with following steps
#6.Define time window- from StartDate less ShortTermDuration to
StartDate plus ShortTermDuration
#7.Define Short Term & Long Term datasets
#8. Run regression over dataset
my_STS_Function <- function (StartDate) {
STAhead <- as.Date(StartDate) + STDuration; STAhead
STBehind <- as.Date(StartDate) - STDuration; STBehind
STDataset <- subset(Adataset, as.Date(Epoch) >= STBehind & as.Date(Epoch)<STAhead)
STResults <- rlm( Distance ~ Epoch, data=STDataset); STResults
STSummary <- summary( STResults ); STSummary
# Return coefficient (Slope of regression)
STNum <- STResults$coefficients[2];STNum
}
my_LTS_Function <- function (StartDate) {
LTAhead <- as.Date(StartDate) + LTDuration; LTAhead
LTBehind <- as.Date(StartDate) - LTDuration; LTBehind
LTDataset <- subset(Adataset, as.Date(Epoch) >= LTBehind & as.Date(Epoch)<LTAhead)
LTResults <- rlm( Distance ~ Epoch, data=LTDataset); LTResults
LTSummary <- summary( LTResults ); LTSummary
# Return coefficient (Slope of regression)
LTNum <- LTResults$coefficients[2];LTNum
Then I test the function to make sure it works for a single date:
myTestResult <- my_STS_Function("2007-11-18")
It works, so I move on to apply the function over the range of dates in the dataset:
mySTSResult <- apply(Adataset, 1, my_STS_Function, seq(StartDate : EndDate))
...in which my desired result is a list or array or vector of mySTSResult (slopes) (and, subsequently, a separate list/array/vector of myLTSResults so then I can create a STSlope:LTSlope ratio over the duration), something like (mySTSResults fabricated)...
> Adataset
Epoch Distance mySTSResults
1: 2007-11-15 92336.22 3
2: 2007-11-16 92336.23 4
3: 2007-11-17 92336.22 5
4: 2007-11-18 92336.20 6
5: 2007-11-19 92336.19 7
6: 2007-11-20 92336.21 8
7: 2007-11-21 92336.18 9
Only I get this error:
Error in FUN(newX[, i], ...) : unused argument(s) (1:1185)
What is this telling me and how to do correct it? I've done some looking and cannot find the correction.
Hopefully I've explained this sufficiently. Please let me know if you need further details.
Ok, it seems the problem is in the additional arguments to my_STS_Function as stated in your apply function call (as you have defined it with only one parameter). The date range is being passed as an additional parameter to that function, and R is complaining that it is unused (a vector of 1185 elements it seems). Are you rather trying to pull a subset of the rows restricted by date range first, then wishing to apply the my_STS_Function? I'd have to think a bit on an exact solution to that.
Sorry - I did my working out in the comments there. A possible solution is this:
subSet <- Adataset[Adataset[,1] %in% seq(StartDate:EndDate),][order(na.exclude(match(Adataset[,1], seq(StartData,EndDate))),]
Adapted from the answer in this question:
R select rows in matrix from another vector (match, %in)
Adding this as a new answer as the previous one was getting confused. A previous commenter was correct, there are bugs in your code, but they aren't a sticking point.
My updated approach was to use seq.Date to generate the date sequence (only works if you have a data point for each day between the start and end - though you could use na.exclude as above):
dates = seq.Date(as.Date(StartDate),as.Date(EndDate),"days")
You then use this as the input to apply, with some munging of types to get things working correctly (I've done this with a lamda function):
mySTSResult <- apply(as.matrix(dates), 1, function(x) {class(x) <- "Date"; my_STS_Function(x)})
Then hopefully you should have a vector of the results, and you should be able to do something similar for LTS, and then manipulate that into another column in your original data frame/matrix.

Resources