I have a dataset titled nypd, which has a column titled OCCUR_TIME. This column contains various times (ex: 3:57:00, 10:31:00, 22:15:00, etc.).
I would like to create a custom TIME_OF_DAY column using R; I wrote this code below:
nypd$TIME_OF_DAY <- 'Night'
nypd[nypd$OCCUR_TIME >= 6:00:00 & nypd$OCCUR_TIME < 12:00:00,] <- 'Morning'
nypd[nypd$OCCUR_TIME >= 12:00:00 & nypd$OCCUR_TIME < 16:00:00,] <- 'Afternoon'
nypd[nypd$OCCUR_TIME >= 16:00:00 & nypd$OCCUR_TIME < 20:00:00,] <- 'Evening'
The error I am getting is Error in `[<-.data.frame`(`*tmp*`, nypd$OCCUR_TIME >= "6:00:00" & nypd$OCCUR_TIME < : missing values are not allowed in subscripted assignments of data frames.
I'm new to R so I am not too familiar with the error codes, but I'm thinking the error is due to my values in the OCCUR_TIME column not being read as a "time" type of value, so I can't use any operators.
Could someone please help me figure out where I'm going wrong? Thank you!
First, as the error is saying, you have missing values in your data. Since we don't have your data to work with, let's make up some data to use:
> data(iris)
> iris$Petal.Length[3:5] <- NA
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 NA 0.2 setosa
4 4.6 3.1 NA 0.2 setosa
5 5.0 3.6 NA 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Now, it has a problem with subsetting on Petal.Length because it isn't sure what to do when there are missing values.
> iris[iris$Petal.Length > 1.2 & iris$Petal.Length < 1.5, ] <- 50
Error in `[<-.data.frame`(`*tmp*`, iris$Petal.Length > 1.2 & iris$Petal.Length < :
missing values are not allowed in subscripted assignments of data frames
Also note that when you do this:
nypd[nypd$OCCUR_TIME >= 6:00:00 & nypd$OCCUR_TIME < 12:00:00,] <- 'Morning'
You aren't telling it what variable you want to assign 'Morning' to!
You can add a test for is.na to your boolean, and include the variable name you want to affect:
> iris[!is.na(iris$Petal.Length) & iris$Petal.Length > 1.2 & iris$Petal.Length < 1.5, 'Petal.Length'] <- 50
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 50.0 0.2 setosa
2 4.9 3.0 50.0 0.2 setosa
3 4.7 3.2 NA 0.2 setosa
4 4.6 3.1 NA 0.2 setosa
5 5.0 3.6 NA 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The advice about learning how to deal with dates and times in R is true, the way you are expressing them here is not right. If they are being read in as a factor, then perhaps however you are reading your data you need to add a stringsAsFactors = FALSE?
We could convert the 'OCCUR_TIME' to Time class with as.ITime from data.table, then do the the comparison
library(dplyr)
library(data.table)
nypd %>%
mutate(OCCUR_TIME = as.ITime(OCCUR_TIME),
TIME_OF_DAY = case_when(between(OCCUR_TIME, as.ITime("06:00:00"),
as.ITime("12:00:00")) ~ "Morning",
between(OCCUR_TIME, as.ITime("12:00:00"),
as.ITime("16:00:00")) ~ "Afternoon",
between(OCCUR_TIME, as.ITime("16:00:00"),
as.ITime("20:00:00")) ~ "Evening", TRUE ~ "Night"))
# OCCUR_TIME TIME_OF_DAY
#1 05:22:34 Night
#2 07:22:29 Morning
#3 12:20:05 Afternoon
#4 15:46:23 Afternoon
#5 19:32:42 Evening
data
nypd <- data.frame(OCCUR_TIME = c("05:22:34", "07:22:29", "12:20:05",
"15:46:23", "19:32:42"), stringsAsFactors = FALSE)
Related
I'm using the Ionosphere dataset in R and am trying to write a loop that will create new columns that are standardized iterations of existing columns and name them accordingly.
I've got the "cname" as the new column name and c as the original. The code is:
install.packages("mlbench")
library(mlbench)
data('Ionosphere')
library(robustHD)
col <- colnames(Ionosphere)
for (c in col[1:length(col)-1]){
cname <- paste(c,"Std")
Ionosphere$cname <- standardize(Ionosphere$c)
}
But get the following error:
"Error in `$<-.data.frame`(`*tmp*`, "cname", value = numeric(0)) :
replacement has 0 rows, data has 351
In addition: Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA"
I feel like there's something super-simple I'm missing but I just can't see it.
Any help gratefully received.
We can use lapply, a custom-made standardization function, setNames, and cbind.
I do not have access to your dataset, so I am using the iris dataset as an example:
df<-iris
cbind(df, set_names(lapply(df[1:4],
\(x) (x - mean(x))/sd(x)),
paste0(names(df[1:4]), '_Std')))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_Std Sepal.Width_Std Petal.Length_Std Petal.Width_Std
1 5.1 3.5 1.4 0.2 setosa -0.89767388 1.01560199 -1.33575163 -1.3110521482
2 4.9 3.0 1.4 0.2 setosa -1.13920048 -0.13153881 -1.33575163 -1.3110521482
3 4.7 3.2 1.3 0.2 setosa -1.38072709 0.32731751 -1.39239929 -1.3110521482
4 4.6 3.1 1.5 0.2 setosa -1.50149039 0.09788935 -1.27910398 -1.3110521482
5 5.0 3.6 1.4 0.2 setosa -1.01843718 1.24503015 -1.33575163 -1.3110521482
...
I feel these transformations get easier with dplyr:
library(dplyr)
iris %>% mutate(across(where(is.numeric),
~ (.x - mean(.x))/sd(.x),
.names = "{col}_Std"))
I feel I have a simple question, but I cannot get my code to work. In short, I want the condition statement in a subset() function to be a string. This mostly works, except for the logical operator. So I would want something like this;
my.string = "gender == female"
Subsequently I would run;
myData = subset(myData, my.string)
I have tried things like;
myData = subset(myData, parse(text = my.string))
myData = subset(myData, eval(parse(text = my.string)))
But of no avail. The main reason I want to do this, is because I want you to be able to make filter conditions up front in the code, so this would be;
filter.variable[[1]] = "gender"
filter.condition[[1]] = "==" # or %in%
filer.value[[1]] = "female"
i = 1
my.string = paste(filter.variable[[i]],filter.condition[[i]],filter.value[[i]])
This way I do not have to hardwire any filters in R.
Any suggestions are much appreciated,
Alex
We need to have quotes around 'female' i.e. This can be easily done in dQuote
my.string <- paste0('gender == ', dQuote('female', FALSE))
Or can do this with " wrapped
my.string = 'gender== "female"'
and then use that in subset with eval(parse
Using a reproducible example
my.string <- paste0('Species == ', dQuote('setosa', FALSE))
subset(iris, eval(parse(text = my.string)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
#7 4.6 3.4 1.4 0.3 setosa
#8 5.0 3.4 1.5 0.2 setosa
# ...
I would like to use dplyr to divide a subset of variables by the IQR. I am open to ideas that use a different approach than what I've tried before, which is a combination of mutate_if and %in%. I want to reference the list bin instead of indexing the data frame by position. Thanks for any thoughts!
contin <- c("age", "ct")
data %>%
mutate_if(%in% contin, function(x) x/IQR(x))
You should use:
data %>%
mutate(across(all_of(contin), ~.x/IQR(.x)))
Working example:
data <- head(iris)
contin <- c("Sepal.Length", "Sepal.Width")
data %>%
mutate(across(all_of(contin), ~.x/IQR(.x)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 15.69231 7.777778 1.4 0.2 setosa
2 15.07692 6.666667 1.4 0.2 setosa
3 14.46154 7.111111 1.3 0.2 setosa
4 14.15385 6.888889 1.5 0.2 setosa
5 15.38462 8.000000 1.4 0.2 setosa
6 16.61538 8.666667 1.7 0.4 setosa
I want to make a function that will accept a argument and use it into subset and then plot a graph with multiple line. I wrote following code
plot.new( )
rest_o_noise <- function(noise_level, color) {
rest_o_noise_level = subset(yelp_flat, attributes.RestaurantsPriceRange2!= "NA", eval(parse(text=noise_level)))
rest_o_noise_level <- rest_o_noise_level %>%
group_by(attributes.RestaurantsPriceRange2) %>%
summarise(n=mean(stars))
lines(rest_o_noise_level, stars, col=color)
}
rest_o_noise("attributes.NoiseLevel=='loud'", "green")
rest_o_noise("attributes.NoiseLevel=='low'", "green")
I am getting a error:
Error in grouped_df_impl(data, unname(vars), drop) : Column attributes.RestaurantsPriceRange2 is unknown
Just to be clear attributes.RestaurantsPriceRange2 is present in csv.
final output should look like:
Is this correct way to plot?
Please help!!
Getting a subset of rows from iris data using a conditional in a string
Species <- as.character(iris$Species)
noise_level <- "Species == \"setosa\""
subset(iris, Sepal.Length == 5.1 &
eval(parse(text=noise_level)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 18 5.1 3.5 1.4 0.3 setosa
# 20 5.1 3.8 1.5 0.3 setosa
# 22 5.1 3.7 1.5 0.4 setosa
# 24 5.1 3.3 1.7 0.5 setosa
# 40 5.1 3.4 1.5 0.2 setosa
# 45 5.1 3.8 1.9 0.4 setosa
# 47 5.1 3.8 1.6 0.2 setosa
In your case, it would be something like what is below which conditions on the
columns attributes.RestaurantsPriceRange2 and attributes.NoiseLevel to select subset of rows.
plot.new( )
rest_o_noise <- function(noise_level, color) {
rest_o_noise_level = subset(yelp_flat,
(attributes.RestaurantsPriceRange2!= "NA") &
eval(parse(text=noise_level)))
rest_o_noise_level <- rest_o_noise_level %>%
group_by(attributes.RestaurantsPriceRange2) %>%
summarise(n=mean(stars))
lines(rest_o_noise_level, stars, col=color)
}
rest_o_noise("attributes.NoiseLevel==\"loud\"", "green")
You can, of course, select a single column, as below. However, why you would use the "==" for that is not clear, hence I assumed you are trying to select a subset of rows.
noise_level <- "Petal.Width"
subset(iris, Sepal.Length == 5.1,
eval(parse(text=noise_level)))
# Petal.Width
# 1 0.2
# 18 0.3
# 20 0.3
# 22 0.4
# 24 0.5
# 40 0.2
# 45 0.4
# 47 0.2
# 99 1.1
I am using R to classify a data-frame called 'd' containing data structured like below:
The data has 576666 rows and the column "classLabel" has a factor of 3 levels: ONE, TWO, THREE.
I am making a decision tree using rpart:
fitTree = rpart(d$classLabel ~ d$tripduration + d$from_station_id + d$gender + d$birthday)
And I want to predict the values for the "classLabel" for newdata:
newdata = data.frame( tripduration=c(345,244,543,311),
from_station_id=c(60,28,100,56),
gender=c("Male","Female","Male","Male"),
birthday=c(1972,1955,1964,1967) )
p <- predict(fitTree, newdata)
I expect my result to be a matrix of 4 rows each with a probability of the three possible values for "classLabel" of newdata. But what I get as the result in p, is a dataframe of 576666 rows like below:
I also get the following warning when running the predict function:
Warning message:
'newdata' had 4 rows but variables found have 576666 rows
Where am I doing wrong?!
I think the problem is: you should add "type='class'"in the prediction code:
predict(fitTree,newdata,type="class")
Try the following code. I take "iris" dataset in this example.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# model fitting
> fitTree<-rpart(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,iris)
#prediction-one row data
> newdata<-data.frame(Sepal.Length=7,Sepal.Width=4,Petal.Length=6,Petal.Width=2)
> newdata
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 7 4 6 2
# perform prediction
> predict(fitTree, newdata,type="class")
1
virginica
Levels: setosa versicolor virginica
#prediction-multiple-row data
> newdata2<-data.frame(Sepal.Length=c(7,8,6,5),
+ Sepal.Width=c(4,3,2,4),
+ Petal.Length=c(6,3.4,5.6,6.3),
+ Petal.Width=c(2,3,4,2.3))
> newdata2
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 7 4 6.0 2.0
2 8 3 3.4 3.0
3 6 2 5.6 4.0
4 5 4 6.3 2.3
# perform prediction
> predict(fitTree,newdata2,type="class")
1 2 3 4
virginica virginica virginica virginica
Levels: setosa versicolor virginica