I want to make a function that will accept a argument and use it into subset and then plot a graph with multiple line. I wrote following code
plot.new( )
rest_o_noise <- function(noise_level, color) {
rest_o_noise_level = subset(yelp_flat, attributes.RestaurantsPriceRange2!= "NA", eval(parse(text=noise_level)))
rest_o_noise_level <- rest_o_noise_level %>%
group_by(attributes.RestaurantsPriceRange2) %>%
summarise(n=mean(stars))
lines(rest_o_noise_level, stars, col=color)
}
rest_o_noise("attributes.NoiseLevel=='loud'", "green")
rest_o_noise("attributes.NoiseLevel=='low'", "green")
I am getting a error:
Error in grouped_df_impl(data, unname(vars), drop) : Column attributes.RestaurantsPriceRange2 is unknown
Just to be clear attributes.RestaurantsPriceRange2 is present in csv.
final output should look like:
Is this correct way to plot?
Please help!!
Getting a subset of rows from iris data using a conditional in a string
Species <- as.character(iris$Species)
noise_level <- "Species == \"setosa\""
subset(iris, Sepal.Length == 5.1 &
eval(parse(text=noise_level)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 18 5.1 3.5 1.4 0.3 setosa
# 20 5.1 3.8 1.5 0.3 setosa
# 22 5.1 3.7 1.5 0.4 setosa
# 24 5.1 3.3 1.7 0.5 setosa
# 40 5.1 3.4 1.5 0.2 setosa
# 45 5.1 3.8 1.9 0.4 setosa
# 47 5.1 3.8 1.6 0.2 setosa
In your case, it would be something like what is below which conditions on the
columns attributes.RestaurantsPriceRange2 and attributes.NoiseLevel to select subset of rows.
plot.new( )
rest_o_noise <- function(noise_level, color) {
rest_o_noise_level = subset(yelp_flat,
(attributes.RestaurantsPriceRange2!= "NA") &
eval(parse(text=noise_level)))
rest_o_noise_level <- rest_o_noise_level %>%
group_by(attributes.RestaurantsPriceRange2) %>%
summarise(n=mean(stars))
lines(rest_o_noise_level, stars, col=color)
}
rest_o_noise("attributes.NoiseLevel==\"loud\"", "green")
You can, of course, select a single column, as below. However, why you would use the "==" for that is not clear, hence I assumed you are trying to select a subset of rows.
noise_level <- "Petal.Width"
subset(iris, Sepal.Length == 5.1,
eval(parse(text=noise_level)))
# Petal.Width
# 1 0.2
# 18 0.3
# 20 0.3
# 22 0.4
# 24 0.5
# 40 0.2
# 45 0.4
# 47 0.2
# 99 1.1
Related
I would like to use dplyr to divide a subset of variables by the IQR. I am open to ideas that use a different approach than what I've tried before, which is a combination of mutate_if and %in%. I want to reference the list bin instead of indexing the data frame by position. Thanks for any thoughts!
contin <- c("age", "ct")
data %>%
mutate_if(%in% contin, function(x) x/IQR(x))
You should use:
data %>%
mutate(across(all_of(contin), ~.x/IQR(.x)))
Working example:
data <- head(iris)
contin <- c("Sepal.Length", "Sepal.Width")
data %>%
mutate(across(all_of(contin), ~.x/IQR(.x)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 15.69231 7.777778 1.4 0.2 setosa
2 15.07692 6.666667 1.4 0.2 setosa
3 14.46154 7.111111 1.3 0.2 setosa
4 14.15385 6.888889 1.5 0.2 setosa
5 15.38462 8.000000 1.4 0.2 setosa
6 16.61538 8.666667 1.7 0.4 setosa
I'm looking to use a non-across function from mutate to create multiple columns. My problem is that the variable in the function will change along with the crossed variables. Here's an example:
needs=c('Sepal.Length','Petal.Length')
iris %>% mutate_at(needs, ~./'{col}.Width')
This obviously doesn't work, but I'm looking to divide Sepal.Length by Sepal.Width and Petal.Length by Petal.Width.
I think your needs should be something which is common in both the columns.
You can select the columns based on the pattern in needs and divide the data based on position. !! and := is used to assign name of the new columns.
library(dplyr)
library(rlang)
needs = c('Sepal','Petal')
purrr::map_dfc(needs, ~iris %>%
select(matches(.x)) %>%
transmute(!!paste0(.x, '_divide') := .[[1]]/.[[2]]))
# Sepal_divide Petal_divide
#1 1.457142857 7.000000000
#2 1.633333333 7.000000000
#3 1.468750000 6.500000000
#4 1.483870968 7.500000000
#...
#...
If you want to add these as new columns you can do bind_cols the above with iris.
Here is a base R approach based that the columns you want to divide have a similar name pattern,
res <- sapply(split.default(iris[-ncol(iris)], sub('\\..*', '', names(iris[-ncol(iris)]))), function(i) i[1] / i[2])
iris[names(res)] <- res
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Petal.Length Sepal.Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 7.00 1.457143
#2 4.9 3.0 1.4 0.2 setosa 7.00 1.633333
#3 4.7 3.2 1.3 0.2 setosa 6.50 1.468750
#4 4.6 3.1 1.5 0.2 setosa 7.50 1.483871
#5 5.0 3.6 1.4 0.2 setosa 7.00 1.388889
#6 5.4 3.9 1.7 0.4 setosa 4.25 1.384615
I have a dataset titled nypd, which has a column titled OCCUR_TIME. This column contains various times (ex: 3:57:00, 10:31:00, 22:15:00, etc.).
I would like to create a custom TIME_OF_DAY column using R; I wrote this code below:
nypd$TIME_OF_DAY <- 'Night'
nypd[nypd$OCCUR_TIME >= 6:00:00 & nypd$OCCUR_TIME < 12:00:00,] <- 'Morning'
nypd[nypd$OCCUR_TIME >= 12:00:00 & nypd$OCCUR_TIME < 16:00:00,] <- 'Afternoon'
nypd[nypd$OCCUR_TIME >= 16:00:00 & nypd$OCCUR_TIME < 20:00:00,] <- 'Evening'
The error I am getting is Error in `[<-.data.frame`(`*tmp*`, nypd$OCCUR_TIME >= "6:00:00" & nypd$OCCUR_TIME < : missing values are not allowed in subscripted assignments of data frames.
I'm new to R so I am not too familiar with the error codes, but I'm thinking the error is due to my values in the OCCUR_TIME column not being read as a "time" type of value, so I can't use any operators.
Could someone please help me figure out where I'm going wrong? Thank you!
First, as the error is saying, you have missing values in your data. Since we don't have your data to work with, let's make up some data to use:
> data(iris)
> iris$Petal.Length[3:5] <- NA
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 NA 0.2 setosa
4 4.6 3.1 NA 0.2 setosa
5 5.0 3.6 NA 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Now, it has a problem with subsetting on Petal.Length because it isn't sure what to do when there are missing values.
> iris[iris$Petal.Length > 1.2 & iris$Petal.Length < 1.5, ] <- 50
Error in `[<-.data.frame`(`*tmp*`, iris$Petal.Length > 1.2 & iris$Petal.Length < :
missing values are not allowed in subscripted assignments of data frames
Also note that when you do this:
nypd[nypd$OCCUR_TIME >= 6:00:00 & nypd$OCCUR_TIME < 12:00:00,] <- 'Morning'
You aren't telling it what variable you want to assign 'Morning' to!
You can add a test for is.na to your boolean, and include the variable name you want to affect:
> iris[!is.na(iris$Petal.Length) & iris$Petal.Length > 1.2 & iris$Petal.Length < 1.5, 'Petal.Length'] <- 50
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 50.0 0.2 setosa
2 4.9 3.0 50.0 0.2 setosa
3 4.7 3.2 NA 0.2 setosa
4 4.6 3.1 NA 0.2 setosa
5 5.0 3.6 NA 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The advice about learning how to deal with dates and times in R is true, the way you are expressing them here is not right. If they are being read in as a factor, then perhaps however you are reading your data you need to add a stringsAsFactors = FALSE?
We could convert the 'OCCUR_TIME' to Time class with as.ITime from data.table, then do the the comparison
library(dplyr)
library(data.table)
nypd %>%
mutate(OCCUR_TIME = as.ITime(OCCUR_TIME),
TIME_OF_DAY = case_when(between(OCCUR_TIME, as.ITime("06:00:00"),
as.ITime("12:00:00")) ~ "Morning",
between(OCCUR_TIME, as.ITime("12:00:00"),
as.ITime("16:00:00")) ~ "Afternoon",
between(OCCUR_TIME, as.ITime("16:00:00"),
as.ITime("20:00:00")) ~ "Evening", TRUE ~ "Night"))
# OCCUR_TIME TIME_OF_DAY
#1 05:22:34 Night
#2 07:22:29 Morning
#3 12:20:05 Afternoon
#4 15:46:23 Afternoon
#5 19:32:42 Evening
data
nypd <- data.frame(OCCUR_TIME = c("05:22:34", "07:22:29", "12:20:05",
"15:46:23", "19:32:42"), stringsAsFactors = FALSE)
I try to replace all columns selected using select by data of the same size.
A reproducible example is
library(tidyverse)
iris = as_data_frame(iris)
temp = cbind( runif(nrow(iris)), runif(nrow(iris)), runif(nrow(iris)), runif(nrow(iris)))
select(iris, -one_of("Petal.Length")) = temp
Then I get the error
Error in select(iris, -one_of("Petal.Length")) = temp : could not find
function "select"
Thanks for any comments.
You want to bind columns of two data frames, so you can simply use bind_cols():
library(tidyverse)
iris <- as_tibble(iris)
temp <- tibble(r1 = runif(nrow(iris)), r2 = runif(nrow(iris)), r3 = runif(nrow(iris)), r4 = runif(nrow(iris)))
select(iris, -Petal.Length) %>% bind_cols(temp)
# or use:
# bind_cols(iris, temp) %>% select(-Petal.Length)
which gives you:
# A tibble: 150 × 8
Sepal.Length Sepal.Width Petal.Width Species r1 r2 r3 r4
<dbl> <dbl> <dbl> <fctr> <dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 0.2 setosa 0.7208566 0.1367070 0.04314771 0.4909396
2 4.9 3.0 0.2 setosa 0.4101884 0.4795735 0.75318182 0.1463689
3 4.7 3.2 0.2 setosa 0.6270065 0.5425814 0.26599432 0.1467248
4 4.6 3.1 0.2 setosa 0.8001282 0.4691908 0.73060637 0.0792256
5 5.0 3.6 0.2 setosa 0.5663895 0.4745482 0.65088630 0.5360953
6 5.4 3.9 0.4 setosa 0.8813042 0.1560600 0.41734507 0.2582568
7 4.6 3.4 0.3 setosa 0.5046977 0.9555570 0.22118401 0.9246906
8 5.0 3.4 0.2 setosa 0.5283764 0.4730212 0.24982471 0.6313071
9 4.4 2.9 0.2 setosa 0.5976045 0.4717439 0.14270551 0.2149888
10 4.9 3.1 0.1 setosa 0.3919660 0.5125420 0.95001067 0.5259598
# ... with 140 more rows
We can use -> to assign the output to 'temp'
select(iris, -one_of("Petal.Length")) -> temp
Using tidyverse paradigma you could use:
dplyr::mutate_at(iris, vars(-one_of("Petal.Length")), .funs = funs(runif))
Although the above sample produces the behaviour with random numbers, it will probably not suit your needs - i suppose you want match features and rows to that one in temp.
It can be done by trasforming iris and temp into long format and the join and replace data accordingly with *join methods for example.
I am using R to classify a data-frame called 'd' containing data structured like below:
The data has 576666 rows and the column "classLabel" has a factor of 3 levels: ONE, TWO, THREE.
I am making a decision tree using rpart:
fitTree = rpart(d$classLabel ~ d$tripduration + d$from_station_id + d$gender + d$birthday)
And I want to predict the values for the "classLabel" for newdata:
newdata = data.frame( tripduration=c(345,244,543,311),
from_station_id=c(60,28,100,56),
gender=c("Male","Female","Male","Male"),
birthday=c(1972,1955,1964,1967) )
p <- predict(fitTree, newdata)
I expect my result to be a matrix of 4 rows each with a probability of the three possible values for "classLabel" of newdata. But what I get as the result in p, is a dataframe of 576666 rows like below:
I also get the following warning when running the predict function:
Warning message:
'newdata' had 4 rows but variables found have 576666 rows
Where am I doing wrong?!
I think the problem is: you should add "type='class'"in the prediction code:
predict(fitTree,newdata,type="class")
Try the following code. I take "iris" dataset in this example.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# model fitting
> fitTree<-rpart(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,iris)
#prediction-one row data
> newdata<-data.frame(Sepal.Length=7,Sepal.Width=4,Petal.Length=6,Petal.Width=2)
> newdata
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 7 4 6 2
# perform prediction
> predict(fitTree, newdata,type="class")
1
virginica
Levels: setosa versicolor virginica
#prediction-multiple-row data
> newdata2<-data.frame(Sepal.Length=c(7,8,6,5),
+ Sepal.Width=c(4,3,2,4),
+ Petal.Length=c(6,3.4,5.6,6.3),
+ Petal.Width=c(2,3,4,2.3))
> newdata2
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 7 4 6.0 2.0
2 8 3 3.4 3.0
3 6 2 5.6 4.0
4 5 4 6.3 2.3
# perform prediction
> predict(fitTree,newdata2,type="class")
1 2 3 4
virginica virginica virginica virginica
Levels: setosa versicolor virginica