i want to replace existing variable in one column from my dataset with new variable, T with 0.0 , and the column type is char , so i put the values between "" , the error was the = , pls i want your help how i can resolve this problem ?? (i requiered to do replacing in R with mutate function )
# Install tidymodels if you haven't done so
install.packages("rlang")
install.packages("tidymodels")
install.packages("dplyr")
# Library for modeling
library(tidymodels)
# Load tidyverse
library(tidyverse)
library(dplyr)
URL <- 'https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz'
download.file (URL, destfile='noaa-weather-sample-data.tar.gz')
untar('noaa-weather-sample-data.tar.gz',tar = 'internal')
dataset<- read.csv ('noaa-weather-sample-data/jfk_weather_sample.csv')
head(dataset)
glimpse(dataset)
subset_data <- data.frame(dataset$HOURLYRelativeHumidity,dataset$HOURLYDRYBULBTEMPF,dataset$HOURLYStationPressure,dataset$HOURLYWindSpeed,dataset$HOURLYPrecip)
subset_data<-setNames(subset_data,c('HOURLYRelativeHumidity','HOURLYDRYBULBTEMPF','HOURLYStationPressure','HOURLYWindSpeed', 'HOURLYPrecip'))
head(subset_data,10)
unique(subset_data$HOURLYPrecip)
data_new <- subset_data %>% # Replacing values
mutate(subset_data$HOURLYPrecip = replace(subset_data$HOURLYPrecip, subset_data$HOURLYPrecip == 'T', '99'))
View(data_new)
ant i got this error?
Error: unexpected '=' in:
"data_new <- subset_data %>% # Replacing values
mutate(subset_data$HOURLYPrecip ="
Related
This function filters/selects one or more variables from my dataset and writes it to a new CSV file. I'm getting an 'object not found' error when I call the function. Here is the function:
extract_ids <- function(filename, opp, ...) {
#Read in data
df <- read_csv(filename)
#Remove rows 2,3
df <- df[-c(1,2),]
#Filter and select
df_id <- filter(df, across(..., ~ !is.na(.x)) & gc == 1) %>%
select(...) #not sure if my use of ... here is correct
#String together variables for export file path
path <- c("/Users/stephenpoole/Downloads/",opp,"_",...,".csv") #not sure if ... here is correct
#Export the file
write_csv(df_id, paste(path,collapse=''))
}
And here is the function call. I'm trying to get columns "rid" and "cintid."
extract_ids(filename = "farmers.csv",
opp = "farmers",
rid, cintid)
When I run this, I get the below error:
Error: Problem with `filter()` input `..1`.
ℹ Input `..1` is `across(..., ~!is.na(.x)) & gc == 1`.
x object 'cintid' not found
The column cintid is correct and appears in the data. I've also tried running it with just one column, rid, and get the same 'object not found' error.
If you are passing multiple values to across(), you need to collect them in the first parameter, otherwise they will spread into the other parameters of across(). Try
filter(df, across(c(...), ~ !is.na(.x))
Otherwise every value other than the first one will be passed along as a parameter to function you've specified in across()
Sorry for omitting this in my previous suggestion to you. Unfortunately, your original question was closed before I could post it as an answer:
If you want your function to resemble dplyr, here's a few
modifications you can make. Write your function header as
function(filename, opp, ...) verbatim. Then, replace !is.na(ID)
with across(..., ~ !is.na(.x)) verbatim. Now, you can call
extract_ids() and, just as you would with any dplyr verb, you can
specify any selection of columns you want to filter out NAs:
extract_ids(filename = "farmers.csv", opp = "farmers", rid, another_column_you_want_without_NAs).
Object Not Found
As MrFlick rightly suggests in their comment, you should wrap ... with c(), so everything you pass in ... is interpreted as the first argument to across(): a single tidy-selection of columns from df:
extract_ids <- function(filename, opp, ...) {
# ...
# Filter and select
df_id <- df %>%
# This format is preferred for dplyr workflows with pipes (%>%).
filter(across(c(...), ~ !is.na(.x)) & gc == 1) %>%
select(...)
# ...
}
Without this precaution, R interprets rid and cintid as multiple arguments to across(), rather than as simply columns named by the first argument (the tidy-selection).
Variable Names in the Filepath
To get those variable names within your filepath, use
extract_ids <- function(filename, opp, ...) {
# ...
# Expand the '...' into a list of given variable names, which will get pasted.
path <- c("/Users/stephenpoole/Downloads/", opp, "_", match.call(expand.dots = FALSE)$`...`, ".csv")
# ...
}
though you might want to consider replacing match.call(expand.dots = FALSE)$`...`, which currently mushes together the variable names:
"/Users/stephenpoole/Downloads/farmers_ridcintid.csv"
In exactly the same place, you might use the expression paste(match.call(expand.dots = FALSE)$`...`, collapse = "-"), which will separate those variable names using -
"/Users/stephenpoole/Downloads/farmers_rid-cintid.csv"
or any other separator of your choice that gives a valid filename.
Imagine I'm using this code in a shiny application where columns are filtered using shiny input:
library(magrittr)
library(DT) # version 0.18
data_viz <- data.frame(Item = c("Milk", "Bread", "Flour"), Quantity = c(2,3,4), Price = c(4,5,6)) # Original data
data_table_viz <- data_viz[, c("Item", "Quantity")] # Filtering columns on the go using Shiny app input
datatable(data = data_table_viz) %>% formatCurrency(c("Price")) # Throws error: You specified the columns: Price, but the column names of the data are , Item, Quantity
It throws error:
Error in name2int(name, names, rownames) :
You specified the columns: Price, but the column names of the data are , Item, Quantity
The error is understandable but I would like to avoid this error and instead ignore the column "Price" and render the remaining data. Below is a workaround:
datatable(data = data_table_viz) %>% formatCurrency(c("Price")[c("Price") %in% colnames(data_table_viz)])
It used to work until DT package version 0.13 but it stopped working afterwards.
It now throws error:
Error in mapply(FUN = f, ..., SIMPLIFY = FALSE) :
zero-length inputs cannot be mixed with those of non-zero length
Does anyone have other workarounds for this issue or should I keep using the older version of the package DT?
You can add an if condition to check if the column is present.
library(magrittr)
library(DT)
dt <- datatable(data = data_table_viz)
if('Price' %in% colnames(data_table_viz)) dt <- dt %>% formatCurrency("Price")
I desperately need help!
I am trying to predict drug use based on 5 characteristics: Age, Gender, Education, Ethnicity, Country. I already build a tree model in R with rpart
DrugTree3 <- rpart(formula = DrugUser ~ Age+Gender+Education+Ethnicity+Country, data = traindata)
, a logistic regression model
DrugLog <- glm(formula = DrugUser ~ Age+Gender+Ethnicity+Education+Country,data = traindata, family = binomial)
, and a knn model
KnnModel <- train(form = DrugUser~., data = ModelData,method ='knn',tuneGrid=expand.grid(.k=1:100),metric='Accuracy',trControl=trainControl(method='repeatedcv',number=10,repeats=10)) .
I saved those as RDS files and uploaded them successfully in Power BI.
I then created tables for each characterization and created okviz filters for them.
Then I tried to predict whether a customer gets predicted as a drug user or a non-drug user based on the selections in the okviz filters. This is when everything went horribly wrong:
I created a custom R visual vor each model prediction and inserted the following code in each visual:
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class = %s) = %s",myclass,as.character(round(myprob,2))),cex=3.5)
Error: Can't determine relationship between fields.
What has gone wrong here?
When I then clicked on the diagonal arrow to get to R Studio, this happens: Unable to construct R script data for use in external R IDE.
I need help as I am literally going crazy over this and I don't know how to resolve the issue! I would be really happy if you can help me
enter image description here
You made a error in line 34, and line 25.
Below is a fixed version of your code.
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class =
Good Luck!
Data
I work with a large dataset (280 million rows) for which Spark and R seems to work nicely.
Problem
I'd had problems with SparkR's regexp_extract function. I thought it to work analogically to Stringr's str_detect but I haven't managed to get it to work. The documentation for regexp_extract is limited. Could you please give me a hand?
Reprex
Here is a reprex where I try to identify strings that do not have a space and paste " 00:01" as a suffix.
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken = ifelse(regexp_extract(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)
Error
error: org.apache.spark.sql.AnalysisException: cannot resolve '(NOT regexp_extract(df.sampletaken, ' ', 1))' due to data type mismatch: argument 1 requires boolean type, however, 'regexp_extract(df.sampletaken, ' ', 1)' is of string type.; line 1 pos 80;
Solution
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken1 = ifelse(rlike(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)
Probably rlike is what you're after if you're looking for the analog to str_detect, see the SQL API docs:
str rlike regexp - Returns true if str matches regexp, or false otherwise.
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
true
on a Column (i.e., in R, rather than in SparkQL through sql()), it would be like:
rlike(Column, 'regex.*pattern')
# i.e., in magrittr form
Column %>% rlike('regex.*pattern')
Note that like is usually more efficient if you can use it since the set of valid like patterns is much smaller.
I'm not familiar with SparkR, but it seems that the function regex_extract returns a string (presumably the matched pattern in the string) instead of a boolean, as required by the function ifelse.
You may try to match the returned value against the empty string.
I'm trying to write this StatsBomb Data into a CSV but I keep on getting the following error message:
Error in stream_delim_(df, path, ..., bom = bom, quote_escape = quote_escape) :
Don't know how to handle vector of type list.
I'm lost (tried multiple things) and not sure what I did wrong here. Is there anyone out here who knows how to solve this? I've included my code below.
library(StatsBombR)
library(tidyverse)
### Read in all free events and matches from the FAWSL
data <- StatsBombFreeEvents()
matches <- FreeMatches(Competitions = 72)
### Clean and separate all data loaded above
dataclean <- allclean(data)
### Filter event data to include only FAWSL data.
data1 <- dataclean %>%
filter(dataclean$competition_id == 72)
### Join event and match data by "match_id"
data1 <- left_join(data1, matches, by = "match_id")
FullData <- data1 %>%
select(-c(related_events, tactics.lineup, shot.freeze_frame, location, pass.end_location, shot.end_location, goalkeeper.end_location))
setwd()
write_csv(FullData, "StatsBomb_FullData.csv")
I had the same problem. Unlisting the column fixed mine.
df$listcolumn <- sapply(df$listcolumn, function(x) paste0(unlist(x), collapse = "\n"))