What is wrong with my pattern matching and replacement function - r

I have a dataframe with temperatures in the format XX,X instead of XX.X.
I can use the following code to successfully change them...
df$tempMedian <- sub(",",".",df$tempMedian)
df$tempMedian <- as.numeric(df$tempMedian)
I've tried writing the following function to do the same thing:
comma_to_point <- function(data, colname){
data$colname <- sub(",", ".", data$colname)
data$colname <- as.numeric(data$colname)
}
When I call the function:
comma_to_point(df, tempMedian)
I get the following error:
"Error in `$<-.data.frame`(`tmp`, colname, value = character(0)) :
replacement has 0 rows, data has 365"
My dataframe is 365 obs long.

Give this a shot
comma_to_point <- function(data, colname){
data[[colname]] <- sub(",", ".", data[[colname]])
data[[colname]] <- as.numeric(data[[colname]])
return (data)
}
df = comma_to_point(df, "tempMedian")
When using a variable var='my_column' to reference a column in a data.frame, you can't do df$var, since R will think var is the name of the column. Instead you can get the column with df[[var]].

Related

Using lapply with gsub to replace word in dataframe using another dataframe as 'dictionnary'

I have a dataframe called data where I want to replace some word in specific columns A & B.
I have a second dataframe called dict that is playing the role of dictionnary/hash containing the words and the values to use for replacement.
I think it could be done with purrr’s map() but I want to use apply. It's for a package and I don't want to have to load another package.
The following code is not working but it's give you the idea. I'm stuck.
columns <- c("A", "B" )
data[columns] <- lapply(data[columns], function(x){x}) %>% lapply(dict, function(y){
gsub(pattern = y[,2], replacement = y[,1], x)})
This is working for one word to change...but I'm not able to pass the list of changes conainted in the dictionnary.
data[columns] <- lapply(data[columns], gsub, pattern = "FLT1", replacement = "flt1")
#Gregor_Thomas is right, you need a for loop to have a recursive effect, otherwise you just replace one value at the time.
df <- data.frame("A"=c("PB1","PB2","OK0","OK0"),"B"=c("OK3","OK4","PB1","PB2"))
dict <- data.frame("pattern"=c("PB1","PB2"), "replacement"=c("OK1","OK2"))
apply(df[,c("A","B")],2, FUN=function(x) {
for (i in 1:nrow(dict)) {
x <- gsub(pattern = dict$pattern[i], replacement = dict$replacement[i],x)
}
return(x)
})
Or, if your dict data is too long you can generate a succession of all the gsub you need using a paste as a code generator :
paste0("df[,'A'] <- gsub(pattern = '", dict$pattern,"', replacement = '", dict$replacement,"',df[,'A'])")
It generates all the gsub lines for the "A" column :
"df[,'A'] <- gsub(pattern = 'PB1', replacement = 'OK1',df[,'A'])"
"df[,'A'] <- gsub(pattern = 'PB2', replacement = 'OK2',df[,'A'])"
Then you evaluate the code and wrap it in a lapply for the various columns :
lapply(c("A","B"), FUN = function(v) { eval(parse(text=paste0("df[,'", v,"'] <- gsub(pattern = '", dict$pattern,"', replacement = '", dict$replacement,"',df[,'",v,"'])"))) })
It's ugly but it works fine to avoid long loops.
Edit : for a exact matching between df and dict maybe you should use a boolean selection with == instead of gsub().
(I don't use match() here because it selects only the first matching
df <- data.frame("A"=c("PB1","PB2","OK0","OK0","OK"),"B"=c("OK3","OK4","PB1","PB2","AB"))
dict <- data.frame("pattern"=c("PB1","PB2","OK"), "replacement"=c("OK1","OK2","ZE"))
apply(df[,c("A","B")],2, FUN=function(x) {
for (i in 1:nrow(dict)) {
x[x==dict$pattern[i]] <- dict$replacement[i]
}
return(x)
})

Function for looping over common variable names with different suffixes in R

I have some code which I'm looking to replicate many times, each for a different country as the suffix.
Assuming 3 countries as a simple example:
country_list <- c('ALB', 'ARE', 'ARG')
I'm trying to create a series of variables called a_m5_ALB, a_m5_ARE, a_m5_ARG etc which have various functions e.g. addcol or round_df applied to reg_math_ALB, reg_math_ARE, reg_math_ARG etc
for (i in country_list) {
paste("a_m5", i , sep = "_") <- addcol(paste("reg_math", i , sep = "_"))
}
for (i in country_list) {
paste("a_m5", i , sep = "_") <- round_df(paste("reg_math", i , sep = "_"))
}
where addcol and round_df are defined as:
addcol = function(y){
dat1 = mutate(y, p.value = ((1 - pt(q = abs(reg.t.value), df = dof))*2))
return(dat1)
}
round_df <- function(x, digits) {
numeric_columns <- sapply(x, mode) == 'numeric'
x[numeric_columns] <- round(x[numeric_columns], digits)
x
}
The loop errors when any of the functions are added in brackets before the paste variable part but it works if doing it manually e.g.
a_m5_ALB <- addcol(reg_math_ALB)
Please could you help? I think it's the application of the function in a loop which i'm getting wrong.
Errors:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "character"
Error in round(x[numeric_columns], digits) :
non-numeric argument to mathematical function
Thank you
From your examples, you're really in a case where everything should be in a single dataframe. Here, keeping separate variables for each country is not the right tool for the job. Say you have your per-country dataframes saved as csv, you can rewrite everything as:
library(tidyverse)
country_list <- c('ALB', 'ARE', 'ARG')
read_data <- function(ctry){
read_csv(paste0("/path/to/file/", "reg_math_", ctry)) %>%
add_column(country = ctry)
}
total_df <- map_dfr(country_list, read_data)
total_df %>%
mutate(p.value = (1 - pt(q = abs(reg.t.value), df = dof))*2) %>%
mutate(across(where(is.numeric), round, digits = digits))
And it gives you immediate access to all other dplyr functions that are great for this kind of manipulation.

Using API in R and converting to dataframe format

I'm basically trying to call an API to retrieve weather information from a government website.
library(data.table)
library(jsonlite)
library(httr)
base<-"https://api.data.gov.sg/v1/environment/rainfall"
date1<-"2020-01-25"
call1<-paste(base,"?","date","=",date1,sep="")
get_rainfall<-GET(call1)
get_rainfall_text<-content(get_rainfall,"text")
get_rainfall_json <- fromJSON(get_rainfall_text, flatten = TRUE)
get_rainfall_df <- as.data.frame(get_rainfall_json)
I'm getting an error
"Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 52, 287, 1"
Not too sure how to resolve this, i'm trying to format the retrieved data into a dataframe format so i can make sense of the readings.
Your "get_rainfall_json" object comes back as a "list". Trying to turn this into a data frame is where you are getting the error. If you specify the "items" object within the list, your error is resolved! (The outcome of this looks like it has some more embedded data within objects... So you'll have to parse through that into a format you're interested in.)
get_rainfall_df <- as.data.frame(get_rainfall_json$items)
Update
In order to loop through the next data frame. Here is one way you could do it. Which loops through each row, extracts the list in each row and turns that into a data frame and appends it to the "df". Then, you are left with one final df with all the data in one place.
library(data.table)
library(jsonlite)
library(httr)
library(dplyr)
base <- "https://api.data.gov.sg/v1/environment/rainfall"
date1 <- "2020-01-25"
call1 <- paste(base, "?", "date", "=", date1, sep = "")
get_rainfall <- GET(call1)
get_rainfall_text <- content(get_rainfall,"text")
get_rainfall_json <- fromJSON(get_rainfall_text, flatten = TRUE)
get_rainfall_df <- as.data.table(get_rainfall_json$items)
df <- data.frame()
for (row in 1:nrow(get_rainfall_df)) {
new_date <- get_rainfall_df[row, ]$readings[[1]]
colnames(new_date) <- c("stationid", "value")
date <- get_rainfall_df[row, ]$timestamp
new_date$date <- date
df <- bind_rows(df, new_date)
}

How to create a loop of ppcor function?

I am trying to create a loop to go through and perform a correlation (and in future a partial correlation) using ppcor function on variables stored within a data frame. The first variable (A) will remain the same for all correlations, whilst the second variable (B) will be the next variable along in the next column within my data frame. I have around 1000 variables.
I show the mtcars dataset below as an example, as it is in the same layout as my data.
I've been able to complete the operation successfully when performed manually using cbind to bind 2 columns (the 2 variables of interest) prior to running ppcor on the array ("tmp_df"). I have then been able to bind the output from correlation operation ("mpg_cycl"), ("mpg_disp") into a single object. However I can't get any of this operation to work in a loop. Any ideas please?
library("MASS")
install.packages("ppcor")
library("ppcor")
mtcars_df <- as.data.frame(mtcars)
tmp_df = cbind(mtcars_df$mpg, mtcars_df$cycl)
mpg_cycl <- pcor(as.matrix(tmp_df), method = 'spearman')
tmp_df1= cbind(mtcars_df$mpg, mtcars_df$disp)
mpg_disp <- pcor(as.matrix(tmp_df1), method = 'spearman')
combined_table <- do.call(cbind, lapply(list("mpg_cycl" = mpg_cycl,
mpg_disp" = mpg_disp), as.data.frame, USE.NAMES = TRUE))
attempting to loop above operation ## (ammended after last reviewer's comments:
for (i in mtcars_df[2:7]){
tmp_df = (cbind(i, mtcars_df$mpg)
i <- pcor(as.matrix(tmp_df), method = 'spearman')
write.csv(i, file = paste0("MyDataOutput",i[1],".csv")
}
I expected the loop to output two of the correlations results to MyDataOutput csv file. But this generates an error message, I thought i was in the correct place?:
Error: unexpected symbol in:
" tmp_df = (cbind(i, mtcars_df$mpg)
i"
Even adding a curly bracket at the end does not resolve issue so I have left this out as it introduces another error message '}'
I have redone some of your code and fixed missing ), }, ". The for cyckle now outputs file with name + name of the variable. Hope this will help.
library("MASS")
#install.packages("ppcor")
library("ppcor")
mtcars_df <- as.data.frame(mtcars)
tmp_df = cbind(mtcars_df$mpg, mtcars_df$cycl)
mpg_cycl <- pcor(as.matrix(tmp_df), method = 'spearman')
tmp_df1= cbind(mtcars_df$mpg, mtcars_df$disp)
mpg_disp <- pcor(as.matrix(tmp_df1), method = 'spearman')
combined_table <- do.call(cbind, lapply(list("mpg_cycl" = mpg_cycl,
"mpg_disp" = mpg_disp), as.data.frame, USE.NAMES = TRUE))
for(i in colnames(mtcars_df[2:7])){
tmp_df = mtcars_df[c(i,"mpg")]
i_resutl <- pcor(as.matrix(tmp_df), method = 'spearman')
write.csv(i_resutl, file = paste0("MyDataOutput_",i,".csv"))
}
for merging before saving:
dta <- c()
for(i in colnames(mtcars_df[2:7])){
tmp_df = mtcars_df[c(i,"mpg")]
i_resutl <- pcor(as.matrix(tmp_df), method = 'spearman')
dta <- rbind(dta,c(i,(unlist( i_resutl))))
}

Trycatch in for loop- continue to next r dataRetrieval

I have a list containing the following site id numbers:
sitelist <- c("02074500", "02077200", "208111310", "02081500", "02082950")
I want to use the dataRetrieval package to collect additional information about these sites and save it into individual .csv files. Site number "208111310" does not exist, so it returns an error and stops the code.
I want the code to ignore site numbers that do not return data and continue to the next number in sitelist.
I've tried trycatch in several ways but can't get the correct syntax. Here is my for loop without trycatch.
for (i in sitelist){
test_gage <- readNWISdv(siteNumbers = i,
parameterCd = pCode)
df = test_gage
df = subset(df, select= c(site_no, Date, X_00060_00003))
names(df)[3] <- c("flow in m3/s")
df$Year <- as.character(year(df$Date))
write.csv(df, paste0("./gage_flow/",i,".csv"), row.names = F)
rm(list=setdiff(ls(),c("sitelist", "pCode")))
}
You can use the variable error in the function trycatch to specify what happened when an error occurs and store the return value using operator <<-.
for (i in sitelist){
test_gage <- NULL
trycatch(error=function(message){
test_gage <<- readNWISdv(siteNumbers = i,parameterCd = pCode)
}
df = test_gage
df = subset(df, select= c(site_no, Date, X_00060_00003))
names(df)[3] <- c("flow in m3/s")
df$Year <- as.character(year(df$Date)) write.csv(df, paste0("./gage_flow/",i,".csv"), row.names = F)
rm(list=setdiff(ls(),c("sitelist", "pCode")))
}
If you want to catch the warnings also just give a second argument to trycatch.
trycatch(error=function(){},warning=function(){})

Resources