Check if column exists and if it does check something about it - r

Hi I want to check if a column in a data.frame exists, and only if it does check another conditions.
I know I can use a nested if statement as I have in the example.
This is normally for checking inputs to functions. This is a working example which gives me the output I want, I just was wondering if there is a smarter way, as this can get messy especially if I am doing it for a number of conditions. My example:
testfun <- function(dat,...){
library(dplyr)
if("Site" %in% colnames(dat)){
#for example check number of sites, this condition could be anything though
if(n_distinct(dat$Site) > 1) stop ("Function must have site specific data")
}
#do stuff
return(1)
}
testdf1 <- data.frame(x = 1:10, y = 1:10)
testdf2 <- data.frame(x = 1:10, y = 1:10,Site = "A")
testdf3 <- data.frame(x = 1:10, y = 1:10,Site = rep(c("A","B"),each = 5))
testfun(testdf1)
testfun(testdf2)
testfun(testdf3)
Edit with a bit more context: For this example the reason for this is that the user may input data that is site specific and therefore doesn't have a Site column (i.e. they have a data.frame with data only at one site so they have never specified the site as a column) or they might be using a data.frame that has had data for a number of sites specified in a column. So if there is no Site column it is safe to assume that data is for one site and the its valid to continue calculations, but if there is a site column I have to check that it only has one distinct value (eg might have been filtered on this column before applying the function of applied through plyr::ddply).
There are a lot of other cases however where I want to check that my input data to a function is of the expected form, and if the input is a data.frame this often means checking for column names and something about that column

You can decide if this is a smarter way or not but one way is by separating the logic using map_if. Here we check the basic condition ("Site" %in% colnames(dat)) in predicate part and based on that we call two functions one for TRUE and other for FALSE. We still check similar conditions but by keeping the functions separate we can keep the code clean and it is easy to understand which part is doing what.
library(dplyr)
library(purrr)
testfun <- function(dat, ...) {
unlist(map_if(list(dat), "Site" %in% colnames(dat), true_fun, .else = false_fun))
}
true_fun <- function(dat) {
if(n_distinct(dat$Site) > 1) stop ("Function must have site specific data")
return(1)
}
false_fun <- function(dat) { return(1) }
testfun(testdf1)
#[1] 1
testfun(testdf2)
#[1] 1
testfun(testdf3)
Error in .f(.x[[i]], ...) : Function must have site specific data

Related

How do you combine objects created in rvest looping function after an iteration?

I hope you are having a good day.
I'm trying to scrape Trustpilot-reviews in the sports-section.
I want four columns with number of reviews, trustscore, subcategories and companynames.
There are 43 pages it should iterate over, with 20 companies in each page.
After an iteration the data should be placed underneath the previous data. This can be cleaned up afterwards using filtering though.
The important part, and what I suspect is my problem is getting everything put together at the end.
The code as-is produce the error
"Error in .subset2(x, i, exact = exact) : subscript out of bounds"
If you know anything about this, some pointers on how the code can be corrected would be appreciated.
Here is the code I'm having trouble with:
Trustpilot_company_data <- data.frame()
page_urls = sprintf('https://dk.trustpilot.com/categories/sports?page=%s&status=all', 2:43)
page_urls = c(page_urls, 'https://dk.trustpilot.com/categories/sports?status=all')
for (i in 1:length(page_urls)) {
session <- html_session(page_urls[i])
trustscore_data_html <- html_nodes(session,'.styles_textRating__19_fv')
trustscore_data <- html_text(trustscore_data_html)
trustscore_data <- gsub("anmeldelser","",trustscore_data)
trustscore_data <- gsub("TrustScore","",trustscore_data)
trustscore_data <- as.data.frame(trustscore_data)
trustscore_data <- separate(trustscore_data, col="trustscore_data", sep="·", into=c("antal anmeldelser", "trustscore"))
number_of_reviews<- trustscore_data$`antal anmeldelser`
Trustpilot_company_data[[i]]$number_of_reviews <- trimws(number_of_reviews, whitespace = "[\\h\\v]") %>%
as.numeric(number_of_reviews)
trustscores <- trustscore_data$trustscore
Trustpilot_company_data[[i]]$trustscores <- trimws(trustscores, whitespace = "[\\h\\v]") %>%
as.numeric(trustscores)
subcategories_data_html <- html_nodes(session,'.styles_categories__c4nU-')
subcategories_data <- html_text(subcategories_data_html)
Trustpilot_company_data[[i]]$subcategories_data <- gsub("·",",",subcategories_data)
company_name_data_html <- html_nodes(session,'.styles_businessTitle__1IANo')
Trustpilot_company_data[[i]]$company_name_data <- html_text(company_name_data_html)
Trustpilot_company_data[[i]]$company_name_data <- rep(i,length(Trustpilot_company_data[[i]]$company_name_data))
}
Best regards
Anders
There seem to be several things going on here.
First, as a rule, growing a data frame this way is not good practice.
Second, in this case you seem to be trying to add the new element for each column one at a time, which makes things more awkward for you. And you are trying to access the data frame as if it were a list. So, for example, this isn't going to work:
Trustpilot_company_data[[i]]$number_of_reviews <- trimws(number_of_reviews, whitespace = "[\\h\\v]")
Trustpilot_company_data is a data frame, so it has rows and columns. So to access a particular row and column with [] you say e.g. dat[5,10] for the fifth row and tenth column of dat. Instead you are trying to use [[i]] which is the syntax for accessing the elements of a list. In this case you'd need to write e.g.
Trustpilot_company_data[i, "number_of_reviews"]
to access the thing you're trying to get at.
Third, doing this one column at a time is a bad idea. If you're going to try to grow a data frame, assemble each new mini-data-frame completely first and then add it to the bottom with rbind(). E.g.,
df <- data.frame()
for(i in 1:5) {
new_piece <- data.frame(a = i,
b = i,
c = i)
df <- rbind(df, new_piece)
}
But fourth and most important, don't grow data frames in this way in the first place. Instead, see for example this answer.

R copying attributes over to anther object

I have an initial variable:
a = c(1,2,3)
attr(a,'name') <- 'numbers'
Now I want to create a new variable that is a subset of a and then have it have the same attributes as a. Is there like a copy.over.attr function or something around that does this without me having to go inside and identify which one is user defined attributes etc. This gets complicated when I have numerous attributes attached to a single variable.
It should be used with caution and care. There is mostattributes<-, which receives a list and attempts to set the attributes in the list to the object in its argument. At the very least, reading the source code will give you some nice ideas on how to check attributes between objects. Here's a little run on your sample a vector. It succeeds since it's not violating any properties of b
a = c(1,2,3)
attr(a,'name') <- 'numbers'
b <- a[-1]
attributes(b)
# NULL
mostattributes(b) <- attributes(a)
attributes(b)
# $name
# [1] "numbers"
Here's a sample of the source code where names are checked.
if (h.nam <- !is.na(inam <- match("names", names(value)))) {
n1 <- value[[inam]]
value <- value[-inam]
}
if (h.dim <- !is.na(idin <- match("dim", names(value)))) {
d1 <- value[[idin]]
value <- value[-idin]
}
if (h.dmn <- !is.na(idmn <- match("dimnames", names(value)))) {
dn1 <- value[[idmn]]
value <- value[-idmn]
}
attributes(obj) <- value
There is also attr.all.equal. It's not the operation you want, but I think you would benefit from reading that source code too. There are many good checks you can learn about in that one.
Wouldn't a simple attributes(b) <- attributes(a) work?
This will just be executed after creating b from a subset of the data in a, so it's not really a single statement, but should work.

Naming different variables and using i to subset a file

I want to go through a vector, name all variables with i and use i to subset a larger file.
Why this does not work?
x <- c(seq(.1,.9,.1),seq(.9,1,.01))
doplot <- function(y)
{
for (i in unique(y))
{
paste("f_", i, sep = "") <- (F_agg[F_agg$Assort==i,])
}
}
doplot(x)
There are several problems here. First of all, on the left hand side of <- you need a symbol (well, or a special function, but let's not get into that now). So when you do this:
a <- "b"
a <- 15
then a will be set to 15, instead of first evaluating a to be b and then set b to 15.
Then, if you create variables within a function, they will be (by default) local to that function, and destroyed at the end of the function.
Third, it is not good practice to create variables this way. (For details I will not go into now.) It is better to put your data in a named list, and then return the list from the function.
Here is a solution that should work, although I cannot test it, because you did not provide any test data:
doplot <- function(y) {
lapply(unique(y), function(i) {
F_agg[F_agg$Assort == i, ]
})
}

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

Assignment to a data.frame with `with`

Here's an example that assigns in two different ways, one which works and one which doesn't:
library(datasets)
dat <- as.data.frame(ChickWeight)
dat$test1 <- with(dat, Time + weight)
with(dat, test2 <- Time + weight)
> colnames(dat)
[1] "weight" "Time" "Chick" "Diet" "test1"
I've grown accustomed to this behavior. Perhaps more surprising is that test2 just disappears (instead of winding up in the base environment, as I'd expect):
> ls(pattern="test")
character(0)
Note that with is a fairly simple^H^H^H^H^H^H short function:
function (data, expr, ...)
eval(substitute(expr), data, enclos = parent.frame())
First let's replicate with's functionality:
eval( substitute(Time+weight), envir=dat, enclos=parent.frame() )
Now test with a different enclosure:
testEnv <- new.env()
eval( substitute(test3 <- Time+weight), envir=dat, enclos=testEnv )
ls( envir=testEnv )
Which still doesn't assign anywhere. This disproves my hunch that it was related to the enclosing environment being discarded, and rather points to something more fundamental to the ,enclos argument not doing what I think it does.
I'm curious about the mechanics of why this is going on and if there's an alternative which allows assignment.
Change with to within. with is only for making variables available, not changing them.
Edit: To elaborate, I believe that both with and within create a new environment and populate it with the given list-like object (such as a data frame), and then evaluate the given expression within that environhment. The difference is that with returns the result of the expression and discards the environment, while within returns the environment (converted back to whatever class it originally was, e.g. data.frame). Either way, any assignments made within the expression are presumably performed inside the created environment, which is discarded by with. This explains why test2 is nowhere to be found after doing with(dat, test2 <- Time + weight).
Note that since within returns the modified environment instead of editing it in place (i.e. call-by-value semantics), you need to do dat <- within(dat, test2 <- Time + weight).
If you want a function to do assignment to the current environment (or any specified environment), look at assign.
Edit 2: The modern answer is to embrace the tidyverse and use magrittr & dplyr:
library(datasets)
library(dplyr)
library(magrittr)
dat <- as.data.frame(ChickWeight)
dat %<>% mutate(test1 = Time + weight)
The last line is equivalent to
dat <- dat %>% mutate(test1 = Time + weight)
which is in turn equivalent to
dat <- mutate(dat, test1 = Time + weight)
Use whichever of the last 3 lines makes the most sense to you.
Inspired by the fact that the following works from the command line ...
eval(substitute(test <- Time + weight, dat))
... I put together the following, which seems to work.
myWith <- function(DAT, expr) {
X <- call("eval",
call("substitute", substitute(expr), DAT))
eval(X, parent.frame())
}
## Trying it out
dat <- as.data.frame(ChickWeight)
myWith(dat, test <- Time + weight)
head(test)
# [1] 42 53 63 70 84 103
(The complicated aspect of this problem is that we need substitute() to search for symbols in one environment (the current frame) while the "outer" eval() assigns into a different environment (the parent frame).)
I get the sense that this is being made way too complex. Both with and within return values calculated by operations on named columns of dataframes. If you don't assign them to anything, the value will get garbage collected. The usual way to store tehn is assignment to to a named object or possibly a component of an object with the <- operator. within returns the entire dataframe, whereas with returns only the vector that was calculated from whatever operations were performed on the column names. You could, of course, use assign instead of <-, but I think overuse of that function may obfuscate rather than clarify the code. The difference in use is just assignment to an entrire dataframe or just a column:
dat <- within(dat, newcol <- oldcol1*oldcol2)
dat$newcol <- with(dat, oldcol1*oldcol2)

Resources