I'd like to do something like an embedded loop, but using apply functions, the goal of which is to check various conditions prior to moving on to the next part of my program.
I have two objects, a list of product descriptions, which can be created as follows:
test_products <- list(c("dingdong","small","affordable","polished"),c("wingding","medium","cheap","dull"),c("doodad","big","expensive","shiny"))
And a data frame of combinations of features that are not allowed, where each row represents a disallowed combination of features. A sample data frame can be created as follows:
disallowed <- data.frame(trait1 = c("dingdong","wingding","doodad"),
trait2 = c("medium","big","big"),
stringsAsFactors = FALSE)
My goal is to check each product against each of the disallowed combinations as efficiently as possible. So far I can check one product against all prohibitions as follows (in this case, the third product):
apply(disallowed, 1, function(x) x %in% unlist(test_products[[3]]))
OR I can check all products against one of the disallowed combinations of traits (the third combination).
lapply(test_products, function(x) disallowed[3,] %in% x)
Is it possible to check all products against all rows of the data frame of disallowed feature combination, without using a loop?
My end result should look something like this:
Product 1: OK
Product 2: OK
Product 3: NOT OK
Since Product 3 runs afoul of the third disallowed row.
There are definitely more elegant ways, but I am going to share my thoughts on this.
First, the way you created the disallowed data frame is convoluted. I decided to use the following code to create disallowed.
# Create a data frame showing disallowed traits
disallowed <- data.frame(trait1 = c("dingdong","wingding","doodad"),
trait2 = c("medium","big","big"),
stringsAsFactors = FALSE)
I then created a function called violate, which has two arguments. The first argument product is a vector of character. The second argument, check_df, is the data frame contains disallowed traits.
The output of violate is a logical vector. TRUE means all two traits from the check_df of the row are both TRUE.
# Create the violate function
violate <- function(product, check_df){
temp_df <- as.data.frame(lapply(check_df, function(Col) Col %in% product))
temp_vec <- apply(temp_df, 1, function(Row) sum(Row) == 2)
return(temp_vec)
}
# Test the violate function
violate(test_products[[3]], check_df = disallowed)
# [1] FALSE FALSE TRUE
After that, I applied the violate function using sapply through the test_products list. The results from violate were evaluated to see if all disallowed checks are FALSE
# Apply the violate function and check if all results from violate is FALSE
sapply(test_products, function(product){
sum(violate(product, check_df = disallowed)) == 0})
# [1] TRUE TRUE FALSE
As you can see, the third element of the results is FALSE, indicating that the third product is not OK, while product 1 and product 2 are OK because the final results from sapply are both TRUE.
Related
The page in question is this:
https://tolltariffen.toll.no/tolltariff/headings/03.02?language=en (Click on OPEN ALL LEVELS to get the complete data)
I'm using RSelenium to load the page and then getting the pagesource and using rvest to capture the required field. This is the data I'm trying to capture.
The code I've come up so far splits some descriptions data into multiple chunks which is not useful for me.
x <- remdr$getPageSource()
xpg <- read_html(x[[1]])
# get the HS descriptions
treeView <- xpg %>%
html_nodes(xpath = '//*/div[#class="MuiGrid-root MuiGrid-container MuiGrid-wrap-xs-nowrap"]') %>%
html_nodes(xpath = '//*/p[contains(#class, "MuiTypography-body1")]') %>%
html_nodes('span') %>%
html_text(trim = TRUE)
I need all the descriptions in order as a list.
Update: This is the output format. Descriptions and the 8-digit code
General thoughts:
RSelenium isn't strictly needed, and you can avoid the overhead of launching a browser. There is an API call, you can see in the browser network tab, which supplies the content of interest, and this can be called with no requirement for additional configuration of the request e.g. headers.
The question of how to extract the items you want from the API response, in the format you want, then becomes a fun challenge (at least to me) as we do not know 1) how many levels of nesting there may be in this response (and possible future ones) 2) whether the level of nesting can vary across listings within a given response for the items of interest 3) whether there will be a commodityCode at a given level (though the pattern appears to be that there is one at the deepest level for a given listing); and we need to consider how we generate columns/lists of equal length for output. These are just some starting considerations that I go on to discuss how I handled below.
The API call:
* You can click on many of the smaller images below to enlarge
The API response:
This request returns nested JSON:
The content of interest is a list of named lists, within the response, accessible via the parent "key" $headingItems:
Each of these named lists is nested as per the levels on the webpage:
You can see the repeated accessor key of headingItems (red boxed), with the first shown above as the parent list stored in data in code to follow.
Below that, indicated by level (orange boxed), are the expanded entries you are after; nested within the response JSON.
Finally, we have the descriptions (green boxed) which contains html for the descriptive text you are after, with English and Norwegian versions of the text:
In addition to this, there is, where present, a commodityCode key within the nested headingItems:
Approach and challenges:
Given that the commodityCode can be at different levels and may not be present (unless assumed to always be present at greatest depth of a given listing), and that it is unknown how many levels of headingItem there can be, the approach I chose was to use regex to identify the relevant child named list's names in a boolean mask (though for purposes here we could just say logical vector); one mask for English headers and one for the commodity codes. I processed each child list separately, using purrr::map and applying a custom function to extract data as a data.table/data.frame.
Example mask (descriptions|text):
The TRUE values are for the following chained accessors (chaining dependent on depth):
Notice how some accessor paths are repeated. This means therefore, that I do not use the mask to retrieve the names and extract the associated values. Instead, I keep the TRUE and FALSE values and thereby have equal lengths for both vectors. I combine the two logical vectors as columns within a data.table; along with the entire set of values within the child list:
This work is done within the custom function get_data, where I also then do the following steps:
I filter for only rows where there is a TRUE value i.e. a value I wish to retrieve
Apply a function utilizing gsub(), to remove non-breaking whitespace, and read_html() to convert those descriptions which are actual html to text. N.B. Some entries are not actually html and are handled by the if statement. In those cases, the input value is returned:
At this point the codes and descriptions/text are in a single column:
I use the booleans in commodity_code to update that columns value where TRUE to match the text column, and wrap in if to replace FALSE with NA.
Knowing that there is actually a 1 row offset between description and associated code, where applicable, I then shift the commodity column values down one row to correctly align with descriptions:
I then keep only the rows where description_header_flag is TRUE:
Finally, I remove the now not needed flag column:
This leaves me with a clean data.table to return from the function.
Generating the final output:
As map() applying the custom function above to a list returns a list of data.tables, I then simply call rbindlist() to combine these into a single data.table:
df <- rbindlist(map(data, get_data))
This can then be written to csv for example.
fwrite(df, 'result.csv')
Example rows in df:
N.B. I return a data.table as you showed 2 columns in your desired output.
R:
library(jsonlite)
library(tidyverse)
library(rvest)
library(data.table)
get_data <- function(x) {
y <- x %>% unlist(recursive = T)
t <- data.table(text = y, description_header_flag = grepl("(?:headingItems\\.)description\\.en$|^description.en$", names(y)), commodity_code = grepl("*commodityCode$", names(y)))
t <- t[description_header_flag | commodity_code, ]
t$text <- map2(t$text, t$description_header_flag, ~ gsub(intToUtf8(160), " ", if (.y & str_detect(.x, pattern = "<div>|<p>")) {
html_text(read_html(.x))
} else {
.x
}))
t$commodity_code <- map2(t$commodity_code, t$text, ~ if (.x) {
.y
} else {
NA
})
t[, commodity_code := c(NA, commodity_code[.I - 1])]
t <- t[description_header_flag == T, ]
t[, description_header_flag := NULL]
return(t)
}
data <- jsonlite::read_json("https://tolltariffen.toll.no/api/search/headings/03.02") %>% .$headingItems
df <- rbindlist(map(data, get_data))
fwrite(df, "result.csv")
Sample output:
Credits:
gsub solution taken from: #shabbychef here
row shift solution adapted from: #Gary Weissman here
I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).
I have a dataset with the tweets of many people, and their corresponding MBTI personality.
I would like to add another variable to the dataset that is a logical, only based on whether the first letter in their MBTI code, i.e. TRUE if they are extroverts, and FALSE if they are introverts. How could i do this?
I tried something like this:
mbti : is the dataset
type : a variable within the set that gives the different types
mbti <- mbti %>%
`if`(mbti$type == `startsWith`("E")){
`mutate`(extrovert = TRUE)
} `else`{
`mutate`(extrovert = FALSE)
}
i have tried several other variations, including this inside a for loop, since i want to iterate over many different individuals.
Sorry if this is a stupid question, or poorly framed, i am quite new to R.
I'd strongly suggest reading An Introduction to dplyr - I think it will help clear up some misconceptions you have. Or perhaps general introductions to R. Many (most!) R functions are vectorized, so you don't need loops. startsWith is vectorized, thus we can run it on all your individuals at once. And it already returns TRUE or FALSE, so we can directly assign the result to the startsWith output.
## startsWith example
startsWith(c("Echo", "Elephant", "Not an E"), "E")
[1] TRUE TRUE FALSE
## using it for your problem
mbti <- mbti %>%
mutate(extrovert = startsWith(type, "E"))
if(){}else{}, however, is one of the exceptions - it is not vectorized, and is used for controlling what code gets run. ifelse() is vectorized function that we could use, e.g., if you wanted the result to be "Extroverted" if type startw with an E and "Introverted" otherwise (see below). But we don't need it here because you want TRUE/FALSE and startsWith returns TRUE/FALSE:
## an ifelse() example
mbti <- mbti %>%
mutate(vertedness = ifelse(startsWith(type, "E"), "Extrovert", "Introvert))
You could just use ifelse directly. Say your new column is Etype then
mbti$Etype <- ifelse(substring(mbti$type, 1, 1) == "E", TRUE, FALSE)
Noob R question here from a Matlab/Python user. I have a dataset with hundreds of different users, each of whom has a unique number of rows of data, and would like to store the data as a list of matrices. So user 1 may have a matrix of 500 rows, user 2 may have a matrix of 250, and so on. This will be used as the inputs for a hierarchical logit with a mixture of normals to explain each user's betas. Column 1 of my dataset is a user id, and the rest of the cols are numerical values.
data <- read.csv("hierarchical_dataset.csv", header=FALSE)
nlgtt = length(table(data[[1]])) # number of users
users = names(table(data[[1]])) # user ids
All good so far, but here is where I'm getting my error:
TV = matrix()
testdata = list()
for (i in 1:nlgtt)
{ TV[i] = matrix(table(data[[1]])[[i]]) # number of rows per user
print(TV[i]) # should equal the below line
print(dim(data[data[[1]] == users[i], ])) # should equal the above line
testdata[i] = data[data[[1]] == users[i], ] # store hierarchically by user
}
When I run the above, the printed values match, so the correct data are being retrieved, but then I get simple repetitions of the users[i] value (the user id) for each testdata[i], and also a number of items to replace is not a multiple of replacement length error. I'm sure this is just a simple formatting issue, but have looked around and failed to turn up anything that resolves my problem. Help appreciated!
You might want to use base::lapply from the apply family functions. Please see the example below.
dataset <- data.frame(user_id = c(1,1,2), variable_a = c(1,2,3))
lapply(unique(dataset$user_id), function(id) dataset[dataset$user_id == id,])
The output gives a 2 element list with matrices of user_id and variable_a, where each element is unique by user.
For a marketing class I have to write a function that calculates the retention rate of the customers (probability that a customer still is a customer). I've come so far that I isolated the ids of the individual customers and stored them in the matrix first.transactions.data. I then split them into cohorts (group of customers by time) with split() and stored them in the list cohort.
Now comes my problem: I calculated another sub-matrix from the full data set called final.period.data where I will calculate the retention rate. However, therefore I have to isolate the ids in final.period.data for each cohort. My instructor told me that I should create an additional column in final.period.data that shows TRUE or FALSE depending on whether the cohort's id and final.period.data's id are the same. For this I tried to use exists, but I always receive error messages. I tried the following:
final.period.data <- if(exists(cohort$'1'$id, where = final.period.data$id) final.period.data$same = TRUE)
but always receive error messages such as: unexpected symbol or invalid first argument. I also tried to convert the list cohort into a matrix but this didn't help either. How do I have to change the exist command or is there a simpler way to locate cohort's ids in final.period.data?
Thank you for your help.
You can just create a function that does what you want:
funct <-(final.period.data){
if (final.period.data$cohort =='1' & final.period.data$id ==<condition2>){
#Change the number for the TRUE condition}
else{ #If it doesn't fit the two conditions
#Change the number for the FALSE condition}
}
vector <- c(nrow(final.period.data))
final.period.data <- cbind(vector)
And use it as the apply function. Here will you find more information about apply
But I usually do it with a for loop, first creating the new column and then adding it to the data frame.