library(NLP)
library(tm)
library(tidytext)
library(tidyverse)
library(topicmodels)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
#sample dataset
tags <- c("product, productdesign, electronicdevice")
web <- c("hardware, sunglasses, eyeware")
tags2 <- data_frame(tags, web, stringsAsFactors = FALSE)
#tokenize the words
toke <- tags2 %>%
unnest_tokens(word, tags)
toke
#create a dummy variable
toke2 <- toke%>% mutate(
product = ifelse(str_detect(word, "^product$"), "1", "0"))
#unnest the toke
nested_toke <- toke2 %>%
nest(word) %>%
mutate(text = map(data, unlist),
text = map_chr(text, paste, collapse = " "))
nested_toke %>%
select(text)
When I nest the column of tokenized words after creating the dummy variable based on the string "product" it seems to be inserting "product" into a new row below the original row where "product" was located.
product underlined should be in the row above
When you add a new column after unnesting, you have to think about what to do with it if you want to nest again. Let's work through it and see what we're talking about.
library(tidyverse)
tags <- c("product, productdesign, electronicdevice")
web <- c("hardware, sunglasses, eyeware")
tags2 <- data_frame(tags, web)
library(tidytext)
tidy_tags <- tags2 %>%
unnest_tokens(word, tags)
tidy_tags
#> # A tibble: 3 x 2
#> web word
#> <chr> <chr>
#> 1 hardware, sunglasses, eyeware product
#> 2 hardware, sunglasses, eyeware productdesign
#> 3 hardware, sunglasses, eyeware electronicdevice
So that is your data set unnested, converted to a tidy form. Next, let's add the new column that detects whether the word "product" is in the word column.
tidy_product <- tidy_tags %>%
mutate(product = ifelse(str_detect(word, "^product$"),
TRUE,
FALSE))
tidy_product
#> # A tibble: 3 x 3
#> web word product
#> <chr> <chr> <lgl>
#> 1 hardware, sunglasses, eyeware product T
#> 2 hardware, sunglasses, eyeware productdesign F
#> 3 hardware, sunglasses, eyeware electronicdevice F
Now think about what your options are for nesting again. If you nest again without taking into account the new column (nest(word)) the structure has a NEW COLUMN and will have to make a NEW ROW to account for the two different values that can take. You could instead do something like nest(word, product) but then the TRUE/FALSE values will end up in your text string. If you are wanting to get back to the original text format, you need to remove the new column you created, because having it there changes the relationships between rows and columns.
nested_product <- tidy_product %>%
select(-product) %>%
nest(word) %>%
mutate(text = map(data, unlist),
text = map_chr(text, paste, collapse = ", "))
nested_product
#> # A tibble: 1 x 3
#> web data text
#> <chr> <list> <chr>
#> 1 hardware, sunglasses, eyeware <tibble [3 × 1]> product, productdesign, …
Created on 2018-02-22 by the reprex package (v0.2.0).
Related
I loaded a table from a database which contains a column that has JSON data in each row.
The table looks something like the example below. (I was not able to replicate the data.frame I have, due to the format of the column data)
dataframe_example <- data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
JSON_col = c({"_inv": [10,20,30,40]}, "_person": ["_personid": "green"],
{"_inv": [15,22]}, "_person": ["_personid": "blue"],
{"_inv": []}, "_person": ["_personid": "red"]))
I have the following two issues:
Some of the items (e.g. "_inv") sometimes have the full 4 numeric entries, sometimes less, and sometimes nothing. Some of the other items (e.g. "_person") usually contain another header, but only one character data point.
My goal is to preserve the existing dataframes colums (such as id and name) and spread the data in the json column such that I have new columns containing each point of information. The target dataframe would look a little like this:
data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
`_inv_1` = c(10,15,NA),
`_inv_2` = c(20,22,NA),
`_inv_3` = c(30,NA,NA),
`_inv_4` = c(40,NA,NA),
`_person_id` = c("green","blue","red"))
Please bear in mind that I have very little experience handling JSON data and no experience dealing with uneven JSON data.
Using purrr I got:
frame <- purrr::map(dataframe_example$JSON_col, jsonlite::fromJSON)
This gave me a large list with n elements, where n is the length of the original dataframe. The "Name" item contains n lists [[1]], each one with its own type of object, ranging from double to data.frame. The double object contain four numeric observations, (such as _inv), some of the objects are lists themselves (such as _person), which further contains "_personid" and then a single entry. The dataframe contains the datetime stamps for each observation in the JSON data. (each _inv item has a timestamp)
Is there a way to obtain the solution above, either by extracting the data from my "frame" object, or an altogether different solution?
library(tidyverse)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
dataframe_example <-
data.frame(
id = c(1, 2, 3),
name = c("name1", "name2", "name3"),
JSON_col = c(
"{\"_inv\": [10,20,30,40], \"_person\": {\"_personid\": \"green\"}}",
"{\"_inv\": [15,22], \"_person\": {\"_personid\": \"blue\"}}",
"{\"_inv\": [], \"_person\": {\"_personid\": \"red\"}}"
)
)
dataframe_example %>%
as_tibble() %>%
mutate(
JSON_col = JSON_col %>% map(parse_json)
) %>%
unnest_wider(JSON_col) %>%
unnest(`_inv`) %>%
unnest(`_inv`) %>%
unnest(`_person`) %>%
unnest(`_person`) %>%
group_by(id, name) %>%
mutate(inv_id = row_number()) %>%
pivot_wider(names_from = inv_id, values_from = `_inv`, names_prefix = "_inv_")
#> # A tibble: 2 x 7
#> # Groups: id, name [2]
#> id name `_person` `_inv_1` `_inv_2` `_inv_3` `_inv_4`
#> <dbl> <chr> <chr> <int> <int> <int> <int>
#> 1 1 name1 green 10 20 30 40
#> 2 2 name2 blue 15 22 NA NA
Created on 2021-11-25 by the reprex package (v2.0.1)
I would like to make a string basing on ids from other columns where the real value sits in a dictionary.
Ideally, this would look like:
library(tidyverse)
region_dict <- tibble(
id = c("reg_id1", "reg_id2", "reg_id3"),
name = c("reg_1", "reg_2", "reg_3")
)
color_dict <- tibble(
id = c("col_id1", "col_id2", "col_id3"),
name = c("col_1", "col_2", "col_3")
)
tibble(
region = c("reg_id1", "reg_id2", "reg_id3"),
color = c("col_id1", "col_id2", "col_id3"),
my_string = str_c(
"xxx"_,
region_name,
"_",
color_name
))
#> # A tibble: 3 x 3
#> region color my_string
#> <chr> <chr> <chr>
#> 1 reg_id1 col_id1 xxx_reg_1_col_1
#> 2 reg_id2 col_id2 xxx_reg_2_col_2
#> 3 reg_id3 col_id3 xxx_reg_3_col_3
Created on 2021-03-01 by the reprex package (v0.3.0)
I know of dplyr's recode() function but I can't think of a way to use it the way I want.
I also thought about first using left_join() and then concatenating the string from the new columns. This is what would work but doesn't seem pretty to me as I would get columns that I'd need to remove later. In the real dataset I have 5 variables.
I'll be glad to read your ideas.
This may also be solved with a fuzzyjoin, but based on the similarity in substring, it would make sense to remove the prefix substring from the 'id' columns of each data and do a left_join, then create the 'my_string' by pasteing the columns together
library(stringr)
library(dplyr)
region_dict %>%
mutate(id1 = str_remove(id, '.*_')) %>%
left_join(color_dict %>%
mutate(id1 = str_remove(id, '.*_')), by = 'id1') %>%
transmute(region = id.x, color = id.y,
my_string = str_c('xxx_', name.x, '_', name.y))
-output
# A tibble: 3 x 3
# region color my_string
# <chr> <chr> <chr>
#1 reg_id1 col_id1 xxx_reg_1_col_1
#2 reg_id2 col_id2 xxx_reg_2_col_2
#3 reg_id3 col_id3 xxx_reg_3_col_3
I'm trying to read an array out of a JSON structure with tidyjson as I'm trying to fasten up my code.
My input data is of the structure
json <- "{\"key1\":\"test\",\"key2\":[\"abc\",\"def\"]}"
I want my output to be a data frame where key1 is one column and key2 is the second column in which all elements of the array are pasted together and separated by ";".
I tried something like
result <- json %>% spread_values(a = jstring("key1"), b = paste0(jstring("key2"), collapse = ";"))
I really have no idea how to get the array out of the JSON in the spread_values function.
I got what I want with
key2 <- json %>% enter_object("key2")
attributes(key2)$JSON %>% unlist() %>% paste0(collapse = ";")
but as I don't have unique keys I can't join it to the rest of my data and I think there must be a better way.
I'm glad you got something working! In case anyone else happens upon this question, there are definitely many ways to accomplish this task!
One is to use tidyjson to gather the data into a tall structure, then summarize:
library(tidyjson)
library(dplyr)
json <- "{\"key1\":\"test\",\"key2\":[\"abc\",\"def\"]}"
myj <- tidyjson::as.tbl_json(json)
myj %>%
# make the data tall
spread_values(key1 = jstring(key1)) %>%
enter_object("key2") %>%
gather_array("idx") %>%
append_values_string("key2") %>%
# now summarize
group_by(key1) %>%
summarize(key2 = paste(key2, collapse = ";"))
#> # A tibble: 1 x 2
#> key1 key2
#> <chr> <chr>
#> 1 test abc;def
Created on 2021-10-29 by the reprex package (v0.3.0)
Another way is to grab the json data directly with json_get_column() and mutate that:
library(tidyjson)
library(dplyr)
json <- "{\"key1\":\"test\",\"key2\":[\"abc\",\"def\"]}"
myj <- tidyjson::as.tbl_json(json)
myj %>%
spread_values(key1 = jstring(key1)) %>%
enter_object("key2") %>%
json_get_column("array") %>%
mutate(key2 = purrr::map_chr(array, ~ paste(.x, collapse = ";"))) %>%
as_tibble() %>% # drop tbl_json structure
select(key1, key2)
#> # A tibble: 1 x 2
#> key1 key2
#> <chr> <chr>
#> 1 test abc;def
Created on 2021-10-29 by the reprex package (v0.3.0)
Consider the following example
> data_text <- data.frame(text = c('where', 'are', 'you'),
blob = c('little', 'nice', 'text'))
> data_text
# A tibble: 3 x 2
text blob
<chr> <chr>
1 where little
2 are nice
3 you text
I want to print the rows that contain the regex text (that is, row 3)
Problem is, I have hundreds of columns and I dont know which one contains this string. str_detect only work with one column at a time...
How can I do that using the stringr package?
Thanks!
With stringr and dplyr you can do this.
You should use filter_all from dplyr >= 0.5.0.
I have extended the data to have a better look on the result:
library(dplyr)
library(stringr)
data_text <- data.frame(text = c('text', 'where', 'are', 'you'),
one_more_text = c('test', 'test', 'test', 'test'),
blob = c('wow', 'little', 'nice', 'text'))
data_text %>%
filter_all(any_vars(str_detect(., 'text')))
# output
text one_more_text blob
1 text test wow
2 you test text
You can treat the data.frame as a list and use purrr::map to check each column, which can then be reduced into a logical vector that filter can handle. Alternatively, purrr::pmap can iterate over all the columns in parallel:
library(tidyverse)
data_text <- data_frame(text = c('where', 'are', 'you'),
blob = c('little', 'nice', 'text'))
data_text %>% filter(map(., ~.x == 'text') %>% reduce(`|`))
#> # A tibble: 1 x 2
#> text blob
#> <chr> <chr>
#> 1 you text
data_text %>% filter(pmap_lgl(., ~any(c(...) == 'text')))
#> # A tibble: 1 x 2
#> text blob
#> <chr> <chr>
#> 1 you text
matches = apply(data_text,1,function(x) sum(grepl("text",x)))>0
result = data_text[matches,]
No other packages required. Hope this helps!
library(rvest)
df <- data.frame(Links = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"))
for(i in 1:3) {
webpage <- read_html(paste0("https://www.whatmobile.com.pk/", df$Links[i]))
data <- webpage %>%
html_nodes(".specs") %>%
.[[1]] %>%
html_table(fill = TRUE)
}
want to make loop works for all 3 values in df$Links but above code just download the last one, and downloaded data must also be identical with variables (may be a new column with variables name)
The problem is in how you're structuring your for loop. It's much easier just to not use one in the first place, though, as R has great support for iterating over lists, like lapply and purrr::map. One version of how you could structure your data:
library(tidyverse)
library(rvest)
base_url <- "https://www.whatmobile.com.pk/"
models <- data_frame(model = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"),
link = paste0(base_url, model),
page = map(link, read_html))
model_specs <- models %>%
mutate(node = map(page, html_node, '.specs'),
specs = map(node, html_table, header = TRUE, fill = TRUE),
specs = map(specs, set_names, c('var1', 'var2', 'val1', 'val2'))) %>%
select(model, specs) %>%
unnest()
model_specs
#> # A tibble: 119 x 5
#> model var1 var2
#> <chr> <chr> <chr>
#> 1 Qmobile_Noir-M6 Build OS
#> 2 Qmobile_Noir-M6 Build Dimensions
#> 3 Qmobile_Noir-M6 Build Weight
#> 4 Qmobile_Noir-M6 Build SIM
#> 5 Qmobile_Noir-M6 Build Colors
#> 6 Qmobile_Noir-M6 Frequency 2G Band
#> 7 Qmobile_Noir-M6 Frequency 3G Band
#> 8 Qmobile_Noir-M6 Frequency 4G Band
#> 9 Qmobile_Noir-M6 Processor CPU
#> 10 Qmobile_Noir-M6 Processor Chipset
#> # ... with 109 more rows, and 2 more variables: val1 <chr>, val2 <chr>
The data is still pretty messy, but at least it's all there.
it is capturing all three values, but it writes over them with each loop. That's why it only shows one value, and that one value being for the last page
You need to initialise a variable first before you go into your loop, I suggest a list so you can store data for each successive loop. So something like
final_table <- list()
for(i in 1:3) {
webpage <- read_html(paste0("https://www.whatmobile.com.pk/", df$Links[i]))
data <- webpage %>%
html_nodes(".specs") %>%
.[[1]] %>%
html_table(fill= TRUE)
final_table[[i]] <- data.frame(data, stringsAsFactors = F)
}
In this was, it appends new data to the list with each loop.