I have a dataset with 63,000 rows in R. One of the columns contained a list of words in the format '"Fireplace", "Garage", "One story with balcony", "Off street parking",' etc. They are property characteristics listed from sale websites.
I want to extract words from this column and create a new column that has a '0' or a '1' if the word is present or not (creating a dummy variable for regression). Once that has been completed, I want to be able to merge some of those columns together (ie take 'parking' 'Parking' 'garage' 'Garage' columns and merge them into one that includes all parking and garages). I'm assuming that R is sensitive to upper and lowercase characters but even if not, I need to be able to merge 'parking' and 'garage' together, for example.
This is for a hedonic pricing method so I need as many property characteristic variables as possible.
I don't know how to create new dummy variables or merge them into one column once I have, so am struggling. Would appreciate any help.
Is this what you are looking for?
library(tidyverse)
data.frame(txt) %>%
# tidy up `txt`:
mutate(txt = gsub("(?![, ])\\W", "", txt, perl = TRUE)) %>%
# split into rows
separate_rows(txt, sep = ",") %>%
# extract keywords matched:
mutate(keywords = str_extract(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony"))
# A tibble: 19 × 2
txt keywords
<chr> <chr>
1 "Stunning seaside location" NA
2 " 24hour emergency call system and secure video entry" NA
3 " Mature landscaped gardens with large terraces and seating areas" garden
4 " Walk out balconies to selected apartments" NA
5 " Beautifully decorated homeowners8099 lounge" NA
6 " Parking spaces and car ports are available via an annual permit" Parking
7 " Wheelchair access" NA
8 " Lifts to all floors" NA
9 " Fire detection" NA
10 " Intruder alarm" NA
11 " Village Location" NA
12 " 4 Bedrooms" NA
13 " Gardens" Garden
14 " Balcony" Balcony
15 " On streetresidents parking" parking
16 " Central heating" NA
17 " Double glazing" NA
18 " Fireplace" Fireplace
19 " Ruralsecluded" NA
Data:
txt <- '"[\"Stunning seaside location\", \"24-hour emergency call system and secure video entry\", \"Mature landscaped gardens with large terraces and seating areas\", \"Walk out balconies to selected apartments\", \"Beautifully decorated homeownersâ\200\231 lounge\", \"Parking spaces and car ports are available via an annual permit\", \"Wheelchair access\", \"Lifts to all floors\", \"Fire detection\", \"Intruder alarm\"]", "[\"Village Location, 4 Bedrooms, Garden(s)\"]", "[\"Balcony\", \"On street/residents parking\", \"Central heating\", \"Double glazing\", \"Fireplace\", \"Rural/secluded\"]"'
If there may be more than 1 keyword per substring, then use str_extract_all in this way:
data.frame(txt) %>%
mutate(txt = gsub("(?![, ])\\W", "", txt, perl = TRUE)) %>%
separate_rows(txt, sep = ",") %>%
mutate(keywords = str_extract_all(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony")) %>%
unnest(where(is.list), keep_empty = TRUE)
EDIT:
If the OP is looking to obtain a variable for each keyword, then this works:
data.frame(txt) %>%
mutate(txt = gsub("(?![, /])\\W", "", txt, perl = TRUE)) %>%
separate_rows(txt, sep = ", ") %>%
mutate(keywords = str_extract_all(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony")) %>%
# unnest listed items:
unnest(where(is.list), keep_empty = TRUE) %>%
# capitalize initial letter:
mutate(keywords = sub("^(.)", "\\U\\1", keywords, perl = TRUE)) %>%
# cast each keaword into its own column:
pivot_wider(names_from = keywords, values_from = keywords,
values_fn = function(x) 1, values_fill = 0)
# A tibble: 19 × 6
txt `NA` Garden Parking Balcony Fireplace
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Stunning seaside location 1 0 0 0 0
2 24hour emergency call system and secure video entry 1 0 0 0 0
3 Mature landscaped gardens with large terraces and seating areas 0 1 0 0 0
4 Walk out balconies to selected apartments 1 0 0 0 0
5 Beautifully decorated homeowners8099 lounge 1 0 0 0 0
6 Parking spaces and car ports are available via an annual permit 0 0 1 0 0
7 Wheelchair access 1 0 0 0 0
8 Lifts to all floors 1 0 0 0 0
9 Fire detection 1 0 0 0 0
10 Intruder alarm 1 0 0 0 0
11 Village Location 1 0 0 0 0
12 4 Bedrooms 1 0 0 0 0
13 Gardens 0 1 0 0 0
14 Balcony 0 0 0 1 0
15 On street/residents parking 0 0 1 0 0
16 Central heating 1 0 0 0 0
17 Double glazing 1 0 0 0 0
18 Fireplace 0 0 0 0 1
19 Rural/secluded 1 0 0 0 0
Related
The goal is to read the 1-5yr GIC rates for Guaranteed Investment Certificate - Long-Term and Compound Interest under the Non-Cashable GICs tab.
Selector Gadget tells me that the css identifier is #container-9565195e5e .cmp-chart__chart span. Using rvest:
page <- read_html('https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/')
page %>%
html_nodes("#container-9565195e5e .cmp-chart__chart span")
# {xml_nodeset (5)}
# [1] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:1" data-value="postedRate"></span>
# [2] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:2" data-value="postedRate"></span>
# [3] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:3" data-value="postedRate"></span>
# [4] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:4" data-value="postedRate"></span>
# [5] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:5" data-value="postedRate"></span>}
rvest can't read the actual rates because of the use of JavaScript on the site.
Turning to RSelenium using the same css selector results in an error:
remDr$navigate("https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/")
webElem <- remDr$findElement(using = "css", "#container-9565195e5e .cmp-chart__chart span")
# Selenium message:Unable to locate element: {"method":"css selector","selector":"#container-9565195e5e .cmp-chart__chart span"}
# For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html
# Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
# System info: host: 'ef4080d2cb73', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-135-generic', java.version: '1.8.0_91'
# Driver info: driver.version: unknown
#
# Error: Summary: NoSuchElement
# Detail: An element could not be located on the page using the given search parameters.
# class: org.openqa.selenium.NoSuchElementException
# Further Details: run errorDetails method
So how do I use RSelenium to read the 1-5yr rates for Guaranteed Investment Certificate - Long-Term and Compound Interest for Non-registered and Registered (TFSA, RSP, RIF, RESP)
Replaced RSelenium with Chromote (which is on its way to rvest: r4ds, gh). The selector in question seems to refer to another table, Long-Term and Simple Interest. While values are currently the same, still switched to the one mentioned in question.
library(chromote)
library(rvest)
b <- ChromoteSession$new()
# Display the current session in the Chromote browser:
# b$view()
b$Page$navigate("https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/")
b$Page$loadEventFired()
# Non-Cashable GICs >> Guaranteed Investment Certificate - Long-Term and Compound Interest
b$Runtime$evaluate("document.querySelector('#container-8a263227af table').outerHTML")$result$value %>%
minimal_html() %>%
html_element("table") %>%
html_table()
#> # A tibble: 5 × 2
#> Term `Non-registered and Registered (TFSA, RSP, RIF, RESP)`
#> <chr> <chr>
#> 1 1 year 4.65%
#> 2 2 years 4.35%
#> 3 3 years 3.75%
#> 4 4 years 4%
#> 5 5 years 4.05%
### Few alternatives
# evalute js in runtime:
sapply(1:5, \(x) b$Runtime$evaluate(paste0("document.querySelector('[data-filter-item=\"productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:",x,"\"]').innerText"))$result$value)
#> [1] "4.65" "4.35" "3.75" "4" "4.05"
doc <- b$DOM$getDocument()
# elements where "data-filter-item" attribute starts with "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:"
nodeids <- b$DOM$querySelectorAll(doc$root$nodeId, '[data-filter-item^="productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:"]')
sapply(nodeids$nodeIds, \(x) b$DOM$getOuterHTML(x) %>% minimal_html() %>% html_text())
#> [1] "4.65" "4.35" "3.75" "4" "4.05"
# close session
b$close()
#> [1] TRUE
Created on 2023-01-21 with reprex v2.0.2
The page does an initial POST request that gets all the data (let's call it master) for all the options. It then uses the various data-filter-item attribute values associated with a given table's cells e.g., data-filter-item="productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:1", to filter the master data to the items needed to update that table.
You can replicate a simplified version of this POST request, create a DataFrame of all values in the value part of the response (think of it like a master reference table), then extract the required filters from a request to the original URI
> filters
[1] "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:1" "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:2"
[3] "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:3" "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:4"
[5] "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:5"
and turn those into a DataFrame for filtering.
You can then subset the master table using the smaller DataFrame. Column names will be matched upon if set in the master using the key values from the key:value response.
Finally, update the table, when extracted from request response for initial webpage, by updating the relevant column with the rate % from the filtered master DataFrame.
The html from the initial webpage is invalid so the target table selection was not as straightforward as I would like.
The CSS selector list I went with was designed with hopefully a longer shelf-life, in terms of remaining valid for longer than a more brittle path might.
One other thing to show might be the response from the POST request. The response has the following key:value format, where I use the key column to generate headers for my master DataFrame. The values get turned into the master DataFrame of all possible rates (and other dynamic page info.)
Credit:
I took the approach used by #akrun in their answer here, whereby read.dcf is used to map out a set of rows, with potentially repeated/new headers, into a single DataFrame with all headers present. NA is entered if a particular entry is not present in a given processed row.
This allowed me to turn this list of split filtering instructions:
> lapply(filters, str_split, "\\|") %>% unlist(recursive = F)
[[1]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:1"
[[2]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:2"
[[3]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:3"
[[4]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:4"
[[5]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:5"
into this:
> data_df
productId minimumDepositAmt minimumTermYearCnt
1 703 0.01 1
2 703 0.01 2
3 703 0.01 3
4 703 0.01 4
5 703 0.01 5
i.e. the set of filtering instructions for the master DataFrame as a DataFrame
The master DataFrame looking as follows:
> df %>% head()
productId minimumDepositAmt maximumDepositAmt minimumTermYearCnt maximumTermYearCnt minimumTermDayCnt maximumTermDayCnt postedRate
1 107 0.01 4999.99 0 0 90 119 4
2 107 5000 9999.99 0 0 90 119 4
3 107 10000 24999.99 0 0 90 119 4
4 107 25000 49999.99 0 0 90 119 4
5 107 50000 99999.99 0 0 90 119 4
6 107 100000 249999.99 0 0 90 119 4
minimumMarketGrowthRate maximumMarketGrowthRate stepperYear1Rate stepperYear2Rate stepperYear3Rate stepperYear4Rate stepperYear5Rate
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
The subset master DataFrame:
> filtered_df
productId minimumDepositAmt minimumTermYearCnt maximumDepositAmt maximumTermYearCnt minimumTermDayCnt maximumTermDayCnt postedRate
1 703 0.01 1 4999.99 1 0 364 4.65
2 703 0.01 2 4999.99 2 0 364 4.35
3 703 0.01 3 4999.99 3 0 364 3.75
4 703 0.01 4 4999.99 4 0 364 4
5 703 0.01 5 4999.99 5 0 364 4.05
minimumMarketGrowthRate maximumMarketGrowthRate stepperYear1Rate stepperYear2Rate stepperYear3Rate stepperYear4Rate stepperYear5Rate
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
The extracted table, from initial page, before update:
> table
# A tibble: 5 × 2
Term `Non-registered and Registered (TFSA, RSP, RIF, RESP)`
<chr> <chr>
1 1 year %
2 2 years %
3 3 years %
4 4 years %
5 5 years %
And the table after update using master (df - data from POST request to get rates info):
> print(table)
# A tibble: 5 × 2
Term `Non-registered and Registered (TFSA, RSP, RIF, RESP)`
<chr> <chr>
1 1 year 4.65%
2 2 years 4.35%
3 3 years 3.75%
4 4 years 4%
5 5 years 4.05%
r:
library(rvest)
library(tidyverse)
library(httr2)
page <- read_html("https://www.td.com/ca/en/personal-banking/personal-investing/products/gic/gic-rates-canada")
table_node <- page %>%
html_element('div.container:contains("Guaranteed Investment Certificate - Long-Term") .text:contains("Compound") ~ div table')
filters <- table_node %>%
html_elements("[data-filter-item]") %>%
html_attr("data-filter-item")
res <- request("https://www.td.com/ca/en/personal-banking/getRates/") %>%
req_headers(
"user-agent" = "Mozilla/5.0",
"content-type" = "application/json",
"x-kl-ajax-request" = "Ajax_Request"
) %>%
req_body_json(list("errorText" = "Unable to get the rate", "ratesType" = "gic")) %>%
req_perform() %>%
resp_body_string()
data <- jsonlite::parse_json(res, simplifyVector = T)
df <- set_names(data$value %>% as.data.frame(), data$key)
data_df <- map_dfr(lapply(filters, str_split, "\\|") %>%
unlist(recursive = F), ~ {
new <-
if (length(.x) > 0) {
as.data.frame(read.dcf(textConnection(.x)))
} else {
NULL
}
})
filtered_df <- inner_join(data_df, df)
table <- table_node %>% html_table()
table[2] <- str_c(filtered_df$postedRate, table[[2]])
print(table)
I'm working with some wind direction data for a potential paper. I am trying to compare the number of days the wind is blowing easterly (negative U) and the number of days it is blowing westerly (positive U). I need to calculate this over an austral summer, so the period between October and March e.g.: October 1993 to March 1994.
Here is a sample of my data frame:
Year Month Day Hour Minutes Seconds Ws U V
1 1993 1 1 0 0 0 3.750620 2.822403 1.281318
2 1993 1 1 6 0 0 4.207054 3.600465 1.719147
3 1993 1 1 12 0 0 5.050543 3.155271 3.243411
4 1993 1 1 18 0 0 3.165194 -0.477054 2.926124
5 1993 1 2 0 0 0 1.529690 -0.721395 -0.503101
6 1993 1 2 6 0 0 1.950233 0.303333 -1.728295
7 1993 1 2 12 0 0 4.548992 -2.868217 3.307519
8 1993 1 2 18 0 0 6.563643 -6.245194 1.744419
9 1993 1 3 0 0 0 5.868992 -5.805969 -0.594031
10 1993 1 3 6 0 0 6.530620 -6.446667 -0.689535
11 1993 1 3 12 0 0 7.085736 -6.657984 1.834884
12 1993 1 3 18 0 0 7.685349 -7.111008 2.571783
13 1993 1 4 0 0 0 6.508760 -6.414574 -0.678837
14 1993 1 4 6 0 0 6.141860 -6.006822 -0.272558
15 1993 1 4 12 0 0 7.388295 -6.744574 1.862868
16 1993 1 4 18 0 0 7.281163 -7.054264 0.896512
17 1993 1 5 0 0 0 4.847287 -4.431628 -0.813643
18 1993 1 5 6 0 0 3.482558 -1.670078 2.048915
19 1993 1 5 12 0 0 5.698992 1.097287 5.433721
20 1993 1 5 18 0 0 4.894031 1.445736 4.440465
21 1993 1 6 0 0 0 1.983411 0.783023 1.556047
22 1993 1 6 6 0 0 2.315891 -1.225891 1.756744
23 1993 1 6 12 0 0 4.525581 -4.016124 1.723721
24 1993 1 6 18 0 0 5.123566 -4.618682 0.759225
25 1993 1 7 0 0 0 3.449147 -2.639457 -1.627442
26 1993 1 7 6 0 0 2.067364 1.185891 -0.760233
27 1993 1 7 12 0 0 5.675814 3.872171 3.419690
28 1993 1 7 18 0 0 6.278450 3.989767 4.684031
29 1993 1 8 0 0 0 6.562636 5.496667 3.329302
30 1993 1 8 6 0 0 7.762636 5.280310 5.516589
31 1993 1 8 12 0 0 9.283953 5.575659 7.294264
>
So far I have manage to do this calculation for one month only (see code below), but I'm unsure of how to do it from October of one year to March of the next year. When I tried filter(wind,Year==1993:1994,Month==10:3,U>0) I got the error Warning message:
In Month == 10:3 :
longer object length is not a multiple of shorter object length
This is what I have done so far with calculating the number of positive and negative directions for October 1993, which has worked. I am new to R and stackoverflow, so I hope I have set this out correctly!
filter(wind,Year==1993,Month==10,U>0)
Oct_1993_pos<-filter(wind,Year==1993,Month==10,U>0)
Oct_1993_pos
filter(wind,Year==1993,Month==10,U<0)
Oct_1993_neg<-filter(wind,Year==1993,Month==10,U<0)
Oct_1993_neg
sum(Oct_1993_pos$U>0)
sum(Oct_1993_neg$U<0)
Your first error (Month == 10:3) occurs because you are comparing a vector (Month) with another vector. When you do this, you do an element-wise comparison, i.e. Month[1] == 10, Month[2] == 9, etc. When the vectors are of unequal length, R repeats the shorter one - but only if the longer one is an exact number of multiples longer:
c(1,2,3,1,2,3) == c(1,2)
[1] TRUE TRUE FALSE FALSE FALSE FALSE
c(1,2,3,1,2) == c(1,2)
[1] TRUE TRUE FALSE FALSE FALSE
Warning message:
In c(1, 2, 3, 1, 2) == c(1, 2) :
longer object length is not a multiple of shorter object length
For counting positive and negative U's, you can exploit that summing logicals simply counts the number of TRUEs:
sum(c(FALSE, TRUE, TRUE, FALSE))
[1] 2
And you can obtain such logicals by doing a simply comparison:
sum(U > 0)
For your calculations I would recommend using dplyr. With this you can repeat your counting across any collection of subsets. Try:
# if following fails, run install.packages("dplyr")
library(dplyr)
monthly <- wind %>% group_by(Year, Month) %>%
summarise(
pos=sum(U > 0),
neg=sum(U < 0),
nowind=sum(U == 0),
entries=n()
)
Edit in response to comment:
Depending on if you need intermediate results or not, we could do a couple of things. Regarding the period October to March, you have to be careful if your data spans several years.
monthly %>% filter((Month => 10 & Year == 1993) | (Month <= 3 & Year == 1994)) %>% ungroup %>%
summarise_at(vars(pos, neg, nowind, entries), sum)
or, just filter before you summarise:
wind %>% filter((Month => 10 & Year == 1993) | (Month <= 3 & Year == 1994)) %>%
summarise(
pos=sum(U > 0),
neg=sum(U < 0),
nowind=sum(U == 0),
entries=n()
)
Beware here that I am using single boolean operators (|, &) and not double (||, &&) as we want to keep the element-wise comparisons (the double-variant collapses to a single element).
If you want to see winter vs. summer periods, across multiple years, we have to figure how to group the seasons correctly. For this, I will build a data set of years and months:
library(tidyr)
seasons <- crossing(month=1:12, year=1992:1994) %>% arrange(year, month) %>%
mutate(
season_start = month %in% c(3, 10),
season = cumsum(season_start)
)
With this approach, we've split the problem in two: 1) Define the seasons you wish to summarise over, and 2) summarise it.
inner_join(wind, seasons, by=c('Year'='year','Month'='month')) %>%
group_by(season) %>%
summarise(
seasonstart = paste0(min(Year), '-', min(Month)),
pos=sum(U > 0),
neg=sum(U < 0),
nowind=sum(U == 0),
entries=n()
)
So, to summarise over the period October-March, same as before, just define a different grouping.
For exercises, try adding Year and/or Month to the group_by call in the last example.
Ok so I have written a loop which intends to for the first part of the loop, perform a multiplication calculation, multiplying two columns. Afterwards, for the remainder of the loop the loop is to perform another multiplication using two different columns than the first.
The columns for multiplation are: ocret and clret which multiply against response.
My code for this:
train.set$output[[1]] = if (train.set$response[[1]] == 1) {
apply(train.set[,c('ocret', 'response')], 1, function(x) { (x[1]*x[2])} )
}
for (i in 2:nrow(train.set)){
train.set$output[i] = if(train.set$response[i] == 1) {
apply(train.set[,c('clret', 'response')], 1, function(x) { (x[1]*x[2])})
train.set$output[i-1]
}
}
The idea for this was first finding a response == 1, it was to perform the ocret * response calculation.
For the second part of the loop, it was to start on row 2 as to not overwrite the first part... and continue to loop down the +1 and perform the clret * response calculation.
The logic makes sense to me, this is pretty much my first attempt at a loop. When I run the code, nothing happens, it doesnt make the output column, can anyone give me any pointers? I continue to read it and it makes sense, not sure what im missing, any explanation greatly appreciated.
Example data frame and output below:
ocret clret response output
1 0.00730616 0.003382433 0 0
2 -0.084899894 -0.088067766 0 0
3 0.047208568 0.054174679 1 0.047208568
4 -0.002671414 -0.004543992 0 0
5 -0.039943462 -0.040290793 0 0
6 -0.01428499 -0.013506524 0 0
7 -0.037054965 -0.038517845 0 0
8 -0.058027611 -0.057394837 1 -0.058027611
9 -0.004014491 -0.011332705 1 -0.011332705
10 -0.079419682 -0.076167096 1 -0.076167096
11 -0.003424577 -0.011759287 1 -0.011759287
12 0.099260455 0.115800375 1 0.115800375
13 -0.011841897 -0.005322141 1 -0.005322141
14 -0.087230999 -0.090349775 1 -0.090349775
15 0.040570359 0.042507445 1 0.042507445
16 -0.001846555 -0.006212821 1 -0.006212821
17 0.044398056 0.047684898 1 0.047684898
18 -0.025856823 -0.030799705 0 0
19 -0.057677505 -0.061012471 0 0
20 0.010043567 0.012634046 0 0
21 -0.020609404 -0.034511205 0 0
Line 3: ocret * response
Line 8: ocret * response
Line 9 to 16: clret * response
For loop may not be required. We can use dplyr and data.table to get the desired output (dt2).
library(dplyr)
library(data.table)
dt2 <- dt %>%
mutate(RunID = rleid(response)) %>%
group_by(RunID) %>%
mutate(output = ifelse(response == 0, 0,
ifelse(row_number() == 1, ocret, clret))) %>%
ungroup() %>%
select(-RunID)
Data Preparation
dt <- read.table(text = " ocret clret response
1 0.00730616 0.003382433 0
2 -0.084899894 -0.088067766 0
3 0.047208568 0.054174679 1
4 -0.002671414 -0.004543992 0
5 -0.039943462 -0.040290793 0
6 -0.01428499 -0.013506524 0
7 -0.037054965 -0.038517845 0
8 -0.058027611 -0.057394837 1
9 -0.004014491 -0.011332705 1
10 -0.079419682 -0.076167096 1
11 -0.003424577 -0.011759287 1
12 0.099260455 0.115800375 1
13 -0.011841897 -0.005322141 1
14 -0.087230999 -0.090349775 1
15 0.040570359 0.042507445 1
16 -0.001846555 -0.006212821 1
17 0.044398056 0.047684898 1
18 -0.025856823 -0.030799705 0
19 -0.057677505 -0.061012471 0
20 0.010043567 0.012634046 0
21 -0.020609404 -0.034511205 0",
header = TRUE, stringsAsFactors = FALSE)
When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to segment document into phases but not word(term).
You can do this in R using the quanteda package, which can detect multi-word expressions as statistical collocates, which would be the multi-word expressions that you are probably referring to in English. To remove the collocations containing stop words, you would first tokenise the text, then remove the stop words leaving a "pad" in place to prevent false adjacencies in the results (two words that were not actually adjacent before the removal of stop words between them).
require(quanteda)
pres_tokens <-
tokens(data_corpus_inaugural) %>%
tokens_remove("\\p{P}", padding = TRUE, valuetype = "regex") %>%
tokens_remove(stopwords("english"), padding = TRUE)
pres_collocations <- textstat_collocations(pres_tokens, size = 2)
head(pres_collocations)
# collocation count count_nested length lambda z
# 1 united states 157 0 2 7.893307 41.19459
# 2 let us 97 0 2 6.291128 36.15520
# 3 fellow citizens 78 0 2 7.963336 32.93813
# 4 american people 40 0 2 4.426552 23.45052
# 5 years ago 26 0 2 7.896626 23.26935
# 6 federal government 32 0 2 5.312702 21.80328
# convert the corpus collocations into single tokens, for top 1,500 collocations
pres_compounded_tokens <- tokens_compound(pres_tokens, pres_collocations[1:1500])
tokens_select(pres_compounded_tokens[2], "*_*")
# tokens from 1 document.
# 1793-Washington :
# [1] "called_upon" "shall_endeavor" "high_sense" "official_act"
Using this "compounded" token set, we can now turn this into a document-feature matrix where the features consist of a mixture of original terms (those not found in a collocation) and the collocations. As can be seen below, "united" occurs alone and as part of the collocation "united_states".
pres_dfm <- dfm(pres_compounded_tokens)
head(pres_dfm[1:5, grep("united|states", featnames(pres_dfm))])
# Document-feature matrix of: 5 documents, 10 features (86% sparse).
# 5 x 10 sparse Matrix of class "dfm"
# features
# docs united states statesmen statesmanship reunited unitedly devastates statesman confederated_states united_action
# 1789-Washington 4 2 0 0 0 0 0 0 0 0
# 1793-Washington 1 0 0 0 0 0 0 0 0 0
# 1797-Adams 3 9 0 0 0 0 0 0 0 0
# 1801-Jefferson 0 0 0 0 0 0 0 0 0 0
# 1805-Jefferson 1 4 0 0 0 0 0 0 0 0
If you want a more brute-force approach, it's possible simply to create a document-by-bigram matrix this way:
# just form all bigrams
head(dfm(data_inaugural_corpus, ngrams = 2))
## Document-feature matrix of: 57 documents, 63,866 features.
## (showing first 6 documents and first 6 features)
## features
## docs fellow-citizens_of of_the the_senate senate_and and_of the_house
## 1789-Washington 1 20 1 1 2 2
## 1797-Adams 0 29 0 0 2 0
## 1793-Washington 0 4 0 0 1 0
## 1801-Jefferson 0 28 0 0 3 0
## 1805-Jefferson 0 17 0 0 1 0
## 1809-Madison 0 20 0 0 2 0
I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1