all,
I have imported the sotu corpus from quanteda in R. I am somewhat new to dfm objects and am wanting to separate the doc_id column to give me a name and a year column. If this was a tibble, this code works:
library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- as_tibble(data_corpus_sotu)
sotusubsetted <- sotu %>%
separate(doc_id, c("name","year"),"-")
However, since I am new with dfm and regex, I am not sure if there is an equivalent process if I load in the data as:
library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- corpus(data_corpus_sotu)
sotudfm <- dfm(sotu)
Is there some equivalent way to do this with dfm objects?
The safest method is also one that will work for any core quanteda object, meaning equally for a corpus, tokens, or dfm object. These involve using the accessor functions, not addressing the internals of the corpus or dfm objects directly, which is strongly discouraged. You can do that, but your code could break in the future if those object structures are changed. In addition, our accessor functions are generally also the most efficient method.
For this task, you want to use the docnames() functions or accessing the document IDs, and this works for the corpus as well as for the dfm.
library("quanteda")
## Package version: 2.1.2
data("data_corpus_sotu", package = "quanteda.corpora")
data.frame(doc_id = docnames(data_corpus_sotu[1:5])) %>%
tidyr::separate(doc_id, c("name", "year"), "-")
## name year
## 1 Washington 1790
## 2 Washington 1790b
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
data.frame(doc_id = docnames(dfm(data_corpus_sotu[1:5]))) %>%
tidyr::separate(doc_id, c("name", "year"), "-")
## name year
## 1 Washington 1790
## 2 Washington 1790b
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
You could also have taken this from the "President" docvar field and the "Date":
data.frame(
name = data_corpus_sotu$President,
year = lubridate::year(data_corpus_sotu$Date)
) %>%
head()
## name year
## 1 Washington 1790
## 2 Washington 1790
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
## 6 Washington 1794
Created on 2021-02-13 by the reprex package (v1.0.0)
The following code will do exactly what you want, albeit it might break some operations in quanteda that will look for docid_ in sotudfm#docvars, the data frame that stores the documents relational data. For instance, it will break any filtering by sotudfm#Dimnames$docs, which is where the dimension names of the documents are listed.
sotudfm#docvars <- sotudfm#docvars %>% separate(col = docid_, c("name","year"),"-")
> sotudfm#docvars %>% as_tibble()
# A tibble: 241 x 10
docname_ name year segid_ FirstName President Date delivery type party
<chr> <chr> <chr> <int> <chr> <chr> <date> <fct> <fct> <fct>
1 Washington-1790 Washington 1790 1 George Washington 1790-01-08 spoken SOTU Independent
2 Washington-1790b Washington 1790b 1 George Washington 1790-12-08 spoken SOTU Independent
3 Washington-1791 Washington 1791 1 George Washington 1791-10-25 spoken SOTU Independent
4 Washington-1792 Washington 1792 1 George Washington 1792-11-06 spoken SOTU Independent
5 Washington-1793 Washington 1793 1 George Washington 1793-12-03 spoken SOTU Independent
6 Washington-1794 Washington 1794 1 George Washington 1794-11-19 spoken SOTU Independent
7 Washington-1795 Washington 1795 1 George Washington 1795-12-08 spoken SOTU Independent
8 Washington-1796 Washington 1796 1 George Washington 1796-12-07 spoken SOTU Independent
9 Adams-1797 Adams 1797 1 John Adams 1797-11-22 spoken SOTU Federalist
10 Adams-1798 Adams 1798 1 John Adams 1798-12-08 spoken SOTU Federalist
Here is the code that ended up working for me:
sotudfm#docvars <- sotudfm#docvars %>%
separate(col = docname_, c("name","year"),"-")
This kept the doc_id intact when I ran
head(sotudfm, 10)
It appears that docid_ and docname_ are identical.
Related
I have a dataset with a column, CatSex, that's got data in it in a form similar to "American.Indian.or.Alaska.Native.men"--the characters after the last period, I want to turn into a new pivoted column, so I have two columns, one called Cat with only the demographic info in it, and one called Sex with the sex in it. The characters before the sex designation don't follow any clear pattern. I am not very good at R, but it's better than Tableau Prep with large data sets, it seems. What I ultimately want is to pivot the data so that I have two distinct columns for the different categories here. I used this code to get part of the way there (the original data held like 119 columns with names like "Grand.total.men..C2005_A_RV..First.major..Area..ethnic..cultural..and.gender.studies...Degrees.total"), but I can't figure out how to do this with the pattern I'm now left with in the column CatSex:
pivot_longer(
cols = -c(UnitID, Institution.Name),
names_to = c("CatSex", "Disc"),
names_pattern = "(.*)..C2005_A_RV..First.major..(.*)",
values_to = "Count",
values_drop_na = TRUE
)
Here's a screenshot of the data structure I have now. I'm sorry for not putting in reproducible code--I don't know how to do that in this context!
EDIT: Here's a head(df) of the cleaned data so far:
# A tibble: 6 × 5
UnitID Institution.Name CatSex Disc Count
<int> <fct> <chr> <chr> <int>
1 177834 A T Still University of Health Sciences Grand.total.men Health.professions.and.related.clinical.sciences...Degrees.total. 212
2 177834 A T Still University of Health Sciences Grand.total.women Health.professions.and.related.clinical.sciences...Degrees.total. 359
3 177834 A T Still University of Health Sciences White.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 181
4 177834 A T Still University of Health Sciences White.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 317
5 177834 A T Still University of Health Sciences Black.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 3
6 177834 A T Still University of Health Sciences Black.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 5
Using extract from tidyr package (it is in tidyverse)
Capture 2 groups with ()
Define second group to have one or more characters that are not . up to the end $
library(dplyr)
library(tidyr)
df %>%
extract(CatSex, c("Cat", "Sex"), "(.*)\\.([^.]+)$")
UnitID Institution.Name Cat Sex
1 222178 Abilene Christian University Hispanic men
2 222178 Abilene Christian University Hispanic women
3 222178 Abilene Christian University American.Indian.or.Alaska.Native men
4 222178 Abilene Christian University American.Indian.or.Alaska.Native women
5 222178 Abilene Christian University Asian.or.Pacific.Islander women
6 222178 Abilene Christian University Asian.or.Pacific.Islander men
7 222178 Abilene Christian University Grand.total men
8 222178 Abilene Christian University Grand.total women
9 222178 Abilene Christian University White.non.Hispanic men
10 222178 Abilene Christian University White.non.Hispanic women
11 222178 Abilene Christian University lack.non.Hispanic men
12 222178 Abilene Christian University Black.non.Hispanic women
13 222178 Abilene Christian University Hispanic men
14 222178 Abilene Christian University Hispanic women
15 222178 Abilene Christian University American.Indian.or.Alaska.Native men
Disc
1 Communication journalism..and.related.programs
2 Communication journalism and.related.programs
3 Communication journalism..and.related.programs
4 Communication..journalism..and.related.programs
5 Communication..journalism..and.related.programs
6 Communication .journalism..and.related.program
7 Computer.and.information.sciences.and.support.serv
8 computer.and.information.sciences.and.support.servi
9 Computer.and.information.sciences.and.support.servi
10 Computer.and.information.sciences.and.support.servi
11 Computer.and.information.sciences.and.support.servi
12 Computer.and.information.sciences.and.support.servi.
13 Computer.and.information.sciences.and.support.serv
14 Computer.and.information.sciences.and.support.servi.
15 Computer.and.information.sciences.and.support.servi
pivot_longer is not the right function in this context.
Here are few options -
Using tidyr::separate
tidyr::separate(df, 'CatSex', c('Cat', 'Sex'), sep = '(\\.)(?!.*\\.)')
#. Cat Sex
#1 Grand.total men
#2 Grand.total women
#3 White.non.Hispanic men
#4 White.non.Hispanic women
#5 Black.non.Hispanic men
#6 Black.non.Hispanic women
Using stringr functions
library(dplyr)
library(stringr)
df %>%
mutate(Sex = str_extract(CatSex, 'men|women'),
Cat = str_remove(CatSex, '\\.(men|women)'))
In base R
transform(df, Sex = sub('.*\\.(men|women)', '\\1', CatSex),
Cat = sub('\\.(men|women)', '', CatSex))
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(CatSex = c("Grand.total.men", "Grand.total.women",
"White.non.Hispanic.men", "White.non.Hispanic.women",
"Black.non.Hispanic.men", "Black.non.Hispanic.women"))
I have a text file containing information on book title, author name, and country of birth which appear in seperate lines as shown below:
Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA
Is there any way to convert the text to a dataframe with these three items appearing as different columns:
ID Author Book Country
1 "Oscar Wilde" "De Profundis" "Ireland"
2 "Nathaniel Hawthorn" "Birthmark" "USA"
There are built-in functions for dealing with this kind of data:
data.frame(scan(text=xx, multi.line=TRUE,
what=list(Author="", Book="", Country=""), sep="\n"))
# Author Book Country
#1 Oscar Wilde De Profundis Ireland
#2 Nathaniel Hawthorn Birthmark USA
#3 James Joyce Ulysses Ireland
#4 Walt Whitman Leaves of Grass USA
You can create a 3-column matrix from one column of data.
dat <- read.table('data.txt', sep = ',')
result <- matrix(dat$V1, ncol = 3, byrow = TRUE) |>
data.frame() |>
setNames(c('Author', 'Book', 'Country'))
result <- cbind(ID = 1:nrow(result), result)
result
# ID Author Book Country
#1 1 Oscar Wilde De Profundis Ireland
#2 2 Nathaniel Hawthorn Birthmark USA
#3 3 James Joyce Ulysses Ireland
#4 4 Walt Whitman Leaves of Grass USA
There aren't any built in functions that handle data like this. But you can reshape your data after importing.
#Test data
xx <- "Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA"
writeLines(xx, "test.txt")
And then the code
library(dplyr)
library(tidyr)
lines <- read.csv("test.txt", header=FALSE)
lines %>%
mutate(
rid = ((row_number()-1) %% 3)+1,
pid = (row_number()-1) %/%3 + 1) %>%
mutate(col=case_when(rid==1~"Author",rid==2~"Book", rid==3~"Country")) %>%
select(-rid) %>%
pivot_wider(names_from=col, values_from=V1)
Which returns
# A tibble: 4 x 4
pid Author Book Country
<dbl> <chr> <chr> <chr>
1 1 Oscar Wilde De Profundis Ireland
2 2 Nathaniel Hawthorn Birthmark USA
3 3 James Joyce Ulysses Ireland
4 4 Walt Whitman Leaves of Grass USA
I am using Quanteda to analyze party platforms, which are txt files. I have binded two sets of party platforms into one dfm:
corp20dr <- corp20d + corp20r
summary(corp20dr)`
Document-feature matrix of: 28 documents, 6,595 features (85.0% sparse).
> summary(corp20dr)
Corpus consisting of 28 documents:
Text Types Tokens Sentences
akdem20.txt 1895 7624 332
azdem20.txt 908 2921 94
cadem20.txt 3255 19881 150
medem20.txt 355 863 39
.....................................
wvgop20.txt 1419 5013 106
wygop20.txt 428 1085 45
I would like to compare the Democratic (corp20d) and Republican platforms (corp20r). But, I seem to need to use docvars to make comparisons between the different groups (15 Dem, 13 GOP). When I use textplot_keyness, I intend to get a comparison of all the texts, but the result is to draw the first text against all other texts in the corpus.
corp20dr_dfm <- dfm(corpus(corp20dr),
remove = stopwords("english"), stem = TRUE, remove_numbers = TRUE, ```
remove_punct = TRUE)
corp20dr_dfm
result_keyness <- textstat_keyness(corp20dr_dfm)
textplot_keyness(result_keyness,
color = c('blue', 'red'))
The result is a comparison of the Alaska platform to the "reference" which seems to be the other 27 documents. I was hoping to use the compare differences in word usage between the two groups of corpora (15 Democratic platforms compared to the 13 Republican platforms), but I seem to have to identify each group using docvars. But I am not sure how to do this. Any help would be appreciated.
The keyness function only compares one reference document to all others, so you should group the documents by the original corpus, before calling textstat_keyness(). You can do this by using dfm_group() on a new docvar that identifies the corpus. See below for a reproducible example.
library("quanteda")
## Package version: 2.1.2
corp_a <- corpus(data_corpus_inaugural[1:5])
corp_b <- corpus(data_corpus_inaugural[6:10])
# this is the key: identifying the original corpus
# will be used to group the dfm later into just two combined "documents"
corp_a$source <- "a"
corp_b$source <- "b"
corp <- corp_a + corp_b
summary(corp)
## Corpus consisting of 10 documents, showing 10 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## 1809-Madison 535 1261 21 1809 Madison James
## 1813-Madison 541 1302 33 1813 Madison James
## 1817-Monroe 1040 3677 121 1817 Monroe James
## 1821-Monroe 1259 4886 131 1821 Monroe James
## 1825-Adams 1003 3147 74 1825 Adams John Quincy
## Party source
## none a
## none a
## Federalist a
## Democratic-Republican a
## Democratic-Republican a
## Democratic-Republican b
## Democratic-Republican b
## Democratic-Republican b
## Democratic-Republican b
## Democratic-Republican b
Now we can go through the steps of forming the dfm, grouping, and getting the keyness statistics. (Here, I've removed stopwords and punctuation as well.)
# using the separate package since we are moving textstat_*() functions
# to this module package with quanteda v3 release planned in 2021
library("quanteda.textstats")
corp %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm() %>%
dfm_group(groups = "source") %>%
textstat_keyness() %>%
head()
## feature chi2 p n_target n_reference
## 1 love 11.236174 0.0008021834 10 1
## 2 mind 10.108762 0.0014756604 11 3
## 3 good 9.971163 0.0015901101 17 8
## 4 may 9.190508 0.0024327341 38 31
## 5 can 8.887529 0.0028712512 27 19
## 6 shall 7.728615 0.0054352433 23 16
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I've read a number of posts on gather but I'm struggling to create a solution that would restructure a file with different widths into a long format.
My data are here:
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv")
df2 <- read.csv(text = x)
In the above case, I have groups of 3 columns, each of which need to be stacked up. I tried the following method but my values get spread into the wrong columns:
longJazz<- df2 %>% gather(key,
value,
X1:X69)
The resulting dataframe should have 782 rows and 3 columns (title, year and artist).
In another case, I have groups of 5 columns, so I'd like a solution that can be simply adapted. For instance, a function that takes as arguments a dataframe and the number of columns per group, would be handy.
We can remove the first column 'X', and then rename the columns until the last column 'id', by a sequence of 'Details', 'year', 'Description', then use pivot_longer from tidyr to reshape into 'long' format
library(stringr)
library(dplyr)
library(readr)
library(tidyr)
df2 <- df2[-1]
i1 <- as.integer(gl(ncol(df2)-1, 3, ncol(df2)-1))
names(df2)[1:69] <- str_c(c("Details", "year", "Description"), i1, sep="_")
df2 %>%
mutate_at(vars(starts_with('year')), ~ as.integer(as.character(.))) %>%
pivot_longer(cols = -id, names_sep="_", names_to = c(".value", "group")) %>%
select(-group)
# A tibble: 1,150 x 4
# id Details year Description
# <int> <fct> <int> <fct>
# 1 1 Sophisticated Lady / Tea For Two 1933 Art Tatum
# 2 1 The Genius Of Art Tatum, No. 21 1955 Art Tatum
# 3 1 The Tatum Group Masterpieces, Vol. 5 1964 Art Tatum / Lionel Hampton / Harry Edison / Buddy Rich / Red Callender / Barney Ke…
# 4 1 Live Sessions 1940 / 1941 1975 Art Tatum
# 5 1 20th Century Piano Genius 1986 Art Tatum
# 6 1 Jazz Masters (100 Ans De Jazz) 1998 Art Tatum
# 7 1 The Art Tatum - Ben Webster Quartet 2015 Art Tatum / Ben Webster
# 8 1 El Gran Tatum NA Art Tatum
# 9 1 Sweet Georgia Brown / Shiek Of Araby / Back O' Town Bl… 1945 Benny Goodman Quintet* / Esquire All Stars Featuring Louis Armstrong
#10 1 The Immortal Live Sessions 1944/1947 1975 Louis Armstrong
# … with 1,140 more rows
I'm getting an error message that I can't make sense of. My code:
url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(url)
testdata <- leg %>%
html_nodes('table') %>%
.[6] %>%
html_table()
To which I get the response:
Error in out[j + k, ] : subscript out of bounds
When I swap out html_table with html_text I don't get the error. Any idea what I'm doing wrong?
library(htmltab)
library(dplyr)
library(tidyr)
url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
url %>%
htmltab(6, rm_nodata_cols = F) %>%
.[,-1] %>%
replace_na(list(Notes = "", "Term-limited?" = "")) %>%
`rownames<-` (seq_len(nrow(.)))
Output is:
District Name Party Residence Term-limited? Notes
1 1 Ted Gaines Republican El Dorado Hills
2 2 Mike McGuire Democratic Healdsburg
3 3 Bill Dodd Democratic Napa
4 4 Jim Nielsen Republican Gerber
5 5 Cathleen Galgiani Democratic Stockton
6 6 Richard Pan Democratic Sacramento
...
Why not just target the table better?
library(rvest)
wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(wp_url)
html_node(leg, xpath=".//table[contains(., 'District')]") %>%
html_table()
## Position Position Name Party District
## 1 Lieutenant Governor Gavin Newsom Democratic
## 2 President pro tempore Kevin de León Democratic 24th–Los Angeles
## 3 Majority leader Bill Monning Democratic 17th–Carmel
## 4 Majority whip Nancy Skinner Democratic 9th–Berkeley
## 5 Majority caucus chair Connie Leyva Democratic 20th–Chino
## 6 Majority caucus vice chair Mike McGuire Democratic 2nd–Healdsburg
## 7 Minority leader Patricia Bates Republican 36th–Laguna Niguel
## 8 Minority caucus chair Jim Nielsen Republican 4th–Gerber
## 9 Minority whip Ted Gaines Republican 1st–El Dorado Hills
## 10 Secretary Secretary Daniel Alvarez Daniel Alvarez Daniel Alvarez
## 11 Sergeant-at-Arms Sergeant-at-Arms Debbie Manning Debbie Manning Debbie Manning
## 12 Chaplain Chaplain Sister Michelle Gorman Sister Michelle Gorman Sister Michelle Gorman
ARGH! Wrong table. It's still unwise to just use numeric indexes like that. We can still target the table you want better:
library(rvest)
library(purrr)
wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(wp_url)
target_table <- html_node(leg, xpath=".//span[#id='Members']/../following-sibling::table")
But, rvest::html_table() is causing the error and you should absolutely file a bug report on the GH page for it.
The htmltab pkg in used in the other answer looks handy (and feel free to accept that answer vs this one since it's shorter and works).
We'll do it the old-fashioned way, but will need a helper function to make better column names:
mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}
Now, we extract the header row and the data rows:
header_row <- html_node(target_table, xpath=".//tr[th]")
data_rows <- html_nodes(target_table, xpath=".//tr[td]")
We peek at the header row and see that there's an evil colspan in there. We'll make use of this knowledge later.
html_children(header_row)
## {xml_nodeset (6)}
## [1] <th scope="col" width="30" colspan="2">District</th>
## [2] <th scope="col" width="170">Name</th>
## [3] <th scope="col" width="70">Party</th>
## [4] <th scope="col" width="130">Residence</th>
## [5] <th scope="col" width="60">Term-limited?</th>
## [6] <th scope="col" width="370">Notes</th>
Get the column names, and make them tidy:
html_children(header_row) %>%
html_text() %>%
tolower() %>%
mcga() -> col_names
Now, iterate over the rows, pull out the values, remove the extra first value and turn the whole thing into a data frame:
map_df(data_rows, ~{
kid_txt <- html_children(.x) %>% html_text()
as.list(setNames(kid_txt[-1], col_names))
})
## # A tibble: 40 x 6
## district name party residence term_limited notes
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Ted Gaines Republican El Dorado Hills
## 2 2 Mike McGuire Democratic Healdsburg
## 3 3 Bill Dodd Democratic Napa
## 4 4 Jim Nielsen Republican Gerber
## 5 5 Cathleen Galgiani Democratic Stockton
## 6 6 Richard Pan Democratic Sacramento
## 7 7 Steve Glazer Democratic Orinda
## 8 8 Tom Berryhill Republican Twain Harte Yes
## 9 9 Nancy Skinner Democratic Berkeley
## 10 10 Bob Wieckowski Democratic Fremont
## # ... with 30 more rows