Tidy text: Compute Zipf's law from the following term-document matrix - r

I tried the code from http://tidytextmining.com/tfidf.html. My result can be seen in this image.
My question is: How can I rewrite the code to produce the negative relationship between the term frequency and the rank?
The following is the term-document matrix. Any comments are highly appreciated.
# Zipf 's law
freq_rk < -DTM_words %>%
group_by(document) %>%
mutate(rank=row_number(),
'term_frequency'=count/total)
freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8)
DTM_words
# A tibble: 4,530 x 5
document term count n total
<chr> <chr> <dbl> <int> <dbl>
1 1 activ 1 1 109
2 1 agencydebt 1 1 109
3 1 assess 1 1 109
4 1 avail 1 1 109
5 1 balanc 2 1 109
# ... with 4,520 more rows

To use row_number() to get rank, you need to make sure that your data frame is ordered by n, the number of times a word is used in a document. Let's look at an example. It sounds like you are starting with a document-term matrix that you are tidying? (I'm going to use some example data that is similar to a DTM from quanteda.)
library(tidyverse)
library(tidytext)
data("data_corpus_inaugural", package = "quanteda")
inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)
ap_td <- tidy(inaug_dfm)
ap_td
#> # A tibble: 44,725 x 3
#> document term count
#> <chr> <chr> <dbl>
#> 1 1789-Washington fellow 3
#> 2 1793-Washington fellow 1
#> 3 1797-Adams fellow 3
#> 4 1801-Jefferson fellow 7
#> 5 1805-Jefferson fellow 8
#> 6 1809-Madison fellow 1
#> 7 1813-Madison fellow 1
#> 8 1817-Monroe fellow 6
#> 9 1821-Monroe fellow 10
#> 10 1825-Adams fellow 3
#> # ... with 44,715 more rows
Notice that here, you have a tidy data frame with one word per row, but it is not ordered by count, the number of times that each word was used in each document. If we used row_number() here to try to assign rank, it isn't meaningful because the words are all jumbled up in order.
Instead, we can arrange this by descending count.
ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count))
ap_td
#> # A tibble: 44,725 x 3
#> # Groups: document [58]
#> document term count
#> <chr> <chr> <dbl>
#> 1 1841-Harrison the 829
#> 2 1841-Harrison of 604
#> 3 1909-Taft the 486
#> 4 1841-Harrison , 407
#> 5 1845-Polk the 397
#> 6 1821-Monroe the 360
#> 7 1889-Harrison the 360
#> 8 1897-McKinley the 345
#> 9 1841-Harrison to 318
#> 10 1881-Garfield the 317
#> # ... with 44,715 more rows
Now we can use row_number() to get rank, because the data frame is actually ranked/arranged/ordered/sorted/however you want to say it.
ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count)) %>%
mutate(rank = row_number(),
total = sum(count),
`term frequency` = count / total)
ap_td
#> # A tibble: 44,725 x 6
#> # Groups: document [58]
#> document term count rank total `term frequency`
#> <chr> <chr> <dbl> <int> <dbl> <dbl>
#> 1 1841-Harrison the 829 1 9178 0.09032469
#> 2 1841-Harrison of 604 2 9178 0.06580954
#> 3 1909-Taft the 486 1 5844 0.08316222
#> 4 1841-Harrison , 407 3 9178 0.04434517
#> 5 1845-Polk the 397 1 5211 0.07618499
#> 6 1821-Monroe the 360 1 4898 0.07349939
#> 7 1889-Harrison the 360 1 4744 0.07588533
#> 8 1897-McKinley the 345 1 4383 0.07871321
#> 9 1841-Harrison to 318 4 9178 0.03464807
#> 10 1881-Garfield the 317 1 3240 0.09783951
#> # ... with 44,715 more rows
ap_td %>%
ggplot(aes(rank, `term frequency`, color = document)) +
geom_line(alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()

A graph that would describe a linear regression (i.e. not Zipf's Law) would just add a smooth with a linear regression model (lm).
freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8) +
geom_smooth(method = lm)
To identify the differences between Austen's distributions and yours, run the following code:
Austen:
ggplot(freq_by_rank, aes(rank, fill = book) + geom_density(alpha = 0.5) + labs(title = "Austen linear")
ggplot(freq_by_rank, aes(rank, fill = book) + geom_density(alpha = 0.5) + scale_x_log10() + labs(title = "Austen Logarithmic")
Tom's Sample
ggplot(freq_rk, aes(rank, fill = document) + geom_density(alpha = 0.5) + labs(title = "Sample linear")
ggplot(freq_rk, aes(rank, fill = document) + geom_density(alpha = 0.5) + scale_x_log10() + labs(title = "Sample Logarithmic")

Related

Filter by value counts within groups

I want to filter my grouped dataframe based on the number of occurrences of a specific value within a group.
Some exemplary data:
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
The “lapse” column is my filter variable in this case.
I want to exclude every “ID” group that has more than 15 counts of “lapse” == 2 within!
data %>% group_by(ID) %>% count(lapse == 2)
So, if for example the group “A” has 17 times “lapse” == 2 within it should be filtered entirely from the datafame.
First I created some reproducible data using a set.seed and check the number of values per group. It seems that in this case only group D more values with lapse 2 has. You can use filter and sum the values with lapse 2 per group like this:
set.seed(7)
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
library(dplyr)
# Check n values per group
data %>%
group_by(ID, lapse) %>%
summarise(n = n())
#> # A tibble: 8 × 3
#> # Groups: ID [4]
#> ID lapse n
#> <chr> <int> <int>
#> 1 A 1 8
#> 2 A 2 7
#> 3 B 1 13
#> 4 B 2 15
#> 5 C 1 18
#> 6 C 2 6
#> 7 D 1 17
#> 8 D 2 16
data %>%
group_by(ID) %>%
filter(!(sum(lapse == 2) > 15))
#> # A tibble: 67 × 3
#> # Groups: ID [3]
#> ID rt lapse
#> <chr> <dbl> <int>
#> 1 B 0.517 2
#> 2 C 0.589 1
#> 3 C 0.598 2
#> 4 C 0.715 1
#> 5 B 0.475 2
#> 6 C 0.965 1
#> 7 B 0.234 1
#> 8 B 0.812 2
#> 9 C 0.517 1
#> 10 B 0.700 1
#> # … with 57 more rows
Created on 2023-01-08 with reprex v2.0.2

Select rows within an overlapping range based on another column in R tidyverse

I have a data frame that looks like this:
the col1 defines the start of a range when the direction is " + " while the col2 establishes the beginning of a range when the direction is " - ".
library(tidyverse)
df <- tibble(col1=c(1,10,100,40,1000), col2=c(15,20,50,80,2000),
direction=c("+","+","-","+","+"), score=c(50,100,300,10,300))
df
#> # A tibble: 5 × 4
#> col1 col2 direction score
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 15 + 50
#> 2 10 20 + 100
#> 3 100 50 - 300
#> 4 40 80 + 10
#> 5 1000 2000 + 300
Created on 2022-07-28 by the reprex package (v2.0.1)
By considering the direction, I want to extract from the rows with overlapping ranges the ones with the highest score.
I want my data to look like this.
#> col1 col2 direction score
#> <dbl> <dbl> <chr> <dbl>
#> 1 10 20 + 100
#> 3 100 50 - 300
#> 5 1000 2000 + 300
Any ideas and help are highly appreciated.
We could use slice_max after grouping by rleid on the 'direction'
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(direction)) %>%
slice_max(n = 1, order_by = score) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 3 × 4
col1 col2 direction score
<dbl> <dbl> <chr> <dbl>
1 10 20 + 100
2 100 50 - 300
3 1000 2000 + 300

Looking for a way to determine how many points are above a certain (nonlinear) line in R

I have a big data frame (df) which contains x/y coordinates and a nonlinear regression line that is somewhere in the middle, see plot below. Many points overlap, that's why I have an extra column $Freq.
I am looking for a way to determine how many points (many on top of eachother) are above this line. See the df structure below.
head(df)
x y Freq
0 0 396
1 1 222
1 0 513
2 0 315
2 1 279
2 2 36
...
I am aware of the polygon methods here on StackOverflow but I can't seem to get those to work, maybe in part due to the fact that my line is a series of coordinates rather than a formula:
head(line)
x y
0 0.0000
1 0.4220
2 0.8350
3 1.2545
4 1.6615
5 2.0450
In the end it would be great if I can have three numbers: one that describes the count of the points above the line, the count of the points below the line, and possibly those that are to the right of this particular line.
Thanks!
A lot of unknowns here, nevertheless
df=merge(df,line,by="x",suffixes=c("_df","_line"))
sum(df$Freq[df$y_df>df$y_line])
[1] 537
library(tidyverse)
data <- tibble::tribble(
~x, ~y, ~Freq,
0, 0, 396,
1, 1, 222,
1, 0, 513,
2, 0, 315,
2, 1, 279,
2, 2, 36
)
data
#> # A tibble: 6 x 3
#> x y Freq
#> <dbl> <dbl> <dbl>
#> 1 0 0 396
#> 2 1 1 222
#> 3 1 0 513
#> 4 2 0 315
#> 5 2 1 279
#> 6 2 2 36
line <- tibble::tribble(
~x, ~y,
0, 0.0000,
1, 0.4220,
2, 0.8350,
3, 1.2545,
4, 1.6615,
5, 2.0450
)
line
#> # A tibble: 6 x 2
#> x y
#> <dbl> <dbl>
#> 1 0 0
#> 2 1 0.422
#> 3 2 0.835
#> 4 3 1.25
#> 5 4 1.66
#> 6 5 2.04
data %>%
left_join(line %>% rename(line_y = y)) %>%
filter(y > line_y)
#> Joining, by = "x"
#> # A tibble: 3 x 4
#> x y Freq line_y
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 222 0.422
#> 2 2 1 279 0.835
#> 3 2 2 36 0.835
data %>%
left_join(line %>% rename(line_y = y)) %>%
filter(y > line_y) %>%
summarise(sum(Freq))
#> Joining, by = "x"
#> # A tibble: 1 x 1
#> `sum(Freq)`
#> <dbl>
#> 1 537
data %>%
left_join(line %>% rename(line_y = y)) %>%
filter(y > line_y) %>%
nrow()
#> Joining, by = "x"
#> [1] 3
Created on 2021-11-09 by the reprex package (v2.0.1)

R simulation of correlated data

data = data.frame(GROUP = sample(1:4, size = 1000, r = TRUE),
SCORE = runif(1000),
HELP = runif(1000, -.2, .8))
I have sample 'data' that has variable GROUP which indicates the GROUP an individual is in and SCORE which is the SCORE for that individual and HELP which is another measure.
Now if you know that GROUP = 1 will have a higher SCORE on average if their values of HELP are greater than 0.3, but GROUP = 2 will have a lower SCORE on average if their values of HELP are less than 0.4 how can you simulate a data set that uses this information?
Here's an unsophisticated solution using dplyr to increase/decrease SCORE by one sd for the relevant conditions.
library(dplyr)
sd(data$SCORE)
#> [1] 0.2868018
data <-
data %>%
mutate(helpgt3 = as.integer(HELP > .3),
helplt4 = as.integer(HELP < .4))
original_means <-
data %>%
group_by(GROUP, helpgt3) %>%
summarise(mean.score = mean(SCORE))
data <-
data %>%
mutate(SCORE = case_when(
helpgt3 == 1 & GROUP == 1 ~ SCORE + 0.2868018,
helplt4 == 1 & GROUP == 2 ~ SCORE - 0.2868018,
TRUE ~ SCORE
))
new_means <-
data %>%
group_by(GROUP, helpgt3) %>%
summarise(mean.score = mean(SCORE))
#> `summarise()` regrouping output by 'GROUP' (override with `.groups` argument)
original_means
#> # A tibble: 8 x 3
#> # Groups: GROUP [4]
#> GROUP helpgt3 mean.score
#> <int> <int> <dbl>
#> 1 1 0 0.486
#> 2 1 1 0.474
#> 3 2 0 0.473
#> 4 2 1 0.525
#> 5 3 0 0.482
#> 6 3 1 0.486
#> 7 4 0 0.545
#> 8 4 1 0.521
new_means
#> # A tibble: 8 x 3
#> # Groups: GROUP [4]
#> GROUP helpgt3 mean.score
#> <int> <int> <dbl>
#> 1 1 0 0.486
#> 2 1 1 0.761
#> 3 2 0 0.186
#> 4 2 1 0.478
#> 5 3 0 0.482
#> 6 3 1 0.486
#> 7 4 0 0.545
#> 8 4 1 0.521
Your data
set.seed(2020)
data = data.frame(GROUP = sample(1:4, size = 1000, r = TRUE),
SCORE = runif(1000),
HELP = runif(1000, -.2, .8))

What is the process of applying a dplyr function to a list of values

I have created a dplyr function to evaluate counts of events for a population. The code works when used with explicit naming of variables within the dplyr::filter and dplyr::group_by functions.
I need to apply the function to 24 variables which are column headers within a data frame. Here they are referred to as x.
I have used !! as I understand that the variable is evaluated as a string rather than a column name.
The function
summary_table <- function(x){
assign(paste(x,"sum_tab", sep="_"),
envir = parent.frame(),
value = df %>%
filter(!is.na(!!x)) %>%
group_by(!!x) %>%
summarise(
'Variable name' = paste0(x),
Discharged = sum(admission_status == "Discharged"),
'Re-attended' = sum(!is.na(re_admission_status)),
'Admitted on Re-attendance' = sum(re_admission_status == "Admitted", na.rm = TRUE)))
}
I have used:
sapply(var_names, summary_table)
However this only outputs one row of the table for each variable in the list var_names
In summary I would like pointers to the correct mechanism to apply the function written above to a list of column names within the dplyr pipe.
Reproducible example
example <- mtcars %>%
group_by(vs) %>%
summarise(
'6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
})
In this example we would want to apply this function to the following list
cars_var <- c("vm", "am", "carb")
This would produce three tables for each column in the list.
As #eipi10 commented, it is usually unwise to automatically create variables. A better idea is to create a single variable that is a list of data frames.
It is also easier to let users apply the groups themselves with group_by() or group_by_at(), so that you don't have to worry about how they provide the names of the variables.
EDIT 2019-05-2
One way is to regard the names of the grouping variables as the 'data', and map over them, creating a copy of the actual data grouped by each one of the grouping variables.
library(dplyr)
library(purrr)
grouping_vars <- c("vs", "am", "carb")
map(grouping_vars, group_by_at, .tbl = mtcars) %>%
map(summarise,
'6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
#> [[1]]
#> # A tibble: 2 x 4
#> vs `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 3 18 14
#> 2 1 4 9 13
#>
#> [[2]]
#> # A tibble: 2 x 4
#> am `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 4 19 19
#> 2 1 3 8 8
#>
#> [[3]]
#> # A tibble: 6 x 4
#> carb `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 1 2 4 7
#> 2 2 0 8 8
#> 3 3 0 3 3
#> 4 4 4 10 9
#> 5 6 1 1 0
#> 6 8 0 1 0
Created on 2019-05-02 by the reprex package (v0.2.1)
Original answer
Here is a function that uses dplyr::groups() to find out which variables have been grouped. Then it iterates over each grouping variable, summarises, and appends the resulting data frame to a list.
library(dplyr)
margins <- function(.data, ...) {
groups <- dplyr::groups(.data)
n <- length(groups)
out <- vector(mode = "list", length = n)
for (i in rev(seq_len(n))) {
out[[i]] <-
.data %>%
dplyr::group_by(!!groups[[i]]) %>%
dplyr::summarise(...) %>%
dplyr::group_by(!!groups[[i]]) # Reapply the original group
}
out
}
mtcars %>%
group_by(vs, am, carb) %>%
margins('6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
#> [[1]]
#> # A tibble: 2 x 4
#> # Groups: vs [2]
#> vs `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 3 18 14
#> 2 1 4 9 13
#>
#> [[2]]
#> # A tibble: 2 x 4
#> # Groups: am [2]
#> am `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 4 19 19
#> 2 1 3 8 8
#>
#> [[3]]
#> # A tibble: 6 x 4
#> # Groups: carb [6]
#> carb `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 1 2 4 7
#> 2 2 0 8 8
#> 3 3 0 3 3
#> 4 4 4 10 9
#> 5 6 1 1 0
#> 6 8 0 1 0
Created on 2019-04-24 by the reprex package (v0.2.1.9000)
If you want to group with a vector of variable names, you can use dplyr::group_by_at() and dplyr::vars().
cars_var <- c("vs", "am", "carb")
mtcars %>%
group_by_at(vars(cars_var)) %>%
margins('6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
I am the author of a small package called armgin that implements this and a few similar ideas.

Resources