Replicating a Google ngram plot in R - r

I'm trying to replicate a plot from the paper Michel et al., 'Quantitative Analysis of Culture Using Millions of Digitized Books' (2011). Specifically I'm trying to make the one on the top right here:
https://pubmed.ncbi.nlm.nih.gov/21163965/#&gid=article-figures&pid=fig-3-uid-2
I know the paper used v1 of the corpus but I'm doing it with v2 as it's easier to work with. When I use the Google Ngram viewer (specifying the English 2012 corpus which corresponds to v2, a year range of 1875 to 1975, and no smoothing) I get this, which looks pretty close.
When I tried to replicate this in R/ggplot I get this:
1950 and 1883 look pretty consistent with what is happening in the viewer plot, but I can't figure out what is happening with 1910. There appears to be very few occurrences of the year '1910' in the data set in comparison to some of the other years. Would anyone with a better understanding of the Google ngrams data set be able to point me in the right direction? Should I be supplementing this with something other than just the 1-gram dataset? Does the Google ngram viewer pick out occurrences of 1-grams in a different way?
The code I've used is below. A couple of other points: 1910 and 1950 do not seem to exist as 1-grams in the v2 data set, but 1883 does. To get this to even remotely work, I had to grepl for 1950 and 1910 to get any hits (i.e. they all seem to appear as parts of date ranges like 1890-1910, or with some other characters tacked on), rather than just doing a fixed search for those years in the ngram field. I also used purrr::map_dfr to do this rather than just a dplyr::case_when in case years appeared in the same ngram picked up by a grepl (e.g. the range 1883-1910 should be a hit for both of those years, not just one).
library(ggplot2)
library(dplyr)
library(purrr)
#---- Load data ----
counts_file <- file.path("data", "total_counts.txt")
ngrams_file <- file.path("data", "google_books_1gram_eng_v2.gz")
if (!dir.exists("data")) {
dir.create("data")
}
if (!file.exists(counts_file)) {
download.file(
"http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-totalcounts-20120701.txt",
counts_file
)
}
if (!file.exists(ngrams_file)) {
download.file(
"http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-1.gz",
ngrams_file
)
}
one_grams <- read.delim(
gzfile(ngrams_file),
header = FALSE
)
names(one_grams) <- c("ngram", "year", "match_count", "volume_count")
one_grams_subset <- one_grams %>%
filter(year >= 1875 & year <= 1975)
total_counts_temp <- t(
read.table(
counts_file,
header = FALSE
)
)
total_counts_char <- do.call(
rbind,
strsplit(total_counts_temp, ",")
)
total_counts <- apply(total_counts_char, 2, as.numeric)
colnames(total_counts) <- c("year", "match_count", "page_count", "volume_count")
#---- Recreate plot 3A from Michel et al. (2011) ----
year_subset <- function(year_char, one_grams_data) {
one_grams_data %>%
filter(grepl(year_char, .[["ngram"]], fixed = TRUE)) %>%
group_by(year) %>%
summarise(year_count = sum(match_count, na.rm = TRUE)) %>%
mutate(year_gram = year_char)
}
plot_data <- map_dfr(c("1883", "1910", "1950"),
year_subset,
one_grams_subset) %>%
left_join(as_tibble(total_counts), by = "year") %>%
mutate(frequency = 10000 * year_count/match_count) %>%
select(year_gram, year, frequency, year_count)
ggplot(plot_data) +
geom_line(aes(x = year, y = frequency, colour = year_gram)) +
theme_minimal() +
labs(col = "ngram", x = "Year", y = "Frequency")

Related

How to do parallel processing with rowwise

I am using rowwise to perform a function on each row. This takes a long time. In order to speed things up, is there a way to use parallel processing so that multiple cores are working on different rows concurrently?
As an example, I am aggregating PRISM weather data (https://prism.oregonstate.edu/) to the state level while weighting by population. This is based on https://www.patrickbaylis.com/blog/2021-08-15-pop-weighted-weather/.
Note that the below code requires downloads of daily weather data as well as the shapefile with population estimates at a very small geography.
library(prism)
library(tidyverse)
library(sf)
library(exactextractr)
library(tigris)
library(terra)
library(raster)
library(ggthemes)
################################################################################
#get daily PRISM data
prism_set_dl_dir("/prism/daily/")
get_prism_dailys(type = "tmean", minDate = "2012-01-01", maxDate = "2021-07-31", keepZip=FALSE)
Get states shape file and limit to lower 48
states = tigris::states(cb = TRUE, resolution = "20m") %>%
filter(!NAME %in% c("Alaska", "Hawaii", "Puerto Rico"))
setwd("/prism/daily")
################################################################################
#get list of files in the directory, and extract date
##see if it is stable (TRUE) or provisional data (FALSE)
list <- ls_prism_data(name=TRUE) %>% mutate(date1=substr(files, nchar(files)-11, nchar(files)-4),
date2=substr(product_name, 1, 11),
year = substr(date2, 8, 11), month=substr(date2, 1, 3),
month2=substr(date1, 5, 6), day=substr(date2, 5, 6),
stable = str_detect(files, "stable"))
################################################################################
#function to get population weighted weather by state
#run the population raster outside of the loop
# SOURCE: https://sedac.ciesin.columbia.edu/data/set/usgrid-summary-file1-2000/data-download - Census 2000, population counts for continental US
pop_rast = raster("/population/usgrid_data_2000/geotiff/uspop00.tif")
pop_crop = crop(pop_rast, states)
states = tigris::states(cb = TRUE, resolution = "20m") %>%
filter(!NAME %in% c("Alaska", "Hawaii", "Puerto Rico"))
daily_weather <- function(varname, filename, date) {
weather_rast = raster(paste0(filename, "/", filename, ".bil"))
weather_crop = crop(weather_rast, states)
pop_rs = raster::resample(pop_crop, weather_crop)
states$value <- exact_extract(weather_crop, states, fun = "weighted_mean", weights=pop_rs)
names(states)[11] <- varname
states <- data.frame(states) %>% arrange(NAME) %>% dplyr::select(c(6,11))
states
}
################################################################################
days <- list %>% rowwise() %>% mutate(states = list(daily_weather("tmean", files, date1))))
As is, each row takes about 7 seconds. This adds up with 3500 rows. And I want to get other variables beside tmean. So it will take a day or more to do everything unless I can speed it up.
I am mainly interested in solutions to be able to use parallel processing with rowwise, but I also welcome other suggestions of how to speed up the code in other ways.
you could try either purrr of its multiprocessed equivalent furrr (either map() or pmap()). The quickest method would be to use data.table. See this blog post that gives some benchmarks behind my recommendation

Power BI - creating custom r visuals relationship error

I desperately need help!
I am trying to predict drug use based on 5 characteristics: Age, Gender, Education, Ethnicity, Country. I already build a tree model in R with rpart
DrugTree3 <- rpart(formula = DrugUser ~ Age+Gender+Education+Ethnicity+Country, data = traindata)
, a logistic regression model
DrugLog <- glm(formula = DrugUser ~ Age+Gender+Ethnicity+Education+Country,data = traindata, family = binomial)
, and a knn model
KnnModel <- train(form = DrugUser~., data = ModelData,method ='knn',tuneGrid=expand.grid(.k=1:100),metric='Accuracy',trControl=trainControl(method='repeatedcv',number=10,repeats=10)) .
I saved those as RDS files and uploaded them successfully in Power BI.
I then created tables for each characterization and created okviz filters for them.
Then I tried to predict whether a customer gets predicted as a drug user or a non-drug user based on the selections in the okviz filters. This is when everything went horribly wrong:
I created a custom R visual vor each model prediction and inserted the following code in each visual:
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class = %s) = %s",myclass,as.character(round(myprob,2))),cex=3.5)
Error: Can't determine relationship between fields.
What has gone wrong here?
When I then clicked on the diagonal arrow to get to R Studio, this happens: Unable to construct R script data for use in external R IDE.
I need help as I am literally going crazy over this and I don't know how to resolve the issue! I would be really happy if you can help me
enter image description here
You made a error in line 34, and line 25.
Below is a fixed version of your code.
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class =
Good Luck!

Census Data Using an API on R studio

So, I am new to using R, so sorry if the questions seem a little basic!
But my work is asking me to look through census data using an API and identify some variables in each tract, then create a csv file they can look at. The code is fully written for me, I believe, but I need to change the variables to:
S2602_C01_023E - black / his
S2602_C01_081E - unemployment rate
S2602_C01_070E - not US citizen (divide by total population)
S0101_C01_030E - # over 65 (divide by total pop)
S1603_C01_009E - # below poverty (divide by total pop)
S1251_C01_010E - # child under 18 (divide by # households)
S2503_C01_013E - median income
S0101_C01_001E - total population
S2602_C01_078E - in labor force
And, I need to divide some of the variables, like I have written, and export all of this into a CSV file. I just don't really know what to do with the code..like I am just lost because I have never used R. I try changing the variables to the ones I need, but an error comes up. Any help would be greatly appreciated!
library(tidycensus)
library(tidyverse)
library(stringr)
library(haven)
library(profvis)
#list of variables possible
v18 <- load_variables(year = 2018,
dataset = "acs5",
cache = TRUE)
#function to get variables for all states. Year, variables can be
easily edited.
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(totpop = "B01003_001",
male = "B01001_002",
female = "B01001_026",
white_alone = "B02001_002",
black_alone = "B02001_003",
americanindian_alone = "B02001_004",
asian_alone = "B02001_005",
nativehaw_alone = "B02001_006",
other_alone = "B02001_007",
twoormore = "B02001_008",
nh = "B03003_002",
his = "B03003_003",
noncit = "B05001_006",
povstatus = "B17001_002",
num_households = "B19058_001",
SNAP_households = "B19058_002",
medhhi = "B19013_001",
hsdiploma_25plus = "B15003_017",
bachelors_25plus = "B15003_022",
greater25 = "B15003_001",
inlaborforce = "B23025_002",
notinlaborforce = "B23025_007",
greater16 = "B23025_001",
civnoninstitutional = "B27010_001",
withmedicare_male_0to19 = "C27006_004",
withmedicare_male_19to64 = "C27006_007",
withmedicare_male_65plus = "C27006_010",
withmedicare_female_0to19 = "C27006_014",
withmedicare_female_19to64 = "C27006_017",
withmedicare_female_65plus = "C27006_020",
withmedicaid_male_0to19 = "C27007_004",
withmedicaid_male_19to64 = "C27007_007",
withmedicaid_male_65plus = "C27007_010",
withmedicaid_female_0to19 = "C27007_014",
withmedicaid_female_19to64 = "C27007_017",
withmedicaid_female_65plus ="C27007_020"),
geography = "tract",
state = st )
return(df)
}
#loops over all states
df_list <- setNames(lapply(states, get_census_data), states)
##if you want to keep margin of error, remove everything after %>%
in next two lines
final_df <- bind_rows(df_list) %>%
select(-moe)
colnames(final_df)[3] <- "varname"
#cleaning up final data, making it wide instead of long
final_df_wide <- final_df %>%
gather(variable, value, -(GEOID:varname)) %>%
unite(temp, varname, variable) %>%
spread(temp, value)
#exporting to csv file, adjust your path
write.csv(final_df,"C:\Users\NAME\Documents\acs_2018_tractlevel_dat.
a.csv" )
Since you can't really give an reproducible example without revealing your API key, I'll try my best to figure out what could work here:
Let's first edit the function that pulls data from the API:
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(blackHis= "S2602_C01_023E",
unEmployRate = "S2602_C01_081E",
notUSCit = "S2602_C01_070E")
geography = "tract",
state = st )
return(df)
}
I've just put in two of the variables, but you should get the point.
Try if this works for you. And returns the data that is stored in the respective variables.

Loop in r to run script

I need to run a script for each station (I was replacing the numbers 1 by 1 in the script) but there're more than 100 stations.
I thought maybe loop in script could save my time. Never done loop before, don't know if it's possible to do what I want. I've tried as the bellow but doesn't work.
Just a bit of my df8 data (txt):
RowNum,date,code,gauging_station,precp
1,01/01/2008 01:00,1586,315,0.4
2,01/01/2008 01:00,10990,16589,0.2
3,01/01/2008 01:00,17221,30523,0.6
4,01/01/2008 01:00,34592,17344,0
5,01/01/2008 01:00,38131,373,0
6,01/01/2008 01:00,44287,370,0
7,01/01/2008 01:00,53903,17314,0.4
8,01/01/2008 01:00,56005,16596,0
9,01/01/2008 01:00,56349,342,0
10,01/01/2008 01:00,57294,346,0
11,01/01/2008 01:00,64423,533,0
12,01/01/2008 01:00,75266,513,0
13,01/01/2008 01:00,96514,19187,0
Code:
station <- sample(50:150,53,replace=F)
for(i in station)
{
df08_1 <- filter(df08, V7==station [i])
colnames(df08_1) <- c("Date","gauging_station", "code", "precp")
df08_1 <- unique(df08_1)
final <- df08_1 %>%
group_by(Date=floor_date(Date, "1 hour"), gauging_station, code) %>%
summarize(precp=sum(precp))
write.csv(final,file="../station [i].csv", row.names = FALSE)
}
If you're not averse to using some tidyverse packages, I think you could simplify this a bit:
Updated with your new sample data - this runs ok on my computer:
Code:
library(dplyr)
dat %>%
select(-RowNum) %>%
distinct() %>%
group_by(date_hour = lubridate::floor_date(date, 'hour'), gauging_station, code) %>%
summarize(precp = sum(precp)) %>%
split(.$gauging_station) %>%
purrr::map(~write.csv(.x,
file = paste0('../',.x$gauging_station, '.csv'),
row.names = FALSE))
Data:
dat <- data.table::fread("RowNum,date,code,gauging_station,precp
1,01/01/2008 01:00,1586,315,0.4
2,01/01/2008 01:00,10990,16589,0.2
3,01/01/2008 01:00,17221,30523,0.6
4,01/01/2008 01:00,34592,17344,0
5,01/01/2008 01:00,38131,373,0
6,01/01/2008 01:00,44287,370,0
7,01/01/2008 01:00,53903,17314,0.4
8,01/01/2008 01:00,56005,16596,0
9,01/01/2008 01:00,56349,342,0
10,01/01/2008 01:00,57294,346,0
11,01/01/2008 01:00,64423,533,0
12,01/01/2008 01:00,75266,513,0
13,01/01/2008 01:00,96514,19187,0") %>%
mutate(date = as.POSIXct(date, format = '%m/%d/%Y %H:%M'))
Can't comment for a lack of reputation, but if the code works if you change station [i] for the number of the station, it sounds like each station is a part of and has to be extracted from the df08 object (dataframe).
If I understand you correctly, I would do this as follows:
stations <- c(1:100) #put your station IDs into a vector
for(i in stations) { #run the script for each entry in the list
#assuming that 'V7' is the name of the (unnamed) seventh column of df08, it could
#work like this:
df08_1 <- filter(df08, df08$V7==i) #if your station names are something like
#'station 1' as a string, use paste("station", 1, sep = "")
colnames(df08_1) <- c("Date","gauging_station", "code", "precp")
df08_1 <- unique(df08_1)
final <- df08_1 %>%
group_by(Date=floor_date(Date, "1 hour"), gauging_station, code) %>%
summarize(precp=sum(precp)) #floor_date here is probably your own function
write.csv(final,file=paste("../station", i, ".csv", sep=""), row.names = FALSE)
#automatically generate names. You can modify the string to whatever you want ofc.
}
If this and all of the other examples don't work, could you provide us with some dummy data to work with, just to see what the df08 dataframe looks like? And also what the floor_date() function does?

Loop through two data frames and concatenate contents to string

I am trying to create a large list of file URLs by concatenating various pieces together. (Say, ~40 file URLs which represent multiple data types for each of the 50 states.) Eventually, I will download and then unzip/unrar these files. (I have working code for that portion of it.)
I'm very much an R noob, so please bear with me, here.
I have a set of data frames:
states - list of 50 state abbreviations
partial_url - a partial URL for the 50 states
url_parts - a list of each of the remaining URL
pieces (5 file types to download)
year
filetype
I need a URL that looks like this:
http://partial_url/state_urlpart_2017_file.csv.gz
I was able to build the partial_url data frame with the following:
for (i in seq_along(states)) {
url_part1 <- as.data.frame(paste0(url,states[[i]],"/",dir,"/"))
}
I was hoping that some kind of nested loop might work to do the rest, like so:
for (i in 1:partial_url){
for (j in 1:url_parts){
for(k in 1:states){
url_part2 <- as.data.frame(paste0(partial_url[[i]],"/",url_parts[[j]],states[[k]],year,filetype))
}}}
Can anyone suggest how to proceed with the final step?
In my understanding all OP needs can be handled by paste0 function itself. The paste0 works as vectorise format. Hence, the for-loop shown by OP is not needed. The data used in my example is stored in vector format but it can be represented by a column of data.frame.
For example:
states <- c("Alabama", "Colorado", "Georgia")
partial_url <- c("URL_1", "URL_2", "URL_3")
url_parts <- c("PART_1", "PART_2", "PART_3")
year <- 2017
fileType <- "xls"
#Now use paste0 will list out all the URLS
paste0(partial_url,"/",url_parts,states,year,fileType)
#[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls"
#[3] "URL_3/PART_3Georgia2017xls"
EDIT: multiple fileType based on feedback from #Onyambu
We can use rep(fileType, each = length(states)) to support multiple files.
The solution will look like.
fileType <- c("xls", "doc")
paste0(partial_url,"/",url_parts,states,year,rep(fileType,each = length(states)))
[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls" "URL_3/PART_3Georgia2017xls"
[4] "URL_1/PART_1Alabama2017doc" "URL_2/PART_2Colorado2017doc" "URL_3/PART_3Georgia2017doc"
Here is a tidyverse solution with some simple example data. The approach is to use complete to give yourself a data frame with all possible combinations of your variables. This works because if you make each variable a factor, complete will include all possible factor levels even if they don't appear. This makes it easy to combine your five url parts even though they appear to have different nrow (e.g. 50 states but only 5 file types). unite allows you to join together columns as strings, so we call it three times to include the right separators, and then finally add the http:// with mutate.
Re: your for loop, I find it hard to work through nested for loop logic in the first place. But at least two issues as written include that you have 1:partial_url instead of 1:length(partial_url) and similar, and you are simply overwriting the same object with every pass of the loop. I prefer to avoid loops unless it's a problem where they're absolutely necessary.
library(tidyverse)
states <- tibble(state = c("AK", "AZ", "AR", "CA", "NY"))
partial_url <- tibble(part = c("part1", "part2"))
url_parts <- tibble(urlpart = c("urlpart1", "urlpart2"))
year <- tibble(year = 2007:2010)
filetype <- tibble(filetype = c("csv", "txt", "tar"))
urls <- bind_cols(
states = states[[1]] %>% factor() %>% head(2),
partial_url = partial_url[[1]] %>% factor() %>% head(2),
url_parts = url_parts[[1]] %>% factor() %>% head(2),
year = year[[1]] %>% factor() %>% head(2),
filetype = filetype[[1]] %>% factor() %>% head(2)
) %>%
complete(states, partial_url, url_parts, year, filetype) %>%
unite("middle", states, url_parts, year, sep = "_") %>%
unite("end", middle, filetype, sep = ".") %>%
unite("url", partial_url, end, sep = "/") %>%
mutate(url = str_c("http://", url))
print(urls)
# A tibble: 160 x 1
url
<chr>
1 http://part1/AK_urlpart1_2007.csv
2 http://part1/AK_urlpart1_2007.txt
3 http://part1/AK_urlpart1_2008.csv
4 http://part1/AK_urlpart1_2008.txt
5 http://part1/AK_urlpart1_2009.csv
6 http://part1/AK_urlpart1_2009.txt
7 http://part1/AK_urlpart1_2010.csv
8 http://part1/AK_urlpart1_2010.txt
9 http://part1/AK_urlpart2_2007.csv
10 http://part1/AK_urlpart2_2007.txt
# ... with 150 more rows
Created on 2018-02-22 by the reprex package (v0.2.0).

Resources