So, I am new to using R, so sorry if the questions seem a little basic!
But my work is asking me to look through census data using an API and identify some variables in each tract, then create a csv file they can look at. The code is fully written for me, I believe, but I need to change the variables to:
S2602_C01_023E - black / his
S2602_C01_081E - unemployment rate
S2602_C01_070E - not US citizen (divide by total population)
S0101_C01_030E - # over 65 (divide by total pop)
S1603_C01_009E - # below poverty (divide by total pop)
S1251_C01_010E - # child under 18 (divide by # households)
S2503_C01_013E - median income
S0101_C01_001E - total population
S2602_C01_078E - in labor force
And, I need to divide some of the variables, like I have written, and export all of this into a CSV file. I just don't really know what to do with the code..like I am just lost because I have never used R. I try changing the variables to the ones I need, but an error comes up. Any help would be greatly appreciated!
library(tidycensus)
library(tidyverse)
library(stringr)
library(haven)
library(profvis)
#list of variables possible
v18 <- load_variables(year = 2018,
dataset = "acs5",
cache = TRUE)
#function to get variables for all states. Year, variables can be
easily edited.
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(totpop = "B01003_001",
male = "B01001_002",
female = "B01001_026",
white_alone = "B02001_002",
black_alone = "B02001_003",
americanindian_alone = "B02001_004",
asian_alone = "B02001_005",
nativehaw_alone = "B02001_006",
other_alone = "B02001_007",
twoormore = "B02001_008",
nh = "B03003_002",
his = "B03003_003",
noncit = "B05001_006",
povstatus = "B17001_002",
num_households = "B19058_001",
SNAP_households = "B19058_002",
medhhi = "B19013_001",
hsdiploma_25plus = "B15003_017",
bachelors_25plus = "B15003_022",
greater25 = "B15003_001",
inlaborforce = "B23025_002",
notinlaborforce = "B23025_007",
greater16 = "B23025_001",
civnoninstitutional = "B27010_001",
withmedicare_male_0to19 = "C27006_004",
withmedicare_male_19to64 = "C27006_007",
withmedicare_male_65plus = "C27006_010",
withmedicare_female_0to19 = "C27006_014",
withmedicare_female_19to64 = "C27006_017",
withmedicare_female_65plus = "C27006_020",
withmedicaid_male_0to19 = "C27007_004",
withmedicaid_male_19to64 = "C27007_007",
withmedicaid_male_65plus = "C27007_010",
withmedicaid_female_0to19 = "C27007_014",
withmedicaid_female_19to64 = "C27007_017",
withmedicaid_female_65plus ="C27007_020"),
geography = "tract",
state = st )
return(df)
}
#loops over all states
df_list <- setNames(lapply(states, get_census_data), states)
##if you want to keep margin of error, remove everything after %>%
in next two lines
final_df <- bind_rows(df_list) %>%
select(-moe)
colnames(final_df)[3] <- "varname"
#cleaning up final data, making it wide instead of long
final_df_wide <- final_df %>%
gather(variable, value, -(GEOID:varname)) %>%
unite(temp, varname, variable) %>%
spread(temp, value)
#exporting to csv file, adjust your path
write.csv(final_df,"C:\Users\NAME\Documents\acs_2018_tractlevel_dat.
a.csv" )
Since you can't really give an reproducible example without revealing your API key, I'll try my best to figure out what could work here:
Let's first edit the function that pulls data from the API:
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(blackHis= "S2602_C01_023E",
unEmployRate = "S2602_C01_081E",
notUSCit = "S2602_C01_070E")
geography = "tract",
state = st )
return(df)
}
I've just put in two of the variables, but you should get the point.
Try if this works for you. And returns the data that is stored in the respective variables.
Related
I am creating a pipeline that allows for an arbitrary number of dataset names to be put in, where they will all be put through similar cleaning processes. To do this, I am using the targets package, and using the tar_map function from tarchetypes, I subject each dataset to a series of tidying and wrangling functions.
My issue now is that one dataset needs to be split into three datasets by a factor (a la split) while the rest should remain untouched. The pipeline would then theoretically move on by processing each dataset individually, including the three 'daughter' datasets.
Here's my best attempt:
library(targets)
library(tarchetypes)
library(tidyverse)
# dir.create("./data")
# tibble(nums = 1:300, groups = rep(letters[1:3], each = 100)) |>
# write_csv("./data/td1.csv")
# tibble(nums = 301:600, groups = rep(letters[1:3], each = 100)) |>
# write_csv("./data/td2.csv")
# tibble(nums = 601:900, groups = rep(letters[1:3], each = 100)) |>
# write_csv("./data/td3.csv")
tar_option_set(
packages = c("tidyverse")
)
read_data <- function(paths) {
read_csv(paths)
}
get_group <- function(data, groups) {
filter(data, groups == groups)
}
do_nothing <- function(data) {
data
}
list(
map1 <- tar_map(
values = tibble(datasets = c("./data/td1.csv", "./data/td2.csv", "./data/td3.csv")),
tar_target(data, read_data(datasets)),
map2 <- tar_map(values = tibble(groups = c("a", "b", "c")),
tar_skip(tester, get_group(data, groups), !str_detect(tar_name(), "td3\\.csv$"))
),
tar_target(dn, do_nothing(list(data, tester)))
)
)
The skipping method is a bit clumsy, I may be thinking about that wrong as well.
I'm obviously trying to combine the code poorly at the end there by putting them in a list, but I'm at a loss as to what else to do.
The datasets can't be combined by, say, rbind, since in actuality they are SummarizedExperiment objects.
Any help is appreciated - let me know if any further clarification is needed.
If you know the levels of that factor in advance, you can handle the splitting of that third dataset with a separate tar_map() call similar to what you do now. If you do not know the factor levels in advance, then the splitting needs to be handled with dynamic branching, and I recommend something like tarchetypes::tar_group_by().
I do not think tar_skip() is relevant here, and I recommend removing it.
If you start with physical files (or write physical files) then I strongly suggest you track them with format = "file": https://books.ropensci.org/targets/files.html#external-input-files.
library(targets)
library(tarchetypes)
tar_option_set(packages = "tidyverse")
list(
tar_map(
values = list(paths = c("data/td1.csv", "data/td2.csv")),
tar_target(file, paths, format = "file"),
tar_target(data, read_csv(file, col_types = cols()))
),
tar_target(file3, "data/td3.csv", format = "file"),
tar_group_by(data3, read_csv(file3, col_types = cols()), groups),
tar_target(
data3_row_counts,
tibble(group = data3$groups[1], n = nrow(data3)),
pattern = map(data3)
)
)
I'm trying to replicate a plot from the paper Michel et al., 'Quantitative Analysis of Culture Using Millions of Digitized Books' (2011). Specifically I'm trying to make the one on the top right here:
https://pubmed.ncbi.nlm.nih.gov/21163965/#&gid=article-figures&pid=fig-3-uid-2
I know the paper used v1 of the corpus but I'm doing it with v2 as it's easier to work with. When I use the Google Ngram viewer (specifying the English 2012 corpus which corresponds to v2, a year range of 1875 to 1975, and no smoothing) I get this, which looks pretty close.
When I tried to replicate this in R/ggplot I get this:
1950 and 1883 look pretty consistent with what is happening in the viewer plot, but I can't figure out what is happening with 1910. There appears to be very few occurrences of the year '1910' in the data set in comparison to some of the other years. Would anyone with a better understanding of the Google ngrams data set be able to point me in the right direction? Should I be supplementing this with something other than just the 1-gram dataset? Does the Google ngram viewer pick out occurrences of 1-grams in a different way?
The code I've used is below. A couple of other points: 1910 and 1950 do not seem to exist as 1-grams in the v2 data set, but 1883 does. To get this to even remotely work, I had to grepl for 1950 and 1910 to get any hits (i.e. they all seem to appear as parts of date ranges like 1890-1910, or with some other characters tacked on), rather than just doing a fixed search for those years in the ngram field. I also used purrr::map_dfr to do this rather than just a dplyr::case_when in case years appeared in the same ngram picked up by a grepl (e.g. the range 1883-1910 should be a hit for both of those years, not just one).
library(ggplot2)
library(dplyr)
library(purrr)
#---- Load data ----
counts_file <- file.path("data", "total_counts.txt")
ngrams_file <- file.path("data", "google_books_1gram_eng_v2.gz")
if (!dir.exists("data")) {
dir.create("data")
}
if (!file.exists(counts_file)) {
download.file(
"http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-totalcounts-20120701.txt",
counts_file
)
}
if (!file.exists(ngrams_file)) {
download.file(
"http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-1.gz",
ngrams_file
)
}
one_grams <- read.delim(
gzfile(ngrams_file),
header = FALSE
)
names(one_grams) <- c("ngram", "year", "match_count", "volume_count")
one_grams_subset <- one_grams %>%
filter(year >= 1875 & year <= 1975)
total_counts_temp <- t(
read.table(
counts_file,
header = FALSE
)
)
total_counts_char <- do.call(
rbind,
strsplit(total_counts_temp, ",")
)
total_counts <- apply(total_counts_char, 2, as.numeric)
colnames(total_counts) <- c("year", "match_count", "page_count", "volume_count")
#---- Recreate plot 3A from Michel et al. (2011) ----
year_subset <- function(year_char, one_grams_data) {
one_grams_data %>%
filter(grepl(year_char, .[["ngram"]], fixed = TRUE)) %>%
group_by(year) %>%
summarise(year_count = sum(match_count, na.rm = TRUE)) %>%
mutate(year_gram = year_char)
}
plot_data <- map_dfr(c("1883", "1910", "1950"),
year_subset,
one_grams_subset) %>%
left_join(as_tibble(total_counts), by = "year") %>%
mutate(frequency = 10000 * year_count/match_count) %>%
select(year_gram, year, frequency, year_count)
ggplot(plot_data) +
geom_line(aes(x = year, y = frequency, colour = year_gram)) +
theme_minimal() +
labs(col = "ngram", x = "Year", y = "Frequency")
I am trying to extract some information (metadata) from GenBank using the R package "rentrez" and the example I found here https://ajrominger.github.io/2018/05/21/gettingDNA.html. Specifically, for a particular group of organisms, I search for all records that have geographical coordinates and then want to extract data about the accession number, taxon, sequenced locus, country, lat_long, and collection date. As an output, I want a csv file with the data for each record in a separate row. It seems that the code below can do the job but at some point, rows get muddled with data from different records overlapping the neighbouring rows. For example, from 157 records that rentrez retrieves from NCBI 109 records in the file look like what I want to achieve but the rest is a total mess. I would greatly appreciate any advice on how to fix the issue because I am a total newbie with R and figuring out each step takes a lot of time.
setwd ("C:/R-Works")
library('XML')
library('rentrez')
argasid <- entrez_search(db="nuccore", term = "Argasidae[Organism] AND [lat]", use_history=TRUE, retmax=15000)
x <- entrez_fetch (db="nuccore", id=argasid$ids, rettype= "native", retmode="xml", parse=TRUE)
x <-xmlToList(x)
cleanEntrez <- function(x) {
basePath <- 'Seq-entry_seq.Bioseq'
c(
genbank = as.character(x[paste(basePath,
'Bioseq_id', 'Seq-id', 'Seq-id_genbank',
'Textseq-id', 'Textseq-id_accession',
sep = '.')]),
taxon = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_source', 'BioSource', 'BioSource_org',
'Org-ref', 'Org-ref_taxname',
sep = '.')]),
bseqdesc_title = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_title',
sep = '.')]),
lat_lon = as.character(x[grep('lat-lon', x) + 1]),
geo_description = as.character(x[grep('country', x) + 1]),
coll_date = as.character(x[grep('collection-date', x) + 1])
)
}
getGenbankMeta <- function(ids) {
allRec <- entrez_fetch(db = 'nuccore', id = ids,
rettype = 'native', retmode = 'xml',
parsed = TRUE)
allRec <- xmlToList(allRec)[[1]]
o <- lapply(allRec, function(x) {
cleanEntrez(unlist(x))
})
temp <- array(unlist(o), dim = c(length(o[[1]]), length(ids)))
seqVec <- temp[nrow(temp), ]
seqDF <- as.data.frame(t(temp[-nrow(temp), ]))
names(seqDF) <- names(o[[1]])[-nrow(temp)]
return(list(seq = seqVec, data = seqDF))
}
write.csv(getGenbankMeta(argasid$ids), 'argasid_georef.csv')
I am working on a research assignment on COVID and using the datalake API to fetch different kind of datasets available to us.
I am wondering if it's possible to fetch all outbreak countries.
ids = list("Australia"), this works with individual country, it doesnt seem to accept wildcard or all.
Can anyone give me any insights on this please.
# Total number of confirmed cases in Australia and proportion of getting infected.
today <- Sys.Date()
casecounts <- evalmetrics(
"outbreaklocation",
list(
spec = list(
**ids = list("Australia"),**
expressions = list("JHU_ConfirmedCasesInterpolated","JHU_ConfirmedDeathsInterpolated"),
start = "2019-12-20",
end = today-1,
interval = "DAY"
)
)
)
casecounts
The easiest way to access a list of countries is in the Excel file linked at https://c3.ai/covid-19-api-documentation/#tag/OutbreakLocation. It has a list of countries in the first sheet, and shows which of those have data from JHU.
You could also fetch an approximate list of country-level locations with:
locations <- fetch(
"outbreaklocation",
list(
spec = list(
filter = "not(contains(id, '_'))"
)
)
)
That should contain all of the countries, but could have some non-countries like World Bank regions.
Then, you'd use this code to get the time series data for all of those locations:
location_ids <-
locations %>%
dplyr::select(-location) %>%
unnest_wider(fips, names_sep = ".") %>%
sample_n(15) %>% # include this to test on a smaller set
pull(id)
today <- Sys.Date()
casecounts <- evalmetrics(
"outbreaklocation",
list(
spec = list(
ids = location_ids,
expressions = list("JHU_ConfirmedCasesInterpolated","JHU_ConfirmedDeathsInterpolated"),
start = "2019-12-20",
end = today-1,
interval = "DAY"
)
),
get_all = TRUE
)
casecounts
I'm new-ish to R and am having some trouble iterating through values.
For context: I have data on 60 people over time, and each person has his/her own dataset in a folder (I received the data with id #s 00:59). For each person, there are 2 values I need - time of response and picture response given (a number 1 - 16). I need to convert this data from wide to long format for each person, and then eventually append all of the datasets together.
My problem is that I'm having trouble writing a loop that will do this for each person (i.e. each dataset). Here's the code I have so far:
pam[x] <- fromJSON(file = "PAM_u[x].json")
pam[x]df <- as.data.frame(pam[x])
#Creating long dataframe for times
pam[x]_long_times <- gather(
select(pam[x]df, starts_with("resp")),
key = "time",
value = "resp_times"
)
#Creating long dataframe for pic_nums (affect response)
pam[x]_long_pics <- gather(
select(pam[x]df, starts_with("pic")),
key = "picture",
value = "pic_num"
)
#Combining the two long dataframes so that I have one df per person
pam[x]_long_fin <- bind_cols(pam[x]_long_times, pam[x]_long_pics) %>%
select(resp_times, pic_num) %>%
add_column(id = [x], .before = 1)
If you replace [x] in the above code with a person's id# (e.g. 00), the code will run and will give me the dataframe I want for that person. Any advice on how to do this so I can get all 60 people done?
Thanks!
EDIT
So, using library(jsonlite) rather than library(rjson) set up the files in the format I needed without having to do all of the manipulation. Thanks all for the responses, but the solution was apparently much easier than I'd thought.
I don't know the structure of your json files. If you are not in the same folder, like the json files, try that:
library(jsonlite)
# setup - read files
json_folder <- "U:/test/" #adjust you folder here
files <- list.files(path = paste0(json_folder), pattern = "\\.json$")
# import data
pam <- NULL
pam_df <- NULL
for (i in seq_along(files)) {
pam[[i]] <- fromJSON(file = files[i])
pam_df[[i]] <- as.data.frame(pam[[i]])
}
Here you generally read all json files in the folder and build a vector of a length of 60.
Than you sequence along that vector and read all files.
I assume at the end you can do bind_rowsor add you code in the for loop. But remember to set the data frames to NULL before the loop starts, e.g. pam_long_pics <- NULL
Hope that helped? Let me know.
Something along these lines could work:
#library("tidyverse")
#library("jsonlite")
file_list <- list.files(pattern = "*.json", full.names = TRUE)
Data_raw <- tibble(File_name = file_list) %>%
mutate(File_contents = map(File_name, fromJSON)) %>% # This should result in a nested tibble
mutate(File_contents = map(File_contents, as_tibble))
Data_raw %>%
mutate(Long_times = map(File_contents, ~ gather(key = "time", value = "resp_times", starts_with("resp"))),
Long_pics = map(File_contents, ~ gather(key = "picture", value = "pic_num", starts_with("pic")))) %>%
unnest(Long_times, Long_pics) %>%
select(File_name, resp_times, pic_num)
EDIT: you may or may not need not to include as_tibble() after reading in the JSON files, depending on how your data looks like.