Writing codebook object to a pdf file with R - r

I am using a package called "memisic" in order to generate a codebook of my 2000 variables survey. The codebook is pretty much a frequency table with a description and the wordings of the variable name. The package provides a function that is called codebook that results in a codebook object. The problem is that I can't write this object anywhere. I tried to write it to a text file or to pdf file and it doesn't work.
This is a code to generate a codebook (the author's code):
library(memisc)
Data <- data.set(
vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
income = exp(rnorm(300,sd=.7))*2000
)
Data <- within(Data,{
description(vote) <- "Vote intention"
description(region) <- "Region of residence"
description(income) <- "Household income"
wording(vote) <- "If a general election would take place next tuesday,
the candidate of which party would you vote for?"
wording(income) <- "All things taken into account, how much do all
household members earn in sum?"
foreach(x=c(vote,region),{
measurement(x) <- "nominal"
})
measurement(income) <- "ratio"
labels(vote) <- c(
Conservatives = 1,
Labour = 2,
"Liberal Democrats" = 3,
"Don't know" = 8,
"Answer refused" = 9,
"Not applicable" = 97,
"Not asked in survey" = 99)
labels(region) <- c(
England = 1,
Scotland = 2,
Wales = 3,
"Not applicable" = 97,
"Not asked in survey" = 99)
foreach(x=c(vote,region,income),{
annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
})
missing.values(vote) <- c(8,9,97,99)
missing.values(region) <- c(97,99)
})
r=codebook(Data)
so my final objective is to write the object R to a pdf/word/excel/text file. Any of these would be just great.

The easiest way to get the text file from this would be to just use capture.output:
capture.output(r, file="test.txt")
Here are the first few lines read back into R:
head(readLines("test.txt"))
# [1] "==================================================================================="
# [2] ""
# [3] " vote 'Vote intention'"
# [4] ""
# [5] " \"If a general election would take place next tuesday, the candidate of which"
# [6] " party would you vote for?\""

It's possible to output the codebook directly to a txt file using the Write function:
Write(codebook(Data), file = "datacodebook.txt")

Related

readlines function problem from adopted script not working

I am trying to run a script located here. The setup for which looks like this:
readlines <- function(...) {
lapply(list(...), readline)
}
input = readlines(
"Please Input Your Census API Key (Get a Free Census Api Key Here: <https://api.census.gov/data/key_signup.html>): ",
"Enter the State(s) you would like to use separated by a comma (i.e. Oregon, Washington) or enter USA if you want to calculate SVI for all 50 states: ",
"What Year would you like to calculate? (2009-2019 Are Available): ",
"Where would you like to save these files? Please type or copy and paste a complete file path: "
)
#**Install and load required packages**
API.key = input[[1]]
States = as.list(unlist(strsplit(input[[2]], split=",")))
Year = as.integer(input[[3]])
dir.create(paste0(gsub("\\\\", "/", input[[4]]), "/Social_Vulnerability"))
setwd(paste0(gsub("\\\\", "/", input[[4]]), "/Social_Vulnerability"))
My understanding is that I would replace the values after input=readlines( with the local info I have. So I rewrote it as:
readlines <- function(...) {
lapply(list(...), readline)
}
input = readlines(
"a952c5c861faf0ec64e05348e67xxxxxxxxxx", # "Please Input Your Census API Key (Get a Free Census Api Key Here: <https://api.census.gov/data/key_signup.html>): "
"USA", # "Enter the State(s) you would like to use separated by a comma (i.e. Oregon, Washington) or enter USA if you want to calculate SVI for all 50 states: "
"2009", # "What Year would you like to calculate? (2009-2019 Are Available): "
"C:/Users/FirstNameLastName/Data/Derived" #Where would you like to save these files? Please type or copy and paste a complete file path: "
)
#**Install and load required packages**
API.key = input[[1]]
States = as.list(unlist(strsplit(input[[2]], split=",")))
Year = as.integer(input[[3]])
dir.create(paste0(gsub("\\\\", "/", input[[4]]), "/Social_Vulnerability"))
setwd(paste0(gsub("\\\\", "/", input[[4]]), "/Social_Vulnerability"))
And I assume it's supposed to work where the readlines bit pulls in that info, and then drops it into the 1..[4].. spots as appropriate for API.key = input[1] etc. But, what input has saved after I ran the top bit is:
input
[[1]]
[1] "View(input)"
[[2]]
[1] "View(input)"
[[3]]
[1] "## R"
[[4]]
[1] "readlines <- function(...) {"
Which does not seem right at all. Any advice on where I am going wrong?

Text Mining Scraped Data (R)

I wrote the code below to look for the word "nationality" in a job postings dataset, where I am essentially trying to see how many employers specify that a given candidate must of a particular visa type or nationality.
I know that in the raw data itself (in excel), there are several cases where the job description where the word "nationality" is mentioned.
nationality_finder = function(string){
nationality = c(" ")
split_string = strsplit(string, split = NULL)
split_string = split_string[[1]]
flag = 0
for(letter in split_string){
if(flag > 0){nationality = append(nationality, letter)}
if(letter == "nationality "){flag = 1}
if(letter == " "){flag = flag-0.5}
}
nationality = paste(nationality, collapse = '')
return(nationality)
}
for(n in 1:length(df2$description)){
df2$nationality[n] <- nationality_finder(df2$description[n])
}
df2%>%
view()
Furthermore, the code is working w/out errors, but it is not producing what I am looking for. I am essentially looking to create another variable where 1 indicates that the word "nationality" is mention, and 0 otherwise. Specifically, I am looking for words such as "citizen" and "nationality" under the job description variable. And the text under each job description is extremely long but here, I just gave a summarized version for brevity.
Text example for a job description in the dataset
Title: Event Planner
Nationality: Saudi National
Location: Riyadh, Saudi Arabia
Salary: Open
Salary depends on the candidates skills, experience, and other attributes.
Another job description:
- Have recently graduated or looking for a career change and be looking for
an entry level role (we will offer full training)
- Priority will be taken for applications by U.S. nationality holders
You can try something like this. I'm assuming you've a data.frame as data, and you want to add a new column.
dats$check <- as.numeric(grepl("nationality",dats$description,ignore.case=TRUE))
dats$check
[1] 1 1 0 1
grepl() is going to detect in the column dats$description the string nationality, ignoring case (ignore.case = TRUE) and as.numeric() is going to convert TRUE FALSE into 1 0.
With fake data:
dats <- structure(list(description = c("Title: Event Planner\n \n Nationality: Saudi National\n \n Location: Riyadh, Saudi Arabia\n \n Salary: Open\n \n Salary depends on the candidates skills, experience, and other attributes.",
"- Have recently graduated or looking for a career change and be looking for\n an entry level role (we will offer full training) \n \n - Priority will be taken for applications by U.S. nationality holders ",
"do not have that word here", "aaaaNationalitybb"), check = c(1,
1, 0, 1)), row.names = c(NA, -4L), class = "data.frame")

How to improve R code to extract identifiers from a structured text file

My goal is to parse a structured text file and extract 4 unique identifiers into an R data frame.
As a first step, I've run
c <- read_lines("minex_cochrane.txt")
Then, given the character vector c, where each element is a line.
c <- c("Record #1 of 3", "ID: CN-00966682", "TI: A multi-center, randomized controlled trial of a group psychological intervention for psychosis with comorbid cannabis dependence over the early course of illness",
"SO: Schizophrenia research", "YR: 2013", "VL: 143", "NO: 1",
"CC: Drugs and Alcohol", "PG: 138‐142", "PM: PUBMED 23187069",
"PT: Journal Article; Multicenter Study; Randomized Controlled Trial",
"DOI: 10.1016/j.schres.2012.10.018", "US: https://www.cochranelibrary.com/central/doi/10.1002/central/CN-00966682/full",
"", "", "Record #2 of 3", "ID: CN-00917992", "TI: Effectiveness of a self-guided web-based cannabis treatment program: randomized controlled trial",
"SO: Journal of medical internet research", "YR: 2013", "VL: 15",
"NO: 2", "PG: e26", "PM: PUBMED 23470329", "XR: EMBASE 23470329",
"PT: Journal Article; Randomized Controlled Trial; Research Support, Non‐U.S. Gov't; Research Support, U.S. Gov't, Non‐P.H.S.",
"KY: Adult; Australia; Female; Health Behavior; Humans; Internet; Male; Marijuana Abuse [psychology, *therapy]; Outcome Assessment (Health Care); Self Care; Telemedicine [*methods]; Therapy, Computer‐Assisted; Treatment Outcome; Young Adult",
"DOI: 10.2196/jmir.2256", "US: https://www.cochranelibrary.com/central/doi/10.1002/central/CN-00917992/full",
"", "", "")
The following detects, then cleans up, my desired unique ID's.
library(stringr)
id_l <- str_detect(c, "ID: ")
id_vec <- c[id_l == TRUE]
id <- str_replace(id_vec, "ID: ", "")
pmid_l <- str_detect(c,"PM: PUBMED")
pmid_vec <- c[pmid_l == TRUE]
pmid <- str_replace(pmid_vec, "PM: PUBMED ", "")
cs <- cbind(id, pmid)
Producing the following output as desired.
> cs
id pmid
[1,] "CN-00966682" "23187069"
[2,] "CN-00917992" "23470329"
However, this seems overly cumbersome. I would like improve my code to do this in a more compact and efficient fashion to facilitate parsing a large file.
maybe something like this?
The code assumes, that for each ID: xyz, there is a PM: xyz: following before the next ID: xyz.
ids= c[which(grepl("^ID: |^PM: ", c))] # find and select each element starting either with "ID: " or "PM: "
df = matrix(data = ids, nrow = length(ids)/2, ncol = 2, byrow = T) # transforme the vector in a matrix, assuming that the order is allways ID: PM: (so for each ID-entry there has to be a PM entry too)
df = apply(df,2,function(f) gsub("ID: |PM: |PUBMED ", "", f)) # remove all ID: , PM: , PUBMED from all of the strings
df
You could use a regex with an or in a look behind condition:
matrix(na.omit(str_extract(c, "(?<=ID: |PM: PUBMED )(.+)")),
ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] "CN-00966682" "23187069"
#[2,] "CN-00917992" "23470329
More concise would be to apply str_extract() directly:
library(stringr)
library(magrittr)
cs <- data.frame(
id = str_extract(c, '(?<=^ID: ).*') %>% .[!is.na(.)],
pmid = str_extract(c, '(?<=PUBMED ).*') %>% .[!is.na(.)]
)
Resulting in:
id pmid
1 CN-00966682 23187069
2 CN-00917992 23470329
Caveat:
This assumes almost perfect consistency/regularity in your data.
EDIT
Can be simplified using na.omit() that I had forgotten about (credit to Roland):
cs <- data.frame(
id = na.omit(str_extract(c, '(?<=^ID: ).*')),
pmid = na.omit(str_extract(c, '(?<=PUBMED ).*'))
)

Row/Col Name Vectors for Matrix in R on Data Camp

I'm working through R on DataCamp, this is my exercise:
# Box office Star Wars: In Millions (!)
# First element: US, Second element: non-US
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
# Construct matrix
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
# Add your code here such that rows and columns of star_wars_matrix have a name!
This is my original code:
colnames(star_wars_matrix) <- c("US", "non-US")
rownames(star_wars_matrix) <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
After getting frustrated and trying a bunch of different things, I succumbed to the help button and it gave me the solution, this:
rownames(star_wars_matrix) <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
colnames(star_wars_matrix) <- c("US", "non-US")
So my question is why? Do rows always come first when naming rows and cols in a matrix? I appreciate your time.

Using R to parse out Surveymonkey csv files

I'm trying to analyse a large survey created with surveymonkey which has hundreds of columns in the CSV file and the output format is difficult to use as the headers run over two lines.
Has anybody found a simple way of managing the headers in the CSV file so that the analysis is manageable ?
How do other people analyse results from Surveymonkey?
Thanks!
You can export it in a convenient form that fits R from Surveymonkey, see download responses in 'Advanced Spreadsheet Format'
What I did in the end was print out the headers using libreoffice labeled as V1,V2, etc. then I just read in the file as
m1 <- read.csv('Sheet1.csv', header=FALSE, skip=1)
and then just did the analysis against m1$V10, m1$V23 etc...
To get around the mess of multiple columns I used the following little function
# function to merge columns into one with a space separator and then
# remove multiple spaces
mcols <- function(df, cols) {
# e.g. mcols(df, c(14:18))
exp <- paste('df[,', cols, ']', sep='', collapse=',' )
# this creates something like...
# "df[,14],df[,15],df[,16],df[,17],df[,18]"
# now we just want to do a paste of this expression...
nexp <- paste(" paste(", exp, ", sep=' ')")
# so now nexp looks something like...
# " paste( df[,14],df[,15],df[,16],df[,17],df[,18] , sep='')"
# now we just need to parse this text... and eval() it...
newcol <- eval(parse(text=nexp))
newcol <- gsub(' *', ' ', newcol) # replace duplicate spaces by a single one
newcol <- gsub('^ *', '', newcol) # remove leading spaces
gsub(' *$', '', newcol) # remove trailing spaces
}
# mcols(df, c(14:18))
No doubt somebody will be able to clean this up!
To tidy up Likert-like scales I used:
# function to tidy c('Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree')
tidylik4 <- function(x) {
xlevels <- c('Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree')
y <- ifelse(x == '', NA, x)
ordered(y, levels=xlevels)
}
for (i in 44:52) {
m2[,i] <- tidylik4(m2[,i])
}
Feel free to comment as no doubt this will come up again!
I have to deal with this pretty frequently, and having the headers on two columns is a bit painful. This function fixes that issue so that you only have a 1 row header to deal with. It also joins the multipunch questions so you have top: bottom style naming.
#' #param x The path to a surveymonkey csv file
fix_names <- function(x) {
rs <- read.csv(
x,
nrows = 2,
stringsAsFactors = FALSE,
header = FALSE,
check.names = FALSE,
na.strings = "",
encoding = "UTF-8"
)
rs[rs == ""] <- NA
rs[rs == "NA"] <- "Not applicable"
rs[rs == "Response"] <- NA
rs[rs == "Open-Ended Response"] <- NA
nms <- c()
for(i in 1:ncol(rs)) {
current_top <- rs[1,i]
current_bottom <- rs[2,i]
if(i + 1 < ncol(rs)) {
coming_top <- rs[1, i+1]
coming_bottom <- rs[2, i+1]
}
if(is.na(coming_top) & !is.na(current_top) & (!is.na(current_bottom) | grepl("^Other", coming_bottom)))
pre <- current_top
if((is.na(current_top) & !is.na(current_bottom)) | (!is.na(current_top) & !is.na(current_bottom)))
nms[i] <- paste0(c(pre, current_bottom), collapse = " - ")
if(!is.na(current_top) & is.na(current_bottom))
nms[i] <- current_top
}
nms
}
If you note, it returns the names only. I typically just read.csv with ...,skip=2, header = FALSE, save to a variable and overwrite the names of the variable. It also helps ALOT to set your na.strings and stringsAsFactor = FALSE.
nms = fix_names("path/to/csv")
d = read.csv("path/to/csv", skip = 2, header = FALSE)
names(d) = nms
As of November 2013, the webpage layout seems to have changed. Choose Analyze results > Export All > All Responses Data > Original View > XLS+ (Open in advanced statistical and analytical software). Then go to Exports and download the file. You'll get raw data as first row = question headers / each following row = 1 response, possibly split between multiple files if you have many responses / questions.
The issue with the headers is that columns with "select all that apply" will have a blank top row, and the column heading will be the row below. This is only an issue for those types of questions.
With this in mind, I wrote a loop to go through all columns and replace the column names with the value from the second row if the column name was blank- which has a character length of 1.
Then, you can kill the second row of the data and have a tidy data frame.
for(i in 1:ncol(df)){
newname <- colnames(df)[i]
if(nchar(newname) < 2){
colnames(df)[i] <- df[1,i]
}
df <- df[-1,]
Coming to the party late, but this is still an issue and the best workaround I've found is using a function to paste the column names and sub-column names together, based on repeating values.
For instance, if exporting to .csv, the repeated column names will automatically be replaced with an X in RStudio. If exporting to .xlsx, the repeated value will be ....
Here's a base R solution:
sm_header_function <- function(x, rep_val){
orig <- x
sv <- x
sv <- sv[1,]
sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
sv <- t(sv)
sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
names(sv)[1] <- "name"
names(sv)[2] <- "value"
sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
sv$new_value <- paste0(sv$new_value, " ", sv$value)
new_names <- as.character(sv$new_value)
colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
orig <- orig[-c(1),]
return(orig)
}
sm_header_function(df, "X")
sm_header_function(df, "...")
With some sample data, the change in column names would look like this:
Original export from SurveyMonkey:
> colnames(sample)
[1] "Respondent ID" "Please provide your contact information:" "...11"
[4] "...12" "...13" "...14"
[7] "...15" "...16" "...17"
[10] "...18" "...19" "I wish it would have snowed more this winter."
Cleaned export from SurveyMonkey:
> colnames(sample_clean)
[1] "Respondent ID" "Please provide your contact information: Name"
[3] "Please provide your contact information: Company" "Please provide your contact information: Address"
[5] "Please provide your contact information: Address 2" "Please provide your contact information: City/Town"
[7] "Please provide your contact information: State/Province" "Please provide your contact information: ZIP/Postal Code"
[9] "Please provide your contact information: Country" "Please provide your contact information: Email Address"
[11] "Please provide your contact information: Phone Number" "I wish it would have snowed more this winter. Response"
Sample data:
structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621,
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin",
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale",
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's",
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2",
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia",
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa",
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.",
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104",
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country",
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins#gmail.com",
"mjemison#nasa.gov", "stargazer#gmail.com", "dubois#web.com",
"firstnurse#aol.com", "galileo123#yahoo.com", "imthinking#gmail.com"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646",
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944",
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response",
"Strongly disagree", "Strongly agree", "Neither agree nor disagree",
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
How about the following: use read.csv() with header=FALSE. Make two arrays, one with the two lines of headings and one with the answers to the survey. Then paste() the two rows/sentences of together. Finally, use colnames().

Resources