Selecting following characters after a pattern match in R - r

I have a data frame which has one column of text with info that I need to extract, here is one observation from that column: each question has three attributes associated to it objectives,KeyResults and responsible
[{"text":"Newideas.","translationKey":"new.question-4","id":4,"objectives":"Great","KeyResults":"Awesome","responsible":"myself"},{"text":"customer focus.","translationKey":"new.question-5","id":5,"objectives":"Goalset","KeyResults":"Amazing","responsible":"myself"}
-------------------------DESIRED OUTPUT -----------------------
Question# Objectives KeyResults responsible Question# Objectives KeyResults responsible
4 Great Awesome myself 5 Goalset Amazin myself

Data is a valid json (but you need square bracket closing ] on it). You can read json into R object using json parser package (eg. jsonlite)
Let say your text is in column text of data frame df, then this will transform that text into R dataframe.
library(jsonlite)
dat <- fromJSON(df$text)
dat
# text translationKey id objectives KeyResults responsible
# 1 Newideas. new.question-4 4 Great Awesome myself
# 2 customer focus. new.question-5 5 Goalset Amazing myself
You need to install jsonlite to make it works
install.packages("jsonlite")

Related

Trying to remove "ZCTA" from rows

I am trying to extract only the zip code values from my imported ACS data file, however, the rows all include "ZCTA" before the 5 digit zip code. Is there a way to remove that so just the 5 digit zip code remains?
Example:
I tried using strtrim on the data but I can't figure out how to target the last 5 digits. I image there is a function or loop that could also do this since the dataset is so large.
To remove "ZCTA5":
gsub("ZCTA5", "", df$zip) # df - your data.frame name
or
library(stringr)
str_replace(df$zip,"ZCTA5","")
To extract ZIP CODE:
str_sub(df$zip,-5,-1)
Here is a few others for fun:
#option 1
stringr::str_extract(df$zip, "(?<=\\s)\\d+$")
#option 2
gsub("^.*\\s(\\d+)$", "\\1", df$zip)

In R: How do I create a dataframe name from a string plus a column name plus categorical variable?

Apologies, this seems like an easy question but I can't find the answer.
I'm using key word groups to search strings for important phrases. My table (srchtbl) classifies words by category (general thing they refer to) and component (actions vs. descriptions)
My method requires that I drill down to vectors to extract word groups to search. I'm able to create vectors for each category name and each component.
However, I also want to make dataframes for each category that are named by the category.
my data:
word pattern category component
<chr> <chr> <chr> <chr>
1 pack pack pkg action
2 protect protect pkg action
3 well well pkg description
4 clever clever pkg description
5 care care pkg description
6 safe safe pkg description
These statements create the appropriate dataframe with the appropriate name:
catgroups <- unique(srchtbl$category)
assign(paste("df_", catgroups[i], sep = ""), srchtbl %>% filter(category == catgroups[i]) %>% group_by(component))
which is fine, but how do I refer to it without using the whole statement? if I use:
print(paste("df_", catgroups[3], sep = ""))
[1] "df_pkg"
So it's like I can't reference it again without using the entire assign statement.
Is there another way to concatenate a dataframe name and make a simple assignment, like:
"string" + catgroups[i] <- srchtbl %>% filter(category == catgroups[3]) %>% group_by(component))
Ultimately the code will be looped so that the key word table can expand to any number of categories and components, so I don't want to type individual dataframe names
Consider base R's by or split which creates a named list of data frames from one or more grouping(s) where you can reference the individual data frames with $ or [[ qualifier. No need to flood global environment with many similarly structured objects. Instead maintain one list object. You lose no functionality of a data frame if stored in a list.
df_list1 <- split(srchtbl, srchtbl$category)
df_list1$pkg
# word pattern category component
# 1 pack pack pkg action
# 2 protect protect pkg action
# 3 well well pkg description
# 4 clever clever pkg description
# 5 care care pkg description
# 6 safe safe pkg description
dflist2 <- by(srchtbl, srchtbl$category, identity)
dflist2[['pkg']]
# word pattern category component
# 1 pack pack pkg action
# 2 protect protect pkg action
# 3 well well pkg description
# 4 clever clever pkg description
# 5 care care pkg description
# 6 safe safe pkg description

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

R Refer to (part of) data frame using string in R

I have a large data set in which I have to search for specific codes depending on what i want. For example, chemotherapy is coded by ~40 codes, that can appear in any of 40 columns called (diag1, diag2, etc).
I am in the process of writing a function that produces plots depending on what I want to show. I thought it would be good to specify what I want to plot in a input data frame. Thus, for example, in case I only want to plot chemotherapy events for patients, I would have a data frame like this:
Dataframe name: Style
Name SearchIn codes PlotAs PlotColour
Chemo data[substr(names(data),1,4)=="diag"] 1,2,3,4,5,6 | red
I already have a function that searches for codes in specific parts of the data frame and flags the events of interest. What i cannot do, and need your help with, is referring to a data frame (Style$SearchIn[1]) using codes in a data frame as above.
> Style$SearchIn[1]
[1] data[substr(names(data),1,4)=="diag"]
Levels: data[substr(names(data),1,4)=="diag"]
I thought perhaps get() would work, but I cant get it to work:
> get(Style$SearchIn[1])
Error in get(vars$SearchIn[1]) : invalid first argument
enter code here
or
> get(as.character(Style$SearchIn[1]))
Error in get(as.character(Style$SearchIn[1])) :
object 'data[substr(names(data),1,5)=="TDIAG"]' not found
Obviously, running data[substr(names(data),1,5)=="TDIAG"] works.
Example:
library(survival)
ex <- data.frame(SearchIn="lung[substr(names(lung),1,2) == 'ph']")
lung[substr(names(lung),1,2) == 'ph'] #works
get(ex$SearchIn[1]) # does not work
It is not a good idea to store R code in strings and then try to eval them when needed; there are nearly always better solutions for dynamic logic, such as lambdas.
I would recommend using a list to store the plot specification, rather than a data.frame. This would allow you to include a function as one of the list's components which could take the input data and return a subset of it for plotting.
For example:
library(survival);
plotFromSpec <- function(data,spec) {
filteredData <- spec$filter(data);
## ... draw a plot from filteredData and other stuff in spec ...
};
spec <- list(
Name='Chemo',
filter=function(data) data[,substr(names(data),1,2)=='ph'],
Codes=c(1,2,3,4,5,6),
PlotAs='|',
PlotColour='red'
);
plotFromSpec(lung,spec);
If you want to store multiple specifications, you could create a list of lists.
Have you tried using quote()
I'm not entirely sure what you want but maybe you could store the things you're trying to get() like
quote(data[substr(names(data),1,4)=="diag"])
and then use eval()
eval(quote(data[substr(names(data),1,4)=="diag"]), list(data=data))
For example,
dat <- data.frame("diag1"=1:10, "diag2"=1:10, "other"=1:10)
Style <- list(SearchIn=c(quote(data[substr(names(data),1,4)=="diag"]), quote("Other stuff")))
> head(eval(Style$SearchIn[[1]], list(data=dat)))
diag1 diag2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.
words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.
Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Resources