I have data that looks like this:
> head(chf)
Admit.Day.of.Week Type.of.Admission Patient.Disposition
1 SAT Emergency Skilled Nursing Home
2 FRI Elective Home or Self Care
3 FRI Emergency Home w/ Home Health Services
4 MON Emergency Skilled Nursing Home
5 THU Emergency Home or Self Care
6 WED Emergency Skilled Nursing Home
mean_los_dispo
1 8.553525
2 4.224193
3 5.789052
4 8.553525
5 4.224193
6 8.553525
I use the following command to get the column labled mean_los_dispo
# Mean LOS for each patient disposition
chf$mean_los_dispo <- ave(chf$Length.of.Stay, chf$Patient.Disposition,
FUN = mean)
What I want to do is set a variable to hold the value of the mean_los_dispo for each of the four different dispositions, for example
SNH = 8.553525
HSC = 4.224193
...
How would I go about doing this? I want to be able to eventually use paste or something similar to put the information in the title of a graph.
You can use paste. So for example, I created two variables, one with numbers (so your means) and another with characters (so your dispositions), and then I used paste to concatenate them:
a<-c(1,2,3,4,5)
b<-c("a","b","c","d","e")
strs<-paste(b," = ",as.character(a),sep="")
This produces:
[1] "a = 1" "b = 2" "c = 3" "d = 4" "e = 5"
In your case you could do something like the following:
unique(paste(chf$Patient.Disposition," = ",as.character(chf$mean_los_dispo),sep=""))
The unique will get rid of all of the duplicates.
Related
I have searched everywhere trying to find an answer to this question and I haven't quite found what I'm looking for yet so I'm hoping asking directly will help.
I am working with the USPS Tracking API, which provides an output an XML format. The API is limited to 35 results per call (i.e. you can only provide 35 tracking numbers to get info on each time you call the API) and I need information on ~90,000 tracking numbers, so I am running my calls in a for loop. I was able to store the results of the call in a list, but then I had trouble exporting the list as-is into anything usable. However, when I tried to convert the results from the list into JSON, it dropped the attribute tag, which contained the tracking number I had used to generate the results.
Here is what a sample result looks like:
<TrackResponse>
<TrackInfo ID="XXXXXXXXXXX1">
<TrackSummary> Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830.</TrackSummary>
<TrackDetail>February 6 6:49 am NOTICE LEFT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805</TrackDetail>
<TrackDetail>February 5 7:28 pm ENROUTE 33699</TrackDetail>
<TrackDetail>February 5 7:18 pm ACCEPT OR PICKUP 33699</TrackDetail>
Here is the script I ran to get the output I'm currently working with:
final_tracking_info <- list()
for (i in 1:x) { # where x = the number of calls to the API the loop will need to make
usps = input_tracking_info[i] # input_tracking_info = GET commands
usps = read_xml(usps)
final_tracking_info1[[i+1]]<-usps$TrackResponse
gc()
}
final_output <- toJSON(final_tracking_info)
write(final_output,"final_tracking_info.json") # tried converting to JSON, lost the ID attribute
cat(capture.output(print(working_list),file = "Final_Tracking_Info.txt")) # exported the list to a textfile, was not an ideal format to work with
What I ultimately want tog et from this data is a table containing the tracking number, the first track detail, and the last track detail. What I'm wondering is, is there a better way to compile this in XML/JSON that will make it easier to convert to a tibble/df down the line? Is there any easy way/preferred format to select based on the fact that I know most of the columns will have the same name ("Track Detail") and the DFs will have to be different lengths (since each package will have a different number of track details) when I'm trying to compile 1,000 of results into one final output?
Using XML::xmlToList() will store the ID attribute in .attrs:
$TrackSummary
[1] " Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830."
$TrackDetail
[1] "February 6 6:49 am NOTICE LEFT BARTOW FL 33830"
$TrackDetail
[1] "February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830"
$TrackDetail
[1] "February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805"
$TrackDetail
[1] "February 5 7:28 pm ENROUTE 33699"
$TrackDetail
[1] "February 5 7:18 pm ACCEPT OR PICKUP 33699"
$.attrs
ID
"XXXXXXXXXXX1"
A way of using that output which assumes that the Summary and ID are always present as first and last elements, respectively, is:
xml_data <- XML::xmlToList("71563898.xml") %>%
unlist() %>% # flattening
unname() # removing names
data.frame (
ID = tail(xml_data, 1), # getting last element
Summary = head(xml_data, 1), # getting first element
Info = xml_data %>% head(-1) %>% tail(-1) # remove first and last elements
)
I have a dataframe with a column with some text in it. I want to do three data pre-processing steps:
1) remove words that occur only once
2) remove words with low inverse document frequency (IDF) and 3) remove words that occur most frequently
This is an example of the data:
head(stormfront_data$stormfront_self_content)
Output:
[1] " , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!"
[2] "bonjour warm brother ! forward speaking !"
[3] " check time time forums. frequently moved columbia distinctly numbered. groups gatherings "
[4] " ! site pretty nice. amount news articles. main concern moment islamification."
[5] " , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed."
[6] " white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
Any help would be greatly appreciated, as I am not too familiar with R.
Here's a solution to Q1 in several steps:
Step 1: clean data by removing anything that is not alphanumeric (\\W):
data2 <- trimws(paste0(gsub("\\W+", " ", data), collapse = ""))
Step 2: Make a sorted frequency list of the words:
fw <- as.data.frame(sort(table(strsplit(data2, "\\s{1,}")), decreasing = T))
Step 3: define a pattern to match (namely all the words that occur only once), make sure you wrap them into boundary position markers (\\b) so that only exact matches get matched (e.g., networkbut not networking):
pattern <- paste0("\\b(", paste0(fw$Var1[fw$Freq==1], collapse = "|"), ")\\b")
Step 4: remove matched words:
data3 <- gsub(pattern, "", data2)
Step 5: clean up by removing superfluous spaces:
data4 <- trimws(gsub("\\s{1,}", " ", data3))
Result:
[1] "stormfront introduction white networking site decided networking site white brothers stormfront introduction stormfront stormfront main board forums forums articles stormfront stormfront stormfront local local stormfront stormfront main board groups networking member member site online online stormfront time time forums groups site articles main site decided time site white brothers jay member stormfront jay"
Here is an approach with tidytext
library(tidytext)
library(dplyr)
word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
count(document, word, sort = TRUE)
total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
group_by(word) %>%
summarize(total = n())
words <- left_join(word_count,total_count)
words %>%
bind_tf_idf(word, document, n)
# A tibble: 111 x 7
document word n total tf idf tf_idf
<int> <chr> <int> <int> <dbl> <dbl> <dbl>
1 1 stormfront 10 11 0.139 1.10 0.153
2 1 networking 3 3 0.0417 1.79 0.0747
3 1 site 3 6 0.0417 0.693 0.0289
4 1 board 2 2 0.0278 1.79 0.0498
5 1 forums 2 3 0.0278 1.10 0.0305
6 1 introduction 2 2 0.0278 1.79 0.0498
7 1 local 2 2 0.0278 1.79 0.0498
8 1 main 2 3 0.0278 1.10 0.0305
9 1 member 2 3 0.0278 1.10 0.0305
10 1 online 2 2 0.0278 1.79 0.0498
# … with 101 more rows
From here it is trivial to filter with dplyr::filter, but since you don't define any specific criteria other than "only once", I'll leave that to you.
Data
data <- structure(c(" , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!",
"bonjour warm brother ! forward speaking !", " check time time forums. frequently moved columbia distinctly numbered. groups gatherings ",
" ! site pretty nice. amount news articles. main concern moment islamification.",
" , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed.",
" white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
), .Dim = c(6L, 1L))
Base R solution:
# Remove double spacing and punctuation at the start of strings:
# cleaned_str => character vector
cstr <- trimws(gsub("\\s*[[:punct:]]+", "", trimws(gsub('\\s+|^\\s*[[:punct:]]+|"',
' ', df), "both")), "both")
# Calulate the document frequency: document_freq => data.frame
document_freq <- data.frame(table(unlist(sapply(cstr, function(x){
unique(unlist(strsplit(x, "[^a-z]+")))}))))
# Store the inverse document frequency as a vector: idf => double vector:
document_freq$idf <- log(length(cstr)/document_freq$Freq)
# For each record remove terms that occur only once, occur the maximum number
# of times a word occurs in the dataset, or words with a "low" idf:
# pp_records => character vector
pp_records <- do.call("rbind", lapply(cstr, function(x){
# Store the term and corresponding term frequency as a data.frame: tf_dataf => data.frame
tf_dataf <- data.frame(table(as.character(na.omit(gsub("^$", NA_character_,
unlist(strsplit(x, "[^a-z]+")))))),
stringsAsFactors = FALSE)
# Store a vector containing each term's idf: idf => double vector
tf_dataf$idf <- document_freq$idf[match(tf_dataf$Var1, document_freq$Var1)]
# Explicitly return the ppd vector: .GlobalEnv() => character vector
return(
data.frame(
cleaned_record = x,
pp_records =
paste0(unique(unlist(
strsplit(gsub("\\s+", " ",
trimws(
gsub(paste0(tf_dataf$Var1[tf_dataf$Freq == 1 |
tf_dataf$idf < (quantile(tf_dataf$idf, .25) - (1.5 * IQR(tf_dataf$idf))) |
tf_dataf$Freq == max(tf_dataf$Freq)],
collapse = "|"), "", x), "both"
)), "\\s")
)), collapse = " "),
row.names = NULL,
stringsAsFactors = FALSE
)
)
}
))
# Column bind cleaned strings with the original records: ppd_cleaned_df => data.frame
ppd_cleaned_df <- cbind(orig_record = df, pp_records)
# Output to console: ppd_cleaned_df => stdout (console)
ppd_cleaned_df
I have the following data frames
df1 <- data.frame(
Description=c("How are you- doing?", "will do it tomorrow otherwise: next week", "I will work hard to complete it for nextr week1 or tomorrow", "I am HAPPY with this situation now","Utilising this approach can helpα'x-ray", "We need to use interseting <U+0452> books to solve the issue", "Not sure if we could do it appropriately.", "The schools and Universities are closed in f -blook for a week", "Things are hectic here and we are busy"))
<!-- begin snippet: js hide: false console: true babel: false -->
and I want to get the following table:
d <- data.frame(
Description=c("Utilising this approach can helpa'x-ray", "How are you- doing", " We need to use interseting <U+0452> books to solve the issue ", " will do it tomorrow otherwise: next week ", " Things are hectic here and we are busy ", "I will work hard to complete it for nextr week1 or tomorrow ", "The schools and Universities are closed in f -blook for a week", " I am HAPPY with this situation now "," I will work hard to complete it for nextr week1 or tomorrow"))
f2<- read.table(text="B12 B6 B9
No Yes Yes
12 6 9
No No Yes
No No Yes
No No Yes
Yes No Yes
11 No Yes
12 11 P
No No Yes
", header=TRUE)
df3<-cbind(d,f2)
As you can see in the Description column, there are space and colon, and so on 1 after week is subscript and I was unable to fix it. I want to match it based on "Description". So I want to match df1 with df2 using Description. Can we do it it in R for this case?
We can use stringdist joins from fuzzyjoin package to match data based on 'Description'. We use na.omit to remove the NA rows from the final dataframe.
na.omit(fuzzyjoin::stringdist_left_join(df1, df3, by = 'Description'))
I have a text file that looks like:
1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last
I have another text file that looks like.
1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last
I was wondering how you could read in data from the previous text file, and use it to find the specific sections within the second data file and all the content after it uptil the the next section. So basically, I'm trying to get something like:
Section Content
1 Hello My name is John. It was nice to meet you.
1.1 Hi Hi again. My last name is Doe. 1.1.1 Bye
1.2 Hey Greetings.
.....And so on
I was wondering how I could do so.
The following solution might certainly be improved, but it might provide you with an idea how to approach your issue. Depending on the size and structure of the files you need to process, this approach might be ok or require more tuning concerning the detection of sections and the speed.
file1 =
"1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last"
file2 =
"1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last"
file1 = unlist(strsplit(file1, "\n", fixed = T))
file2 = unlist(strsplit(file2, "\n", fixed = T))
positions = unlist(sapply(file1, function(x) grep(paste0("^", x, "$"), file2, ignore.case = T)))
positions = cbind(positions, c(positions[-1]-1, length(file2)))
text = mapply(function(x, y) file2[x:y], positions[,1], positions[,2])
text = lapply(text, function(x) x[-1])
result = cbind(positions, text)
result
# positions text
# 1 Hello 1 2 "My name is John. It was nice to meet you."
# 1.1 Hi 3 5 Character,2
# 1.2 Hey 6 7 "Greetings."
# 2 Next section 8 9 "This is the second section. I am majoring in CS."
# 2.1 New section 10 15 Character,5
# 4 last 16 16 Character,0
# Note that the text column contains lists storing the individual lines.
# e.g. for "2.1 New section":
class(result[5, "text"])
# list
result[5, "text"]
# [[1]]
# [1] "Welcome. I am an undergraduate student." "3 third" #<< note the different spelling of third
# [3] "1. hi" "2. hello"
# [5] "3. hey"
The answer to this question is Yes, it can be done. The implementation will vary wildly based on what programming language that you are using to accomplish this task. High level overview would be
split original file into string array by line. These are you list of keys for searching the second document.
read second file into string variable
iterate through all your keys (iterator x) and find their index in the second document. something like
int start = seconddocument.indexof(keys[x]);
int end = seconddocument.indexof(keys[x+1]);
then with these start and end positions you can use a substring() function to extract the content.
string matchedContent = seconddocument.substring(start, end);
This works until you get to the last match because keys[x+1] will not exist where x is the last key. in this case end needs to be set to the position of the last character in the document, or you use a substring method that just takes a starting point.
HTH
I want to substitute all the strings that have words that repeat themselves one after another with words that have a single occurrence.
My strings go something like that:
text_strings <- c("We have to extract these numbers 12, 47, 48", "The integers numbers are also interestings: 189 2036 314",
"','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456", "We like to to offer you 7890$ per month in order to complete this task... we are joking", "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits.", "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life.", "you can also extract exotic stuff like a456 gb67 and 45678911ghth", "Writing 1 example is not funny, please consider that 66% is validation+testing", "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]", "Who loves arrays more than me?", "{366,78,90,5}Yes, there are only 4 numbers inside", "Integers are fine but sometimes you like 99 cents after the 99 dollars", "100€ are better than 99€", "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]", "Ok ok 1 2 3 4 5 and the last one is 6", "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando")
I tried:
gsub("\b(?=\\w*(\\w)\1)\\w+", "\\w", text_strings, perl = TRUE)
But nothing happened (the output remained the same).
How can I remove the repeating words such as in
text_strings[9]
#[1] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
Thank you!
You can use gsub and a regular expression.
gsub("\\b(\\w+)\\W+\\1", "\\1", text_strings, ignore.case=TRUE, perl=TRUE)
[1] "We have to extract these numbers 12, 47, 48"
[2] "The integers numbers are also interestings: 189 2036 314"
[3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
[4] "We like to offer you 7890$ per month in order to complete this task... we are joking"
[5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
[6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
[7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth"
[8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
[9] "You are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
[10] "Who loves arrays more than me?"
[11] "{366,78,90,5}Yes, there are only 4 numbers inside"
[12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
[13] "100€ are better than 99€"
[14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
[15] "Ok 1 2 3 4 5 and the last one is 6"
[16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando
"