Run A For Loop on a Data Frame - r

I have a dataframe that looks like this:
draftclasses
Name Yards TDs Class
Joe Smith 333.3 34 2017
Carson Mathers 386.2 22 2021
Bo Someome 345.2 22 2022
Im Notgood 170.99 7 2017
What I would like to do is get all of the Yards subset for each value in the Class column. I know to filter out a subset of the data frame:
year2021 = draftclass[!is.na(draftclasses$Yards) & draftclasses$Class == 2022,]
I am also aware I would use a for loop but don't know how to design it. I've read around online a bit but I am still unsure how to run a loop to get this input for each year in the Class column.
Ideally I would like each year in Class to label a object with a string of all of the yards pertaining to a class like so:
> year2017
[1] "333.3" "170.99"
Any help would be appreciated. Thanks.

With the help of split we can split the Yards for each Class.
result <- split(draftclasses$Yards, draftclasses$Class)
This returns a list in result, if you want a separate vector for each Class, you can name the list and use list2env.
names(result) <- paste0('year', names(result))
list2env(result, .GlobalEnv)
year2017
#[1] 333.30 170.99
year2021
#[1] 386.2
year2022
#[1] 345.2

Related

Regex for variable length

I am looking for a regex or another command/workaround to extract all pkA values from a very large list for hundred of chemicals. So far, I have managed to extract the desired pkA values from a subset of my list.
I wonder however if it is also possible to extract the whole lines that contain the pkAs? I figured since they all have a rather comparable length, you could extract these with a regex but I don't know how to implement the length inside the regex in combinations with the specific lines containing the pkA values?
The reason why I wonder this is because my regex does not include pkAs that start with a 0. Chemicals like this are uncommon but they do exist. By extracting the whole line, I would also catch the few entries that give a temperature value which my regex is not including.
Down below is a (hopefully) minimal working example with an extract of my list.
library(stringr)
list_pkas <- structure(list(Chemical = c("MCPA", "Aspirin"), pka = c("3.2.13Dissociation Constants\r\npKa= 3.13\r\nCessna AJ, Grover R; J Agric Food Chem 26: 289-92(1978)\r\nHazardous Substances Data Bank (HSDB)",
"3.2.14Dissociation Constants\r\nAcidic pKa\r\n3.47\r\nTested as SID 103164874 in AID 781325: https://pubchem.ncbi.nlm.nih.gov/bioassay/781325#sid=103164874\r\nComparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res. 2014; 31(4):1082-95. DOI:10.1007/s11095-013-1232-z. PMID:24249037\r\nChEMBL\r\nAcidic pKa\r\n3.5\r\nTested as SID 103164874 in AID 781326: https://pubchem.ncbi.nlm.nih.gov/bioassay/781326#sid=103164874\r\nComparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res. 2014; 31(4):1082-95. DOI:10.1007/s11095-013-1232-z. PMID:24249037\r\nChEMBL; DrugBank\r\npKa = 3.49 at 25 °C\r\nO'Neil, M.J. (ed.). The Merck Index - An Encyclopedia of Chemicals, Drugs, and Biologicals. Whitehouse Station, NJ: Merck and Co., Inc., 2006., p. 140\r\nHazardous Substances Data Bank (HSDB)"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
string <- list_pkas$pka[2]
string_sub <- str_sub(string, 7)
pkas <- str_extract_all(string_sub, "([1-9]\\.[0-9]{1,2})")
The expected output should be for MCPA:
3.13
or
pKa=3.13
For Aspirin:
3.47
3.5
pKa = 3.49 at 25 °C
Any help is much appreciated!
You can use the lookbehind assertion (?<=foo):
str_extract_all(list_pkas$pka, "(?<=pKa\\D{0,5})\\d.*")
# [[1]]
# [1] "3.13"
#
# [[2]]
# [1] "3.47" "3.5" "3.49 at 25 °C"
I think that this expression might do what you need:
"pKa\\D{0,5}((?:\\s*\\d+\\.*\\d*)(?:\\s*at\\s*\\d+\\s*.*?\\w)*)"

Compiling API outputs in XML format in R

I have searched everywhere trying to find an answer to this question and I haven't quite found what I'm looking for yet so I'm hoping asking directly will help.
I am working with the USPS Tracking API, which provides an output an XML format. The API is limited to 35 results per call (i.e. you can only provide 35 tracking numbers to get info on each time you call the API) and I need information on ~90,000 tracking numbers, so I am running my calls in a for loop. I was able to store the results of the call in a list, but then I had trouble exporting the list as-is into anything usable. However, when I tried to convert the results from the list into JSON, it dropped the attribute tag, which contained the tracking number I had used to generate the results.
Here is what a sample result looks like:
<TrackResponse>
<TrackInfo ID="XXXXXXXXXXX1">
<TrackSummary> Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830.</TrackSummary>
<TrackDetail>February 6 6:49 am NOTICE LEFT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805</TrackDetail>
<TrackDetail>February 5 7:28 pm ENROUTE 33699</TrackDetail>
<TrackDetail>February 5 7:18 pm ACCEPT OR PICKUP 33699</TrackDetail>
Here is the script I ran to get the output I'm currently working with:
final_tracking_info <- list()
for (i in 1:x) { # where x = the number of calls to the API the loop will need to make
usps = input_tracking_info[i] # input_tracking_info = GET commands
usps = read_xml(usps)
final_tracking_info1[[i+1]]<-usps$TrackResponse
gc()
}
final_output <- toJSON(final_tracking_info)
write(final_output,"final_tracking_info.json") # tried converting to JSON, lost the ID attribute
cat(capture.output(print(working_list),file = "Final_Tracking_Info.txt")) # exported the list to a textfile, was not an ideal format to work with
What I ultimately want tog et from this data is a table containing the tracking number, the first track detail, and the last track detail. What I'm wondering is, is there a better way to compile this in XML/JSON that will make it easier to convert to a tibble/df down the line? Is there any easy way/preferred format to select based on the fact that I know most of the columns will have the same name ("Track Detail") and the DFs will have to be different lengths (since each package will have a different number of track details) when I'm trying to compile 1,000 of results into one final output?
Using XML::xmlToList() will store the ID attribute in .attrs:
$TrackSummary
[1] " Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830."
$TrackDetail
[1] "February 6 6:49 am NOTICE LEFT BARTOW FL 33830"
$TrackDetail
[1] "February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830"
$TrackDetail
[1] "February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805"
$TrackDetail
[1] "February 5 7:28 pm ENROUTE 33699"
$TrackDetail
[1] "February 5 7:18 pm ACCEPT OR PICKUP 33699"
$.attrs
ID
"XXXXXXXXXXX1"
A way of using that output which assumes that the Summary and ID are always present as first and last elements, respectively, is:
xml_data <- XML::xmlToList("71563898.xml") %>%
unlist() %>% # flattening
unname() # removing names
data.frame (
ID = tail(xml_data, 1), # getting last element
Summary = head(xml_data, 1), # getting first element
Info = xml_data %>% head(-1) %>% tail(-1) # remove first and last elements
)

Access Nested List Item for Functions Applied to a Data Frame

I'm trying to split a string in one column...
> df.arpt
arpt
1 CMH 39402
2 IAH 97571
3 DAL 67191
4 HOU 07614
5 OKC 11127
...and break it out into two new columns with a result that looks like this...
> df.arpt
arpt arptCode arptID
1 CMH 39402 CMH 39402
2 IAH 97571 IAH 97571
3 DAL 67191 DAL 67191
4 HOU 07614 HOU 07614
5 OKC 11127 OKC 11127
I really want something like this to be possible...
> df.arpt$arptCode <- strsplit(df.arpt$arpt, " ")[[...]][1]
> df.arpt$arptID <- strsplit(df.arpt$arpt, " ")[[...]][2]
... where the ... in the code represents "for every record in the data frame".
Any suggestions on how to go about this? (I'd like to stick with base R / "out-of-the-box" R rather than higher-level packages.) Am I thinking about this the right way in R?
If the arptCode values are row names, you can convert them into a column.
library(tidyverse)
df.arpt %>%
rownames_to_column(var = "arptCode")
If they are not row names then you can use separate.
library(tidyverse)
df.arpt %>%
separate(arpt, into = c('arptCode', 'aprtID'))
How about this:
df<-data.frame(arpt =c("CMH 39402", "IAH 97571", "DAL 67191", "HOU 07614", "OKC 11127"))
tidyr::separate(df, arpt, into = c("artpCode", "arptID"))
Because the strings are all fixed length, I was able to apply the substr function instead in order to move past the problem. However, I still don't know what the solution would be if the result of the function was a list.

How do I subset a list with mixed data type and data structure?

I have a list which included a mix of data type (character) and data structure (dataframe).
I want to keep only the dataframes and remove the rest.
> head(list)
[[1]]
[1] "/Users/Jane/R/12498798.txt error"
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
[[3]]
[1] "/Users/Jane/R/12498780.txt error"
I expect the final list to contain only dataframes:
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
Based on the example, it is possible that the OP's list elements are vectors and want to remove any element having 'error' substring
list[!sapply(list, function(x) any(grepl("error$", x)))]

Replace text in one data-frame using a look up to another data-frame

I have the task of searching through text, replacing peoples names and nicknames with a generic character string.
Here is the structure of my data frame of names and corresponding nicknames:
names <- c("Thomas","Thomas","Abigail","Abigail","Abigail")
nicknames <- c("Tom","Tommy","Abi","Abby","Abbey")
df_name_nick <- data.frame(names,nicknames)
Here is the structure of my data frame containing text
text_names <- c("Abigail","Thomas","Abigail","Thomas","Colin")
text_comment <- c("Tommy sits next to Abbey","As a footballer Tommy is very good","Abby is a mature young lady","Tom is a handsome man","Tom is friends with Colin and Abi")
df_name_comment <- data.frame(text_names,text_comment)
Giving these dataframes
df_name_nick:
names nicknames
1 Thomas Tom
2 Thomas Tommy
3 Abigail Abi
4 Abigail Abby
5 Abigail Abbey
df_name_comment:
text_names text_comment
1 Abigail Tommy sits next to Abbey
2 Thomas As a footballer Tommy is very good
3 Abigail Abby is a mature young lady
4 Thomas Tom is a handsome man
5 Colin Tom is friends with Colin and Abi
I am looking for a routine that will search through each row of df_name_comment and use the df_name_comment$text_names to look up the corresponding nickname from df_name_nick and replace it with XXX.
Note for each person's name there can be several nicknames.
Note that in each text comment, only the appropriate name for that row is replaced, so that we would get this as output:
Abigail "Tommy sits next to XXX"
Thomas "As a footballer, XXX is very good"
Abigail "XXX is a mature young lady"
Thomas "XXX is a handsome man"
Colin "Tom is friends with Colin and Abi"
I’m thinking this will require a cunning combination of gsubs, matches and apply functions (either mapply, sapply, etc)
I've searched on Stack Overflow for something similar to this request and can only find very specific regex solutions based on data frames with unique row elements, and not something that I think will work with generic text lookups and gsubs via multiple nicknames.
Can anyone please help me solve my predicament?
With thanks
Nevil
(newbie R programmer since Jan 2017)
Here is an idea via base R. We basically paste the nicknames together for each name, collapsed by | so as to pass it as regex in gsub and replace the matched words of each comment with XXX. We use mapply to do that after we merge our aggregated nicknames with df_name_comment.
d1 <- aggregate(nicknames ~ names, df_name_nick, paste, collapse = '|')
d2 <- merge(df_name_comment, d1, by.x = 'text_names', by.y = 'names', all = TRUE)
d2$nicknames[is.na(d2$nicknames)] <- 0
d2$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y), d2$nicknames, d2$text_comment)
d2$nicknames <- NULL
d2
Which gives,
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Abigail XXX is a mature young lady
3 Colin Tom is friends with Colin and Abi
4 Thomas As a footballer XXX is very good
5 Thomas XXX is a handsome man
Note1: Replacing NA in nicknames with 0 is due to the fact that NA (which is the default fill in merge for unmatched elements) would convert the comment string to NA as well when passed in gsub
Note2 The order is also changed due to merge, but you can sort as you wish as per usual.
Note3 Is better to have your variables as characters rather than factors. So you either read the data frames with stringsAsFactors = FALSE or convert via,
df_name_comment[] <- lapply(df_name_comment, as.character)
df_name_nick[] <- lapply(df_name_nick, as.character)
EDIT
Based on your comment, we can simply match the comments' names with our aggregated data set, save that in a vector and use mapply directly on the original data frame, without having to merge and then drop variables, i.e.
#d1 as created above
v1 <- d1$nicknames[match(df_name_comment$text_names, d1$names)]
v1[is.na(v1)] <- 0
df_name_comment$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y),
v1, df_name_comment$text_comment)
Hope this helps!
l <- apply(df_name_comment, 1, function(x)
ifelse(length(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"]) > 0,
gsub(paste(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"], collapse="|"),'XXX', x["text_comment"]),
x["text_comment"]))
df_name_comment$text_comment <- as.list.data.frame(l)
Don't forget to let us know if it solved your problem :)
Data
df_name_nick <- data.frame(names,nicknames,stringsAsFactors = F)
df_name_comment <- data.frame(text_names,text_comment,stringsAsFactors = F)
Solution 2
EDIT: In this initial solution I manually checked with grepl if the nickname was present, and then gsubbed with one of the matching ID's. I knew the '|' operator worked with grepl, but not with gsub. So credits to Sotos for that idea.
df = df_name_comment
for(i in 1:nrow(df))
{
matching_nicknames = df_name_nick$nicknames[df_name_nick$names==df$text_names[i]]
if(length(matching_nicknames)>0)
{
df$text_comment[i] = mapply(sub, pattern=paste(paste0("\\b",matching_nicknames,"\\b"),collapse="|"), "XXX", df$text_comment[i])
}
}
Output
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Thomas As a footballer XXX is very good
3 Abigail XXX is a mature young lady
4 Thomas XXX is a handsome man
5 Colin Tom is friends with Colin and Abi
Hope this helps!

Resources