r run easyPubMed queries row wise from a data frame column - r

I have a dataframe like below;
I would like to run row by row query in Pubmed using easyPubMed package. For each row/query should fetch list of PMIDs. This list should be retrived in another column called 'PMID'.

This might work
library(easyPubMed)
library(purrr)
Query <- c('rituximab OR bevacizumab','meningitis OR headache')
Heading <- c('A','B')
x <- as.data.frame(cbind(Heading,Query),stringsAsFactors = F)
x$PMID<- ""
ids <- map(x[,"Query"],get_pubmed_ids)
for (i in 1:length(ids)) {
x[i,"PMID"]<- paste(ids[[i]][["IdList"]],collapse = ",")
}
I think that "sapply" won't return expected results so, going the "map" way from "purrr" package is safer.

Related

Binding rows of multiple data frames into one data frame in R

I have a vector of file paths called dfs, and I want create a dataframe of those files and bind them together into one huge dataframe, so I did something like this :
for (df in dfs){
clean_df <- bind_rows(as.data.table(read.delim(df, header=T, sep="|")))
return(clean_df)
}
but only the last item in the dataframe is being returned. How do I fix this?
I'm not sure about your file format, so I'll take common .csv as an example. Replace the a * i part with actually reading all the different files, instead of just generating mockup data.
files = list()
for (i in 1:10) {
a = read.csv('test.csv', header = FALSE)
a = a * i
files[[i]] = a
}
full_frame = data.frame(data.table::rbindlist(files))
The problem is that you can only pass one file at a time to the function read.delim(). So the solution would be to use a function like lapply() to read in each file specified in your df.
Here's an example, and you can find other answers to your question here.
library(tidyverse)
df <- c("file1.txt","file2.txt")
all.files <- lapply(df,function(i){read.delim(i, header=T, sep="|")})
clean_df <- bind_rows(all.files)
(clean_df)
Note that you don't need the function return(), putting the clean_df in parenthesis prompts R to print the variable.

Apply function to all dataframes

I work with SAS files (sas7bdat = dataframes) and SAS formats (sas7bcat).
My sas7bdat files are in a "data" file, so I can get a list in object files_names.
Here is the first part of my code, working perfectly
files_names <- list.files(here("data"))
nb_files <- length(files_names)
data_names <- vector("list",length=nb_files)
for (i in 1 : nb_files) {
data_names[i] <- strsplit(files_names[i], split=".sas7bdat")
}
for (i in 1:nb_files) {
assign(data_names[[i]],
read_sas(paste(here("data", files_names[i])), "formats/formats.sas7bcat")
)}
but I get some issues when trying to apply function as_factor from package haven (in order to apply labels on my new dataframes and get like SEX = "Male" instead of SEX = 1).
I can make it work dataframe by dataframe like the code below
df_labelled <- haven::as_factor(df, only_labelled = TRUE)
I would like to create a loop but didn't work because my data_names[i] isn't a dataframe and as_factor requires a dataframe in first argument.
I'm quite new to R, thank you very much if someone could help me.
you might want to think about using different data structures, for example you can use a named list to save your dataframes then you can easily loop through them.
In fact you could do everything in one loop, I'm sure there's a more efficient way to do this, but here's an example of one way without changing your code too much :
files_names <- list.files(here("data"))
raw_dfs <- list()
labelled_dfs <- list()
for (file_name in files_names) {
# # strsplit returns a list either extract the first element
# # like this
# df_name <- (strsplit(file_name, split=".sas7bdat"))[[1]]
# # or use something else like gsub
df_name <- gsub(".sas7bdat", '', file_name)
raw_dfs[df_name] <- read_sas(paste(here("data", file_name)), "formats/formats.sas7bcat")
labelled_dfs[df_name] <- haven::as_factor(raw_dfs[[df_name]], only_labelled = TRUE)
}

Rename multiple colnames using a dictionary

I have multiple csv including multiple information (such as "age") with different spellings for the same variable. For standardizing them I plan to read each of them and turn each into a dataframe for standardizing and then writing back the csv.
Therefore, I created a dictionary that looks like this:
I am struggling to find a way to do the following in R:
Asking it to look through each of the colnames of the dataframe and comparing each to every "old_name" in the dictionary dataframe.
If it finds the a match then replace the "old_name" with the "new_name"
Any help would be really useful!
Edit: the issue is not only with upper and lower case. For example, in some cases it could be: "years" instead of "age".
Here is a quick and dirty approach. I wrote a function so you could just change the arguments and quickly cycle through all your files. Using the stringi package is optional -- I'm using it to check the provided .csv file name, but you could remove that if you decide it's unnecessary.
library(stringi)
dict <- data.frame(path=c('../csv1','../csv1','../csv2','../csv3','../csv3'),
old_name=c('Age','agE','Name','years','NamE'),
new_name=c('age','age','name','age','name'))
example_csv <- data.frame(Age=c(43,34,42,24),NamE=c('Michael','Jim','Dwight','Kevin'))
standardizeColumnNames <- function(df,csvFileName,dictionary){
colHeaders <- character(ncol(df))
for(i in 1:ncol(df)){
index <- which(dictionary$old_name == names(df)[i])
if(length(index) > 0){
colHeaders[i] <- as.character(dictionary$new_name[index[1]])
} else {
colHeaders[i] <- names(df)[i]
}
}
names(df) <- colHeaders
if(stri_sub(csvFileName,-4) != '.csv'){
csvFileName <- paste0(csvFileName,'.csv')
}
write.csv(df,csvFileName)
}
standardizeColumnNames(example_csv,'test_file_name',dict)

Using a loop to create multiple data frames in R

I have this function that returns a data frame of JSON data from the NBA stats website. The function takes in the game ID of a certain game and returns a data frame of the halftime box score for that game.
getstats<- function(game=x){
for(i in game){
url<- paste("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&
EndRange=14400&GameID=",i,"&RangeType=2&Season=2015-16&SeasonType=
Regular+Season&StartPeriod=1&StartRange=0000",sep = "")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
}
return(df)
}
So what I would like to do with this function is take a vector of several game ID's and create a separate data frame for each one. For example:
gameids<- as.character(c(0021500580:0021500593))
I would want to take the vector "gameids", and create fourteen data frames. If anyone knew how I would go about doing this it would be greatly appreciated! Thanks!
You can save your data.frames into a list by setting up the function as follows:
getstats<- function(games){
listofdfs <- list() #Create a list in which you intend to save your df's.
for(i in 1:length(games)){ #Loop through the numbers of ID's instead of the ID's
#You are going to use games[i] instead of i to get the ID
url<- paste("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&
EndRange=14400&GameID=",games[i],"&RangeType=2&Season=2015-16&SeasonType=
Regular+Season&StartPeriod=1&StartRange=0000",sep = "")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
listofdfs[[i]] <- df # save your dataframes into the list
}
return(listofdfs) #Return the list of dataframes.
}
gameids<- as.character(c(0021500580:0021500593))
getstats(games = gameids)
Please note that I could not test this because the URLs do not seem to be working properly. I get the connection error below:
Error in file(con, "r") : cannot open the connection
Adding to Abdou's answer, you could create dynamic data frames to hold results from each gameID using the assign() function
for(i in 1:length(games)){ #Loop through the numbers of ID's instead of the ID's
#You are going to use games[i] instead of i to get the ID
url<- paste("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&
EndRange=14400&GameID=",games[i],"&RangeType=2&Season=2015-16&SeasonType=
Regular+Season&StartPeriod=1&StartRange=0000",sep = "")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
# create a data frame to hold results
assign(paste('X',i,sep=''),df)
}
The assign function will create data frames same as number of game IDS. They be labelled X1,X2,X3......Xn. Hope this helps.
Use lapply (or sapply) to apply a function to a list and get the results as a list. So if you get a vector of several game ids and a function that do what you want to do, you can use lapply to get a list of dataframe (as your function return df).
I haven't been able to test your code (I got an error with the function you provided), but something like this should work :
library(RJSONIO)
gameids<- as.character(c(0021500580:0021500593))
df_list <- lapply(gameids, getstats)
getstats<- function(game=x){
url<- paste0("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&EndRange=14400&GameID=",
game,
"&RangeType=2&Season=2015-16&SeasonType=Regular+Season&StartPeriod=1&StartRange=0000")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
return(df)
}
df_list will contain 1 dataframe per Id you provided in gameids.
Just use lapply again for additionnal data processing, including saving the dataframes to disk.
data.table is a nice package if you have to deal with a ton of data. Especially rbindlist allows you to rbind all the dt (=df) contained in a list into a single one if needed (split will do the reverse).

updating column values when table name is a variable

First question here, and very new to R as well.
I have a loop creating data frames according to a list of studytables. I can read all the CSVs fine, but I would like to get the field "Subject" and add the variable "study" before what is currently in the field. My trouble is with the 2nd "assign" line, I can't get R to assign the new value to "Subject".
Thanks for all your help.
study <- 'study10'
studytables <- list('ae', 'subject')
studypath <- 'C:/mypath/'
for(table in studytables) {
destinframe <- paste(table,study, sep='')
file <- paste(studypath, table, '.CSV', sep='' )
assign(destinframe, read.csv(file)) # create all dataframes
assign(destinframe['Subject'], rep('testing', nrow(get(destinframe))))
}
Using assign like that really isn't a great idea. And as you can see it doesn't work well when you try to add columns to a data.frame. It's better to add the columns before you do the assign. So replace
assign(destinframe, read.csv(file))
assign(destinframe['Subject'], rep('testing', nrow(get(destinframe))))
with
dd <- read.csv(file)
dd$Subject <- paste(study, dd$Subject)
assign(destinframe, dd)

Resources