I'm aiming to get a list of all files in a Google Drive folder, as well at the associated metadata for those files. When I use drive_ls, it returns 3 columns {name, id, drive_resource}. drive_resource is a structured like this: list(kind = "drive#file", id = "abc",...). However, some of the list is not qualified by quotations, and commas are also occassionally used when not a separator.
Any ideas how I might properly unlist this? I can't find anywhere in the package that can handle this.
Using the package 'googledrive', I can get a list of all the files
a <- drive_ls(path = "abc", recursive = TRUE)
The below attempt gets close, but fails to get thee column names and also splits some values at the wrong place based on a comma being contained in the string.
a$drive_resource <- vapply(a$drive_resource, paste, collapse = ",", character(1L))
abcd <- a%>% separate(drive_resource, sep = ",", into = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30") )
You can try the following approach. It's an example with only four elements of the list (selected names are specified in the function). The function maps each list contained in each row to a tibble, so you can unnest it
require(googledrive)
require(dplyr)
f <- function(l){
l[c("version","webContentLink","viewedByMeTime","mimeType")] %>% as_tibble()
}
dr_content <- drive_ls(path = "<path>", recursive = TRUE)
dr_content <- dr_content %>% mutate(drive_resource = purrr::map(drive_resource, f))
dr_content <- dr_content %>% tidyr::unnest(drive_resource)
Related
I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)
I have a CSV file with 141 rows and several columns. I wanted my data to be ordered in ascending order by the first two columns i.e. 'label' and 'index'. Following is my code:
final_data <- read.csv("./features.csv",
header = FALSE,
col.names = c('label','index', 'nr_pix', 'rows_with_1', 'cols_with_1',
'rows_with_3p', 'cols_with_3p', 'aspect_ratio',
'neigh_1', 'no_neigh_above', 'no_neigh_below',
'no_neigh_left', 'no_neigh_right', 'no_neigh_horiz',
'no_neigh_vert', 'connected_areas', 'eyes', 'custom'))
sorted_data_by_label <- final_data[order(label),]
sorted_data_by_index <- sorted_data_by_label[order(index),]
write.table(sorted_data_by_index, file = "./features.csv",
append = FALSE, sep = ',',
row.names = FALSE)
I chose to read from a CSV and use write.table because that was necessary for my code requirement to override the CSV with column names.
Now even when I added a , after order(label), and order(index), the code sorted data should still read other rows and columns right?
After running this code, I only get the first row out of 141 rows. Is there a way to fix this problem?
As #akrun has mentioned briefly, what you need to do is to change
sorted_data_by_label <- final_data[order(label),]
to
sorted_data_by_label <- final_data[order(final_data$label),]
and to change
sorted_data_by_index <- sorted_data_by_label[order(index),]
to
sorted_data_by_index <- sorted_data_by_label[order(sorted_data_by_label$index),]
This is because when you write label, R will try to find the index object in the global environment, not within the final_data data frame.
If you intended to use index that is a column of final_data, you need to use explicit final_data$index.
Other options
You can use with:
sorted_data_by_label <- with(final_data, final_data[order(label),])
sorted_data_by_index <- with(sorted_data_by_label, sorted_data_by_label[order(index),])
In dplyr you can use
sorted_data_by_label <- final_data %>% arrange(label)
sorted_data_by_index <- sorted_data_by_label %>% arrange(index)
I am trying to extract pubmed abstracts and their titles to place them in a dataframe.
will the help of members stackoverflow, I was able to write the code below, which works. The issue now is that the number of rows in the abstracts variable is higher than that of pmid or title, therefore I am unable to merge them correctly. Looking at the structure of the xml file I have, it appears the abstracts have more than one ?node, that's why they get extracted in > one row.
Any suggestion how to overcome that and have each abstract in one row, so I can merge the variables.
Here is my code:
library(XML)
library(httr)
library(glue)
library(dplyr)
####
query = 'asthma[mesh]+AND+eosinophils[mesh]+AND+2009[pdat]'
reqq = glue ('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&RetMax=50&term={query}')
op = GET(reqq)
content(op)
df_op <- op %>% xml2::read_xml() %>% xml2::as_list()
pmids <- df_op$eSearchResult$IdList %>% unlist(use.names = FALSE)
reqq1 = glue("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={paste0(pmids, collapse = ',')}&rettype=abstract&retmode=xml")
op1 = GET(reqq1)
a = xmlParse(content(op1))
pmidd = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/PMID', xmlValue))
title = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/ArticleTitle', xmlValue))
abstract = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Abstract/AbstractText', xmlValue))
nrow(pmidd)
nrow(abstract)
Some articles come with the abstract spread in several sections (Objective, Methods, ....), some have just one entry and then some don't have an abstract at all. You'll have to take care of all these different scenarios.
xml::xmlToList() can be used to extract a list from the xml data. We can then use purrr's map*() commands to flatten the data.
library(purrr)
b <- xmlToList(a)
res <- map_dfr(b, \(x) {
abstract_l <- x$MedlineCitation$Article$Abstract
if (is.null(abstract_l))
abstract_l <- ""
tibble(
pmid = x$MedlineCitation$PMID$text,
title = x$MedlineCitation$Article$ArticleTitle,
abstract = ifelse(
length(abstract_l) > 1,
map_chr(abstract_l, \(y) y[[1]]) |> paste(collapse = "\n"),
unlist(abstract_l)
)
)
})
res$abstract
I've got a function which I'm trying to apply in a for loop that extracts a dataframe from multiple files and combines them into a single one.
This is how, from what I've read, I thought would be the best way to attack it but I get an empty list returned, when I was hoping for a list of dataframes which could be combined using bind_rows.
This is the code I'm using:
combined_functions <- function(file_name) {
#combines the get_dfm_df and get corp function: get dfm tibble straight from the file name
data_frame_returned<- get_dfm_df(getcorp(file_name))
data_frame_returned
}
list_of_dataframes <- list()
file.list <- dir(pattern ="DOCX$")
for (file in file.list) {
dataframe_of_file <- combined_function(file)
append(list_of_dataframes,dataframe_of_file)
}
bind_rows(list_of_dataframes, .id = "column_label") #https://stackoverflow.com/questions/2851327/convert-a-list-of-data-frames-into-one-data-frame
It creates an empty list, gets a list of the file names which the function combined_function uses to create a data frame out of the file and should, to my understanding, append this dataframe to the list. After all the files in the directory have been matched, bind_rows should combine it into one overall dataframe but it only returns an empty tibble. list_of_dataframes is also empty.
I've tried the solution in this answer but it didn't help:
Append a data frame to a list
https://www.dropbox.com/sh/z8vh50b370gcb1j/AAAcbnfAUOM6-y8uWn4-lUWLa?dl=0
This a link to the raw files I am using in this case, but I think the problem is a general one.
Appendix:
These are the functions combined_functions refer too. They work on the individual cases so I'm confident this is not the cause of the problem but I've included them for completeness anyway.
rm(list = ls())
library(quanteda)
library(quanteda.corpora)
library(readtext)
library(LexisNexisTools)
library(tidyverse)
library(tools)
getcorp<- function(file_name){
#function to take the lexis word document, convert it into quanteda corpus object, returns duplicate df and date from filename in list
LNToutput <- lnt_read(file_name)
duplicates_df <- lnt_similarity(LNToutput = LNToutput,
threshold = 0.99)
duplicates_df <- duplicates_df[duplicates_df$Similarity > 0.99] #https://github.com/JBGruber/LexisNexisTools creates dataframe of duplicate articles
LNToutput <- LNToutput[!LNToutput#meta$ID %in% duplicates_df$ID_duplicate, ] #removes these duplicates from the main dataframe
corp <- lnt_convert(LNToutput, to = "quanteda") #to return multiple values from the r function, must be placed in a list
corp_date_from_file_name <- basename(file_name)
file_date <- as.Date(corp_date_from_file_name, format ="%d_%m_%y")
list_of_returns <-list(duplicates_df, corp,file_date) #list returns has duplicate df in first position, corpus in second and the file date in third
list_of_returns
}
get_dfm_df <- function(corp_list){
# takes the corp from getcorp, applies lexicoder dictionary, adds the neg_pos etc to their equivalent columns,
# calculates the percentage each category is of the total number of sentiment bearing words, adds the date specified from the file name
corpus_we_want <- corp_list[[2]]
sentiment_df <- dfm(corpus_we_want, dictionary = data_dictionary_LSD2015) %>% #applies the dictionary
convert("data.frame") %>%
cbind(docvars(corpus_we_want)) %>% #https://stackoverflow.com/questions/60419692/how-to-convert-dfm-into-dataframe-but-keeping-docvars
as_tibble() %>%
mutate(combined_negative = negative + neg_positive, combined_positive = positive + neg_negative) %>%
mutate(pos_percentage = combined_positive/(combined_positive + combined_negative ), neg_percentage =combined_negative/(combined_positive + combined_negative ) ) %>%
mutate(date = corp_list[[3]])
sentiment_df
}
I need help to modify my code to do the following tasks... I've used help from the following questions and answers thus far
Opening all files in a folder, and applying a function
How to assign a unique ID number to each group of identical values in a column
Here are things i hope to be able to do with my code...
I need to read in several files from a folder
I will like to use the name of each of the files in the folder to add a column. I was able to do this simply with 'mutate' but for a single file
I will like to save the result of each file separately and also combine to a single file
I also want to keep the code for reading the files separate from the function, so i can apply to other projects.
I'm trying to avoid using the 'loop' statements
Here is the sample of my incomplete code which gives error
library(tidyverse)
library(readr)
cleaningdata<- function(data){
data$Label<-gsub(".tif", "", data$Label)
data %>% select(Label:Solidity) %>%group_by(Label)%>%
mutate(view = seq_along(Label), Station="T1-1")%>%
rename(Species = Label)%>%
mutate(view = recode(view, "1" = "a","2" = "b","3" = "c"))
}
filenames <- list.files("Data", pattern="*.txt", full.names=TRUE)
ldf <- lapply(filenames, read.txt)
res <- lapply(ldf, cleaningdata)
Here is a sample of my dataset Data Folder and below is my work thus far
The fs package contains the useful dir_map function, which applies a function to each file in the path. If you need more control over the files to use, you could alternatively pipe a vector of the filenames into purrr::map() instead.
Your error Warning message: Unreplaced values treated as NA as .x is not compatible. Please specify replacements exhaustively or supply .default was because you were recoding 1, 2, 3 to a, b, c but one of the Species had 6 rows so 4, 5, 6 were recoded to NA. I've used letters[n] to avoid this problem.
library(tidyverse)
library(fs)
result <- dir_map(path = 'Data', fun = function(filepath) {
read_tsv(filepath) %>%
select(-1) %>%
rename(Species = Label) %>%
mutate(Species = sub('.tif$', '', Species)) %>%
group_by(Species) %>%
mutate(
View = seq_along(Species),
View = letters[View], # a, b, c, etc. instead of 1, 2, 3, etc.
Station = sub('.txt$', '', basename(filepath))
)
})
# get rows from second file
result[[2]]
# bind rows from all files
result %>% bind_rows()