I want to read gds_result.txt from enter link description here
using R and get dataframe.
The data.frame have 7 columns. The colnames of data.frame were:
title 2. contents 3. Organism 4. Type 5. Platform 6. FTP download 7. DataSet
How to get?
You could start with this:
library(tidyverse)
library(stringr)
txt<-read_lines("https://raw.githubusercontent.com/juancholkovich/GEO_DataSet_Browser/master/gds_result.txt")
txt %>% as_data_frame() %>%
filter(!value=='') %>%
mutate(new_group=as.numeric(str_detect(value, "^(\\d*?\\. )")),
group=cumsum(new_group),
keyword=str_match(value, "^Organism|^Project|^Type|^FTP|^Sample|^Series|^Source"),
keyword=ifelse(str_detect(tolower(value), "^dataset|^series|^sample|^platform|related platforms"), "Dataset", keyword),
keyword=ifelse(str_detect(tolower(value), "accession"), "Accession", keyword),
keyword=ifelse(new_group==1, "Name", keyword),
keyword=ifelse(is.na(keyword), "Comment", keyword)
) %>% select(-new_group) %>% spread(key=keyword, value=value)
There's probably a lot more cleaning to be done, but at least you get some structure to your data.
Related
I'm currently using the group_by() function and summaraise() to get the sum of my columns, is it possible to save that information to another data frame somehow? Maybe even create a csv file with its information. Thanks
workday %>% group_by(Date) %>% mutate_if(is.character,as.numeric) %>% summarise(across(Axis1:New_Sitting,sum))
Store the pipe result in a new object, say a
a <- workday %>% group_by(Date) %>% #other ops on workday
To save it to a file there are several options, including the base write.csv:
write.csv(a, "Path to file")
I'm new at R and have a large list of 30 elements, each of which is a dataframe that contains few hundred rows and around 20 columns (this varies depending on the dataframe). Each dataframe is named after the original .csv filename (for example "experiment data XYZ QWERTY 01"). How can I check through the whole list and only filter those dataframes that don't have specific text included in their filename AND also add an unique id column to those filtered dataframes (the id value would be first three characters of that filename)? For example all the elements/dataframes/files in the list which include "XYZ QWERTY" as a part of their name won't be filtered and doesn't need unique id. I had this pseudo style code:
for(i in 1:length(list_of_dataframes)){
if
list_of_dataframes[[i]] contains "this text" then don't filter
else
list_of_dataframes[[i]] <- filter(list_of_dataframes[[i]], rule) AND add unique.id.of.first.three.char.of.list_of_dataframes[[i]]
}
Sorry if the terminology used here is a bit awkward, but just starting out with programming and first time posting here, so there's still a lot to learn (as a bonus, if you have any good resources/websites to learn to automate and do similar stuff with R, I would be more than glad to get some good recommendations! :-))
EDIT:
The code I tried for the filtering part was:
for(i in 1:length(tbl)){
if (!(str_detect (tbl[[i]], "OLD"))){
tbl[[i]] <- filter(tbl[[i]], age < 50)
}
}
However there was an error message stating "argument is not an atomic vector; coercing" and "the condition has length > 1 and only the first element will be used". Is there any way to get this code working?
Let there be a directory called files containing these csv files:
'experiment 1.csv' 'experiment 2.csv' 'experiment 3.csv'
'OLDexperiment 1.csv' 'OLDexperiment 2.csv'
This will give you a list of data frames with a filter condition (here: do not contain the substring OLD in the filename). Just remove the ! to only include old experiments instead. A new column id is added containing the file path:
library(tidyverse)
list.files("files")
paths <- list.files("files", full.names = TRUE)
names(paths) <- list.files("files", full.names = TRUE)
list_of_dataframes <- paths %>% map(read_csv)
list_of_dataframes %>%
enframe() %>%
filter(! name %>% str_detect("OLD")) %>%
mutate(value = name %>% map2(value, ~ {
.y %>% mutate(id = .x)
})) %>%
pull(value)
A good resource to start is the free book R for Data Science
This is a much simpler approach without a list to get one big combined table of files matching the same condition:
list.files("files", full.names = TRUE) %>%
tibble(id = .) %>%
# discard old experiments
filter(! id %>% str_detect("OLD")) %>%
# read the csv table for every matching file
mutate(data = id %>% map(read_csv)) %>%
# combine the tables into one big one
unnest(data)
I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.
A question regarding this data extraction I did. I would like to create a bar chart with the data but unfortunately I am unable to convert the characters extracted to numbers inside R. If I edit the file in a text editor, there's no porblem at all but I'd like to do the whole process in R. Here it is the code:
install.packages("rvest")
library(rvest)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table()
str(corporatetax)
As a result in corporatetax there is a data.frame with 3 variables all of them characters. My question, which I've not been abe to resolve, is how should I proceed to convert the second and the third column to numbers to create a bar chart? I've tried with sapply() and dplyr() but did not find a correct way to do that.
Thanks!
You might try to clean up the table like this
library(rvest)
library(stringr)
library(dplyr)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
# your xpath defines the single table, so you can use html_node() instead of html_nodes()
html_node(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table() %>% as_tibble() %>%
setNames(c("country", "corporate_tax", "combined_tax"))
corporatetax %>%
mutate(corporate_tax=as.numeric(str_replace(corporate_tax, "%", ""))/100,
combined_tax=as.numeric(str_replace(combined_tax, "%", ""))/100
)
library(dplyr)
library(tidyr)
library(forcats)
library(readxl)
Using the gss_cat dataset from the forcats package I created a grouped and summarized dataframe then split the data by the marital and race variables (If there's a better tidyverse method than using lapply here that would be a great bonus).
Survey<-gss_cat%>%
select(marital,race,relig,denom)%>%
group_by(marital,race,relig,denom)%>%
summarise(Count=n())%>%
mutate(Perc=paste0(round(100*Count/sum(Count),2),"%"))%>%
drop_na()
Survey%>%split(.$marital)%>%
lapply(function(x) split(x,x$race))
However I'm stuck trying to export the final list to an Excel file with readxl. More specifically, I want to export select tables in the list to separate Excel tabs. For example, divided by race, so that each race category is on a different tab in the spreadsheet.
First, readxl does not write Excel files. See the thread for issue 231 on the readxl GitHub page. It looks like the writexl package (not [yet] part of the tidyverse) is recommended instead.
Second, split() can take a list as an argument.
list_of_dfs <- survey %>% split(list(.$marital, .$race), sep='_')
Putting it together, assuming you've installed writexl:
require(tidyverse)
require(forcats)
require(writexl)
survey <-
gss_cat %>%
select(marital, race, relig, denom) %>%
group_by(marital, race, relig, denom) %>%
summarise(Count=n()) %>%
mutate(Perc=paste0(round(100*Count/sum(Count), 2), "%")) %>%
drop_na()
list_of_dfs <- survey %>% split(list(.$marital, .$race), sep='_')
write_xlsx(list_of_dfs, 'out.xlsx')
Note that there are no checks on the suitability of the names of the worksheets that write_xlsx tries to create. If your data contained illegal characters in the marital or race column, or if you used an illegal character in the sep argument to split(), then the operation would fail. (Try using sep = ':' if you don't believe me.)