How to convert dataframe to list of lists? - r

I have a dataset that looks like a sentiment dictionary.
Say the data is like following:
sentiment <-read.table(header=TRUE,text='
term score
awesome 3
good 0
interesting 2
power 1
bad -1
horrible -2
worst -3' )
I want to convert this data into a list of lists by score. Something like following (list names can be different):
$pos3
[1] awesome
$pos2
[1] interesting
$pos1
[1] power
$zero
[1] good
$neg1
[1] bad
$neg2
[1] horrible
$neg3
[1] worst
Other solutions suggested to use dlply:
dlply(sentiment, .(score), c)
But this produces score lists that I don't want. So I ended up using a dumb code like following:
list(neg3=sentiment$term[sentiment$score==-3],neg2=sentiment$term[sentiment$score==-2],
neg1=sentiment$term[sentiment$score==-1],zero=sentiment$term[sentiment$score==0],
pos1=sentiment$term[sentiment$score==1], pos2=sentiment$term[sentiment$score==2],
pos3=sentiment$term[sentiment$score==3])
But I wonder if there is a better way to do this.
How can I convert a dataframe to a list of lists without producing lists that I don't want?

Related

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

R: Applying factor values from one column to another

I am trying to process municipal information in R and it seems that factors (to be exact factor()). are the best way to achieve my goal. I am only starting to get the hang of R, so I imagine my problem is possibly very simple.
I have the following example dataframe to share (a tiny portion of Finnish municipalities):
municipality<-c("Espoo", "Oulu", "Tampere", "Joensuu", "Seinäjoki",
"Kerava")
region<-c("Uusimaa","Pohjois-Pohjanmaa","Pirkanmaa","Pohjois-Karjala","Etelä-Pohjanmaa","Uusimaa")
myData<-cbind(municipality,region)
myData<-as.data.frame(myData)
By default R converts my character columns into factors, which can be tested with str(myData). Now to the part where my beginner to novice level R skills end: I can't seem to find a way to apply factors from column region to column municipality.
Let me demonstrate. Instead of having the original result
as.numeric(factor(myData$municipality))
[1] 1 4 6 2 5 3
I would like to get this, the factors from myData$region applied to myData$municipality.
as.numeric(factor(myData$municipality))
[1] 5 4 2 3 1 5
I welcome any help with open arms. Thank you.
To better understand the use of factor in R have a look here.
If you want to add factor levels, you have to do something like this in your dataframe:
levels(myData$region)
[1] "Etelä-Pohjanmaa" "Pirkanmaa" "Pohjois-Karjala" "Pohjois-Pohjanmaa" "Uusimaa"
> levels(myData$municipality)
[1] "Espoo" "Joensuu" "Kerava" "Oulu" "Seinäjoki" "Tampere"
> levels(myData$municipality)<-c(levels(myData$municipality),levels(myData$region))
> levels(myData$municipality)
[1] "Espoo" "Joensuu" "Kerava" "Oulu" "Seinäjoki"
[6] "Tampere" "Etelä-Pohjanmaa" "Pirkanmaa" "Pohjois-Karjala" "Pohjois-Pohjanmaa"
[11] "Uusimaa"

Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.
I have a list of genes (n=10):
gene_list
SYMBOL ENTREZID GENENAME
1 AFAP1 60312 actin filament associated protein 1
2 ANAPC11 51529 anaphase promoting complex subunit 11
3 ANAPC5 51433 anaphase promoting complex subunit 5
4 ATL2 64225 atlastin GTPase 2
5 AURKA 6790 aurora kinase A
6 CCNB2 9133 cyclin B2
7 CCND2 894 cyclin D2
8 CDCA2 157313 cell division cycle associated 2
9 CDCA7 83879 cell division cycle associated 7
10 CDCA7L 55536 cell division cycle associated 7-like
and I simply want to find their function and I've been suggested to use GO analysis tools.
I am not sure if it's a correct way to do so.
here is my solution:
x <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
xx<- as.list(x[gene_list$ENTREZID])
So, I've got a list with EntrezID that are assigned to several GO terms for each genes.
for example:
> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"
$`GO:0009966`$Evidence
[1] "IEA"
$`GO:0009966`$Ontology
[1] "BP"
$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"
$`GO:0051493`$Evidence
[1] "IEA"
$`GO:0051493`$Ontology
[1] "BP"
My question is :
how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or?
because I want to add the function to the gene_list as a function/GO column.
Thanks in advance,
EDIT: There is a new Bioinformatics SE (currently in beta mode).
I hope I get what you are aiming here.
BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics
If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database.
You will need internet though to do the query.
Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).
In your case, directly from biomaRt vignette - task 2 in particular:
Note: there are slightly quicker way that the one I reported below:
# load the library
library("biomaRt")
# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts()
ensembl=useMart("ensembl")
# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)
You need to create your query (your list of ENTREZ ids). To see which filters you can query:
filters = listFilters(ensembl)
And then you want to retrieve attributes : your GO number and description. To see the list of available attributes
attributes = listAttributes(ensembl)
For you, the query would look like something as:
goids = getBM(
#you want entrezgene so you know which is what, the GO ID and
# name_1006 is actually the identifier of 'Go term name'
attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',
values=gene_list$ENTREZID,
mart=ensembl)
The query itself can take a while.
Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).
Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
tempo<-goids[goids$entrezgene==x,]
return(
data.frame('ENTREZGENE'= x,
'Go.ID'= paste(tempo$go_id,collapse=' ; '),
'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)
Edit:
If you want to query a past version of the ensembl database:
ens82<-useMart(host='sep2015.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
and then the query would be:
goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',values=gene_list$ENTREZID,
mart=ens82)
However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.

applying alternate to for loop in R

I am looking for a very efficient solution for for loop in R
where data_papers is
data_papers<-c(1,3, 47276 77012 77012 79468....)
paper_author:
paper_id author_id
1 1 521630
2 1 972575
3 1 1528710
4 1 1611750
5 2 1682088
I need to find the authors which are present in paper_author for a given paper in data_papers.There are around 350,000 papers in data_papers to around 2,100,000 papers in paper_author.
So my output would be a list of author_id for paper_ids in data_paper
authors:
[[1]]
[1] 521630 972575 1528710 1611710
[[2]]
[1] 826 338038 788465 1256860 1671245 2164912
[[3]]
[1] 366653 1570981 1603466
The simplest way to do this would be
authors<-vector("list",length(data_papers))
for(i in 1:length(data_papers)){
authors[i]<-as.data.frame(paper_author$author_id[which(paper_author$paper_id%in%data_papers[i])])}
But the computation time is very high
The other alternative is something like below taken from efficient programming in R
i=1:length(data_papers)
authors[i]<-as.data.frame(paper_author$author_id[which(paper_author$paper_id%in%data_papers[i])])
But i am not able to do this.
How could this be done.thanks
with(paper_author, split(author_id,paper_id))
Or you could use R's merge function?
merge(data_papers, paper_author, by=1)
Why are you not able to use this second solution you mentioned? Information on why would be useful.
In any case, what you want to do is to join two tables (data_papers and paper_authors). Doing it with pure nested loops, as your sample code does in either R for loops or the C for loops underlying vector operations, is pretty inefficient. You could use some kind of index data structure, based on e.g. the hash package, but it's a lot of work.
Instead, just use a database. They're built for this sort of thing. sqldf even lets you embed one into R.
install.packages("sqldf")
require(sqldf)
#you probably want to dig into the indexing options available here as well
combined <- sqldf("select distinct author_id from paper_author pa inner join data_papers dp on dp.paper_id = pa.paper_id where dp.paper_id = 1234;")

Import DAT File - Parsing Issue

I have a tab-delimited DAT file that I want to read into R. When I import the data using read.delim, my data frame has the correct number of columns, but has more rows than expected.
My datafile represents responses to a survey. After digging a little deeper, it appears that R is creating a new record when there is a "." in a column that represents an open-ended response. It appears that there are times when a respondent may have hit "enter" to add a new line.
Is there a way to get around this? I read the help, but I am not sure how I can tell R to ignore this character in the character response.
Here is an example response that parses incorrectly. This is one response, but you can see that there are returns that put this onto multiple lines when parsed by R.
possible ask for size before giving free tshirt.
Also maybe have the interview in conference rooms instead of tight offices. I felt very cramped.
I would of loved to have gone, but just had to make a choices and had more options then I expected.
I am analyzing the data with SPSS and the data were brought in fine, however, I need to use R for more advanced modeling
Any help will be greatly appreciated. Thanks in advance.
There is an 'na.strings' argument. You don't offer any test case, but perhaps you can to this:
read.delim(file="myfil.DAT", na.strings=".")
I think it would be good if you could produce an edit to your question that better demonstrated the problem. I cannot create an error with a simple effort:
> read.delim(text="a\tb\t.\nc\td\te\n",header=FALSE)
V1 V2 V3
1 a b .
2 c d e
> read.delim(text="a\tb\t.\nc\td\te\n",header=FALSE, na.strings=".")
V1 V2 V3
1 a b <NA>
2 c d e
(After the clarification that above comments are not particularly relevant.) This will bring in a field that has a linefeed in it .... but it requires that the "field" be quoted in the original file:
> scan(file=textConnection("'a\nb'\nx\t.\nc\td\te\n"), what=list("","","") )
Read 2 records
[[1]]
[1] "a\nb" "c"
[[2]]
[1] "x" "d"
[[3]]
[1] "." "e"

Resources