I have a input file with different food types
Corn Fiber 17
Beans Protein 12
Milk Protien 15
Butter Fat 201
Eggs Fat 2
Bread Fiber 12
Eggs Cholesterol 4
Eggs Protein 8
Milk Fat 5
(Don't take these too seriously. I'm no nutrition expert) Anyway, I have the following script that reads the input file then puts the following into a table
file = io.open("food.txt")
foods = {}
nutritions = {}
for line in file:lines()
do
local f, n, v = line:match("(%a+) (%a+) (%d+)")
nutritions[n] = {value = v}
--foods[f] = {} Not sure how to implement here
end
file:close()
(It's a little messy right now)
Notice also that different foods can have different nutrients. For example, eggs have both protein and fat. I need a way to let the program, know which value I am trying to call. For example:
> print(foods.Eggs.Fat)
2
> print(foods.Eggs.Protein
8
I believe I need two tables, as shown above. The foods table will contain a table of nutritions. This way, I can have multiple food types with multiple different nutrient facts. However, I am not sure how to handle a table of tables. How can I implement this within my program?
The straightforward way is to test if food[f] exists, to decide whether to create a new table or add elements to existing one.
foods = {}
for line in file:lines() do
local f, n, v = line:match("(%a+) (%a+) (%d+)")
if foods[f] then
foods[f][n] = v
else
foods[f] = {[n] = v}
end
end
Related
I am analyzing a data set which is feedback from teachers. Each line in the data frame is a teacher, each of their answers is a variable, however I've run into a problem inputting the year level for each teacher as a lot of the teachers teach multiple grades.
eg:
Teacher Year
a 1
b 3
c 1/2
d 7
e 3/4
How can I enter this data into an excel sheet and then into R and analyse it usefully? I've never dealt with a variable before which contains multiple options on the same row.
Suppose you already have this data in R in an object called teacher_data. I will show you the way to deal with such responses that I have seen most commonly employed: you create additional columns so that each answer gets its own cell via the convenient tidyr function separate().
library(tidyr)
separate(teacher_data, col = "Year", into = paste0("Year", 1:2), sep = "/")
Here's the result:
Teacher Year1 Year2
1 a 1 <NA>
2 b 3 <NA>
3 c 1 2
4 d 7 <NA>
5 e 3 4
How you then use those columns kind of depends on what sort of answer you're trying to ask with the data. This part of your question is probably best asked at the sister site Cross Validated (Stack Exchange for statistics).
As far as Excel goes, I would not even deal with Excel as an intermediate step; it's just unnecessary. If you write the data out when you're done into a CSV, Excel can read CSVs just fine:
write.csv(teacher_data, file = "teacher_data.csv", row.names = FALSE)
Also, just so you know, I put your data into R via the following:
teacher_data <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
Teacher Year
a 1
b 3
c 1/2
d 7
e 3/4")
I have a number of trees, when printing they are 7 pages long. I've had to rebalance the data and need to look at the branches with the highest frequency to see if they make sense - I need to identify a cancellation rate for different clusters.
Given the data is so long what I need is to have the biggest branches and then I can validate those rather than go through 210 branches manually. I will have lots of trees so need to automate this to look at the important results.
Example code to use:
library(CHAID)
updatecars<-mtcars
updatecars$cyl<-as.factor(updatecars$cyl)
updatecars$vs<-as.factor(updatecars$vs)
updatecars$am<-as.factor(updatecars$am)
updatecars$gear<-as.factor(updatecars$gear)
plot(carsChaid)
carsChaid<-chaid(am~ cyl+vs+gear, data=updatecars)
carsChaid
When you print this data, you see n=15 for the first group. I need a table where I can sort on this value.
What I need is a decision tree table with the variable values and the number within each group from the tree. This is not exactly the same as this answer Walk a tree
as it doesn't give the number within but I think it's in the direction.
Can someone help,
Thanks,
James
Sure there is a better way to do this but this works.Obviously willing to have corrections and improvements suggested.
The particular trouble i had was creating the list of all combinations. When the expand.grid goes over 3 factors, it stops working. So I had to build a loop ontop of it to create the complete list.
All_canx_rates<-function(Var1,Var2,Var3,Var4,Var5,nametree){
df1<-data.frame("CanxRate"=0,"Num_Canx"=0,"Num_Cust"=0)
pars<-as.list(match.call()[-1])
a<-eval(pars$nametree)[,as.character(pars$Var1)]
b<-eval(pars$nametree)[,as.character(pars$Var2)]
c<-eval(pars$nametree)[,as.character(pars$Var3)]
d<-eval(pars$nametree)[,as.character(pars$Var4)]
e<-eval(pars$nametree)[,as.character(pars$Var5)]
allcombos<-expand.grid(levels(a),levels(b),levels(c))
clean<- allcombos
allcombos$Var4<-d[1]
for (i in 2:length(levels(d))) {
clean$Var4<-levels(d)[i]
allcombos<-rbind(allcombos,clean)
}
#define a forloop
for (i in 1:nrow(allcombos)) {
#define values
f1<-allcombos[i,1]
f2<-allcombos[i,2]
f3<-allcombos[i,3]
f4<-allcombos[i,4]
y5<-nrow(nametree[(a %in% f1 & b %in% f2 & c %in% f3 & d %in% f4 &
e =='1'),])
y4<-nrow(nametree[(a %in% f1 & b %in% f2 & c %in% f3 & d %in% f4),])
df2<-data.frame("CanxRate"=y5/y4,"Num_Canx"=y5,"Num_Cust"=y4)
df1<-rbind(df1, df2)
}
#endforloop
#make the dataframe available for global viewing
df1<-df1[-1,]
output<<-cbind(allcombos,df1)
}
You can use data.tree to do further operations on a party object like sorting, walking the tree, custom plotting, etc. The latest release v0.3.7 from github has a conversion from party class objects:
devtools::install_github("gluc/data.tree#v0.3.7")
library(data.tree)
tree <- as.Node(carsChaid)
tree$fieldsAll
The last command shows the names of the converted fields of the party class:
[1] "data" "fitted" "nodeinfo" "partyinfo" "split" "splitlevels" "splitname" "terms" "splitLevel"
You can sort by a function, e.g. the rows of the data on each node:
tree$Sort(attribute = function(node) nrow(node$data), decreasing = TRUE)
print(tree,
"splitname",
count = function(node) nrow(node$data),
"splitLevel")
This prints, for instance, like so:
levelName splitname count splitLevel
1 1 gear 32
2 ¦--3 17 4, 5
3 °--2 15 3
I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.
I have a list of genes (n=10):
gene_list
SYMBOL ENTREZID GENENAME
1 AFAP1 60312 actin filament associated protein 1
2 ANAPC11 51529 anaphase promoting complex subunit 11
3 ANAPC5 51433 anaphase promoting complex subunit 5
4 ATL2 64225 atlastin GTPase 2
5 AURKA 6790 aurora kinase A
6 CCNB2 9133 cyclin B2
7 CCND2 894 cyclin D2
8 CDCA2 157313 cell division cycle associated 2
9 CDCA7 83879 cell division cycle associated 7
10 CDCA7L 55536 cell division cycle associated 7-like
and I simply want to find their function and I've been suggested to use GO analysis tools.
I am not sure if it's a correct way to do so.
here is my solution:
x <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
xx<- as.list(x[gene_list$ENTREZID])
So, I've got a list with EntrezID that are assigned to several GO terms for each genes.
for example:
> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"
$`GO:0009966`$Evidence
[1] "IEA"
$`GO:0009966`$Ontology
[1] "BP"
$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"
$`GO:0051493`$Evidence
[1] "IEA"
$`GO:0051493`$Ontology
[1] "BP"
My question is :
how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or?
because I want to add the function to the gene_list as a function/GO column.
Thanks in advance,
EDIT: There is a new Bioinformatics SE (currently in beta mode).
I hope I get what you are aiming here.
BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics
If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database.
You will need internet though to do the query.
Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).
In your case, directly from biomaRt vignette - task 2 in particular:
Note: there are slightly quicker way that the one I reported below:
# load the library
library("biomaRt")
# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts()
ensembl=useMart("ensembl")
# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)
You need to create your query (your list of ENTREZ ids). To see which filters you can query:
filters = listFilters(ensembl)
And then you want to retrieve attributes : your GO number and description. To see the list of available attributes
attributes = listAttributes(ensembl)
For you, the query would look like something as:
goids = getBM(
#you want entrezgene so you know which is what, the GO ID and
# name_1006 is actually the identifier of 'Go term name'
attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',
values=gene_list$ENTREZID,
mart=ensembl)
The query itself can take a while.
Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).
Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
tempo<-goids[goids$entrezgene==x,]
return(
data.frame('ENTREZGENE'= x,
'Go.ID'= paste(tempo$go_id,collapse=' ; '),
'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)
Edit:
If you want to query a past version of the ensembl database:
ens82<-useMart(host='sep2015.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
and then the query would be:
goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',values=gene_list$ENTREZID,
mart=ens82)
However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.
I have a csv file which I want to extract only the timestamp of the sentences which contain toward plus the fruit name in that sentence. How can I do this in R (or if there's a faster way to do so, what's that?)
rosbagTimestamp,data
1438293900729698553,robot is in motion toward [strawberry]
1438293900730571638,Found a plan for avocado in 1.36400008202 seconds
1438293900731434815,current probability is greater than EXECUTION_THRESHOLD
1438293900731554567,ready to execute am original plan of len = 33
1438293900731586463,len of sub plan 1 = 24
1438293900731633713,len of sub plan 2 = 9
1438293900732910799,put in an execution request; now updating the dict
1438293900732949576,current_prediciton_item = avocado
1438293900733070339,current_item_probability = 0.880086981207
1438293901677787230,current probability is greater than PLANNING_THRESHOLD
1438293901681590725,robot is in motion toward [avocado]
1438293902689233770,we have received verbal request [avocado]
1438293902689314002,we already have a plan for the verbal request
1438293902689377800,debug
1438293902690529516,put in the final motion request
1438293902691076051,Found a plan for avocado in 1.95595788956 seconds
1438293902691084147,current predicted item != motion target; calc a new plan
1438293902691110642,current probability is greater than EXECUTION_THRESHOLD
1438293902691885974,have existing requests
1438293904496769068,robot is in motion toward [avocado]
1438293907737142498,ready to pick up the item
Ideally I want the output to be something like this:
1438293900729698553, strawberry
1438293901681590725, avocado
1438293904496769068, avocado
So apparently I have to use subset in grep for R but I am not really sure how to!
stamps <- df$rosbagTimestamp[grep("toward \\[", df$data)]
fruits <- gsub(".*\\[(\\w+)\\].*", "\\1", df$data[grep("toward \\[", df$data)])
data.frame(stamps,fruits)
stamps fruits
1 1438293900729698560 strawberry
2 1438293901681590784 avocado
3 1438293904496769024 avocado
I used the pattern "toward \\[" to locate fruits. If any changes occur in variability, it can be extended. The stamps variable is created by locating time stamps that have the pattern in the data column. The fruits variable isolates the fruit inside of the brackets.
I just started programming in R, neo4j and R-neo4j so please be indulgent if my question is trivial.
I have created following database (please confer the attached photo) [1] using R-neo4j and the following R Project code [2].
The database contains the outcome of computer game matches between four players. The dataset consists of four nodes, player 1 to player 4. The nodes are connected via the relationship "defeats", which indicates the outcome of the matches. There are two label entries attached to each relationship containing the following data: judge, game.
From the graph database using Cypher queries, I want to extract data in the following form (please confer the picture in [1]):
Winning player Loosing player Game Judge
player 1 player 4 Starcraft player 2
player 1 player 4 LOL player 3
player 4 player 1 LOL player 2
player 1 player 4 Starcraft player 3
player 1 player 2 LOL player 3
player 2 player 1 LOL player 4
player 4 player 1 Starcraft player 4
I want to make a query (preferred in the R-neo4j environment) to the graph database, where the input is "player 1" and the table above is returned.
I hope that my question is clear and someone can help me with this.
Have a good day.
Christian
[1] https://goo.gl/cMxXHo
[2] The R (Rneo4j) code:
clear(graph)
Y
player1 = createNode(graph,"user",ID="Player 1",male=T)
player2 = createNode(graph,"user",ID="Player 2",male=T)
player3 = createNode(graph,"user",ID="Player 3",male=F)
player4 = createNode(graph,"user",ID="Player 4",male=F)
addConstraint(graph,"user","ID")
rel1 = createRel(player1,"defeats",player4)
rel2 = createRel(player1,"defeats",player4)
rel3 = createRel(player4,"defeats",player1)
rel4 = createRel(player1,"defeats",player4)
rel5 = createRel(player1,"defeats",player2)
rel6 = createRel(player2,"defeats",player1)
rel7 = createRel(player3,"defeats",player1)
rel1 = updateProp(rel1, game = "Starcraft", judge = "Player 2")
rel2 = updateProp(rel2, game = "League of Legends", judge = "Player 3")
rel3 = updateProp(rel3, game = "League of Legends", judge = "Player 2")
rel4 = updateProp(rel4, game = "Starcraft", judge = "Player 3")
rel5 = updateProp(rel5, game = "League of Legends", judge = "Player 3")
rel6 = updateProp(rel6, game = "League of Legends", judge = "Player 4")
rel7 = updateProp(rel7, game = "Starcraft", judge = "Player 4")
A couple things. If you want to use clear(graph) without having to type "Y", you can use clear(graph, input=F). Also, if you weren't aware, you can set properties on relationships when you create them:
rel1 = createRel(player1, "defeats", player4, game="Starcraft", judge="Player 2")
To answer the question, I'd do this:
getDataForPlayer = function(name) {
query = "
MATCH (winner:user)-[game:defeats]->(loser:user)
WHERE winner.ID = {name} OR loser.ID = {name}
RETURN winner.ID AS `Winning Player`,
loser.ID AS `Losing Player`,
game.game AS Game,
game.judge AS Judge
"
return(cypher(graph, query, name=name))
}
getDataForPlayer("Player 1")
Output:
Winning Player Losing Player Game Judge
1 Player 4 Player 1 League of Legends Player 2
2 Player 2 Player 1 League of Legends Player 4
3 Player 3 Player 1 Starcraft Player 4
4 Player 1 Player 2 League of Legends Player 3
5 Player 1 Player 4 Starcraft Player 2
6 Player 1 Player 4 League of Legends Player 3
7 Player 1 Player 4 Starcraft Player 3
Looking at your graph, it kind of hits me of not having the right structure. Even though every scenario might be different, it is always good to consider what happens when you add MUCH more data. Can your model handle it?
For example, you are using relationships to represent results of games, which then of course requires attributes to store the judge and the games. Game names actually look like tournament games to me, but you'll know what works better. When storing the player and tournament names, you end up having a lot of repetition because the same names and players appear everywhere.
If you continue to add results between players you will end up with many relationships and the possibilities for error and repetition keep growing.
What can you do in order to improve your model then? Think of your basic relationship as a starting point but now it has outgrown the original requirement: you can introduce nodes for tournaments and nodes for games; keep relationships for storing the roles of the players within a game and so on. There is always more than one way to do it (TIMTOWTDI).
Given that a picture is worth a thousand words, look at the improved model here:
You see how it is also easier to add additional properties to the corresponding nodes or relationships in the model.
In order to produce your desired table with results, you then can use:
MATCH
(g:Game)-[:WINNER]->(w:Player),
(g)-[:LOSER]->(l:Player),
(g)-[:JUDGE]->(j:Player),
(g)<-[:HAS_GAMES]-(t:Tournament)
WHERE
w.name = 'Player 1' OR l.name = 'Player 1'
RETURN
w.name AS 'Winning Player',
l.name AS 'Losing Player',
t.name AS 'Game',
j.name AS 'Judge'
and adapt for R as suggested by Nicole. If you pretend to add lots of data, I think this structure will adapt better to your needs, and you can also explore different ways of querying for the same data, as you can now start with the tournaments or explore games directly.