Automate searching of pages such as Wikipedia [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
my question is rather general and not quite specific to wikipedia only, I would like to know is there a way to automate gneration and selection of search results. To give an eample of what I intend:
Let's say I'd like to write articles about American Food and I'd like to read information, such as ingredients, texture, cuisine(County-wise), preparation methods, etc. about approximately 500 different American foods. Let's say these are all available on Wiki too and I have an excel sheet with the names of these dishes and columns specifying their properties. But I dont want to manually look up these dishes/food-iems, can I automate this process? I am looking for some general guidance, some open-source links, some pseudo-code or algorithmic approach to this problem. Any help is appreciated.
Thanks.
P.S.: It'd be great if the logic had some links to help in carrying it out using R, since the other aspects of my project have already been built in R. Also i'd like to broaden my searches to include other major information gathering sites/search engines.

You can do it relatively quickly with use of the WikipediR package:
require(WikipediR)
phrs <- c("car","house")
j <- 1
for (i in phrs) {
pgs[j] <- page_content("en", "wikipedia", page_name = i, as_wikitext = TRUE)
j <- j + 1
}
The solution rather fortuitously assumes that your food names correspond to the page names on Wikipedia. Most probably this won't be the case for all the items. You may consider using the pages_in_category in order to source more pages at once. I presume that I would fist match my list against pages_in_category for a given category (foods) and if the number of errors is insignificant progressed to matching the data.

Related

How can I learn to use R to do an excel V-lookup function that retrieves data from other excel files? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am trying to learn how to use R/R studio for a project. Some of the initial tasks I will be using R for are described below, and I would be very grateful for a resource that teaches me how I could perform the tasks below.
I have a column of unique identifiers in one excel document (document A), ie a, b, and c. I have another excel document for each of these identifiers, with the same name as these unique identifiers. So for each unique ID, I want to look-up the spreadsheet with a matching name, and from that spreadsheet, I want to retrieve the first and final value in a certain column, as well as the mean and maximum values in that column.
I am interested in finding a resource that will teach me to do all this and more, and don't mind investing time to learn ie I am not in a rush to do this.
After this step, I have something more complicated I want to do.
I have another document (document B) where I have a column of identifiers, but the identifiers are repeated multiple times. So again, using the first document with the list of identifiers, I want to search through document B and retrieve values from the rows where the identifier is mentioned for the first and last time in the column.
If you have a resource I can study to learn to do all this and more I would be very grateful. Thank you.
R offers multiple ways to do what you want and after you understood the basics you will find it probably easy to implement a solution for the tasks you described
Besides learning the R basics I'd also suggest looking at the tidyverse collection of packages. Its package dplyr offers a easy to write and read way of structuring code and together with tidyr almost all the functions you'll ever need for your day to day data wrangling needs (including the tasks mentioned in your question).
An Introduction to R - CRAN An official intro to the basics of R. While you would probably use alternative solutions to many of the examples here, I think it's very useful to at least once having read the basics
tidyverse Here you will find links (by clicking on the icons) to the tidyverse packages. Notably ggplot2 probably the plotting package in R and the aforementioned dplyr and tidyr as well as readxl, a package to read data from excel files.
Just to give you a glimpse into the future: The workflow to solve the tasks from the question could look something like the following:
Read data from the excel file with the unique identifiers using readxl::read_excel
Loop through the identifiers and load the corresponding files
Use dplyr::mutate to find the mean, max, dplyr::first, and dplyr::last
Proceed similarly for document B, maybe using dplyr::group_by and dplyr::first, and dplyr::last

R - Mean of columns in a dataframe? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Good evening, I'd just like to start off by saying I'm the biggest newbie to coding. I've gone through so many tutorials just to try and make a simple football/soccer data frame.
I now currently have something like this;
Home team | Away Team | Home Goals | Away Goals
M.United Liverpool 0 0
I have that for all results of the season so far.
What i'd like to do is get the mean of home goals column, and away goals column.
Also, if it's do-able, I'd like to filter a specific team and see what their average goals are at home, and their average goals conceded at home are, etc.
Thanks in advance, and apologies for my total utter noobism.
Jay.
You can use the dplyr package with something like:
library(dplyr)
data %>%
group_by(team) %>%
summarise(mean_home = mean(home_goals), mean_away = mean(away_goals))
(I am pretty sure that will work, but one thing that is great to do is to create a reproducible example so I can run your code to double check, for example I am not exactly sure what your variables names / data set names are, and I am not able to run your code as it is, a great resource for this is the reprex package).
For obtaining mean Try,
summary(your_data_frame_name), it will give you basic statistics of each column including Home.Goals.
OR to just get mean of a column,
mean(your_data_frame_name$Home.Goals)
To filter specific team, look into select function. You can do:
M_united_home <- subset(your_data_frame_name, Home.team == M.united)
Then you can use this data frame to answer any more queries about Man united. If you want to do more, also look into dplyr package.

Why is 'DO LOOP' missing in 328eForth? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I’m trying to learning Forth directly in an embedded system and using Starting Forth by Leo Brodie as a text. The Forth version I’m using is 328eForth (a port of eforth to the ATmega328) which I’ve flashed into an Arduino Uno.
It appears that the DO LOOP words are not implemented in 328eForth - which puts a kink in my learning with Brodie. But looking at the dictionary using “WORDS” shows that a series of looping words exist e.g. BEGIN UNTIL WHILE FOR NEXT AFT EXIT AGAIN REPEAT amongst others.
My questions are as follows:
Q1 Why was DO LOOP omitted from 328eForth?
Q2 Can DO LOOP be implemented in other existing words? If so, how please and if not why? (I guess there must be a very good reason for the omission of DO LOOP...)
Q3 Can you give some commented examples of the 328eForth looping words?
Q1: A choice was made for a different loop construct.
Q2: The words FOR and NEXT perform a similar function that just counts down to 0 and runs exactly the specified number of times, including zero.
The ( n2 n1 -- ) DO ... LOOP always runs at least once, which requires additional (mental) bookkeeping. People have been complaining
about that as long back as I can remember.
Q3: The 382eforth documentation ForthArduino_1.pdf contains some examples.
Edit: Added some exposé to Q2

Search for duplicates in texts with math equations [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
My employer asked me to do a project for our local team. Actually, it will be a way to help our work to finished faster.
We have a local database where we add exercises divided in two fields. The question and the solution. My employer wants since we are a team and we work at the same time, to create a system like stackoverflow's similar questions. When one of the team tries to submit a new data in the database, then it will check if there are other fields which may be duplicates.
The reason he asked me is because I have done something similar in the past but only for text using techniques like TF-IDF and Latent Semantic Analysis. But now, since the math symbols are all in Latex, I cannot find a way to check for duplicates.
I have tried to apply TF-IDF to the text only, but it doesn't work.
Any suggestion?
Edit:
Sorry for the broad topic. I will try to give more examples about my problem.
All the texts are exercises of primary and secondary schools. It is a mix of text and numbers-equations-symbols. If there were only text, I could use TF-IDF to find possible duplicates. Now, several exercises have a little or are without text.
Examples:
1) a. Solve the following equation: (x+1)*(x-1) = 5
b. Find the x: x^2 - 1 = 5
They are the same equation but with a different expression. So, I don't want to mark them as duplicates.
2) a. Solve the following equation: 3x + 7 = 12
b. Find the solution: 7 + 3x = 12
c. Find the x: 3x = 12 - 7
a and b should be duplicate whereas the c will not be.
You could try using MathJax to convert the LaTeX equation into MathML an XML format. You could then use tools to examine that structure. There are probably a few other tools which can convert your equation into some kind of tree structure.
Equality of mathematical expressions is a complex problem. There are question that should you treat (x+1)*(x-1) as being equal to x^2-1, algebraically they are the same.
You might want to investigate computer algebra systems which have a lot of sophisticated features for manipulating expressions.
One technique is to evaluate the expression at a number of points. If the values agree then its a good indication that the expressions are the same.
It might be easier to give a better answer if there was some idea of the type of problems you are working with, polynomials, integrals etc.?

R normalization with all samples, or just the part that i need? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am using the edgeR and Limma packages to analyse a RNA-seq count data table.
I only need a subset of the data file, therefore my question is: Do I need to normalize my data within all the samples, or is it better to subset my data first and normalize the data then.
Thank you.
Regards Lisanne
I think it depends on what you want to proof/show. If you also want to take into account your "darkcounts" than you should normalize it at first such that you also take into account the percentage in which your experiment fails. Here your total number of experiments ( good and bad results) sums up to one.
If you want to find out the distribution of your "good events" than you should first produce your subset of good samples and normalize afterwards. In this case your number of good events sums up to 1
So once again, it depends on what you want to proof. As a physicist I would prefer the first method since we do not remove bad data points.
Cheers TL

Resources