Rvest : Extracting clickable content - r

I am trying to extract the table in the link below
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
I want the whole table to be extracted and I am using the following code
html_page <- read_html(curl(curl))
tab <- html_page %>% html_table(., fill = TRUE)
I get the table in tab[[1]], however, if you notice that website it has a clickable section within the table that has additional data. That part is missing from the extracted table. Will appreciate any help on how the whole table can be extracted.

I'm not sure what you're getting. However, when I pulled from this website I see that there are multiple tabs but I pulled all of the data.
Here is the bottom of the table, when you show all.
Here are the results, when I query for the last line of this website data.
library(rvest)
library(tidyverse)
hx = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
htp <- read_html(hx) %>% html_table(., fill = T)
tbOne = htp[[1]][, 1:10] # just the data
tbOne %>% filter(`State Name` == "Uttar Pradesh",
`District Name` == "Badaun",
`Market Name` == "Wazirganj")
# # A tibble: 1 × 10
# `State Name` `District Name` `Market Name` Variety Group `Arrivals (Tonnes)`
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Uttar Pradesh Badaun Wazirganj Dara Cereals 3.50
# # … with 4 more variables: `Min Price (Rs./Quintal)` <chr>,
# # `Max Price (Rs./Quintal)` <chr>, `Modal Price (Rs./Quintal)` <chr>,
# # `Reported Date` <chr>
Update
When I pressed the 2, nothing happened (and I did try repeatedly). However, I needed to be really patient and I wasn't. Sorry about that.
The URL has the query in it, so the URL can be used to get all of the data. You could do this by adding the states you're missing, or you could do this for every state. For example, page one ends on Utter Pradesh, but we don't know if this is all of Utter Pradesh. That might make more sense when you see what I did.
Using rvest, I collected all of the states' names from the form. Then I put these name-value pairs into a data frame.
# collect form values for State
ht <- read_html(hx) %>% html_form()
df1 <- as.data.frame(ht[[1]][["fields"]][["ctl00$ddlState"]][["options"]]) %>%
rownames_to_column("State")
names(df1)[2] <- "Abb"
To only look at the states that were not included in page one, you could just query the states after Utter Pradesh like this.
which(df1$State == "Uttar Pradesh", arr.ind = T)
# [1] 35
# split the URL
urone = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State="
urtwo = "&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=West+Bengal&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
# collect remaining states' data
df2 <- map(36:nrow(df1),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df2)
# [1] 2
length(df2[[1]]) # state 36 is empty
length(df2[[2]]) # state 37 is not
# add the new data to the original data
df3 <- df2[[2]][[1]]
tbOne <- rbind(tbOne, df3) # one data frame of tabled data
If you wanted to make sure that you had all the data for each state, you could expand this. Although, using map for that much data may be slow. So I used the function mclapply from the package parallel. In this code, I used 15 cores. You may need to change this depending on your computer's processor. Using 15 made this take less than a second.
# skip row 1, that's "select" or all
df4 <- mclapply(2:nrow(df1), mc.cores = getOption("mc.cores", 15L),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df4)
# [1] 36
# create storage using first state with data
df5 <- df4[[7]][[1]]
map(8:36,
function(x){
y = length(df4[[x]])
if(y > 0){
df5 <<- rbind(df5, df4[[x]][[1]])
}
})
Now you have a data frame, df5 that started as each state queried separately.
I didn't look at how the data was different. However, my tbOne data frame has 577 observations. My df5 data frame has 584.

Related

R for loop to extract info from a file and add it into tibble?

I am not great with tidyverse so forgive me if this is a simple question. I have a bunch of files with data that I need to extract and add into distinct columns in a tibble I created.
I want the the row names to start with the file IDs which I did manage to create:
filelist <- list.fileS(pattern=".txt") # Gives me the filenames in current directory.
# The filenames are something like AA1230.report.txt for example
file_ID <- trimws(filelist, whitespace="\\..*") # Gives me the ID which is before the "report.txt"
metadata <- as_tibble(file_ID[1:181]) # create dataframe with IDs as row names for 180 files.
Now in these report files are information on species and abundance (kraken report files for those familiar with kraken) and all I need is to extract the number of reads for each domain. I can easily search up in each file the domains and number of reads that fall into that domain using something like:
sample_data <- as_tibble(read.table("AA1230.report.txt", sep="\t", header=FALSE, strip.white=TRUE))
sample_data <- rename(sample_data, Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) # Just renaming the column headers for clarity
sample_data %>% filter(Rank=="D") # D for domain
This gives me a clear output such as:
Percentage Num_Reads_Root Num_Reads_Taxon Rank NCBI_ID Name
<dbl> <int> <int> <fct> <int> <fct>
1 75.9 60533 28 D 2 Bacteria
2 0.48 386 0 D 2759 Eukaryota
3 0.01 4 0 D 2157 Archaea
4 0.02 19 0 D 10239 Viruses
Now, I want to just grab the info in the second column and final column and save this info into my tibble so that I can get something like:
> metadata
value Bacteria_Counts Eukaryota_Counts Viruses_Counts Archaea_Counts
<chr> <int> <int> <int> <int>
1 AA1230 60533 386 19 4
2 AB0566
3 AA1231
4 AB0567
5 BC1148
6 AW0001
7 AW0002
8 BB1121
9 BC0001
10 BC0002
....with 171 more rows
I'm just having trouble coming up with a for loop to create these sample_data outputs, then from that, extract the info and place into a tibble. I guess my first loop should create these sample_data outputs so something like:
for (files in file.list()) {
>> get_domains <<
}
Then another loop to extract that info from the above loop and insert it into my metadata tibble.
Any suggestions? Thank you so much!
PS: If regular dataframes in R is better for this let me know, I have just recently learned that tidyverse is a better way to organize dataframes in R but I have to learn more about it.
You could also do:
library(tidyverse)
filelist <- list.files(pattern=".txt")
nms <- c("Percentage", "Num_reads_root", "Num_reads_taxon", "Rank", "NCBI_ID", "Name")
set_names(filelist,filelist) %>%
map_dfr(read_table, col_names = nms, .id = 'file_ID') %>%
filter(Rank == 'D') %>%
select(file_ID, Name, Num_reads_root) %>%
pivot_wider(id_cols = file_ID, names_from = Name, values_from = Num_reads_root) %>%
mutate(file_ID = str_remove(file_ID, '.txt'))
I've found that using a for loop is nice sometimes because saves all the progress along the way in case you hit an error. Then you can find the problem file and debug it or use try() but throw a warning().
library(tidyverse)
filelist <- list.files(pattern=".txt") #list files
tmp_list <- list()
for (i in seq_along(filelist)) {
my_table <- read_tsv(filelist[i]) %>% # It looks like your files are all .tsv's
rename(Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) %>%
filter(Rank=="D") %>%
mutate(file_ID <- trimws(filelist[i], whitespace="\\..*")) %>%
select(file_ID, everything())
tmp_list[[i]] <- my_table
}
out <- bind_rows(tmp_list)
out

Problem formatting spreadsheets in R, how can I read and write to tables using R?

I'm working with R for the first time for a class in college. To preface this: I don't know enough to know what I don't know, so I'm sorry if this question has been asked before. I am trying to predict the results of the Texas state house elections in 2020, and I think the best prior for that is the results of the 2018 state house elections. There are 150 races, so I can't bare to input them all by hand, but I can't find any spreadsheet that has data formatted how I want it. I want it in a pretty standard table format:
My desired table format. However, the table from the Secretary of state I have looks like the following:
Gross ugly table.
I wrote some psuedo code:
Here's the Psuedo Code, basically we want to construct a new CSV:
'''%First, we want to find a district, the house races are always preceded by a line of dashes, so I will need a function like this:
Create a New CSV;
for(x=1; x<151 ; x +=1){
Assign x to the cell under the district number cloumn;
Find "---------------" ;
Go down one line;
Go over two lines;
% We should now be in the third column and now want to read in which party got how many votes. The number of parties is not consistant, so we need to account for uncontested races, libertarians, greens, and write ins. I want totals for Republicans, Democrats, and Other.
while(cell is not empty){
Party <- function which reads cell (but I want to read a string);
go right one column;
Votes <- function which reads cell (but I want to read an integer);
if(Party = Rep){
put this data in place in new CSV;
else if (Party = Dem)
put this data in place in new CSV;
else
OtherVote += Votes;
};
};
Assign OtherVote to the column for other party;
OtherVote <- 0;
%Now I want to assign 0 to null cells (ones where no rep, or no Dem, or no other party contested
read through single row 4 spaces, if its null assign it 0;
Party <- null
};'''
But I don't know enough to google what to do! Here's what I need help with: Can I create a new CSV in Rstudio, how? How can I read specific cells in a table, hopefully indexing? Lastly, how do I write to a table in R. Any help is appreciated! Thank you!
Can I create a new CSV in Rstudio, how?
Yes you can. Use the "write.csv" function.
write.csv(df, file = "df.csv") #see help for more information.
How can I read specific cells in a table?
Use the brackets after df,example below.
df <- data.frame(x = c(1,2,3), y = c("A","B","C"), z = c(15,25,35))
df[1,1]
#[1] 1
df[1,1:2]
# x y
#1 1 A
How do I write to a table in R?
If you want to write a table in xlsx use the function write.xlsx from openxlsx package.
Wikipedia seems to have a table that is closer to the format you are looking for.
In order to get to the table you are looking for we need a few steps:
Download data from Wikipedia and extract table.
Clean up table.
Select columns.
Calculate margins.
1. Download data from wikipedia and extract table.
The rvest table helps with downloading and parsing websites into R objects.
First we download the HTML of the whole website.
library(dplyr)
library(rvest)
wiki_html <-
read_html(
"https://en.wikipedia.org/wiki/2018_United_States_House_of_Representatives_elections_in_Texas"
)
There are a few ways to get a specific object from an HTML file in this case
I dedided to look for the table that has the class name “wikitable plainrowheaders sortable”,
as I learned from inspecting the code, that the only table with that class is
the one we want to extract.
library(purrr)
html_nodes(wiki_html, "table") %>%
map_lgl( ~ html_attr(., "class") == "wikitable plainrowheaders sortable") %>%
which()
#> [1] 20
Then we can select table number 20 and convert it to a dataframe with html_table()
raw_table <-
html_nodes(wiki_html, "table")[[20]] %>%
html_table(fill = TRUE)
2. Clean up table.
The table has duplicated names, we can change that by using as_tibble() and its .name_repair argument. We then usedplyr::select() to get the columns. Furthermore we usedplyr::filter() to delete the first two rows, that have "District" as a value in theDistrictcolumn. Now the columns are still characters
vectors, but we need them to be numeric, therefore we first delete commas from
all columns and then transform columns 2 to 4 to numeric.
clean_table <-
raw_table %>%
as_tibble(.name_repair = "unique") %>%
filter(District != "District") %>%
mutate_all( ~ gsub(",", "", .)) %>%
mutate_at(2:4, as.numeric)
3. Select columns and 4. Calculate margins.
We use dplyr::select() to select the columns you are interested in and give them more helpful names.
Finally we calculate the margin between democratic and republican votes by first adding up there votes
as total_votes and then dividing the difference by total_votes.
clean_table %>%
select(District,
RepVote = Republican...2,
DemVote = Democratic...4,
OthVote = Others...6) %>%
mutate(
total_votes = RepVote + DemVote,
margin = abs(RepVote - DemVote) / total_votes * 100
)
#> # A tibble: 37 x 6
#> District RepVote DemVote OthVote total_votes margin
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 District 1 168165 61263 3292 229428 46.6
#> 2 District 2 139188 119992 4212 259180 7.41
#> 3 District 3 169520 138234 4604 307754 10.2
#> 4 District 4 188667 57400 3178 246067 53.3
#> 5 District 5 130617 78666 224 209283 24.8
#> 6 District 6 135961 116350 3731 252311 7.77
#> 7 District 7 115642 127959 0 243601 5.06
#> 8 District 8 200619 67930 4621 268549 49.4
#> 9 District 9 0 136256 16745 136256 100
#> 10 District 10 157166 144034 6627 301200 4.36
#> # … with 27 more rows
Edit: In case you want to go with the data provided by the state, it looks to me as if the data you are looking for is in the first, third and fourth column. So what you want to do is.
(All the code below is not tested, as I do not have the original data.)
read data into R
library(readr)
tx18 <- read_csv("filename.csv")
select relevant columns
tx18 <- tx18 %>%
select(c(1,3,4))
clean table
tx18 <- tx18 %>%
filter(!is.na(X3),
X3 != "Party",
X3 != "Race Total")
Group and summarize data by party
tx18 <- tx18 %>%
group_by(X3) %>%
summarise(votes = sum(X3))
Pivot/ Reshape data to wide format
tx18 %>$
pivot_wider(names_from = X3,
values_from = votes)
After this you could then calculate the margin similarly as I did with the Wikipedia data.

how to use future_lapply and data.table to read huge folder of csvs in a loop and return summary table

I have a folder of 10,000+ csv files stored on my hard drive. Each csv is for a species and gives presence in raster cells (so over 5million cells if the species were present in every cell on earth).
I need to read each file and use dplyr to join to other data frames and summarise, then return a summary df. I don't have a server to run this on and it's stalling my desktop. It works with a subset of 17 species csvs, but even then it's slow.
This is similar to a few other questions about dealing with big data, but I can't figure out the right combination of packages like data.table, bigmemory, and future. I think the really slow part is the dplyr commands, as opposed to reading the files, but I'm not sure.
I'm not sure if this is possible to answer without the files, but they're huge so not sure how to make this reproducible?
spp_ids <- <vector of the species ids, in this case 17 of them>
spp_list <- <datafame with ids of the 17 spp in the folder>
spp_info <- <dataframe with the species id and then some other columns>
cellid_df <- <big df with 5 million+ cell ids and corresponding region names>
# Loop
spp_regions <- future_lapply(spp_ids, FUN = function(x) {
csv_file <- file.path("//filepathtoharddrivefolder",
sprintf('chrstoremove_%s.csv', x)) # I pull just the id number from the file names
# summarise number of regions and cells
spp_region_summary <- data.table::fread(csv_file, sep = ",") %>%
dplyr::mutate(spp_id = x) %>%
dplyr::filter(presence == 1) %>% # select cell ids where the species is present
dplyr::left_join(cellid_df, by = "cell_id") %>%
dplyr::group_by(region, spp_id) %>%
dplyr::summarise(num_cells = length(presence)) %>%
dplyr::ungroup()
# add some additional information
spp_region_summary <- spp_region_summary %>%
dplyr::left_join(spp_info, by = "spp_id") %>%
dplyr::left_join(spp_list, by = "spp_id") %>%
dplyr::select(region, spp_id, num_cells)
return(spp_region_summary)
})
spp_regions_df <- dplyr::bind_rows(spp_regions)
fwrite(spp_regions_df,"filepath.csv")
Haven't worked with this much data before so I've never had to leave the tidyverse!
I've tried to reproduce this. I generated 10 million rows for cellid_df and each individual file. It only took about 40 seconds for 15 "files" (Using reprex added an extra 20 seconds).
If you can leave your laptop running for half a day or so, this should do it.
A couple of suggestions:
You can write to file if you're worried about memory issues.
Since spp_id is unique in each iteration, you can add it in after the merge. It will save some time.
The "additional information" can be joined to the final dataframe since it is keyed on spp_id. In data.table, left_join(X,Y,by='id') will become Y[X,on='id']
library(data.table)
spp_ids <- 1:15
set.seed(123)
N <- 1e7 # number of cell_ids
# Dummy cell ids + regions
cellid_df <- data.table(cell_id=1:N,region=sample(state.abb,N,replace = T))
head(cellid_df)
#> cell_id region
#> 1: 1 NM
#> 2: 2 IA
#> 3: 3 IN
#> 4: 4 AZ
#> 5: 5 TN
#> 6: 6 WY
#
outfile <- 'test.csv'
if(file.exists(outfile))
file.remove(outfile)
a=Sys.time()
l<- lapply(spp_ids, function(x){
#Generate random file with cell_id and presence
spp_file <- data.table(cell_id=1:N,presence=round(runif(N)))
present_cells <- cellid_df[spp_file[presence==1],on='cell_id'] # Filter and merge
spp_region_summary <- present_cells[,.(spp_id=x,num_cells=.N),by=.(region)] # Summarise and add
setcolorder(spp_region_summary,c('spp_id','region','num_cells')) # Reorder the columns if you want
fwrite(spp_region_summary,outfile,append = file.exists(outfile)) # Write the summary to disk to avoid memory issues
# If you want to keep it in memory, you can return it and use rbindlist
# spp_region_summary
})
b=Sys.time()
b-a
#> Time difference of 1.019157 mins
# Check lines in file = (No of species) x (No of regions) + 1
R.utils::countLines(outfile)
#> Registered S3 method overwritten by 'R.oo':
#> method from
#> throw.default R.methodsS3
#> [1] 751
#> attr(,"lastLineHasNewline")
#> [1] TRUE
Created on 2019-12-20 by the reprex package (v0.3.0)

.TXT in long form to data.frame in wide form in R

I am currently working with clinical assessment data that is scored and output by a software package in a .txt file. My goal is extract the data from the txt file into a long format data frame with a column for: Participant # (which is included in the file name), subtest, Score, and T-score.
An example data file is available here:
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
I am running into a couple road blocks that I could use some input into how navigate.
1) I only need the information that corresponds to each subtest, these all have a number prior to the subtest name. Therefore, the rows that only have one to two words that are not necessary (eg cognitive screen) seem to be interfering creating new data frames because I have a mismatch in columns provided and columns wanted.
Some additional corks to the data:
1) the asteriks are NOT necessary
2) the cognitive TOTAL will never have a value
I am utilizing the readtext package to import the data at the moment and I am able to get a data frame with two columns. One being the file name (this includes the participant name) so that problem is fixed. However, the next column is a a giant character string with the columns data points for both Score and T-Score. Presumably I would then need to split these into the columns of interest, previously listed.
Next problem, when I view the data the T scores are in the correct order, however the "score" data no longer matches the true values.
Here is what I have tried:
# install.packages("readtext")
library(readtext)
library(tidyr)
pathTofile <- path.expand("/Users/Brahma/Desktop/CAT TEXT FILES/")
data <- readtext(paste0(pathTofile2, "CAToutput.txt"),
#docvarsfrom = "filenames",
dvsep = " ")
From here I do not know how to split the data, in my head I would do something like this
data2 <- separate(data2, text, sep = " ", into = c("subtest", "score", "t_score"))
This of course, gives the correct column names but removes almost all the data I actually am interested in.
Any help would be appreciated whether a solution or a direction you might suggest I look for more answers.
Sincerely,
Alex
Here is a way of converting that text file to a dataframe that you can do analysis on
library(tidyverse)
input <- read_lines('c:/temp/scores.txt')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
table <- left_join(scores,
header,
by = 'row'
)
colnames(table) <- c('index', 'type', 'Score', 'T-Score', 'row', 'title')
head(table, 10)
# A tibble: 10 x 6
index type Score `T-Score` row title
<chr> <chr> <chr> <chr> <int> <chr>
1 "1. " Line Bisection 9 53 3 Subtest/Section
2 "2. " Semantic Memory 8 51 4 Subtest/Section
3 "3. " Word Fluency 1 56* 5 Subtest/Section
4 "4. " Recognition Memory 40 59 6 Subtest/Section
5 "5. " Gesture Object Use 2 68 7 Subtest/Section
6 "6. " Arithmetic 5 49 8 Subtest/Section
7 "7. " Spoken Words 17 45* 14 Spoken Language
8 "9. " Spoken Sentences 25 53* 15 Spoken Language
9 "11. " Spoken Paragraphs 4 60 16 Spoken Language
10 "8. " Written Words 14 45* 20 Written Language
What is the source for the code at the link provided?
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
This data is odd. I was able to successfully match patterns and manipulate most of the data, but two rows refused to oblige. Rows 17 and 20 refused to be matched. In addition, the data type / data structure are very unfamiliar.
This is what was accomplished before hitting a wall.
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, Extract = str_extract(df$V2, "[1-9]+\\s[1-9]+\\*+\\s?"))
df2 <- df1 %>% mutate(V2, Extract2 = str_extract(df1$V2, "[0-9]+.[0-9]+$"))
head(df2)
When the data was further explored, the second column, V2, included data types that are completely unfamiliar. These included: Arithmetic, Complex Words, Digit Strings, and Function Words.
If anything, it would good to know something about those unfamiliar data types.
Took another look at this problem and found where it had gotten off track. Ignore my previous post. This solution works in Jupyter Lab using the data that was provided.
library(stringr)
library(dplyr)
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, "Score" = str_extract(df$V2, "\\d+") )
df2 <- df1 %>% mutate(V2, "T Score" = str_extract(df$V2, "\\d\\d\\*?$"))
df3 <- df2 %>% mutate(V2, "Subtest/Section" = str_remove_all(df2$V2, "\\\t+[0-9]+"))
df4 <- df3 %>% mutate(V1, "Sub-S" = str_extract(df3$V1, "\\s\\d\\d\\s*"))
df5 <- df4 %>% mutate(V1, "Sub-T" = str_extract(df4$V1,"\\d\\d\\*"))
df6 <- replace(df5, is.na(df5), "")
df7 <- df6 %>% mutate(V1, "Description" = str_remove_all(V1, "\\d\\d\\s\\d\\d\\**$")) # remove digits, new variable
df7$V1 <- NULL # remove variable
df7$V2 <- NULL # remove variable
df8 <- df7[, c(6,3,1,4,2,5)] # re-align variables
head(df8,15)

How to scrape multiple tables that are without IDs or Class using R

I'm trying to scrape this webpage using R : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (All the pages)
I'm new to programming. And everywhere I've looked, tables are mostly identified with IDs or Divs or Class. On this page there's none. Data is stored in Table format. How should I scrape it?
This is what I did :
library(rvest)
webpage <- read_html("http://zipnet.in/index.php
page=missing_mobile_phones_search&criteria=browse_all")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[9:10] %>%
html_table(fill = TRUE)
colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
"Police Station", "Status", "Mobile Type(GSM/CDMA)",
"FIR/DD/GD Dat")
You can scrape the table data by targeting the css id of each table. It looks like each page is composed of 3 different tables pasted one after another. Two of the tables have #AutoNumber15 css id while the third (in the middle) has the #AutoNumber16 css id.
I put a simple code example that should get you started in the right direction.
suppressMessages(library(tidyverse))
suppressMessages(library(rvest))
# define function to scrape the table data from a page
get_page <- function(page_id = 1) {
# default link
link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No="
# build link
link <- paste0(link, page_id)
# get tables data
wp <- read_html(link)
wp %>%
html_nodes("#AutoNumber16, #AutoNumber15") %>%
html_table(fill = TRUE) %>%
bind_rows()
}
# get the data from the first three pages
iter_page <- 1:3
# this is just a progress bar
pb <- progress_estimated(length(iter_page))
# this code will iterate over pages 1 through 3 and apply the get_page()
# function defined earlier. The Sys.sleep() part is used to pause the code
# after each iteration so that the sever is not overloaded with requests.
map_df(iter_page, ~ {
pb$tick()$print()
df <- get_page(.x)
Sys.sleep(sample(10, 1) * 0.1)
as_tibble(df)
})
#> # A tibble: 72 x 4
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 FIR/DD/GD Number 000165 State
#> 2 FIR/DD/GD Date 17/08/2017 District
#> 3 Mobile Type(GSM/CDMA) GSM Police Station
#> 4 Mobile Make SAMSUNG J2 Mobile Number
#> 5 Missing/Stolen Date 23/04/2017 IMEI Number
#> 6 Complainant AKEEL KHAN Complainant Contact Number
#> 7 Status Stolen/Theft Report Date/Time on ZIPNET
#> 8 <NA> <NA> <NA>
#> 9 FIR/DD/GD Number FIR No 37/ State
#> 10 FIR/DD/GD Date 17/08/2017 District
#> # ... with 62 more rows, and 1 more variables: X4 <chr>

Resources