How to build dataframe of variable search strings for web scraping - r

I'm trying to build a dataframe that will help me paginate some simple web scraping. What is the best way to build a dataframe where each row uses the same base URL string but varies a few specific characters, which can be specified according to the pagination one needs.
Let's say you have a set of search results where there is a total of 4485 results, 10 per page, spread out over 449 pages. All I want for the moment is to make a dataframe with one variable where each row is a character string of the URL with a variable, sequenced page number along the lines of:
**Var1**
http://begin.com/start=0/index.html
http://begin.com/start=10/index.html
http://begin.com/start=20/index.html
http://begin.com/start=30/index.html
...
http://begin.com/start=4480/index.html
Here's my original strategy but this fails (and yea it's inefficient and newbish).
startstring<-"http://begin.com/start="
variableterm<-seq(from=0, to=4485, by=10)
endstring<-"/index.html"
df <- as.data.frame(matrix(nrow=449, ncol=1))
for (x in 1:length(variableterm)){
for(i in variableterm){
df[x,]<-c(paste(startstring,i,endstring, sep=""))
}
}
But every single row is equal to http://begin.com/start=4480/index.html. How can I change this so that each row gives the same URL but with a different number increasing like in the desired dataframe above?
I would very much appreciate how to achieve this with my strategy (just to learn) but of course better approaches are welcome also. Thanks!

I am not sure why you would need this to be in a data frame. Here is one way to create a vector of page urls.
sprintf("http://begin.com/start=%s/index.html", seq(0, 4490, 10))

The reason you have each row returning the same value (the last value) is that you have two loops where you only require a single. The first loop is looping through the rows of the data frame and the second loop is looping through the entire set of URLs and leaving the last one as the value of the data frame's row before the first loop moves to the next row.
This should work as you would expect:
for(i in 1:length(variableterm)){
df[i,]<-paste(startstring,variableterm[i],endstring, sep="")
}

Related

Combine lapply and gsub to replace a list of values for another list of values

I am currently looking for a way to simplify searching through a column within a dataframe for a vector of values and replacing each of of those values with another value (also contained within a separate vector). I can run a for loop for this, but it must be possible within the apply family, I'm just not seeing it yet. Very new to using the apply family and could use help.
So far, I've been able to have it replace all instances of the first value in my vector with the new first value in the new vector, it just isn't iterating past the first level. I hope this makes sense. Here is the code I have:
#standardize tank location
old_tank_list <- c("7.C.4","7.C.5","7.C.6","7.C.7","7.C.8","7.C.9","7.C.10","7.C.11")
new_tank_list <- c("7.B.3-4","7.C.3-4","7.C.1-2","7.C.5-6","7.C.7-8","7.C.9-10","7.E.9-10","7.C.11-12")
sapply(df_growth$Tank,function(y) gsub(old_tank_list,std_tank_list,y))
Tank is the name of the column I am trying to replace all of these values within. I haven't assigned it back yet, because I want to test the functionality first. Thanks for any help you can offer.
Hopefully, this image will help. The photo on the left is the column before my function is applied. The column on the right is after. Basically, I just want to batch change text values.
Before and After
library(dplyr)
df %>%
mutate(Tank = recode(Tank, !!!setNames(new_tank_list, old_tank_list)))

Referencing last used row in a data frame

I couldn't find the answer in any previously asked questions, but I believe this is an easy one.
I have the below two lines of code, which take in data from excel in a specific range (using readxl for this). The range itself only goes through row 2589 in the excel document, but it will update dynamically (it's a time series) and to ensure I capture the different observations (rows) as they're added, I've included rows to 10000 in the read_excel range argument.
In the end, I'd like to run charts on this data, but a key part of this is identifying the last used row, without manually updating the code row for the latest date. I've tried using nrow but to no avail.
Raw_Index_History <- read_excel("RData.xlsx", range = "ReturnsA6:P10000", col_names = TRUE)
Raw_Index_History <- Raw_Index_History[nrow(Raw_Index_History),]
Does anybody have any thoughts or advice? Thanks very much.
It would be easier to answer your question if you include an example.
Not knowing how your data looks like answers are likely going to be a bit vague.
Does your data contain NAs? If not it should be straight forward to remove the empty rows with
na.omit(Raw_Index_History)
It appears you also have control over the excel spreadsheet. So in case your data does contain NAs you could have some default value in your empty rows that will get overwritten as soon as a new data point is recorded. This will allow you to filter your dataframe accordingly.
Raw_Index_History[!grepl("place_holder", Raw_Index_History$column_with_placeholder),]
If you expect data in the spreadsheet to grow, you can specify only the columns to include, instead of a defined boundary.
Something like this ...
Raw_Index_History <- read_excel("RData.xlsx",
sheet = 1,
range = cell_cols("A:P"), # Only cols, no rows
col_names = TRUE)
Every time you run the code, R will pull in the data from columns between A:P up until the last populated row.
This will be a more elegant approach to your use case. (Consider what you'd do when your data crosses 10000 rows in the future)

Changing column name starting with a number

I am trying to run a loop to change the names of two columns in my data but the name of these two columns start with a number. For the same work I changed them (they were not starting with a number) by writing as shown below but it is not working.
Here is the code (the loop finishes later) :
#Filtering
for(i in 1:length(names$ID)){
f<-names$ID[[i]]
corrpoints<-sprintf("corrpoints%i",as.numeric(levels(f))[f])
pts=readOGR(dsn="C:/Users/Charlie/Desktop/Stage_permafrost/SIG/Quantification_des_mouvements/Corr_points_disp/Corr_points_ubaye", layer=corrpoints)
pts$Gvalue2004<-pts$2004_red_gr
pts$Gvalue2012<-pts$2012_red_gr
pts$Aspect_mnt<-pts$Aspect_25m_
Any idea on how I could fix this?
Thanks
Your example is not reproducible for us.
Column names CAN start with numbers. Use ticks around the name to access it as below. Whether this is a good idea is an entirely different story.
x <- mtcars
colnames(x)[1] <- '1mpg'
x$`1mpg`

How do I reference previous/next rows when iterating through a data frame in R?

I have a dataset that looks like this (I'm simplifying slightly here):
Column 1 has a user id
Column 2 has a url title
Column 3 has an actual url
The data is already ordered by user and time. So its User 1 and all the URLs they visited in ascending order of time and then User 2 and the URLs they visited in ascending order of time etc etc
What I'm trying to do is loop through the dataset and look for "triplets" where the first rows url doesn't contain my keyword (something like google or facebook or nytimes or whatever), the second rows url does contain my keyword, and the third row doesn't contain my keyword. Basically checking to see which websites users visited before and after any specific website.
I've figured out I can look for the keyword using:
if(length(grep("facebook",url)) > 0)
But I haven't been able to figure out how to loop through the code and achieve what I'm trying to do.
If you could break your response into two parts, I would really appreciate it:
Part 1: Is there any way to loop through a dataframe and have access to all the columns? I was able to work on a single column with this code:
new_data <- data.frame (url)
for (url in data$url)
if(length(grep("keyword",url)) > 0) {
new_data <- rbind(new_data,data.frame(url = url))
}
This approach is limited though because I can only reference a single column in my dataframe. Whats the better solution here? I tried:
for (row in data) and then referencing columns by row[column_number] and row['column_name'] to no avail
I also tried for (i in 1:nrow(data)) and then referencing columns using data[i,column_number] and that didn't work either (That should have worked right?) I figured if this method worked I could use i-1 and i+1 to access other rows! I know this isn't the traditional way of doing things in R, but if you could still offer an explanation on how to do it this way I would really appreciate it.
Part 2: How do I accomplish my actual goal, as stated earlier? I'd like to learn to do it the "R way"; I imagine its going to involve plyr or lapply, but I haven't managed to figure out how to use those functions even after extensive reading, let alone use them and include references to previous/next rows.
Thanks in advance for your help, any guidance is appreciated!
Use [-1]:
last <- nrow(df)
penu <- nrow(df) - 1
df$ContainsKeyword <- FALSE
df$ContainsKeyword[grep("keyword", df$url)] <- TRUE
df$TripletFound <- NA
for (i in 2:penu){
df$TripletFound[i] <- {df$ContainsKeyword[i-1] & df$ContainsKeyword[i+1]} & {!df$ContainsKeyword[i]}
}

Executing for loop in R

I am pretty new to R and have a couple of questions about a loop I am attemping to execute. I will try explain myself as best as possible reguarding what I wish the loop to do.
for(i in (1988:1999,2000:2006)){
yearerrors=NULL
binding=do.call("rbind.fill",x[grep(names(x), pattern ="1988.* 4._ data=")])
cmeans=lapply(binding[,2:ncol(binding)],mean)
datcmeans=as.data.frame(cmeans)
finvec=datcmeans[1,]
kk=0
result=RMSE2(yields[(kk+1):(kk+ncol(binding))],finvec)
kk=kk+ncol(binding)
yearerrors=c(result)
}
yearerrors
First I wish for the loop to iterate over file names of data.
Specifically over the years 1988-2006 in the place where 1988 is
placed right now in the binding statement. x is a list of data files
inputted into R and the 1988 is part of the file name. So, I have
file names starting with 1988,1989,...,2006.
yields is a numeric vector and I would like to input the indices of
the vector into the function RMSE2 as indicated in the loop. For
example, over the first iteration I wish for the indices 1 to the
number of columns in binding to be used. Then for the next iteration
I want the first index to be 1 more than what the previous iteration
ended with and continue to a number equal to the number of columns in the next binding
statement. I just don't know if what I have written will accomplish
this.
Finally, I wish to store each of these results in the vector
yearerrors and then access this vector afterwards.
Thanks so much in advance!
OK, there's a heck of a lot of guesswork here because the structure of your data is extremely unclear, I have no idea what the RMSE2 function is (and you've given no detail). Based on your question the other day, I'm going to assume that your data is in .csv files. I'm going to have a stab at your problem.
I would start by building the combined dataframe while reading the files in, not doing one then the other. Like so:
#Set your working directory to the folder containing the .csv files
#I'm assuming they're all in the form "YEAR.something.csv" based on your pattern matching
filenames <- list.files(".", pattern="*.csv") #if you only want to match a specific year then add it to the pattern match
years <- gsub("([0-9]+).*", "\\1", filenames)
df <- mdply(filenames, read.csv)
df$year <- as.numeric(years[df$X1]) #Adds the year
#Your column mean dataframe didn't work for me
cmeans <- as.data.frame(t(colMeans(df[,2:ncol(df)])))
It then gets difficult to know what you're trying to achieve. Since your datcmeans is a one row data.frame, datcmeans[1,] doesn't change anything. So if a one row from a dataframe (or a numeric vector) is an argument required for your RMSE2 function, you can just pass it datcmeans (cmeans in my example).
Your code from then is pretty much indecipherable to me. Without know what yields looks like, or how RMSE2 works, it's pretty much impossible to help more.
If you're going to do a loop here, I'll say that setting kk=kk+ncol(binding) at the end of the first iteration is not going to help you, since you've set kk=0, kk is not going to be equal to ncol(binding), which is, I'm guessing, not what you want. Here's my guess at what you need here (assuming looping is required).
yearerrors=vector("numeric", ncol(df)) #Create empty vector ahead of loop
for(i in 1:ncol(df)) {
yearerrors[i] <- RMSE2(yields[i:ncol(df)], finvec)
}
yearerrors
I honestly can't imagine a function that would work like this, but it seems the most logical adaption of your code.

Resources