How do I reference previous/next rows when iterating through a data frame in R? - r

I have a dataset that looks like this (I'm simplifying slightly here):
Column 1 has a user id
Column 2 has a url title
Column 3 has an actual url
The data is already ordered by user and time. So its User 1 and all the URLs they visited in ascending order of time and then User 2 and the URLs they visited in ascending order of time etc etc
What I'm trying to do is loop through the dataset and look for "triplets" where the first rows url doesn't contain my keyword (something like google or facebook or nytimes or whatever), the second rows url does contain my keyword, and the third row doesn't contain my keyword. Basically checking to see which websites users visited before and after any specific website.
I've figured out I can look for the keyword using:
if(length(grep("facebook",url)) > 0)
But I haven't been able to figure out how to loop through the code and achieve what I'm trying to do.
If you could break your response into two parts, I would really appreciate it:
Part 1: Is there any way to loop through a dataframe and have access to all the columns? I was able to work on a single column with this code:
new_data <- data.frame (url)
for (url in data$url)
if(length(grep("keyword",url)) > 0) {
new_data <- rbind(new_data,data.frame(url = url))
}
This approach is limited though because I can only reference a single column in my dataframe. Whats the better solution here? I tried:
for (row in data) and then referencing columns by row[column_number] and row['column_name'] to no avail
I also tried for (i in 1:nrow(data)) and then referencing columns using data[i,column_number] and that didn't work either (That should have worked right?) I figured if this method worked I could use i-1 and i+1 to access other rows! I know this isn't the traditional way of doing things in R, but if you could still offer an explanation on how to do it this way I would really appreciate it.
Part 2: How do I accomplish my actual goal, as stated earlier? I'd like to learn to do it the "R way"; I imagine its going to involve plyr or lapply, but I haven't managed to figure out how to use those functions even after extensive reading, let alone use them and include references to previous/next rows.
Thanks in advance for your help, any guidance is appreciated!

Use [-1]:
last <- nrow(df)
penu <- nrow(df) - 1
df$ContainsKeyword <- FALSE
df$ContainsKeyword[grep("keyword", df$url)] <- TRUE
df$TripletFound <- NA
for (i in 2:penu){
df$TripletFound[i] <- {df$ContainsKeyword[i-1] & df$ContainsKeyword[i+1]} & {!df$ContainsKeyword[i]}
}

Related

Adressing columns based on only parts of the name in order to simplify lines

My first question here and I am not very experienced, however I hope this question is easy enough to answer since I only want to know if what I describe in the title is possible.
I have multiple dataframes taken from online capacity tests participants did.
For all Items I have response, score, and durationvariables among others.
Now I want to delete rows where all responsevariables are NA. So I can't just use a command to delete rows with where all is NA but there are also to many columns to do it by hand. And I also want to keep the dataframe together while doing it in order to really drop the complete rows, so just extracting all responsevariables doesn't sound like a good option.
However, besides a 3digit number based on the specific items the responsevariablenames are basically the same.
So instead of writing a very long impractical line mentioning all responsevariables and to drop the row if they all contain NA is there a way to not use the full anme of a variable but only use the end of the name for example so R checks the condition for all variables ending that way?
simplified e.g: instead of
newdf <- olddf[!(olddf$item123response != NA & olddf$item131response != NA & etc),]
Can I just do something like newdf <- olddf[!(olddf$xxxresponse != NA),] ?
I tried to google an answer but I didn't know how to frame my question effectively.
Thanks in advance!
Try This
newdf <- olddf[complete.cases(olddf[, grep('response', names(olddf))]), ]

How to reference a dynamically assigned dataframe name

I have successfully allocated dataframe names and populated them (see code) but I do not know how to subsequently reference them. So I loop through to assign df.test1 and populate it with some data 1 and so on. I know that the df has been created, and can view or summary it in the console, but not in the code.
I am pretty new to R so am not sure if some of the solutions I have looked at apply to me.
num.clusters <- 5
for (i in 1:num.clusters) {
assign(paste("df.test",i,sep=""), paste("somedata", i))
}
This works but Then want to do something like:
View(df.test,i)
to view whatever iteration from 1 to 5.
I want to be able to use the assigned dataframes like any other dataframe. I could hard code this as View(df.test1) but that would defeat the point. I also want to do other things with the datframe, e.g. subsetting.
I know this doesn't work. Would love to know what does.
Many thanks...
Your question is the proof that the approach is problematic: avoid using assign in general because it makes accessing the variables afterwards awkward (among other issues).
A cleaner way is to just put your "data frames" (copying from your example) in a list:
num.clusters <- 5
df.test <- list()
for (i in 1:num.clusters) {
df.test[[i]] <- paste("somedata", i)
}
Then you would just access them like this:
View(df.test[[i]])
If what you put in there was an actual data.frame (and not the strings you were using), you could then access its columns like any other data.frame:
df.test[[i]]$Name
Or
df.test[[i]][, "Name"]

Assigning observation name to a value when retrieving a variable

I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically.
But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?
Take my sample code for extracting a variable to make it more clear:
#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'
#Reading the HTML code from the website
webpage <- read_html(url)
title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text-
muted'))
df <- data.frame(title_data_html, rank_data_html,description_data_html)
This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.
A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.
Does anyone have a suggestion of what is the "best practice" in such a case?
Thank you!
I know it's not a satisfying answer, but there is not a single strategy for solving this type of problem. This is the work of web scraping. There is no guarantee that the html is going to be structured in the way you'd expect it to be structured.
You haven't shown us a reproducible example (something we can run on our own machine that reproduces the problem you're having), so we can't help you troubleshoot why you ended up extracting 201 nodes during one call to html_nodes when you expected 200. Best practice here is the boring old advice to LOOK at the website you're scraping, LOOK at your data, and see where the extra or duplicate description is (or where the missing movie is). Perhaps there's an odd element that has an attribute that is also matching your xpath selector text. Look at both the website as it appears in a browser, as well as the source. Right click, CTL + U (PC), or OPT + CTL + U (Mac) are some ways to pull up the source code. Use the search function to see what matches the selector text.
If the html document you're working with is like the example you used, you won't be able to use the strategy you're looking for help with (extract the name of the movie together with the description). You're already extracting the names. The names are not in the same elements as the descriptions.

How to build dataframe of variable search strings for web scraping

I'm trying to build a dataframe that will help me paginate some simple web scraping. What is the best way to build a dataframe where each row uses the same base URL string but varies a few specific characters, which can be specified according to the pagination one needs.
Let's say you have a set of search results where there is a total of 4485 results, 10 per page, spread out over 449 pages. All I want for the moment is to make a dataframe with one variable where each row is a character string of the URL with a variable, sequenced page number along the lines of:
**Var1**
http://begin.com/start=0/index.html
http://begin.com/start=10/index.html
http://begin.com/start=20/index.html
http://begin.com/start=30/index.html
...
http://begin.com/start=4480/index.html
Here's my original strategy but this fails (and yea it's inefficient and newbish).
startstring<-"http://begin.com/start="
variableterm<-seq(from=0, to=4485, by=10)
endstring<-"/index.html"
df <- as.data.frame(matrix(nrow=449, ncol=1))
for (x in 1:length(variableterm)){
for(i in variableterm){
df[x,]<-c(paste(startstring,i,endstring, sep=""))
}
}
But every single row is equal to http://begin.com/start=4480/index.html. How can I change this so that each row gives the same URL but with a different number increasing like in the desired dataframe above?
I would very much appreciate how to achieve this with my strategy (just to learn) but of course better approaches are welcome also. Thanks!
I am not sure why you would need this to be in a data frame. Here is one way to create a vector of page urls.
sprintf("http://begin.com/start=%s/index.html", seq(0, 4490, 10))
The reason you have each row returning the same value (the last value) is that you have two loops where you only require a single. The first loop is looping through the rows of the data frame and the second loop is looping through the entire set of URLs and leaving the last one as the value of the data frame's row before the first loop moves to the next row.
This should work as you would expect:
for(i in 1:length(variableterm)){
df[i,]<-paste(startstring,variableterm[i],endstring, sep="")
}

Subsetting a data frame based on a factor level that contains a particular variable in at least one row

I am quite new to R and I always try looking up a solution first before asking (so far I never had to ask, because a solution was already provided somewhere on the internet). That being said, I have trouble even coming up with a search query for my problem.
I have individual pageview data from several websites (see example below, sorry if it does not meet the usual formatting criteria). Example of a missing url in 3rd row. The dataframe is called a and is loaded through read.csv:
a<-read.csv("201311.csv",sep=",",colnames=c("Timestamp","user_id","url")
which results in:
Timestamp     user_id   url
2013-11-01 176b24938a domain1.xy/z/66546,66546
2013-11-01 6785504947 domain2.xy/z/66346,66346
2013-10-31 0717e6b5dc
I have all the data lumped together in a file with 55M rows. I need to split this file in individual file for each website. Trouble is, not every pageview has a recorded URL (technical issues), in fact over 20 % of the pageviews miss a URL. Hypothetically there should be little to no overlap between the users of the sites.
I am able to subset the observations with recorded urls through the grepl() function quite easily through:
b <- subset(a,grepl("domain1\\.xy",a$url))
Now my first notion would be to assign the pageviews to individual sites through user_ids
in case the user_id has at least one pageview with a recorded URL. Trouble is, I have no idea where to begin in R.
The example of an ideal outcome would be as follows (for domain1):
Timestamp     user_id  url
2013-11-01 176b24938a domain1.xy/z/66546,66546
2013-11-05 6785504949 domain1.xy/z/66346,66346
2013-10-31 0717e6b5dc
Thanks for any help and I apologize if this post doesn't follow the usual format.
You should first filter your data to remove rows with missing values. Since you don't give a reproducible data it is hard to know if you have real missing values (NA) or just empty url characters.
dat <- dat[!(is.na(dat$url) | nchar(dat$url)==0),]
Then you can process by url. You have many options, for example using by:
by(dat,dat$url,function(x){
fileName <- sprintf("file%s.pdf", unique(x$url)
write.csv(x,fileName)
})
Since you did not give an example dataset with the requested output, I will have to guess:
# generate some data
data <- "Timestamp;user_id;url
2013-11-01;176b24938a;domain1.xy/z/66546,66546
2013-11-01;6785504947;domain2.xy/z/66346,66346
2013-10-31;0717e6b5dc;
2013-12-01;6785504947;"
data <- read.csv2(textConnection(data))
data$url[data$url == ""] <- NA
# select records with url
url <- data[!is.na(data$url), c("user_id", "url")]
# remove duplicate records
url <- url[!duplicated(url), ]
It is possible that some users have visited multiple sites. In the next few lines I removed these. This would, however, be a good time to check your assumption.
# remove user_id with different url's
duplicated_users <- url$user_id[duplicated(url$user_id)]
url <- url[!(url$user_id %in% duplicated_users), ]
Finally, we can use the url's in url to impute the missing url's in the original data set
data$url2 <- data$url
m <- match(data$user_id, url$user_id)
sel <- is.na(data$url)
data$url2[sel] <- url$url[m[sel]]
Step-by-step explanation of previous code block:
First create a copy of the url column. When imputing new values it is usually a good idea to store the original values.
Match the user id's of the url data.frame to those in the original data.frame. This will give a vector with indices. These can be used to index url.
Create a vector that selects the records with missing url's. We only want to impute new values for those.
By doing m[sel] we get the indices of the records in url that correspond to the records with missing url. When a user with missing url has not visited another site this index is NA. Then we use these indices to select the url's from url.

Resources