I have a large data.frame called rain with information of many species mesured in different plots at different times (census), from which I want to extract the information. This data frame have many collumns, and in dataF2 I want to keep the same structure however I want to extract from rain the information of the penultimate census (Census.No is one of the collumns of rain) in each plot (Plot.Code is another one). In idx3 I have the information of the number of the penultimate census for each plot.
It's easy to do it for one plot
data1<- rain[Plot.Code==idx3[1,1] & Census.No==idx3[1,2],]
I've been trying to do for loops in R.. but I keep overwriting my data.frame and ending up just with the last loop.
dataF2<- data.frame(nrow= nrow (rain), ncol = ncol (rain))
summary (dataF2)
for (i in 1:length (idx3[,1])){
dataF2<- rain[Plot.Code==idx3[i,1] & Census.No==idx3[i,2],]
}
Here I want to extract from a data frame the information of the penultimate census in each plot (ixd3 contains this information of what was the penultimate census in each plot).
I've tried many things, like:
dataF2<- data.frame(nrow= nrow (rain), ncol = ncol (rain))
for (i in 1:length (idx3[,1])){
data1<- rainfor[Plot.Code==idx3[i,1] & Census.No==idx3[i,2],]
dataF2<- rbind (data1[i])
}
But nothing worked.. my problem is that it keeps overwithin on dataF2!
Cheers!!!
Your clarifications in the comments helped somewhat, but reproducible examples are always better. Let's start at the beginning:
dataF2<- data.frame(nrow= nrow (rain), ncol = ncol (rain))
This is wrong. I think that you're trying to create an empty data frame with the same dimensions as your data frame rain. If you examine dataF2 you'll see that this is far from what you have done with this line. If you read the documentation for the function ?data.frame it will become clear that there are no arguments called nrow and ncol. What you probably intended was something like this:
dataF2 <- rain
dataF2[] <- NA
Inside your for loop you are overwriting your entire data frame because....you are overwriting your entire data frame.
dataF2<- rain[Plot.Code==idx3[i,1] & Census.No==idx3[i,2],]
This assigns something to dataF2, replacing it completely. If you want to assign to just a single row of dataF2 you need to assign to that specific row:
dataF2[i,] <- rain[Plot.Code==idx3[i,1] & Census.No==idx3[i,2],]
I can't absolutely assure that this will work correctly, since you haven't provided a sufficiently detailed example, so I'm not sure that all the dimensions will coincide properly when you index on i. But this is the basic idea.
Related
I hope you are having a good day.
I'm trying to scrape Trustpilot-reviews in the sports-section.
I want four columns with number of reviews, trustscore, subcategories and companynames.
There are 43 pages it should iterate over, with 20 companies in each page.
After an iteration the data should be placed underneath the previous data. This can be cleaned up afterwards using filtering though.
The important part, and what I suspect is my problem is getting everything put together at the end.
The code as-is produce the error
"Error in .subset2(x, i, exact = exact) : subscript out of bounds"
If you know anything about this, some pointers on how the code can be corrected would be appreciated.
Here is the code I'm having trouble with:
Trustpilot_company_data <- data.frame()
page_urls = sprintf('https://dk.trustpilot.com/categories/sports?page=%s&status=all', 2:43)
page_urls = c(page_urls, 'https://dk.trustpilot.com/categories/sports?status=all')
for (i in 1:length(page_urls)) {
session <- html_session(page_urls[i])
trustscore_data_html <- html_nodes(session,'.styles_textRating__19_fv')
trustscore_data <- html_text(trustscore_data_html)
trustscore_data <- gsub("anmeldelser","",trustscore_data)
trustscore_data <- gsub("TrustScore","",trustscore_data)
trustscore_data <- as.data.frame(trustscore_data)
trustscore_data <- separate(trustscore_data, col="trustscore_data", sep="·", into=c("antal anmeldelser", "trustscore"))
number_of_reviews<- trustscore_data$`antal anmeldelser`
Trustpilot_company_data[[i]]$number_of_reviews <- trimws(number_of_reviews, whitespace = "[\\h\\v]") %>%
as.numeric(number_of_reviews)
trustscores <- trustscore_data$trustscore
Trustpilot_company_data[[i]]$trustscores <- trimws(trustscores, whitespace = "[\\h\\v]") %>%
as.numeric(trustscores)
subcategories_data_html <- html_nodes(session,'.styles_categories__c4nU-')
subcategories_data <- html_text(subcategories_data_html)
Trustpilot_company_data[[i]]$subcategories_data <- gsub("·",",",subcategories_data)
company_name_data_html <- html_nodes(session,'.styles_businessTitle__1IANo')
Trustpilot_company_data[[i]]$company_name_data <- html_text(company_name_data_html)
Trustpilot_company_data[[i]]$company_name_data <- rep(i,length(Trustpilot_company_data[[i]]$company_name_data))
}
Best regards
Anders
There seem to be several things going on here.
First, as a rule, growing a data frame this way is not good practice.
Second, in this case you seem to be trying to add the new element for each column one at a time, which makes things more awkward for you. And you are trying to access the data frame as if it were a list. So, for example, this isn't going to work:
Trustpilot_company_data[[i]]$number_of_reviews <- trimws(number_of_reviews, whitespace = "[\\h\\v]")
Trustpilot_company_data is a data frame, so it has rows and columns. So to access a particular row and column with [] you say e.g. dat[5,10] for the fifth row and tenth column of dat. Instead you are trying to use [[i]] which is the syntax for accessing the elements of a list. In this case you'd need to write e.g.
Trustpilot_company_data[i, "number_of_reviews"]
to access the thing you're trying to get at.
Third, doing this one column at a time is a bad idea. If you're going to try to grow a data frame, assemble each new mini-data-frame completely first and then add it to the bottom with rbind(). E.g.,
df <- data.frame()
for(i in 1:5) {
new_piece <- data.frame(a = i,
b = i,
c = i)
df <- rbind(df, new_piece)
}
But fourth and most important, don't grow data frames in this way in the first place. Instead, see for example this answer.
heatmap(Web_Data$Timeinpage)
str(Web_Data)
heat = c(t(as.matrix(Web_Data$Timeinpage[,-1])))
heatmap(heat)
A few items to note here:
1) by including the c() operator in the c(t(as.matrix(Web_Data$Timeinpage[,-1]))) You are creating a single vector and not a matrix. You can see this by running the following: is.matirx(c(t(as.matrix(Web_Data$Timeinpage[,-1])))). heatmap (I believe) is checking for a matrix because...
2) You need to provide a matrix with at least two rows and two columns for this function to work. Currently, you are only give on vector - time. You will need to provide some other feature of interest to have it work correctly, such as Continent.
3) If you intend to plot ONLY one field, you may consider doing as suggested here and use the image() function. (I included an example below).
4) I find the heatmap function somewhat dated in look. You may want to consider other popular functions, such as ggplot's geom_tile. (see here).
Below is an example code that should produce an output:
#fake data
Web_Data <- data.frame("Timeinpage" = c(123,321,432,555,332,1221,2,43,0, NA,10, 44),
OTHER = rep(c("good", "bad",6)) )
#a matrix with TWO columns from my data frame. Notice the c() is removed and I am not transposing. Also removing the , from [,-1]
heat <- matrix(c(Web_Data$Timeinpage[-1], Web_Data$OTHER[-1]), 2,11)
#output
heatmap(heat)
#one row
heat2 <- as.matrix(sort(Web_Data$Timeinpage[-1])) #sorting as well
#output
image(heat2)
I've been working on scraping data from the web and manipulating it in R, but I'm having some trouble when I begin to include the dreaded for loop. I am working with some arbitrary sport statistics, with the idea being that I can calculate the per game average for various stats for each player.
Each of these pieces works outside of the loop, but falls apart inside. Ideally, my code will do three things:
1) Scraping the data. I have a list of player names "names" (by row), and the unique piece of the url for each player in column 2. The website has a table named "stats" on each page, which is nice of them.
library(XML)
statMean <- matrix(ncol = 8, nrow = 20)
for(h in 1:20){
webname <- names[h,2]
vurl <- paste("http://www.pro-football-reference.com/players/",
webname, "/gamelog/2015")
tables <- readHTMLTable(vurl)
t1 <- tables$stats
2) Pick the columns I want and turn the values into numeric values.
temp <- t1[, c(9,10,12,13,14,15,17)]
temp <- sapply(temp, function(x) as.numeric(as.character(x)))
3) Calculate the mean of each column, bind the unique player name to the column means, and rbind that vector to a full table.
statMean <- rbind(statMean, c(webname, colMeans(temp)))
When I run through these steps outside of the Loop, it seems to work ... when I run the loop I inevitably get:
Error in colMeans(temp) : 'x' must be an array of at least two dimensions
I've looked at a number of For Loop questions on this site, and my code looks a lot different than how it started. Unfortunately, each time I try something new, I end up with a version of the above error.
Any help would be fantastic. Thanks.
I'm very new to R - but have been developing SAS-programs (and VBA) for some years. Well, the thing is that I have 4 lines of R-code (scripts?) that I would like to repeat 44 times. Two times for each of 22 different train stations, indicating whether the train is in- or out-going. The four lines of code are:
dataGL_FLIin <- subset( dataGL_all, select = c(Tidsinterval, Dag, M.ned, Ugenr.,Kode, Ugedag, FLIin))
names(dataGL_FLIin)[names(dataGL_FLIin)=='FLIin'] <- 'GL_Antal'
dataGL_FLIin$DIR<-"IN"
dataGL_FLIin$STATION<-"FLI
To avoid repeating the 4 lines 44 times I need 2 "macro variables" (yes, I'm aware, that this is a SAS-thing only, sorry). One "macro variable" indicating the train station and one indicating the direction. In the example above the train station is FLI and the direction is in. Below the same 4 lines are demonstrated for the train station FBE, this time in out-going direction.
dataGL_FBEout <- subset( dataGL_all, select = c(Tidsinterval, Dag, M.ned, Ugenr.,Kode, Ugedag, FBEout))
names(dataGL_FBEout)[names(dataGL_FBEout)=='FBEout'] <- 'GL_Antal'
dataGL_FBEout$DIR<-"OUT"
dataGL_FBEout$STATION<-"FBE"
I have looked many places and tried many combinations of R-functions and R-lists, but I can't make it work. Quite possible I'm getting it all wrong. I apologize in advance if the question is (too) stupid, but will however be very grateful for any help on the matter.
Pls. notice that I, in the end, want 44 different data-frames created:
1) dataGL_FLIin
2) dataGL_FBEout
3) Etc. ...
ADDED: 2 STATION 2 DIRECTIONS EXAMPLE OF MY PROBLEM
'The one data frame I have'
Date<-c("01-01-15 04:00","01-01-15 04:20","01-01-15 04:40")
FLIin<-c(96,39,72)
FLIout<-c(173,147,103)
FBEin<-c(96,116,166)
FBEout<-c(32,53,120)
dataGL_all<-data.frame(Date, FLIin, FLIout, FBEin, FBEout)
'The four data frames I would like'
GL_antal<-c(96,39,72)
Station<-("FLI")
Dir<-("IN")
dataGL_FLIin<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(173,147,103)
Station<-("FLI")
Dir<-("OUT")
dataGL_FLIout<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(96,116,166)
Station<-("FBE")
Dir<-("IN")
dataGL_FBEin<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(32,53,120)
Station<-("FBE")
Dir<-("OUT")
dataGL_FBEout<-data.frame(Date, Station, Dir, GL_antal)
Thanks,
lars
With your example, it is now clearer what you want and I give it a second try. I use dataGL_all as defined in your question and the define
stations <- rep(c("FLI","FBE"),each=2)
directions <- rep(c("in","out"),times=length(stations)/2)
You could also extract the stations and directions from your data frame. Using your example, the following would work
stations <- substr(names(dataGL_all)[-1],1,3)
directions <- substr(names(dataGL_all)[-1],4,6)
Then, I define the function that will work on the data:
dataGLfun <- function(station,direction) {
name <- paste0(station,direction)
dataGL <- dataGL_all[,c("Date", name)]
names(dataGL)[names(dataGL)==name] <- 'GL_Antal'
dataGL$DIR<-direction
dataGL$STATION<-station
dataGL
}
And now I apply this function to all stations with both directions:
dataGL <- mapply(dataGLfun,stations,directions,SIMPLIFY=FALSE)
names(dataGL) <- paste0(stations,directions)
Now, you can get the data frames for each combination of station and direction. For instance, the two examples in your question, you get with dataGL$FLIin and dataGL$FBEout. The reason that there is a $ instead of a _ is that I did not actually create a separate variable for each data frame. Instead, I created a list, where each element of the list is one of the data frames. This has the advantage that it will be easier to do something to all the data frames later. With your solution, you would have to type all the various variable names, but if the data frames are in a list, you can work with them using functions like lapply.
If you prefer to have many different variables, you could do the following
for (i in seq_along(stations)) {
assign(paste0("dataGL_",stations[i],directions[i]), dataGLfun(stations[i],directions[i]))
}
However, in my opinion, this is not how you should solve this problem in R.
So I have some lidar data that I want to calculate some metrics for (I'll attach a link to the data in a comment).
I also have ground plots that I have extracted the lidar points around, so that I have a couple hundred points per plot (19 plots). Each point has X, Y, Z, height above ground, and the associated plot.
I need to calculate a bunch of metrics on the plot level, so I created plotsgrouped with split(plotpts, plotpts$AssocPlot).
So now I have a data frame with a "page" for each plot, so I can calculate all my metrics by the "plot page". This works just dandy for individual plots, but I want to automate it. (yes, I know there's only 19 plots, but it's the principle of it, darn it! :-P)
So far, I've got a for loop going that calculates the metrics and puts the results in a data frame called Results. I pulled the names of the groups into a list called groups as well.
for(i in 1:length(groups)){
Results$Plot[i] <- groups[i]
Results$Mean[i] <- mean(plotsgrouped$PLT01$Z)
Results$Std.Dev.[i] <- sd(plotsgrouped$PLT01$Z)
Results$Max[i] <- max(plotsgrouped$PLT01$Z)
Results$75%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .75)])
Results$50%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .50)])
...
and so on.
The problem arises when I try to do something like:
Results$mean[i] <- mean(paste("plotsgrouped", groups[i],"Z", sep="$")). mean() doesn't recognize the paste as a reference to the vector plotsgrouped$PLT27$Z, and instead fails. I've deduced that it's because it sees the quotes and thinks, "Oh, you're just some text, I can't get the mean of you." or something to that effect.
Btw, groups is a list of the 19 plot names: PLT01-PLT27 (non-consecutive sometimes) and FTWR, so I can't simply put a sequence for the numeric part of the name.
Anyone have an easier way to iterate across my test plots and get arbitrary metrics?
I feel like I have all the right pieces, but just don't know how they go together to give me what I want.
Also, if anyone can come up with a better title for the question, feel free to post it or change it or whatever.
Try with:
for(i in seq_along(groups)) {
Results$Plot[i] <- groups[i] # character names of the groups
tempZ = plotsgrouped[[groups[i]]][["Z"]]
Results$Mean[i] <- mean(tempZ)
Results$Std.Dev.[i] <- sd(tempZ)
Results$Max[i] <- max(tempZ)
Results$75%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .75)])
Results$50%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .50)])
}