Referencing last used row in a data frame - r

I couldn't find the answer in any previously asked questions, but I believe this is an easy one.
I have the below two lines of code, which take in data from excel in a specific range (using readxl for this). The range itself only goes through row 2589 in the excel document, but it will update dynamically (it's a time series) and to ensure I capture the different observations (rows) as they're added, I've included rows to 10000 in the read_excel range argument.
In the end, I'd like to run charts on this data, but a key part of this is identifying the last used row, without manually updating the code row for the latest date. I've tried using nrow but to no avail.
Raw_Index_History <- read_excel("RData.xlsx", range = "ReturnsA6:P10000", col_names = TRUE)
Raw_Index_History <- Raw_Index_History[nrow(Raw_Index_History),]
Does anybody have any thoughts or advice? Thanks very much.

It would be easier to answer your question if you include an example.
Not knowing how your data looks like answers are likely going to be a bit vague.
Does your data contain NAs? If not it should be straight forward to remove the empty rows with
na.omit(Raw_Index_History)
It appears you also have control over the excel spreadsheet. So in case your data does contain NAs you could have some default value in your empty rows that will get overwritten as soon as a new data point is recorded. This will allow you to filter your dataframe accordingly.
Raw_Index_History[!grepl("place_holder", Raw_Index_History$column_with_placeholder),]

If you expect data in the spreadsheet to grow, you can specify only the columns to include, instead of a defined boundary.
Something like this ...
Raw_Index_History <- read_excel("RData.xlsx",
sheet = 1,
range = cell_cols("A:P"), # Only cols, no rows
col_names = TRUE)
Every time you run the code, R will pull in the data from columns between A:P up until the last populated row.
This will be a more elegant approach to your use case. (Consider what you'd do when your data crosses 10000 rows in the future)

Related

Sorting a column of values based on index location

I am currently working with a large amount of data. For testing purposes I am using a smaller batch, but the main point of concern is the sorting of all the data based off of values in one particular column. I have posted a picture below that shows a small portion of my unsorted data. I want to sort the values in row 2 in ascending order along with all other data in those corresponding columns. In other words I don't want to just order row 2, I want to order row 2 and shift all other data with those re-ordered values.
Currently what I do is read in that csv to a data frame (tmpDF).
After that I transpose the data using tmpDF <- t(tmpDF)
Now I take that data and order the second column into ascending order (or at least that is what i think I am doing. ) tmpDF<- tmpDF[order(tmpDF[,1]),]
Re transpose the data to get it back how it was originally, but sorted. Result is shown in picture below "Ordered data result" Keep in mind that the data shown between the unsorted and sorted are different numbers due to my not posting my entire data set.
I have a few questions about this.
1) Am I going about this the correct way? I am not a very experienced programmer, just trying to teach myself R to help out my research efforts.
2) Why are the values such as "102" being represented as "1.01E+02" in my final sorted csv file? I don't believe I am changing type and in the original file they were represented as "102"
3) Why does the value 116 gets ordered before "1.01E+02"?

NA Values Appear for All Data in Imported .csv File

I imported a set of data into RStudio containing 85 variables and 139 observations. All values are integers except for the last column which is blank and for some reason was imported alongside everything else in the .csv file I created from a .xls file.
As such, this last column is all NA values. The problem is that when I try to run any kind of analysis it seems to be reading that all values are NA values. Despite this, in the data window in RStudio everything seems to be fine. Are there solutions to this problem that don't involve the data? Is it almost certainly the data that's the problem?
It seems strange that when opening the file anywhere else and even viewing it in R
The most likely issue is that the file is being imported as all text rather than as numeric data. If all of the data is numeric you can just use colClasses="numeric" as an argument to the read.csv() function and that should import correctly. You could also change the data class once it is in R, or give colClasses a vector of different classes if you have a variety of different data types (logical, character, numeric etc.) in your file.
Edit
Seeing as colClasses is not working (it is hard to say why without looking at your data), you can try this:
MyDF<-data.frame(sapply(MyDF,FUN=as.numeric))
Where MyDF is your datafraome. That will change all of your columns to numeric. If you have some charcter/factor/logical values in there this may not work as expected. You might want to check your excel file/csv to see why it is importing a NA column. It could be that there is a cell with a space in it that is being pulled in and this is throwing things off. You could always try deleting that empty column and retrying your import.
If you want to omit your last column while reading the data itself, you can try the following code. In this example, I am assuming that your file has 5 columns and the 5th column has NA values. So, you want to skip reading 5th column in your data set.
data <- read.csv (fileName, ....) [,1:4]
or, if you want to use column names, you can use:
data <- read.csv (fileName, ....) [,c('col1','col2','col3','col4')]
This will read all the observations from selected columns within your data set.
Hope this helps.
If you are trying too find the mean and standard deviation you can use
Data<- mean( dataframe$colname , na.rm = TRUE)
Data1<- sd( dataframe$colname , na.rm = TRUE)
This will give u the answer after omitting the na values from the column

How to delete all rows in R until a certain value

I have a several data frames which start with a bit of text. Sometimes the information I need starts at row 11 and sometimes it starts at row 16 for instance. It changes. All the data frames have in common that the usefull information starts after a row with the title "location".
I'd like to make a loop to delete all the rows in the data frame above the useful information (including the row with "location").
I'm guessing that you want something like this:
readfun <- function(fn,n=-1,target="location",...) {
r <- readLines(fn,n=n)
locline <- grep(target,r)[1]
read.table(fn,skip=locline,...)
}
This is fairly inefficient because it reads the data file twice (once as raw character strings and once as a data frame), but it should work reasonably well if your files are not too big. (#MrFlick points out in the comments that if you have a reasonable upper bound on how far into the file your target will occur, you can set n so that you don't have to read the whole file just to search for the target.)
I don't know any other details of your files, but it might be safer to use "^location" to identify a line that begins with that string, or some other more specific target ...

How to build dataframe of variable search strings for web scraping

I'm trying to build a dataframe that will help me paginate some simple web scraping. What is the best way to build a dataframe where each row uses the same base URL string but varies a few specific characters, which can be specified according to the pagination one needs.
Let's say you have a set of search results where there is a total of 4485 results, 10 per page, spread out over 449 pages. All I want for the moment is to make a dataframe with one variable where each row is a character string of the URL with a variable, sequenced page number along the lines of:
**Var1**
http://begin.com/start=0/index.html
http://begin.com/start=10/index.html
http://begin.com/start=20/index.html
http://begin.com/start=30/index.html
...
http://begin.com/start=4480/index.html
Here's my original strategy but this fails (and yea it's inefficient and newbish).
startstring<-"http://begin.com/start="
variableterm<-seq(from=0, to=4485, by=10)
endstring<-"/index.html"
df <- as.data.frame(matrix(nrow=449, ncol=1))
for (x in 1:length(variableterm)){
for(i in variableterm){
df[x,]<-c(paste(startstring,i,endstring, sep=""))
}
}
But every single row is equal to http://begin.com/start=4480/index.html. How can I change this so that each row gives the same URL but with a different number increasing like in the desired dataframe above?
I would very much appreciate how to achieve this with my strategy (just to learn) but of course better approaches are welcome also. Thanks!
I am not sure why you would need this to be in a data frame. Here is one way to create a vector of page urls.
sprintf("http://begin.com/start=%s/index.html", seq(0, 4490, 10))
The reason you have each row returning the same value (the last value) is that you have two loops where you only require a single. The first loop is looping through the rows of the data frame and the second loop is looping through the entire set of URLs and leaving the last one as the value of the data frame's row before the first loop moves to the next row.
This should work as you would expect:
for(i in 1:length(variableterm)){
df[i,]<-paste(startstring,variableterm[i],endstring, sep="")
}

Trouble getting my data into wide form with the reshape package

I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).
The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.
I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).
dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")
Now, the format i would like my data to be in is as follows:
User ID Qid1, ....Qid255 Time, with the probabilities for each question in the questions corresponding column.
I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.
In the past, i've always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).
Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.
Your dataset has 6 rows, 3 of which have the column "variable" equal to "probability" and 3 of which have that column equal to "time". You want to have probability be the value of each, and time be added onto the right.
I think there's a difficulty in making this work for you because what you want to do isn't clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you're treating time as an ID variable, but it's storing values like a content variable.
To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3
Which is simply reshaped wide.

Resources