How to delete all rows in R until a certain value - r

I have a several data frames which start with a bit of text. Sometimes the information I need starts at row 11 and sometimes it starts at row 16 for instance. It changes. All the data frames have in common that the usefull information starts after a row with the title "location".
I'd like to make a loop to delete all the rows in the data frame above the useful information (including the row with "location").

I'm guessing that you want something like this:
readfun <- function(fn,n=-1,target="location",...) {
r <- readLines(fn,n=n)
locline <- grep(target,r)[1]
read.table(fn,skip=locline,...)
}
This is fairly inefficient because it reads the data file twice (once as raw character strings and once as a data frame), but it should work reasonably well if your files are not too big. (#MrFlick points out in the comments that if you have a reasonable upper bound on how far into the file your target will occur, you can set n so that you don't have to read the whole file just to search for the target.)
I don't know any other details of your files, but it might be safer to use "^location" to identify a line that begins with that string, or some other more specific target ...

Related

R Code: csv file data incorrectly breaking across lines

I have some csv data that I'm trying to read in, where lines are breaking across rows weirdly.
An example of the file (the files are the same but the date varies) is here: http://nemweb.com.au/Reports/Archive/DispatchIS_Reports/PUBLIC_DISPATCHIS_20211118.zip
The csv is non-rectangular because there's 4 different types of data included, each with their own heading rows. I can't skip a certain number of lines because the length of the data varies by date.
The data that I want is the third dataset (sometimes the second), and has approximately twice the number of headers as the data above it. So I use read.csv() without a header and ideally it should pull all data and fill NAs above.
But for some reason read.csv() seems to decide that there's 28 columns of data (corresponding to the data headers on row 2) which splits the data lower down across three rows - so instead of the data headers being on one row, it splits across three; and so does all the rows of data below it.
I tried reading it in with the column names explicitly defined, it's still splitting the rows weirdly
I can't figure out what's going on - if I open the csv file in Excel it looks perfectly normal.
If I use readr::read_lines() there's no errant carriage returns or new lines as far as I can tell.
Hoping someone might have some guidance, otherwise I'll have to figure out a kind of nasty read_lines approach.

R Updating A Column In a Large Dataframe

I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}

Set up automatic process for R to read directory and process?

I am so very very new to R. Like had to look up how to open a file in R new. Diving in the deep end. Anyway
I have a bunch of .csv files with results that I need to analyse. Really, I would like to set up some kind of automation so I can just say "go" (a function?)
Basically I have results in one file that are -particle.csv and another that are -ROI.csv. They have the same names so I know which ones match up (e.g. brain1 section1 -particle.csv and brain1 section1 -ROI.csv). I need to do some maths using these two datasets - Divide column 2 rows 2-x in -particle.csv (the row number might change but is there a way of saying row "2-No more content"?) by column 1, 5, 10, etc. row 2 in -ROI.csv (the column number will always stay the same but if it helps they are all called Area1, Area2, Area3,... the number of Area columns can vary but surely there's a way I can say "every column that begins with Area"? Also the area count and the row count will always match up)
Okay, I'm fine to do that manually for each set up results but I have over 300 brains to analyse! Is there anyway I can set it up as a process that I can apply this these and future results that will be in the same format?
Sorry if this is a huge ask!

Referencing last used row in a data frame

I couldn't find the answer in any previously asked questions, but I believe this is an easy one.
I have the below two lines of code, which take in data from excel in a specific range (using readxl for this). The range itself only goes through row 2589 in the excel document, but it will update dynamically (it's a time series) and to ensure I capture the different observations (rows) as they're added, I've included rows to 10000 in the read_excel range argument.
In the end, I'd like to run charts on this data, but a key part of this is identifying the last used row, without manually updating the code row for the latest date. I've tried using nrow but to no avail.
Raw_Index_History <- read_excel("RData.xlsx", range = "ReturnsA6:P10000", col_names = TRUE)
Raw_Index_History <- Raw_Index_History[nrow(Raw_Index_History),]
Does anybody have any thoughts or advice? Thanks very much.
It would be easier to answer your question if you include an example.
Not knowing how your data looks like answers are likely going to be a bit vague.
Does your data contain NAs? If not it should be straight forward to remove the empty rows with
na.omit(Raw_Index_History)
It appears you also have control over the excel spreadsheet. So in case your data does contain NAs you could have some default value in your empty rows that will get overwritten as soon as a new data point is recorded. This will allow you to filter your dataframe accordingly.
Raw_Index_History[!grepl("place_holder", Raw_Index_History$column_with_placeholder),]
If you expect data in the spreadsheet to grow, you can specify only the columns to include, instead of a defined boundary.
Something like this ...
Raw_Index_History <- read_excel("RData.xlsx",
sheet = 1,
range = cell_cols("A:P"), # Only cols, no rows
col_names = TRUE)
Every time you run the code, R will pull in the data from columns between A:P up until the last populated row.
This will be a more elegant approach to your use case. (Consider what you'd do when your data crosses 10000 rows in the future)

Select Rows and Columns At the Same Time in SPSS

I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.

Resources