wildcards to download particular csv - r

Still very new to R, so please excuse me.
I am trying to download csv data from the Sloane Digital Website Survey. Within R I do the following -
astro1 <- read.csv("https://dr14.sdss.org/optical/spectrum/view/data/format=csv/spec=full?mjd=55359&fiberid=596&plateid=4055")
This downloads 1 csv spectra per fibre ID per plate [here, plateid=4055]. However, if there are several hundred fibre IDs it will be a very long couple of days.
Is there a way to batch download all csv data for all fibre IDs? I tried fibreid=* (and "", " ", #, but got the following error -
"no lines available in input", or unexpected string constant.
If for example there are 100 .csv files per plate. All will have a common x-axis (wavelength), but a different 3rd column (best fit, for y-axis). Is there a way to get the downloaded csv tables to form 1 very large dataset, with the same common axis (wavelength), and subsequent columns to show only the Best Fit columns?
Many thx

Best case would be that you have a list of all the links to your wanted csv-Files. Since this is seemingly not the case, you know that you want to loop over all the fiberids. You know the structure of the link, hence we could use it to define
buildFibreIdLink <- function(fibreId) {
paste0("https://dr14.sdss.org/optical/spectrum/view/data/format=csv/spec=full?mjd=55359&fiberid=",fibreId,"&plateid=4055")
}
Now I would just loop over all ids, whatever all means in this case. Just start at 1 and count up. Therefore I would use the function
getCsvDataList <- function(startId = 1, endId = 10, maxConsecutiveNulls = 5) {
dataList <- list()
consecutiveNullCount <- 0
for(id in startId:endId) {
csvLink <- buildFibreIdLink(fibreId = id)
newData <- tryCatch(expr = {
read.csv(csvLink)
}, error = function(e) {return(NULL)})
if(is.null(newData)) {
consecutiveNullCount <- consecutiveNullCount +1
} else {
dataList <- c(dataList,list(newData))
consecutiveNullCount <- 0
}
if(consecutiveNullCount == maxConsecutiveNulls) {
print(paste0("reached maxConsecutiveNulls at id ",id))
break;
}
}
return(dataList)
}
Specify the id-range you want to read, such that you can really partially read the csvs. Now the question is: When have you reached the end? My answer would basically be: You reached the end, when there are maxConsecutiveNulls consecutive "read-csv-fails". I assume that a link doesn't exist if you can't read it, hence the tryCatch-block triggers and I basically count these triggers until a given maximum.
If you know that the structure of the csvs is always the same, you can merge the list of data.frames together via
dataListFrom1to10 <- getCsvDataList(startId = 1, endId = 10)
merged1to10 <- do.call("rbind",dataListFrom1to10)
Update: If you have your vector of needed fibre-ids, you can modify the function as follows. Since we didn't know the exact Ids, we looped from 1 to anywhere. Now, knowing the Ids, you can replace the startId and endId arguments by say fibreIdVector, to get the signature
getCsvDataList <- function(fibreIdVector, maxConsecutiveNulls) .... In the for-loop, replace for(id in startId:endId) by for(id in fibreIdVector). If you know that all your Ids are valid, you can remove the error-handling to get a much cleaner function. Since you don't need to know the results of previous iterations, e.g. counting the consecutiveNullCount, you can just put everything into an lapply like
allCsvData <- lapply(fibreIdVector, function(id) {
read.csv(buildFibreIdLink(fibreId = id))
})
replacing the whole function.

Related

Searching for target in Excel spreadsheet using R

As an R noob, I'm currently rather stumped by what is probably a rather trivial problem. I have data that looks like in the second image below, essentially a long sheet of rows with values in three columns. What I need is for a way to scan the sheet looking for particular combinations of values in the first and second column - combinations that are specified in a second spreadsheet of targets (see picture 1). When that particular combination is found, I need the script to extract the whole row in question from the data file.
So far, I've managed to read the files without problem:
library(xlsx)
folder <- 'C:\\Users\\...\\Desktop\\R EXCEL test\\'
target_file <- paste(folder,(readline(prompt = "Enter filename for target list:")),sep = "")
data_file <- paste(folder,(readline(prompt = "Enter data file:")),sep = "")
targetsDb <- read.xlsx(target_file, sheetName = "Sheet1")
data <- read.xlsx(data_file, sheetName = "Sheet1")
targets <- vector(mode = "list", length = 3)
for(i in 1:nrow(targetsDb)){
targets[[i]] <- c(targetsDb[i,1],targetsDb[i,2])
}
And with the last command I've managed to save the target combinations as items in a list. However, I run into trouble when it comes to iterating through the file looking for any of those combinations of cell values in the first two columns. My approach was to create a list with one item,
SID_IA <- vector(mode = "list", length = 1)
and to fill it with the values of column 1 and 2 iteratively for each row of the data file:
for(n in 1:nrow(data)){
SID_IA[[n]] <- c(data[n,1],data[n,2])
I would then nest another for loop here, which basically goes through every row in the targets sheet to check if the combination of values currently in the SID_IA list matches any of the target ones. Then at the end of the loop, the list is emptied so it can be filled with the following combination of data values.
for(i in targets){
if(SID_IA[[n]] %in% targets){
print(SID_IA[[n]], "in sentence" , data[n,1], "is ", data[n,3])
}else{
print(FALSE)
}
SID_IA[[n]] <- NULL
}
}
However, if I try to run that last loop, it returns the following output and error:
[1] FALSE
Error in SID_IA[[n]] : subscript out of bounds
In addition: Warning message:
In if (SID_IA[[n]] %in% targets) { :
the condition has length > 1 and only the first element will be used
So, it seems to be doing something for at least one iteration, but then crashes. I'm sure I'm missing something very elementary, but I just can't see it. Any ideas?
EDIT: As requested, I've removed the images and made the test Excel sheets available here and here.
OK.. I'm attempting an answer that should require minimum use of fancy tricks.
data<- xlsx::read.xlsx(file = "Data.xlsx",sheetIndex = 1)
target<- xlsx::read.xlsx(file = "Targets.xlsx",sheetIndex = 1)
head(data)
target
These values are already in data.frame format. If all you want to know is which rows appear exactly same in data and target, then it will be as simple as finding a merge
merge(target,data,all = F)
If, on the other hand , you want to keep the data table with a marking of target rows, then the easiest way will be to make an index column
data$indx<- 1:nrow(data)
data
mrg<- merge(target,data,all = F)
data$test<- rep("test", nrow(data))
data$test[mrg$indx]<- "target"
data
This is like the original image you'd posted.
BTW , if yo are on a graphical interface you can also use File dialogue to open data files.. check out file.choose()
(Posted on behalf of the OP).
Following from #R.S.'s suggestion that didn't involve vectors and loops, and after some playing around, I have figured out how to extract the target lines, and then how to remove them from the original data, outputting both results. I'm leaving it here for future reference and considering this solved.
extracted <- merge(targets,data,all = F)
write.xlsx(extracted,output_file1)
combined <-rbind(data,extracted)
minus.target <- combined[!duplicated(combined,fromLast = FALSE)&!duplicated(combined,fromLast = TRUE),]
write.xls(minus.target,output_file2)

Not sure which way of combining my loop results I should be using

To make a long story short I'm trying to gather information on 6500 user, so I wrote a loop. Below you can find an example of 10 artists. In this loop I'm trying to use a call to gather information on all tracks of a user.
test <- fromJSON(getURL('http://api.soundcloud.com/users/52880864/tracks?client_id=0ab2657a7e5b63b6dbc778e13c834e3d&limit=200&offset=1&linked_partitioning=1'))
This short example shows a dataframe with all the tracks uploaded by a user. When I use my loop I'd like to add all the dataframes together so that I can process them with tapply. This way I can for instance see how what the sum of all track likes are. However, two things are going wrong. First, when I run the loop, each users only shows one uploaded track. Second, I think I'm not combining the dataframes properly. Could somebody please explain to me what I'm doing wrong?
id <- c(20376298, 63320169, 3806325, 12231483, 18838035, 117385796, 52880864, 32704993, 63975320, 95667573)
Partition1 <- paste0("'http://api.soundcloud.com/users/", id, "/tracks?client_id=0ab2657a7e5b63b6dbc778e13c834e3d&limit=200&offset=1&linked_partitioning=1'")
results <- vector(mode = "list", length = length(Partition1))
for (i in seq_along(Partition1)){
message(paste0('Query #',i))
tryCatch({
result_i <- fromJSON((getURL(str_replace_all(Partition1[i],"'",""))))
clean_i <- function(x)ifelse(is.null(x),NA,ifelse(length(x)==0,NA,x))
results[[i]] <- plyr::llply(result_i, clean_i) %>% as_data_frame
if( i == 4 ) {
stop('stop')
}
}, error = function(e){
beepr::beep(1)
}
)
Sys.sleep(0.5)
}

getting error: 'incorrect number of dimensions' when executing a function in R 3.1.2. What does it mean and how to fix it?

Let me first describe you the function and what I have to process.
Basically theres this folder containing some 300 comma separated value files. Each file has an ID associated with it, as in 200.csv has ID 200 in it and contains some data pertaining to sulphate and nitrate pollutants. What I have to do is calculate the mean of these particles for either one ID or a range of IDs. For example, calculating the mean of sulphate for ID 5 or calculate the same thing for IDs 5:10.
This is my procedure for processing the data but I'm getting a silly error in the end.
I have a list vector of these .csv files.
A master data frame combining all these files, I used the data.table package for this.
Time to describe the function:
pollutantmean <- function(spectate,pollutant,id)
specdata <- rbindlist(filelist)
setkey(specdata, ... = 'ID) ## because ID needs to be sorted out
for(i in id)
if(pollutant == 'sulphate'){
return(mean(specdata[, 'sulphate'], na.rm = TRUE))
}else{
if(pollutant == 'nitrate'){
return(mean(specdata([, nitrate], na.rm = TRUE))
}else{
print('NA')}
}
}
Now the function is very simply. I defined spectate, i defined the for loop to calculate data for each id. I get no error when function is run. But there is one error that is being the last obstacle.
'Error in "specdata"[, "sulfate"] : incorrect number of dimensions'
When I execute the function. Could someone elaborate?
I think this is the kind of task where the plyr function will be really helpful.
More specifically, I think you should use ldply and a bespoke function.
ldply: as in a list as the input and a dataframe as the output. The list should be the directory contents, and output will be the summary values from each of the csv files.
Your example isn't fully reproducible so the code below is just an example structure:
require(plyr)
require(stringr)
files_to_extract <- list.files("dir_with_files", pattern=".csv$")
files_to_extract <- str_sub_replace(files_to_extract, ".csv$", "")
fn <- function(this_file_name){
file_loc <- paste0(dir_with_files, "/", this_file_name, ".csv")
full_data <- read.csv(file_loc)
out <- data.frame(
variable_name=this_file_name,
variable_summary= mean(full_data$variable_to_summarise)
)
return(out)
}
summarised_output <- ldply(files_to_extract, fn)
Hope that helps. Probably won't work first time and you might want to add some additional conditions and so on to handle files that don't have the expected contents. Happy to discuss what it's doing but probably best to read this, as once you understand the approach it makes all kinds of tasks much easier.

R for loop index issue

I am new to R and I am practicing to write R functions. I have 100 cvs separate
data files stored in my directory, and each is labeled by its id, e.g. "1" to "100.
I like to write a function that reads some selected files into R, calculates the
number of complete cases in each data file, and arrange the results into a data frame.
Below is the function that I wrote. First I read all files in "dat". Then, using
rbind function, I read the selected files I want into a data.frame. Lastly, I computed
the number of complete cases using sum(complete.cases()). This seems straightforward but
the function does not work. I suspect there is something wrong with the index but
have not figured out why. Searched through various topics but could not find a useful
answer. Many thanks!
`complete = function(directory,id) {
dat = list.files(directory, full.name=T)
dat.em = data.frame()
for (i in id) {
dat.ful= rbind(dat.em, read.csv(dat[i]))
obs = numeric()
obs[i] = sum(complete.cases(dat.ful[dat.ful$ID == i,]))
}
data.frame(ID = id, count = obs)
}
complete("envi",c(1,3,5)) `
get error and a warning message:
Error in data.frame(ID = id, count = obs) : arguments imply differing number of rows: 3, 5
One problem with your code is that you reset obs to numeric() each time you go through the loop, so obs ends up with only one value (the number of complete cases in the last file in dat).
Another issue is that the line dat.ful = rbind(dat.em, read.csv(dat[i])) resets dat.ful to contain just the data frame being read in that iteration of the loop. This won't cause an error, but you don't actually need to store the previous data frames, since you're just checking the number of complete cases for each data frame you read in.
Here's a different approach using lapply instead of a loop. Note that instead of giving the function a vector of indices, this function takes a vector of file names. In your example, you use the index instead of the file name as the file "id". It's better to use the file names directly, because even if the file names are numbers, using the index will give an incorrect result if, for some reason, your vector of file names is not sorted in ascending numeric order, or if the file names don't use consecutive numbers.
# Read files and return data frame with the number of complete cases in each csv file
complete = function(directory, files) {
# Read each csv file in turn and store its name and number of complete cases
# in a list
obs.list = lapply(files, function(x) {
dat = read.csv(paste0(directory,"/", x))
data.frame(fileName=x, count=sum(complete.cases(dat)))
})
# Return a data frame with the number of complete cases for each file
return(do.call(rbind, obs.list))
}
Then, to run the function, you need to give it a directory and a list of file names. For example, to read all csv files in the current working directory, you can do this:
filesToRead = list.files(pattern=".csv")
complete(getwd(), filesToRead)

r-project create a data frame function and probably use *apply somewhere too

trying to create a function that looks up a bunch of CSV files in a directory and then, taking the file ID as an argument, outputs a table (actually data frame - new to R language) where there are 2 columns, one titled ID for the corresponding id parameter and the second column will be the count of rows in that file.
The files are all titled 001.csv - 322.csv
e.g. output would look like column title: ID, first record: 001 (derived from 001.csv), second column: title "count of rows", first record
The function looks like so: myfunction(directory,id)
Directory is the folder where the csv files are and id can be a number (or vector?) e.g. simply 1 or 9 or 100 or it can be a vector like so 200:300.
In the case of the later, 200:300, the output would be a table with 100 rows where the 1st row would be 200 with say 10 rows of data within it.
So far:
complete <- function(directory,id = 1:332) {
# create an object to help read the appropriate csv files later int he function
csvfilespath <- sprintf("/Users/gcameron/Desktop/%s/%03d.csv", directory, id)
colID <- sprintf('%03d', id)
# now, how do I tell R to create a table with 2 columns titled ID and countrows?
# Now, how would I take each instance of an ID and add to this table the id and count of rows in each?
}
I apologize if this seems really basic. The tutorial I'm on moves fast and I have watched each video lecture and done a fair amount of research too.
SO is by far my favourite resource and I learn better by using it. Perhaps because it's personalised and directly applicable to my immediate tasks. I hope my questions also benefit others who are learning R.
BASED ON FEEDBACK BELOW
I now have the following script:
complete <- function(directory,id = 1:332) {
csvfiles <- sprintf("/Users/gcameron/Desktop/%s/%03d.csv", directory, id)
nrows <- sapply( csvfiles, function(f) nrow(read.csv(f)))
data.frame(ID=id, countrows=sapply(csvfiles,function(x) length(count.fields(x)))
}
Does this look like I'm on the right track?
I'm receiving an error "Error: unexpected '}' in:
"data.frame(ID=id, countrows=sapply(csvfiles,function(x) length(count.fields(x)))
}"
I cannot see hwere the extra "}" is coming from?
data.frame(ID=id, countrows=sapply(csvfilepath, function(x) length(count.fields(x))))

Resources