See the below R code, I'm using JSONlite package to scrape data from a website:
library(jsonlite)
url <- "http://fantasy.premierleague.com/web/api/elements/"
seasonsdata <- data.frame(matrix(NA,nrow=1,ncol=20))
seasonsdata <- seasonsdata[-1,]
fetchData <- function(i) {res <- try(a <- fromJSON(paste0(url,i)))
if(!inherits(res,"try-error")) {b<-data.frame(a[1],a[20],a[21],as.data.frame(a$season_history))}}
seasonsdata <- lapply(1:696, fetchData)
seasonsdata <-do.call(rbind,lapply(seasonsdata,data.frame,stringsAsFactors=FALSE))
The code is working fine for 'i' till 10 at least, I'm getting the desired output. However, as I increase 'i' to 696, I'm getting the error:
Error in data.frame(a[1], a[20], a[21], as.data.frame(a$season_history)) :
arguments imply differing number of rows: 1, 0
Any advise?
If a$season_history is empty (page 57 is an example) then when you do data.frame(a[1], a[20], a[21], as.data.frame(a$season_history)) the first 3 elements have one row (they are scalars) and the last element has zero rows. In your function you can first check if a$season_history is there. If it it's not you can create a row of NAs in its places.
However there is another problem with your code you may not be aware of yet. Not every page to 696 exists and you get a 404 error when you try to pull the data from it. I added some steps to remove those pages before you do the final do.call(rbind, ...) step.
library(jsonlite)
url <- "http://fantasy.premierleague.com/web/api/elements/"
seasonsdata <- data.frame(matrix(NA, nrow = 1, ncol = 20))
seasonsdata <- seasonsdata[-1, ]
fetchData <- function(i) {
res <- try(a <- fromJSON(paste0(url, i)))
if (!inherits(res, "try-error")) {
if (nrow(as.data.frame(a$season_history)) == 0) {
b <- data.frame(a[1], a[20], a[21], as.data.frame(matrix(NA, ncol = 17)))
} else {
b <- data.frame(a[1], a[20], a[21], as.data.frame(a$season_history))
}
}
}
seasonsdata <- lapply(1:696, fetchData)
seasonsdata <- seasonsdata[!sapply(seasonsdata, is.null)]
seasonsdata <- seasonsdata[sapply(seasonsdata, is.data.frame)]
seasonsdata <- do.call(rbind,lapply(seasonsdata, data.frame, stringsAsFactors = FALSE))
Related
I have the following data as an example:
IID<-c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
KB<-c(345,1234,2000,1567,
376,1657,9656,234,
1865,1565,123,111,
1999,2032,1565,234)
data<-cbind(IID,KB)
I wrote a script to process it and give me some outcomes:
results_kb <- function(class) {
this_iids_roh <- dat[class,]
my_list<-c("Sum_long"=sum(this_iids_roh$KB[this_iids_roh$KB>=1500]),
"N_long"=length(this_iids_roh$KB[this_iids_roh$KB>=1500]),
"Sum_short"=sum(this_iids_roh$KB[this_iids_roh$KB<1500]),
"N_short"=length(this_iids_roh$KB[this_iids_roh$KB<1500]))
return(my_list)
}
dat <- data.table::as.data.table(data)
dat$IID<-as.factor(dat$IID)
data.table::setkey(dat,"IID")
results <- c()
nLevels <- length(levels(dat$IID))
start <- proc.time()
pb <- txtProgressBar(min = 0, max = nLevels, style = 3)
for (i in 1:nLevels){
this_iid <- levels(dat$IID)[i]
results <- rbind(results,results_kb(this_iid))
setTxtProgressBar(pb,i)
}
close(pb)
proc.time()-start
results<-data.frame(levels(dat$IID),results)
results$IID<-results$levels.dat.IID.
results[results==0] <- NA
I created first a function that gives me the outcome table and then I processed the dataset I want to analyze. I added a time indicator because this script is to process very large samples. As you can test it works perfectly.
Now I want to write my own package, so I have to reduce all this into functions. The first part of the script is already in a function. For the second part I have tried the following:
rohsum<-function(data){
dat <- data.table::as.data.table(data)
dat$IID<-as.factor(dat$IID)
data.table::setkey(dat,"IID")
results <- c()
nLevels <- length(levels(dat$IID))
start <- proc.time()
pb <- txtProgressBar(min = 0, max = nLevels, style = 3)
for (i in 1:nLevels){
this_iid <- levels(dat$IID)[i]
results <- rbind(results,results_kb(this_iid))
setTxtProgressBar(pb,i)
}
close(pb)
proc.time()-start
results<-data.frame(levels(dat$IID),results)
results$IID<-results$levels.dat.IID.
results[results==0] <- NA
return(results)
}
However this seems not to work, since when I tried to run rohsum(data) I get the following error message:
Error in results_kb(this_iid) : object 'dat' not found
Even more, If I tried to run several times I get the following:
How can I solve this issue to be able to build my own package?
I have 1000 json files. And I would like to read them in parallel. I have 4 CPU cores.
I have a character vector which has the names of all the files as following:-
cik_files <- list.files("./data/", pattern = ".json")
And using this vector I load the file and extract the data and add it to the following list:-
data <- list()
Below is the code for extracting the data:-
for(i in 1:1000){
data1 <- fromJSON(paste0("./data/", cik_files[i]), flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
data1 <- data1[grep("CY20[0-9]{2}$", data1$frame), c(3, 9)]
try({if(nrow(data1) > 0){
data1$cik <- strtrim(cik_files[i], 13)
data[[length(data) + 1]] <- data1
}}, silent = TRUE)
}
}
This however, takes quite a lot of time. So I was wondering how I can run the code within the for loop but in parallel.
Thanks in advance.
Here is an attempt to solve the problem in the question. Untested, since there is no data.
Step 1
First of all, rewrite the loop in the question as a function.
f <- function(i, path = "./data", cik_files){
filename <- file.path(path, cik_files[i])
data1 <- fromJSON(filename, flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
found <- grep("CY20[0-9]{2}$", data1$frame)
if(length(found) > 0){
tryCatch({
out <- data1[found, c(3, 9)]
out$cik <- strtrim(cik_files[i], 13)
out
},
error = function(e) e,
warning = function(w) w)
} else NULL
} else NULL
}
Step 2
Now load the package parallel and run one of the following, depending on OS.
library(parallel)
# Not on Windows
library(jsonlite)
json_list <- mclapply(seq_along(cik_files), f, cik_files = cik_files)
# Windows
ncores <- detectCores()
cl <- makeCluster(ncores - 1L)
clusterExport(cl, "cik_files")
clusterEvalQ(cl, "cik_files")
clusterEvalQ(cl, library(jsonlite))
json_list <- parLapply(cl, seq_along(cik_files), f, cik_files = cik_files)
stopCluster(cl)
Step 3
Extract the data from the returned list json_list.
err <- sapply(json_list, inherits, "error")
warn <- sapply(json_list, inherits, "warning")
ok <- !(err | warn)
json_list[ok] # correctly read in
i am trying to solve this problem an you help me:
kegg_brite_map <- read.table("E:\\Path\\KoG1\\picrust1_KO_BRITE_map.tsv", header=TRUE, sep="\t", quote = "", stringsAsFactors = FALSE, comment.char="", row.names=1)
test_ko <- read.table("E:\Path\KoG1\test_ko.tsv", header=TRUE, sep="\t", row.names=1)
##Run function
### Reproducing the categorize by function (level 3) functionality in plain-text tables.
### Doing this because adding a column of KEGG Pathways to a table and then converting
### that table to BIOM is difficult.
categorize_by_function_l3 <- function(in_ko, kegg_brite_mapping) {
# Function to create identical output as categorize_by_function.py script,
# but with R objects instead of BIOM objects in Python.
# Input KO table is assumed to have rownames as KOs and sample names as columns.
out_pathway <- data.frame(matrix(NA, nrow=0, ncol=(ncol(in_ko) + 1)))
colnames(out_pathway) <- c("pathway", colnames(in_ko))
for(ko in rownames(in_ko)) {
# Skip KO if not in KEGG BRITE mapping df
# (this occurs with newer KOs that weren't present in PICRUSt1).
if(! ko %in% rownames(kegg_brite_mapping)) {
next
}
pathway_list <- strsplit(kegg_brite_mapping[ko, "metadata_KEGG_Pathways"], "\\|")[[1]]
for(pathway in pathway_list) {
pathway <- strsplit(pathway, ";")[[1]][3]
new_row <- data.frame(matrix(c(NA, as.numeric(in_ko[ko,])), nrow=1, ncol=ncol(out_pathway)))
colnames(new_row) <- colnames(out_pathway)
new_row$pathway <- pathway
out_pathway <- rbind(out_pathway, new_row)
}
}
out_pathway = data.frame(aggregate(. ~ pathway, data = out_pathway, FUN=sum))
rownames(out_pathway) <- out_pathway$pathway
out_pathway <- out_pathway[, -which(colnames(out_pathway) == "pathway")]
if(length(which(rowSums(out_pathway) == 0)) > 0) {
out_pathway <- out_pathway[-which(rowSums(out_pathway) == 0), ]
}
return(out_pathway)
}
#Run function to categorize all KOs by level 3 in BRITE hierarchy
test_ko_L3 <- categorize_by_function_l3(test_ko, kegg_brite_map)
#ERROR
Error in rowSums(out_pathway) :
'x' must be an array of at least two dimensions
Called from: rowSums(out_pathway)
Without this question being reproducible (see comment from #jogo), it is difficult to tell where you have issues in the code, but the error is telling you that your argument 'x' to the function rowSums() must be at least two dimensions. This makes sense because you need rows, to be able to take the sums of each of them.
out_pathway is the object that you are passing to rowSums(), so this is the place to start. This object must not have two dimensions, possibly because you are doing some aggregating, and then deleting columns in the lines above.
If out_pathway is just a one-dimensional object, you can just use sum()
I have a spatial polygon dataframe with several shapefiles. I would like to trim these shapefiles by elevation and replace the original shapefile in the dataframe. However, there appears to be an error when I try to replace the polygon after the trim. Currently, my plan is run though the following loop for each shapefile in the dataset. However, when I try dist[i,] <- temp3, I get the following error:
Error in match(value, lx) : 'match' requires vector arguments
In addition: Warning message:
In checkNames(value) :
attempt to set invalid names: this may lead to problems later on. See ?make.names
Any suggestions?
# Load spdf
dist <- rgdal::readOGR('critterDistributions.shp');
# Load elevational ranges
rangeElevation <- read.csv(file = 'elevationRanges.csv');
# Load altitude data
elevation <- raster('ETOPO1_Bed_g_geotiff.tif');
# Tidy up CRSes
crs(elevation) <- crs(dist);
# Run loop
for (i in 1:length(dist)){
subjName <- as.character(dist#data$Species[i]);
if (!(subjName %in% rangeElevation$?..Species_name)){
paste0(subjName, 'does not exist in the elevational range database.');
}
else{
erNameMatch <- match(subjName, rangeElevation$?..Species_name);
temp <- raster::reclassify(elevation, rcl = c(-Inf,rangeElevation[erNameMatch,2],NA,
rangeElevation[erNameMatch,2],rangeElevation[erNameMatch,3],1,
rangeElevation[erNameMatch,3],Inf,NA));
temp2 <- dist[i,];
temp <- mask(temp, temp2);
temp <- crop(temp, temp2);
temp3 <- rasterToPolygons(temp, na.rm = T, dissolve = T);
names(temp3) <- make.names(names(temp2), unique = T);
temp3#data <- temp2#data;
dist[i,] <- temp3; # <<<< This is the line of code that doesn't work.
}
}
Upon further thought, I came up with a workaround: initiating a list, then using rbind after the loop to unite everything back together into a single object. I'm still interested in finding out why dist[i,] <- temp3 doesn't work, but at least I was able to get this job done.
oneSPDFtoRuleThemAll <- vector(mode = "list", length = length(dist));
for (i in 1:length(dist)){
subjName <- as.character(dist#data$Species[i]);
if (!(subjName %in% rangeElevation$?..Species_name)){
paste0(subjName, 'does not exist in the elevational range database.');
}
else{
erNameMatch <- match(subjName, rangeElevation$?..Species_name);
temp <- raster::reclassify(elevation, rcl = c(-Inf,rangeElevation[erNameMatch,2],NA,
rangeElevation[erNameMatch,2],rangeElevation[erNameMatch,3],1,
rangeElevation[erNameMatch,3],Inf,NA));
temp2 <- dist[i,];
temp <- mask(temp, temp2);
temp <- crop(temp, temp2);
temp3 <- rasterToPolygons(temp, na.rm = T, dissolve = T);
names(temp3) <- make.names(names(temp2), unique = T);
temp3#data <- temp2#data;
oneSPDFtoRuleThemAll[[i]] <- temp3; # <<<< This is the line of code that doesn't work.
}
}
finalSPDF <- rbind(unlist(oneSPDFtoRuleThemAll));
I tried to combine two lists into one dataframe:
all_stas <- list()
for(i in vid_id){
stas <- get_stats(video_id = i)
all_stas <- rbind(all_stas,stas)
}
View(all_stas)
all_detail <- list()
for(i in vid_id){
detail1 <- get_video_details(video_id = i)
all_detail <- rbind(all_detail,detail1)
}
View(all_detail)
df <- data.frame(all_stas,all_detail)
write.csv(df, file = "new_file.csv")
Afterwards I would like to store it into a csv file.
When I run it it gives me the following warning message
Warning message:
In rbind(all_stas, stas) :
number of columns of result is not a multiple of vector length (arg 2)
Does anyone of you know how I can make the code work?
This block below is triggering an error
all_stas <- list()
for(i in vid_id){
stas <- get_stats(video_id = i)
all_stas <- rbind(all_stas,stas)}
If I understand your code correctly you can get around that error by
all_stas <- list()
for(i in vid_id){
all_stas[[i]] <- get_stats(video_id = i)}