I'm trying to make an API call per row of my data.frame, while passing the value of a row as a parameter. There return value should be added to my frame
However, I'm failing to make my code work. Any help is appreciated.
Contents of APPLYapiCallTest.csv:
id title returnValue
01 "wine" ""
02 "beer" ""
03 "coffee" ""
Essentially call API with call("wine) and replace the empty field with the result e.g. beverage.
Here's what I got so far.
#call api per row using apply
library(jsonlite)
library(httr)
callAPI <- function(x) {
findWhat <- as.character(x)
#create ULR
url1 <- "http://api.nal.usda.gov/ndb/search/?format=json&q="
url2 <- "&sort=n&max=1&offset=0&api_key=KYG9lsu0nz31SG5yHGdAsM28IuCEGxWWlvdYqelI&location=Chicago%2BIL"
fURL <- paste(url1, findWhat, url2, sep="")
apiRet <- data.frame(fromJSON(txt=fURL, flatten = TRUE))
result <- apiRet[1,c(9)]
return(result)
}
tData <- data.frame(read.delim("~/Documents/R-SCRIPTS/DATA/APPLYapiCallTest"))
apply(tData[,c('title')], 1, function(x) callAPI(x) )
I think you'd be better off doing something like this:
library(jsonlite)
library(httr)
library(pbapply)
callAPI <- function(x) {
res <- GET("http://api.nal.usda.gov/ndb/search/",
query=list(format="json",
q=x,
sort="n",
offset=0,
max=16,
api_key=Sys.getenv("NAL_API_KEY"),
location="Berwick+ME"))
stop_for_status(res)
return(data.frame(fromJSON(content(res, as="text"), flatten=TRUE), stringsAsFactors=FALSE))
}
tData <- data.frame(id=c("01", "02", "03"),
title=c("wine", "beer", "coffee"),
returnValue=c("", "", ""),
stringsAsFactors=FALSE)
dat <- merge(do.call(rbind.data.frame, pblapply(tData$title, callAPI)),
tData[, c("title", "id")], by.x="list.q", by.y="title", all.x=TRUE)
str(dat)
## 'data.frame': 45 obs. of 12 variables:
## $ list.q : chr "beer" "beer" "beer" "beer" ...
## $ list.sr : chr "28" "28" "28" "28" ...
## $ list.start : int 0 0 0 0 0 0 0 0 0 0 ...
## $ list.end : int 13 13 13 13 13 13 13 13 13 13 ...
## $ list.total : int 13 13 13 13 13 13 13 13 13 13 ...
## $ list.group : chr "" "" "" "" ...
## $ list.sort : chr "n" "n" "n" "n" ...
## $ list.item.offset: int 0 1 2 3 4 5 6 7 8 9 ...
## $ list.item.group : chr "Beverages" "Beverages" "Beverages" "Beverages" ...
## $ list.item.name : chr "Alcoholic beverage, beer, light" "Alcoholic beverage, beer, light, BUD LIGHT" "Alcoholic beverage, beer, light, BUDWEISER SELECT" "Alcoholic beverage, beer, light, higher alcohol" ...
## $ list.item.ndbno : chr "14006" "14007" "14005" "14248" ...
## $ id : chr "02" "02" "02" "02" ...
This way you're making a structured call via httr::GET for each title in tData and then binding all the results together into a data frame and finally adding back the ID.
This also allows you to put NAL_API_KEY into .Renviron and avoid exposing it in your workflows (and on SO :-)
You can trim out the API result responses you don't need either in the function or outside it.
Related
I am trying to figure out how to get data in R for the purposes of making it into a table that I can store into a database like sql.
API <- "https://covidtrackerapi.bsg.ox.ac.uk/api/v2/stringency/date-range/{2020-01-01}/{2020-06-30}"
oxford_covid <- GET(API)
I then try to parse this data and make it into a dataframe but when I do so I get the errors of:
"Error: Columns 4, 5, 6, 7, 8, and 178 more must be named.
Use .name_repair to specify repair." and "Error: Tibble columns must have compatible sizes. * Size 2: Columns deaths, casesConfirmed, and stringency. * Size 176: Columns ..2020.12.27, ..2020.12.28, ..2020.12.29, and"
I am not sure if there is a better approach or how to parse this. Is there a method or approach? I am not having much luck online.
It looks like you're trying to take the JSON return from that API and call read.table or something on it. Don't do that, JSON should be parsed by JSON tools (such as jsonlite::parse_json).
Some work on that URL.
js <- jsonlite::parse_json(url("https://covidtrackerapi.bsg.ox.ac.uk/api/v2/stringency/date-range/2020-01-01/2020-06-30"))
lengths(js)
# scale countries data
# 3 183 182
str(js, max.level = 2, list.len = 3)
# List of 3
# $ scale :List of 3
# ..$ deaths :List of 2
# ..$ casesConfirmed:List of 2
# ..$ stringency :List of 2
# $ countries:List of 183
# ..$ : chr "ABW"
# ..$ : chr "AFG"
# ..$ : chr "AGO"
# .. [list output truncated]
# $ data :List of 182
# ..$ 2020-01-01:List of 183
# ..$ 2020-01-02:List of 183
# ..$ 2020-01-03:List of 183
# .. [list output truncated]
So this is rather large. Since you're hoping for a data.frame, I'm going to look at js$data only; js$countries looks relatively uninteresting,
str(unlist(js$countries))
# chr [1:183] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "AUS" "AUT" "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ" "BMU" "BOL" "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHL" "CHN" "CIV" "CMR" "COD" "COG" "COL" "CPV" ...
and does not correlate with the js$data. The js$scale might be interesting, but I'll skip it for now.
My first go-to for joining data like this into a data.frame is one of the following, depending on your preference for R dialects:
do.call(rbind.data.frame, list_of_frames) # base R
dplyr::bind_rows(list_of_frames) # tidyverse
data.table::rbindlist(list_of_frames) # data.table
But we're going to run into problems. Namely, there are entries that are NULL, when R would prefer that they be something (such as NA).
str(js$data[[1]][1])
# List of 2
# $ ABW:List of 8
# ..$ date_value : chr "2020-01-01"
# ..$ country_code : chr "ABW"
# ..$ confirmed : NULL # <--- problem
# ..$ deaths : NULL
# ..$ stringency_actual : int 0
# ..$ stringency : int 0
# ..$ stringency_legacy : int 0
# ..$ stringency_legacy_disp: int 0
So we need to iterate over each of those and replace NULL with NA. Unfortunately, I don't know of an easy tool to recursively go through lists of lists (even rapply doesn't work well in my tests), so we'll be a little brute-force here with a triple-lapply:
Long-story-short,
str(js$data[[1]][[1]])
# List of 8
# $ date_value : chr "2020-01-01"
# $ country_code : chr "ABW"
# $ confirmed : NULL
# $ deaths : NULL
# $ stringency_actual : int 0
# $ stringency : int 0
# $ stringency_legacy : int 0
# $ stringency_legacy_disp: int 0
jsdata <-
lapply(js$data, function(z) {
lapply(z, function(y) {
lapply(y, function(x) if (is.null(x)) NA else x)
})
})
str(jsdata[[1]][[1]])
# List of 8
# $ date_value : chr "2020-01-01"
# $ country_code : chr "ABW"
# $ confirmed : logi NA
# $ deaths : logi NA
# $ stringency_actual : int 0
# $ stringency : int 0
# $ stringency_legacy : int 0
# $ stringency_legacy_disp: int 0
(Technically, if we know that it's going to be integers, we should use NA_integer_. Fortunately, R and its dialects are able to work with this shortcut, as we'll see in a second.)
After that, we can do a double-dive rbinding and get back to the frame-making I discussed a couple of steps ago. Choose one of the following, whichever dialect you prefer:
alldat <- do.call(rbind.data.frame,
lapply(jsdata, function(z) do.call(rbind.data.frame, z)))
alldat <- dplyr::bind_rows(purrr::map(jsdata, dplyr::bind_rows))
alldat <- data.table::rbindlist(lapply(jsdata, data.table::rbindlist))
For simplicity, I'll show the first (base R) version:
tail(alldat)
# date_value country_code confirmed deaths stringency_actual stringency stringency_legacy stringency_legacy_disp
# 2020-06-30.AND 2020-06-30 AND 855 52 42.59 42.59 65.47 65.47
# 2020-06-30.ARE 2020-06-30 ARE 48667 315 72.22 72.22 83.33 83.33
# 2020-06-30.AGO 2020-06-30 AGO 284 13 75.93 75.93 83.33 83.33
# 2020-06-30.ALB 2020-06-30 ALB 2535 62 68.52 68.52 78.57 78.57
# 2020-06-30.ABW 2020-06-30 ABW 103 3 47.22 47.22 63.09 63.09
# 2020-06-30.AFG 2020-06-30 AFG 31507 752 78.70 78.70 76.19 76.19
And if you're curious about the $scale,
do.call(rbind.data.frame, js$scale)
# min max
# deaths 0 127893
# casesConfirmed 0 2633466
# stringency 0 100
## or
data.table::rbindlist(js$scale, idcol="id")
# id min max
# <char> <int> <int>
# 1: deaths 0 127893
# 2: casesConfirmed 0 2633466
# 3: stringency 0 100
## or
dplyr::bind_rows(js$scale, .id = "id")
This is the code that I was using for my data mining assignment in R studio. I was preprocessing the data.
setwd('C:/Users/user/OneDrive/assignments/Data mining/individual')
dataset = read.csv('Dataset.csv')
dataset[dataset == '?'] <- NA
View(dataset)
x <- na.omit(dataset)
library(tidyr)
library(dplyr)
library(outliers)
View(gather(x))
x$Age[x$Age <= 30] <- 3
x$Age[(x$Age <=49) & (x$Age >= 31)] <- 2
x$Age[(x$Age != 3) & (x$Age !=2)] <- 1
x$Hours_Per_week[x$Hours_Per_week <= 30] <- 3
x$Hours_Per_week[(x$Hours_Per_week <= 49)& (x$Hours_Per_week >= 31)] <- 2
x$Hours_Per_week[(x$Hours_Per_week != 3) & (x$Hours_Per_week != 2)] <- 1
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov","Local-
gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov"), labels =
c(1,2,3,4,5,6) )
And here by I will attach the result of the code.
the result
str(x)
As you can see in the result , after the last code , all the data in the column Hours_Per_week is suddenly changed into NA. I don't really know why this occurs since every other example that I saw online changed the data inside to the labels.
The link for the dataset :
dataset
unfortunately I do not know the original data - possibly you just have to change the levels and labels content:
x$Work_Class <- factor(x$Work_Class, levels = c(1,2,3,4,5,6), labels = c("Federal-gov","Local-gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov") )
The problem is the factor() statement. The Dataset.csv file does not have character strings surrounded by quotation marks so you get a leading space on every character field.
str(dataset)
# data.frame': 100 obs. of 7 variables:
# $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
# $ Work_Class : chr " State-gov" " Self-emp-not-inc" " Private" NA ...
# $ Education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
# $ Marital_Status: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
# $ Sex : chr " Male" " Male" " Male" " Male" ...
# $ Hours_Per_week: int 40 13 40 40 40 40 16 45 50 40 ...
# $ Income : chr " <=50K" " <=50K" " <=50K" " <=50K" ...
Notice the blank space before each label in Work_Class, Education, Marital_Status, Sex, and Income. You need to trim the white space when you read the file:
dataset = read.csv('Dataset.csv', strip.white=TRUE)
Then change the last line by removing the labels= argument:
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov", "Local-gov", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov"))
str(x)
# 'data.frame': 93 obs. of 7 variables:
# $ Age : num 2 1 2 3 2 2 1 2 2 3 ...
# $ Work_Class : Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 5 3 3 6 ...
# $ Education : chr "Bachelors" "Bachelors" "HS-grad" "Bachelors" ...
# $ Marital_Status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
# $ Sex : chr "Male" "Male" "Male" "Female" ...
# $ Hours_Per_week: num 2 3 2 2 2 3 2 2 1 2 ...
# $ Income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
# - attr(*, "na.action")= 'omit' Named int [1:7] 4 9 28 62 70 78 93
# ..- attr(*, "names")= chr [1:7] "4" "9" "28" "62" ...
table(x$Work_Class)
#
# Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov
# 6 6 67 3 7 4
I googled my error, but that didn't helped me.
Got a data frame, with a column x.
unique(df$x)
The result is:
[1] "fc_social_media" "fc_banners" "fc_nat_search"
[4] "fc_direct" "fc_paid_search"
When I try this:
df <- spread(data = df, key = x, value = x, fill = "0")
I got the error:
Error in `[.data.frame`(data, setdiff(names(data), c(key_var, value_var))) :
undefined columns selected
But that is very weird, because I used the spread function (in the same script) different times.
So I googled, saw some "solutions":
I removed all the "special" characters. As you can see, my unique
values do not contain special characters (cleaned it). But this didn't
help.
I checked if there are any columns with the same name. But all column names
are unique.
#Gregor, #Akrun:
> str(df)
'data.frame': 100 obs. of 22 variables:
$ visitor_id : chr "321012312666671237877-461170125342559040419" "321012366667112237877-461121705342559040419" "321012366661271237877-461170534255901240419" "321012366612671237877-461170534212559040419" ...
$ visit_num : chr "1" "1" "1" "1" ...
$ ref_domain : chr "l.facebook.com" "X.co.uk" "x.co.uk" "" ...
$ x : chr "fc_social_media" "fc_social_media" "fc_social_media" "fc_social_media" ...
$ va_closer_channel : chr "Social Media" "Social Media" "Social Media" "Social Media" ...
$ row : int 1 2 3 4 5 6 7 8 9 10 ...
$ : chr "0" "0" "0" "0" ...
$ Hard Drive : chr "0" "0" "0" "0" ...
The error could be due to a column without a name i.e "". Using a reproducible example
library(tidyr)
spread(df, x, x)
Error in [.data.frame(data, setdiff(names(data), c(key_var,
value_var))) : undefined columns selected
We could make it work by changing the column name
names(df) <- make.names(names(df))
spread(df, x, x, fill = "0")
# X fc_banners fc_direct fc_nat_search fc_paid_search fc_social_media
#1 1 0 0 0 0 fc_social_media
#2 2 fc_banners 0 0 0 0
#3 3 0 0 fc_nat_search 0 0
#4 4 0 fc_direct 0 0 0
#5 5 0 0 0 fc_paid_search 0
data
df <- data.frame(x = c("fc_social_media", "fc_banners",
"fc_nat_search", "fc_direct", "fc_paid_search"), x1 = 1:5, stringsAsFactors = FALSE)
names(df)[2] <- ""
I have a list of values which I would like to use as names for separate tables scraped from separate URLs on a certain website.
> Fac_table
[[1]]
[1] "fulltime_fac_table"
[[2]]
[1] "parttime_fac_table"
[[3]]
[1] "honorary_fac_table"
[[4]]
[1] "retired_fac_table"
I would like to loop through the list to automatically generate 4 tables with the respective names.
The result should look like this:
> fulltime_fac_table
職稱
V1 "教授兼系主任"
V2 "教授"
V3 "教授"
V4 "教授"
V5 "特聘教授"
> parttime_fac_table
職稱 姓名
V1 "教授" "XXX"
V2 "教授" "XXX"
V3 "教授" "XXX"
V4 "教授" "XXX"
V5 "教授" "XXX"
V6 "教授" "XXX"
I have another list, named 'headers', containing column headings of the respective tables online.
> headers
[[1]]
[1] "職稱" "姓名" " 研究領域"
[4] "聯絡方式"
[[2]]
[1] "職稱" "姓名" "研究領域" "聯絡方式"
I was able to assign values to the respective tables with this code:
> assign(eval(parse(text="Fac_table[[i]]")), as_tibble(matrix(fac_data,
> nrow = length(headers[[i]])))
This results in a populated table, without column headings, like this one:
> honorary_fac_table
[,1] [,2]
V1 "名譽教授" "XXX"
V2 "名譽教授" "XXX"
V3 "名譽教授" "XXX"
V4 "名譽教授" "XXX"
But was unable to assign column names to each table.
Neither of the code below worked:
> assign(colnames(eval(parse(text="Fac_table[1]"))), c(gsub("\\s", "", headers[[1]])))
Error in assign(colnames(eval(parse(text = "Fac_table[1]"))), c(gsub("\\s", :
第一個引數不正確
> colnames(eval(parse(text="Fac_table[i]"))) <- c(gsub("\\s", "", headers[[i]]))
Error in colnames(eval(parse(text = "Fac_table[i]"))) <- c(gsub("\\s", :
賦值目標擴充到非語言的物件
> do.call("<-", colnames(eval(parse(text="Fac_table[i]"))), c(gsub("\\s", "", headers[[i]])))
Error in do.call("<-", colnames(eval(parse(text = "Fac_table[i]"))), c(gsub("\\s", :
second argument must be a list
To simplify the issue, a reproducible example is as follows:
> varNamelist <- list(c("tbl1","tbl2","tbl3","tbl4"))
> colHeaderlist <- list(c("col1","col2","col3","col4"))
> tableData <- matrix([1:12], ncol=4)
This works:
> assign(eval(parse(text="varNamelist[[1]][1]")), matrix(tableData, ncol
> = length(colHeaderlist[[1]])))
But this doesn't:
> colnames(as.name(varNamelist[[1]][1])) <- colHeaderlist[[1]]
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3", "col4" :
attempt to set 'colnames' on an object with less than two dimensions
It seems like the colnames() function in R is unable to treat the strings as represented by "Fac_table[i]" as variable names, in which independent data (separate from Fac_table) can be stored.
> colnames(as.name(Fac_table[[1]])) <- headers[[1]]
Error in `colnames<-`(`*tmp*`, value = c("a", "b", "c", :
attempt to set 'colnames' on an object with less than two dimensions
Substituting for 'fulltime_fac_table' directly works fine.
> colnames(fulltime_fac_table) <- headers[[1]]
Is there any way around this issue?
Thanks!
There is a solution to this, but I think the current set up may be more complex than necessary if I understand correctly. So I'll try to make this task easier.
If you're working with one-dimensional data, I'd recommend using vectors, as they're more appropriate than lists for that purpose. So for this project, I'd begin by storing the names of tables and headers, like this:
varNamelist <- c("tbl1","tbl2","tbl3","tbl4")
colHeaderlist <- c("col1","col2","col3","col4")
It's still difficult to determine what the data format and origin for the input of these table is from your question, but in general, sometimes a data frame can be easier to work with than a matrix, as long as your not working with Big Data. The assign function is also typically not necessary for these sort of steps. Instead, when setting up a dataframe, we can apply the name of the data frame, the name of the columns, and the data contents all at once, like this:
tbl1 <- data.frame("col1"=c(1,2,3),
"col2"=c(4,5,6),
"col3"=c(7,8,9),
"col4"=c(10,11,12))
Again, we're using vectors, noted by the c() instead of list(), to fill each column since each column is it's own single dimension.
To check the output of tbl1, we can then use print():
print(tbl1)
col1 col2 col3 col4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
If it's an option to create the tables closer to this way shown, that might make things easier than using so many lists and assign functions; that quickly becomes overly complicated.
But if you want at the end to store all the tables in a single place, you could put them in a list:
tableList <– list(tbl1=tbl1,tbl2=tbl2,tbl3=tbl3,tbl4=tbl4)
str(tableList)
List of 4
$ tbl1:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl2:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl3:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
$ tbl4:'data.frame': 3 obs. of 4 variables:
..$ col1: num [1:3] 1 2 3
..$ col2: num [1:3] 4 5 6
..$ col3: num [1:3] 7 8 9
..$ col4: num [1:3] 10 11 12
I've found a work around solution based on #Ryan's recommendation, given by this code:
for (i in seq_along(url)){
webpage <- read_html(url[i]) #loop through URL list to access html data
fac_data <- html_nodes(webpage,'.tableunder') %>% html_text()
fac_data1 <- html_nodes(webpage,'.tableunder1') %>% html_text()
fac_data <- c(fac_data, fac_data1) #Store table data on each URL in a variable
x <- fac_data %>% matrix(ncol = length(headers[[i]]), byrow=TRUE) #make matrix to extract column data
for (j in seq_along(headers[[i]])){
y <- cbind(x[,j]) #extract column data and store in temporary variable
colnames(y) <- as.character(headers[[i]][j]) #add column name
print(cbind(y)) #loop through headers list to print column data in sequence. ** cbind(y) will be overwritten when I try to store the result on a list with 'z <- cbind(y)'.
}
}
I am now able to print out all values, complete with headers of the data in question.
Follow-up questions have been posted here.
The final code solved this problem as well.
I've imported a dataset into R where in a column which should be supposed to contain numeric values are present NULL. This make R set the column class to character or factor depending on if you are using or not the stringAsFactors argument.
To give you and idea this is the structure of the dataset.
> str(data)
'data.frame': 1016 obs. of 10 variables:
$ Date : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
$ Name : chr "Chi" "Chi" "Chi" "Chi" ...
$ Impressions: chr "229097" "3323" "70171" "1359" ...
$ Revenue : num 533.78 11.62 346.16 3.36 1282.28 ...
$ Clicks : num 472 13 369 1 963 161 1 7 317 21 ...
$ CTR : chr "0.21" "0.39" "0.53" "0.07" ...
$ PCC : chr "32" "2" "18" "0" ...
$ PCOV : chr "3470.52" "94.97" "2176.95" "0" ...
$ PCROI : chr "6.5" "8.17" "6.29" "NULL" ...
$ Dimension : Factor w/ 11 levels "100x72","1200x627",..: 1 3 4 5 7 8 9 10 11 1 ...
I would like to transform the PCROI column as numeric, but containing NULLs it makes this harder.
I've tried to get around the issue setting the value 0 to all observations where current value is NULL, but I got the following error message:
> data$PCROI[which(data$PCROI == "NULL"), ] <- 0
Error in data$PCROI[which(data$PCROI == "NULL"), ] <- 0 :
incorrect number of subscripts on matrix
My idea was to change to 0 all the NULL observations and afterwards transform all the column to numeric using the as.numeric function.
You have a syntax error:
data$PCROI[which(data$PCROI == "NULL"), ] <- 0 # will not work
data$PCROI[which(data$PCROI == "NULL")] <- 0 # will work
by the way you can say:
data$PCROI = as.numeric(data$PCROI)
it will convert your "NULL" to NA automatically.