How to get result of package function into a dataframe in r - r

I am at the learning stage of r.
I am using library(usdm) in r where I am using vifcor(vardata,th=0.4,maxobservations =50000) to find the not multicollinear variables. I need to get the result of vifcor(vardata,th=0.4,maxobservations =50000) into a structured dataframe for further analysis.
Data reading process I am using:
performdata <- read.csv('F:/DGDNDRV_FINAL/OutputTextFiles/data_blk.csv')
vardata <-performdata[,c(names(performdata[5:length(names(performdata))-2])]
Content of the csv file:
pointid grid_code Blocks_line_dst_CHT GrowthCenter_dst_CHT Roads_nationa_dst_CHT Roads_regiona_dst_CHT Settlements_CHT_line_dst_CHT Small_Hat_Bazar_dst_CHT Upazilla_lin_dst_CHT resp
1 6 150 4549.428711 15361.31836 3521.391846 318.9043884 3927.594727 480 1
2 6 127.2792206 4519.557617 15388.68457 3500.24292 342.0526123 3902.883545 480 1
3 2 161.5549469 4484.473145 15391.6377 3436.539063 335.4101868 3844.216553 540 1
My tries:
r<-vifcor(vardata,th=0.2,maxobservations =50000) returns
2 variables from the 6 input variables have collinearity problem:
Roads_regiona_dst_CHT GrowthCenter_dst_CHT
After excluding the collinear variables, the linear correlation coefficients ranges between:
min correlation ( Small_Hat_Bazar_dst_CHT ~ Roads_nationa_dst_CHT ): -0.04119076963
max correlation ( Small_Hat_Bazar_dst_CHT ~ Settlements_CHT_line_dst_CHT ): 0.1384278434
---------- VIFs of the remained variables --------
Variables VIF
1 Blocks_line_dst_CHT 1.026743892
2 Roads_nationa_dst_CHT 1.010556752
3 Settlements_CHT_line_dst_CHT 1.038307666
4 Small_Hat_Bazar_dst_CHT 1.026943711
class(r) returns
[1] "VIF"
attr(,"package")
[1] "usdm"
mode(r) returns "S4"
I need Roads_regiona_dst_CHT GrowthCenter_dst_CHT into a dataframe and VIFs of the remained variables into another dataframe!
But nothing worked!

Basically the resturned result is a S4 class and you can extract slots via the # operator:
library(usdm)
example(vifcor) # creates 'v2'
str(v2)
# Formal class 'VIF' [package "usdm"] with 4 slots
# ..# variables: chr [1:10] "Bio1" "Bio2" "Bio3" "Bio4" ...
# ..# excluded : chr [1:5] "Bio5" "Bio10" "Bio7" "Bio6" ...
# ..# corMatrix: num [1:5, 1:5] 1 0.0384 -0.3011 0.0746 0.7102 ...
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:5] "Bio1" "Bio2" "Bio3" "Bio8" ...
# .. .. ..$ : chr [1:5] "Bio1" "Bio2" "Bio3" "Bio8" ...
# ..# results :'data.frame': 5 obs. of 2 variables:
# .. ..$ Variables: Factor w/ 5 levels "Bio1","Bio2",..: 1 2 3 4 5
# .. ..$ VIF : num [1:5] 2.09 1.37 1.25 1.27 2.31
So you can extract the results and the excluded slot now via:
v2#excluded
# [1] "Bio5" "Bio10" "Bio7" "Bio6" "Bio4"
v2#results
# variables VIF
# 1 Bio1 2.086186
# 2 Bio2 1.370264
# 3 Bio3 1.253408
# 4 Bio8 1.267217
# 5 Bio9 2.309479

You should be able to use the below command to get the information in the slot 'results' into a data frame. You can then split the information out into separate data frames using traditional methods
df <- r#results
Note that r#results[1:2,2] would give you the VIF for the first two rows.

Related

How to use an API in R to be able to get data for storing into a db?

I am trying to figure out how to get data in R for the purposes of making it into a table that I can store into a database like sql.
API <- "https://covidtrackerapi.bsg.ox.ac.uk/api/v2/stringency/date-range/{2020-01-01}/{2020-06-30}"
oxford_covid <- GET(API)
I then try to parse this data and make it into a dataframe but when I do so I get the errors of:
"Error: Columns 4, 5, 6, 7, 8, and 178 more must be named.
Use .name_repair to specify repair." and "Error: Tibble columns must have compatible sizes. * Size 2: Columns deaths, casesConfirmed, and stringency. * Size 176: Columns ..2020.12.27, ..2020.12.28, ..2020.12.29, and"
I am not sure if there is a better approach or how to parse this. Is there a method or approach? I am not having much luck online.
It looks like you're trying to take the JSON return from that API and call read.table or something on it. Don't do that, JSON should be parsed by JSON tools (such as jsonlite::parse_json).
Some work on that URL.
js <- jsonlite::parse_json(url("https://covidtrackerapi.bsg.ox.ac.uk/api/v2/stringency/date-range/2020-01-01/2020-06-30"))
lengths(js)
# scale countries data
# 3 183 182
str(js, max.level = 2, list.len = 3)
# List of 3
# $ scale :List of 3
# ..$ deaths :List of 2
# ..$ casesConfirmed:List of 2
# ..$ stringency :List of 2
# $ countries:List of 183
# ..$ : chr "ABW"
# ..$ : chr "AFG"
# ..$ : chr "AGO"
# .. [list output truncated]
# $ data :List of 182
# ..$ 2020-01-01:List of 183
# ..$ 2020-01-02:List of 183
# ..$ 2020-01-03:List of 183
# .. [list output truncated]
So this is rather large. Since you're hoping for a data.frame, I'm going to look at js$data only; js$countries looks relatively uninteresting,
str(unlist(js$countries))
# chr [1:183] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "AUS" "AUT" "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ" "BMU" "BOL" "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHL" "CHN" "CIV" "CMR" "COD" "COG" "COL" "CPV" ...
and does not correlate with the js$data. The js$scale might be interesting, but I'll skip it for now.
My first go-to for joining data like this into a data.frame is one of the following, depending on your preference for R dialects:
do.call(rbind.data.frame, list_of_frames) # base R
dplyr::bind_rows(list_of_frames) # tidyverse
data.table::rbindlist(list_of_frames) # data.table
But we're going to run into problems. Namely, there are entries that are NULL, when R would prefer that they be something (such as NA).
str(js$data[[1]][1])
# List of 2
# $ ABW:List of 8
# ..$ date_value : chr "2020-01-01"
# ..$ country_code : chr "ABW"
# ..$ confirmed : NULL # <--- problem
# ..$ deaths : NULL
# ..$ stringency_actual : int 0
# ..$ stringency : int 0
# ..$ stringency_legacy : int 0
# ..$ stringency_legacy_disp: int 0
So we need to iterate over each of those and replace NULL with NA. Unfortunately, I don't know of an easy tool to recursively go through lists of lists (even rapply doesn't work well in my tests), so we'll be a little brute-force here with a triple-lapply:
Long-story-short,
str(js$data[[1]][[1]])
# List of 8
# $ date_value : chr "2020-01-01"
# $ country_code : chr "ABW"
# $ confirmed : NULL
# $ deaths : NULL
# $ stringency_actual : int 0
# $ stringency : int 0
# $ stringency_legacy : int 0
# $ stringency_legacy_disp: int 0
jsdata <-
lapply(js$data, function(z) {
lapply(z, function(y) {
lapply(y, function(x) if (is.null(x)) NA else x)
})
})
str(jsdata[[1]][[1]])
# List of 8
# $ date_value : chr "2020-01-01"
# $ country_code : chr "ABW"
# $ confirmed : logi NA
# $ deaths : logi NA
# $ stringency_actual : int 0
# $ stringency : int 0
# $ stringency_legacy : int 0
# $ stringency_legacy_disp: int 0
(Technically, if we know that it's going to be integers, we should use NA_integer_. Fortunately, R and its dialects are able to work with this shortcut, as we'll see in a second.)
After that, we can do a double-dive rbinding and get back to the frame-making I discussed a couple of steps ago. Choose one of the following, whichever dialect you prefer:
alldat <- do.call(rbind.data.frame,
lapply(jsdata, function(z) do.call(rbind.data.frame, z)))
alldat <- dplyr::bind_rows(purrr::map(jsdata, dplyr::bind_rows))
alldat <- data.table::rbindlist(lapply(jsdata, data.table::rbindlist))
For simplicity, I'll show the first (base R) version:
tail(alldat)
# date_value country_code confirmed deaths stringency_actual stringency stringency_legacy stringency_legacy_disp
# 2020-06-30.AND 2020-06-30 AND 855 52 42.59 42.59 65.47 65.47
# 2020-06-30.ARE 2020-06-30 ARE 48667 315 72.22 72.22 83.33 83.33
# 2020-06-30.AGO 2020-06-30 AGO 284 13 75.93 75.93 83.33 83.33
# 2020-06-30.ALB 2020-06-30 ALB 2535 62 68.52 68.52 78.57 78.57
# 2020-06-30.ABW 2020-06-30 ABW 103 3 47.22 47.22 63.09 63.09
# 2020-06-30.AFG 2020-06-30 AFG 31507 752 78.70 78.70 76.19 76.19
And if you're curious about the $scale,
do.call(rbind.data.frame, js$scale)
# min max
# deaths 0 127893
# casesConfirmed 0 2633466
# stringency 0 100
## or
data.table::rbindlist(js$scale, idcol="id")
# id min max
# <char> <int> <int>
# 1: deaths 0 127893
# 2: casesConfirmed 0 2633466
# 3: stringency 0 100
## or
dplyr::bind_rows(js$scale, .id = "id")

'Unknown column' in R

I am working on an economical research and have a data frame filled with regression coefficients using melt & tidy functions from broom package. My df:
> head(LmModGDP, 10)
Country variable term estimate std.error statistic p.value
1 Netherlands FDI_InFlow_MilUSD (Intercept) 5.354083e+02 5.974760e+01 8.961167 1.976417e-09
2 Netherlands FDI_InFlow_MilUSD value 2.400677e-03 1.409779e-03 1.702875 1.005189e-01
3 Netherlands FDI_InFlow_percGDP (Intercept) 6.184273e+02 6.723554e+01 9.197923 1.173719e-09
4 Netherlands FDI_InFlow_percGDP value -1.261933e+00 1.008740e+01 -0.125100 9.014067e-01
5 Netherlands FDI_InStock_MilUSD (Intercept) 3.110956e+02 2.719577e+01 11.439116 1.201802e-11
6 Netherlands FDI_InStock_MilUSD value 7.025298e-04 5.307147e-05 13.237429 4.620706e-13
7 Netherlands FDI_OutFlow_MilUSD (Intercept) 5.106762e+02 5.939921e+01 8.597356 4.465840e-09
8 Netherlands FDI_OutFlow_MilUSD value 1.920313e-03 8.646908e-04 2.220808 3.528536e-02
9 Netherlands FDI_OutFlow_percGDP (Intercept) 2.593453e+02 5.334202e+01 4.861932 4.838082e-05
10 Netherlands FDI_OutFlow_percGDP value 3.931491e+00 5.332541e-01 7.372641 7.896681e-08
After I filter the df using any method (even simply by subseting or with dplyr package):
LmModGDP[LmModGDP$variable == "FDI_InStock_MilUSD",]
or
LmModGDP %>%
filter(variable == "FDI_InStock_MilUSD")
It returns the desired df but when I drag my mouse over the last column (p.value) in RStudio viewer it tells me that it is "Unknown Column" and the data still correct. Also when I use str or class function on it it shows that it is numeric but in the viewer it shows something else..
My desired df:
Country variable term estimate std.error statistic p.value
5 Netherlands FDI_InStock_MilUSD (Intercept) 3.110956e+02 2.719577e+01 11.439116 1.201802e-11
6 Netherlands FDI_InStock_MilUSD value 7.025298e-04 5.307147e-05 13.237429 4.620706e-13
19 Romania FDI_InStock_MilUSD (Intercept) 3.122229e+01 3.313134e+00 9.423796 7.188216e-10
20 Romania FDI_InStock_MilUSD value 2.128223e-03 7.035679e-05 30.249006 8.588104e-22
When I try to use kable function to display it in markdown report p.value column shows only 0 values... not the actual ones.
Can someone help me ?
!! UP !!
Here's an output of str :
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 28 obs. of 7 variables:
$ Country : chr "Netherlands" "Netherlands" "Netherlands" "Netherlands" ...
$ variable : Factor w/ 7 levels "FDI_InFlow_MilUSD",..: 1 1 2 2 3 3 4 4 5 5 ...
$ term : chr "(Intercept)" "value" "(Intercept)" "value" ...
$ estimate : num 535.4083 0.0024 618.4273 -1.2619 311.0956 ...
$ std.error: num 59.7476 0.00141 67.23554 10.0874 27.19577 ...
$ statistic: num 8.961 1.703 9.198 -0.125 11.439 ...
$ p.value : num 1.98e-09 1.01e-01 1.17e-09 9.01e-01 1.20e-11 ...
- attr(*, "vars")= chr "Country" "variable"
- attr(*, "drop")= logi TRUE
- attr(*, "indices")=List of 14
..$ : int 0 1
..$ : int 2 3
..$ : int 4 5
..$ : int 6 7
..$ : int 8 9
..$ : int 10 11
..$ : int 12 13
..$ : int 14 15
..$ : int 16 17
..$ : int 18 19
..$ : int 20 21
..$ : int 22 23
..$ : int 24 25
..$ : int 26 27
- attr(*, "group_sizes")= int 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, "biggest_group_size")= int 2
- attr(*, "labels")='data.frame': 14 obs. of 2 variables:
..$ Country : chr "Netherlands" "Netherlands" "Netherlands" "Netherlands" ...
..$ variable: Factor w/ 7 levels "FDI_InFlow_MilUSD",..: 1 2 3 4 5 6 7 1 2 3 ...
..- attr(*, "vars")= chr "Country" "variable"
..- attr(*, "drop")= logi TRUE
I cannot comment yet, this is why I write here an answer.
Could you show us the output of str(LmModGDP) ? Maybe the df is nested? Maybe it is not a pure df but has special properties. Have you tried forcing LmModGDP<-as.data.frame(LmModGDP) ?
Have you tried forcing LmModGDP$p.value<-as.numeric(LmModGDP$p.value) ?
Have you tried converting to data.table and see if the behavior is different after applying your filter on it?
UPDATE1:
Thanks for posting the str(). Your object is a "grouped_df". Have you tried ungroup(LmModGDP)?

Subsetting SPSS data imported into r with package haven?

I've used the package haven to read SPSS data into R. All seems ok, except that when I try to subset the data it doesn't seem to behave correctly. Here's the code (I don't have SPSS to create example data and can't post the real stuff):
require(haven)
df <- read_spss("filename1.sav")
tmp <- df[as_factor(df$variable1) == "factor1",]
tmp <- tmp[!is.na(tmp$variable2), ]
The above df has "NA" scattered throughout. I expected the above to subset only the data, keeping only rows with variable1 with "factor1" and discarding all rows with NAs in variable2. The first subset works as expected. But the second subset does not. It removes rows, but NAs are still present.
I suspect the issue has something to do with the way haven structures the imported data and uses the class labelled instead of an actual factor variable, but it's over my head. Anyone know what could be happening and how to accomplish the same?
Here's the structure of df, variable1 and variable2:
> str(df)
'data.frame': 4573 obs. of 316 variables:
> str(df$variable1)
Class 'labelled' atomic [1:4573] 9 9 9 14 8 8 2 4 8 16 ...
..- attr(*, "labels")= Named num [1:18] 1 2 3 4 5 6 7 8 9 10 ...
.. ..- attr(*, "names")= chr [1:18] "factor1" "factor2" "factor3" "factor4" ...
> str(df$variable2)
Class 'labelled' atomic [1:4573] 3 NA 3 NA 3 NA 1 1 NA NA ...
..- attr(*, "labels")= Named num [1:3] 1 2 3
.. ..- attr(*, "names")= chr [1:3] "Sponsor" "Not a Sponsor" "Don't Know"

How to acces composite elements in a data frame

I've created this data frame and want to access the individual elements for plotting. But it seems I can't. What kind of data frame did I have created and how can I access its individual elements?
> print(df)
B.mean B.conf1 B.conf2
1 0.75000000 -0.18826132 1.68826132
2 0.66666667 0.01334534 1.31998799
3 0.33333333 -0.31998799 0.98665466
> names(df)
[1] "B"
> struct(df)
'data.frame': 3 obs. of 1 variable:
$ B: num [1:3, 1:3] 0.75 0.6667 0.3333 -0.1883 0.0133 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mean" "conf1" "conf2"
The 'B' column is a matrix as evident from the str of 'df'. By using do.call with data.frame, it gets converted to 3 columns of a data.frame.
do.call(data.frame, df)

Find AUC with tree package - binary response

Attempting to get ROC Curve and AUC for CART decision tree which was made using "tree" package.
> str(pruned.tree7)
Here is the Structure of my tree
'data.frame': 13 obs. of 6 variables:
$ var : Factor w/ 15 levels "","Age",..: 15 10 1 11 11 5 1 1 15 1 ...
$ n : num 383 158 29 129 110 38 20 18 72 7 ...
$ dev : num 461.1 218.6 29.6 174 141.8 ...
$ yval : Factor w/ 2 levels "Negative","Positive": 2 2 1 2 2 1 2 1 2 1 ...
$ splits: chr [1:13, 1:2] "<19.5" "<81.5" "" "<65" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "cutleft" "cutright"
$ yprob : num [1:13, 1:2] 0.29 0.475 0.793 0.403 0.345 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Negative" "Positive"
Referencing the above structure, I have written (many variations of) the following code:
> preds <- prediction(pruned.tree7$frame$yprob, dimnames(pruned.tree7$frame$yprob))
Error in prediction(pruned.tree7$frame$yprob, dimnames(pruned.tree7$frame$yprob)) :
Number of predictions in each run must be equal to the number of labels for each run.
> preds <- prediction(pruned.tree7$frame$yprob, dimnames)
Error in prediction(pruned.tree7$frame$yprob, dimnames) :
Format of labels is invalid.
> preds <- prediction(pruned.tree7$frame$yprob, "dimnames")
Error in prediction(pruned.tree7$frame$yprob, "dimnames") :
Number of cross-validation runs must be equal for predictions and labels.
> preds <- prediction(pruned.tree7$frame$yprob, names(yprob))
Error in is.data.frame(labels) : object 'yprob' not found
> preds <- prediction(pruned.tree7$frame$yprob, names(pruned.tree7$frame$yprob))
Error in prediction(pruned.tree7$frame$yprob, names(pruned.tree7$frame$yprob)) :
Format of labels is invalid.
> preds <- prediction(pruned.tree7$frame$yprob, dimnames(pruned.tree7$frame$yprob))
Error in prediction(pruned.tree7$frame$yprob, dimnames(pruned.tree7$frame$yprob)) :
Number of predictions in each run must be equal to the number of labels for each run.
I have searched and found this link: ROCR Package Documentation
It mentions the topic of cross-validation. However, it does not make sense to me.
Thank you in advance!!

Resources