I've imported a dataset into R where in a column which should be supposed to contain numeric values are present NULL. This make R set the column class to character or factor depending on if you are using or not the stringAsFactors argument.
To give you and idea this is the structure of the dataset.
> str(data)
'data.frame': 1016 obs. of 10 variables:
$ Date : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
$ Name : chr "Chi" "Chi" "Chi" "Chi" ...
$ Impressions: chr "229097" "3323" "70171" "1359" ...
$ Revenue : num 533.78 11.62 346.16 3.36 1282.28 ...
$ Clicks : num 472 13 369 1 963 161 1 7 317 21 ...
$ CTR : chr "0.21" "0.39" "0.53" "0.07" ...
$ PCC : chr "32" "2" "18" "0" ...
$ PCOV : chr "3470.52" "94.97" "2176.95" "0" ...
$ PCROI : chr "6.5" "8.17" "6.29" "NULL" ...
$ Dimension : Factor w/ 11 levels "100x72","1200x627",..: 1 3 4 5 7 8 9 10 11 1 ...
I would like to transform the PCROI column as numeric, but containing NULLs it makes this harder.
I've tried to get around the issue setting the value 0 to all observations where current value is NULL, but I got the following error message:
> data$PCROI[which(data$PCROI == "NULL"), ] <- 0
Error in data$PCROI[which(data$PCROI == "NULL"), ] <- 0 :
incorrect number of subscripts on matrix
My idea was to change to 0 all the NULL observations and afterwards transform all the column to numeric using the as.numeric function.
You have a syntax error:
data$PCROI[which(data$PCROI == "NULL"), ] <- 0 # will not work
data$PCROI[which(data$PCROI == "NULL")] <- 0 # will work
by the way you can say:
data$PCROI = as.numeric(data$PCROI)
it will convert your "NULL" to NA automatically.
Related
I am trying to figure out how to get data in R for the purposes of making it into a table that I can store into a database like sql.
API <- "https://covidtrackerapi.bsg.ox.ac.uk/api/v2/stringency/date-range/{2020-01-01}/{2020-06-30}"
oxford_covid <- GET(API)
I then try to parse this data and make it into a dataframe but when I do so I get the errors of:
"Error: Columns 4, 5, 6, 7, 8, and 178 more must be named.
Use .name_repair to specify repair." and "Error: Tibble columns must have compatible sizes. * Size 2: Columns deaths, casesConfirmed, and stringency. * Size 176: Columns ..2020.12.27, ..2020.12.28, ..2020.12.29, and"
I am not sure if there is a better approach or how to parse this. Is there a method or approach? I am not having much luck online.
It looks like you're trying to take the JSON return from that API and call read.table or something on it. Don't do that, JSON should be parsed by JSON tools (such as jsonlite::parse_json).
Some work on that URL.
js <- jsonlite::parse_json(url("https://covidtrackerapi.bsg.ox.ac.uk/api/v2/stringency/date-range/2020-01-01/2020-06-30"))
lengths(js)
# scale countries data
# 3 183 182
str(js, max.level = 2, list.len = 3)
# List of 3
# $ scale :List of 3
# ..$ deaths :List of 2
# ..$ casesConfirmed:List of 2
# ..$ stringency :List of 2
# $ countries:List of 183
# ..$ : chr "ABW"
# ..$ : chr "AFG"
# ..$ : chr "AGO"
# .. [list output truncated]
# $ data :List of 182
# ..$ 2020-01-01:List of 183
# ..$ 2020-01-02:List of 183
# ..$ 2020-01-03:List of 183
# .. [list output truncated]
So this is rather large. Since you're hoping for a data.frame, I'm going to look at js$data only; js$countries looks relatively uninteresting,
str(unlist(js$countries))
# chr [1:183] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "AUS" "AUT" "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ" "BMU" "BOL" "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHL" "CHN" "CIV" "CMR" "COD" "COG" "COL" "CPV" ...
and does not correlate with the js$data. The js$scale might be interesting, but I'll skip it for now.
My first go-to for joining data like this into a data.frame is one of the following, depending on your preference for R dialects:
do.call(rbind.data.frame, list_of_frames) # base R
dplyr::bind_rows(list_of_frames) # tidyverse
data.table::rbindlist(list_of_frames) # data.table
But we're going to run into problems. Namely, there are entries that are NULL, when R would prefer that they be something (such as NA).
str(js$data[[1]][1])
# List of 2
# $ ABW:List of 8
# ..$ date_value : chr "2020-01-01"
# ..$ country_code : chr "ABW"
# ..$ confirmed : NULL # <--- problem
# ..$ deaths : NULL
# ..$ stringency_actual : int 0
# ..$ stringency : int 0
# ..$ stringency_legacy : int 0
# ..$ stringency_legacy_disp: int 0
So we need to iterate over each of those and replace NULL with NA. Unfortunately, I don't know of an easy tool to recursively go through lists of lists (even rapply doesn't work well in my tests), so we'll be a little brute-force here with a triple-lapply:
Long-story-short,
str(js$data[[1]][[1]])
# List of 8
# $ date_value : chr "2020-01-01"
# $ country_code : chr "ABW"
# $ confirmed : NULL
# $ deaths : NULL
# $ stringency_actual : int 0
# $ stringency : int 0
# $ stringency_legacy : int 0
# $ stringency_legacy_disp: int 0
jsdata <-
lapply(js$data, function(z) {
lapply(z, function(y) {
lapply(y, function(x) if (is.null(x)) NA else x)
})
})
str(jsdata[[1]][[1]])
# List of 8
# $ date_value : chr "2020-01-01"
# $ country_code : chr "ABW"
# $ confirmed : logi NA
# $ deaths : logi NA
# $ stringency_actual : int 0
# $ stringency : int 0
# $ stringency_legacy : int 0
# $ stringency_legacy_disp: int 0
(Technically, if we know that it's going to be integers, we should use NA_integer_. Fortunately, R and its dialects are able to work with this shortcut, as we'll see in a second.)
After that, we can do a double-dive rbinding and get back to the frame-making I discussed a couple of steps ago. Choose one of the following, whichever dialect you prefer:
alldat <- do.call(rbind.data.frame,
lapply(jsdata, function(z) do.call(rbind.data.frame, z)))
alldat <- dplyr::bind_rows(purrr::map(jsdata, dplyr::bind_rows))
alldat <- data.table::rbindlist(lapply(jsdata, data.table::rbindlist))
For simplicity, I'll show the first (base R) version:
tail(alldat)
# date_value country_code confirmed deaths stringency_actual stringency stringency_legacy stringency_legacy_disp
# 2020-06-30.AND 2020-06-30 AND 855 52 42.59 42.59 65.47 65.47
# 2020-06-30.ARE 2020-06-30 ARE 48667 315 72.22 72.22 83.33 83.33
# 2020-06-30.AGO 2020-06-30 AGO 284 13 75.93 75.93 83.33 83.33
# 2020-06-30.ALB 2020-06-30 ALB 2535 62 68.52 68.52 78.57 78.57
# 2020-06-30.ABW 2020-06-30 ABW 103 3 47.22 47.22 63.09 63.09
# 2020-06-30.AFG 2020-06-30 AFG 31507 752 78.70 78.70 76.19 76.19
And if you're curious about the $scale,
do.call(rbind.data.frame, js$scale)
# min max
# deaths 0 127893
# casesConfirmed 0 2633466
# stringency 0 100
## or
data.table::rbindlist(js$scale, idcol="id")
# id min max
# <char> <int> <int>
# 1: deaths 0 127893
# 2: casesConfirmed 0 2633466
# 3: stringency 0 100
## or
dplyr::bind_rows(js$scale, .id = "id")
I googled my error, but that didn't helped me.
Got a data frame, with a column x.
unique(df$x)
The result is:
[1] "fc_social_media" "fc_banners" "fc_nat_search"
[4] "fc_direct" "fc_paid_search"
When I try this:
df <- spread(data = df, key = x, value = x, fill = "0")
I got the error:
Error in `[.data.frame`(data, setdiff(names(data), c(key_var, value_var))) :
undefined columns selected
But that is very weird, because I used the spread function (in the same script) different times.
So I googled, saw some "solutions":
I removed all the "special" characters. As you can see, my unique
values do not contain special characters (cleaned it). But this didn't
help.
I checked if there are any columns with the same name. But all column names
are unique.
#Gregor, #Akrun:
> str(df)
'data.frame': 100 obs. of 22 variables:
$ visitor_id : chr "321012312666671237877-461170125342559040419" "321012366667112237877-461121705342559040419" "321012366661271237877-461170534255901240419" "321012366612671237877-461170534212559040419" ...
$ visit_num : chr "1" "1" "1" "1" ...
$ ref_domain : chr "l.facebook.com" "X.co.uk" "x.co.uk" "" ...
$ x : chr "fc_social_media" "fc_social_media" "fc_social_media" "fc_social_media" ...
$ va_closer_channel : chr "Social Media" "Social Media" "Social Media" "Social Media" ...
$ row : int 1 2 3 4 5 6 7 8 9 10 ...
$ : chr "0" "0" "0" "0" ...
$ Hard Drive : chr "0" "0" "0" "0" ...
The error could be due to a column without a name i.e "". Using a reproducible example
library(tidyr)
spread(df, x, x)
Error in [.data.frame(data, setdiff(names(data), c(key_var,
value_var))) : undefined columns selected
We could make it work by changing the column name
names(df) <- make.names(names(df))
spread(df, x, x, fill = "0")
# X fc_banners fc_direct fc_nat_search fc_paid_search fc_social_media
#1 1 0 0 0 0 fc_social_media
#2 2 fc_banners 0 0 0 0
#3 3 0 0 fc_nat_search 0 0
#4 4 0 fc_direct 0 0 0
#5 5 0 0 0 fc_paid_search 0
data
df <- data.frame(x = c("fc_social_media", "fc_banners",
"fc_nat_search", "fc_direct", "fc_paid_search"), x1 = 1:5, stringsAsFactors = FALSE)
names(df)[2] <- ""
I have a dataset where I'm planning to use ubRacing of unbalanced package. But this ubRacing only accepts numeric columns. Is there anyway I can convert all the chr columns to numeric through R?
Thanks
'data.frame': 31000 obs. of 22 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : int 56 57 37 40 56 45 59 41 24 25 ...
$ job : chr "housemaid" "services" "services" "admin." ...
$ marital : chr "married" "married" "married" "married" ...
$ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
$ default : chr "no" "unknown" "no" "no" ...
$ housing : chr "no" "no" "yes" "no" ...
$ loan : chr "no" "no" "no" "no" ...
$ contact : chr "telephone" "telephone" "telephone" "telephone" ...
$ month : chr "may" "may" "may" "may" ...
$ day_of_week : chr "mon" "mon" "mon" "mon" ...
It is not clear how to character columns should be converted to numeric. One possible option would be to convert the character class to factor and then coerce it to numeric. We loop through the columns of the dataset with lapply.
df1[] <- lapply(df1, function(x) if(is.character(x)) as.numeric(factor(x))
else (x))
I know there is a lot of information in Google about this problem, but I could not solve it.
I have a data frame:
> str(myData)
'data.frame': 1199456 obs. of 7 variables:
$ A: num 3064 82307 4431998 1354 193871 ...
$ B: num 6067 403916 2709997 2743 203434 ...
$ C: num 299 11752 33282 170 2748 ...
$ D: num 105 6676 7065 20 1593 ...
$ E: num 8 572 236 3 170 ...
$ F: num 0 21 95 0 13 ...
$ G: num 583 18512 961328 348 42728 ...
Then I convert it to a matrix in order to apply the Cramer-von Mises test from "cramer" library:
> myData = as.matrix(myData)
> str(myData)
num [1:1199456, 1:7] 3064 82307 4431998 1354 193871 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:1199456] "8" "32" "48" "49" ...
..$ : chr [1:7] "A" "B" "C" "D" ...
After that, if I apply a "cramer.test(myData[x1:y1,], myData[x2:y2,])" I get the following error:
Error in rep(0, (RVAL$m + RVAL$n)^2) : invalid 'times' argument
In addition: Warning message:
In matrix(rep(0, (RVAL$m + RVAL$n)^2), ncol = (RVAL$m + RVAL$n)) :
NAs introduced by coercion
I also tried to convert the data frame to a matrix like this, but the error is the same:
> myData = as.matrix(sapply(myData, as.numeric))
> str(myData)
num [1:1199456, 1:7] 3064 82307 4431998 1354 193871 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "A" "B" "C" "D" ...
Your problem is that your data set is too large for the algorithm that cramer.test is using (at least the way it's coded). The code tries to create a lookup table according to
lookup <- matrix(rep(0, (RVAL$m + RVAL$n)^2),
ncol = (RVAL$m + RVAL$n))
where RVAL$m and RVAL$n are the number of rows of the two samples. The standard maximum length of an R vector is 2^31-1 on a 32-bit platform: since your samples have equal numbers of rows N, you'll be trying to create a vector of length (2*N^2), which in your case is 5.754779e+12 -- probably too big even if R would let you create the vector.
You may have to look for another implementation of the test, or another test.
I am having problems creating a boxplot of my data, because one of my variables is in the form of a list.
I am trying to create a boxplot:
boxplot(dist~species, data=out)
and received the following error:
Error in model.frame.default(formula = dist ~ species, data = out) :
invalid type (list) for variable 'species'
I have been unsuccessful in forcing 'species' into the form of a factor:
out[species]<- as.factor(out[[out$species]])
and receive the following error:
Error in .subset2(x, i, exact = exact) : invalid subscript type 'list'
How can I convert my 'species' column into a factor which I can then use to create a boxplot? Thanks.
EDIT:
str(out)
'data.frame': 4570 obs. of 6 variables:
$ GridRef : chr "NT73" "NT80" "NT85" "NT86" ...
$ pred : num 154 71 81 85 73 99 113 157 92 85 ...
$ pred_bin : int 0 0 0 0 0 0 0 0 0 0 ...
$ dist : num 20000 10000 9842 14144 22361 ...
$ years_since_1990: chr "21" "16" "21" "20" ...
$ species :List of 4570
..$ : chr "C.splendens"
..$ : chr "C.splendens"
..$ : chr "C.splendens"
.. [list output truncated]
It's hard to imagine how you got the data into this form in the first place, but it looks like
out <- transform(out,species=unlist(species))
should solve your problem.
set.seed(101)
f <- as.list(sample(letters[1:5],replace=TRUE,size=100))
## need I() to make a wonky data frame ...
d <- data.frame(y=runif(100),f=I(f))
## 'data.frame': 100 obs. of 2 variables:
## $ y: num 0.125 0.0233 0.3919 0.8596 0.7183 ...
## $ f:List of 100
## ..$ : chr "b"
## ..$ : chr "a"
boxplot(y~f,data=d) ## invalid type (list) ...
d2 <- transform(d,f=unlist(f))
boxplot(y~f,data=d2)