I have some data in JSON I am trying to use in R. My problem is I cannot get the data in the right format.
require(RJSONIO)
json <- "[{\"ID\":\"id1\",\"VALUE\":\"15\"},{\"ID\":\"id2\",\"VALUE\":\"10\"}]"
example <- fromJSON(json)
example <- do.call(rbind,example)
example <- as.data.frame(example,stringsAsFactors=FALSE)
> example
ID VALUE
1 id1 15
2 id2 10
This gets close, but I cannot get the numeric column to convert to numeric. I know I can convert columns manually, but I thought data.frame or as.data.frame scanned the data and made the most appropriate class definitions. Clearly I misunderstood. I am reading in numerous tables - all very different - and I need to have the numeric data treated as such when it's numeric.
Ultimately I am looking to get data tables with numeric columns when the data is numeric.
read.table uses type.convert to convert data to the appropriate type. You could do the same as a cleaning step after reading in the JSON data.
sapply(example,class)
# ID VALUE
# "character" "character"
example[] <- lapply(example, type.convert, as.is = TRUE)
sapply(example, class)
# ID VALUE
# "character" "integer"
I would recommend that you use the jsonlite package, which would convert this to a data frame by default
jsonlite::fromJSON(json)
ID VALUE
1 id1 15
2 id2 10
NOTE: The numeric problem still remains since json does not have data types encoded. So you will have to manually convert numeric columns.
Just to follow-up to Ramnath's suggestion to transition to jsonlite I did some benchmarking of the two approaches:
##RJSONIO vs. jsonlite for a simple example
require(RJSONIO)
require(jsonlite)
require(microbenchmark)
json <- "{\"ID\":\"id1\",\"VALUE\":\"15\"},{\"ID\":\"id2\",\"VALUE\":\"10\"}"
test <- rep(json,1000)
test <- paste(test,collapse=",")
test <- paste0("[",test,"]")
func1 <- function(x){
temp <- jsonlite::fromJSON(x)
}
func2 <- function(x){
temp <- RJSONIO::fromJSON(x)
temp <- do.call(rbind,temp)
temp <- as.data.frame(temp,stringsAsFactors=FALSE)
}
> microbenchmark(func1(test),func2(test))
Unit: milliseconds
expr min lq median uq max neval
func1(test) 204.05228 221.46047 233.93321 246.90815 341.95684 100
func2(test) 21.60289 22.36368 22.70935 23.75409 27.41851 100
At least for now, and I know the jsonlite package is still new and focusing on accuracy over performance, the older RJSONIO is performing faster for this simple example - even with transforming the list into a data frame.
Update including rjson:
require(rjson)
func3 <- function(x){
temp <- rjson::fromJSON(x)
temp <- do.call(rbind,lapply(temp,unlist))
temp <- as.data.frame(temp,stringsAsFactors=FALSE)
}
> microbenchmark(func1(test),func2(test),func3(test))
Unit: milliseconds
expr min lq median uq max neval
func1(test) 205.34603 220.85428 234.79492 249.87628 323.96853 100
func2(test) 21.76972 22.67311 23.11287 23.56642 32.97469 100
func3(test) 14.16942 15.96937 17.29122 20.19562 35.63004 100
> microbenchmark(func1(test),func2(test),func3(test),times=500)
Unit: milliseconds
expr min lq median uq max neval
func1(test) 206.48986 225.70693 241.16301 253.83269 336.88535 500
func2(test) 21.75367 22.53256 23.06782 23.93026 103.70623 500
func3(test) 14.21577 15.61421 16.86046 19.27347 95.13606 500
> identical(func1(test),func2(test)) & identical(func1(test),func3(test))
[1] TRUE
At least on my machine rjson is only slightly faster, although I did not test how it scales compared to RJSONIO which may be where it gets the big performance bump Ramnath suggested.
Related
I'm running a simulation where I need to repeatedly extract 1 column from a matrix and check each of its values against some condition (e.g. < 10). However, doing so with a matrix is 3 times slower than doing the same thing with a data.frame. Why is this the case?
I'd like to to use matrixes to store the simulation data because they are faster for some other operations (e.g. updating columns by adding/subtracting values). How can I extract columns / subset a matrix in a faster way?
Extract column from data.frame vs matrix:
df <- data.frame(a = 1:1e4)
m <- as.matrix(df)
library(microbenchmark)
microbenchmark(
df$a,
m[ , "a"])
# Results; Unit: microseconds
# expr min lq mean median uq max neval cld
# df$a 5.463 5.8315 8.03997 6.612 8.0275 57.637 100 a
# m[ , "a"] 64.699 66.6265 72.43631 73.759 75.5595 117.922 100 b
Extract single value from data.frame vs matrix:
microbenchmark(
df[1, 1],
df$a[1],
m[1, 1],
m[ , "a"][1])
# Results; Unit: nanoseconds
# expr min lq mean median uq max neval cld
# df[1, 1] 8248 8753.0 10198.56 9818.5 10689.5 48159 100 c
# df$a[1] 4072 4416.0 5247.67 5057.5 5754.5 17993 100 b
# m[1, 1] 517 708.5 828.04 810.0 920.5 2732 100 a
# m[ , "a"][1] 45745 47884.0 51861.90 49100.5 54831.5 105323 100 d
I expected the matrix column extraction to be faster, but it was slower. However, extracting a single value from a matrix (i.e. m[1, 1]) was faster than both of the ways of doing so with a data.frame. I'm lost as to why this is.
Extract row vs column, data.frame vs matrix:
The above is only true for selecting columns. When selecting rows, matrices are much faster than data.frames. Still don't know why.
microbenchmark(
df[1, ],
m[1, ],
df[ , 1],
m[ , 1])
# Result: Unit: nanoseconds
# expr min lq mean median uq max neval cld
# df[1, ] 16359 17243.5 18766.93 17860.5 19849.5 42973 100 c
# m[1, ] 718 999.5 1175.95 1181.0 1327.0 3595 100 a
# df[ , 1] 7664 8687.5 9888.57 9301.0 10535.5 42312 100 b
# m[ , 1] 64874 66218.5 72074.93 73717.5 74084.5 97827 100 d
data.frame
Consider the builtin data frame BOD. data frames are stored as a list of columns and the inspect output shown below shows the address of each of the two columns of BOD. We then assign its second column to BOD2. Note that the address of BOD2 is the same memory location as the second column shown in the inspect output for BOD. That is, all R did was have BOD2 point to memory within BOD in order to create BOD2. There was no data movement at all. Another way to see this is to compare the size of BOD, BOD2 and both together and we see that both together take up the same amount of memory as BOD so there must have been no copying. (Continued after code.)
library(pryr)
BOD2 <- BOD[[2]]
inspect(BOD)
## <VECSXP 0x507c278>
## <REALSXP 0x4f81f48>
## <REALSXP 0x4f81ed8> <--- compare this address to address shown below
## ...snip...
BOD2 <- BOD[,2]
address(BOD2)
## [1] "0x4f81ed8"
object_size(BOD)
## 1.18 kB
object_size(BOD2)
## 96 B
object_size(BOD, BOD2) # same as object_size(BOD) above
## 1.18 kB
matrix
Matrices are stored as one long vector with dimensions rather than as a list of columns so the strategy for extraction of a column is different. If we look at the memory used by a matrix m, an extracted column m2 and both together we see below that both together use the sum of the memories of the individual objects showing that there was data copying.
set.seed(123)
n <- 10000L
m <- matrix(rnorm(2*n), n, 2)
m2 <- m[, 2]
object_size(m)
## 160 kB
object_size(m2)
## 80 kB
object_size(m, m2)
## 240 kB <-- unlike for data.frames this equals sum of above
what to do
If your program is such that it uses column extraction up to a point only you could use a data frame for that portion and then do a one time conversion to matrix and process it like that for the rest.
I suppose it is about the data structure of R in the memory.
A matrix in R is a 2-d array, which is the same of 1-d array. A variable is a point directly to the memory, so it would be very faster to extract a single value. To extract a column in the matrix, it would take some computation and ask for new memory address and save it. As for dataframe, it is actually a list of columns, so it would be faster to return a column.
That's what i guess, hope to be proved.
I'm trying to sum the digits of integers in the last 2 columns of my data frame. I have found a function that does the summing, but I think I may have an issue with applying the function - not sure?
Dataframe
a = c("a", "b", "c")
b = c(1, 11, 2)
c = c(2, 4, 23)
data <- data.frame(a,b,c)
#Digitsum function
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
#Applying function
data[2:3] <- lapply(data[2:3], digitsum)
This is the error that I get:
*Warning messages:
1: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used
2: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used*
Your function digitsum at the moment works fine for a single scalar input, for example,
digitsum(32)
# [1] 5
But, it can not take a vector input, otherwise ":" will complain. You need to vectorize this function, using Vectorize:
vec_digitsum <- Vectorize(digitsum)
Then it works for a vector input:
b = c(1, 11, 2)
vec_digitsum(b)
# [1] 1 2 2
Now you can use lapply without trouble.
#Zheyuan Li 's answer solved your problem of using lapply. Though I'd like to add several points:
Vectorize is just a wrapper with mapply, which doesn't give you the performance of vectorization.
The function itself can be improved for much better readability:
see
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
vec_digitsum <- Vectorize(digitsum)
sumdigits <- function(x){
digits <- strsplit(as.character(x), "")[[1]]
sum(as.numeric(digits))
}
vec_sumdigits <- Vectorize(sumdigits)
microbenchmark::microbenchmark(digitsum(12324255231323),
sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
digitsum(12324255231323) 12.223 12.712 14.50613 13.201 13.690 96.801 100 a
sumdigits(12324255231323) 13.689 14.667 15.32743 14.668 15.157 38.134 100 a
The performance of two versions are similar, but the 2nd one is much easier to understand.
Interestingly, the Vectorize wrapper add considerable overhead for single input:
microbenchmark::microbenchmark(vec_digitsum(12324255231323),
vec_sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
vec_digitsum(12324255231323) 92.890 96.801 267.2665 100.223 108.045 16387.07 100 a
vec_sumdigits(12324255231323) 94.357 98.757 106.2705 101.445 107.556 286.00 100 a
Another advantage of this function is that if you have really big numbers in string format, it will still work (with small modification of removing the as.character). While the first version function will have problem with big numbers or may introduce errors.
Note: At first my benchmark was comparing the vectorized version of OP function and non-vectorized version of my function, that gave me the wrong impression of my function is much faster. Turned out that was caused by Vectorize overhead.
I have a character vector of stock tickers where the ticker name is concatenated to the country in which that ticker is based in the following form: country_name/ticker_name. I am trying to split each string and delete everything from the '/' back, returning a character vector of only the ticker names. Here is an example vector:
sample_string <- c('US/SPY', 'US/AOL', 'US/MTC', 'US/PHA', 'US/PZI',
'US/AOL', 'US/BRCM')
My initial thought would be to use the stringr library. I don't have really any experience with that package, but here is what I was trying:
library(stringr)
split_string <- str_split(sample_string, '/')
But I was unsure how to return only the second element of each list as a single vector.
How would I do this over a large character vector (~105 million entries)?
Some benchmark here including all the methods suggested by #David Arenburg, and another method using str_extract from stringr package.
sample_string <- rep(sample_string, 1000000)
library(data.table); library(stringr)
s1 <- function() sub(".*/(.*)", "\\1", sample_string)
s2 <- function() sub(".*/", "", sample_string)
s3 <- function() str_extract(sample_string, "(?<=/)(.*)")
s4 <- function() tstrsplit(sample_string, "/", fixed = TRUE)[[2]]
length(sample_string)
# [1] 7000000
identical(s1(), s2())
# [1] TRUE
identical(s1(), s3())
# [1] TRUE
identical(s1(), s4())
# [1] TRUE
microbenchmark::microbenchmark(s1(), s2(), s3(), s4(), times = 5)
# Unit: seconds
# expr min lq mean median uq max neval
# s1() 3.916555 3.917370 4.046708 3.923246 3.925184 4.551184 5
# s2() 3.584694 3.593755 3.726922 3.610284 3.646449 4.199426 5
# s3() 3.051398 3.062237 3.354410 3.138080 3.722347 3.797985 5
# s4() 1.908283 1.964223 2.349522 2.117521 2.760612 2.996971 5
The tstrsplit method is the fastest.
Update:
Add another method from #Frank, this comparison is not strictly accurate which depends on the actual data, if there is a lot of duplicated cases as the sample_string is produced above, the advantage is quite obvious:
s5 <- function() setDT(list(sample_string))[, v := tstrsplit(V1, "/", fixed = TRUE)[[2]], by=V1]$v
identical(s1(), s5())
# [1] TRUE
microbenchmark::microbenchmark(s1(), s2(), s3(), s4(), s5(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# s1() 3905.97703 3913.264 3922.8540 3913.4035 3932.2680 3949.3575 5
# s2() 3568.63504 3576.755 3713.7230 3660.5570 3740.8252 4021.8426 5
# s3() 3029.66877 3032.898 3061.0584 3052.6937 3086.9714 3103.0604 5
# s4() 1322.42430 1679.475 1985.5440 1801.9054 1857.8056 3266.1101 5
# s5() 82.71379 101.899 177.8306 121.6682 209.0579 373.8141 5
Some helpful notes about your question: Firstly, there is a str_split_fixed function in the stringrpackage which does what you want it to do by calling lapply.
library(data.table); library(stringr)
sample_string <- c('US/SPY', 'US/AOL', 'US/MTC', 'US/PHA', 'US/PZI',
'US/AOL', 'US/BRCM')
sample_string <- rep(sample_string, 1e5)
split_string <- str_split_fixed(sample_string, '/', 2)[,2]
It works by calling stringi::stri_split_fixed and is not dissimilar to
do.call("c", lapply(str_split(sample_string, '/'),"[[",2))
Secondly, another way to think about extracting each second element of the list is by doing exactly what tstrsplit is doing internally.
transpose(strsplit(sample_string, "/", fixed = T))[[2]]
On a total side note, the above should be marginally faster than calling tstrsplit. This of course, is probably not worth typing at length but it helps to know what the function does.
library(data.table); library(stringr)
s4 <- function() tstrsplit(sample_string, "/", fixed = TRUE)[[2]]
s5 <- function() transpose(strsplit(sample_string, "/", fixed = T))[[2]]
identical(s4(), s5())
microbenchmark::microbenchmark(s4(), s5(), times = 20)
microbenchmark::microbenchmark(s4(), s5(), times = 20)
Unit: milliseconds
expr min lq mean median uq max neval
s4() 161.0744 193.3611 255.8136 234.9945 271.6811 434.7992 20
s5() 140.8569 176.5600 233.3570 194.1676 251.7921 420.3431 20
Regarding this second method, in short, transposing this list of length 7 million, each with 2 elements will convert your result to a list of length 2, each with 7 million elements. You are then extracting the second element of this list.
I have a vector of scalar values of which I'm trying to get: "How many different values there are".
For instance in group <- c(1,2,3,1,2,3,4,6) unique values are 1,2,3,4,6 so I want to get 5.
I came up with:
length(unique(group))
But I'm not sure it's the most efficient way to do it. Isn't there a better way to do this?
Note: My case is more complex than the example, consisting of around 1000 numbers with at most 25 different values.
Here are a few ideas, all points towards your solution already being very fast. length(unique(x)) is what I would have used as well:
x <- sample.int(25, 1000, TRUE)
library(microbenchmark)
microbenchmark(length(unique(x)),
nlevels(factor(x)),
length(table(x)),
sum(!duplicated(x)))
# Unit: microseconds
# expr min lq median uq max neval
# length(unique(x)) 24.810 25.9005 27.1350 28.8605 48.854 100
# nlevels(factor(x)) 367.646 371.6185 380.2025 411.8625 1347.343 100
# length(table(x)) 505.035 511.3080 530.9490 575.0880 1685.454 100
# sum(!duplicated(x)) 24.030 25.7955 27.4275 30.0295 70.446 100
You can use rle from base package
x<-c(1,2,3,1,2,3,4,6)
length(rle(sort(x))$values)
rle produces two vectors (lengths and values ). The length of values vector gives you the number of unique values.
I have used this function
length(unique(array))
and it works fine, and doesn't require external libraries.
uniqueN function from data.table is equivalent to length(unique(group)). It is also several times faster on larger datasets, but not so much on your example.
library(data.table)
library(microbenchmark)
xSmall <- sample.int(25, 1000, TRUE)
xBig <- sample.int(2500, 100000, TRUE)
microbenchmark(length(unique(xSmall)), uniqueN(xSmall),
length(unique(xBig)), uniqueN(xBig))
#Unit: microseconds
# expr min lq mean median uq max neval cld
#1 length(unique(xSmall)) 17.742 24.1200 34.15156 29.3520 41.1435 104.789 100 a
#2 uniqueN(xSmall) 12.359 16.1985 27.09922 19.5870 29.1455 97.103 100 a
#3 length(unique(xBig)) 1611.127 1790.3065 2024.14570 1873.7450 2096.5360 3702.082 100 c
#4 uniqueN(xBig) 790.576 854.2180 941.90352 896.1205 974.6425 1714.020 100 b
We can use n_distinct from dplyr
dplyr::n_distinct(group)
#[1] 5
If one wants to get number of unique elements in a matrix or data frame or list, the following code would do:
if( typeof(Y)=="list"){ # Y is a list or data frame
# data frame to matrix
numUniqueElems <- length( na.exclude( unique(unlist(Y)) ) )
} else if ( is.null(dim(Y)) ){ # Y is a vector
numUniqueElems <- length( na.exclude( unique(Y) ) )
} else { # length(dim(Y))==2, Yis a matrix
numUniqueElems <- length( na.exclude( unique(c(Y)) ) )
}
I would like to have unique numeric factors as part of an xts, so that over time...each number refers to a specific factor, independent of time.
To give an example, imagine a stock index that changes its constituents every day. We can simulate this if I have the following universe of two letter stock tickers
universe <- apply(as.data.frame(expand.grid(letters,letters)),1,paste0,collapse="")
and each day an index is created that is a random subsample of 20 of the stock tickers from the universe.
subsample.list <- lapply(1:50, function(y){
sort(sample(universe,20,replace=FALSE))
})
the key of unique stocks over the 50 days is:
uni.subsample <- sort(unique(unlist(subsample.list)))
I would like to basically be able to see which stocks were in the index each day if i had the xts object and unique factors.
Although it is not meant to be used this way....I was thinking something like:
tmp <- xts(do.call(rbind,subsample.list),Sys.Date()-c(50:1))
to create the xts.
however I would like to covert the coredata into a numeric matrix, where each number is the ticker from uni.subsample
so if tmp.adjusted['20130716'][1,] would be the numeric vector of numbers of length 20 that represents the numerical values of uni.subsample for the 16th July 2013, so I would expect that I would be able to get all of 2013-07-16's index members by using the xts objecting the following way uni.subsample[tmp.adjusted['20130716'][1,]]...i.e. the adjustment from tmp to tmp.adjusted converts the strings into factors, with unique levels associated with uni.subsample
I hope this makes sense...its kinda hard to explain....
Here a vectorized solution:
tmp.int <- xts(matrix(as.integer(factor(tmp,levels=uni.subsample,ordered=TRUE)),
ncol=ncol(tmp)),index(tmp))
You are basically trying to code a matrix of ordered factor by their levels order.
EDIT adding some benchmarking :
set.seed(1233)
N <- 5000
subsample.list <- lapply(seq(N), function(y){
sort(sample(universe,20,replace=FALSE))
})
uni.subsample <- sort(unique(unlist(subsample.list)))
tmp <- xts(do.call(rbind,subsample.list),Sys.Date()-seq(N))
ag <- function() xts(matrix(as.integer(factor(tmp,levels=uni.subsample,ordered=TRUE)),
ncol=ncol(tmp)),index(tmp))
no <- function()xts(apply(X=tmp,
MARGIN=c(1,2), function(x) which(uni.subsample == x)),
index(tmp))
library(microbenchmark)
microbenchmark(ag(),no(),times=1)
## N = 50 ag 24 faster
microbenchmark(ag(),no(),times=1)
Unit: milliseconds
expr min lq median uq max neval
ag() 1.126405 1.126405 1.126405 1.126405 1.126405 1
## N = 500 ag 135 fatser
microbenchmark(ag(),no(),times=10)
Unit: milliseconds
expr min lq median uq max neval
ag() 23.38484 26.19744 31.13428 35.51057 44.96251 10
no() 3115.24902 3220.04940 3250.63773 3288.66867 3470.35053 10
no() 24.000003 24.000003 24.000003 24.000003 24.000003 1
How about:
tmp.int <- xts(apply(X=tmp, MARGIN=c(1,2), function(x) which(uni.subsample == x)),
index(tmp))
# to perform the lookup (e.g., 'find the name of the first value on May 27, 2013'):
uni.subsample[tmp.int['2013-05-27'][,1]]