How do I subset a data.table based on another data.table? - r

I am trying to get my head around how to use data.tables. It is not going well.
I have a large data.table with a bunch of returns and AUM. I subsetted that data.table into two data.tables, one with returns, and one with AUM. I now want to subset the returns data.table, to get only the returns from funds with AUM less than the 50th percentile.
To give you an idea, this is my code:
fundDetails <- data.table(read.table("Fund_Deets.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))
fundNAV <- data.table(read.table("NAV_AUM.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))
allFundDetails <- fundDetails[Currency == 'USD']
allFundNAV <- fundNAV[Fund.ID %in% allFundDetails$Fund.ID]
allFundAUM <- allFundNAV[Type == 'AUM', -c(1,3), with = FALSE]
allFundAUM <- setnames(data.table(t(sapply(allFundAUM[,-1, with = FALSE],as.numeric))), as.character(allFundAUM$Fund.ID))
allFundReturns <- allFundNAV[Type == 'Return', -c(1,3), with = FALSE]
allFundReturns <- setnames(data.table(t(sapply(allFundReturns[,-1, with = FALSE],as.numeric)/100)), as.character(allFundReturns$Fund.ID))
smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))
This Produces the following three tables (smallFundReturns is obviously just NA's):
> allFundAUM[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA 1 27
2: NA NA NA NA NA NA 117 NA 1 27
3: NA NA NA NA NA NA 120 NA 1 27
4: NA NA NA NA NA NA 133 NA 1 27
5: NA NA NA NA NA NA 146 NA 1 29
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
> allFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA 0.0188 -0.0116
2: NA NA NA NA NA NA -0.0315 NA -0.0120 0.0134
3: NA NA NA NA NA NA -0.0978 NA -0.0908 -0.0206
4: NA NA NA NA NA NA -0.0445 NA -0.0269 -0.0287
5: NA NA NA NA NA NA 0.0139 NA 0.0298 -0.0141
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
> smallFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA NA NA NA NA
4: NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
for (i in 1:nrow(allFundReturns)){
theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
theSubset[is.na(theSubset)] <- FALSE
theSubset <- colnames(allFundReturns)[theSubset]
smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}
I am trying to subset using this for loop (using a for loop in an attempt to debug):
for (i in 1:nrow(allFundReturns)){
theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
theSubset[is.na(theSubset)] <- FALSE
theSubset <- colnames(allFundReturns)[theSubset]
smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}
This produces an error:
Error in `[<-.data.table`(`*tmp*`, i, theSubset, with = FALSE, value = list( :
unused argument (with = FALSE)
I tried removing the 'with' part, but this spits out a bunch of warnings:
> warnings()
Warning messages:
1: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '41526' (3019 unused)
2: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '45993' (3019 unused)
3: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '45994' (3019 unused)
4: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
I am confused on how to do this. Any ideas on how I can subset the second data.table by the subset on the first?
EDIT:
I tried the suggestion below:
smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]
And I got these warnings():
> warnings()
Warning messages:
1: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
2: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
3: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
And the code produced this, with 'TRUE' everywhere I would expect a number:
> smallFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA TRUE TRUE
2: NA NA NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA NA NA NA NA
4: NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
EDIT 2:
I figured out the issue. Apparently, this line:
smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))
created the table as being logical. I changed it to this line:
smallFundReturns <- data.table(sapply(allFundReturns, function(x) as.numeric(rep(NA, length(x)))))
And everything worked after #HubertL fix. Thanks!!

You have to write it like that:
smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]

Suggestions for improvement:
Try reading data with fread instead of read.table if possible. It's way faster and the result is data.table not data.frame.
When doing "data.table operations" with the statement ", with=FALSE" you actually force R to use the much slower data.frame operations instead of using the blazingly fast data.table methods.
Have fun

Related

How to eliminate “Error in as.Date.OtherDate(death) : NAs in foreign function call (arg 4)"

I need to convert persian date to gergorian using ConvCalendar library. the character vector is as follows:
str(df$death_date)
chr [1:286] NA NA NA NA "1399/03/12" NA NA NA NA NA NA NA NA NA "1399/03/25" NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
death <- OtherDate(day=substr(df$death_date,9,10),
month=substr(df$death_date,6,7),
year=substr(df$death_date,1,4),
calendar="persian")
once I go to convert the death into gregorian using as.Date(death) the following error comes out:
Error in as.Date.OtherDate(death) : NAs in foreign function call (arg 4)
could anyone please tell me what the wrong is?
Package ConvCalendar was archived a long time ago, see CRAN:
Archived on 2018-05-24 as check problems were not corrected despite reminders.
If it is installed from source, the following will work.
library(ConvCalendar)
y <- x[!is.na(x)]
y <- as.POSIXlt(y, format = '%Y/%m/%d', origin = '1970-01-01')
pers <- OtherDate(day=y$mday, month=y$mon+1, year=y$year, calendar="persian")
as.Date(pers)
#[1] "0120-06-02" "0120-06-15"
Data
x <- scan(what = character(), text = '
NA NA NA NA "1399/03/12" NA NA NA NA NA NA NA NA NA "1399/03/25" NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
')

Insert NA elements in vector

I have a vector:
x <- c(1,2,3,4)
I would like to add 23 NA elements before each element of x
Maybe like this?
c(sapply(x, function(x) c(rep(NA,23),x)))
We can do this with vectorization
replace(rep(NA, 23*length(x) + length(x)), rep(c(FALSE, TRUE), c(23, 1)), x)
#[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#[43] NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA NA NA
#[85] NA NA NA NA NA NA NA NA NA NA NA 4
Or another option is to create a matrix, replace the last row with 'x' and convert it to vector
m1 <- matrix(rep(rep(NA, 24), length(x)), nrow = length(x))
m1[,24] <- x
c(t(m1))

How to subset a raster by cell number in R?

I'm trying to subset a raster based on cell numbers. I want to provide a vector of cell numbers and return a raster with the original cell values for those cells referenced in the cell numbers vector. I tried the rasterFromCells() function but this seems to interpolate between cell numbers and doesn't return values, but rather cell numbers. I've tried:
#original raster loaded with 400 sample values ranging from 1:24
foo <- raster(ncol=20, nrow=20)
foo[] <- sample(seq(1,24),400,replace = TRUE)
#vector of desired cell numbers
my.pts <- c(2,20,200)
#rasterFromCells attempt
bar<-rasterFromCells(foo, my.pts, values=TRUE)
How can I return a raster layer with foo's values for cell numbers 2, 20 and 200 and all other cells asNA?
If you want to create a new raster with the values at only the cell locations in my.pts replaced by the values at those cell locations in foo and all other cell values set to NA, you just have to:
create a raster (i.e., bar) the same size as foo.
fill it with NAs
Use bar[my.pts] <- foo[my.pts]
For example:
library(raster)
set.seed(123) ## for reproducible results
foo <- raster(ncols=20, nrows=20)
foo[] <- sample(seq(1,24),400,replace = TRUE)
#vector of desired cell numbers
my.pts <- c(2,20,200)
## create raster the same size as foo filled with NAs
bar <- raster(ncols=ncol(foo), nrows=nrow(foo))
bar[] <- NA
## replace the values with those in foo
bar[my.pts] <- foo[my.pts]
foo[my.pts]
##[1] 19 23 14
bar[]
## [1] NA 19 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 23 NA NA NA NA NA NA NA NA NA NA NA
## [32] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [63] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [94] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[125] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[156] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[187] NA NA NA NA NA NA NA NA NA NA NA NA NA 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[218] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[249] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[280] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[311] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[342] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##[373] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Another approach to accomplish the same result is to copy foo to bar and then set all cells locations not in my.pts to NAs:
bar <- foo
bar[setdiff(1:ncell(foo),my.pts)] <- NA
The advantage of rasterFromCells is that it returns a smaller raster, as it contains only the cropped version of what you want.
So what you need to do is to feed again the value of your initial raster (r) in the new one (r2), which is eased by the fact that the new one (r2) returns the original cell numbers:
r <- raster(ncols=100, nrows=100)
r[] <- rnorm(ncell(r))
cells <- c(3:5, 210)
r2 <- rasterFromCells(r, cells, values=TRUE)
ini_cells <- getValues(r2)
Simply feed the values according to the index:
r2[] <- r[ini_cells]
This results in a raster of 24 cells instead of 10'000!
c(ncell(r), ncell(r2))
Let us compare the results:
data.frame(Orig=getValues(r)[cells], New=getValues(r2)[ini_cells %in% cells])
[,1] [,2]
[1,] -0.5081512 -0.5081512
[2,] -0.8799739 -0.8799739
[3,] 0.3722788 0.3722788
[4,] -0.7661364 -0.7661364
Note: you wanted to set all others to NA. You would do this with:
r2[!ini_cells %in% cells] <- NA
head(getValues(r2))
-0.5081512 -0.8799739 0.3722788 NA NA NA

error in reading a csv file

I have been facing an error while reading a csv file. first few lines of the line is as given below:
"","1.CEL","2.CEL","3.CEL","4.CEL"
"1_s_at",NA,NA,NA,NA
"2_at",NA,NA,NA,NA
"3_at",NA,NA,NA,NA
"4_at",NA,NA,NA,NA
"5_g_at",NA,NA,NA,NA
"6_at",NA,NA,NA,NA
"7_at",NA,NA,NA,NA
reading the csv.file
test <- read.csv(file='/home/userxyz/test.csv')
head(test)
# X X1.CEL X2.CEL X3.CEL X4.CEL
#1 1_s_at NA NA NA NA
#2 2_at NA NA NA NA
#3 3_at NA NA NA NA
#4 4_at NA NA NA NA
#5 5_g_at NA NA NA NA
#6 6_at NA NA NA NA
Explicitly specifying the presence of the header.
test <- read.csv(file='/home/userxyz/test.file', header=T)
head(test)
# X X1.CEL X2.CEL X3.CEL X4.CEL
#1 1_s_at NA NA NA NA
#2 2_at NA NA NA NA
#3 3_at NA NA NA NA
#4 4_at NA NA NA NA
#5 5_g_at NA NA NA NA
#6 6_at NA NA NA NA
While explicitly specifying the row.names, it didn't work.
test <- read.csv(file='/home/userxyz/test.file', row.names=T)
#Error in read.table(file = file, header = header, sep = sep, quote = quote, :
# invalid 'row.names' specification
read.table, read.delim functions have also been looked at.
Is the error because of special characters in the row.names?
I think you are trying to read in the first column as row name. Try:
x <- '"","1.CEL","2.CEL","3.CEL","4.CEL"
"1_s_at",NA,NA,NA,NA
"2_at",NA,NA,NA,NA
"3_at",NA,NA,NA,NA
"4_at",NA,NA,NA,NA
"5_g_at",NA,NA,NA,NA
"6_at",NA,NA,NA,NA
"7_at",NA,NA,NA,NA'
read.csv(text = x, row.names = 1L)
# X1.CEL X2.CEL X3.CEL X4.CEL
#1_s_at NA NA NA NA
#2_at NA NA NA NA
#3_at NA NA NA NA
#4_at NA NA NA NA
#5_g_at NA NA NA NA
#6_at NA NA NA NA
#7_at NA NA NA NA
If you want to preserve exactly the header, do
read.csv(text = x, row.names = 1L, check.names = FALSE)
# 1.CEL 2.CEL 3.CEL 4.CEL
#1_s_at NA NA NA NA
#2_at NA NA NA NA
#3_at NA NA NA NA
#4_at NA NA NA NA
#5_g_at NA NA NA NA
#6_at NA NA NA NA
#7_at NA NA NA NA
Regarding row.name, read ?read.csv:
row.names: a vector of row names. This can be a vector giving the
actual row names, or a single number giving the column of the
table which contains the row names, or character string
giving the name of the table column containing the row names.

R is unexpectedly transforming field from CSV file to NA

I'm trying to parse CSV file in R. Here is the first line of CSV file with separator ~. Please note i literal at second field position.
2015-10-29 18:49:42~i~186.37.108.44~Mozilla/5.0 (Linux; Android 4.1.2; GT-S6810E Build/JZO54K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.94 Mobile Safari/537.36~ea01627ed45116787d3b1c0224a44d77~?~CL~1443~219~729~335~3155~9214~5
Here is how I'm trying to parse it:
> parsed <- read.csv('i.csv', header=F, sep='~')
> parsed$V2
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[37] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[73] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[109] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[145] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[181] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[217] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[253] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[289] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
> table(count.fields('i.csv', sep='~'))
14
310
Why this happens? Why field#2 is NA istead of i? All other fields are ok, field#1 and field#3 do not contains i literal. All other fields are also OK.
> df$V1[1]
[1] 2015-10-29 18:38:04
257 Levels: 2015-10-29 18:38:04 2015-10-29 18:38:07 2015-10-29 18:38:12 ... 2015-10-29 18:51:46
> df$V3[1]
[1] 24.237.158.3
270 Levels: 1.144.97.1 1.187.195.221 1.187.204.84 1.39.12.184 1.39.13.227 1.39.137.12 1.39.33.86 ... 97.44.1.207
For the sake of completion, I'm adding my comment as answer.
Almost all the read functions in R (read.csv, read.csv2, data, read.fwf, unzip, read.delim) call read.table function internally.
And read.table calls type.convert to recycle colClasses if it weren't provided with the function call.
From type.convert at R docs, it says
This is principally a helper function for read.table. Given a character vector, it attempts to convert it to logical, integer, numeric or complex, and failing that converts it to factor unless as.is = TRUE. The first type that can accept all the non-missing values is chosen.
So, type.convert checks if the value is logical, integer, real or complex, in this specific order and if all these options are ruled out, converts value to factor (or character if as.is=T).
In R-3.2.1, (buggy) implementation of strtoc and possibly typeconvert resulted in conversion of i to NA. strtoc has been corrected in R-3.3.0.
In R-3.3.0, type.convert('n±ki') return complex only if k ≠ 1.
From Changes in R-3.3.0:
type.convert("i") now returns a factor instead of a complex value with zero real part and missing imaginary part.

Resources