Selecting a subset of IDs - r

I have a data.table dt with ids on the column idnum and a data.table ids that contains a list of ids in the column idnum (all of which exist on dt)
I want to get
The intersection: dt where dt.idnum ==ids.idnum`
The complement to the intersection: dt where dt.idnum not in ids.idnum
I got the first one with ease using
setkey(dt, idnum)
setkey(ids, idnum)
dt[ids]
However, Im stuck getting the second one. My approach was
dt[is.element(idnum, ids[, idnum]) == FALSE]
However, the row numbers of the two groups do not add up to nrow(dt). I suspect the second command. What can I do instead / Where am I going wrong? Is there perhaps a more efficient way of computing the second group given that it's the complement to the first group and I already have that one?
Update
I tried the approach given in the answer, but my numbers don't add up:
> nrow(x[J(ids$idnum)])
[1] 148
> nrow(x[!J(ids$idnum)])
[1] 52730
> nrow(x)
[1] 52863
While, the first two numbers added yield 52878. That is, I have 15 rows too many. My data contains duplicates in adj, could that be the reason?
Here's some description of the data I used:
> str(x)
Classes 'data.table' and 'data.frame': 52863 obs. of 1 variable:
$ idnum: int 6 6 11 21 22 22 22 22 27 27 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "idnum"
> head(x)
idnum
1: 6
2: 6
3: 11
4: 21
5: 22
6: 22
> str(ids)
Classes 'data.table' and 'data.frame': 46 obs. of 1 variable:
$ idnum: int 2909 5012 5031 5033 5478 6289 6405 6519 7923 7940 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "idnum"
> head(ids)
idnum
1: 2909
2: 5012
3: 5031
4: 5033
5: 5478
6: 6289
and here is
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] C/C/C/C/C/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] yaml_2.1.13 ggplot2_1.0.0 mFilter_0.1-3
[4] data.table_1.9.4 foreign_0.8-61
loaded via a namespace (and not attached):
[1] MASS_7.3-35 Rcpp_0.11.3 chron_2.3-45
[4] colorspace_1.2-4 digest_0.6.4 grid_3.1.1
[7] gtable_0.1.2 labeling_0.3 munsell_0.4.2
[10] plyr_1.8.1 proto_0.3-10 reshape2_1.4
[13] scales_0.2.4 stringr_0.6.2 tools_3.1.1

Here is one way:
library(data.table)
set.seed(1) # for reproducible example
dt <- data.table(idnum=1:1e5,x=rnorm(1e5)) # 10,000 rows, unique ids
ids <- data.table(idnum=sample(1:1e5,10)) # 10 random ids
setkey(dt,idnum)
result.1 <- dt[J(ids$idnum)] # inclusive set (records with common ids)
result.2 <- dt[!J(ids$idnum)] # exclusive set (records from dt with ids$idnum excluded
any(result.2$idnum %in% result.1$isnum)
# [1] FALSE
EDIT: Response to OPs comment.
Comparing the number of rows is not meaningful. The join will return rows corresponding to all matches. So if a given idnum is present twice in dt and three times in ids, you will get 2 X 3 = 6 rows in the result. The important test is the one I did: are any of the ids in result.1 also present in result.2? If so, then there's something wrong.
If you have duplicated ids$idnum, try:
result.1 <- dt[J(unique(ids$idnum))] # inclusive set (records with common ids)

Related

Data Manipulation in R for Apriori

I have a part of the data-set as shown below in the form of csv,the number of rows and columns are more than what is shown.I want to implement apriori on this data-set,Say I have this:-
Maths Science C++ Java DC
[1] 75 44 55 56 88
[2] 56 88 54 78 44
the original dataset has total columns(representing subjects)=30 and serial number(representing students)=24,
DATASET:link
I want to covert this dataset in the form shown below:-
[1] {Maths,DC}
[2] {Science,Java}
i.e A list of list(I think this is what it is called) containing the colnames.A list for a student shows in which subject he/she scored more than or equal to 75 marks,rest of the subjects are dropped(The only condition of the problem)
eq:- first student scored 75+ marks in Dc and Maths and so his list includes only dc and maths.
I am sorry for posting this,but I searched a lot on stack,and found a few of the working suggestions ,but couldn't reach the final goal.
My goal is to get a form like this:-
[9834] {semi-finished bread,
bottled water,
soda,
bottled beer}
[9835] {chicken,
tropical fruit,
other vegetables,
vinegar,
shopping bags}
As given in :-
library(arules)
inspect(Groceries)
OR I WILL APPRECIATE IF ANYONE CAN SUGGEST A WAY TO REPRESENT THE DATA IN OTHER FORM WHICH APRIORI CAN UNDERSTAND,BUT IT SHOULD FOLLOW THE NECESSARY CONDITIONS AS STATED.
*(sorry for the long post,I hope this conversion of my dataset in this format may help me study the pattern in student-subject dataset,thnx a ton for all the help)
library(plyr)
library(arules)
df <- read.table(text =
" 75 44 55 56 88
56 88 54 78 44")
names(df) <- c("Maths", "Science", "C++", "Java", "DC")
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {DC,Maths} 1
# [2] {Java,Science} 2
Edit: It works with your example dataset, too:
library(plyr)
library(arules)
df <- read.csv(file = url("https://drive.google.com/uc?export=download&id=0B3kdblyHw4qLR0dpT24xWUZGcGs"))
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {CD,CG,CN,DA,Data.Struc} 1
# [2] {CD,CG,CO,ML,OS} 2
# [3] {CN,Data.Struc,DC,DM,DMS} 3
# [4] {CHE,DD,DM,EC,EE} 4
# [5] {CHE,CN,MATHS,PHY} 5
# [6] {Data.Science,DM,DMS,ML,OS} 6
# [7] {CD,DA,Data.Struc,EC,MATHS} 7
# [8] {CG,CHE,CN,CO,OS} 8
# [9] {CN,CO,Data.Science,DC,DMS} 9
# [10] {DC,DD,EC,EE,PHY} 10
# [11] {CHE,DD,DMS,MATHS,PHY} 11
# [12] {CN,Data.Science,DM,MATHS,ML} 12
# [13] {CD,CG,DA,Data.Science,Data.Struc} 13
# [14] {CG,CO,EE,MATHS,OS} 14
# [15] {CN,CO,DC,DMS,PHY} 15
# [16] {CN,CO,DD,EC,EE} 16
# [17] {CHE,DA,EE,MATHS,PHY} 17
# [18] {Data.Science,DD,DM,ML,PHY} 18
# [19] {CD,CO,DA,Data.Struc,DC} 19
# [20] {CG,CO,DD,DM,OS} 20
# [21] {CG,CN,DA,DC,DMS} 21
# [22] {DD,EC,EE,ML,OS} 22
# [23] {CHE,CN,Data.Struc,MATHS,PHY} 23
# [24] {CG,Data.Science,DM,EE,ML} 24

Displaying cyrillic in RStudio console

I am having trouble displaying Russian characters in the Rstudio console. I load an Excel file with Russian using the readxl package. The cyrillic displays properly in the dataframe. However, if I run a function that has an output that includes the variable names, the RStudio consoles displays symbols instead of the proper Cyrillic characters.
test.xlsx contains two columns - зависимая переменная (dependent variable - numeric) and независимая переменная (independent variable, factor).
зависимая_переменная независимая_переменная
5 а
6 б
8 в
8 а
7.5 б
6 в
5 а
4 б
3 в
2 а
5 б
My code:
Sys.setlocale(locale = "Russian")
install.packages("readxl")
require(readxl)
basetable <- readxl::read_excel('test.xlsx',sheet = 1)
View(basetable)
basetable$независимая_переменная <- as.factor(basetable$независимая_переменная)
str(basetable)
This is what I get for the output of the str function:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 11 obs. of 2 variables:
$ çàâèñèìàÿ_ïåðåìåííàÿ : num 5 6 8 8 7.5 6 5 4 3 2 ...
$ íåçàâèñèìàÿ_ïåðåìåííàÿ: Factor w/ 3 levels "а","б","в": 1 2 3 1 2 3 1 2 3 1 ...
I want to have the variable names displayed properly in Russian because I will be building many models from this data. For reference, here is my sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251
[3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C
[5] LC_TIME=Russian_Russia.1251
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readxl_0.1.1 shiny_0.13.1 dplyr_0.4.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.2 digest_0.6.9 assertthat_0.1 mime_0.4
[5] chron_2.3-47 R6_2.1.2 xtable_1.8-2 jsonlite_0.9.19
[9] DBI_0.3.1 magrittr_1.5 lazyeval_0.1.10 data.table_1.9.6
[13] tools_3.2.3 httpuv_1.3.3 parallel_3.2.3 htmltools_0.3
Try to change dataframe colnames encoding to UTF-8.
Encoding(colnames(YOURDATAFRAME)) <- "UTF-8"

R convert data to factor will corrupt all other data.frame columns

I have a data.frame, all columns are numeric. I want to convert one integer column to factor, but doing so will convert all other columns to class character. Is there anyway to just convert one column to factor?
The example is from Converting variables to factors in R:
myData <- data.frame(A=rep(1:2, 3), B=rep(1:3, 2), Pulse=20:25)
myData$A <-as.factor(myData$A)
The result
apply(myData,2,class)
# A B Pulse
# "character" "character" "character"
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines stats graphics grDevices utils datasets methods base ...
str(myData$A)
# Factor w/ 2 levels "1","2": 1 2 1 2 1 2
Your code actually works when I test it.
This is my output from str(myData):
'data.frame': 6 obs. of 3 variables:
$ A : Factor w/ 2 levels "1","2": 1 2 1 2 1 2
$ B : int 1 2 3 1 2 3
$ Pulse: int 20 21 22 23 24 25
Your issue is because, as ?apply states:
‘apply’ attempts to coerce
to an array via ‘as.matrix’ if it is two-dimensional (e.g., a data
frame)
This is done before executing the function on each column. And when you run as.matrix(myData) you end up with everything forced to one class, in this case character data:
is.character(as.matrix(myData))
#[1] TRUE

ff package in R: how to move data from one drive to another, and change filenames

I am working intensively with the amazing ff and ffbase package.
Due to some technical details, I have to work in my C: drive with my R session. After finishing that, I move the generated files to my P: drive (using cut/paste in windows, NOT using ff).
The problem is that when I load the ffdf object:
load.ffdf("data")
I get the error:
Error: file.access(filename, 0) == 0 is not TRUE
This is ok, because nobody told the ffdf object that it was moved, but trying :
filename(data$x) <- "path/data_ff/x.ff"
or
pattern(data) <- "./data_ff/"
does not help, giving the error:
Error in `filename<-.ff`(`*tmp*`, value = filename) :
ff file rename from 'C:/DATA/data_ff/id.ff' to 'P:/DATA_C/data_ff/e84282d4fb8.ff' failed.
Is there any way to "change" into the ffdf object the path for the files new location?
Thank you !!
If you want to 'correct' your filenames afterwards you can use:
physical(x)$filename <- "newfilename"
For example:
> a <- ff(1:20, vmode="integer", filename="./a.ff")
> saveRDS(a, "a.RDS")
> rm(a)
> file.rename("./a.ff", "./b.ff")
[1] TRUE
> b <- readRDS("a.RDS")
> b
ff (deleted) integer length=20 (20)
> physical(b)$filename <- "./b.ff"
> b[]
opening ff ./b.ff
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Using filename() in the first session would of course have been easier. You could also have a look at the save.ffdf and corresponding load.ffdf functions in the ffbase package, which make this even simpler.
Addition
To rename the filenames of all columns in a ffdf you can use the following function:
redir <- function(ff, newdir) {
for (x in physical(b)) {
fn <- basename(filename(x))
physical(x)$filename <- file.path(newdir, fn)
}
return (ff)
}
You can also use ff:::clone()
R> foo <- ff(1:20, vmode = "integer")
R> foo
ff (open) integer length=20 (20)
[1] [2] [3] [4] [5] [6] [7] [8] [13] [14] [15] [16] [17] [18] [19]
1 2 3 4 5 6 7 8 : 13 14 15 16 17 18 19
[20]
20
R> physical(foo)$filename
[1] "/vol/fftmp/ff69be3e90e728.ff"
R> bar <- clone(foo, pattern = "~/")
R> bar
ff (open) integer length=20 (20)
[1] [2] [3] [4] [5] [6] [7] [8] [13] [14] [15] [16] [17] [18] [19]
1 2 3 4 5 6 7 8 : 13 14 15 16 17 18 19
[20]
20
R> physical(bar)$filename
[1] "/home/ubuntu/69be5ec0cf98.ff"
From what I understand from briefly skimming the code of save.ffdf and load.ffdf, those functions do this for you when you save/load.

Selecting rows in data.frame based on character strings

I've a data.frame with row.names as in test.
test <-
c("Env_1990:trait_KPS", "Env_1990:trait_SPSM", "Env_1990:trait_TKW",
"Env_1990:trait_Yield", "Env_1991:trait_KPS", "Env_1991:trait_SPSM",
"Env_1991:trait_TKW", "Env_1991:trait_Yield", "Env_1992:trait_KPS",
"Env_1992:trait_SPSM", "Env_1992:trait_TKW", "Env_1992:trait_Yield",
"Env_1993:trait_KPS", "Env_1993:trait_SPSM", "Env_1993:trait_TKW",
"Env_1993:trait_Yield", "Env_1994:trait_KPS", "Env_1994:trait_SPSM",
"Env_1994:trait_TKW", "Env_1994:trait_Yield", "Env_1995:trait_KPS",
"Env_1995:trait_SPSM", "Env_1995:trait_TKW", "Env_1995:trait_Yield",
"Gen_B88:Env_1990:trait_KPS", "Gen_B88:Env_1990:trait_SPSM",
"Gen_B88:Env_1990:trait_TKW", "Gen_B88:Env_1990:trait_Yield",
"Gen_B88:Env_1991:trait_KPS", "Gen_B88:Env_1991:trait_SPSM",
"Gen_B88:Env_1991:trait_TKW", "Gen_B88:Env_1991:trait_Yield",
"Gen_B88:Env_1992:trait_KPS", "Gen_B88:Env_1992:trait_SPSM",
"Gen_B88:Env_1992:trait_TKW", "Gen_B88:Env_1992:trait_Yield",
"Gen_B88:Env_1993:trait_KPS", "Gen_B88:Env_1993:trait_SPSM",
"Gen_B88:Env_1993:trait_TKW", "Gen_B88:Env_1993:trait_Yield")
I want to select only those rows which start with Env_. I tried this code in R
grep(pattern="[Env_]", x=test).
This code gives me all rows because Env_ appears in every row name. I wonder how to select rows which starts only with Env_. Thanks in advance for your help.
You want to add the ^ character for beginning of line/string:
> grep("^Env_", test)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
> grep("^Env_", test, value = TRUE)
[1] "Env_1990:trait_KPS" "Env_1990:trait_SPSM" "Env_1990:trait_TKW"
[4] "Env_1990:trait_Yield" "Env_1991:trait_KPS" "Env_1991:trait_SPSM"
[7] "Env_1991:trait_TKW" "Env_1991:trait_Yield" "Env_1992:trait_KPS"
[10] "Env_1992:trait_SPSM" "Env_1992:trait_TKW" "Env_1992:trait_Yield"
[13] "Env_1993:trait_KPS" "Env_1993:trait_SPSM" "Env_1993:trait_TKW"
[16] "Env_1993:trait_Yield" "Env_1994:trait_KPS" "Env_1994:trait_SPSM"
[19] "Env_1994:trait_TKW" "Env_1994:trait_Yield" "Env_1995:trait_KPS"
[22] "Env_1995:trait_SPSM" "Env_1995:trait_TKW" "Env_1995:trait_Yield"

Resources