Data Manipulation in R for Apriori - r

I have a part of the data-set as shown below in the form of csv,the number of rows and columns are more than what is shown.I want to implement apriori on this data-set,Say I have this:-
Maths Science C++ Java DC
[1] 75 44 55 56 88
[2] 56 88 54 78 44
the original dataset has total columns(representing subjects)=30 and serial number(representing students)=24,
DATASET:link
I want to covert this dataset in the form shown below:-
[1] {Maths,DC}
[2] {Science,Java}
i.e A list of list(I think this is what it is called) containing the colnames.A list for a student shows in which subject he/she scored more than or equal to 75 marks,rest of the subjects are dropped(The only condition of the problem)
eq:- first student scored 75+ marks in Dc and Maths and so his list includes only dc and maths.
I am sorry for posting this,but I searched a lot on stack,and found a few of the working suggestions ,but couldn't reach the final goal.
My goal is to get a form like this:-
[9834] {semi-finished bread,
bottled water,
soda,
bottled beer}
[9835] {chicken,
tropical fruit,
other vegetables,
vinegar,
shopping bags}
As given in :-
library(arules)
inspect(Groceries)
OR I WILL APPRECIATE IF ANYONE CAN SUGGEST A WAY TO REPRESENT THE DATA IN OTHER FORM WHICH APRIORI CAN UNDERSTAND,BUT IT SHOULD FOLLOW THE NECESSARY CONDITIONS AS STATED.
*(sorry for the long post,I hope this conversion of my dataset in this format may help me study the pattern in student-subject dataset,thnx a ton for all the help)

library(plyr)
library(arules)
df <- read.table(text =
" 75 44 55 56 88
56 88 54 78 44")
names(df) <- c("Maths", "Science", "C++", "Java", "DC")
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {DC,Maths} 1
# [2] {Java,Science} 2
Edit: It works with your example dataset, too:
library(plyr)
library(arules)
df <- read.csv(file = url("https://drive.google.com/uc?export=download&id=0B3kdblyHw4qLR0dpT24xWUZGcGs"))
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {CD,CG,CN,DA,Data.Struc} 1
# [2] {CD,CG,CO,ML,OS} 2
# [3] {CN,Data.Struc,DC,DM,DMS} 3
# [4] {CHE,DD,DM,EC,EE} 4
# [5] {CHE,CN,MATHS,PHY} 5
# [6] {Data.Science,DM,DMS,ML,OS} 6
# [7] {CD,DA,Data.Struc,EC,MATHS} 7
# [8] {CG,CHE,CN,CO,OS} 8
# [9] {CN,CO,Data.Science,DC,DMS} 9
# [10] {DC,DD,EC,EE,PHY} 10
# [11] {CHE,DD,DMS,MATHS,PHY} 11
# [12] {CN,Data.Science,DM,MATHS,ML} 12
# [13] {CD,CG,DA,Data.Science,Data.Struc} 13
# [14] {CG,CO,EE,MATHS,OS} 14
# [15] {CN,CO,DC,DMS,PHY} 15
# [16] {CN,CO,DD,EC,EE} 16
# [17] {CHE,DA,EE,MATHS,PHY} 17
# [18] {Data.Science,DD,DM,ML,PHY} 18
# [19] {CD,CO,DA,Data.Struc,DC} 19
# [20] {CG,CO,DD,DM,OS} 20
# [21] {CG,CN,DA,DC,DMS} 21
# [22] {DD,EC,EE,ML,OS} 22
# [23] {CHE,CN,Data.Struc,MATHS,PHY} 23
# [24] {CG,Data.Science,DM,EE,ML} 24

Related

formatting webchem pubchem output to a dataframe

I have about 3500 CAS numbers that I would like to extract the chemical information from pubchem and put into a dataframe. I have no idea on how to format the output so I can put it into a dataframe when I use the code below. The output of each call (please see below) seems to give me the same format. It consists of a list of 9 things of varying size, 2 of which are tibbles of varying size.. Any ideas would be appreciated!! Thank you!
library(dplyr)
library(webchem)
ci_query(query, from = c("rn", "inchikey"), verbose = getOption("verbose"))
y1 <- ci_query('50-00-0', from = 'rn')
which yields:
y1
$`50-00-0`
$`50-00-0`$name
[1] "Formaldehyde [USP]" "Methanal"
$`50-00-0`$synonyms
[1] "AI3-26806" "Aldehyd mravenci" "Aldehyd
mravenci [Czech]"
[4] "Aldehyde formique" "Aldehyde formique [French]" "Aldehyde
formique [ISO-French]"
[7] "Aldeide formica" "Aldeide formica [Italian]" "BFV"
[10] "Caswell No. 465" "CCRIS 315" "Dormol"
[13] "EC 200-001-8" "EINECS 200-001-8" "EPA
Pesticide Chemical Code 043001"
[16] "Fannoform" "Formalaz"
"Formaldehyd"
[19] "Formaldehyd [Czech, Polish]" "Formaldehyde"
"Formaldehyde solution"
[22] "Formaldehyde, gas" "Formalin" "Formalin
40"
[25] "Formalin [JAN]" "Formalin-loesungen"
"Formalin-loesungen [German]"
[28] "Formalina" "Formalina [Italian]"
"Formaline"
[31] "Formaline [German]" "Formalith" "Formic
aldehyde"
[34] "Formol" "FYDE" "HSDB
164"
[37] "Karsan" "Lysoform"
"Methaldehyde"
[40] "Methanal" "Methyl aldehyde"
"Methylene oxide"
[43] "Morbicid" "NCI-C02799" "NSC
298885"
[46] "Oplossingen" "Oplossingen [Dutch]"
"Oxomethane"
[49] "Oxymethylene" "Paraform" "RCRA
waste number U122"
[52] "Superlysoform" "UN 1198" "UN 2209
(formalin)"
[55] "UNII-1HG84L3525"
$`50-00-0`$cas
[1] "50-00-0"
$`50-00-0`$inchi
[1] "InChI=1S/CH2O/c1-2/h1H2"
$`50-00-0`$inchikey
[1] "WSFSSNUMVMOOMR-UHFFFAOYSA-N"
$`50-00-0`$smiles
[1] "C=O"
$`50-00-0`$toxicity
# A tibble: 24 x 6
Organism `Test Type` Route `Reported Dose (Normalized Dose)` Effect
Source
<chr> <chr> <chr> <chr> <chr>
<chr>
1 cat LCLo inhalation 400mg/m3/2H (400mg/m3) ""
"\"To~
2 cat LDLo intravenous 30mg/kg (30mg/kg) "BLOOD: OTHER
CHANGES" "Acta~
3 dog LDLo intravenous 70mg/kg (70mg/kg) ""
"Inte~
4 dog LDLo subcutaneous 350mg/kg (350mg/kg) ""
"Inte~
5 frog LDLo parenteral 800ug/kg (0.8mg/kg) ""
"Inte~
6 guinea pig LD50 oral 260mg/kg (260mg/kg) ""
"Jour~
7 human TCLo inhalation 17mg/m3/30M (17mg/m3) "LUNGS, THORAX,
OR RESPIRATION: OTHER CHANGESSENSE OR~ "JAMA~
8 man LDLo unreported 477mg/kg (477mg/kg) ""
"\"Po~
9 man TCLo inhalation 300ug/m3 (0.3mg/m3) "SENSE ORGANS
AND SPECIAL SENSES: OTHER CHANGES: OLFA~ "Gigi~
10 man TDLo oral 643mg/kg (643mg/kg)
"GASTROINTESTINAL: NAUSEA OR VOMITINGLUNGS, THORAX, O~ "Japa~
# ... with 14 more rows
$`50-00-0`$physprop
# A tibble: 8 x 5
`Physical Property` Value Units `Temp (deg C)` Source
<chr> <dbl> <chr> <int> <chr>
1 Melting Point -9.2 e+ 1 deg C NA EXP
2 Boiling Point -1.91e+ 1 deg C NA EXP
3 pKa Dissociation Constant 1.33e+ 1 (none) 25 EXP
4 log P (octanol-water) 3.5 e- 1 (none) NA EXP
5 Water Solubility 4 e+ 5 mg/L 20 EXP
6 Vapor Pressure 3.89e+ 3 mm Hg 25 EXP
7 Henry's Law Constant 3.37e- 7 atm-m3/mole 25 EXP
8 Atmospheric OH Rate Constant 9.37e-12 cm3/molecule-sec 25 EXP
$`50-00-0`$source_url
[1] "https://chem.nlm.nih.gov/chemidplus/rn/50-00-0"
attr(,"class")
[1] "ci_query" "list"

read csv, skip three lines but include header names in data.frame [duplicate]

This question already has answers here:
unable to get column names when using skip along with read.csv
(2 answers)
Closed 7 years ago.
I want to read csv file skip three lines (except header) but include header names in data.frame. I've tried following but header names are wrong:
> sine = read.csv(file="sine.csv",head=TRUE,sep=",", skip=3, check.names=TRUE)
> colnames(sine)
[1] "X0" "X0.0" "X0.0.1" "X0.0.2" "None" "X1.0" "X0.0.3" "None.1" "X.."
[10] "X0.1" "X0.2"
When I read dataset without skipping three lines header names are OK:
> sine = read.csv(file="sine.csv",head=TRUE,sep=",", skip=0, check.names=TRUE)
> colnames(sine)
[1] "reset"
[2] "angle"
[3] "sine"
[4] "multiStepPredictions.actual"
[5] "multiStepPredictions.1"
[6] "anomalyScore"
[7] "multiStepBestPredictions.actual"
[8] "multiStepBestPredictions.1"
[9] "anomalyLabel"
[10] "multiStepBestPredictions.multiStep.errorMetric..altMAPE..steps..1..window.1000.field.sine"
[11] "multiStepBestPredictions.multiStep.errorMetric..aae..steps..1..window.1000.field.sine"
What I'm doing wrong?
something like this,
foo <- read.csv("http://www.ats.ucla.edu/stat/r/faq/test.csv", header=T)
foo
# make model mpg weight price
# 1 amc concord 22 2930 4099
# 2 amc oacer 17 3350 4749
# 3 amc spirit 22 2640 3799
# 4 buick century 20 3250 4816
# 5 buick electra 15 4080 7827
colnames(foo)
# [1] "make" "model" "mpg" "weight" "price"
bar <- read.csv("http://www.ats.ucla.edu/stat/r/faq/test.csv", header=T, skip=3)
bar
# amc spirit X22 X2640 X3799
# 1 buick century 20 3250 4816
# 2 buick electra 15 4080 7827
colnames(bar)
# [1] "amc" "spirit" "X22" "X2640" "X3799"
As Richard Scriven pointed out below my initial answer did not work, don't know how I missed that. Found this SO answer and made the solution below.
all_content = readLines("http://www.ats.ucla.edu/stat/r/faq/test.csv")
skip_second = all_content[c(c(-2:-4))]
foo2 = read.csv(textConnection(skip_second),
header = TRUE, stringsAsFactors = FALSE)
foo2
# make model mpg weight price
# 1 buick century 20 3250 4816
# 2 buick electra 15 4080 7827
colnames(foo2)
# [1] "make" "model" "mpg" "weight" "price"

Replace every word by an index in 15 million strings

I have a list of 15 million strings and I have a dictionary of 8 million words. I want to replace every string in database by the index of the string in the dictionary.
I tried using the hash package for faster indexing, but it is still taking hours for replacing in all 15 million strings.
What is the efficient way of implementing this?
Example[EDITED]:
# Database
[[1]]
[1]"a admit been c case"
[[2]]
[1]"co confirm d ebola ha hospit howard http lik"
# dictionary
"t" 1
"ker" 2
"be" 3
.
.
.
.
# Output:
[[1]]123 3453 3453 567
[[2]]6786 3423 234123 1234 23423 6767 3423 124431 787889 111
Where the index of admit in the dictionary is 3453.
Any kind of help is appreciated.
Updated Example with Code:
This is what I am currently doing.
Example: data =
[1] "a co crimea divid doe east hasten http polit secess split t threaten ukrain via w west xtcnwl youtub"
[2] "billion by cia fund group nazy spent the tweethead ukrain"
[3] "all back energy grandpar home miss my posit radiat the"
[4] "ao bv chega co de ebola http kkmnxv pacy rio suspeito t"
[5] "android androidgam co coin collect gameinsight gold http i jzdydkylwd t ve"
words.list = strsplit(data, "\\W+", perl=TRUE)
words.vector = unlist(words.list)
sorted.words = sort(table(words.vector),decreasing=TRUE)
h = hash(names(sorted.words),1:length(names(sorted.words)))
index = lapply(data, function(row)
{
temp = trim.leading(row)
word_list = unlist(strsplit(temp, "\\W+", perl=TRUE))
index_list = lapply(word_list,function(x)
{
return(h[[x]])
}
)
#print(index_list)
return(unlist(index_list))
}
)
Output:
index_list
[[1]]
[1] 6 1 19 21 22 23 31 2 40 44 46 3 48 5 51 52 53 54 55
[[2]]
[1] 12 14 16 26 30 38 45 4 49 5
[[3]]
[1] 7 11 25 29 32 36 37 41 42 4
[[4]]
[1] 10 13 15 1 20 24 2 35 39 43 47 3
[[5]]
[1] 8 9 1 17 18 27 28 2 33 34 3 50
The output is index. This runs fast if the length of data is small but execution is really slow if the length is 15 million.
My task is the nearest neighbor search. I want to search for 1000 queries which are of same format as the database.
I have tried many things like parallel computations as well, but had issues with memory.
[EDIT] How can I implement this using RCpp?
I think you'd like to avoid the lapply() by splitting the data, unlisting, then processing the vector of words
data.list = strsplit(data, "\\W+", perl=TRUE)
words = unlist(data.list)
## ... additional processing, e.g., strip white space, on the vector 'words'
perform the match, then re-list to original
relist(match(words, word.vector), data.list)
For downstream applications it might actually pay to retain the vector + 'partitioning' information, partition = sapply(data.list, length) rather than re-listing, since it'll continue to be efficient to operate on the unlisted vector. The Bioconductor S4Vectors package provides a CharacterList class that takes this approach, where one mostly works on something that is list-like, but where the data are stored and most operations are on an underlying character vector.
Sounds like you're doing NLP.
A fast non-R solution (which you could wrap in R) is word2vec
The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the
training text data and then learns vector representation of words. The
resulting word vector file can be used as features in many natural
language processing and machine learning applications.

ff package in R: how to move data from one drive to another, and change filenames

I am working intensively with the amazing ff and ffbase package.
Due to some technical details, I have to work in my C: drive with my R session. After finishing that, I move the generated files to my P: drive (using cut/paste in windows, NOT using ff).
The problem is that when I load the ffdf object:
load.ffdf("data")
I get the error:
Error: file.access(filename, 0) == 0 is not TRUE
This is ok, because nobody told the ffdf object that it was moved, but trying :
filename(data$x) <- "path/data_ff/x.ff"
or
pattern(data) <- "./data_ff/"
does not help, giving the error:
Error in `filename<-.ff`(`*tmp*`, value = filename) :
ff file rename from 'C:/DATA/data_ff/id.ff' to 'P:/DATA_C/data_ff/e84282d4fb8.ff' failed.
Is there any way to "change" into the ffdf object the path for the files new location?
Thank you !!
If you want to 'correct' your filenames afterwards you can use:
physical(x)$filename <- "newfilename"
For example:
> a <- ff(1:20, vmode="integer", filename="./a.ff")
> saveRDS(a, "a.RDS")
> rm(a)
> file.rename("./a.ff", "./b.ff")
[1] TRUE
> b <- readRDS("a.RDS")
> b
ff (deleted) integer length=20 (20)
> physical(b)$filename <- "./b.ff"
> b[]
opening ff ./b.ff
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Using filename() in the first session would of course have been easier. You could also have a look at the save.ffdf and corresponding load.ffdf functions in the ffbase package, which make this even simpler.
Addition
To rename the filenames of all columns in a ffdf you can use the following function:
redir <- function(ff, newdir) {
for (x in physical(b)) {
fn <- basename(filename(x))
physical(x)$filename <- file.path(newdir, fn)
}
return (ff)
}
You can also use ff:::clone()
R> foo <- ff(1:20, vmode = "integer")
R> foo
ff (open) integer length=20 (20)
[1] [2] [3] [4] [5] [6] [7] [8] [13] [14] [15] [16] [17] [18] [19]
1 2 3 4 5 6 7 8 : 13 14 15 16 17 18 19
[20]
20
R> physical(foo)$filename
[1] "/vol/fftmp/ff69be3e90e728.ff"
R> bar <- clone(foo, pattern = "~/")
R> bar
ff (open) integer length=20 (20)
[1] [2] [3] [4] [5] [6] [7] [8] [13] [14] [15] [16] [17] [18] [19]
1 2 3 4 5 6 7 8 : 13 14 15 16 17 18 19
[20]
20
R> physical(bar)$filename
[1] "/home/ubuntu/69be5ec0cf98.ff"
From what I understand from briefly skimming the code of save.ffdf and load.ffdf, those functions do this for you when you save/load.

Selecting rows in data.frame based on character strings

I've a data.frame with row.names as in test.
test <-
c("Env_1990:trait_KPS", "Env_1990:trait_SPSM", "Env_1990:trait_TKW",
"Env_1990:trait_Yield", "Env_1991:trait_KPS", "Env_1991:trait_SPSM",
"Env_1991:trait_TKW", "Env_1991:trait_Yield", "Env_1992:trait_KPS",
"Env_1992:trait_SPSM", "Env_1992:trait_TKW", "Env_1992:trait_Yield",
"Env_1993:trait_KPS", "Env_1993:trait_SPSM", "Env_1993:trait_TKW",
"Env_1993:trait_Yield", "Env_1994:trait_KPS", "Env_1994:trait_SPSM",
"Env_1994:trait_TKW", "Env_1994:trait_Yield", "Env_1995:trait_KPS",
"Env_1995:trait_SPSM", "Env_1995:trait_TKW", "Env_1995:trait_Yield",
"Gen_B88:Env_1990:trait_KPS", "Gen_B88:Env_1990:trait_SPSM",
"Gen_B88:Env_1990:trait_TKW", "Gen_B88:Env_1990:trait_Yield",
"Gen_B88:Env_1991:trait_KPS", "Gen_B88:Env_1991:trait_SPSM",
"Gen_B88:Env_1991:trait_TKW", "Gen_B88:Env_1991:trait_Yield",
"Gen_B88:Env_1992:trait_KPS", "Gen_B88:Env_1992:trait_SPSM",
"Gen_B88:Env_1992:trait_TKW", "Gen_B88:Env_1992:trait_Yield",
"Gen_B88:Env_1993:trait_KPS", "Gen_B88:Env_1993:trait_SPSM",
"Gen_B88:Env_1993:trait_TKW", "Gen_B88:Env_1993:trait_Yield")
I want to select only those rows which start with Env_. I tried this code in R
grep(pattern="[Env_]", x=test).
This code gives me all rows because Env_ appears in every row name. I wonder how to select rows which starts only with Env_. Thanks in advance for your help.
You want to add the ^ character for beginning of line/string:
> grep("^Env_", test)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
> grep("^Env_", test, value = TRUE)
[1] "Env_1990:trait_KPS" "Env_1990:trait_SPSM" "Env_1990:trait_TKW"
[4] "Env_1990:trait_Yield" "Env_1991:trait_KPS" "Env_1991:trait_SPSM"
[7] "Env_1991:trait_TKW" "Env_1991:trait_Yield" "Env_1992:trait_KPS"
[10] "Env_1992:trait_SPSM" "Env_1992:trait_TKW" "Env_1992:trait_Yield"
[13] "Env_1993:trait_KPS" "Env_1993:trait_SPSM" "Env_1993:trait_TKW"
[16] "Env_1993:trait_Yield" "Env_1994:trait_KPS" "Env_1994:trait_SPSM"
[19] "Env_1994:trait_TKW" "Env_1994:trait_Yield" "Env_1995:trait_KPS"
[22] "Env_1995:trait_SPSM" "Env_1995:trait_TKW" "Env_1995:trait_Yield"

Resources