Find the related items in r programming for data mining - r

I am working on data mining in R programming and I'm using RStudio. My dataset looks like this:
I've used 'yes' 'no' instead of any other disease name in some places just to check if it works for 'yes' or 'no'.
Here you can see that a patient has different diseases/diagnosis. I am trying to use association rule to display me the diseases that a person is suffering along with HTN. I've written the following code:
mytestdata <- read.csv("D:/Senior Thesis/Program/test.csv", header=T,
colClasses = "factor", sep = ",")
library(arules)
myrules <- apriori(mytestdata,
parameter = list(supp = 0.1, conf = 0.1, maxlen=10, minlen=2),
appearance = list(rhs=c("Disease.1=HTN")))
summary(myrules)
inspect(myrules)
But I'm not getting any disease name in the column lhs; you can see that in the following image:
Please help me so that lhs shows the name of the disease associated with rhs which is Disease.1=HTN.

Your code takes missing values (e.g. cell E4 in excel sheet) as a factor level. You could prevent this behaviour when you specify the NA value in read.csv function.
mytestdata <- read.csv("D:/Senior Thesis/Program/test.csv", header=T,
colClasses = "factor", sep = ",", na.strings = "")

It would, if you had more data. There is just 3 rows that satisfy your rhs!
Note that you do get Disease.2=yes.
But I assume you want to ignore order on the diseases...

Related

PupilPre (R package for analyzing pupil data) problem - Prep_data function converts TIMESTEP values to NAs

We are interested in analyzing our pupil data (only interested in size, not position) recorded with an SR eyelink 1000Hz system.
We exported the files using the SR data viewer as sample reports.
After running ppl_prep_data the TIMESTAMP variable class is converted from character to numeric however it returns all NA and the real timestamp values are lost. The rest of the pipeline isthereforer not working.
Does anyone of you have an idea why this is the case that it gives us a NA message and if so how can we maybe work around this?
Below you can find the code the code that we are using:
#step 1 Load library
library(PupilPre)
#step 2:load data
# change folder were the data is in the line below
Pupildat <- read.table("DATAXX.txt", header = T, sep = "\t", na.strings = c(".", "NA"))
# after reading in the first column is called weird something with ?.. so we rename it for the next line of code
names(Pupildat)[1] <- 'RECORDING_SESSION_LABEL'
## Step 3:PupilPre Pipeline ###
# Check classes of columns and reassigns => creates event variable
data_pre <- ppl_prep_data(data = Pupildat, Subject = "RECORDING_SESSION_LABEL", EventColumns = c("Subject", "TRIAL_INDEX"))
align_msg(data_pre, Msg = "Hashtag_1")
#Using the function check_msg_time you can see that the TIMESTAMP values associated with the message are not the same for each event.
#This indicates that alignment is required. Note that a regular expression (regex) can be used here as the message string.
#example below, though think we want different timings for the events
check_msg_time(data = data_pre, Msg = "Hashtag_1")
### returns NA

Problems with displaying .txt file (delimiters)

I have a problem with one task where I have to load some data set, and I have to make sure that missing values are read in properly and that column names are unambiguous.
The format of .txt file:
At the end, data set should contain only country column and median age.
I tried using read.delim, precisely this chunk:
rawdata <- read.delim("rawdata_343.txt", sep = "", stringsAsFactors = FALSE, header = TRUE)
And when I run it, I get this:
It confuses me that if country has multiple words (Turks and Caicos Islands) it assigns every word to another column.
Since I am still a beginner in R, any suggestion would be very helpful for me. Thanks!
Three points to note about your input file: (1) the first two lines at the top are not tabular and should be skipped with skip = 2, (2) your column separators are tabs and this should be specified with sep = "\t", and (c) you have no headers, so header = FALSE. Your command should be: -
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2)
UPDATE: A fourth point is that the first column includes row numbers, so row.names = 1. This also addresses the follow-up comment.
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2, row.names = 1)
It looks like your delimiter that you are specifying in the sep= argument is telling R to consider spaces as the column delimiter. Looking at your data as a .txt file, there is no apparent delimiter (like commas that you would find in a typical .csv). If you can put the data in a tabular form in something like a .csv or .xlsx file, R is much better at reading that data as expected. As it is, you may struggle to get the .txt format to read in a tabular fashion, which is what I assume you want.
P.s. you can use read.csv() if you do end up putting the data in that format.

Set column types for csv with read.csv.ffdf

I am using a payments dataset from Austin Text Open Data. I am trying to load the data with the following code:-
library(ff)
asd <- read.table.ffdf(file = "~/Downloads/Fiscal_Year_2010_eCheckbook_Payments.csv", first.rows = 100, next.ros = 50, FUN = "read.csv", VERBOSE = TRUE)
This shows me the following error:-
read.table.ffdf 301..Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '7AHM'
This happens on 339th line of csv file at 5th column of the dataset. The reason why I think this is happening is that all the values of the 5th column are integers where as this happens to be string. But the actual type of the column should be string.
So I wanted to know if there was a way I could set the types of the column
Below I am providing the types for all the columns in a vector:-
c("character","integer","integer","character","character", "character","character","character","character","character","integer","character","character","character","character","character","character","character","integer","character","character","character","character","character","integer","integer","integer","character","character","character","character","double","character","integer")
You can also find the type of each column from the description of the dataset.
Please also keep in mind that I am very new to this library. Practically just found out about it today.
Maybe you need to transform your data type...The following is just an example that maybe to help you.
data <- transform(
data,
age=as.integer(age),
sex=as.factor(sex),
cp=as.factor(cp),
trestbps=as.integer(trestbps),
choi=as.integer(choi),
fbs=as.factor(fbs),
restecg=as.factor(restecg),
thalach=as.integer(thalach),
exang=as.factor(exang),
oldpeak=as.numeric(oldpeak),
slope=as.factor(slope),
ca=as.factor(ca),
thai=as.factor(thai),
num=as.factor(num)
)
sapply(data, class)

Error in arulesSequences in R

I am trying to do a sequential pattern analysis using aruleSequences in R.
My data set has 626,047 rows after removing all kinds of duplicates. It has 3 columns. I unfortunately cant put the dataset out here. I have created sample data in a google sheet to give an idea of how the data looks like. it is here. The data is named as df_sq
It has 3 columns:
Numeric_id of class numeric . This is a user_id
Product - of class factor.
Time - of class integer
I have been able to convert the data in 'transaction' format according to the package. But on running cSpade, i get the following error:
Error in makebin(data, file) : 'eid' invalid (strict order)
Now, i know from reading other questions on Stackoverflow, that this means that i have to sort my data.
So i went back and sorted my orignal data by numeric_id and time. Vice versa as well. And re coverted data to 'transaction' format and re ran cSpade.
I am still getting the same error.
Has any one worked with this package before?
Here is the code i had used:
library(arules)
library(arulesViz)
library(arulesSequences)
library(sqldf)
df_sq = read.csv("service_data.csv", stringsAsFactors = FALSE)
#Changing class of timestamp column and coercing product name to factor
df_sq$time1 = as.integer(as.numeric(df_sq$time1))
df_sq$service_name = as.factor(df_sq$service_name)
#Clearing duplicates
df_sq = sqldf("select distinct numeric_id, service_name, time1
from df_sq")
#Ordering the dataset on numeric id and time
df_sq = df_sq3[order(df_sq3$numeric_id, df_sq3$time1),]
df_sq = df_sq3[order(df_sq3$time1),]
df_sq = df_sq3[order(df_sq3$sequenceID),]
#Coverting to transactional format per the package
sq_data = data.frame(item=df_sq3$service_name)
sq_tran = as(sq_data, "transactions")
transactionInfo(sq_tran)$sequenceID = df_sq3$numeric_id
transactionInfo(sq_tran)$eventID = df_sq3$time1
summary(sq_tran)
#Running cSpade
s1 = cspade(sq_tran, parameter = list(support = 0.1), control = list(verbose
= TRUE),tmpdir = tempdir())
summary(s1)

cca per groups and row.names

I am somewhat new to R, so forgive my basic questions.
I perform a CCA on a full dataset (358 sites, 40 abiotic parameters, 100 species observation).
library(vegan)
env <- read.table("env.txt", header = TRUE, sep = "\t", dec = ",")
otu <- read.table(otu.txt", header = TRUE, sep = "\t", dec = ",")
cca <- cca(otu~., data=env)
cca.plot <- plot(cca, choices=c(1,2))
vif.cca(cca)
ccared <- cca(formula = otu ~EnvPar1,2,n, data = env)
ccared.plot <- plot(ccared, choices=c(1,2))
orditorp(ccared.plot, display="sites")
This works without using sample names in the first columns (initially, the first column containing numeric samples names got interpreted as a variable, so i used tables without that information. When i add site names to the plot via orditorp, it gives "row.name=n" in the plot.)
I want to use my sample names, however. I tried row.names=1 on both tables with sample name information:
envnames <- read.table("envwithnames.txt", header = TRUE, row.names=1, sep = "\t", dec = ",")
otunames <- read.table("otuwithnames.txt", header = TRUE, row.names=1, sep = "\t", dec = ",")
, and any combination of env/otu/envnames/otunames. cca worked out well in any case, but any plot command yielded
plot.ccarownames <- plot(cca(ccarownames, choices=c(1,2)))
Error in rowSums(X) : 'x' must be numeric
My second problem is connected to that: The 358 sites are grouped into 6 groups (4x60,2x59). The complete matrix has this information inferred as an extra column.
Since i couldnt work out the row name problem, i am even more stuck with nominal data, anyhow.
The original matrix contains a first column (sample names, numeric, but can be easily transformed to nominal) and second one (group identity, nominal), followed by biological observations.
What i would like to have:
A CCA containing all six groups that is coloring sites per group.
A CCA containing only data for one group (without manual
construction of individual input tables)
CCA plots that are using my original sample names.
Any help is appreciated! Really, i am stuck with it since yesterday morning :/
I'm using cca() from vegan myself and I have some of your own problems, however I've been able to at least solve your original "row names" problem. I'm doing a CCA analysis on data from 41 soils, with 334 species and 39 environmental factors.
In my case I used
rownames(MyDataSet) <- MyDataSet$ObservationNamesColumn
(I used default names such as MyDataSet for the sake of example here)
However I still had environmental factors which weren't numerical (such as soil texture). You could try checking for non numerical factors in case you have a mistake in your original dataset or an abiotic factor which is not interpreted as numerical for any other reason. To do this you can either use the command str(MyDataSet) which tells you the nature of each of your variable, or lapply(MyDataSet, class) which also tells you the same but in a different output.
In case you have abiotic factors which are not numerical (again, such as texture) and you want to remove them, you can do so by creating a whole new dataset using only the numerical variables (you will still keep your observation names as they were defined as row names), this is rather easy to do and can be done using something similar to this:
MyDataSet.num <- MyDataSet[,sapply(MyDataSet, is.numeric)]
This creates a new data set which has the same rows as the original but only columns (variables) with numeric values. You should be able then to continue your work using this new data set.
I am very new to both R programming and statistics (I'm a microbiologist) but I hope this helps!

Resources