R memory issues for extremely large dataset - r

I need to perform regression analysis on a 3.5gb dataset consisting mixed (numerical and categorical) dataset in CSV format consisting of 1.8 million records and 1000 variables/columns mainly containing 0s and 1s and a few categorical and numeric values. (Refer snapshot of data.)
I was initially supposed to directly perform clustering on this dataset but I kept getting a lot of errors related to memory in spite of running it on a remote virtual machine (64-bit Windows Server 2012 R2) having 64gb RAM. So I thought of doing some factor analysis to find correlation between the variables so that I can reduce the number of columns to 600 - 700 (as much possible). Any other ideas are appreciated as I am very naïve to data analysis.
I have tried various packages like ff, bigmemory, biganalytics, biglm, FactoMineR, Matrix etc but with no success. Have always encountered “cannot allocate vector of size …” or reached maximum allocation of size 65535MB some other errors.
Can you guys let me know of a solution to this as I feel memory should be a problem as 64gb RAM should suffice.
Snapshot of dataset:
SEX AGE Adm Adm LOS DRG DRG RW Total DC Disp Mortality AAADXRUP
M 17 PSY 291 887 0.8189 31185 PDFU 0 0
M 57 PSY ER 31 884 0.9529 54960.4 SNF 0 0
F 23 AC PH 3 775 0.5283 9497.7 HOM 0 0
F 74 AC PH 3 470 2.0866 23020.3 SNF 0 0
There are additional columns after Mortality mostly containing 0s or 1s

Related

Cluster analysis in R on large data set

I have a data set with rankings as the column names and about 15,000 contestants. My data looks like:
contestant
1
2
3
4
101
13
0
5
12
14
0
1
34
6
...
...
...
...
...
500
0
2
23
3
I've been working on doing cluster analysis on this dataset. The dendrograms are obviously not very helpful with this dataset--it produces a thick block line because of the large number of entries.
I'm wondering if there is a better way to do cluster analysis with this type of data. I've tried
fviz_cluster()
and similar commands, as well as went through multiple tutorials. Many tutorials guided me through making dendograms. The data all seems to be different than mine (comparing two variables, etc) and much smaller. Essentially, I'm asking which types of cluster analysis may work well with this type of data.

How to create a hierarchical cluster using categorical and numerical data is R?

I want to create a hierarchical cluster to show types of careers and the balance that those who are in those careers have in their bank account.
I a dataset with two variables, job and balance:
job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88
I want the result to look like this:
Where A, B ,C etc are the job categories.
Can anyone help me start this or give me some help?
I have no idea how to begin.
Thanks!
You can start by using the distand hclust functions.
df <- read.table(text = " job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88")
dist computes the distance between each element (by default, the euclidian distance):
distances <- dist(df$balance)
You can then cluster you values using the distance matrix generated above:
clusters <- hclust(distances)
By default, hclust applies complete-linkage clustering to your data.
Finally, you can plot your results as a tree:
plot(clusters, labels = df$job)
Here, we clustered all the entries in your data frame, that's why some jobs are duplicated. If you want to have a single value per job, you can for example take the mean balance for each job using tapply:
means <- tapply(df$balance, df$job, mean)
And then cluster the jobs:
distances <- dist(means)
clusters <- hclust(distances)
plot(clusters)
You can then try to use other distance measures or other clustering algorithms (see help(dist) and help(hclust) for other methods).

Test performing on counts

In R a dataset data1 that contains game and times. There are 6 games and times simply tells us how many time a game has been played in data1. So head(data1) gives us
game times
1 850
2 621
...
6 210
Similar for data2 we get
game times
1 744
2 989
...
6 711
And sum(data1$times) is a little higher than sum(data2$times). We have about 2000 users in data1 and about 1000 users in data2 but I do not think that information is relevant.
I want to compare the two datasets and see if there is a statistically difference and which game "causes" that difference.
What test should I use two compare these. I don't think Pearson's chisq.test is the right choice in this case, maybe wilcox.test is the right to chose ?

positive output aand variables nnet-model r

Can anyone tell me how to constrain the output and selected variables of a neural network such that the influence of a charateristic is positive using the function nnet in R. I Have a database (real estate) with numerical (surface, price) and categorial values (parking Y/N, areacode, ectera). The output of the model is the price. The thing is that the model currently estimates that in a few areacodes the homes with a parking spot are less worth than the homes without a parking spot. I would like to constrain the output (Price) so that in each areacode, the influence of a parking spot on the price is positive. Ofcourse a really small house with parking spot can still be cheaper than a big house without a parking spot.
example data (of 80.000 observations):
Price Surface Parking Y Areacode 1 Areacode 2 Areacode 3
100000 100 0 1 0 0
110000 99 1 0 1 0
200000 110 0 0 0 1
150000 130 0 0 1 0
190000 130 1 0 0 1
(thanks for putting the table in a decent format)
I had this modelled in R using nnet.
model = nnet(Price~ . , data=data6, MaxNWts=2500, size=12, skip=TRUE, linout=TRUE, decay=0.025, na.action=na.omit)
I used nnet because I hope to find different values for parking spots per area code. If there is a beter way for this please let us know.
Im using RStudio Version 0.98.976 on windows XP (yes i know;)
Thanks in advance for your replies

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

Resources