Removing duplicates requires a transpose, but my dataframe is too large

Removing duplicates requires a transpose, but my dataframe is too large - r

I had asked a question here. I had a simple dataframe, for which I was attempting to remove duplicates. Very basic question.
Akrun gave a great answer, which was to use this line:
df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df$location)),]
I went ahead and did this, which worked great on the dummy problem. But I have 3.5 million records that I'm trying to filter.
In an attempt to see where the bottleneck is, I broke the code into steps.
step1 <- apply(df1[1:2], 1, sort)
step2 <- t(step1)
step3 <- data.frame(step2, df1$location)
step4 <- !duplicated(step3)
final <- df1[step4, ,]
step 1 look quite a long time, but it wasn't the worst offender.
step 2, however, is clearly the culprit.
So I'm in the unfortunate situation where I'm looking for a way to transpose 3.5 million rows in R. (Or maybe not in R. Hopefully there is some way to do it somewhere).
Looking around, I saw a few ideas
install the WGCNA library, which has a transposeBigData function. Unfortunately this package is not longer being maintained, and I can't install all the dependencies.
write the data to a csv, then read it in line by line, and transpose each line one at a time. For me, even writing the file run overnight with no completion.
This is really strange. I just want to remove duplicates. For some reason, I have to transpose a dataframe in this process. But I can't transpose a dataframe this large.
So I need a better strategy for either removing duplicates, or for transposing. Does anyone have any ideas on this?
By the way, I'm using Ubuntu 14.04, with 15.6 GiB RAM, for which cat /proc/cpuinfo returns
Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
model name : Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
cpu MHz : 1200.000
cache size : 6144 KB
Thanks.
df <- data.frame(id1 = c(1,2,3,4,9), id2 = c(2,1,4,5,10), location=c('Alaska', 'Alaska', 'California', 'Kansas', 'Alaska'), comment=c('cold', 'freezing!', 'nice', 'boring', 'cold'))

A faster option would be using pmin/pmax with data.table
library(data.table)
setDT(df)[!duplicated(data.table(pmin(id1, id2), pmax(id1, id2)))]
# id1 id2 location comment
#1: 1 2 Alaska cold
#2: 3 4 California nice
#3: 4 5 Kansas boring
#4: 9 10 Alaska cold
If 'location' also needs to be included to find the unique
setDT(df)[!duplicated(data.table(pmin(id1, id2), pmax(id1, id2), location))]

So after struggling with this for most of the weekend (grateful for plenty of selfless help from the illustrious #akrun), I realized that I would need to go about this in a completely different manner.
Since the dataframe was simply too large to process in memory, I ended up using a strategy where I pasted together a (string) key and column-bound it onto the dataframe. Next, I collapsed the key and sorted the characters. Here I could use which to get the index of the rows that contained non-duplicate keys. With that I could filter the my dataframe.
df_with_key <- within(df, key <- paste(boxer1, boxer2, date, location, sep=""))
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
df_with_key$key <- strSort(df_with_key$key)
idx <- which(!duplicated(df_with_key$key))
final_df <- df[idx,]

Related

fuzzy and exact match of two databases

I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name. I want to match each row of the first database to one or more rows of the second database based on this grantee_name. Note that merge will not work because the grantee_name do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:
library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 70000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-790000 omitted for brevity
The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match. There will be a better match somewhere in the filings data (not shown above). That does not matter at this stage.
So, I have tried the following:
df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Totally new to R. This is resulting in an error:
cannot allocate vector of size 375GB (with the big database of course). A sample of 100 rows from forfuzzy always works. So, I thought of iterating over a list of 100 rows at a time.
I have tried the following:
n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))
df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)
I have also tried the above with mclapply instead of lapply. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3. Perhaps a foreach command could help, but I have no idea how to implement it.
I have been advised to use the purrr and repurrrsive packages, so I try the following:
purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
This seems to be working, after a novice error in the by=grantee_name statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.

If you split (with base::split or dplyr::group_split) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list. (map is pretty much lapply)
purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.
See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe

I haven't used foreach before but maybe the variable x is already the individual rows of zz1?
Have you tried:
stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")
?

R: Is there a way to subset a file while reading

I have a huge .csv file, its size is ~ 1.4G and reading with read.csv takes time. There are several variables in that file and all i want is to extract data for few variables in a certain column.
For example, suppose ABC.csv is my file and it looks something like this:
ABC.csv
Date Variables Val
2017-11-01 X 23
2017-11-01 A 2
2017-11-01 B 0.5
............................
2017-11-02 X 20
2017-11-02 C 40
............................
2017-11-03 D 33
2017-11-03 X 22
............................
............................
So , here the variable of interest is X and while reading this file i want the df$Variables to be scanned reading only the rows with X string in this column. So that my new data from will look something like this:
> df
Date Variables Val
2017-11-01 X 23
2017-11-02 X 20
.........................
.........................
Any Help will be appreciated. Thank you in advance.

Check out the LaF package, it allows to read very large textfiles in blocks, so you don't have to read the entire file into memory.
library(LaF)
data_model <- detect_dm_csv("yourFile.csv", skip = 1) # detects the file structure
dat <- laf_open(data_model) # opens connection to the file
block_list <- lapply(seq(1,100000,1000), function(row_num){
goto(dat, row_num)
data_block <- next_block(dat, nrows = 1000) # reads data blocks of 1000 rows
data_block <- data_block[data_block$Variables == "X",]
return(data_block)
})
your_df <- do.call("rbind", block_list)
Admittedly, the package sometimes feels a bit bulky and in some situations I had to find small hacks to get my results (you might have to adapt my solution for your data). Nevertheless, I found it a immensely useful solution for dealing with files that exceeded my RAM.

Just wondering if doing this works. It worked for my code but I am not sure whether it is first reading in the entire data and then subsetting or is it only reading the part of the file where Variables == 'X'.
temp <- fread('dat.csv')[Variables == 'X']

I would say that most of the time you can probably just read in the entire file, and then subset within R:
df <- read.csv(file="path/to/your/file.csv", header=TRUE)
df.x <- df[df$Variables=='x', ]
R operates completely in memory, so an exception to what I said above might occur if you have a file whose total size is so massive that it cannot fit into memory, but for some reason the subset of interest can.

Simple lookup to insert values in an R data frame

This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:
Case zip market
1 44485 NA
2 44488 NA
3 43210 NA
There are over 3.5 million records.
Then, I have a second data frame, 'zipcodes'.
market zip
1 44485
1 44486
1 44488
... ... (100 zips in market 1)
2 43210
2 43211
... ... (100 zips in market 2, etc.)
I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.

Since you don't care about the market column in alldata, you can first strip it off using and merge the columns in alldata and zipcodes based on the zip column using merge:
merge(alldata[, c("Case", "zip")], zipcodes, by="zip")
The by parameter specifies the key criteria, so if you have a compound key, you could do something like by=c("zip", "otherfield").

Another option that worked for me and is very simple:
alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])

With such a large data set you may want the speed of an environment lookup. You can use the lookup function from the qdapTools package as follows:
library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])
Or
alldata$zip %l% zipcodes[, 2:1]

Here's the dplyr way of doing it:
library(tidyverse)
alldata %>%
select(-market) %>%
left_join(zipcodes, by="zip")
which, on my machine, is roughly the same performance as lookup.

The syntax of match is a bit clumsy. You might find the lookup package easier to use.
alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
## Case zip market
## 1 1 44485 1
## 2 2 44488 1
## 3 3 43210 2

Running out of memory with merge

I have a paneldata which looks like:
(Only the substantially cutting for my question)
Persno 122 122 122 333 333 333 333 333 444 444
Income 1500 1500 2000 2000 2100 2500 2500 1500 2000 2200
year 1990 1991 1992 1990 1991 1992 1993 1994 1992 1993
Now I would like to give out for every row (PErsno) the years of workexperience at the begining of the year. I use ddply
hilf3<-ddply(data, .(Persn0), summarize, Bgwork = 1:(max(year) - min(year)))
To produce output looking like this:
Workexperience: 1 2 3 1 2 3 4 5 1 2
Now I want to merge the ddply results to my original panel data:
data<-(merge(data,hilf3,by.x="Persno",by.y= "Persno"))
The panel data set is very large. The code stops because of a memory size error.
Errormessage:
1: In make.unique(as.character(rows)) :
Reached total allocation of 4000Mb: see help(memory.size)
What should I do?

Re-reading your question, I think you don't actually want to use merge here at all. Just sort your original data frame and rbind Bgwork from hilf3. And also, your ddply-call could perhaps result in a 1:0 sequence, which is most likely not what you want. Try
data = data[order(data$Persno, data$year),]
hilf3 = ddply(data, .(Persno), summarize, Bgwork=(year - min(year) + 1))
stopifnot(nrow(data) == nrow(hilf3))
stopifnot(all(data$Persno == hilf3$Persno))
data$Bgwork = hilf3$Bgwork

Well, perhaps the surest way of fixing this is to get more memory. However, this isn't always an option. What you can do is somewhat dependent on your platform. On Windows, check the results of memory.size()and compare this to your available RAM. If memory size is lower than RAM then you can increase it. This is not an option on linux, as by default it will show all of your memory.
Another issue that can complicate matters is whether or not you are running a 32bit or 64bit system, as 32bit windows can only address up to a certain amount of RAM (2-4GB) depending on settings. This is not an issue if you are using 64bit Windows 7, which can address far more memory.
A more practical solution is to eliminate all unnecessary objects from your workspace before performing merge. You should run gc() to see how much memory you have and are using, and also to remove any objects which have no more references. Personally, I would probably run your ddply() from a script, then save the resulting dataframe as a CSV file, close your workspace and reopen it and then perform the merge again.
Finally the worst possible option (but which does require a whole lot less memory) is to create a new dataframe, and use the subsetting commands in R to copy the columns you want over, one by one. I really don't recommend this as it is tiresome and error prone, but I have had to do it once when there was no way to complete my analysis otherwise (i ended up investing in a new computer with more RAM shortly afterwards).
Hope this helps.

If you need to merge large data frames in R, one good option is to do it in pieces of, say 10000 rows. If you're merging data frames x and y, loop over 10000-row pieces of x, merge (or rather use plyr::join) with y and immediately append these results to a sigle csv-file. After all pieces have been merged and written to file, read that csv-file. This is very memory-efficient with proper use of logical index vectors and well placed rm and gc calls. It's not fast though.

Since this question was posted, the data.table package has provided a re-implementation of data frames and a merge function that I have found to be much more memory-efficient than R's default. Converting the default data frames to data tables with as.data.table may avoid memory issues.

Good ways to code complex tabulations in R?

Does anyone have any good thoughts on how to code complex tabulations in R?
I am afraid I might be a little vague on this, but I want to set up a script to create a bunch of tables of a complexity analogous to the stat abstract of the united states.
e.g.: http://www.census.gov/compendia/statab/tables/09s0015.pdf
And I would like to avoid a whole bunch of rbind and hbind statements.
In SAS, I have heard, there is a table creation specification language; I was wondering if there was something of similar power for R?
Thanks!

It looks like you want to apply a number of different calculations to some data, grouping it by one field (in the example, by state)?
There are many ways to do this. See this related question.
You could use Hadley Wickham's reshape package (see reshape homepage). For instance, if you wanted the mean, sum, and count functions applied to some data grouped by a value (this is meaningless, but it uses the airquality data from reshape):
> library(reshape)
> names(airquality) <- tolower(names(airquality))
> # melt the data to just include month and temp
> aqm <- melt(airquality, id="month", measure="temp", na.rm=TRUE)
> # cast by month with the various relevant functions
> cast(aqm, month ~ ., function(x) c(mean(x),sum(x),length(x)))
month X1 X2 X3
1 5 66 2032 31
2 6 79 2373 30
3 7 84 2601 31
4 8 84 2603 31
5 9 77 2307 30
Or you can use the by() function. Where the index will represent the states. In your case, rather than apply one function (e.g. mean), you can apply your own function that will do multiple tasks (depending upon your needs): for instance, function(x) { c(mean(x), length(x)) }. Then run do.call("rbind" (for instance) on the output.
Also, you might give some consideration to using a reporting package such as Sweave (with xtable) or Jeffrey Horner's brew package. There is a great post on the learnr blog about creating repetitive reports that shows how to use it.

Another options is the plyr package.
library(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, "month", function(x){
with(x, c(meantemp = mean(temp), maxtemp = max(temp), nonsense = max(temp) - min(solar.r)))
})

Here is an interesting blog posting on this topic. The author tries to create a report analogous to the United Nation's World Population Prospects: The 2008 Revision report.
Hope that helps,
Charlie

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex