Filtering data in R (complex) - r

I have a dataset with 7 million records.
I need to filter the data to only show about 9000 of these.
The first field dmg is effectively the primary key and take the format 1-Apr-123456. There are about 12 occurrences of each dmg value.
Another column is O_Y and takes the value of 0 or 1. It is most often 0, but 1 on about 900 occasions.
I would like to return all the rows with the same dmg value, where at least one of those records has and O_Y value of 1.

I recommend using data.table for doing this (fread in data.table will be quite handy in reading in the large data set too as you say you have enough RAM).
I am not sure that the following is the best way to do this in data.table but, at least, it should get you started. Hopefully, someone else will come along and list the most idiomatic data.table way for this. But this is what I can think of right now:
Assuming your data.table is called DT and has two columns dmg and O_Y. Use O_Y as the index key for DT and subset DT for O_Y == 1 (DT[.(1)] in data.table syntax). Now find the corresponding dmg values. The unique of these dmg values is your keys.with.ones. All this is succinctly done as follows:
setkey(DT, O_Y)
keys.with.ones <- unique(DT[.(1), dmg][["dmg"]])
Next, we need to extract rows corresponding to these values of dmg. For this we need to change the key for DT to dmg and extract the rows corresponding to the keys above:
setkey(DT, dmg)
DT.filtered <- DT[.(keys.with.ones)]
And we are done. :)
Please refer to ?data.table to figure out a better method if possible and let us know.

Related

Last up-to-x rows per group from DT

I have a data.table that I want to pull out the last 10,000 lines on a per- group basis. Unfortunately, I have been getting inconsistent results depending on the method employed, so am obviously not understanding the full picture. I have concerns about each of my methods.
The data is structured in such a way that I have a set of columns that I wish to group by, where I want to grab the entries corresponding with the last 10,000 POSIXct timestamps (if that many exist...otherwise return all). Entries are defined as unique by the combination of the grouping columns along with the timestamps, even though there are several other data columns. In the below example, my timestamp column is in ts and keycol1 and keycol2 are the fields that I'm grouping on. DT has 2,809,108 entries.
setkey(DT,keycol1,keycol2,ts)
DT[DT[,.I[.N-10000:.N], by=c("keycol1","keycol2")]$V1]
returns 1,181,256 entries. No warning or error is issued. My concern is what is happening when .N-10000 for the group is <1? When I do a DT[-10:10] I receive the following error.
Error in [.data.table(DT, -10:10) : Item 1 of i is -10 and item
12 is 1. Cannot mix positives and negatives.
leading me to believe that the .I[.N-10000:.N] may not be working as intended.
If I instead try to sort the timestamps backward and then use the strategy described by Jaap in his answer to this question
DT[DT[order(-ts),.I[1:10000], by=c("keycol1","keycol2")]$V1],nomatch=NULL]
returns 3,810,000 entries, some of which are all NA, suggesting that the nomatch parameter isn't being honored (nomatch=0 returns the same). Chaining a [!is.na(ts)] tells me that it returns 1,972,166 valid entries, which is more than the previous "solution." However, do the values for .I correspond with the row numbers of the original DT or of the reverse-sorted (within group) DT? So, does the outer selection return the true matches, or will it actually result in the first 10000 entries per group, rather than the last 10000?
okay, to resolve this question can I have the key itself work backward?
setkey(DT,keycol1,keycol2,ts)
setorder(DT,keycol1,keycol2,-ts)
key(DT)
NULL
setkey(DT,keycol1,keycol2,-ts)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
some columns are not in the data.table: -ts
That'd be a no then.
Can this be resolved by using .SD rather than .I?
DT[
DT[order(-ts), .SD[1:10000], by=c("keycol1","keycol2")],
nomatch=0, on=c("keycol1","keycol2","ts")
]
returns 1,972,166 entries. Although I'm fairly confident that these entries are the ones I want, it also results in a duplication of columns not part of the key or timestamp (as i.A, i.B, etc). I think these are the same entries as the .I[1:10000] example with the order(-ts), as if I store each, delete the extra columns in the .SD method, then do a setkey(resultA, keycol1,keycol2,ts) for each then do identical(resultA,resultB) it returns
TRUE
Related threads:
Throw away first and last n rows
data.table - select first n rows within group
How to extract the first n rows per group?

Subsetting Data Table / Counting rows

When I count the rows in my data table by store code using the follwoing:
DailyProduct[, .N, by = ANA_Code]
I get 14 store codes with total rows of 120,237 - the same number of rows in DailyPRoduct which is great!
If I then get a unique list of the store codes:
unique(DailyProduct$ANA_Code)
I get 21 store codes which is more than I got above but is actually the correct number.
If I convert the store code from numeric to a factor, everything shows as expected, I get 21 stores and the row counts of each add up to 120,237
This is also causing me a problem when I aggregate the data, the sales value is correct but 7 of the store codes are missing.
Is there a fundamental difference in how data table treats numeric vs. factor when performing these operations?
I don't understand why this is happening so I can't provide an example as such so apologies for that.
It could happen if the 'ANA_Code' is a big integer and while reading it with fread incorrect output can happen. One way would be to load the bit64 library and then read it with fread
library(bit64)
library(data.table)
DailyProduct <- fread("yourfile.csv")

Create new numeric columns from 1 string column

I'm a beginner. I have a dataset taken from here which consists of people profiles with different attributes, while profession is of them. There are 12 professions: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown.
I'd like to apply K-NN to that dataset, so I'd like to distribute the profession column into 12 new columns, and attribute 1 to the corresponding profession, and 0 to all the other 11 professions that don't belong to that person.
I tried foreach package and for loops, unsuccessfully. I'm not being able to work with foreach, and I don't know what to do next, from the following code:
jobs <- data[,2]
jobs
for (job in jobs) {
print(job)
#No idea how to create the new columns here, based on if conditionals
}
How would be the best way to do this?
Thanks.
You can certainly solve the problem using a for loop, but may I suggest a solution that is more efficient in the long run: reshape2 package (https://cran.r-project.org/web/packages/reshape2/).
I have the data from bank-full.csv read into R in object bank. Next reshape2 package needs to be downloaded, installed, and loaded:
install.packages("reshape2")
library(reshape2)
The data can then be shaped into a format where observations are on rows and jobs on columns. An accessory id column is first added to the data:
bank$id<-1:nrow(bank)
Then, taking the columns 2 and 18 (job and id) from the data frame bank and casting them into the aforementioned form can be done as:
tmp<-dcast(bank[,c(2, 18)], id~job, length)
That should give a new data frame tmp, where each job has it's own column. Since every id is present in the data only once, the length function used in the dcast function to aggregate the data puts just zeros and ones in every column.
Last, these new columns can be added to the original data set:
bank<-cbind(bank[,-18], tmp[,-1])
Negative subscripts inside the square brackets delete the columns from the dataset, so this simultaneously let's you get rid off the id column.
Another, even more efficient way to do this is to use the function model.matrix:
bank2<-cbind(bank, model.matrix( ~ 0 + job, bank))
This should give you a data frame with each job as a new column. Note however that it changes the column names a bit (adds job to the beginning of the job columns).

R: Function: generate and save multiple matrices based on multiple conditions

I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)

Match values in each group of a data.table column to values in a vector

I recently started to use the data.table package to identify values in a table's column that conform to some conditions. Although and I manage to get most of the things done, now I'm stuck with this problem:
I have a data table, table1, in which the first column (labels) is a group ID, and the second column, o.cell, is an integer. The key is on "labels"
I have another data table, table2, containing a single column: "cell".
Now, I'm trying to find, for each group in table1, the values from the column "o.cell" that are in the "cell" column of table2. table1 has some 400K rows divided into 800+ groups of unequal sizes. table2 has about 1.3M rows of unique cell numbers. Cell numbers in column "o.cell" table1 can be found in more than one group.
This seems like a simple task but I can't find the right way to do it. Depending on the way I structure my call, it either gives me a different result than what I expect or it never completes and I have to end R task because it's frozen (my machine has 24 GB RAM).
Here's an example of one of the "variant" of the calls I have tried:
overlap <- table1[, list(over.cell =
o.cell[!is.na(o.cell) & o.cell %in% table2$cell]),
by = labels]
I pretty sure this is the wrong way to use data tables for this task and on top of that I can't get the result I want.
I will greatly appreciate any help. Thanks.
Sounds like this is your set up:
dt1 = data.table(labels = c('a','b'), o.cell = 1:10)
dt2 = data.table(cell = 4:7)
And you simply want to do a simple merge:
setkey(dt1, o.cell)
dt1[dt2]
# o.cell labels
#1: 4 b
#2: 5 a
#3: 6 b
#4: 7 a

Resources