Get all possible combinations on a large dataset in R - r

I am having a large dataset having more than 10 million records and 20 variables. I need to get every possible combination for 11 variables out of these 20 variables and for each combination the frequency also should be displayed.
I have tried count() in plyr package and table() function. But both of them are unable to get all possible combinations, since the number of combinations are very high (greater than 2^32 combinations) and also the size is huge.
Assume following dataset having 5 variables and 6 observations -
And I want all possible combinations of first three variables where frequencies are greater than 0.
Is there any other function to achieve this? I am just interested in combinations whose frequency is non-zero.
Thanks!

OK. I think I have an idea of what you require. If you are saying you want the count by N categories of rows in your table, you can do so with the data.table package. It will give you the count of all combinations that exist in the table. Simply list the required categories in the by arguement
DT<-data.table(val=rnorm(1e7),cat1=sample.int(10,1e7,replace = T),cat2=sample.int(10,1e7,replace = T),cat3=sample.int(10,1e7,replace = T))
DT_count<-DT[, .N, by=.(cat1,cat2,cat3)]

Related

Filtering iteratevely through data table in R

I have a data table with 3 variables, 1 frequency column, and I am wishing to add another proportion column.
The variable 1 has 4 unique values.
Variable 2 has 5,
And Variable 3 has 2.
The frequencies captures the amount of times that happens.
But if I add the prop.table to it, it will calculate the proportion regarding the whole data.table, when I really want it to calculate the proportion in the subsets of Variable 2.
I thought of iterating, but it seems complicated in tables.
You could use the aggregate function (or tapply) to sum all the counts within the categories of variable 2, then use prop.table or similar on the result.
If you want to use the tidyverse instead of base R then this would be a group_by followed by summarise to add within each group, then prop_table again to calculate the proportions.

Combining mutate and loop

I am currently trying to automate a process that seems simple, but I have to repeat the process 800 times. I have 2 datasets, each with 8 columns. One column is streamflow, and the other is a list of thresholds (each row has a different threshold). I want to know the number of days that the streamflow falls below the threshold. So far I've done this using mutate and ifelse.
lf1 <- Daily_average_Q %>%
mutate(lf1 = ifelse(Q1B < Thresholds$B1[1], '1', '0')
This gives me what I want for 1 threshold, but I have 100 thresholds that I need to use. I also need to do this over 8 sites, so I can't afford to do this 800 separate times. I just want the row number in Thresholds$B1[row#] to change automatically each time. I've tried looping with "for" but I can't figure out how to mutate and loop at the same time.
Thanks so much for any help!!

Conditionally create new column in R

I would like to create a new column in my dataframe that assigns a categorical value based on a condition to the other observations.
In detail, I have a column that contains timestamps for all observations. The columns are ordered ascending according to the timestamp.
Now, I'd like to calculate the difference between each consecutive timestamp and if it exceeds a certain threshold the factor should be increased by 1 (see Desired Output).
Desired Output
I tried solved it with a for loop, however that takes a lot of time because the dataset is huge.
After searching for a bit I found this approach and tried to adapt it: R - How can I check if a value in a row is different from the value in the previous row?
ind <- with(df, c(TRUE, timestamp[-1L] > (timestamp[-length(timestamp)]-7200)))
However, I can not make it work for my dataset.
Thanks for your help!

Split a dataframe into any number of smaller dataframes with no more than N number of rows

I have a number of dataframes, each with different numbers of rows. I want to break them all into smaller dataframes that have no more than 50 rows each, for example.
So, if I had a dataframe with 107 rows, I want to output the following:
A dataframe containing rows 1-50
A dataframe containing rows 51-100
A dataframe containing rows 101-107
I have been reading many examples using the split() function but I have not been able to find any usage of split() or any other solution to this that does not pre-define the number of dataframes to split into, or scrambles the order of the data, or introduce other problems.
This seems like such a simple task that I am surprised that I have not been able to find a solution.
Try:
split(df,(seq_len(nrow(df))-1) %/% 50)
What have in common the first 50 rows? If you make an integer division (%/%) of the index of row (less one) by 50, they all give 0 as result. As you can guess, rows 51-100 give 1 and so on. The (seq_len(nrow(df))-1) %/% 50 basically indicate the group you want to split into.

Count the occurrences of just one value in R

I have 34 subsets with a bunch of variables and I am making a new dataframe with summarizing information about each variable for the subsets.
- Example: A10, T2 and V2 are all subsets with ~10 variables and 14 observations where one variable is population.
I want my new dataframe to have a column which says how many times per subset variable 2 hit zero.
I've looked at a bunch of different count functions but they all seem to make separate tables and count the occurrences of all variables. I'm not interested in how many times each unique value shows up because most of the values are unique, I just want to know how many times population hit zero for each subset of 14 observations.
I realize this is probably a simple thing to do but I'm not very good at creating my own solutions from other R code yet. Thanks for the help.
I've done something similar with a different dataset where I counted how many times 'NA' occurred in a vector where all the other values were numerical. For that I used:
na.tmin<- c(sum(is.na(s1997$TMIN)), sum(is.na(s1998$TMIN)), sum(is.na(s1999$TMIN))...
Which created a column (na.tmin) that had the number of times each subset recorded NA instead of a number. I'd like to just count the number of times the value 0 occurred but is.0 is of course not a function because 0 is numerical. Is there a function that will just count the number of times a specific value shows up? If there's not should I use the count occurrences for unique values function?
Perhaps:
sum( abs( s1997$TMIN ) < 0.00000001 )
It's safer to use a tolerance value unless you are sure that you value is an integer. See FAQ 7.31.
sum( abs( pi - (355/113+seq(-0.001, 0.001, length=1000 ) ) )< 0.00001 )
[1] 10

Resources