R: Warning when creating a (long) list of dummies - r

A dummy column for a column c and a given value x equals 1 if c==x and 0 else. Usually, by creating dummies for a column c, one excludes one value x at choice, as the last dummy column doesn't add any information w.r.t. the already existing dummy columns.
Here's how I'm trying to create a long list of dummies for a column firm, in a data.table:
values <- unique(myDataTable$firm)
cols <- paste('d',as.character(inds[-1]), sep='_') # gives us nice d_value names for columns
# the [-1]: I arbitrarily do not create a dummy for the first unique value
myDataTable[, (cols):=lapply(values[-1],function(x)firm==x)]
This code reliably worked for previous columns, which had smaller unique values. firm however is larger:
tr(values)
num [1:3082] 51560090 51570615 51603870 51604677 51606085 ...
I get a warning when trying to add the columns:
Warning message:
truelength (6198) is greater than 1000 items over-allocated (length = 36). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().
As far as I can tell, there is still all columns that I need. Can I just ignore this issue? Will it slow down future computations? I'm not sure what to make of this and the relevant of truelength.

Taking Arun's comment as an answer.
You should use alloc.col function to pre-allocate required amount of columns in your data.table to the number which will be bigger than expected ncol.
alloc.col(myDataTable, 3200)
Additionally depending on the way how you consume the data I would recommend to consider reshaping your wide table to long table, see EAV. Then you need to have only one column per data type.

Related

Indexing dataframes in R

good day
I don´t understand a topic here, is like it works but I can´t understand why
I have this database
# planets_df is pre-loaded in your workspace
# Use order() to create positions
positions <- order(planets_df$diameter)
positions
# Use positions to sort planets_df
planets_df[positions,]
I don´t understand why if u take the column diameter, then if u want to order it why u put it in a row of the dataframe like for me it should be [ rows, colum] but u put a column in a row and it changes, I really don´t get that.Why it´s not planets_df[,positions].
The exercise is solved I just don´t get it, is a data camp exercise btw.
Sorry if my English is wrong, it is not my native language.
I believe that I have created an example that matches your description. For the mtcars data set, which is pre-loaded in any R session, we can sort based on the variable mpg.
The function order returns the row indices sorted by mpg in this case. The ordering variable indicates the order that the rows should be presented in by storing the row indices based on mpg.
ordering <- order(mtcars$mpg)
This next step indicates that we want the rows of mtcars as specified by ordering. Essentially ordering is the order of the rows we want and so we pass that object to the row portion the call to mtcars.
mtcars[ordering,]
If we instead passed ordering as the columns, we would be reordering the columns of mtcars instead of the rows.

Counting number of rows where a value occurs at least once within many columns

I updated the question with pseudocode to better explain what I would like to do.
I have a data.frame named df_sel, with 5064 rows and 215 columns.
Some of the columns (~80) contains integers with a unique identifier for a specific trait (medications). These columns are named "meds_0_1", "meds_0_2", "meds_0_3" etc. as well as "meds_1_1", "meds_1_2", "meds_1_3". Each column may or may not contain any of the integer values I am looking for.
For the specific integer values to look for, some could be grouped under different types of medication, but coded for specific brand names.
metformin = 1140884600 # not grouped
sulfonylurea = c(1140874718, 1140874724, 1140874726) # grouped
If it would be possible to look-up a group of medications, like in a vector format as above, that would be helpful.
I would like to do this:
IF [a specific row]
CONTAINS [the single integer value of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_METFORMIN = 1 ELSE A_NEW_VARIABLE_METFORMIN = 0
and concordingly
IF [a specific row]
CONTAINS [any of multiple integer values of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_SULFONYLUREA = 1 ELSE A_NEW_VARIABLE_SULFONYLUREA = 0
I have manged to create a vector based on column names:
column_names <- names(df_sel) %>% str_subset('^meds_0')
But I havent gotten any further despite some suggestions below.
I hope you understand better what I am trying to do.
As for the selection of the columns, you could do this by first extracting the names in the way you are doing with a regex, and then using select:
library(stringr)
column_names <- names(df_sel) %>%
str_subset('^meds_0')
relevant_df <- df_sel %>%
select(column_names)
I didn't quite get the structure of your variables (if they are integers, logicals, etc.), so I'm not sure how to continue, but it would probably involve something like summing across all the columns and dropping those that are not 0, like:
meds_taken <- rowSums(relevant_df)
df_sel_med_count <- df_sel %>%
add_column(meds_taken)
At this point you should have your initial df with the relevant data in one column, and you can summarize by subject, medication or whatever in any way you want.
If this is not enough, please edit your question providing a relevant sample of your data (you can do this with the dput function) and I'll edit this answer to add more detail.
First, I would like to start off by recommending bioconductor for R libraries, as it sounds like you may be studying biological data. Now to your question.
Although tidyverse is the most widely acceptable and 'easy' method, I would recommend in this instance using 'lapply' as it is extremely fast. Your code from a programming standpoint becomes a simple boolean, as you stated, but I think we can go a little further. Using the built-in data from 'mtcars',
data(mtcars)
head(mtcars, 6)
target=6
#trues and falses for each row and column
rows=lapply(mtcars, function(x) x %in% target)
#Number of Trues for each column and which have more that 0 Trues
column_sums=unlist(lapply(rows, function(x) (sum(x, na.rm = TRUE))))
which(column_sums>0)
This will work with other data types with a few tweaks here and there.

Counting unique subsets of data efficiently

I have a relatively large dataset that I wouldn't qualify as 'big data'. It's around 3 to 5 million rows; because of the size I'm using the data.table library to do analysis.
The dataset (named df, which is a data.table structure) composition can essentially be broken into:
n identify fields, hereafter ID_1, ID_2, ..., ID_n, some of which are numeric and some of which are character vector.
m categorical variables, hereafter C_1, ..., C_m, all of which are character vector and have very few values apiece (2 in one, 3 in another, etc...)
2 measurement variables, M_1, and M_2, both numeric.
A subset of data is identified by ID_1 through ID_n, has a full set of all values of C_1 through C_m, and has a range of values of M_1 and M_2. A subset of data consists of 126 records.
I need to accurately count the unique sets of data and, because of the size of the data, I would like to know if there already exists a much more efficient way to do this. (Why roll my own if other, much smarter, people have done it already?)
I've already done a fair amount of Google work to arrive at the method below.
What I've done is to use the ht package (https://github.com/nfultz/ht) so that I can use a data frame as a hash value (using digest in the background).
I paste together the ID fields to create a new, single column, hereafter referred to as ID, which resembles...
ID = "ID1:ID2:...:IDn"
Then I loop through each unique set of identifiers and then, using just the subset data frame of C_1 through C_m, M_1, and M_2 (126 rows of data), hash the value / increment the hash.
Afterwards I'm taking that information and putting it back into the data frame.
# Create the hash structure
datasets <- ht()
# Declare the fields which will denote a subset of data
uniqueFields <- c("C_1",..."C_m","M_1","M_2")
# Create the REPETITIONS field in the original data.table structure
df[,REPETITIONS := 0]
# Create the KEY field in the original data.table structure
df[,KEY := ""]
# Use the updateHash function to fill datasets
updateHash <- function(val){
key <- df[ID==val, uniqueFields, with=FALSE]
if (isnull(datasets[key])) {
# If this unique set of data doesn't already exist in datasets...
datasets[key] <- list(val)
} else {
# If this unique set of data does already exist in datasets...
datasets[key] <- append(datasets[key],val)
}
}
# Loop through the ID fields. I've explored using apply;
# this vector is around 10-15K long. This version works.
for (id in unique(df$ID)) {
updateHash(id)
}
# Now update the original data.table structure so the analysis can
# be done. Again, I could use the R apply family, this version works.
for(dataset in ls(datasets)){
IDS <- unlist(datasets[[dataset]]$val)
# For this set of information, how many times was it repeated?
df[ID%in%IDS, REPETITIONS := length(datasets[[dataset]]$val)]
# For this set, what is a unique identifier?
df[ID%in%IDS, KEY := dataset]
}
This does what I want to, though not blindingly fast. I now have the capability to present some neat analysis revolving around variability in datasets to people who care about it. I don't like that it's hackey and, one way or another, I'm going to clean this up and make it better. Before I do that I want to do my final due diligence and see if it's simply my Google Fu failing me.

r: data.table, most efficient row wise normalization

This code normalizes each value in each row (all values end up between -1 and 1).
dt <- setDT(knime.in)
df <-as.data.frame(t(apply(dt[,-1], 1, function(x) x / sum(x) )))
df1<-cbind(knime.in$Majors_Final,df)
BUT
It is not dynamic. The code "knows" that the String categorical variable is in row one and removes it before running the calculations
It seems old school and I suspect it does not make full use of data.table's referencing memory allocations.
QUESTIONS
How do I use the most memory efficient data.table code to achieve the row wise normalization?
How do I exclude all is.character() columns (or include only is.numeric), if I do not know the position or name of these columns?

Sample a subset of dataframe by group, with sample size equal to another subset of the dataframe

Here's my hypothetical data frame;
location<- as.factor(rep(c("town1","town2","town3","town4","town5"),100))
visited<- as.factor(rbinom(500,1,.4)) #'Yes or No' variable
variable<- rnorm(500,10,2)
id<- 1:500
DF<- data.frame(id,location,visited,variable)
I want to create a new data frame where the number of 0's and 1's are equal for each location. I want to accomplish this by taking a random sample of the 0's for each location (since there are more 0's than 1's).
I found this solution to sample by group;
library(plyr)
ddply(DF[DF$visited=="0",],.(location),function(x) x[sample(nrow(x),size=5),])
I entered '5' for the size argument so the code would run, But I can't figure out how to set the 'size' argument equal to the number of observations where DF$visited==1.
I suspect the answer could be in other questions I've reviewed, but they've been a bit too advanced for me to implement.
Thanks for any help.
The key to using ddply well is to understand that it will:
break the original data frame down by groups into smaller data frames,
then, for each group, it will call the function you give it, whose job it is to transform that data frame into a new data frame*
and finally, it will stitch all the little transformed data frames back together.
With that in mind, here's an approach that (I think) solves your problem.
sampleFunction <- function(df) {
# Determine whether visited==1 or visited==0 is less common for this location,
# and use that count as our sample size.
n <- min(nrow(df[df$visited=="1",]), nrow(df[df$visited=="0",]))
# Sample n from the two groups (visited==0 and visited==1).
ddply(df, .(visited), function(x) x[sample(nrow(x), size=n),])
}
newDF <- ddply(DF,.(location),sampleFunction)
# Just a quick check to make sure we have the equal counts we were looking for.
ddply(newDF, .(location, visited), summarise, N=length(variable))
How it works
The main ddply simply breaks DF down by location and applies sampleFunction, which does the heavy lifting.
sampleFunction takes one of the smaller data frames (in your case, one for each location), and samples from it an equal number of visited==1 and visited==0. How does it do this? With a second call to ddply: this time, using location to break it down, so we can sample from both the 1's and the 0's.
Notice, too, that we're calculating the sample size for each location based on whichever sub-group (0 or 1) has fewer occurrences, so this solution will work even if there aren't always more 0's than 1's.

Resources