Determining how to quickly classify columns in huge datasets as factors

Determining how to quickly classify columns in huge datasets as factors - r

No good example here since my datasets that I am working with are huge.
But if I have a 200,300something column dataset I want to have some sort of rule to quickly classify and convert some of these columns to factors. Is there some quick R code to do it?
Reason being I don't have time to go column by column to completely understand or interpret data, but if I see there are just unique 4 values out 5000 rows, I assume that this is categorical data.
Anyone have any quick code snippets or ways to go about doing this?

Assuming that df refers to your dataframe:
## Find all columns with less than 5 unique values
cols <- apply(df, 2, FUN = function(x) length(unique(x))) < 5
## Convert columns with less than 5 unique values to factor
df[cols] <- lapply(df[cols], factor)

Related

Loop over even/odd columns & stack them under specific ones

I have the following data set from Douglas Montgomery's book Introduction to Time Series Analysis & Forecasting:
I created a data frame called pharm from this spreadsheet. We only have two variables but they're repeated over several columns. I'd like to take all odd "Week" columns past the 2nd column and stack them under the 1st Week column in order. Conversely I'd like to do the same thing with the even "Sales, in thousands" columns. Here's what I've tried so far:
pharm2 <- data.frame(week=c(pharm$week, pharm[,3], pharm[,5], pharm[,7]), sales=c(pharm$sales, pharm[,4], pharm[,6], pharm[,8]))
This works because there aren't many columns, but I need a way to do this more efficiently because hard coding won't be practical with many columns. Does anyone know a more efficient way to do this?

If the columns are alternating, just subset with a recycling logical vector, unlist and create a new data.frame
out <- data.frame(week = unlist(pharm[c(TRUE, FALSE)]),
sales = unlist(pharm[c(FALSE, TRUE)]))

You may use the seq function to generate sequence to extract alternating columns.
pharm2 <- data.frame(week = unlist(pharm[seq(1, ncol(pharm), 2)]),
sales = unlist(pharm[seq(2, ncol(pharm), 2)]))

Calculate ratios of all column combinations from a dataframe

I have a CVS file imported as df in R. dimension of this df is 18x11. I want to calculate all possible ratios between the columns. Can you guys please help me with this? I understand that either 'for loop" or vectorized function will do the job. The row names will remain the same, while column name combinations can be merged using paste. However, I don't know how to execute this. I did this in excel as it is still a smaller data set. A larger size will make it tedious and error prone in excel, therefore, I would like to try in R.
Will be great help indeed. Thanks. Let's say below is the data frame as subset from my data.
dfn = data.frame(replicate(18,sample(100:1000,15,rep=TRUE)))

If you do:
do.call("cbind", lapply(seq_along(dfn), function(y) apply(dfn, 2, function(x) dfn[[y]]/x)))
You will get an array that is 15 * 324, with 18 columns representing all columns divided by the first column, 18 columns divided by the second column, and so on.
You can keep track of them by labelling the columns with the following names:
apply(expand.grid(names(dfn), names(dfn)), 1, paste, collapse = " / ")

Subsetting variables with missing values in R

I have a dataset with 50 variables (columns) and 30 of them have missing values more than half its own observations.
I want to subset a dataset where those 30 variables with too many missing values are gone. I think I can do it one by one, but I was just wondering if there could be a way to do it more quickly in R.

Logic : First iterate through each column using sapply and check which all columns have less than half missing values. The output from first line is a logical vector which is used to subset the data.
ind <- sapply( colnames(df), function(x) sum(is.na(df[[x]])) < nrow(df)/2)
df <- df[colnames(df)[ind]]

Subsetting every x amount of columns as separate sites

I need a function that recognises every x amount of columns as a separate site. So in df1 below there are 8 columns, with 4 sites each consisting of 2 variables. Previously, I have used a procedure like this as answered here Selecting column sequences and creating variables.
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:20, 8*10, replace=TRUE), ncol=8))
I then need to calculate a column sum so that a total for each variable is obtained.
colsums <- as.data.frame(t(colSums(df1)))
I subsequently split the dataframe using this technique...
lst1 <- setNames(lapply(split(1:ncol(colsums), as.numeric(gl(ncol(colsums),
2, ncol(colsums)))), function(i) colsums[,i]), paste0('site', 1:4))
list2env(lst1, envir=.GlobalEnv)
And organise into one dataframe...
Combined <- as.matrix(mapply(c,site1,site2,site3,site4))
rownames(Combined) <- c("Site.1","Site.2","Site.3","Site.4")
Whilst this technique has been great on smaller dataframes, where there are a substantial amount of sites (>500) typing out each site following the mapply function takes up a lot of code and could lead to some sites getting missed off if I'm typing them all in manually. Is there an easy way to overcome this following the colsums stage?

A matrix is a vector with dimensions. Matrices are stored in column-major order in R.
The call matrix(colsums, nrow=2) should help you a lot.
NB.: Polluting the "global" environment is generally a bad idea.

Using gsub() in a data.table

I have a big data table (about 20,000 rows). One of its columns contains in integers from 1 to 6.
I also have a character vector of car models (6 models).
I'm trying to replace integers with corresponding car model.(just 2 in this example)
gsub("1",paste0(labels[1]),Models)
gsub("2",paste0(labels[2]),Models)
...
"Models" is the name of a column.
labels <- c("Altima","Maxima")
After fighting with it for 12+ hours gsub() isn't working(
sample data:
mydata<-data.table(replicate(1,sample(1:6,10000,rep=TRUE)))
labels<-c("altima","maxima","sentra","is","gs","ls")

I don't think you need gsub here. What you are describing is a factor variable.
If you data is
mydata <- data.table(replicate(1,sample(1:6,1000,rep=TRUE)))
models <- c("altima","maxima","sentra","is","gs","ls")
you could just do
mydata[[1]] <- factor(mydata[[1]], levels=seq_along(models), labels=models)
If you really wanted a character rather than a factor, then
mydata[[1]] <- models[ mydata[[1]] ]
would also do the trick. Both of these require the numbers are continuous and start at 1.

You could try using factor() in the following way - worked for me on your test data. Assuming that name of the first column in mydata is V1 (the default)
mydata$V1 <- factor(mydata$V1, labels=models)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Determining how to quickly classify columns in huge datasets as factors - r

Assuming that df refers to your dataframe: ## Find all columns with less than 5 unique values cols <- apply(df, 2, FUN = function(x) length(unique(x))) < 5 ## Convert columns with less than 5 unique values to factor df[cols] <- lapply(df[cols], factor)

Related

Loop over even/odd columns & stack them under specific ones

Calculate ratios of all column combinations from a dataframe

Subsetting variables with missing values in R

Subsetting every x amount of columns as separate sites

Using gsub() in a data.table

Categories

Resources