Replace NA with mean matching the same ID - r

I have a data frame:
id <- c(rep(1, 4), rep(2, 3), rep(3, 2), 4)
rate <- c(rep(1, 3), NA, 0.5, 0.6, NA, 0.7, NA, NA)
df <- data.frame(id, rate)
and I need to replace the NA based on the following conditions:
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)
if (is.nan(mrate)) {
df$rate[i] <- 1
} else {
df$rate[i] <- mrate
}
}
}
Apparently the for loop is simply too slow on a big data frame with >200K rows. How can I use a much faster way without using for loop?
Thanks!

This is a solution using data.tables:
library(data.table)
dt <- data.table( df, key = "id" )
dt[ , rate := ifelse( is.na(rate), round( mean(rate, na.rm=TRUE), 1), rate ), by = id ]
dt[ is.na(rate), rate := 1 ]
dt
id rate
1: 1 1.0
2: 1 1.0
3: 1 1.0
4: 1 1.0
5: 2 0.5
6: 2 0.6
7: 2 0.6
8: 3 0.7
9: 3 0.7
10: 4 1.0
I am not sure though, if the ifelse could/should be avoided.

As mentioned in my comment, for loops in R are not specifically slow. However, often a for loop indicates other inefficiencies in code. In this case, the subset operation that is repeated for each row to determine the mean is most likely the slowest bit of code.
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1) ## This line!
if (is.nan(mrate)) {
df$rate[i] <- 1
} else {
df$rate[i] <- mrate
}
}
}
If instead, these group averages are determined before hand, the loop can do a rapid lookup.
foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE)
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- foo$x[foo$Group.1 == df$id[i]]
...
However, I am still doing a subset at df$id[i] on the large data.frame. Instead, using one of the tools that implements a split-apply-combine strategy is a good idea. Also, lets write a function that takes a single value and a pre-computed group average and does the right thing:
myfun <- function(DF) {
avg <- avgs$rate[avgs$id == unique(DF$id)]
if (is.nan(avg)) {
avg <- 1
}
DF$rate[is.na(DF$rate)] <- avg
return (DF)
}
The plyr version:
library(plyr)
avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE))
result <- ddply(df, .(id), myfun)
And the likely much faster data.table version:
library(data.table)
DT <- data.table(df)
setkey(DT, id)
DT[, avg := mean(rate, na.rm=TRUE), by=id]
DT[is.nan(avg), avg := 1]
DT[, rate := ifelse(is.na(rate), avg, rate)]
This way, we've avoided all lookup subsetting in leiu of adding a pre-calculated column and can now do row-wise lookups which are fast and efficient. The extra column can be dropped inexpensively using:
DT[, avg := NULL]
The whole shebang can be written into a function or a data.table expression. But, IMO, that often comes at the expense of clarity!

I'm not sure this exactly answers the OP's question, but for others who read this later, there is a different and much faster method of performing calculations on a subset of data other than actually subsetting the data: vector math. The engineers in the crowd will know what I'm talking about.
Instead of subsetting, assign a very rapid function to create an identity vector and multiply the data by the identity.
Now, this isn't faster for all cases. There are cases where vectorized functions are actually slower than item-explicit functions, and it all depends on your specific application. [Insert O-notation rant of your choice here.]
Here is how we would do a vector math implementation for this case:
# Create the NA identity vector.
na_identity <- is.na(df$rate)
# Initialize the final data frame.
# This is for non-destructive purposes.
df_revised <- df
# Replace all NA occurrences in final
# data frame with zero values.
df_revised$rate[na_identity] <- 0
# Loop through each unique [id]
# value in the data.
# Create an identity vector for the
# current ID, calculate the mean
# rate for that ID (replacing NaN with 1),
# and insert the mean for any NA values
# associated with that ID.
for (i in unique(df$id)){
id_identity <- df$id==i
id_mean <- sum(df_revised$rate * id_identity * !na_identity) / sum(id_identity * !na_identity)
if(is.nan(id_mean)){id_mean <- 1}
df_revised$rate <- df_revised$rate + id_mean * id_identity * na_identity
}
# id rate
# 1 1 1.00
# 2 1 1.00
# 3 1 1.00
# 4 1 1.00
# 5 2 0.50
# 6 2 0.60
# 7 2 0.55
# 8 3 0.70
# 9 3 0.70
# 10 4 1.00
From a vector math perspective, this code is easily readable. In this small example, the code is very fast, but the cycle time increases directly with the number of unique ID values. I'm not sure if this is the correct approach for the OP's larger application, but the solution is workable and theoretically sound and removes the need for complex and difficult to read logic blocks.

Related

What is an alternative for this slow for-loop to fill in single days between dates?

For a project I'm working on, I need to have a dataframe to indicate whether a person was absent (0) or not (1) on a particular day.
The problem is: my data is in a format where it gives the starting date of absenteïsm and then the number of days the person was absent.
Example of my dataframe:
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
Instead of the "Start date" and "number of days absent" per person, it should look like this instead:
df2 <- data.frame(Person = c(1,1,1,1,1),
Date = c("01-01","02-01","03-01","04-01","05-01"),
Absent = c(1,1,1,0,1))
For now I solved it with this for loop with two if-conditions:
for(i in 1:nrow(df1)){
if(!is.na(df1$DAYS[i])){
var <- df1$DAYS[i]
}
if(var > 0){
var <- var-1
df1$DAYS[i] <- 1
}
}
This works, however I have thousands of persons with a full year of dates each, meaning that I have more than 5 million rows in my dataframe. You can imagine how slow the loop is.
Does anyone know a quicker way to solve my problem?
I tried looking at the lubridate package to work with periods and dates, but I don't see a solution there.
Here is an approach based upon generating all the indices of observations that should be set to 1, and then filling in the values.
# The data
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
# Initialize the vector we want with zeros
df1$Absent <- 0
# we get the indices of all the non-zero day entries
inds <- which(!is.na(df1$DAYS))
# Now we are going to build a list of all the indices that should be
# set to one. These are the intervals from when absence starts to
# the number of days absent - 1
inds_to_change <- mapply(function(i,d){i:(i+d-1)}, inds, na.omit(df1$DAYS))
df1$Absent[unlist(inds_to_change)] <- 1
df1
#> Person StartDate DAYS Absent
#> 1 1 01-01 3 1
#> 2 1 02-01 NA 1
#> 3 1 03-01 NA 1
#> 4 1 04-01 NA 0
#> 5 1 05-01 1 1
Created on 2019-02-20 by the reprex package (v0.2.1)
A faster solution can be found by using integrated R functions.
The general idea:
For each person, find the position for absent days greater than 1. Let the number of absent days be a and the position be p.
In every position defined by the sequence p:(p + a - 1) insert the value 1.
Return the redefined vector, in place of the old vector.
This can all be implemented into a function, and then applied across all the subgroups. For this to be faster
the function
For the specific case using mapply (as the previous answer suggest) works, but using data.table will in general be faster for larger data sets. This is utilized below.
RelocateAbsentees <- function(x){
#Find the position in x for which the value is greater than 1
pos <- which(x > 1)
#Fill in the vector with the absent days
for(i in pos){
val <- x[i]
x[i:(i + val - 1)] <- 1
}
#return the vector
pos
}
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
library(data.table)
setDT(df1)
df2 <- copy(df1)[,Person := 2]
df3 <- rbind(df1,df2)
#Using data.table package (faster)
df1[, newDays := RelocateAbsentees(DAYS), by = Person]
df3[, newDays := RelocateAbsentees(DAYS), by = Person]
I found pretty neat solution using tidyverse:
library(tidyverse)
df1 %>%
group_by(Person) %>%
mutate(Abs = map_dbl(DAYS, ~ {
if (!is.na(.x)) {
d <<- .x
+(d > 0)
} else {
d <<- d - 1
+(d > 0)
}
}))
Firstly, your original approach was not so bad. Some minor improvements can make it faster than gfgm`s (as of my testing, I do not know your exact data structure):
improvedOP <- function(d) {
days <- d$DAYS # so we do not repeatedly change data.frames column
ii <- !is.na(days) # this can be calculated outside the loop
for (i in 1:nrow(d)) {
if (ii[i]) var <- days[i]
if (var > 0) {
var <- var - 1
days[i] <- 1
}
}
return(days)
}
I came up with this approach:
minem <- function(d) {
require(zoo)
rn <- 1:nrow(d) # row numbers
ii <- rn + d$DAYS - 1L # get row numbers which set to 1
ii <- na.locf(ii, na.rm = F) # fill NA forward
ii <- rn <= ii # if row number less or equal than interested row is 1
ii[ii == 0] <- NA # set 0 to NA to match original results
as.integer(ii)
}
all.equal(minem(d), improvedOP(d))
# TRUE
The idea is that we calculate row numbers which need to be 1 (current row + DAYS - 1). Then fill the NAs with this value and if row matches our condition set to 1. This should be faster than any other approach, that involves creating sequences.
Benchmark on larger (7.3 mil rows) simulated data:
gfgm <- function(d) {
days <- rep(0, nrow(d))
inds <- which(!is.na(d$DAYS))
inds_to_change <- mapply(function(i, d) {i:(i + d - 1)}, inds, na.omit(d$DAYS))
days[unlist(inds_to_change)] <- 1
days
}
nrow(d)/1e6 # 7.3 mil
require(bench)
require(data.table)
bm <- bench::mark(minem(d), improvedOP(d), gfgm(d), iterations = 2, check = F)
as.data.table(bm[, 1:7])
# expression min mean median max itr/sec mem_alloc
# 1: minem(d) 281.34ms 302.85ms 302.85ms 324.35ms 3.3019990 408MB
# 2: improvedOP(d) 747.45ms 754.55ms 754.55ms 761.65ms 1.3252907 139MB
# 3: gfgm(d) 3.23s 3.27s 3.27s 3.31s 0.3056558 410MB
P.S. but the real results probably depends on the distribution of DAYS values.

Replace nth consecutive occurrence of a value

I want to replace the nth consecutive occurrence of a particular code in my data frame. This should be a relatively easy task but I can't think of a solution.
Given a data frame
df <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,1,2,2,2,1,1))
I want a result
df_result <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,0,2,2,2,1,0))
The data frame is time-ordered so I need to keep the same order after replacing the values. I guess that nth() or duplicate() functions could be useful here but I'm not sure how to use them. What I'm missing is a function that would count the number of consecutive occurrences of a given value. Once I have it, I could then use it to replace the nth occurrence.
This question had some ideas that I explored but still didn't solve my problem.
EDIT:
After an answer by #Gregor I wrote the following function which solves the problem
library(data.table)
library(dplyr)
replace_nth <- function(x, nth, code) {
y <- data.table(x)
y <- y[, code_rleid := rleid(y$Code)]
y <- y[, seq := seq_along(Code), by = code_rleid]
y <- y[seq == nth & Code == code, Code := 0]
drop.cols <- c("code_rleid", "seq")
y %>% select(-one_of(drop.cols)) %>% data.frame() %>% return()
}
To get the solution, simply run replace_nth(df, 2, 1)
Using data.table:
library(data.table)
setDT(df)
df[, code_rleid := rleid(df$Code)]
df[, seq := seq_along(Code), by = code_rleid]
df[seq == 2 & Code == 1, Code := 0]
df
# Values Code code_rleid seq
# 1: 1 1 1 1
# 2: 4 0 1 2
# 3: 5 2 2 1
# 4: 6 2 2 2
# 5: 3 2 2 3
# 6: 3 1 3 1
# 7: 2 0 3 2
You could combine some of these (and drop the extra columns after). I'll leave it clear and let you make modifications as you like.

apply function to all values in data.table subset

I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table. I'm experimenting with data.table to see if it will suit my needs.
For example, I start with this vector of data points, which I convert to a pairwise distance matrix.
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
It looks like this:
> pdist
spA spB spC spD spE spF
spA NA NA NA NA NA NA
spB 6 NA NA NA NA NA
spC 4 2 NA NA NA NA
spD 3 9 7 NA NA NA
spE 1 5 3 4 NA NA
spF 5 1 1 8 4 NA
Converting this table to a data.table
library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
rn spA spB spC spD spE spF
1: spA NA NA NA NA NA NA
2: spB 6 NA NA NA NA NA
3: spC 4 2 NA NA NA NA
4: spD 3 9 7 NA NA NA
5: spE 1 5 3 4 NA NA
6: spF 5 1 1 8 4 NA
If I have some subset that I want to extract the values for,
sub <- c('spB', 'spF', 'spD')
I can do the following, which yields the submatrix that I am interested in:
> pdist[.(sub), sub, with=FALSE]
spB spF spD
1: NA NA NA
2: 1 NA 8
3: 9 NA NA
Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.
> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6
UPDATE
Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}
> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
user system elapsed
18.116 0.154 18.317
> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
user system elapsed
795.456 13.357 806.820
It appears that going through the data.table step here is leading to a big performance cost.
Please see the solution posted here for an every more general solution. It may also help:
data.table: transforming subset of columns with a function, row by row
To apply the function, you can do the following:
Part 1. A Step-by-Step Solution
(1.a) Get the data into Data.Table format:
library(data.table)
library(magrittr) #for access to pipe operator
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
(1.b) Then, Get the list of Column Names:
# Get the list of names
sub <- c('spB', 'spF', 'spD')
(1.c) Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X, X.mean = mean(X, na.rm=T), X.sd = sd(X, na.rm=T)){
X <- (X - X.mean) / X.sd
return(X)}
(1.d) Apply the function:
# Voila:
pdist[, unlist(.SD, use.names = FALSE), .SDcols = sub] %>% normalize()
#Or, you can apply the function inside the [], as below:
pdist[, unlist(.SD, use.names = FALSE) %>% normalize(), .SDcols = sub]
# Or, if you prefer to do it without the pipe operator:
pdist[, normalize(unlist(.SD, use.names = FALSE)), .SDcols = sub]
Part 2. Some Advantages for Data.Table approach
Since you seem familiar with matrix approach, I just wanted to point out some advantages of keeping the data.table approach
(2.a) Apply functions within group by using the "by ="
One advantage over matrix is that you can still apply functions within group by using the "by =" argument.
In the example here, I assume you have a variable called "Grp."
With the by=Grp line, the normalization is within group now.
pdist[, unlist(.SD) %>% normalize(), .SDcols = sub, by=Grp]
(2.b) Another advantage is that you can keep other identifying information, for example, if each row has a "participant identifier" P.Id that you wish to keep and repeat:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
In the first step, done in this portion of the code: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]
First, we create a new column called Combined.Data for data in all three columns identified in "sub"
Next to each row of the combined data, the appropriate Participant Id will repeat in column P.Id
In the second step, done in this portion of the code:
[,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]
We can create a new column called Normalized to store the normalized values that result from applying the function normalize()
In addition, we can also include the Combined.Data column as well
So, with this single line:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
we subset columns,
collapse data across the subset,
keep track of the identifier for each datum (P.Id) even when collapsed,
apply a transformation on the entire collapsed data, and
end-up with a neat output in the form of a data table with 3 columns: (1) P.Id, (2) Transformed, & (3) Combined.Data (original values).
and, the order(P.Id) allows the output to appear meaningfully ordered.
The same would be possible with matrix approach, but would be much more cumbersome and take more lines of code.
Data table allows for powerful manipulation and management of data, especially when you start chaining operations together.
(2.c) Finally, if you just wish to keep row information as simple row.numbers, you can use the .I feature of the data.table package:
pdist[, .(.I, normalize(unlist(.SD)), .SDcols = sub]
This feature can be quite helpful, especially if you dont have a participant or row identifier that is inherently meaningful.
Part 3. Disadvantage: Time Cost
I recreated the corrected time cost shown above and the solution for Data Table does take significantly longer
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
# pdistDT$sp %<>% as.factor()
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
unlist(., recursive = FALSE, use.names = FALSE) %>%
mean(., na.rm = TRUE))
}
dtMethod1 <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
melt.data.table(., measure.vars = sub, na.rm=TRUE) %$%
mean(value))
}
system.time(q1 <- apply(spSub, MARGIN = 2, function(x) matMethod(pdist, x)))
# user system elapsed
# 2.86 0.00 3.27
system.time(q2 <- apply(spSub, MARGIN = 2, function(x) dtMethod(pdistDT, x)))
# user system elapsed
# 57.20 0.02 57.23
system.time(q3 <- apply(spSub, MARGIN = 2, function(x) dtMethod1(pdistDT, x)))
# user system elapsed
# 62.78 0.06 62.91

R function to calculate mean/median of top highest values

I have a data frame with 2 columns one with numeric values and one with a name. The name repeats itself but has different values each time.
Data <- data.frame(
Value = c(1:10),
Name = rep(LETTERS, each=4)[1:10])
I would like to write a function that takes the 3 highest numbers for each name and calculates mean and median (and in case there aren’t 3 values present throw an NA) and then take all the values for each name and calculate mean and median.
My initial attempt looks something like this:
my.mean <- function (x,y){
top3.x <- ifelse(x > 3 , NA, x)
return(mean(top3.x), median(top3.x))
}
Any hints on how to improve this will be appreciated.
I would probably recommend by for this.
Something put together really quickly might look like this (if I understood your question correctly):
myFun <- function(indf) {
do.call(rbind, with(indf, by(Value, Name, FUN=function(x) {
Vals <- head(sort(x, decreasing=TRUE), 3)
if (length(Vals) < 3) {
c(Mean = NA, Median = NA)
} else {
c(Mean = mean(Vals), Median = median(Vals))
}
})))
}
myFun(Data)
# Mean Median
# A 3 3
# B 7 7
# C NA NA
Note that it is not a very useful function in this form because of how many parameters are hard-coded into the function. It's really only useful if your data is in the form you shared.
Here's a data.table solution, assuming that you don't have any other NAs in your data:
require(data.table) ## 1.9.2+
setDT(Data) ## convert to data.table
Data[order(Name, -Value)][, list(m1=mean(Value[1:3]), m2=median(Value[1:3])), by=Name]
# Name m1 m2
# 1: A 3 3
# 2: B 7 7
# 3: C NA NA
Using dplyr
library(dplyr)
myFun1 <- function(dat){
dat %>%
group_by(Name)%>%
arrange(desc(Value))%>%
mutate(n=n(), Value=ifelse(n<=3, NA_integer_, Value))%>%
summarize(Mean=mean(head(Value,3)), Median=median(head(Value,3)))
}
myFun1(Data)
#Source: local data frame [3 x 3]
# Name Mean Median
#1 A 3 3
#2 B 7 7
#3 C NA NA

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah

Resources