Creating dummies with apply in R - r

I have data about different study strategies for individuals (stored in columns labeled StrategyA, StrategyB, StrategyC. The strategies are coded 1-15. I want to create a dummy for each strategy (e.g. strategy1, strategy2, etc) because each student can list up to 3 strategies.
Example Data
ID = c(1, 2, 3, 4, 5)
Strategy_A = c(10, 12, 13, 1, 2)
Strategy_B = c(1, 2, 1, 4, 5)
Strategy_C = c(2, 3, 6, 8, 15)
all = data.frame(ID, Strategy_A, Strategy_B, Strategy_C)
I thought about using apply and creating a function linked to the fastDummies package.
dummies = function(x){
dummy_cols(x)
}
new = apply(all [,-1], 2, dummies)
new = as.data.frame(new)
However, this creates dummies for StrategyA_1 StrategyA_2 StrategyA_3 rather than summarizing the dummies as Strategy1 Strategy2 Strategy3. Any ideas how to fix this?

After a small transformation of all, you can use dummy.data.frame() from dummies (you can also use dummy_cols() from fastDummies) and then aggregate per ID.
all <- data.frame(ID = rep(all$ID, 3),
Strategy = c(all$Strategy_A, all$Strategy_B, all$Strategy_C)) # data frame "all" with one column Strategy
library(dummies)
all <- dummy.data.frame(all, "Strategy") # or fastDummies::dummy_cols(all, "Strategy")
aggregate(. ~ ID, all, sum) # since strategies are now dummies, the sum will always be 0 or 1
# output
ID Strategy1 Strategy2 Strategy3 Strategy4 Strategy5 Strategy6 Strategy8 Strategy10 Strategy12 Strategy13 Strategy15
1 1 1 1 0 0 0 0 0 1 0 0 0
2 2 0 1 1 0 0 0 0 0 1 0 0
3 3 1 0 0 0 0 1 0 0 0 1 0
4 4 1 0 0 1 0 0 1 0 0 0 0
5 5 0 1 0 0 1 0 0 0 0 0 1

I provide a method with the tidyverse way.
library(tidyverse)
new <- all %>% gather(select = -ID) %>%
mutate(key = NULL, num = 1) %>%
spread(value, num)
# ID 1 2 3 4 5 6 8 10 12 13 15
# 1 1 1 1 NA NA NA NA NA 1 NA NA NA
# 2 2 NA 1 1 NA NA NA NA NA 1 NA NA
# 3 3 1 NA NA NA NA 1 NA NA NA 1 NA
# 4 4 1 NA NA 1 NA NA 1 NA NA NA NA
# 5 5 NA 1 NA NA 1 NA NA NA NA NA 1
new[is.na(new)] <- 0
new
# ID 1 2 3 4 5 6 8 10 12 13 15
# 1 1 1 1 0 0 0 0 0 1 0 0 0
# 2 2 0 1 1 0 0 0 0 0 1 0 0
# 3 3 1 0 0 0 0 1 0 0 0 1 0
# 4 4 1 0 0 1 0 0 1 0 0 0 0
# 5 5 0 1 0 0 1 0 0 0 0 0 1

Related

How to populate a matrix based on criteria for rows and columns (Updated 2x)

Updated 2x I have checked the issue and now I can give a better explanation.
I am trying to do a schedule planning using r. My issue is explained next.
I have a set of n workers and they need to work during a period of m months (aka p). The only conditions that need to be satisfied are next:
Each day is always required that np workers need to be working in the factory.
All the workers must have r=8 days of rest per month but each day there must always be np people working.
Those are the conditions that need to be completed for this issue. In this way I think I could use the days of month to populate the days of working but I am not sure on how to set the rest days such that each month each worker only must have 8 days of rest. I would set a matrix with an example of 12 workers during two months. The only conditions are: each day np=8 workers must be in the factory and every month they can only have r=8 days of rest.
I have a code like this:
#Workers
n <- 12;
#Months (Days)
p <- 59;
#Number of people required each day
np <- 8;
#Rest days per month
r <- 8
#Matrix
#Days
vday <- seq(as.Date('2023-02-01'),as.Date('2023-03-31'),by=1)
mm <- matrix(data =NA,nrow = n,ncol = length(vday))
dimnames(mm)[[2]]<-as.character(vday)
But it is complex for me finding a way so that I have np=8 persons working each day and that each month each of them must rest only r=8 days per month but keeping the condition that each day are needed np people in the factory. Resting days could be allocated random each month taking into acount the condition. I would set 1 for working and 0 for resting.
I think the way you are naming the variables is misleading as np looks to me as a multiplication. I propose a function that can deal with your problem with a different variable naming.
This function returns an error in two cases:
The total number of workers you consider is not integer-divisible for the number of working groups you need. So if you need 2 groups the total number of workers must be integer-divisible by 2 (2, 4, 8, ..., 14). It could accommodate non-integer division but the coding gets a bit more complex and you should specify what you want to do with the extra workers.
The work-rest schedule is not feasible with the number of groups you plan to use. For instance a 2-day work/3-day rest schedule is not possible having 2 groups of workers (unless nobody work on certain days).
If no error occurs then the function returns a data.frame with the following characteristics:
Column one the worker ID
Column two the group ID
All remaining columns represent the working days and can have the following values: 1 - the worker has to work, 0 - the worker has to rest, NA - the worker has neither to work nor rest.
FUNCTION CODE:
work_schedule <- function(workers, total_days, group_size, day_streak, rest_days){
if(workers %% group_size != 0){
stop("workers are not divisible in groups of equal sizes")
}
n_groups <- workers / group_size
df <- data.frame(worker = seq_len(workers), # NOTE FOR MORE THAN 26 GROUPS THE GROUP NAMING MUST BE CHANGED
group = rep(LETTERS[seq_len(n_groups)], each = group_size))
schedule <- matrix(nrow = nrow(df), ncol = total_days)
pttrn <- c(rep(1, day_streak), rep(0, rest_days))
d0 <- 1
g_names <- unique(df$group)
g_ind <- 1
while(d0 <= total_days){
d1 <- d0+length(pttrn)-1
if(d0+length(pttrn)-1 > total_days){
d1 <- total_days
}
mt <- t(schedule[df$group == g_names[g_ind], d0:d1])
if( !all(is.na(mt)) ) stop("Not enough groups to comply with working and resting days schedule")
mt[,] <- pttrn[1:length(d0:d1)]
schedule[df$group == g_names[g_ind], d0:d1] <- t(mt)
d0 <- d0 + day_streak
g_ind <- g_ind + 1
if(g_ind > length(g_names)){
g_ind <- 1
}
}
colnames(schedule) <- paste0("D", seq_len(total_days))
return(cbind(df, schedule))
}
EXAMPLES:
# Example 1
work_schedule(12, 15, 3, 2, 2)
worker group D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15
1 1 A 1 1 0 0 NA NA NA NA 1 1 0 0 NA NA NA
2 2 A 1 1 0 0 NA NA NA NA 1 1 0 0 NA NA NA
3 3 A 1 1 0 0 NA NA NA NA 1 1 0 0 NA NA NA
4 4 B NA NA 1 1 0 0 NA NA NA NA 1 1 0 0 NA
5 5 B NA NA 1 1 0 0 NA NA NA NA 1 1 0 0 NA
6 6 B NA NA 1 1 0 0 NA NA NA NA 1 1 0 0 NA
7 7 C NA NA NA NA 1 1 0 0 NA NA NA NA 1 1 0
8 8 C NA NA NA NA 1 1 0 0 NA NA NA NA 1 1 0
9 9 C NA NA NA NA 1 1 0 0 NA NA NA NA 1 1 0
10 10 D NA NA NA NA NA NA 1 1 0 0 NA NA NA NA 1
11 11 D NA NA NA NA NA NA 1 1 0 0 NA NA NA NA 1
12 12 D NA NA NA NA NA NA 1 1 0 0 NA NA NA NA 1
# Example 2
work_schedule(14, 15, 7, 2, 2)
worker group D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15
1 1 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
2 2 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
3 3 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
4 4 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
5 5 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
6 6 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
7 7 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
8 8 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
9 9 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
10 10 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
11 11 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
12 12 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
13 13 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
14 14 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
Since it remains unclear what should happen when there are not enough rested workers to fill a full team for the day, this solution assumes there are always enough workers for a rested team to be available every day. It simply rotates through the workers.
library(tidyverse)
n <- 15; p <- 15; np <- 7; k <- 2; r <- 2
sol <- rep(c(integer(k-1), 1), length.out = p-1) |> # Vector of days, 1 for rotation, 0 for same crew
accumulate(
.f = \(lhs, rhs) {if (rhs) {c(tail(lhs, np), head(lhs, n-np))} else {lhs}}, # move last np elements to the front of the vector
.init = c(rep(1L, np), rep(0L, n-np)) # init vector
) %>%
set_names(str_c("D", seq_along(.))) %>%
c(list(Workers = seq_len(n)), .) |> # Add workers col
do.call(what = cbind)
sol
#> Workers D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15
#> [1,] 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [2,] 2 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [3,] 3 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [4,] 4 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [5,] 5 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1
#> [6,] 6 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1
#> [7,] 7 1 1 0 0 0 0 1 1 0 0 1 1 0 0 1
#> [8,] 8 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [9,] 9 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [10,] 10 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [11,] 11 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [12,] 12 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0
#> [13,] 13 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0
#> [14,] 14 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0
#> [15,] 15 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0
If you care about the distinction between resting 0 and rested non-working NA, you can add that like such:
sol[, -1] <- sol[, -1] |>
apply(1, \(x) ifelse((cumsum(x==0) %>% {. - lag(., r+1, 0)})>r, NA_integer_, x)) |>
t()
sol
#> Workers D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15
#> [1,] 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [2,] 2 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [3,] 3 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [4,] 4 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [5,] 5 1 1 0 0 1 1 0 0 1 1 0 0 NA NA 1
#> [6,] 6 1 1 0 0 1 1 0 0 NA NA 1 1 0 0 1
#> [7,] 7 1 1 0 0 NA NA 1 1 0 0 1 1 0 0 1
#> [8,] 8 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [9,] 9 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [10,] 10 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [11,] 11 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [12,] 12 0 0 1 1 0 0 1 1 0 0 1 1 0 0 NA
#> [13,] 13 0 0 1 1 0 0 1 1 0 0 NA NA 1 1 0
#> [14,] 14 0 0 1 1 0 0 NA NA 1 1 0 0 1 1 0
#> [15,] 15 0 0 NA NA 1 1 0 0 1 1 0 0 1 1 0

Conditioning error, progression of logic in mutate/elseif_ pipeline

I'm trying to work out why a code like this won't give me the expected results. I understand there are better ways of achieving the results (cut, etc.) but I am specifically trying to understand why the mutate>ifelse pipeline progression to replace values doesn't work.
A <- c(1,0,0,0,NA,0,1,0,1,0,0,1,1,1,NA,NA,NA,1,0,0,0,1,1,1,0,1,NA)
B <- c(1,0,0,NA,0,1,1,1,0,1,NA,1,0,1,NA,NA,1,0,01,0,0,0,NA,0,1,0,1)
C <- c(0,NA,0,1,0,1,NA,1,0,1,NA,0,1,0,NA,NA,1,0,01,NA,0,0,NA,1,NA,NA,1)
df <- data.frame(A, B, C)
df$D <- NA
df <- df %>%
mutate(D=ifelse(A==0 & B==0 & C==0,0,D)) %>% #assign 0 to d IF all 3 variables 0
mutate(D=ifelse(A==0 | B==0 | C==0,0,D)) %>% #now assign 0 to d IF ANY of 3 variables 0
mutate(D=ifelse(A==1 | B==1 | C==1,1,D)) #now reassign d to 1 if any of the variables has the value 1
> summary(as.factor(df$D))
0 1 NA's
2 19 6
But looking at cross tabulation, my aims is to get 0=2 and NA=2 and rest assigned 1. I can't figure out why my code's logic is not working.
> ftable(xtabs(~A+B+C, df, addNA = TRUE, na.action = NULL)) #matches AV variable
C 0 1 NA
A B
0 0 2 0 2
1 0 4 1
NA 0 1 1
1 0 3 2 1
1 3 0 1
NA 0 0 1
NA 0 1 0 0
1 0 2 0
NA 0 0 2
Edit: corrected typo
Look at your code step by step, specificslly the two mutate commands with the OR conditions. For rows that contain missing and 1s (but no zeroes), R can‘t check if this row contains a zero, because it does not know what NA might be. So the second mutate returns NA for any row that has only 1s and NAs. The third step dows the same, just with 1s. Any row that only contains 0s and NAs will then return NA.
You can verify this by:
x <- c(0, 0, NA)
any(x == 0)
[1] TRUE
any(x == 1)
[1] NA
You can do:
library(tidyverse)
df2 <- df %>%
mutate(D = case_when(A == 0 & B == 0 & C == 0 ~ 0,
is.na(A) & is.na(B) & is.na(C) ~ NA_real_,
TRUE ~ 1))
which gives:
A B C D
1 1 1 0 1
2 0 0 NA 1
3 0 0 0 0
4 0 NA 1 1
5 NA 0 0 1
6 0 1 1 1
7 1 1 NA 1
8 0 1 1 1
9 1 0 0 1
10 0 1 1 1
11 0 NA NA 1
12 1 1 0 1
13 1 0 1 1
14 1 1 0 1
15 NA NA NA NA
16 NA NA NA NA
17 NA 1 1 1
18 1 0 0 1
19 0 1 1 1
20 0 0 NA 1
21 0 0 0 0
22 1 0 0 1
23 1 NA NA 1
24 1 0 1 1
25 0 1 NA 1
26 1 0 NA 1
27 NA 1 1 1
And then
df2 %>% count(D)
D n
1 0 2
2 1 23
3 NA 2

Count several rows and make a new column in R

I want to count several rows (x1-x4) and make a new column (x1_x4) in R looks like the below picture. Can anyone help me?
df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),
x1 = c(0,NA,0,1,0,0,1,1,1,NA),
x2 = c(0,NA,1,0,0,NA,0,1,0,0),
x3 = c(0,NA,0,1,1,0,1,1,1,0),
x4 = c(0,NA,0,0,0,0,1,1,1,1))
You can use rowSums and test with apply if all are NA.
df$x1_x4 <- rowSums(df[-1], TRUE)
df$x1_x4[apply(is.na(df[2:5]), 1, all)] <- NA
# ID x1 x2 x3 x4 x1_x4
#1 1 0 0 0 0 0
#2 2 NA NA NA NA NA
#3 3 0 1 0 0 1
#4 4 1 0 1 0 2
#5 5 0 0 1 0 1
#6 6 0 NA 0 0 0
#7 7 1 0 1 1 3
#8 8 1 1 1 1 4
#9 9 1 0 1 1 3
#10 10 NA 0 0 1 1
One dplyr solution could be:
df %>%
rowwise() %>%
mutate(x1_x4 = any(!is.na(c_across(-ID)))^NA * sum(c_across(-ID), na.rm = TRUE))
ID x1 x2 x3 x4 x1_x4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 0 0 0
2 2 NA NA NA NA NA
3 3 0 1 0 0 1
4 4 1 0 1 0 2
5 5 0 0 1 0 1
6 6 0 NA 0 0 0
7 7 1 0 1 1 3
8 8 1 1 1 1 4
9 9 1 0 1 1 3
10 10 NA 0 0 1 1
vars <- paste0("x", 1:4)
df$x1_x4 <- rowSums(df[vars], na.rm = TRUE)
df[rowSums(is.na(df[vars]), na.rm = TRUE) == 4, "x1_x4"] <- NA
df
# ID x1 x2 x3 x4 x1_x4
# 1 1 0 0 0 0 0
# 2 2 NA NA NA NA NA
# 3 3 0 1 0 0 1
# 4 4 1 0 1 0 2
# 5 5 0 0 1 0 1
# 6 6 0 NA 0 0 0
# 7 7 1 0 1 1 3
# 8 8 1 1 1 1 4
# 9 9 1 0 1 1 3
# 10 10 NA 0 0 1 1
Base R one (obfuscated) expression:
within(df, {x1_x4 <- apply(df[,grepl("^x", names(df))], 1,
function(x){ifelse(all(is.na(x)), NA_integer_, sum(x, na.rm = TRUE))})})

Recoding multiple variables based on logical rules in external table.

Objective
Given data ds, compute a new variable ds$h1 from ds$raw1 and ds$raw2 according to the harmonization rule specified in the object hrule.
The reproducible example contains response of 10 individuals on 2 measures, raw1 and raw2:
>ds
id raw1 raw2
1 1 1 1
2 2 1 0
3 3 0 1
4 4 0 0
5 5 NA 1
6 6 NA 0
7 7 1 NA
8 8 0 NA
9 9 NA NA
10 10 1 1
These two variables need to be transformed into a single, harmonized variable, according to some rule (developed qualitatively). The rules of harmonizational transformation are encoded in the object hrule:
>hrule
raw1 raw2 h1
1 0 0 0
2 0 1 1
3 0 NA 0
4 1 0 1
5 1 1 1
6 1 NA 1
7 NA 0 0
8 NA 1 1
9 NA NA NA
Thus, the rule should be read for row 1 as:
if respondent provides a value of 0 on raw1 and the value of 0 on raw2 then the value of h1 should be 0.
Functional objective
Develop a function that passes ds, hrule, names of variables a character vector( c("raw1","raw2")) , and the name of the harmonization variable ("h1") and outputs a new harmonized variable (ds$h1).
Starter code
(ds <- data.frame("id" = 1:10,
"raw1" = c(1,1,0,0,NA,NA,1 ,0 ,NA,1),
"raw2" = c(1,0,1,0,1 ,0 ,NA,NA,NA,1)))
(response_profile <- ds %>% dplyr::group_by(raw1, raw2) %>% dplyr::summarize(count=n()))
(hrule <- cbind(response_profile, "h1" = c(0,1,0,1,1,1,0,1,NA)))
new_function <- function(ds, hrule,
variable_names, # variable_names = c("raw1,"raw2"), the number will vary
harmony_name # harmony_name = "h1", there might be "h2"
){
}
Thanks in advance for your ideas!
Here's the full solution, suggested by #Symbolix
rm(list=ls(all=TRUE)) #Clear the memory of variables from previous run. This is not called by knitr, because it's above the first chunk.
cat("\f")
library(magrittr)
(ds <- data.frame("id" = 1:10,
"raw1" = c(1,1,0,0,NA,NA,1 ,0 ,NA,1),
"raw2" = c(1,0,1,0,1 ,0 ,NA,NA,NA,1)))
response_profile <- ds %>% dplyr::group_by(raw1, raw2) %>% dplyr::summarize(count=n()) %>% dplyr::select(-count)
(hrule <- cbind(response_profile,
"h1" = c(0,1,0 ,1,1,1 ,0 ,1 ,NA), # at least one 1 to produce 1
"h2"= c(0,0,NA,0,1,NA,NA,NA,NA) # both must be 1
))
recode_from_meta <- function(ds, hrule, variable_names, harmony_name){
d <- merge(ds, hrule[, c(variable_names, harmony_name)], by=variable_names, all.x=T)
}
> hrule
raw1 raw2 h1 h2
1 0 0 0 0
2 0 1 1 0
3 0 NA 0 NA
4 1 0 1 0
5 1 1 1 1
6 1 NA 1 NA
7 NA 0 0 NA
8 NA 1 1 NA
9 NA NA NA NA
> (d <- recode_from_meta(ds, hrule,variable_names=c("raw1", "raw2"), harmony_name="h1"))
raw1 raw2 id h1
1 0 0 4 0
2 0 1 3 1
3 0 NA 8 0
4 1 0 2 1
5 1 1 1 1
6 1 1 10 1
7 1 NA 7 1
8 NA 0 6 0
9 NA 1 5 1
10 NA NA 9 NA
> (d <- recode_from_meta(ds, hrule,variable_names=c("raw1", "raw2"), harmony_name="h2"))
raw1 raw2 id h2
1 0 0 4 0
2 0 1 3 0
3 0 NA 8 NA
4 1 0 2 0
5 1 1 1 1
6 1 1 10 1
7 1 NA 7 NA
8 NA 0 6 NA
9 NA 1 5 NA
10 NA NA 9 NA

How to collapse very large sparse dataframes

I want to sum about 10000 columns like colSparseX on 1500 sparse rows of an dataframe. If I have the input:
(I tried on OriginalDataframe this:
coldatfra <- aggregate(. ~colID,datfra,sum)
and this:
coldatfra <- ddply(datfra, .(colID), numcolwise(sum))
But it doesn't work!)
colID <- c(rep(seq(1:6),2), rep(seq(1:2),3))
colSparse1 <- c(rep(1,5), rep(0,4), rep(1,2), rep(0,5), rep(1,2))
cPlSpars2 <- c(rep(1,3), rep(0,6), rep(1,2), rep(0,5), rep(1,2))
coMSparse3 <- c(rep(1,6), rep(0,3), rep(1,2), rep(0,5), rep(1,2))
colSpArseN <- c(rep(1,2), rep(0,7), rep(1,2), rep(0,5), rep(1,2))
(datfra <- data.frame(colID, colSparse1, cPlSpars2, coMSparse3, colSpArseN))
colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1 1 1 1 1
2 1 1 1 1
3 1 1 1 0
4 1 0 1 0
5 1 0 1 0
6 0 0 1 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 1 1 1 1
5 1 1 1 1
6 0 0 0 0
1 0 0 0 0
2 0 0 0 0
1 0 0 0 0
2 0 0 0 0
1 1 1 1 1
2 1 1 1 1
And want to sum the elements for each ID on all (10000 columns - requires some placeholder for colnames are very variable words) colSparses in order to get this:
colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1 2 2 2 2
2 2 2 2 2
3 1 1 1 0
4 2 1 2 1
5 2 1 2 1
6 0 0 1 0
Note: str(OriginalDataframe)
'data.frame': 1500 obs. of 10000 variables:
$ someword : num 0 0 0 0 0 0 0 0 0 0 ...
$ anotherword : num 0 0 0 0 0 0 0 0 0 0 ...
And on a smaller version (which was terminated) of the OriginalDataframe treated with ddply(datfra, .(colID), numcolwise(sum)) I get:
colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1 0019 0 0 0 0
NA <NA> NA NA NA NA
NA.1 <NA> NA NA NA NA
NA.2 <NA> NA NA NA NA
NA.3 <NA> NA NA NA NA
Take a look at my answer to this question:
Mean per group in a data.frame
Your question is similar. If you change the function being applied from mean to sum, you get what you are looking for.
colstosum <- names(mydt)[2:5]
mydt.sum <- mydt[,lapply(.SD,sum,na.rm=TRUE),by=colID,.SDcols=colstosum]
mydt.sum
colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1: 1 2 2 2 2
2: 2 2 2 2 2
3: 3 1 1 1 0
4: 4 2 1 2 1
5: 5 2 1 2 1
6: 6 0 0 1 0
Granted, I can't guarantee the speed or lack thereof of sum on a large data.table. Also, there is a way you should be able to incorporate colSums in the lapply function, but I can't figure out the syntax at the moment.

Resources