How to collapse very large sparse dataframes - r

I want to sum about 10000 columns like colSparseX on 1500 sparse rows of an dataframe. If I have the input:
(I tried on OriginalDataframe this:
coldatfra <- aggregate(. ~colID,datfra,sum)
and this:
coldatfra <- ddply(datfra, .(colID), numcolwise(sum))
But it doesn't work!)
colID <- c(rep(seq(1:6),2), rep(seq(1:2),3))
colSparse1 <- c(rep(1,5), rep(0,4), rep(1,2), rep(0,5), rep(1,2))
cPlSpars2 <- c(rep(1,3), rep(0,6), rep(1,2), rep(0,5), rep(1,2))
coMSparse3 <- c(rep(1,6), rep(0,3), rep(1,2), rep(0,5), rep(1,2))
colSpArseN <- c(rep(1,2), rep(0,7), rep(1,2), rep(0,5), rep(1,2))
(datfra <- data.frame(colID, colSparse1, cPlSpars2, coMSparse3, colSpArseN))
colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1 1 1 1 1
2 1 1 1 1
3 1 1 1 0
4 1 0 1 0
5 1 0 1 0
6 0 0 1 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 1 1 1 1
5 1 1 1 1
6 0 0 0 0
1 0 0 0 0
2 0 0 0 0
1 0 0 0 0
2 0 0 0 0
1 1 1 1 1
2 1 1 1 1
And want to sum the elements for each ID on all (10000 columns - requires some placeholder for colnames are very variable words) colSparses in order to get this:
colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1 2 2 2 2
2 2 2 2 2
3 1 1 1 0
4 2 1 2 1
5 2 1 2 1
6 0 0 1 0
Note: str(OriginalDataframe)
'data.frame': 1500 obs. of 10000 variables:
$ someword : num 0 0 0 0 0 0 0 0 0 0 ...
$ anotherword : num 0 0 0 0 0 0 0 0 0 0 ...
And on a smaller version (which was terminated) of the OriginalDataframe treated with ddply(datfra, .(colID), numcolwise(sum)) I get:
colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1 0019 0 0 0 0
NA <NA> NA NA NA NA
NA.1 <NA> NA NA NA NA
NA.2 <NA> NA NA NA NA
NA.3 <NA> NA NA NA NA

Take a look at my answer to this question:
Mean per group in a data.frame
Your question is similar. If you change the function being applied from mean to sum, you get what you are looking for.
colstosum <- names(mydt)[2:5]
mydt.sum <- mydt[,lapply(.SD,sum,na.rm=TRUE),by=colID,.SDcols=colstosum]
mydt.sum
colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1: 1 2 2 2 2
2: 2 2 2 2 2
3: 3 1 1 1 0
4: 4 2 1 2 1
5: 5 2 1 2 1
6: 6 0 0 1 0
Granted, I can't guarantee the speed or lack thereof of sum on a large data.table. Also, there is a way you should be able to incorporate colSums in the lapply function, but I can't figure out the syntax at the moment.

Related

How to populate a matrix based on criteria for rows and columns (Updated 2x)

Updated 2x I have checked the issue and now I can give a better explanation.
I am trying to do a schedule planning using r. My issue is explained next.
I have a set of n workers and they need to work during a period of m months (aka p). The only conditions that need to be satisfied are next:
Each day is always required that np workers need to be working in the factory.
All the workers must have r=8 days of rest per month but each day there must always be np people working.
Those are the conditions that need to be completed for this issue. In this way I think I could use the days of month to populate the days of working but I am not sure on how to set the rest days such that each month each worker only must have 8 days of rest. I would set a matrix with an example of 12 workers during two months. The only conditions are: each day np=8 workers must be in the factory and every month they can only have r=8 days of rest.
I have a code like this:
#Workers
n <- 12;
#Months (Days)
p <- 59;
#Number of people required each day
np <- 8;
#Rest days per month
r <- 8
#Matrix
#Days
vday <- seq(as.Date('2023-02-01'),as.Date('2023-03-31'),by=1)
mm <- matrix(data =NA,nrow = n,ncol = length(vday))
dimnames(mm)[[2]]<-as.character(vday)
But it is complex for me finding a way so that I have np=8 persons working each day and that each month each of them must rest only r=8 days per month but keeping the condition that each day are needed np people in the factory. Resting days could be allocated random each month taking into acount the condition. I would set 1 for working and 0 for resting.
I think the way you are naming the variables is misleading as np looks to me as a multiplication. I propose a function that can deal with your problem with a different variable naming.
This function returns an error in two cases:
The total number of workers you consider is not integer-divisible for the number of working groups you need. So if you need 2 groups the total number of workers must be integer-divisible by 2 (2, 4, 8, ..., 14). It could accommodate non-integer division but the coding gets a bit more complex and you should specify what you want to do with the extra workers.
The work-rest schedule is not feasible with the number of groups you plan to use. For instance a 2-day work/3-day rest schedule is not possible having 2 groups of workers (unless nobody work on certain days).
If no error occurs then the function returns a data.frame with the following characteristics:
Column one the worker ID
Column two the group ID
All remaining columns represent the working days and can have the following values: 1 - the worker has to work, 0 - the worker has to rest, NA - the worker has neither to work nor rest.
FUNCTION CODE:
work_schedule <- function(workers, total_days, group_size, day_streak, rest_days){
if(workers %% group_size != 0){
stop("workers are not divisible in groups of equal sizes")
}
n_groups <- workers / group_size
df <- data.frame(worker = seq_len(workers), # NOTE FOR MORE THAN 26 GROUPS THE GROUP NAMING MUST BE CHANGED
group = rep(LETTERS[seq_len(n_groups)], each = group_size))
schedule <- matrix(nrow = nrow(df), ncol = total_days)
pttrn <- c(rep(1, day_streak), rep(0, rest_days))
d0 <- 1
g_names <- unique(df$group)
g_ind <- 1
while(d0 <= total_days){
d1 <- d0+length(pttrn)-1
if(d0+length(pttrn)-1 > total_days){
d1 <- total_days
}
mt <- t(schedule[df$group == g_names[g_ind], d0:d1])
if( !all(is.na(mt)) ) stop("Not enough groups to comply with working and resting days schedule")
mt[,] <- pttrn[1:length(d0:d1)]
schedule[df$group == g_names[g_ind], d0:d1] <- t(mt)
d0 <- d0 + day_streak
g_ind <- g_ind + 1
if(g_ind > length(g_names)){
g_ind <- 1
}
}
colnames(schedule) <- paste0("D", seq_len(total_days))
return(cbind(df, schedule))
}
EXAMPLES:
# Example 1
work_schedule(12, 15, 3, 2, 2)
worker group D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15
1 1 A 1 1 0 0 NA NA NA NA 1 1 0 0 NA NA NA
2 2 A 1 1 0 0 NA NA NA NA 1 1 0 0 NA NA NA
3 3 A 1 1 0 0 NA NA NA NA 1 1 0 0 NA NA NA
4 4 B NA NA 1 1 0 0 NA NA NA NA 1 1 0 0 NA
5 5 B NA NA 1 1 0 0 NA NA NA NA 1 1 0 0 NA
6 6 B NA NA 1 1 0 0 NA NA NA NA 1 1 0 0 NA
7 7 C NA NA NA NA 1 1 0 0 NA NA NA NA 1 1 0
8 8 C NA NA NA NA 1 1 0 0 NA NA NA NA 1 1 0
9 9 C NA NA NA NA 1 1 0 0 NA NA NA NA 1 1 0
10 10 D NA NA NA NA NA NA 1 1 0 0 NA NA NA NA 1
11 11 D NA NA NA NA NA NA 1 1 0 0 NA NA NA NA 1
12 12 D NA NA NA NA NA NA 1 1 0 0 NA NA NA NA 1
# Example 2
work_schedule(14, 15, 7, 2, 2)
worker group D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15
1 1 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
2 2 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
3 3 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
4 4 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
5 5 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
6 6 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
7 7 A 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
8 8 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
9 9 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
10 10 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
11 11 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
12 12 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
13 13 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
14 14 B NA NA 1 1 0 0 1 1 0 0 1 1 0 0 1
Since it remains unclear what should happen when there are not enough rested workers to fill a full team for the day, this solution assumes there are always enough workers for a rested team to be available every day. It simply rotates through the workers.
library(tidyverse)
n <- 15; p <- 15; np <- 7; k <- 2; r <- 2
sol <- rep(c(integer(k-1), 1), length.out = p-1) |> # Vector of days, 1 for rotation, 0 for same crew
accumulate(
.f = \(lhs, rhs) {if (rhs) {c(tail(lhs, np), head(lhs, n-np))} else {lhs}}, # move last np elements to the front of the vector
.init = c(rep(1L, np), rep(0L, n-np)) # init vector
) %>%
set_names(str_c("D", seq_along(.))) %>%
c(list(Workers = seq_len(n)), .) |> # Add workers col
do.call(what = cbind)
sol
#> Workers D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15
#> [1,] 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [2,] 2 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [3,] 3 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [4,] 4 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [5,] 5 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1
#> [6,] 6 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1
#> [7,] 7 1 1 0 0 0 0 1 1 0 0 1 1 0 0 1
#> [8,] 8 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [9,] 9 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [10,] 10 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [11,] 11 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [12,] 12 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0
#> [13,] 13 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0
#> [14,] 14 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0
#> [15,] 15 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0
If you care about the distinction between resting 0 and rested non-working NA, you can add that like such:
sol[, -1] <- sol[, -1] |>
apply(1, \(x) ifelse((cumsum(x==0) %>% {. - lag(., r+1, 0)})>r, NA_integer_, x)) |>
t()
sol
#> Workers D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15
#> [1,] 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [2,] 2 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [3,] 3 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [4,] 4 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
#> [5,] 5 1 1 0 0 1 1 0 0 1 1 0 0 NA NA 1
#> [6,] 6 1 1 0 0 1 1 0 0 NA NA 1 1 0 0 1
#> [7,] 7 1 1 0 0 NA NA 1 1 0 0 1 1 0 0 1
#> [8,] 8 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [9,] 9 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [10,] 10 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [11,] 11 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> [12,] 12 0 0 1 1 0 0 1 1 0 0 1 1 0 0 NA
#> [13,] 13 0 0 1 1 0 0 1 1 0 0 NA NA 1 1 0
#> [14,] 14 0 0 1 1 0 0 NA NA 1 1 0 0 1 1 0
#> [15,] 15 0 0 NA NA 1 1 0 0 1 1 0 0 1 1 0

Creating dummies with apply in R

I have data about different study strategies for individuals (stored in columns labeled StrategyA, StrategyB, StrategyC. The strategies are coded 1-15. I want to create a dummy for each strategy (e.g. strategy1, strategy2, etc) because each student can list up to 3 strategies.
Example Data
ID = c(1, 2, 3, 4, 5)
Strategy_A = c(10, 12, 13, 1, 2)
Strategy_B = c(1, 2, 1, 4, 5)
Strategy_C = c(2, 3, 6, 8, 15)
all = data.frame(ID, Strategy_A, Strategy_B, Strategy_C)
I thought about using apply and creating a function linked to the fastDummies package.
dummies = function(x){
dummy_cols(x)
}
new = apply(all [,-1], 2, dummies)
new = as.data.frame(new)
However, this creates dummies for StrategyA_1 StrategyA_2 StrategyA_3 rather than summarizing the dummies as Strategy1 Strategy2 Strategy3. Any ideas how to fix this?
After a small transformation of all, you can use dummy.data.frame() from dummies (you can also use dummy_cols() from fastDummies) and then aggregate per ID.
all <- data.frame(ID = rep(all$ID, 3),
Strategy = c(all$Strategy_A, all$Strategy_B, all$Strategy_C)) # data frame "all" with one column Strategy
library(dummies)
all <- dummy.data.frame(all, "Strategy") # or fastDummies::dummy_cols(all, "Strategy")
aggregate(. ~ ID, all, sum) # since strategies are now dummies, the sum will always be 0 or 1
# output
ID Strategy1 Strategy2 Strategy3 Strategy4 Strategy5 Strategy6 Strategy8 Strategy10 Strategy12 Strategy13 Strategy15
1 1 1 1 0 0 0 0 0 1 0 0 0
2 2 0 1 1 0 0 0 0 0 1 0 0
3 3 1 0 0 0 0 1 0 0 0 1 0
4 4 1 0 0 1 0 0 1 0 0 0 0
5 5 0 1 0 0 1 0 0 0 0 0 1
I provide a method with the tidyverse way.
library(tidyverse)
new <- all %>% gather(select = -ID) %>%
mutate(key = NULL, num = 1) %>%
spread(value, num)
# ID 1 2 3 4 5 6 8 10 12 13 15
# 1 1 1 1 NA NA NA NA NA 1 NA NA NA
# 2 2 NA 1 1 NA NA NA NA NA 1 NA NA
# 3 3 1 NA NA NA NA 1 NA NA NA 1 NA
# 4 4 1 NA NA 1 NA NA 1 NA NA NA NA
# 5 5 NA 1 NA NA 1 NA NA NA NA NA 1
new[is.na(new)] <- 0
new
# ID 1 2 3 4 5 6 8 10 12 13 15
# 1 1 1 1 0 0 0 0 0 1 0 0 0
# 2 2 0 1 1 0 0 0 0 0 1 0 0
# 3 3 1 0 0 0 0 1 0 0 0 1 0
# 4 4 1 0 0 1 0 0 1 0 0 0 0
# 5 5 0 1 0 0 1 0 0 0 0 0 1

Modify the column value by other columns in r

I have a CSV table (as a data frame). I want to modify a specific column value by other columns values.
I have prepared a code, but it doesn't work.
The data frame contains 1076 rows and 156 columns.
The formula have to be like this:
if (a[i,"0Q-state"] == "done" ) && (a[i,0Q-01] == NA)) a[i,0Q-01] = 0;
else a[i,0Q-01] = a[i,0Q-01];
but I don't know how can I do this in r.
>dataset4
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 NA
2: 1 1 1 1 1 1 NA 1 1
3: done 1 1 1 NA 1 1 1 1 1
5: done 1 1 1 1 0 0 0 1 0
6: done 1 1 1 1 0 0 0 1 0
7: 1 1 NA 1 0 0 0 1 0
8: done 1 1 1 1 0 0 0 1 0
sapply(c("0Q-01","0Q-02","0Q-03","0Q-04","0Q-05","0Q-06","0Q-07","0Q-08","0Q-09"),
function(y) {
dataset4[,y] <- sapply(c(1:1076), function(x)
ifelse (((is.na(dataset4[x,y])) && (dataset4[x,c("0Q-state")] == "done"))
,0, dataset4[x,y]))}
)
Output has to be:
>dataset4
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 0
2: 1 1 1 1 1 1 NA 1 1
3: done 1 1 1 0 1 1 1 1 1
5: done 1 1 1 1 0 0 0 1 0
6: done 1 1 1 1 0 0 0 1 0
7: 1 1 NA 1 0 0 0 1 0
8: done 1 1 1 1 0 0 0 1 0
we could try:
df[rep(df[, 1] == "done", ncol(df)) & is.na(df)] <- 0
df
1 done 1 1 1 1 1 1 1 1 0
2 1 1 1 1 1 1 NA 1 1
3 done 1 1 1 0 1 1 1 1 1
4 done 1 1 1 1 0 0 0 1 0
5 done 1 1 1 1 0 0 0 1 0
6 1 1 NA 1 0 0 0 1 0
7 done 1 1 1 1 0 0 0 1 0
or using sapply():
myFunc <- function(x, y) ifelse(is.na(x) & y == "done", 1, x)
data.frame(df[, 1], sapply(df[, -1], myFunc, y = df[, 1]))
1 done 1 1 1 1 1 1 1 1 NA
2 1 1 1 1 1 1 NA 1 1
3 done 1 1 1 NA 1 1 1 1 1
4 done 1 1 1 1 0 0 0 1 0
5 done 1 1 1 1 0 0 0 1 0
6 1 1 NA 1 0 0 0 1 0
7 done 1 1 1 1 0 0 0 1 0
where you can always substitute df[, 1] with df[, "0Q-state"] and df[, -1] with df[, namesOfDummyVars]
The question has been tagged with data.table and the printed output of dataset4 suggests that dataset4 already is a data.table object.
Here are three variants in data.table syntax to replace NAs in rows which are marked as "done".
# create vector of names of columns to be changed
cols <- sprintf("0Q-%02i", 1:9)
# variant 1
dataset4[`0Q-state` == "done",
(cols) := lapply(.SD, function(x) replace(x, is.na(x), 0L)),
.SDcols = cols][]
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 0
2: 1 1 1 1 1 NA 1 1 NA
3: done 1 1 1 0 1 1 1 1 1
4: done 1 1 1 1 0 0 0 1 0
5: done 1 1 1 1 0 0 0 1 0
6: 1 NA 1 0 0 0 1 0 NA
7: done 1 1 1 1 0 0 0 1 0
or
# variant 2
lapply(cols, function(i) dataset4[`0Q-state` == "done" & is.na(get(i)), (i) := 0L])
dataset4
returning the same as above
or
# variant 3 --- data.table development version 1.10.5
for (i in cols)
set(dataset4, which(dataset4[, "0Q-state"] == "done" & is.na(dataset4[, ..i])), i, 0L)
dataset4

Row-wise operation by group over time R

Problem:
I am trying to create variable x2 which is equal to 1, for all rows within each ID group where over time x1 switches from 1 to 0.
Additionally, after the switch, every consecutive 0 in the run, x2 is set to 1.
I tried to figure out how to do this using library(dplyr), but could not figure out how to look at previous records within the group.
Input Data:
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<-c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1")
df<-data.frame(ID,time,x1)
Required Output:
ID time x1 x2
1 1 0 0
1 2 1 0
1 3 1 0
1 4 1 0
1 5 1 0
2 1 0 0
2 2 0 0
2 3 0 0
2 4 0 0
3 1 1 0
3 2 0 1
3 3 0 1
4 1 1 0
4 2 1 0
5 1 1 0
5 2 0 1
5 3 1 0
It is better to have the 'x1' as numeric column
library(data.table)
setDT(df)[, x2 := (cumsum(x1) < 2)*cumsum(c(FALSE, diff(x1) < 0)), ID]
df
# ID time x1 x2
# 1: 1 1 0 0
# 2: 1 2 1 0
# 3: 1 3 1 0
# 4: 1 4 1 0
# 5: 1 5 1 0
# 6: 2 1 0 0
# 7: 2 2 0 0
# 8: 2 3 0 0
# 9: 2 4 0 0
#10: 3 1 1 0
#11: 3 2 0 1
#12: 3 3 0 1
#13: 4 1 1 0
#14: 4 2 1 0
#15: 5 1 1 0
#16: 5 2 0 1
#17: 5 3 1 0
data
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<- as.integer(c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1"))
df<-data.frame(ID,time,x1)
If you want a dplyr answer, you can use #akrun's code in mutate after grouping by ID
library(dplyr)
ID<-c("1","1","1","1","1","2","2","2","2","3","3","3","4","4","5","5","5")
time<-c("1","2","3","4","5","1","2","3","4","1","2","3","1","2","1","2","3")
x1<- as.integer(c("0","1","1","1","1","0","0","0","0","1","0","0","1","1","1","0","1"))
df<-data.frame(ID,time,x1)
df <- df %>%
group_by(ID) %>%
mutate(x2 = (cumsum(x1) < 2)*cumsum(c(FALSE, diff(x1) < 0)))
df
# ID time x1 x2
# 1 1 0 0
# 1 2 1 0
# 1 3 1 0
# 1 4 1 0
# 1 5 1 0
# 2 1 0 0
# 2 2 0 0
# 2 3 0 0
# 2 4 0 0
# 3 1 1 0
# 3 2 0 1
# 3 3 0 1
# 4 1 1 0
# 4 2 1 0
# 5 1 1 0
# 5 2 0 1
# 5 3 1 0

R to recode variables if the categorical variable's frequency lower than an defined value

Here is an example for the dataset (d):
rs3 rs4 rs5 rs6
1 0 0 0
1 0 1 0
0 0 0 0
2 0 1 0
0 0 0 0
0 2 0 1
0 2 NA 1
0 2 2 1
NA 1 2 1
To check the frequency of the SNP genotype (0,1,2), we can use the table command
table (d$rs3)
The output would be
0 1 2
5 2 1
Here we want to recode the variables if the genotype 2's frequency is <3, the recoded output should be
rs3 rs4 rs5 rs6
1 0 0 0
1 0 1 0
0 0 0 0
1 0 1 0
0 0 0 0
0 2 0 1
0 2 NA 1
0 2 1 1
NA 1 1 1
I have 70000SNPs that need to check and recode. How to use the for loop or other method to do that in R?
Here's another possible (vectorized) solution
indx <- colSums(d == 2, na.rm = TRUE) < 3 # Select columns by condition
d[indx][d[indx] == 2] <- 1 # Inset 1 when the subset by condition equals 2
d
# rs3 rs4 rs5 rs6
# 1 1 0 0 0
# 2 1 0 1 0
# 3 0 0 0 0
# 4 1 0 1 0
# 5 0 0 0 0
# 6 0 2 0 1
# 7 0 2 NA 1
# 8 0 2 1 1
# 9 NA 1 1 1
We can try
d[] <- lapply(d, function(x)
if(sum(x==2, na.rm=TRUE) < 3) replace(x, x==2, 1) else x)
d
# rs3 rs4 rs5 rs6
#1 1 0 0 0
#2 1 0 1 0
#3 0 0 0 0
#4 1 0 1 0
#5 0 0 0 0
#6 0 2 0 1
#7 0 2 NA 1
#8 0 2 1 1
#9 NA 1 1 1
Or the same methodology can be used in dplyr
library(dplyr)
d %>%
mutate_each(funs(if(sum(.==2, na.rm=TRUE) <3)
replace(., .==2, 1) else .))

Resources