I would like to collapse a data frame with < 100 columns fourfold,
whereby the code would iterate over groups of 4 adjacent columns and collapse them into one.
However, the resulting values based on each set of 4, depend on the priority of the value.
For example, the highest priority is '1', so whenever any of the 4 columns has a value '1' for that row it should be the resulting value. The second priority is 0, if the set has one '0' and three NA's, the result should be '0' (as long as there's no '1's). The lowest priority is NA, only sets consisting of NA completely would remain NA. An example below, with reproducible code underneath.
ID c1 c2 c3 c4 c5 c6 c7 c8
row1 1 0 0 0 1 0 0 NA
row2 NA NA NA 0 NA NA NA NA
becomes
ID c1 c2
row1 1 1
row2 0 NA
structure(list(ID = c("row1", "row2"), c1 = c(1, NA), c2 = c(0,
NA), c3 = c(0, NA), c4 = c(0, 0), c5 = c(1, NA), c6 = c(0, NA
), c7 = c(0, NA), c8 = c(NA, NA)), class = "data.frame", row.names = c(NA,
-2L))
How about this:
dat <- structure(list(ID = c("row1", "row2"), c1 = c(1, NA), c2 = c(0,
NA), c3 = c(0, NA), c4 = c(0, 0), c5 = c(1, NA), c6 = c(0, NA
), c7 = c(0, NA), c8 = c(NA, NA)), class = "data.frame", row.names = c(NA,
-2L))
out <- data.frame(ID = dat$ID)
k <- 2 # first column to start
i <- 1 # first variable name
while(k < ncol(dat)){
out[[paste0("c", i)]] <- apply(dat[,k:(k+3)], 1, max, na.rm=TRUE)
out[[paste0("c", i)]] <- ifelse(is.finite(out[[paste0("c", i)]]), out[[paste0("c", i)]], NA)
k <- k+4
i <- i+1
}
#> Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
out
#> ID c1 c2
#> 1 row1 1 1
#> 2 row2 0 NA
Created on 2022-11-21 by the reprex package (v2.0.1)
Here is an alternative approach:
f <- function(x) fifelse(all(is.na(x)), NA_real_,1*(sum(x,na.rm = T)>0))
dcast(
melt(setDT(df),"ID",variable.name = "v")[
,f(value), .(ID,r=paste0("c",rep(1:(uniqueN(v)/4), each=uniqueN(v))))],
ID~r, value.var = "V1",
)
Output:
ID c1 c2
1: row1 1 1
2: row2 0 NA
Using split.default to split dataframe every 4th column, then use pmax:
x <- df1
x[ is.na(x) ] <- -1
res <- cbind(df1[ "ID" ],
lapply(split.default(x[, -1], rep(1:2, each = 4)),
function(i) do.call(pmax, i)))
res[ res == -1 ] <- NA
res
# ID 1 2
# 1 row1 1 1
# 2 row2 0 NA
Related
I'm trying to convert my data in R, but I can't manage to get the column I want.
My dataset is as below, and the column I want to get is "total", it is the sum of D1 + D2 + D3 + D4 + D5, and ignores "NA".
NR
D1
D2
D3
D4
D5
total
A
1
NA
NA
1
NA
2
B
NA
NA
NA
NA
NA
NA
C
NA
1
NA
NA
NA
1
It is probably quite a domb question, but I can't get it.
I already tried:
total <- NA
total <- ifelse(D1==1, 1, total)
total <- ifelse(D2==1, total + 1, total)
total <- ifelse(D3==1, total + 1, total)
total <- ifelse(D4==1, total + 1, total)
total <- ifelse(D5==1, total + 1, total)
But it returns all my rows to "NA"
and i tried:
total <- mutate(dataset, total=D1+D2+D3+D4+D5)
but then I don't get an aggregation of the values of D1 to D5.
We could use rowSums
df1$total <- rowSums(df1[startsWith(names(df1), "D")], na.rm = TRUE)
df1$total[df1$total == 0] <- NA
Or the same logic in dplyr
library(dplyr)
df1 %>%
mutate(total = na_if(rowSums(select(., starts_with('D')), na.rm = TRUE), 0))
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1
data
df1 <- structure(list(NR = c("A", "B", "C"), D1 = c(1L, NA, NA), D2 = c(NA,
NA, 1L), D3 = c(NA, NA, NA), D4 = c(1L, NA, NA), D5 = c(NA, NA,
NA), total = c(2L, NA, 1L)), class = "data.frame", row.names = c(NA,
-3L))
Here is a solution with c_across and rowwise
library(dplyr)
df %>%
rowwise() %>%
mutate(Total = sum(c_across(D1:D5 & where(is.numeric)), na.rm = TRUE))
Output:
NR D1 D2 D3 D4 D5 Total
<chr> <int> <int> <lgl> <int> <lgl> <int>
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA 0
3 C NA 1 NA NA NA 1
data:
structure(list(NR = c("A", "B", "C"), D1 = c(1L, NA, NA), D2 = c(NA,
NA, 1L), D3 = c(NA, NA, NA), D4 = c(1L, NA, NA), D5 = c(NA, NA,
NA)), row.names = c(NA, -3L), class = "data.frame")
You can try the code below
df$total <- replace(u <- rowSums(!is.na(df)) - 1, u == 0, NA)
which gives
> df
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1
And also this one:
library(dplyr)
library(purrr)
df1 <- df1[, !names(df1) %in% "total"]
df1 %>%
mutate(total = pmap_dbl(select(cur_data(), starts_with("D")), ~ ifelse(all(is.na(c(...))),
NA, sum(c(...), na.rm = TRUE))))
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1
A data set I'm using is the following:
C1 C2 C3 R1
R1 NA NA NA 5
R2 NA NA 0.4 7
R3 0.1 NA 6
R4 NA NA NA 2
From the data frame, I want to remove rows that contain numbers which is larger than zero from C1 to C3.
The final outcome must be:
C1 C2 C3 R1
R1 NA NA NA 5
R4 NA NA NA 2
I tried with:
df<- df %>% filter_at(vars('C1' : 'C2`), all_vars(. > 0))
but I got en error with this. How Can I fix it?
Imported from Excel:
Wrote in R:
You can use rowSums in base R :
cols <- paste0('C', 1:3)
df[rowSums(df[cols] > 0, na.rm = TRUE) == 0, ]
Or using filter_at :
library(dplyr)
df %>% filter_at(vars(C1:C3), all_vars(. <= 0 | is.na(.)))
# C1 C2 C3 R1
#R1 NA NA NA 5
#R4 NA NA NA 2
and filter_at has been deprecated so you can write this with across as :
df %>% filter(across(C1:C3, ~. <= 0 | is.na(.)))
data
df <- structure(list(C1 = c(NA, NA, 0.1, NA), C2 = c(NA, NA, NA, NA
), C3 = c(NA, 0.4, NA, NA), R1 = c(5L, 7L, 6L, 2L)),
class = "data.frame", row.names = c("R1", "R2", "R3", "R4"))
A more manual approach is as follows:
df <- as.data.table(df)
if(length(which(df$C1 > 0)) > 0){df <- df[-(which(df$C1 > 0)),]}
if(length(which(df$C2 > 0)) > 0){df <- df[-(which(df$C2 > 0)),]}
if(length(which(df$C3 > 0)) > 0){df <- df[-(which(df$C3 > 0)),]}
I am tying to replace 0's in my dataframe of thousands of rows and columns with half the minimum value greater than zero from that column. I would also not want to include the first four columns as they are indexes.
So if I start with something like this:
index <- c("100p", "200p", 300p" 400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb"
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
as.data.frame(q) <- cbind(index, ratio, gene, species, a1, b1, c1)
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 0 0 1
200p 4 NA NA 3 0 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I would hope to gain a result such as this:
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 1 2 1
200p 4 NA NA 3 2 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I have tried the following code:
apply(q[-4], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))
but I keep getting the error:Error in min(x[x > 0])/2 : non-numeric argument to binary operator
Any help on this? Thank you very much!
We can use lapply and replace the 0 values with minimum value in column by 2.
cols<- 5:7
q[cols] <- lapply(q[cols], function(x) replace(x, x == 0, min(x[x>0], na.rm = TRUE)/2))
q
# index ratio gene species a1 b1 c1
#1 100p 5 gapdh mouse 1 2 1
#2 200p 4 <NA> <NA> 3 2 2
#3 300p 3 <NA> <NA> 5 4 3
#4 400p 2 actb rat 2 6 4
In dplyr, we can use mutate_at
library(dplyr)
q %>% mutate_at(cols,~replace(., . == 0, min(.[.>0], na.rm = TRUE)/2))
data
q <- structure(list(index = structure(1:4, .Label = c("100p", "200p",
"300p", "400p"), class = "factor"), ratio = c(5, 4, 3, 2), gene = structure(c(2L,
NA, NA, 1L), .Label = c("actb", "gapdh"), class = "factor"),
species = structure(c(1L, NA, NA, 2L), .Label = c("mouse",
"rat"), class = "factor"), a1 = c(0, 3, 5, 2), b1 = c(0,
0, 4, 6), c1 = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA, -4L))
A slightly different (and potentially faster for large datasets) dplyr option with a bit of maths could be:
q %>%
mutate_at(vars(5:length(.)), ~ (. == 0) * min(.[. != 0])/2 + .)
index ratio gene species a1 b1 c1
1 100p 5 gapdh mouse 1 2 1
2 200p 4 <NA> <NA> 3 2 2
3 300p 3 <NA> <NA> 5 4 3
4 400p 2 actb rat 2 6 4
And the same with base R:
q[, 5:length(q)] <- lapply(q[, 5:length(q)], function(x) (x == 0) * min(x[x != 0])/2 + x)
For reference, considering your original code, I believe your function was not the issue. Instead, the error comes from applying the function to non-numeric data.
# original data
index <- c("100p", "200p", "300p" , "400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb")
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
# data frame
q <- as.data.frame(cbind(index, ratio, gene, species, a1, b1, c1))
# examine structure (all cols are factors)
str(q)
# convert factors to numeric
fac_to_num <- function(x){
x <- as.numeric(as.character(x))
x
}
# apply to cols 5 thru 7 only
q[, 5:7] <- apply(q[, 5:7],2,fac_to_num)
# examine structure
str(q)
# use original function only on numeric data
apply(q[, 5:7], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))
I have a dataframe:df <- data.frame(id = c('1','2','3'), b = c('b1', 'NA', 'b3'), c = c('c1', 'c2', 'NA'), d = c('d1', 'NA', 'NA'))
id b c d
1 b1 c1 d1
2 NA c2 NA
3 b3 NA NA
I have extracted values with id = 1 from df to another dataframe say df2 so df2 has 1 row
id b c d
1 b1 c1 d1
I need to copy all values from df2 to df1 wherever there is not an NA in df1
Result Table:
id b c d
1 b1 c1 d1
2 b1 c2 d1
3 b3 c1 d1
Thank you in advance. I asked similar question before but deleting it.
Based on your last comment that df2[3,3] should be c2 and not c1, a straightforward answer is to use zoo::na.locf.
library(zoo)
df2 <- na.locf(df)
# id b c d
# 1 1 b1 c1 d1
# 2 2 b1 c2 d1
# 3 3 b3 c2 d1
Data
df <- structure(list(id = c(1, 2, 3), b = c("b1", NA, "b3"), c = c("c1",
"c2", NA), d = c("d1", NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
Assuming that there is a mistake in your question -> df2 will be equal to b1-c1-d1 not b1-c2-d1, here is the solution :
Initialize dataframe
df <- data.frame(id = c('1','2','3'), b = c('b1', 'NA', 'b3'), c = c('c1', 'c2', 'NA'), d = c('d1', 'NA', 'NA'))
Converting string NAs to actual detectable NAs
df <- data.frame(lapply(df, function(x) { gsub("NA", NA, x) }))
Obtaining default value row
df2<-df[df$id==1,]
For all rows, check if the column cell is na, then fill it with the df2 cell of the same column
for (r in 1:nrow(df)) for( c in colnames(df)) df[r,c]<-ifelse(is.na(df[r,c]),as.character(df2[1,c]),as.character(df[r,c]))
How can i sum up the n th column with the n-1 th column in a dataframe for a subset of columns?
For example i have a dataframe as follows:
ID C1 C2 C3
1 2000-12-24 3d 2d
2 2000-12-24 2d 1d
i want R to do the following:
ID C1 C2 C3
1 2000-12-24 2000-12-24+3d=2000-12-27 2000-12-27+2d=2000-12-29
2 2000-12-24 2000-12-24+2d=2000-12-26 2000-12-26+1d=2000-12-27
so that the final dataframe looks like this:
ID C1 C2 C3 ...
1 2000-12-24 2000-12-27 2000-12-29
2 2000-12-24 2000-12-26 2000-12-27
UPDATE:
The data has been generated accordingly:
library(plyr)
library(lubridate)
library(reshape2)
Heterotransaction <- rgamma(2,shape=3 , scale=1)
ID <- list(1:2)
Elog <- data.frame(ID,Heterotransaction)
Elog$fist_transaction <- "2000-12-24"
Elog$fist_transaction <- as.Date(Elog$fist_transaction, "%Y-%m-%d")
Heterotransaction <- rgamma(2,shape=3 , scale=1)
f.transaction <- function(x){
y<- (rexp(2,x))
duration(y, units = "years")
}
tbtrans<-ldply(Heterotransaction, f.transaction)
purchases<-data.frame(ID,tbtrans)
Elognew<- merge.data.frame(Elog, purchases)
You could try
df1[3:ncol(df1)] <- lapply(3:ncol(df1), function(i) rowSums(df1[2:i]))
df1
# ID C1 C2 C3
#1 1 2 5 7
#2 2 4 7 8
or
df1[-1] <- t(apply(df1[-1], 1, cumsum))
Or another option would be to use Reduce
library(data.table)
setDT(df1)[,2:ncol(df1) := Reduce(`+`, .SD, accumulate=TRUE),
.SDcols=2:ncol(df1)][]
# ID C1 C2 C3
#1: 1 2 5 7
#2: 2 4 7 8
Update
Based on the new dataset, one option would be to modify the first solution
df2[3:ncol(df2)] <- do.call(rbind, lapply(3:ncol(df2), function(i)
as.Date(df2[,2]+cumsum(as.numeric(sub('[^0-9]+', '', df2[,i]))))))
df2[3:ncol(df2)] <- lapply(df2[3:ncol(df2)], as.Date, origin='1970-01-01')
df2
# ID C1 C2 C3
#1 1 2000-12-24 2000-12-27 2000-12-29
#2 2 2000-12-24 2000-12-26 2000-12-27
data
df1 <- structure(list(ID = 1:2, C1 = c(2L, 4L), C2 = c(3L, 3L),
C3 = c(2L, 1L)), .Names = c("ID", "C1", "C2", "C3"),
class = "data.frame", row.names = c(NA, -2L))
df2 <- df2 <- structure(list(ID = 1:2, C1 = structure(c(11315, 11315),
class = "Date"),
C2 = c("3d", "2d"), C3 = c("2d", "1d")), .Names = c("ID",
"C1", "C2", "C3"), row.names = c(NA, -2L), class = "data.frame")