Creating a new column based on several exiting columns - r

Hoping to create the new column D based on three existing columns: "A" "B" and "C". The dataset also have other variables E, F, G, etc.
Whenever A or B or C has a value, other two columns have NAs (E,F, G, not affected by them). The new variable "D" I need should import whatever the existing values from any of the A,B, or C columns.
A B C D E F G
1 NA NA 1
NA 2 NA 2
NA 4 NA 4
NA NA 2 2
NA NA 3 3
Any simple codes within any packages that can do the trick? Thank you in advance!
I have seen other codes that can do the work but their datasets only have A,B and C, but my data set has other existing columns, so I need codes that can specify the A, B and C columns.

One option is to use coalesce on the 'A', 'B', 'C' to create the 'D' - coalesce will return the column with the first non-NA value per each row
library(dplyr)
df1 <- df1 %>%
mutate(D = coalesce(A, B, C), .after = 'C')

A base R way to do it is to use pmax:
Data:
df <- data.frame(A = c(1, NA, NA, NA, NA),
B = c(NA, 2, 4, NA, NA),
C = c(NA, NA, NA, 2, 3))
Code:
df$D <- pmax(df$A, df$B, df$C, na.rm = TRUE)
# or
df$D <- with(df, pmax(A, B, C, na.rm = TRUE))
Output:
# A B C D
# 1 1 NA NA 1
# 2 NA 2 NA 2
# 3 NA 4 NA 4
# 4 NA NA 2 2
# 5 NA NA 3 3

Update using across:
df %>%
mutate(D = rowSums(across(A:C), na.rm = TRUE))
OR
We could use mutate with rowSums:
library(dplyr)
df %>%
mutate(D = rowSums(.[1:3], na.rm = TRUE))
A B C D E F G
1 1 NA NA 1 1 1 1
2 NA 2 NA 2 1 1 1
3 NA 4 NA 4 1 1 1
4 NA NA 2 2 1 1 1
5 NA NA 3 3 1 1 1
data:
df <- structure(list(A = c(1L, NA, NA, NA, NA), B = c(NA, 2L, 4L, NA,
NA), C = c(NA, NA, NA, 2L, 3L), D = c(1L, 2L, 4L, 2L, 3L), E = c(1L,
1L, 1L, 1L, 1L), F = c(1L, 1L, 1L, 1L, 1L), G = c(1L, 1L, 1L,
1L, 1L)), class = "data.frame", row.names = c(NA, -5L))

Related

With R, how can I separate continuous values from a dataframe with item NA and calculate the average of only variable Y?

X Y
1 1 2
2 2 4
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 1 4
8 2 6
9 1 8
10 1 10
It should be so: In the first case the average of the values 2 and 4 is 3 In the second case, the average of the values 4,6,8,10 is 7 and so on...
Your data:
df = data.frame(X=c(1,2,NA,NA,NA,NA,1,2,1,1),Y=c(2,4,NA,NA,NA,NA,4,6,8,10))
You can define rows with consecutive rows with no NAs using diff(complete.cases(..)) :
blocks = cumsum(c(0,diff(complete.cases(df)) != 0 ))
block_means = tapply(df$Y,blocks,mean)
0 1 2
3 NA 7
block_means[!is.na(block_means)]
0 2
3 7
Or if you don't need to know the order:
na.omit(as.numeric(tapply(df$Y,blocks,mean)))
[1] 3 7
We can create groups of continuous values using rleid from data.table , within each group calculate the mean of Y values/
library(dplyr)
df %>%
group_by(gr = data.table::rleid(is.na(Y))) %>%
summarise(Y = mean(Y, na.rm = TRUE)) %>%
filter(!is.na(Y)) -> df1
df1
# gr Y
# <int> <dbl>
#1 1 3
#2 3 7
data.table way of doing this would be :
library(data.table)
df1 <- setDT(df)[, .(Y = mean(Y, na.rm = TRUE)), rleid(is.na(Y))][!is.na(Y)]
data
df <- structure(list(X = c(1L, 2L, NA, NA, NA, NA, 1L, 2L, 1L, 1L),
Y = c(2L, 4L, NA, NA, NA, NA, 4L, 6L, 8L, 10L)),
class = "data.frame", row.names = c(NA, -10L))

Cumulative Count Paste

I have this dataset:
ID Set Type Count
1 1 1 A NA
2 2 1 R NA
3 3 1 R NA
4 4 1 U NA
5 5 1 U NA
6 6 1 U NA
7 7 2 A NA
8 8 3 R NA
9 9 3 R NA
As dputs:
mystart <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
By using dplyr package how can I obtain this:
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
Again dputs:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 7L, 8L), .Label = c("A1", "A1R1", "A1R2",
"A1R2U1", "A1R2U2", "A1R2U3", "R1", "R2"), class = "factor")), .Names = c("ID",
"Set", "Type", "Count"), class = "data.frame", row.names = c(NA,
-9L))
In short, I want to count the observations of the column "type" within column "set" and print this count(text) cumulatively.
Examining similar posts, I got closely to this:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(1L,
1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
With the code:
library(dplyr)
myend <- read.table("mydata.txt", header=TRUE, fill=TRUE)
myend %>%
group_by(Set, Type) %>%
mutate(Count = seq(n())) %>%
ungroup(myend)
Thank you very much for your help,
Base R version :
aggregateGroup <- function(x){
vecs <- Reduce(f=function(a,b){ a[b] <- sum(a[b],1L,na.rm=TRUE); a },
init=integer(0),
as.character(x),
accumulate = TRUE)
# vecs is a list with something like this :
# [[1]]
# integer(0)
# [[2]]
# A
# 1
# [[3]]
# A R
# 1 1
# ...
# so we simply turn those vectors into characters using vapply and paste
# (excluding the first)
vapply(vecs,function(y) paste0(names(y),y,collapse=''),FUN.VALUE='')[-1]
}
split(mystart$Count,mystart$Set) <- lapply(split(mystart$Type,mystart$Set), aggregateGroup)
> mystart
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
A dplyr version:
mystart %>%
group_by(Set) %>%
mutate(Count = paste0('A', cumsum(Type == 'A'),
'R', cumsum(Type == 'R'),
'U', cumsum(Type == 'U'))) %>%
ungroup()
Which yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1R0U0
2 2 1 R A1R1U0
3 3 1 R A1R2U0
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1R0U0
8 8 3 R A0R1U0
9 9 3 R A0R2U0
If you want to omit the variables with count zero, you'd need to wrap a function around it like so
mygroup <- function(lst) {
name <- names(lst)
vectors <- lapply(seq_along(lst), function(i) {
x <- lst[[i]]
char <- name[i]
x <- ifelse(x == 0, "", paste0(char, x))
return(x)
})
return(do.call("paste0", vectors))
}
mystart %>%
group_by(Set) %>%
mutate(Count = mygroup(list(A = cumsum(Type == 'A'),
R = cumsum(Type == 'R'),
U = cumsum(Type == 'U')))) %>%
ungroup()
This yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
One line solve with data.table
you gotta first do
require(data.table)
mystart <- as.data.table(mystart)
then just use one line
mystart[, .(Type,
count = paste0(
'A',
cumsum(Type == 'A'),
'R',
countR = cumsum(Type == 'R'),
'U',
countU = cumsum(Type == 'U')
)),
by = c('Set')]
first you want cumsum each type and paste them together by 'set'
cumsum(Type=='A') equals the count, since when Type==A, it's 1, otherwise it's 0.
you wanted to paste them into one column also. So, paste0() is good to use.
you still wanted the Type column, so I included Type in the line.
The output:
Set Type count
1: 1 A A1R0U0
2: 1 R A1R1U0
3: 1 R A1R2U0
4: 1 U A1R2U1
5: 1 U A1R2U2
6: 1 U A1R2U3
7: 2 A A1R0U0
8: 3 R A0R1U0
9: 3 R A0R2U0
Hope this helps.
btw, if you want count 0 ignored, you gotta design some if-esle clause yourself.
basically you want this: if cumsum(something) ==0, NULL, esle paste0('something', cumsum(something)), then you paste0() them together.
It's gonna get nasty, I'm not writing it. you get the idea
Here's a base solution.
We can paste raw letters toseq_along of letter groups to get the last 2 characters, then paste the result to the last element of the previous result, using Reduce.
On top of this we use ave to compute by group.
fun <- function(x,y) paste0(x[length(x)],y,seq_along(y))
mystart$Count <- ave(as.character(mystart$Type),mystart$Set,
FUN = function(x) unlist(Reduce(fun,split(x,x),init=NULL,accumulate = TRUE)))
# ID Set Type Count
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2
Details
split(x,x) splits letters as shown here for first Set:
with(subset(mystart,Set==1),split(Type,Type))
# $A
# [1] "A"
#
# $R
# [1] "R" "R"
#
# $U
# [1] "U" "U" "U"
Then fun does this type of operations, helped by Reduce :
fun(NULL,"A") # [1] "A1"
fun("A1",c("R","R")) # [1] "A1R1" "A1R2"
fun(c("A1R1","A1R2"),c("U","U","U")) # [1] "A1R2U1" "A1R2U2" "A1R2U3"
Bonus solution
This other base solution, using rle and avoiding split gives the same output for given example (and whenever Type values are grouped in Sets), but not with mystart2 <- rbind(mystart,mystart) for instance.
fun2 <- function(x){
rle_ <- rle(x)
suffix <- paste0(x,sequence(rle_$length))
prefix <- unlist(mapply(rep,
lag(unlist(
Reduce(paste0,paste0(rle_$values,rle_$lengths),accumulate=TRUE)
),rle_$lengths[1]),
each=rle_$lengths))
prefix[is.na(prefix)] <- ""
paste0(prefix,suffix)
}
mystart$Count2 <-ave(as.character(mystart$Type), mystart$Set,FUN=fun2)
Many elegant solutions have been provided for the problem. Still I was looking for something dplyr way (without-cumsum on fixed types). The function is generic enough to handle additional values of Type.
A solution with help of a custom function as:
library(dplyr)
mystart %>% group_by(Set, Type) %>%
mutate(type_count = row_number()) %>%
mutate(TypeMod = paste0(Type,type_count)) %>%
group_by(Set) %>%
mutate(Count = cumCat(TypeMod, type_count)) %>%
select(-type_count, -TypeMod)
cumCat <- function(x, y){
retVal <- character(length(x))
prevVal = ""
lastGrpVal = ""
for ( i in seq_along(x)){
if(y[i]==1){
lastGrpVal = prevVal
}
retVal[i] = paste0(lastGrpVal,x[i])
prevVal = retVal[i]
}
retVal
}
# # Groups: Set [3]
# ID Set Type Count
# <int> <int> <fctr> <chr>
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2

R Data Frame remove rows with max values from all columns

Hello I have the data frame and I need to remove all the rows with max values from each columns.
Example
A B C
1 2 3 5
2 4 1 1
3 1 4 3
4 2 1 1
So the output is:
A B C
4 2 1 1
Is there any quick way to do this?
We can do this with %in%
df1[!seq_len(nrow(df1)) %in% sapply(df1, which.max),]
# A B C
#4 2 1 1
If there are ties for maximum values in each row, then do
df1[!Reduce(`|`, lapply(df1, function(x) x== max(x))),]
df[-sapply(df, which.max),]
# A B C
#4 2 1 1
DATA
df = structure(list(A = c(2L, 4L, 1L, 2L), B = c(3L, 1L, 4L, 1L),
C = c(5L, 1L, 3L, 1L)), .Names = c("A", "B", "C"),
class = "data.frame", row.names = c(NA,-4L))

Merging rows with the same ID variable [duplicate]

This question already has answers here:
how to spread or cast multiple values in r [duplicate]
(2 answers)
Closed 7 years ago.
I have a dataframe in R with 2186 obs of 38 vars. Rows have an ID variable referring to unique experiments and using
length(unique(df$ID))==nrow(df)
n_occur<-data.frame(table(df$ID))
I know 327 of my rows have repeated IDs with some IDs repeated more than once. I am trying to merge rows with the same ID as these aren't duplicates but just second, third etc. observations within a given experiment.
So for example if I had
x y ID
1 2 a
1 3 b
2 4 c
1 3 d
1 4 a
3 2 b
2 3 a
I would like to end up with
x y ID x2 y2 ID2 x3 y3 ID3
1 2 a 1 4 a 2 3 a
1 3 b 3 2 b na na na
2 4 c na na na na na na
1 3 d na na na na na na
I've seen similar questions for SQL and php but this hasn't helped me with my attempts in R. Any help would be gratefully appreciated.
You could use the enhanced dcast function from the data.table package for that where you can select multiple value variables. With setDT(mydf) you convert your dataframe to a datatable and with [, idx := 1:.N, by = ID] you add a index by ID which you use subsequently in the dcast formula:
library(data.table)
dcast(setDT(mydf)[, idx := 1:.N, by = ID], ID ~ idx, value.var = c("x","y"))
Or with the development version of data.table (v1.9.7+), you can use the new rowid function:
dcast(setDT(mydf), ID ~ rowid(ID), value.var = c("x","y"))
gives:
ID x_1 x_2 x_3 y_1 y_2 y_3
1: a 1 1 2 2 4 3
2: b 1 3 NA 3 2 NA
3: c 2 NA NA 4 NA NA
4: d 1 NA NA 3 NA NA
Used data:
mydf <- structure(list(x = c(1L, 1L, 2L, 1L, 1L, 3L, 2L), y = c(2L, 3L,
4L, 3L, 4L, 2L, 3L), ID = structure(c(1L, 2L, 3L, 4L, 1L, 2L,
1L), .Label = c("a", "b", "c", "d"), class = "factor")), .Names = c("x",
"y", "ID"), class = "data.frame", row.names = c(NA, -7L))

R: assigning value based on row names

> dput(dat)
structure(list(A = c(1L, 1L, 1L, 1L), B = c(1, 1, 1, 3), C = c(1L,
1L, 1L, 1L), D = c(1L, 2L, 1L, 1L), E = c(1L, 1L, 1L, 1L), F = c(1L,
1L, 1L, 1L), G = c(1L, 2L, 1L, 2L), H = c(1L, 2L, 1L, 1L)), .Names = c("A",
"B", "C", "D", "E", "F", "G", "H"), row.names = c("month1", "month6",
"month12", "month24"), class = "data.frame")
> dat
A B C D E F G H
month1 1 1 1 1 1 1 1 1
month6 1 1 1 2 1 1 2 2
month12 1 1 1 1 1 1 1 1
month24 1 3 1 1 1 1 2 1
Suppose my data looks something like this. I want to assign a value to each of these 8 columns based on when a value > 1 first occurs. If a value > 1 occurs at month 1, I will assign a value of 1 to that column. At month 6, I will assign a value of 1.5 to that column. At month 12, I will assign 2, and at month 24, I will assign 3.
For columns which contain all 1s, I assign NA to them. I would like my output to look like
A B C D E F G H
NA 3 NA 1.5 NA NA 1.5 1.5
We can use max.col. We convert the data.frame to a logical matrix ('m1'), transpose it ('m1'), get the maximum value column index for each row by using max.col with ties.method='first'(in case there are multiple TRUE per row), change the all FALSE elements in a row to NA (using rowSums and NA^). Now, we can convert the 'i1' to 'factor', specify the levels' and labels', and change it to numeric.
m1 <- t(dat >1)
i1 <- max.col(m1, 'first') * NA^(!rowSums(m1))
as.numeric(as.character(factor(i1, levels= 1:4, labels=c(1, 1.5, 2,3))))
#[1] NA 3.0 NA 1.5 NA NA 1.5 1.5
###Update
If there are rows/columns missing in some of the datasets, for e.g., here I am creating a new dataset with 2nd row missing ('dat1') (In case there are multiple datasets, we can place it in a list and do this in a loop (lapply(..) instead of repeating the steps). We create a 0 matrix ('m2') with the dimensions and dimnames that have all the rows/columns, replace the 0's in 'm2' with the row/column values that are present in the dataset, and then do the steps as before.
dat1 <- dat[-2,]
lst <- list(dat, dat1)
nC1 <- max(sapply(lst, ncol))
nR1 <- max(sapply(lst, nrow))
m2 <- matrix(0, ncol=nC1, nrow=nR1, dimnames=list(paste0('month',
c(1,6, 12,24)), LETTERS[1:8]))
lst1 <- lapply(lst, function(x) {
m2[rownames(x), colnames(x)] <- as.matrix(x)
m2 })
lapply(lst1, function(x) {m1 <- t(x >1)
i1 <- max.col(m1, 'first') * NA^(!rowSums(m1))
as.numeric(as.character(factor(i1, levels= 1:4, labels=c(1, 1.5, 2,3))))
})
# [[1]]
# [1] NA 3.0 NA 1.5 NA NA 1.5 1.5
# [[2]]
# [1] NA 3 NA NA NA NA 3 NA

Resources