How to change charaters in a data frame

How to change charaters in a data frame - r

I would like to change the "U" to "N" in column 3-9, and change "H" to the character of column "type" of the same row. For example, "H" in the first row would be changed to "M", and so on. I really appreciate any helps for R scripting. Thanks. XW
ID type A01 A02 A03 A04 A05 A06 A07
ss001 M C A U A A H A
ss002 R A H A A A G A
ss003 R H A G A A A U
ss004 R A U A A A A A
ss005 Y C C H T T C C
ss006 Y C T U C C C H
ss007 R A G A H G U G
ss008 K G U T G T H G
ss009 Y T H C T T U C
ss010 K T G T H G T T

This should be a pretty efficient way to do this:
M <- as.matrix(df[-c(1, 2)]) ## Faster to work on a matrix
M[M == "U"] <- "N" ## Replace "U" with "N"
H <- which(M == "H", arr.ind=TRUE) ## Identify the Hs
M[H] <- df[cbind(H[, "row"], 2)] ## Replace with values from "type"
cbind(df[1:2], M) ## Combine
# ID type A01 A02 A03 A04 A05 A06 A07
# 1 ss001 M C A N A A M A
# 2 ss002 R A R A A A G A
# 3 ss003 R R A G A A A N
# 4 ss004 R A N A A A A A
# 5 ss005 Y C C Y T T C C
# 6 ss006 Y C T N C C C Y
# 7 ss007 R A G A R G N G
# 8 ss008 K G N T G T K G
# 9 ss009 Y T Y C T T N C
# 10 ss010 K T G T K G T T

You can do this with apply called on the rows of your data:
# Read in data frame with data stored as characters
df = read.table(text="ID type A01 A02 A03 A04 A05 A06 A07
ss001 M C A U A A H A
ss002 R A H A A A G A
ss003 R H A G A A A U
ss004 R A U A A A A A
ss005 Y C C H T T C C
ss006 Y C T U C C C H
ss007 R A G A H G U G
ss008 K G U T G T H G
ss009 Y T H C T T U C
ss010 K T G T H G T T", header=T, stringsAsFactors=F)
# Manipulate rows
df.mod = as.data.frame(t(apply(df, 1, function(x) {
to.modify <- x[c(-1, -2)]
to.modify[to.modify == "U"] <- "N"
to.modify[to.modify == "H"] <- x[2]
return(c(x[1:2], to.modify))
})))
names(df.mod) <- names(df)
df.mod
# ID type A01 A02 A03 A04 A05 A06 A07
# 1 ss001 M C A N A A M A
# 2 ss002 R A R A A A G A
# 3 ss003 R R A G A A A N
# 4 ss004 R A N A A A A A
# 5 ss005 Y C C Y T T C C
# 6 ss006 Y C T N C C C Y
# 7 ss007 R A G A R G N G
# 8 ss008 K G N T G T K G
# 9 ss009 Y T Y C T T N C
# 10 ss010 K T G T K G T T

Related

Why does cSplit returns TRUE instead of the character

I have this (simplified) dataset:
x <- read.table(text = ' id seq
1 1 AACCAAGCCCTTGCTCAAATCGAAAAAAAGTTGAGCAAACCGAGTTTTGAG
2 2 AAGTTGAGCAAACCGAGTTTTGAGACTTGGATGAAGTCAACCAAAGCCCAC')
Which thus looks like this:
id seq
1 1 AACCAAGCCCTTGCTCAAATCGAAAAAAAGTTGAGCAAACCGAGTTTTGAG
2 2 AAGTTGAGCAAACCGAGTTTTGAGACTTGGATGAAGTCAACCAAAGCCCAC
When I then subject it to cSplit:
cSplit(x, 'seq', direction = 'wide', stripWhite = FALSE, sep = '') it returns TRUE for positions 20 and 32 instead of the character itself:
id seq_01 seq_02 seq_03 seq_04 seq_05 seq_06 seq_07 seq_08 seq_09 seq_10 seq_11 seq_12 seq_13 seq_14 seq_15 seq_16 seq_17 seq_18
1: 1 A A C C A A G C C C T T G C T C A A
2: 2 A A G T T G A G C A A A C C G A G T
seq_19 seq_20 seq_21 seq_22 seq_23 seq_24 seq_25 seq_26 seq_27 seq_28 seq_29 seq_30 seq_31 seq_32 seq_33 seq_34 seq_35 seq_36
1: A TRUE C G A A A A A A A G T TRUE G A G C
2: T TRUE T G A G A C T T G G A TRUE G A A G
seq_37 seq_38 seq_39 seq_40 seq_41 seq_42 seq_43 seq_44 seq_45 seq_46 seq_47 seq_48 seq_49 seq_50 seq_51
1: A A A C C G A G T T T T G A G
2: T C A A C C A A A G C C C A C
(If I instead change direction = 'wide' to direction = 'long' and than spread it myself using tidyr::spread it looks fine)

THe issue is with type.convert which is TRUE by default. So, if there are only T or F in a column, it thinks as TRUE/FALSE instead of the string "T" or "F" and converts it to logical type
library(splitstackshape)
cSplit(x, 'seq', direction = 'wide', stripWhite = FALSE,
sep = '', type.convert = FALSE)
# id seq_01 seq_02 seq_03 seq_04 seq_05 seq_06 seq_07 seq_08 seq_09 seq_10 seq_11 seq_12 seq_13 seq_14 seq_15
#1: 1 A A C C A A G C C C T T G C T
#2: 2 A A G T T G A G C A A A C C G
# seq_16 seq_17 seq_18 seq_19 seq_20 seq_21 seq_22 seq_23 seq_24 seq_25 seq_26 seq_27 seq_28 seq_29 seq_30
#1: C A A A T C G A A A A A A A G
#2: A G T T T T G A G A C T T G G
# seq_31 seq_32 seq_33 seq_34 seq_35 seq_36 seq_37 seq_38 seq_39 seq_40 seq_41 seq_42 seq_43 seq_44 seq_45
#1: T T G A G C A A A C C G A G T
#2: A T G A A G T C A A C C A A A
# seq_46 seq_47 seq_48 seq_49 seq_50 seq_51
#1: T T T G A G
#2: G C C C A C

Join 3 columns of different lengths in R

I have 3 columns
2 are the same length
1 is of a lesser length
here are the columns:
column1 <- letters[1:10]
column2 <- letters[1:15]
column3 <- letters[1:15]
I want all 3 columns to be joined together but have the missing 5 values in column1 to be NA?
What can i do to achieve this? a tibble?

You can change length of a vector
column1 <- letters[1:10]
column2 <- letters[1:15]
length(column1) <- length(column2)
Now
> column1
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" NA NA NA NA NA
We can wrap it in function
cbind_dif <- function(x = list()){
# Find max length
max_length <- max(unlist(lapply(x, length)))
# Set length of each vector as
res <- lapply(x, function(x){
length(x) <- max_length
return(x)
})
return(as.data.frame(res))
}
# Example usage:
> cbind_dif(list(column1 = column1, column2 = column2))
column1 column2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
8 h h
9 i i
10 j j
11 <NA> k
12 <NA> l
13 <NA> m
14 <NA> n
15 <NA> o

n <- max(length(column1), length(column2), length(column3))
data.frame(column1[1:n],column2[1:n],column3[1:n])
column1.1.n. column2.1.n. column3.1.n.
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
7 g g g
8 h h h
9 i i i
10 j j j
11 <NA> k k
12 <NA> l l
13 <NA> m m
14 <NA> n n
15 <NA> o o

Using cbind.fill from rowr package you can do it easily.
library(rowr)
new<- cbind.fill(column1,column2,column3)
I hope this helps

column1 <- letters[1:10]
column2 <- letters[1:15]
column3 <- letters[1:15]
tibble(a = c(column1, rep(NA, length(column2) - length(column1))), b = column2, c = column3)
# A tibble: 15 × 3
a b c
<chr> <chr> <chr>
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
7 g g g
8 h h h
9 i i i
10 j j j
11 NA k k
12 NA l l
13 NA m m
14 NA n n
15 NA o o

R: more efficient solution than this for-loop

I wrote a functioning for loop, but it's slow over thousands of rows and I'm looking for more efficient alternative. Thanks in advance!
The task:
If column a matches column b, column d becomes NA.
If column a does not match b, but b matches c, then column e becomes
NA.
The for loop:
for (i in 1:nrow(data)) {
if (data$a[i] == data$b[i]) {data$d[i] <- NA}
if (!(data$a[i] == data$b[i]) & data$b[i] == data$c[i])
{data$e[i] <- NA}
}
An example:
a b c d e
F G G 1 10
F G F 5 10
F F F 2 8
Would become:
a b c d e
F G G 1 NA
F G F 5 10
F F F NA 8

If you're concerned about speed and efficiency, I'd recommend data.table (though technically vectorizing a normal data.frame as recommended by #parfait would probably speed things up more than enough)
library(data.table)
DT <- fread("a b c d e
F G G 1 10
F G F 5 10
F F F 2 8")
print(DT)
# a b c d e
# 1: F G G 1 10
# 2: F G F 5 10
# 3: F F F 2 8
DT[a == b, d := NA]
DT[!a == b & b == c, e := NA]
print(DT)
# a b c d e
# 1: F G G 1 NA
# 2: F G F 5 10
# 3: F F F NA 8

Suppose df is your data then:
ab <- with(df, a==b)
bc <- with(df, b==c)
df$d[ab] <- NA
df$e[!ab & bc] <- NA
which would result in
# a b c d e
# 1 F G G 1 NA
# 2 F G F 5 10
# 3 F F F NA 8

We could create a list of quosure and evaluate it
library(tidyverse)
qs <- setNames(quos(d*NA^(a == b), e*NA^((!(a ==b) & (b == c)))), c("d", "e"))
df1 %>%
mutate(!!! qs)
# a b c d e
#1 F G G 1 NA
#2 F G F 5 10
#3 F F F NA 8

R - Adding a total row in Excel output

I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na

library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.

Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>

Subsetting character data in R

I have a data frame with several columns of varied character data. I want to find the average of each combination of that character data. I think I'm closing in on a solution, but am having trouble figuring out how to loop over characters. An example bit of data would be like:
Var1 Var2 Var3 M1
a w j 20
a w j 15
a w k 10
a w j 0
b x L 30
b x L 10
b y k 20
b y k 15
c z j 20
c z j 10
c z k 11
c w l 45
a d j 20
a d k 4
a d l 23
a d k 11
And trying to get it in the form of:
P1 P2 P3 Avg
a w j 11.667
a w k 10
a d j 20
a d k 15
a d l 23
b x L 20
b y k 17.5
c z j 15
c z k 11
c w l 45
I think the idea is something like:
test <- read.table("clipboard",header=T)
newdata <- subset(test,
Var1=='a'
& Var2=='w'
& Var3=='j',
select=M1
)
row.names(newdata)<-NULL
newdata2 <- as.data.frame(matrix(data=NA,nrow=3,ncol=4))
names(newdata2) <- c("P1","P2","P3","Avg")
newdata2[1,1] <- 'a'
newdata2[1,2] <- 'w'
newdata2[1,3] <- 'j'
newdata2[1,4] <- mean(newdata$M1)
Which works for the first line, but I'm not entirely sure how to automate this to loop over each character combination across the columns. Unless, of course, there's a similar apply-like function to use in this case?

library(dplyr)
newdata2 = summarise(group_by(test,Var1,Var2,Var3),Avg=mean(M1))
And the result:
> newdata2
Source: local data frame [10 x 4]
Groups: Var1, Var2
Var1 Var2 Var3 Avg
1 a d j 20.00000
2 a d k 7.50000
3 a d l 23.00000
4 a w j 11.66667
5 a w k 10.00000
6 b x L 20.00000
7 b y k 17.50000
8 c w l 45.00000
9 c z j 15.00000
10 c z k 11.00000

Using the base aggregate function:
mydata <- read.table(header=TRUE, text="
Var1 Var2 Var3 M1
a w j 20
a w j 15
a w k 10
a w j 0
b x L 30
b x L 10
b y k 20
b y k 15
c z j 20
c z j 10
c z k 11
c w l 45
a d j 20
a d k 4
a d l 23
a d k 11")
aggdata <-aggregate(mydata$M1, by=list(mydata$Var1,mydata$Var2,mydata$Var3) , FUN=mean, na.rm=TRUE)
output:
> aggdata
Group.1 Group.2 Group.3 x
1 a d j 20.00000
2 a w j 11.66667
3 c z j 15.00000
4 a d k 7.50000
5 a w k 10.00000
6 b y k 17.50000
7 c z k 11.00000
8 a d l 23.00000
9 c w l 45.00000
10 b x L 20.00000

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to change charaters in a data frame - r

Related

Why does cSplit returns TRUE instead of the character

Join 3 columns of different lengths in R

R: more efficient solution than this for-loop

R - Adding a total row in Excel output

Subsetting character data in R

Categories

Resources