I have some sequencing data for some biological samples. The file to read contains 7 columns first that contain characters, as they have gene names and codes etc. From the 8th column are my samples that contain count data, so a number assigned to a gene depending of how much of that gene is present in a given sample.
The problem is that the CSV file I have contains non-integer values and so I need to convert them into integers (as.integer).
This works absolutely find if I delete the columns that contain gene information etc. and have a matrix with only the values! However, I need the gene information and therefore the columns that contain this information, but if I carry out as.integer on the entire data frame, any characters get returned as NA and so I also lose all of this information!
I am struggling as I am guessing I should make the first 7 columns as.characters? Or apply the as.integer function to the 8th column up to the last, however I am struggling to think of the code to do this!
Try using lapply() to apply as.integer() to all except the first 7 columns?
df[, -seq(1, 7)] <- lapply(df[, -seq(1, 7)], as.integer)
#result
> df
c1 c2 c3 c4 c5 c6 c7 c8 c9
1 G F Y M V M X 104 13
2 J E F O Q V H 67 11
3 N Q P L S K L 107 -13
4 U I C E M F Y 102 -14
5 E X Z S L B O 129 7
6 S K I Y Y C F 125 15
7 W O A P A G J 55 -2
8 M S H C J J V 30 17
9 L G X N N L B 129 7
10 B N V G Z T S 99 -12
Sample data:
set.seed(1)
df <- data.frame(
c1 = sample(LETTERS, 10),
c2 = sample(LETTERS, 10),
c3 = sample(LETTERS, 10),
c4 = sample(LETTERS, 10),
c5 = sample(LETTERS, 10),
c6 = sample(LETTERS, 10),
c7 = sample(LETTERS, 10),
c8 = rexp(10, rate = 0.01),
c9 = rnorm(10, sd = 20)
)
> df
c1 c2 c3 c4 c5 c6 c7 c8 c9
1 G F Y M V M X 104.94389 13.939268
2 J E F O Q V H 67.88807 11.133264
3 N Q P L S K L 107.98811 -13.775114
4 U I C E M F Y 102.82469 -14.149903
5 E X Z S L B O 129.22616 7.291639
6 S K I Y Y C F 125.31054 15.370658
7 W O A P A G J 55.46414 -2.246924
8 M S H C J J V 30.12830 17.622155
9 L G X N N L B 129.31247 7.962118
10 B N V G Z T S 99.45558 -12.240528
Related
I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na
library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.
Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>
Suppose that we have the following dataframe:
set.seed(1)
(tmp <- data.frame(x = 1:10, R1 = sample(LETTERS[1:5], 10, replace =
TRUE), R2 = sample(LETTERS[1:5], 10, replace = TRUE)))
x R1 R2
1 1 B B
2 2 B A
3 3 C D
4 4 E B
5 5 B D
6 6 E C
7 7 E D
8 8 D E
9 9 D B
10 10 A D
I want to do the following: if the difference between the level index
of factor R1 and that of factor R2 is an odd number, the levels of the
two factors need to be switched between them, which can be performed
through the following code:
for(ii in 1:dim(tmp)[1]) {
kk <- which(levels(tmp$R2) %in% tmp[ii,'R2'], arr.ind = TRUE) -
which(levels(tmp$R1) %in% tmp[ii,'R1'], arr.ind = TRUE)
if(kk%%2!=0) { # swap the their levels between the two factors
qq <- tmp[ii,]$R1
tmp[ii,]$R1 <- tmp[ii,]$R2
tmp[ii,]$R2 <- qq
}
}
More concise and efficient ways to achieve this?
P.S. A slightly different situation is the following.
set.seed(1)
(tmp <- data.frame(x = 1:10, R1 = sample(LETTERS[1:5], 10, replace =
TRUE), R2 = sample(LETTERS[2:6], 10, replace = TRUE)))
x R1 R2
1 C B
2 B B
3 C E
4 E C
5 E B
6 D E
7 E E
8 D F
9 C D
10 A E
Notice that the factor levels between the two factors, R1 and R2, slide by one level; that is, factor R1 does not have level F while factor R2 does not have level A. I want to swap the factor levels based on the combined levels of the two factors as shown below:
tl <- unique(c(levels(tmp$R1), levels(tmp$R2)))
for(ii in 1:dim(tmp)[1]) {
kk <- which(tl %in% tmp[ii,'R2'], arr.ind = TRUE) - which(tl %in%
tmp[ii,'R1'], arr.ind = TRUE)
if(kk%%2!=0) { # swap the their levels between the two factors
qq <- tmp[ii,]$R1
tmp[ii,]$R1 <- tmp[ii,]$R2
tmp[ii,]$R2 <- qq
}
}
How to go about this case? Thanks!
#Find out the indices where difference is odd
inds = abs(as.numeric(tmp$R1) - as.numeric(tmp$R2)) %% 2 != 0
#create new columns where values for the appropriate inds are from relevant columns
tmp$R1_new = replace(tmp$R1, inds, tmp$R2[inds])
tmp$R2_new = replace(tmp$R2, inds, tmp$R1[inds])
tmp
# x R1 R2 R1_new R2_new
#1 1 B B B B
#2 2 B A A B
#3 3 C D D C
#4 4 E B B E
#5 5 B D B D
#6 6 E C E C
#7 7 E D D E
#8 8 D E E D
#9 9 D B D B
#10 10 A D D A
Delete the old R1 and R2 if necessary
A solution using dplyr. dt is the final output. Notice that we need to use if_else from dplyr here, not the common ifelse from base R.
library(dplyr)
dt <- tmp %>%
mutate(R1_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R2, R1),
R2_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R1, R2)) %>%
select(x, R1 = R1_new, R2 = R2_new)
Update
For the updated case, add one mutate call to redefine the factor level of R1 and R2. The rest is the same.
tl <- unique(c(levels(tmp$R1), levels(tmp$R2)))
dt <- tmp %>%
mutate(R1 = factor(R1, levels = tl), R2 = factor(R2, levels = tl)) %>%
mutate(R1_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R2, R1),
R2_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R1, R2)) %>%
select(x, R1 = R1_new, R2 = R2_new)
Here is an option using data.table
library(data.table)
setDT(tmp)[(as.integer(R1) - as.integer(R2))%%2 != 0, c('R2', 'R1') := .(R1, R2)]
tmp
# x R1 R2
#1: 1 B B
#2: 2 A B
#3: 3 D C
#4: 4 B E
#5: 5 B D
#6: 6 E C
#7: 7 D E
#8: 8 E D
#9: 9 D B
#10:10 D A
I would like to split a data.frame into a list based on row values/characters across all columns of the data.frame.
I wrote lists of data.frames to file using write.list {erer}
So now when I read them in again, they look like this:
dummy data
set.seed(1)
df <- cbind(data.frame(col1=c(sample(LETTERS, 4),"col1",sample(LETTERS, 7))),
data.frame(col2=c(sample(LETTERS, 4),"col2",sample(LETTERS, 7))),
data.frame(col3=c(sample(LETTERS, 4),"col3",sample(LETTERS, 7))))
col1 col2 col3
1 G E Q
2 J R D
3 N J G
4 U Y I
5 col1 col2 col3
6 F M A
7 W R J
8 Y X U
9 P I H
10 N Y K
11 B T M
12 E E Y
And I would like to split into lists by c("col1","col2","col3") producing
[[1]]
col1 col2 col3
1 G E Q
2 J R D
3 N J G
4 U Y I
[[2]]
col1 col2 col3
1 F M A
2 W R J
3 Y X U
4 P I H
5 N Y K
6 B T M
7 E E Y
Feels like it should be straightforward using split, but my attempts so far have failed. Also, as you see, I can't split by a certain row interval.
Any pointers would be highly appreciated, thanks!
Try
lapply(split(d1, cumsum(grepl(names(d1)[1], d1$col1))), function(x) x[!grepl(names(d1)[1], x$col1),])
#$`0`
# col1 col2 col3
#1 G E Q
#2 J R D
#3 N J G
#4 U Y I
#$`1`
# col1 col2 col3
#6 F M A
#7 W R J
#8 Y X U
#9 P I H
#10 N Y K
#11 B T M
#12 E E Y
This should be general, if you want to split if a line is exactly like the colnames:
dfSplit<-split(df,cumsum(Reduce("&",Map("==",df,colnames(df)))))
for (i in 2:length(dfSplit)) dfSplit[[i]]<-dfSplit[[i]][-1,]
The second line can be written a little more R-style as #DavidArenburg suggested in the comments.
dfSplit[-1] <- lapply(dfSplit[-1], function(x) x[-1, ])
It has also the added benefit of doing nothing if dfSplit has length 1 (opposite to my original second line, which would throw an error).
I have a data frame with several columns of varied character data. I want to find the average of each combination of that character data. I think I'm closing in on a solution, but am having trouble figuring out how to loop over characters. An example bit of data would be like:
Var1 Var2 Var3 M1
a w j 20
a w j 15
a w k 10
a w j 0
b x L 30
b x L 10
b y k 20
b y k 15
c z j 20
c z j 10
c z k 11
c w l 45
a d j 20
a d k 4
a d l 23
a d k 11
And trying to get it in the form of:
P1 P2 P3 Avg
a w j 11.667
a w k 10
a d j 20
a d k 15
a d l 23
b x L 20
b y k 17.5
c z j 15
c z k 11
c w l 45
I think the idea is something like:
test <- read.table("clipboard",header=T)
newdata <- subset(test,
Var1=='a'
& Var2=='w'
& Var3=='j',
select=M1
)
row.names(newdata)<-NULL
newdata2 <- as.data.frame(matrix(data=NA,nrow=3,ncol=4))
names(newdata2) <- c("P1","P2","P3","Avg")
newdata2[1,1] <- 'a'
newdata2[1,2] <- 'w'
newdata2[1,3] <- 'j'
newdata2[1,4] <- mean(newdata$M1)
Which works for the first line, but I'm not entirely sure how to automate this to loop over each character combination across the columns. Unless, of course, there's a similar apply-like function to use in this case?
library(dplyr)
newdata2 = summarise(group_by(test,Var1,Var2,Var3),Avg=mean(M1))
And the result:
> newdata2
Source: local data frame [10 x 4]
Groups: Var1, Var2
Var1 Var2 Var3 Avg
1 a d j 20.00000
2 a d k 7.50000
3 a d l 23.00000
4 a w j 11.66667
5 a w k 10.00000
6 b x L 20.00000
7 b y k 17.50000
8 c w l 45.00000
9 c z j 15.00000
10 c z k 11.00000
Using the base aggregate function:
mydata <- read.table(header=TRUE, text="
Var1 Var2 Var3 M1
a w j 20
a w j 15
a w k 10
a w j 0
b x L 30
b x L 10
b y k 20
b y k 15
c z j 20
c z j 10
c z k 11
c w l 45
a d j 20
a d k 4
a d l 23
a d k 11")
aggdata <-aggregate(mydata$M1, by=list(mydata$Var1,mydata$Var2,mydata$Var3) , FUN=mean, na.rm=TRUE)
output:
> aggdata
Group.1 Group.2 Group.3 x
1 a d j 20.00000
2 a w j 11.66667
3 c z j 15.00000
4 a d k 7.50000
5 a w k 10.00000
6 b y k 17.50000
7 c z k 11.00000
8 a d l 23.00000
9 c w l 45.00000
10 b x L 20.00000
I would like to create a loop that will create a new column, then paste together two columns if a condition is met in a separate column. If the condition is not met, then the column would equal whatever value is in the existing column. Finally, I would like to delete the old columns and rename the new columns to match the old columns. In my example below, I create columns called a1_t, a2_t, a3_t. Then, if a1 == A, paste a1 and a1_c together and place the value in a1_t, otherwise copy the value from a1 into a1_t. Repeat this procedure for a2_t and a3_t.
Here is the data:
set.seed(1)
dat <- data.frame(a1 = sample(LETTERS[1:9],15,replace=T),
a1_c = sample (1:100,15),
a2 = sample(LETTERS[1:9],15,replace=T),
a2_c = sample (1:100,15),
a3 = sample(LETTERS[1:9],15,replace=T),
a3_c = sample (1:100,15))
Here is the long hand way of creating my end goal:
dat$a1_t <- 'none'
dat$a1_t[dat$a1=="A"] <- paste((dat$a1[dat$a1=="A"]),(dat$a1_c[dat$a1=="A"]),sep="_")
dat$a1_t[dat$a1=="B"] <- 'B'
dat$a1_t[dat$a1=="C"] <- 'C'
dat$a1_t[dat$a1=="D"] <- 'D'
dat$a1_t[dat$a1=="E"] <- 'E'
dat$a1_t[dat$a1=="F"] <- 'F'
dat$a1_t[dat$a1=="G"] <- 'G'
dat$a1_t[dat$a1=="H"] <- 'H'
dat$a1_t[dat$a1=="I"] <- 'I'
dat$a2_t <- 'none'
dat$a2_t[dat$a2=="A"] <- paste((dat$a2[dat$a2=="A"]),(dat$a2_c[dat$a2=="A"]),sep="_")
dat$a2_t[dat$a2=="B"] <- 'B'
dat$a2_t[dat$a2=="C"] <- 'C'
dat$a2_t[dat$a2=="D"] <- 'D'
dat$a2_t[dat$a2=="E"] <- 'E'
dat$a2_t[dat$a2=="F"] <- 'F'
dat$a2_t[dat$a2=="G"] <- 'G'
dat$a2_t[dat$a2=="H"] <- 'H'
dat$a2_t[dat$a2=="I"] <- 'I'
dat$a3_t <- 'none'
dat$a3_t[dat$a3=="A"] <- paste((dat$a3[dat$a3=="A"]),(dat$a3_c[dat$a3=="A"]),sep="_")
dat$a3_t[dat$a3=="B"] <- 'B'
dat$a3_t[dat$a3=="C"] <- 'C'
dat$a3_t[dat$a3=="D"] <- 'D'
dat$a3_t[dat$a3=="E"] <- 'E'
dat$a3_t[dat$a3=="F"] <- 'F'
dat$a3_t[dat$a3=="G"] <- 'G'
dat$a3_t[dat$a3=="H"] <- 'H'
dat$a3_t[dat$a3=="I"] <- 'I'
-al
If you are dealing with a small number of columns, you might just want to use within and ifelse, like this:
within(dat, {
a1_t <- ifelse(a1 == "A", paste(a1, a1_c, sep = "_"),
as.character(a1))
a2_t <- ifelse(a2 == "A", paste(a2, a2_c, sep = "_"),
as.character(a2))
a3_t <- ifelse(a3 == "A", paste(a3, a3_c, sep = "_"),
as.character(a3))
})
You can, however, extend the idea programatically, if necessary.
Ive added comments throughout the code below so you can see what it's doing.
## What variables are we checking?
checkMe <- c("a1", "a2", "a3")
## Let's convert those to character first
dat[checkMe] <- lapply(dat[checkMe], as.character)
cbind(dat, ## We'll combine the original data using cbind
setNames( ## setNames is for the resulting column names
lapply(checkMe, function(x) { ## lapply is an optimized loop
Get <- c(x, paste0(x, "_c")) ## We need this for the "if" part
ifelse(dat[, x] == "A", ## logical comparison
## if matched, paste together the value from
## the relevant column
paste(dat[, Get[1]], dat[, Get[2]], sep = "_"),
dat[, x]) ## else return the original value
}),
paste0(checkMe, "_t"))) ## the column names we want
# a1 a1_c a2 a2_c a3 a3_c a1_t a2_t a3_t
# 1 C 50 E 79 I 90 C E I
# 2 D 72 F 3 C 86 D F C
# 3 F 98 E 47 E 39 F E E
# 4 I 37 B 72 C 76 I B C
# 5 B 75 H 67 F 93 B H F
# 6 I 89 G 46 C 42 I G C
# 7 I 20 H 81 E 67 I H E
# 8 F 61 A 41 G 38 F A_41 G
# 9 F 12 G 23 A 30 F G A_30
# 10 A 25 D 7 H 69 A_25 D H
# 11 B 35 H 9 D 19 B H D
# 12 B 2 F 29 H 64 B F H
# 13 G 34 H 95 D 11 G H D
# 14 D 76 E 58 D 22 D E D
# 15 G 30 E 35 E 13 G E E