I am new to R, so it may be that some of concepts are not fully correct...
I have a set of files that I read into a list (here just shown the first 3 lines of each):
myfiles<-lapply(list.files(".",pattern="tab",full.names=T),read.table,skip="#")
myfiles
[[1]]
V1 V2 V3
1 10001 33 -0.0499469
2 30001 65 0.0991478
3 50001 54 0.1564400
[[2]]
V1 V2 V3
1 10001 62 0.0855260
2 30001 74 0.1536640
3 50001 71 0.1020960
[[3]]
V1 V2 V3
1 10001 49 -0.04661360
2 30001 65 0.16961500
3 50001 61 0.07089600
I want to apply an ifelse condition in order to substitute values in columns and then return exactly the same list. However, when I do this:
myfiles<-lapply(myfiles,function(x) ifelse(x$V2>50, x$V3, NA))
myfiles
[[1]]
[1] NA 0.0991478 0.1564400
[[2]]
[1] 0.0855260 0.1536640 0.1020960
[[3]]
[1] NA 0.16961500 0.07089600
it does in fact what I want to, but returns only the columns where the function was applied, and I want it to return the same list as before, with 3 columns (but with the substitutions).
I guess there should be an easy way to do this with some variant of "apply", but I was not able to find it or solve it.
Thanks
You can use lapply and transform/within. There are three possibilities:
a) ifelse
lapply(myfiles, transform, V3 = ifelse(V2 > 50, V3, NA))
b) mathematical operators (potentially more efficient)
lapply(myfiles, transform, V3 = NA ^ (V2 <= 50) * V3)
c) is.na<-
lapply(myfiles, within, is.na(V3) <- V2 < 50)
The result
[[1]]
V1 V2 V3
1 10001 33 NA
2 30001 65 0.0991478
3 50001 54 0.1564400
[[2]]
V1 V2 V3
1 10001 62 0.085526
2 30001 74 0.153664
3 50001 71 0.102096
[[3]]
V1 V2 V3
1 10001 49 NA
2 30001 65 0.169615
3 50001 61 0.070896
Perhaps this helps
lapply(myfiles,within, V3 <- ifelse(V2 >50, V3, NA))
#[[1]]
# V1 V2 V3
#1 10001 33 NA
#2 30001 65 0.0991478
#3 50001 54 0.1564400
#[[2]]
# V1 V2 V3
#1 10001 62 0.085526
#2 30001 74 0.153664
#3 50001 71 0.102096
#[[3]]
# V1 V2 V3
#1 10001 49 NA
#2 30001 65 0.169615
#3 50001 61 0.070896
Update
Another option would be to read the files using fread from data.table which would be fast
library(data.table)
files <- list.files(pattern='tab')
lapply(files, function(x) fread(x)[V2<=50,V3:=NA] )
#[[1]]
# V1 V2 V3
#1: 10001 33 NA
#2: 30001 65 0.0991478
#3: 50001 54 0.1564400
#[[2]]
# V1 V2 V3
#1: 10001 62 0.085526
#2: 30001 74 0.153664
#3: 50001 71 0.102096
#[[3]]
# V1 V2 V3
#1: 10001 49 NA
#2: 30001 65 0.169615
#3: 50001 61 0.070896
Or as #Richie Cotton mentioned, you could also bind the datasets together using rbindlist and then do the operation in one step.
library(tools)
dt1 <- rbindlist(lapply(files, function(x)
fread(x)[,id:= basename(file_path_sans_ext(x))] ))[V2<=50, V3:=NA]
dt1
# V1 V2 V3 id
#1: 10001 33 NA tab1
#2: 30001 65 0.0991478 tab1
#3: 50001 54 0.1564400 tab1
#4: 10001 62 0.0855260 tab2
#5: 30001 74 0.1536640 tab2
#6: 50001 71 0.1020960 tab2
#7: 10001 49 NA tab3
#8: 30001 65 0.1696150 tab3
#9: 50001 61 0.0708960 tab3
This seems harder than it should be because you are working with a list of data frames rather than a single data frame. You can combine all the data frames into a single one using rbind_all in dplyr.
library(dplyr)
# Some variable renaming for clarity:
# myfiles now refers to the file names; mydata now contains the data
myfiles <- list.files(pattern="tab", full.names=TRUE)
mydata <- lapply(myfiles, read.table, skip="#")
# Get the number of rows in each data frame
n_rows <- vapply(mydata, nrow, integer(1))
# Combine the list of data frames into a single data frame
all_mydata <- rbind_all(mydata)
# Add an identifier to see which data frame the row came from.
all_mydata$file <- rep(myfiles, each = n_rows)
# Now update column 3
is.na(all_mydata$V3) <- all_mydata$V2 < 50
Try adding an id column for each df and binding them together:
for(i in 1:3) myfiles[[i]]$id = i
ddf = myfiles[[1]]
for(i in 2:3) ddf = rbind(ddf, myfiles[[i]])
Then apply changes on composite df and split it back again:
ddf$V3 = ifelse(ddf$V2>50, ddf$V3, NA)
myfiles = lapply(split(ddf, ddf$id), function(x) x[1:3])
myfiles
$`1`
V1 V2 V3
1 10001 33 NA
2 30001 65 0.0991478
3 50001 54 0.1564400
$`2`
V1 V2 V3
11 10001 62 0.085526
21 30001 74 0.153664
31 50001 71 0.102096
$`3`
V1 V2 V3
12 10001 49 NA
22 30001 65 0.169615
32 50001 61 0.070896
Related
I have a dataset with 8 variables,when I run dplyr with syntax below, my output dataframe only has the variables I have used in the dplyr code, while I want all variables
ShowID<-MyData %>%
group_by(id) %>%
summarize (count=n()) %>%
filter(count==min(count))
ShowID
So my output will have two variables - ID and Count. How do I get rest of my variables in the new dataframe? Why is this happening, what am I clueless about here?
> ncol(ShowID)
[1] 2
> ncol(MyData)
[1] 8
MYDATA
key ID v1 v2 v3 v4 v5 v6
0-0-70cf97 1 89 20 30 45 55 65
3ad4893b8c 1 4 5 45 45 55 65
0-0-70cf97d7 2 848 20 52 66 56 56
0-0-70cf 2 54 4 846 65 5 5
0-0-793b8c 3 56454 28 6 4 5 65
0-0-70cf98 2 8 4654 30 65 6 21
3ad4893b8c 2 89 66 518 156 16 65
0-0-70cf97d8 3 89 20 161 1 55 45465
0-0-70cf 5 89 79 48 45 55 456
0-0-793b8c 5 89 20 48 545 654 4
0-0-70cf99 6 9 20 30 45 55 65
DESIRED
key ID count v1 v2 v3 v4 v5 v6
0-0-70cf99 6 1 9 20 30 45 55 65
RESULT FROM CODE
ID count
6 1
You can use the base R ave method to calculate number of rows in each group (ID) and then select those group which has minimum rows.
num_rows <- ave(MyData$v1, MyData$ID, FUN = length)
MyData[which(num_rows == min(num_rows)), ]
# key ID v1 v2 v3 v4 v5 v6
#11 0-0-70cf99 6 9 20 30 45 55 65
You could also use which.min in this case to avoid one step however, in case of multiple minimum values it would fail hence, I have used which.
No need to summarize:
ShowID <- MyData %>%
group_by(id) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count == min(count))
Here an example of my list (I actually have > 2,000 df in the real one):
df1 = read.table(text = 'a b
1 66
1 999
23 89', header = TRUE)
df2 = read.table(text = 'a b
99 61
32 99
83 19', header = TRUE)
lst = list(df1, df2)
I need to create a new column for each data.frame within the list and populate each column with a specific number.
numbers = c(100, 200)
so my output should be:
> lst
[[1]]
a b new_col
1 1 66 100
2 1 999 100
3 23 89 100
[[2]]
a b new_col
1 99 61 200
2 32 99 200
3 83 19 200
With lapply I was able to create a new blank column for each data.frame:
lst = lapply(lst, cbind, new_col = '')
> lst
[[1]]
a b new_col
1 1 66
2 1 999
3 23 89
[[2]]
a b new_col
1 99 61
2 32 99
3 83 19
But I don't know how to populate the columns with my vector of numbers.
Thanks
In order to iterate both the list of data.frames and vector of numbers at the same time, use Map(). For example
Map(cbind, lst, new_col=numbers)
# [[1]]
# a b new_col
# 1 1 66 100
# 2 1 999 100
# 3 23 89 100
#
# [[2]]
# a b new_col
# 1 99 61 200
# 2 32 99 200
# 3 83 19 200
I was reading this post sort matrix and I was curious if there is something equivalent, i.e., sort columns of a matrix independently in data.table package?
mat <- matrix(c(45,34,1,3,4325,23,1,2,5,7,3,4,32,734,2),ncol=3)
I would like something:
sort <- matrix(c(1,3,34,45,4325,1,2,5,7,23,2,3,4,32,734),ncol=3)
Thanks!
mat <- matrix(c(45,34,1,3,4325,23,1,2,5,7,3,4,32,734,2),ncol=3)
library(data.table)
DT <- as.data.table(mat)
# V1 V2 V3
#1: 45 23 3
#2: 34 1 4
#3: 1 2 32
#4: 3 5 734
#5: 4325 7 2
DT[, lapply(.SD, sort, method = "radix")]
# V1 V2 V3
#1: 1 1 2
#2: 3 2 3
#3: 34 5 4
#4: 45 7 32
#5: 4325 23 734
You can just apply, like so:
apply(mat,2,sort)
How can I melt a data frame row by row?
I found a really similar question on the forum but I still can't solve my problem without a different id variable.
This is my data set:
V1 V2 V3 V4 V5
51 20 29 12 20
51 22 51 NA NA
51 14 NA NA NA
51 75 NA NA NA
And I want to melt it into:
V1 variable value
51 V2 20
51 V3 29
51 V4 12
51 V5 20
51 V2 22
51 V3 51
51 V2 14
51 V2 75
Currently my approach is melting it row by row with a for loop and then rbind them together.
library(reshape)
df <- read.table(text = "V1 V2 V3 V4 V5 51 20 29 12 20 51 22 51 NA NA 51
+14 NA NA NA 51 75 NA NA NA", header = TRUE)
dfall<-NULL
for (i in 1:NROW(df))
{
dfmelt<-melt(df,id="V1",na.rm=TRUE)
dfall<-rbind(dfall,dfmelt)
}
Just wondering if there is any way to do this faster? Thanks!
We replicate the first column "V1" and the names of the dataset except the first column name to create the first and second column of the expected output, while the 'value' column is created by transposing the dataset without the first column.
na.omit(data.frame(V1=df1[1][col(df1[-1])],
variable = names(df1)[-1][row(df1[-1])],
value = c(t(df1[-1]))))
# V1 variable value
#1 51 V2 20
#2 51 V3 29
#3 51 V4 12
#4 51 V5 20
#5 51 V2 22
#6 51 V3 51
#9 51 V2 14
#13 51 V2 75
NOTE: No additional packages used.
Or we can use gather (from tidyr) to convert the 'wide' to 'long' format after we create a row id column (add_rownames from dplyr) and then arrange the rows.
library(dplyr)
library(tidyr)
add_rownames(df1) %>%
gather(variable, value, V2:V5, na.rm=TRUE) %>%
arrange(rowname, V1) %>%
select(-rowname)
# V1 variable value
# (int) (chr) (int)
#1 51 V2 20
#2 51 V3 29
#3 51 V4 12
#4 51 V5 20
#5 51 V2 22
#6 51 V3 51
#7 51 V2 14
#8 51 V2 75
Or with data.table
library(data.table)
melt(setDT(df1, keep.rownames=TRUE),
id.var= c("rn", "V1"), na.rm=TRUE)[
order(rn, V1)][, rn:= NULL][]
You can make a column with a unique ID for each row, so you can sort on it after melting. Using dplyr:
library(reshape2)
library(dplyr)
df %>% mutate(id = seq_len(n())) %>%
melt(id.var = c('V1','id'), na.rm = T) %>%
arrange(V1, id, variable) %>%
select(-id)
# V1 variable value
# 1 51 V2 20
# 2 51 V3 29
# 3 51 V4 12
# 4 51 V5 20
# 5 51 V2 22
# 6 51 V3 51
# 7 51 V2 14
# 8 51 V2 75
...or base R:
library(reshape2)
df$id <- seq_along(df$V1)
df2 <- melt(df, id.var = c('V1', 'id'), na.rm = TRUE)
df2[order(df2$V1, df2$id, df2$variable),-2]
I have a question about the Reduce function in R. I read its documentation, but I am still confused a bit. So, I have 5 vectors with genes name. For example:
v1 <- c("geneA","geneB",""...)
v2 <- c("geneA","geneC",""...)
v3 <- c("geneD","geneE",""...)
v4 <- c("geneA","geneE",""...)
v5 <- c("geneB","geneC",""...)
And I would like to find out which genes are present in at least two vectors. Some people have suggested:
Reduce(intersect,list(a,b,c,d,e))
I would greatly appreciate if someone could please explain to me how this statement works, because I have seen Reduce used in other scenarios.
Reduce takes a binary function and a list of data items and successively applies the function to the list elements in a recursive fashion. For example:
Reduce(intersect,list(a,b,c))
is the same as
intersect((intersect(a,b),c)
However, I don't think that construct will help you here as it will only return those elements that are common to all vectors.
To count the number of vectors that a gene appears in you could do the following:
vlist <- list(v1,v2,v3,v4,v5)
addmargins(table(gene=unlist(vlist), vec=rep(paste0("v",1:5),times=sapply(vlist,length))),2,list(Count=function(x) sum(x[x>0])))
vec
gene v1 v2 v3 v4 v5 Count
geneA 1 1 0 1 0 3
geneB 1 0 0 0 1 2
geneC 0 1 0 0 1 2
geneD 0 0 1 0 0 1
geneE 0 0 1 1 0 2
A nice way to see what Reduce() is doing is to run it with its argument accumulate=TRUE. When accumulate=TRUE, it will return a vector or list in which each element shows its state after processing the first n elements of the list in x. Here are a couple of examples:
Reduce(`*`, x=list(5,4,3,2), accumulate=TRUE)
# [1] 5 20 60 120
i2 <- seq(0,100,by=2)
i3 <- seq(0,100,by=3)
i5 <- seq(0,100,by=5)
Reduce(intersect, x=list(i2,i3,i5), accumulate=TRUE)
# [[1]]
# [1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
# [20] 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74
# [39] 76 78 80 82 84 86 88 90 92 94 96 98 100
#
# [[2]]
# [1] 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
#
# [[3]]
# [1] 0 30 60 90
Assuming the input values given at the end of this answer, the expression
Reduce(intersect,list(a,b,c,d,e))
## character(0)
gives the genes that are present in all vectors, not the genes that are present in at least two vectors. It means:
intersect(intersect(intersect(intersect(a, b), c), d), e)
## character(0)
If we want the genes that are in at least two vectors:
L <- list(a, b, c, d, e)
u <- unlist(lapply(L, unique)) # or: Reduce(c, lapply(L, unique))
tab <- table(u)
names(tab[tab > 1])
## [1] "geneA" "geneB" "geneC" "geneE"
or
sort(unique(u[duplicated(u)]))
## [1] "geneA" "geneB" "geneC" "geneE"
Note: We used:
a <- c("geneA","geneB")
b <- c("geneA","geneC")
c <- c("geneD","geneE")
d <- c("geneA","geneE")
e <- c("geneB","geneC")