Changing names in a list of dataframes [duplicate] - r

This question already has answers here:
Changing Column Names in a List of Data Frames in R
(6 answers)
Closed 4 years ago.
I read all textfiles in the working directory into a list, and cut some columns
all.files <- list.files(pattern = ".*.txt")
data.list <- lapply(all.files, function(x)read.table(x, sep="\t"))
names(data.list) <- all.files
data.list <- lapply(data.list, function(x) x[,1:3])
I am ending up with a "list of 2"
> str(data.list)
List of 2
$ 001.txt:'data.frame': 71330 obs. of 3 variables:
..$ V1: Factor w/ 71321 levels
..$ V2: Factor w/ 1382 levels
..$ V3: num [1:71330] 89.1 99.5 98.8 99.4 99.5 ...
$ 002.txt:'data.frame': 98532 obs. of 3 variables
..$ V1: Factor w/ 98517 levels
..$ V2: Factor w/ 1348 levels
..$ V3: num [1:98532] 99.5 99 99.5 98.4 100 ...
I want to rename V1,V2,V3 according to
new.names<-c("query", "sbjct", "ident")
How is this possible with lapply?

You can try setNames
data.list <- lapply(data.list, setNames, new.names)

Related

Apply na.locf to multiple datasets

I have multiple datasets (Eg: data01, data02..). In all these datasets, I want to apply na.locf to var1, and create a new variable 'var2' from the locf applied 'var1'. I tried using the following code:
L=list(data01,data02)
for (i in L){i$var2 <- na.locf(i$var1)}
However, when I try to read the locf column using code:
head(data01$var2)
The result given is NULL.
There are a few problems:
in the question i is a copy of each data frame so L is not changed. Index into L to ensure that it is the data frame in L that is changed.
use na.locf0 or equivalently na.locf(..., na.rm = FALSE) to ensure that the output is the same length as the input
the data01 and data02 in L are copies of data01 and data02 and modifying one does not modify the other. That is why you get NULL.
Using the built-in BOD data frame to construct sample input:
library(zoo)
# construct sample input
BOD1 <- BOD2 <- BOD
BOD1$Time[c(1, 3)] <- BOD2$Time[c(3, 5)] <- NA
L <- list(BOD1, BOD2)
for(i in seq_along(L)) L[[i]]$Time2 <- na.locf0(L[[i]]$Time)
giving:
str(L)
List of 2
$ :'data.frame': 6 obs. of 3 variables:
..$ Time : num [1:6] NA 2 NA 4 5 7
..$ demand: num [1:6] 8.3 10.3 19 16 15.6 19.8
..$ Time2 : num [1:6] NA 2 2 4 5 7
..- attr(*, "reference")= chr "A1.4, p. 270"
$ :'data.frame': 6 obs. of 3 variables:
..$ Time : num [1:6] 1 2 NA 4 NA 7
..$ demand: num [1:6] 8.3 10.3 19 16 15.6 19.8
..$ Time2 : num [1:6] 1 2 2 4 4 7
..- attr(*, "reference")= chr "A1.4, p. 270"
Any of these would also work and instead of modifying L produce a new list:
L2 <- lapply(L, function(x) { x$Time2 <- na.locf0(x$Time); x })
L3 <- lapply(L, transform, Time2 = na.locf0(Time))
If your aim is to modify BOD1 and BOD2 as opposed to creating a list with the modified BOD1 and BOD2 then the following would do that (although it is usually better to organize objects in a list if you intend to iterate over them) rather than leave them loose in the global environment.
nms <- c("BOD1", "BOD2")
for(nm in nms) assign(nm, transform(get(nm), Time2 = na.locf0(Time)))
or
nms <- c("BOD1", "BOD2")
for(nm in nms) .GlobalEnv[[nm]]$Time2 <- na.locf0(.GlobalEnv[[nm]]$Time2)
or other variations.

Adding a suffix to names when storing results in a loop

I am making some plots in R in a for-loop and would like to store them using a name to describe the function being plotted, but also which data it came from.
So when I have a list of 2 data sets "x" and "y" and the loop has a structure like this:
x = matrix(
c(1,2,4,5,6,7,8,9),
nrow=3,
ncol=2)
y = matrix(
c(20,40,60,80,100,120,140,160,180),
nrow=3,
ncol=2)
data <- list(x,y)
for (i in data){
??? <- boxplot(i)
}
I would like the ??? to be "name" + (i) + "_" separator. In this case the 2 plots would be called "plot_x" and "plot_y".
I tried some stuff with paste("plot", names(i), sep = "_") but I'm not sure if this is what to use, and where and how to use it in this scenario.
We can create an empty list with the length same as that of the 'data' and then store the corresponding output from the for loop by looping over the sequence of 'data'
out <- vector('list', length(data))
for(i in seq_along(data)) {
out[[i]] <- boxplot(data[[i]])
}
str(out)
#List of 2
# $ :List of 6
# ..$ stats: num [1:5, 1:2] 1 1.5 2 3 4 5 5.5 6 6.5 7
# ..$ n : num [1:2] 3 3
# ..$ conf : num [1:2, 1:2] 0.632 3.368 5.088 6.912
# ..$ out : num(0)
# ..$ group: num(0)
# ..$ names: chr [1:2] "1" "2"
# $ :List of 6
# ..$ stats: num [1:5, 1:2] 20 30 40 50 60 80 90 100 110 120
# ..$ n : num [1:2] 3 3
# ..$ conf : num [1:2, 1:2] 21.8 58.2 81.8 118.2
# ..$ group: num(0)
# ..$ names: chr [1:2] "1" "2"
If required, set the names of the list elements with the object names
names(out) <- paste0("plot_", c("x", "y"))
It is better not to create multiple objects in the global environment. Instead as showed above, place the objects in a list
akrun is right, you should try to avoid setting names in the global environment. But if you really have to, you can try this,
> y = matrix(c(20,40,60,80,100,120,140,160,180),ncol=1)
> .GlobalEnv[[paste0("plot_","y")]] <- boxplot(y)
> str(plot_y)
List of 6
$ stats: num [1:5, 1] 20 60 100 140 180
$ n : num 9
$ conf : num [1:2, 1] 57.9 142.1
$ out : num(0)
$ group: num(0)
$ names: chr "1"
You can read up on .GlobalEnv by typing in ?.GlobalEnv, into the R command prompt.

Subsetting a dataframe by names in another dataframe [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 7 years ago.
I have very big reference file with thousands of pairwise comparisons between thousands of objects ("OTUs). The dataframe is in long format:
data.frame': 14845516 obs. of 3 variables:
$ OTU1 : chr "0" "0" "0" "0" ...
$ OTU2 : chr "8192" "1" "8194" "3" ...
$ gendist: num 78.7 77.8 77.6 74.4 75.3 ...
I also have a much smaller subset with observed data (slightly different structure):
'data.frame': 286903 obs. of 3 variables:
$ OTU1 : chr "1239" "1603" "2584" "1120" ...
$ OTU2 : chr "12136" "12136" "12136" "12136" ...
$ ecodist: num 2.08 1.85 2 1.73 1.53 ...
- attr(*, "na.action")=Class 'omit' Named int [1:287661] 1 759 760 1517 1518 1519 2275 2276 2277 2278 ...
.. ..- attr(*, "names")= chr [1:287661] "1" "759" "760" "1517" ...
Again, its a pairwise comparison of objects ('OTUs'). All objects in the smaller dataset are also in the reference dataset.
I want to reduce the reference that it only contains objects that are also found in the smaller dataset. It is very important that its done on both columns (OTU1, OTU2).
Here is toy data:
library(reshape)
###reference
Ref <- cor(as.data.frame(matrix(rnorm(100),10,10)))
row.names(Ref) <- colnames(Ref) <- LETTERS[1:10]
Ref[upper.tri(Ref)] <- NA
diag(Ref) <- NA
Ref.m <- na.omit(melt(Ref, varnames = c('row', 'col')))
###query
tmp <- cor(as.data.frame(matrix(rnorm(25),5,5)))
row.names(tmp) <- colnames(tmp) <- LETTERS[1:5]
tmp[upper.tri(tmp)] <- NA
diag(tmp) <- NA
tmp.m <- na.omit(melt(tmp, varnames = c('row', 'col')))
The following works for me using your toy data:
Ref[rownames(tmp), colnames(tmp)]
This selects (by name) only those rows in Ref whose names are also the names of rows in tmp, and likewise for columns.
If you want to stick with the long format in the str outputs in the first part of your question, you can instead use something like:
data1[(data1$OTU1 %in% data2$OTU1) & (data1$OTU2 %in% data2$OTU2), ]
Here I'm creating a logical vector that indicates which rows of your reference data frame (data1) have their OTU1 entry somewhere in data2$OTU1, and the same for OTU2. Said logical vector is then used to select rows of data1.

How to make sublist/extract expression data of candidate genes from normalized microarray list

I have several processed microarray data (normalized, .txt files) from which I want to extract a list of 300 candidate genes (ILMN_IDs). I need in the output not only the gene names, but also the expression values and statistics info (already present in the original file).
I have 2 dataframes:
normalizedData with the identifiers (gene names) in the first column, named "Name".
candidateGenes with a single column named "Name", containing the identifiers.
I've tried
1).
all=normalizedData
subset=candidateGenes
x=all%in%subset
2).
all[which(all$gene_id %in% subset)] #(as suggested in other bioinf. forum)#,
but it returns a Dataframe with 0 columns and >4000 rows. This is not correct, since normalizedData has 24 columns and compare them, but I always get error.
The key is to be able to compare the first column of all ("Name") with subset. Here is the info:
> class(all)
> [1] "data.frame"
> dim(all)
> [1] 4312 24
> str(all)
> 'data.frame':4312 obs. of 24 variables:
$ Name: Factor w/ 4312 levels "ILMN_1651253": 3401..
$ meanbgt:num 0 ..
$ meanbgc: num ..
$ cvt: num 0.11 ..
$ cvc: num 0.23 ..
$ meant: num 4618 ..
$ stderrt: num 314.6 ..
$ meanc: num 113.8 ...
$ stderrc: num 15.6 ...
$ ratio: num 40.6 ...
$ ratiose: num 6.21 ...
$ logratio: num 5.34 ...
$ tp: num 1.3e-04 ...
$ t2p: num 0.00476 ...
$ wilcoxonp: num 0.0809 ...
$ tq: num 0.0256 ...
$ t2q: num 0.165 ...
$ wilcoxonq: num 0.346 ...
$ limmap: num 4.03e-10 ...
$ limmapa: num 4.34e-06 ...
$ SYMBOL: Factor w/ 3696 levels "","A2LD1",..
$ ENSEMBL: Factor w/ 3143 levels "ENSG00000000003",..
and here is the info about subset:
> class(subset)
[1] "data.frame"
> dim(subset)
>[1] 328 1
> str(subset) 'data.frame': 328 obs. of 1 variable:
$ V1: Factor w/ 328 levels "ILMN_1651429",..: 177 286 47 169 123 109 268 284 234 186 ...
I really appreciate your help!
What you need to do is
all[all$Name %in% subset$V1, ]
When using a data.frame, it's important to drill down the the correct column that has the data you actually want to use. You need to know which columns have the matching IDs. That the only way that this solution really differed from other suggested or other things you've tried.
It's also important to note that when subsetting a data.frame by rows, you need to use the [,] syntax where the vector before the comma indicates rows and the vector after indicates columns. Here, since you want all columns, we leave it empty.

as.numeric vs chr

I have the following code
fig4 <- data.frame(chads=NA,age=NA,treatment=NA,mean=NA,lower=NA,upper=NA)
fig4$chads <- as.factor(fig4$chads)
levels(fig4$chads) <- c(0,1,2,3,4,5,6)
fig4$age <- as.factor(fig4$age)
levels(fig4$age ) <- c("u80","o80")
fig4$treatment <- as.factor(fig4$treatment)
levels(fig4$treatment) <- c("OAC","OAP")
fig4$mean <- as.numeric(fig4$mean)
fig4$lower <- as.numeric(fig4$lower)
fig4$upper <- as.numeric(fig4$upper)
> str(fig4)
'data.frame': 1 obs. of 6 variables:
$ chads : Factor w/ 7 levels "0","1","2","3",..: NA
$ age : Factor w/ 2 levels "u80","o80": NA
$ treatment: Factor w/ 2 levels "OAC","OAP": NA
$ mean : num NA
$ lower : num NA
$ upper : num NA
So far so good. But then I do this:
vc <- as.vector(c(6,"o80","OAC",0.1,0.02,0.25), mode = "any")
fig4 <- rbind(fig4,vc)
which results in this:
> str(fig4)
'data.frame': 2 obs. of 6 variables:
$ chads : Factor w/ 7 levels "0","1","2","3",..: NA 7
$ age : Factor w/ 2 levels "u80","o80": NA 2
$ treatment: Factor w/ 2 levels "OAC","OAP": NA 1
$ mean : chr NA "0.1"
$ lower : chr NA "0.02"
$ upper : chr NA "0.25"
Why did the numeric vectors turn into character ones ?
Lists can hold objects of multiple types, so to avoid your new data being converted to character, you can do:
fig4[nrow(fig4) + 1, ] <- list(6,"o80","OAC",0.1,0.02,0.25)
For the same reason a matrix would --- both vector and matrix can hold only one type. And as you force character into the mix, you get character.
Use a data.frame to hold "columns" of different types, then subset individual columns.

Resources