I have a list of data frames. Each data frame has 6 rows and 6 columns. They are all numbers, however, all data frames have their elements as class character.
Example:
$`A`
V1 V2 V3 V4 V5 V6
V1 0.1212 0.6231 0.4431 0.3213 0.6578 0.1259
V2 2.1234 0.6532 0.9845 0.8743 0.8732
V3 0.2314 0.7648 0.7634 0.8732
V4 0.1234 0.6544 0.3456
V5 0.7653 0.9812
V6 0.1265
$`B`
V1 V2 V3 V4 V5 V6
V1 0.2345 0.1234 0.5647 0.7891 0.6721 0.3259
V2 1.1334 0.4332 0.1245 0.2343 0.5332
V3 0.2914 0.1648 0.2334 0.1232
V4 0.1234 0.6744 0.5656
V5 0.3553 0.9812
V6 0.4665
I would like to change all data frames of the list to class matrix (numerical).
I tried:
lapply (list, data.matrix)
but the result is a list of data frames with integers. Example:
V1 V2 V3 V4 V5 V6
V1 2 2 2 2 2 4
V2 1 3 4 5 5 7
V3 1 1 3 4 6 3
V4 1 1 1 3 4 5
V5 1 1 1 1 1 1
V6 1 1 1 1 1 1
Also tried to run
lapply(list, as.matrix)
however, I got a list of quoted matrices, like this:
$`A`
V1 V2 V3 V4 V5 V6
V1 "0.1212" "0.6231" "0.4431" "0.3213" "0.6578" "0.1259"
V2 "2.1234" "0.6532" "0.9845" "0.8743" "0.8732"
V3 "0.2314" "0.7648" "0.7634" "0.8732"
V4 "0.1234" "0.6544" "0.3456"
V5 "0.7653" "0.9812"
V6 "0.1265"
How can I convert these data frames of my list from character class to matrix class?
We may loop over the list, then loop over the data.frame columns with lapply convert to numeric and assign it back to the original data.frame object and return the data.frame ('x')
list <- lapply(list, function(x) {x[] <- lapply(x, as.numeric);x})
If those are factor columns, convert to character first and then to numeric
lapply(list, function(x) {x[] <- lapply(x, function(y) as.numeric(as.character(y)))
x})
You can convert to numeric and then reset the matrix order:
lapply(dfs, function(x) matrix(as.numeric(x), ncol = n_cols))
Data
set.seed(1L)
n_cols <- 6
n_total <- 36
a <- matrix(rnorm(n_total), ncol = n_cols)
b <- matrix(rnorm(n_total), ncol = n_cols)
a[lower.tri(a)] <- ""
b[lower.tri(b)] <- ""
dfs <- list(a, b)
I have a very large dataframe of 63 columns and 1697 rows. The end of the rows fill up with NAs but I want the matching values in rows to be in the same column, and stick the NAs into the gaps
a bit like this (updated):
v1 <- c("v1","v1","v1","v1","v1","v1","v1")
v2 <- c("v3","v2","v2","NA","v2","v2","v2")
v3 <- c("v4","v4","v3","NA","v3","v3", "v3")
v4 <- c("v5","v5","v4","NA","v5","v4","NA")
v5 <- c("NA","NA","v5","NA","v6","v6", "NA")
v6 <- c("NA","NA","v6","NA","v7","v7","NA")
v7 < - c("NA","NA","NA","NA","NA","NA","NA")
df <- data.frame(v1,v2,v3,v4,v5,v6,v7)
df
v1 v2 v3 v4 v5 v6 v7
1 v1 v3 v4 v5 NA NA NA
2 v1 v2 v4 v5 NA NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 v5 v6 v7 NA
6 v1 v2 v3 v4 v6 v7 NA
7 v1 v2 v3 NA NA NA NA
but I would like everything aligned like this:
v1 v2 v3 v4 v5 v6 v7
1 v1 NA NA v4 v5 NA NA
2 v1 v2 NA v4 v5 NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 NA v5 v6 v7
6 v1 v2 v3 v4 NA v6 v7
7 v1 v2 v3 NA NA NA NA
I have tried map.values() and this didn't come out as expected, as well as a ifelse() but this all requires me to enter specific cell data and change that.
The column names do match the cell names.
I want to use the data to put into a presence absence plot, so I figured after I can just
for (i in 1:63){
gsub("NA", 0, df[,i]}
and then same for anything containing "v" to have a binary 1 or 0 for presence or absence, but they have to be aligned
There are no predefined rules governing the data, the dataframe has been conglomerated together from many other .csv files and this is the best format I can get it into currently.
Any help would be appreciated!
Updated answer to match new input data
Data
I removed the quotation marks from NA:
v1 <- c("v1","v1","v1","v1","v1","v1","v1")
v2 <- c("v3","v2","v2",NA,"v2","v2","v2")
v3 <- c("v4","v4","v3",NA,"v3","v3", "v3")
v4 <- c("v5","v5","v4",NA,"v5","v4",NA)
v5 <- c(NA,NA,"v5",NA,"v6","v6", NA)
v6 <- c(NA,NA,"v6",NA,"v7","v7",NA)
v7 <- c(NA,NA,NA,NA,NA,NA,NA)
df <- data.frame(v1,v2,v3,v4,v5,v6,v7, stringsAsFactors = F)
Code
l <- list()
u <- c("v1", "v2", "v3", "v4", "v5", "v6", "v7")
h <- NULL
for(k in 1:nrow(df)){
# create a list for each row of the df
l[[k]] <- df[k, ]
for(i in 1:length(l[[k]])){
#check if number exists in the row
if(u[i] %in% l[[k]]){
# find the index of the number given it exists
a <- which(l[[k]] == u[i])
#assign to "help" vector in order to not overwrite values
h[i] <- l[[k]][a]
}
else{
#numbers that do not exist in the vector are asigned NA
h[i] <- NA
}
}
#replace row by sorted vector with NA place holders ("help" vector)
l[[k]] <- h
}
Result
df1 <- as.data.frame(do.call(rbind, l))
df1
V1 V2 V3 V4 V5 V6 V7
1 v1 NA v3 v4 v5 NA NA
2 v1 v2 NA v4 v5 NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 NA v5 v6 v7
6 v1 v2 v3 v4 NA v6 v7
7 v1 v2 v3 NA NA NA NA
I have two data frames, one of which has a large list of two identifiers:
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
And one which contains one of the two identifiers:
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
What I want is to replace the ID in the second dataframe with the corresponding second ID from the first dataframe (I also couldn't think of a succinct way of phrasing that for the title of this question hence why it's phrased so awkwardly and why I might not have been able to find duplicates). I.e.:
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 rs6682375 0 14907 G A
My solution at present is to use a for ... if loop as follows:
for (x in 1:nrow(df2)){
if (df2$V2[x] %in% df1$uniq_id){
df2$V2[x] = df1$rsid[x]
}
}
However, because both files are extremely large, I believe that this is likely a very inefficient way of doing this and am wondering if there is a faster method.
Someone suggested that using the match() function might be quicker, but given that the R documentation for this suggests that %in% is actually more intuitive and my inexperience with it, I'm not sure how to apply it in a different way.
Any help appreciated.
This is an update-join, in data.table terminology. Assuming the first table is called df and the second is called df2:
library(data.table)
setDT(df)
setDT(df2)
df2[df, on = .(V2 = uniq_id), V2 := rsid]
df2
# V1 V2 V3 V4 V5 V6
# 1: 1 1_10439_A_AC 0 10439 A AC
# 2: 1 1_13417_CGAGA_C 0 13417 C CGAGA
# 3: 1 rs6682375 0 14907 G A
Data used
df <- fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- fread('
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
')
A method using base r. It would also be easy to perform this using dplyr and it's left_join function if desired.
df <- data.table::fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- data.table::fread('
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
')
df2 <- merge(df2,df,by.x = c("V2"),by.y = c("uniq_id"),all.x = TRUE)
df2$V2 <- ifelse(!is.na(df2$rsid),df2$rsid,df2$V2)
df2$rsid <- NULL
df2
# V2 V1 V3 V4 V5 V6
# 1: 1_10439_A_AC 1 0 10439 A AC
# 2: 1_13417_CGAGA_C 1 0 13417 C CGAGA
# 3: rs6682375 1 0 14907 G A
I have this line in one my function - result[result>0.05] <- "", that replaces all values from my data frame grater than 0.05, including the row names from the first column. How to avoid this?
This is a fast way too:
df <- as.data.frame(matrix(runif(100),nrow=10))
df[-1][df[-1]>0.05] <- ''
Output:
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0.60105471
2 0.63340567
3 0.11625581
4 0.96227379 0.0173133104108274
5 0.07333583
6 0.05474430 0.0228175506927073
7 0.62610309
8 0.76867090
9 0.76684615 0.0459537433926016
10 0.83312158
I have a data.frame with 16 columns. Here's one example row.
> data[16,]
V1 V2 V3 V4
16 comp27182_c0_seq4 ENSP00000442096 ENSG00000011143 ENSFCAP00000011376
V5 V6 V7 V8
16 ENSFCAG00000012261 comp48601_c0_seq1 comp19130_c0_seq3 comp22796_c2_seq3
V9 V10 V11 V12
16 comp146901_c0_seq1 comp157916_c0_seq1 comp158124_c0_seq1
V13 V14 V15 V16
16 comp229797_c0_seq1 comp61875_c0_seq2
I'm only interested in columns 1 and 6-16. The first column contains the name I would like to use as a column name in the matrix, 6 to 16 may contain either a string or '' (nothing).
I would like to transform this data.frame into a matrix showing 1 or 0, reflecting the content in columns 6-16.
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
comp27182_c0_seq4 1 1 1 1 0 1 1 1 1 0 0
I've trying to use mask without success. I'm sure there's a very easy option out there.
Thanks for any help.
Try this:
do.call(cbind, lapply(c(1,6:16),
function(x) as.numeric(nchar(as.character(data[,x])) > 0)))
I slightly modified your code to my exact needs. Now the first column is naming the rows.
a<-do.call(cbind, lapply(c(6:16),
function(x) as.numeric(nchar(as.character(data[,x])) > 0)))
rownames(a)<-data[,1]
It works great, thanks!