R extracting data conditionally? - r

I have a simulation dataset that explores a set of parameter space, and each set of parameter are run multiple times (iterations), it looks like so:
p1 p2 p3 iteration result
=================================
v3 v2 v1 1 23.8
v2 v1 v3 2 20.36
v3 v2 v1 2 28.8
v2 v1 v3 1 29.36
...
As can be seen from this example, both (v3, v2, v1) and (v2, v1, v3) are run twice. I am trying to extract only the rows with max result for each parameter setting, in this example:
only row 3 and 4 should be kept, as they represent the best results from that parameter set. Is there a easy way to accomplish that in R? Thanks

df <- read.table(textConnection("p1 p2 p3 iteration result
v3 v2 v1 1 23.8
v2 v1 v3 2 20.36
v3 v2 v1 2 28.8
v2 v1 v3 1 29.36"), header = T)
library(plyr)
ddply(df, .(p1,p2,p3), function(x) return(x[(which(x$result == max(x$result))), ]))
p1 p2 p3 iteration result
1 v2 v1 v3 1 29.36
2 v3 v2 v1 2 28.80

Related

convert a list of class data frame objects into objects of class matrix - list

I have a list of data frames. Each data frame has 6 rows and 6 columns. They are all numbers, however, all data frames have their elements as class character.
Example:
$`A`
V1 V2 V3 V4 V5 V6
V1 0.1212 0.6231 0.4431 0.3213 0.6578 0.1259
V2 2.1234 0.6532 0.9845 0.8743 0.8732
V3 0.2314 0.7648 0.7634 0.8732
V4 0.1234 0.6544 0.3456
V5 0.7653 0.9812
V6 0.1265
$`B`
V1 V2 V3 V4 V5 V6
V1 0.2345 0.1234 0.5647 0.7891 0.6721 0.3259
V2 1.1334 0.4332 0.1245 0.2343 0.5332
V3 0.2914 0.1648 0.2334 0.1232
V4 0.1234 0.6744 0.5656
V5 0.3553 0.9812
V6 0.4665
I would like to change all data frames of the list to class matrix (numerical).
I tried:
lapply (list, data.matrix)
but the result is a list of data frames with integers. Example:
V1 V2 V3 V4 V5 V6
V1 2 2 2 2 2 4
V2 1 3 4 5 5 7
V3 1 1 3 4 6 3
V4 1 1 1 3 4 5
V5 1 1 1 1 1 1
V6 1 1 1 1 1 1
Also tried to run
lapply(list, as.matrix)
however, I got a list of quoted matrices, like this:
$`A`
V1 V2 V3 V4 V5 V6
V1 "0.1212" "0.6231" "0.4431" "0.3213" "0.6578" "0.1259"
V2 "2.1234" "0.6532" "0.9845" "0.8743" "0.8732"
V3 "0.2314" "0.7648" "0.7634" "0.8732"
V4 "0.1234" "0.6544" "0.3456"
V5 "0.7653" "0.9812"
V6 "0.1265"
How can I convert these data frames of my list from character class to matrix class?
We may loop over the list, then loop over the data.frame columns with lapply convert to numeric and assign it back to the original data.frame object and return the data.frame ('x')
list <- lapply(list, function(x) {x[] <- lapply(x, as.numeric);x})
If those are factor columns, convert to character first and then to numeric
lapply(list, function(x) {x[] <- lapply(x, function(y) as.numeric(as.character(y)))
x})
You can convert to numeric and then reset the matrix order:
lapply(dfs, function(x) matrix(as.numeric(x), ncol = n_cols))
Data
set.seed(1L)
n_cols <- 6
n_total <- 36
a <- matrix(rnorm(n_total), ncol = n_cols)
b <- matrix(rnorm(n_total), ncol = n_cols)
a[lower.tri(a)] <- ""
b[lower.tri(b)] <- ""
dfs <- list(a, b)

How do I sort my columns in a dataframe so that all the identical strings in a row are in the same column

I have a very large dataframe of 63 columns and 1697 rows. The end of the rows fill up with NAs but I want the matching values in rows to be in the same column, and stick the NAs into the gaps
a bit like this (updated):
v1 <- c("v1","v1","v1","v1","v1","v1","v1")
v2 <- c("v3","v2","v2","NA","v2","v2","v2")
v3 <- c("v4","v4","v3","NA","v3","v3", "v3")
v4 <- c("v5","v5","v4","NA","v5","v4","NA")
v5 <- c("NA","NA","v5","NA","v6","v6", "NA")
v6 <- c("NA","NA","v6","NA","v7","v7","NA")
v7 < - c("NA","NA","NA","NA","NA","NA","NA")
df <- data.frame(v1,v2,v3,v4,v5,v6,v7)
df
v1 v2 v3 v4 v5 v6 v7
1 v1 v3 v4 v5 NA NA NA
2 v1 v2 v4 v5 NA NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 v5 v6 v7 NA
6 v1 v2 v3 v4 v6 v7 NA
7 v1 v2 v3 NA NA NA NA
but I would like everything aligned like this:
v1 v2 v3 v4 v5 v6 v7
1 v1 NA NA v4 v5 NA NA
2 v1 v2 NA v4 v5 NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 NA v5 v6 v7
6 v1 v2 v3 v4 NA v6 v7
7 v1 v2 v3 NA NA NA NA
I have tried map.values() and this didn't come out as expected, as well as a ifelse() but this all requires me to enter specific cell data and change that.
The column names do match the cell names.
I want to use the data to put into a presence absence plot, so I figured after I can just
for (i in 1:63){
gsub("NA", 0, df[,i]}
and then same for anything containing "v" to have a binary 1 or 0 for presence or absence, but they have to be aligned
There are no predefined rules governing the data, the dataframe has been conglomerated together from many other .csv files and this is the best format I can get it into currently.
Any help would be appreciated!
Updated answer to match new input data
Data
I removed the quotation marks from NA:
v1 <- c("v1","v1","v1","v1","v1","v1","v1")
v2 <- c("v3","v2","v2",NA,"v2","v2","v2")
v3 <- c("v4","v4","v3",NA,"v3","v3", "v3")
v4 <- c("v5","v5","v4",NA,"v5","v4",NA)
v5 <- c(NA,NA,"v5",NA,"v6","v6", NA)
v6 <- c(NA,NA,"v6",NA,"v7","v7",NA)
v7 <- c(NA,NA,NA,NA,NA,NA,NA)
df <- data.frame(v1,v2,v3,v4,v5,v6,v7, stringsAsFactors = F)
Code
l <- list()
u <- c("v1", "v2", "v3", "v4", "v5", "v6", "v7")
h <- NULL
for(k in 1:nrow(df)){
# create a list for each row of the df
l[[k]] <- df[k, ]
for(i in 1:length(l[[k]])){
#check if number exists in the row
if(u[i] %in% l[[k]]){
# find the index of the number given it exists
a <- which(l[[k]] == u[i])
#assign to "help" vector in order to not overwrite values
h[i] <- l[[k]][a]
}
else{
#numbers that do not exist in the vector are asigned NA
h[i] <- NA
}
}
#replace row by sorted vector with NA place holders ("help" vector)
l[[k]] <- h
}
Result
df1 <- as.data.frame(do.call(rbind, l))
df1
V1 V2 V3 V4 V5 V6 V7
1 v1 NA v3 v4 v5 NA NA
2 v1 v2 NA v4 v5 NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 NA v5 v6 v7
6 v1 v2 v3 v4 NA v6 v7
7 v1 v2 v3 NA NA NA NA

Fast method in R of obtaining a value from one column based on searched value in another?

I have two data frames, one of which has a large list of two identifiers:
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
And one which contains one of the two identifiers:
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
What I want is to replace the ID in the second dataframe with the corresponding second ID from the first dataframe (I also couldn't think of a succinct way of phrasing that for the title of this question hence why it's phrased so awkwardly and why I might not have been able to find duplicates). I.e.:
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 rs6682375 0 14907 G A
My solution at present is to use a for ... if loop as follows:
for (x in 1:nrow(df2)){
if (df2$V2[x] %in% df1$uniq_id){
df2$V2[x] = df1$rsid[x]
}
}
However, because both files are extremely large, I believe that this is likely a very inefficient way of doing this and am wondering if there is a faster method.
Someone suggested that using the match() function might be quicker, but given that the R documentation for this suggests that %in% is actually more intuitive and my inexperience with it, I'm not sure how to apply it in a different way.
Any help appreciated.
This is an update-join, in data.table terminology. Assuming the first table is called df and the second is called df2:
library(data.table)
setDT(df)
setDT(df2)
df2[df, on = .(V2 = uniq_id), V2 := rsid]
df2
# V1 V2 V3 V4 V5 V6
# 1: 1 1_10439_A_AC 0 10439 A AC
# 2: 1 1_13417_CGAGA_C 0 13417 C CGAGA
# 3: 1 rs6682375 0 14907 G A
Data used
df <- fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- fread('
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
')
A method using base r. It would also be easy to perform this using dplyr and it's left_join function if desired.
df <- data.table::fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- data.table::fread('
V1 V2 V3 V4 V5 V6
1 1_10439_A_AC 0 10439 A AC
1 1_13417_CGAGA_C 0 13417 C CGAGA
1 1_14907_G_A 0 14907 G A
')
df2 <- merge(df2,df,by.x = c("V2"),by.y = c("uniq_id"),all.x = TRUE)
df2$V2 <- ifelse(!is.na(df2$rsid),df2$rsid,df2$V2)
df2$rsid <- NULL
df2
# V2 V1 V3 V4 V5 V6
# 1: 1_10439_A_AC 1 0 10439 A AC
# 2: 1_13417_CGAGA_C 1 0 13417 C CGAGA
# 3: rs6682375 1 0 14907 G A

Replace specific values in a data frame except first column

I have this line in one my function - result[result>0.05] <- "", that replaces all values from my data frame grater than 0.05, including the row names from the first column. How to avoid this?
This is a fast way too:
df <- as.data.frame(matrix(runif(100),nrow=10))
df[-1][df[-1]>0.05] <- ''
Output:
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0.60105471
2 0.63340567
3 0.11625581
4 0.96227379 0.0173133104108274
5 0.07333583
6 0.05474430 0.0228175506927073
7 0.62610309
8 0.76867090
9 0.76684615 0.0459537433926016
10 0.83312158

Aggregate a dataframe based on three columns in R

I have a dataframe with a structure like this :
V1 V2 V3 V4
1 1.35 A 10241297 10459084
2 16.00 A 10241297 10459084
3 1.47 A 10241297 10459084
I would like to average V1 based on V2, V3 and V4
All aggregate example I saw are dealing with aggregating based on a single value.
Any help is appreciated
Thanks
This is one of the ways you could accomplish what I think you are looking for (hard to tell because you have one set of unique identifiers in the example data):
aggregate( V1 ~ V2 + V3 + V4 , df , mean )
# V2 V3 V4 V1
# 1 A 10241297 10459084 6.273333
Here is a plyr approach.
library(plyr)
ddply(df,.(V2,V3,V4),summarise,V1=mean(V1))
V2 V3 V4 V1
1 A 10241297 10459084 6.273333

Resources