I have a vector of characters. Each element of the vector has a name attribute which represents the row index of a data frame and the column index of a data frame, separated by a period. Here's a toy data set:
# Create vector of characters
a <- c("foo","bar","dog","cat")
# Assign attributes. The data frame is 2x2:
attr(a, "names") <- c("1.1", "1.2", "2.1", "2.2")
I am trying to use the attribute names to convert the vector into a data frame, where each element in the data frame is the value in the vector and the element's row is the number before the period in the attribute name and the element's column is the number after the decimal in the attribute name. The toy example's output should look like:
data.frame(var1 = c("foo","dog"), var2 = c("bar", "cat"))
My actual vector is quite large so I am looking to do this efficiently.
You can use indexing by row/column value to do this efficiently:
row.nums <- as.numeric(sapply(strsplit(names(a), "\\."), "[", 1))
col.nums <- as.numeric(sapply(strsplit(names(a), "\\."), "[", 2))
mat <- matrix(NA, max(row.nums), max(col.nums))
mat[cbind(row.nums, col.nums)] <- a
mat
# [,1] [,2]
# [1,] "foo" "bar"
# [2,] "dog" "cat"
Split a on the suffix values and coerce that to a data frame. Omit
the stringsAsFactors=FALSE if you prefer factor columns.
the unname if rownames on the result are acceptable
Code--
as.data.frame(split(unname(a), sub(".*[.]", "", names(a))), stringsAsFactors = FALSE)
giving:
X1 X2
1 foo bar
2 dog cat
I would probably use regex to extract row and column positions, as follows.
my.rows <- as.integer(gsub("\\..*$", "", names(a)))
my.cols <- as.integer(gsub("^.*\\.", "", names(a)))
new.data <- data.frame(matrix(NA, nrow = max(my.rows), ncol = max(my.cols)))
for (i in 1:length(a)) {
new.data[my.rows[i], my.cols[i]] <- a[i]
}
new.data
We can use dplyr and tidyr. b2 is the final output.
library(dplyr)
library(tidyr)
b <- data_frame(Name = names(a), Value = a)
b2 <- b %>%
separate(Name, into = c("Group", "Var")) %>%
spread(Var, Value) %>%
select(-Group)
Related
I have not been programming for that long and have now encountered a problem to which I have not yet been able to find a solution.
In my dataframe there is a column that contains several pieces of information. For example, one row looks like this:
sp|O94910|AGRL1_HUMAN
or like this
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN
Now I want to create a new column with the combination of digits between the two vertical bars.
For the upper example it would be O94910, for the lower Q13554; Q13555
I have already tried functions like str_extract_all, str_match or gsub. But nothing worked.
The "id" is the column I look at. It includes different combinations of digits. I need the one between the two |
> dput(head(anaDiff_PD_vs_CTRL$id, 10))
c("sp|O94910|AGRL1_HUMAN", "sp|P02763|A1AG1_HUMAN", "sp|P19652|A1AG2_HUMAN",
"sp|P25311|ZA2G_HUMAN", "sp|Q8NFZ8|CADM4_HUMAN", "sp|P08174|DAF_HUMAN",
"sp|Q15262|PTPRK_HUMAN", "sp|P78324|SHPS1_HUMAN;sp|Q5TFQ8|SIRBL_HUMAN;sp|Q9P1W8|SIRPG_HUMAN",
"sp|Q8N3J6|CADM2_HUMAN", "sp|P19021|AMD_HUMAN")>
With dplyr and stringr you can try...
library(dplyr)
library(stringr)
dat %>%
rowwise() %>%
mutate(dig = str_extract_all(col, "(?<=sp\\|)[A-Z0-9]+(?=\\|)"),
dig = paste0(dig, collapse = "; "))
#> # A tibble: 4 x 2
#> # Rowwise:
#> col dig
#> <chr> <chr>
#> 1 sp|Q8NFZ8|CADM4_HUMAN Q8NFZ8
#> 2 sp|94910|AGRL1_HUMAN 94910
#> 3 sp|O94910|AGRL1_HUMAN O94910
#> 4 sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN Q13554; Q13555
data
dat <- data.frame(col = c("sp|Q8NFZ8|CADM4_HUMAN", "sp|94910|AGRL1_HUMAN", "sp|O94910|AGRL1_HUMAN", "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"))
Created on 2022-02-02 by the reprex package (v2.0.1)
Here is a solution without tidyverse:
dat <- read.table(text = "
sp|Q8NFZ8|CADM4_HUMAN
sp|94910|AGRL1_HUMAN
sp|O94910|AGRL1_HUMAN
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN")
ids <- strsplit(dat$V1, ";")
ids <- lapply(ids, function(x) gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", x))
ids <- lapply(ids, function(x) paste(x, collapse="; "))
dat$newcol <- unlist(ids)
Even with tidyverse, I would define a helper function for more clarity:
extract_ids <- function(x) {
ids <- strsplit(x, ";")
ids <- map(ids, ~ gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", .))
ids <- map(ids, ~ paste(., collapse="; "))
unlist(ids)
}
dat <- dat %>% mutate(ids = extract_ids(V1))
This solution should help if you want to change your column names in a similar fashion:
library(tidyverse)
# create test data frame with column names "sp|O94910|AGRL1_HUMAN" and "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
col1 <- c(1,2,3,4,5)
col2 <- c(6,7,8,9,10)
df <- data.frame(col1, col2)
names(df)[1] <- "sp|O94910|AGRL1_HUMAN"
names(df)[2] <- "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
names <- as.data.frame((str_split(colnames(df), "\\|", simplify = TRUE))) # split the strings representing the column names seperated by "|" into a list
# remove all strings that contain less digits than letters or special characters
for(i in 1:nrow(names)) {
for(j in 1:ncol(names)){
if ( (str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[0-9]") >
str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[:alpha:]|[:punct:]") )){
names[i,j] <- names[i,j]
} else {
names[i,j] <- ""
}
}
}
# combine the list columns into a single column calles "colnames"
names <- names %>% unite("colnames", 1:5, na.rm = TRUE, remove = TRUE, sep = ";")
# remove all ";" separators at the start of the strings, the end of the strings, and series of ";" into a single ";"
for (i in 1:nrow(names)){
names[i,] <- str_replace(names[i,],"\\;+$", "") %>%
str_replace("^\\;+", "") %>%
str_replace("\\;{2}", ";")
}
# convert column with new names into a vector
new_names <- as.vector(names$colnames)
# replace old names with new names
names(df) <- new_names
I have a data frame which I would like to group according to the value in a given row and column of the data frame
my_data <- data.frame(matrix(ncol = 3, nrow = 4))
colnames(my_data) <- c('Position', 'Group', 'Data')
my_data[,1] <- c('A1','B1','C1','D1')
my_data[,3] <- c(1,2,3,4)
grps <- list(c('A1','B1'),
c('C1','D1'))
grp.names = c("Control", "Exp1", "EMPTY")
my_data$Group <- case_when(
my_data$Position %in% grps[[1]] ~ grp.names[1],
my_data$Position %in% grps[[2]] ~ grp.names[2]
)
OR
my_data$Group <- with(my_data, ifelse(Position %in% grps[[1]], grp.names[1],
ifelse(Position %in% grps[[2]], grp.names[2],
grp.names[3])))
These examples work and produce a Group column with appropriate labels, however I need to have flexibility in the length of the grps list from 1 to approximately 25.
I see no way to iterate through case_with or ifelse in a for loop eg.
my_data$Group <- for (i in 1:length(grps)){
case_when(
my_data$Well %in% grps[[i]] ~ grp.names[i])
}
This example simply deletes the Group column
What is the most appropriate way to handle a variable grps length?
I believe your question implies that the grps variable is a list and every element in that list is itself an array that holds all the positions that belong to that group.
Specifically, in your grps variable below, if the Position is "A1" or "B1" it belongs to the whatever your first entry is grp.names. Similarly, if the position is "C1" or "D1" it belongs to whatever your second entry is in grp.names
> grps
[[1]]
[1] "A1" "B1"
[[2]]
[1] "C1" "D1"
Assuming that to be the case you can do the following:
matching_group_df <- sapply(grps, function(x){ my_data$Position %in% x})
selected_group <- apply(matching_group_df, 1, function(x){which(x == TRUE)})
my_data$Group <- grp.names[selected_group]
Position Group Data
1 A1 Control 1
2 B1 Control 2
3 C1 Exp1 3
4 D1 Exp1 4
The way it works is as follows:
matching_group_df is a matrix of True/False (created via the sapply function) that specifies what group index the position belongs to:
> matching_group_df
[,1] [,2]
[1,] TRUE FALSE
[2,] TRUE FALSE
[3,] FALSE TRUE
[4,] FALSE TRUE
You then select the column that has the TRUE value row by row using an apply command:
selected_group <- apply(matching_group_df, 1, function(x){which(x == TRUE)})
> selected_group
[1] 1 1 2 2
Finally you pass those indices to your grp.names list to select the appropriate ones and set them into your original dataframe.
grp.names[selected_group]
[1] "Control" "Control" "Exp1" "Exp1"
This also has the small side benefit of just using base R functions if that is important to you.
Approach 1: Hash table
I would opt for a different approach here, as group makeup might change during analysis, specifically a lookup table of key-value pairs, and write a small accessor function.
library(tidyverse)
# First, a small adjustment to `grps` to reflect an empty group.
grps <- list(c('A1','B1'),
c('C1','D1'),
NULL)
names <- unlist(grps, use.names = F)
values <- rep(grp.names, map_dbl(grps, length))
h = as.list(values) %>%
set_names(names) %>%
list2env()
# find x in h
f <- Vectorize(function(x) h[[x]], c("x")) # scoping here
This takes some time to setup, but usage is quite convenient:
my_data %>%
mutate(Groups = f(Position))
Position Group Data
1 A1 Control 1
2 B1 Control 2
3 C1 Exp1 3
4 D1 Exp1 4
This avoids having to change your code in multiple places, and can take on arbitrary length of groups.
Approach 2: Dynamic switch
Alternatively, we can make an arbitrary length switch expression, building it from the group names and their unique values.
constructor <- function(ids, names){
purrr::imap_chr(as.character(ids), ~paste(paste0("\"", .x ,"\""),
paste0("\"", names[.y], "\""),
sep = "=")) %>%
paste0(collapse = ", ") %>%
paste0("Vectorize(function(x) switch(as.character(x), ", ., ", NA))", collapse = "") %>%
str2expression()
}
my_data %>%
mutate(Group = eval(constructor(names, values)))
In this case, it would evaluate the expression
expression(Vectorize(function(x) switch(as.character(x), A1 = "Control",
B1 = "Control", C1 = "Exp1", D1 = "Exp1",
NA)))
For each item in my_data$Position you want to go through each of the grps and look for a match and assign grp.names, if so. If you don't find a match in any grp, assign grp.names[3]:
my_data$Group <- lapply(my_data$Position, function(position){ # Goes through each my_data$Position
for(i in 1:length(grps)){
if(position %in% grps[[i]]){
return(grp.names[i]) # Give matching index of grp.names to grps
} else if (i == length(grps)){ # if no matches assign grp.names[3]
return(grp.names[3])
}
}
}) %>% unlist() # Put the list into a vector
I have a data frame df with 7 columns and I have a list z containing multiple strings.
I want a dataframe containing only the columns in df which contain the sting from z.
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
How do I get the column number of the z strings in df? Or how do I get a dataframe with only the columns which contains the z strings.
What I want is:
print(df)
"a_means" "c_m" "f_m"
What I tried:
match(a, names(df)
and
df[,which(colnames(df) %in% colnames(df[ ,grepl(z,names(df)])]
You can use:
df[,match(z, substring(colnames(df), 1, 3))]
With base R:
z <- paste(z, collapse = "|")
df[, grepl(z, names(df))] # you could use grep as well
Combine the search patterns and use that as a pattern for stringr::str_detect() function.
library(dplyr)
library(stringr)
df <- data.frame(a_means = "a_means",
b_means = "b_means",
c_means = "c_means",
d_means = "d_means",
e_means = "e_means",
f_means = "f_means",
g_means = "g_means"
)
z <- c("a_m","c_m","f_m")
z <- paste(z, collapse = "|")
df %>% select_if(str_detect(names(df), z))
#> a_means c_means f_means
#> 1 a_means c_means f_means
You can simply do this:
library(dplyr)
df %>%
select(contains(z))
Check out help("starts_with"). You can also match to a starting prefix with starts_with() among other things.
You can use select and matches to subest the columns based on z
library(dplyr)
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
df %>%
select(matches(z))
#> X.a_means. X.c_means. X.f_means.
#> 1 a_means c_means f_means
I am writing a function to process data from a huge dataframe (row by row) which always has the same column names. So I want to pass the dataframe itself as a function to read out the information I need from the individual rows. However, when I try to use it as argument I can't read the information from it for some reason.
Dataframe:
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
My code:
List <- do.call(list, Map(function(DT) {
DT <- as.data.frame(DT)
aa <- as.numeric(strsplit(DT$Age, ","))
mean.aa <- mean(aa)
},
DF))
Trying this I get a list with the column names, but all Values are NULL.
Expected output :
My expected output is a list with length equal to the number of rows in the data frame. Under each list index there should be another list with the age of the corresponding row (an also other stuff from the same row of the data table, later).
DF <- apply(data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"), "mean.aa" = c(179.7143, 100.8571)), 1, as.list)
What am I doing wrong?
Here is one way :
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
apply(DF, 1, function(row){
aa <- as.numeric(strsplit(row["Age"], ",")[[1]])
row["mean.aa"] <- mean(aa)
as.list(row)
})
I would like to check two data set. one data has many columns (this example has two columns df1) and one data has one column (df2)
At first, I want to check the first column of df1 each row with all part of df2 if any similar part is found, then the row number of df1 and df2 is written
for example
Column 1 of df1 has two similar part of the row to df2 Q9Y6Q9 in row 3 of df1 with Q9Y6Q9 in row 4 of df2 . so the output is 3-4 , the same for others
Maybe you should normalize your data first. For instance, you could do:
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
This function can applied to each column of df1 as well as the lookup table df2:
s1 <- normalize(df1[,1], ";")
s2 <- normalize(df1[,2], ";")
lookup <- normalize(df2[,1], ",")
With this normalized data, it is easy to generate the output you are looking for:
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
return(paste(s[found], pos, sep="-"))
#change the last line to "return(as.character(pos))" to get only the result as in the comment
}
process(s1)
# [1] "3-4" "4-1" "5-4"
process(s2)
# [1] "2-4" "3-15" "7-16"
The output is not exactly the same as in the question, but this may be due to manual lookup errors.
In order to iterate over all columns of df1, you could use lapply:
res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))
names(res) <- colnames(df1)
Now, res is a list indexed by the column names of df1:
res[["sample_1"]]
# [1] "4" "1" "4"