I want to assign a rowname (A_B) while I rbind a row into a new dataframe (d).
The row is the result of the ratio of two rows of another data frame (df).
df <- data.frame(ID = c("A", "B" ),replicate(3,sample(1:100,2,rep=TRUE)))
d <- data.frame()
d<-rbind(d, df[df$ID == "A",2:4 ]/df[df$ID == "B", 2:4])
Actual output
X1 X2 X3
1 0.08 0.14 0.66
Expected output. Instead of rowname 1 I want A_B as result of A_B ratio
X1 X2 X3
A_B 0.08 0.14 0.66
Maybe it's a stupid solution, but have you tried give it directly the row name? Some like this:
rbind(d, your_name = (df[df$ID == "A",2:4 ]/df[df$ID == "B", 2:4]))
For me it's working... Regards.
updated the solution to address multiple rows
You can workaround for the desired row names from following solution....
df <- data.frame(ID = c("A", "B" ),replicate(3, sample(1:100,8,rep=TRUE)))
# This is where you control what you like to see as the row names
rownames(df) <- make.names( paste("NAME", df[ ,"ID"]) , unique = TRUE)
d <- data.frame()
rbind(d, df[df$ID == "A",2:4 ]/df[df$ID == "B", 2:4], make.row.names = "T")
output
X1 X2 X3
NAME.A 0.8690476 1.1851852 2.40909091
NAME.A.1 1.8181818 0.8095238 1.01408451
NAME.A.2 0.8235294 5.4444444 2.50000000
NAME.A.3 1.4821429 1.8139535 0.05617978
You can assign row names by passing your rbind output into a pipe to the function rownames() and mutate the row name as you like.
%>% rownames(.) <- c("Name you like")
Just make sure you mutate the correct row you intended.
Related
My goal is to get a concise way to rename multiple columns in a data frame. Let's consider a small data frame df as below:
df <- data.frame(a=1, b=2, c=3)
df
Let's say we want to change the names from a, b, and c to Y, W, and Z respectively.
Defining a character vector containing old names and new names.
df names <- c(Y = "a", Z ="b", E = "c")
I would use this to rename the columns,
rename(df, !!!names)
df
suggestions?
One more !:
df <- data.frame(a=1, b=2, c=3)
df_names <- c(Y = "a", Z ="b", E = "c")
library(dplyr)
df %>% rename(!!!df_names)
## Y Z E
##1 1 2 3
A non-tidy way might be through match:
names(df) <- names(df_names)[match(names(df), df_names)]
df
## Y Z E
##1 1 2 3
You could try:
sample(LETTERS[which(LETTERS %in% names(df) == FALSE)], size= length(names(df)), replace = FALSE)
[1] "S" "D" "N"
Here, you don't really care what the new names are as you're using sample. Otherwise a straight forward names(df) < c('name1', 'name2'...
Is there a built-in function to display a data.frame with zero columns but still show row.names?
> df
DataFrame with 5 rows and 0 columns
> row.names(df)
[1] "ID1" "ID2" "ID3" "ID4" "ID5"
It would be useful if instead:
> df
DataFrame with 5 rows and 0 columns
ID1
ID2
ID3
ID4
ID5
I wrote a custom function to do it via cat, but would be nice to know if there's a built-in way of doing it.
library(tidyverse)
df <- df %>%
select(-everything())
cat(print(df), cat(rownames(df), sep = "\n"))
Or could also be simplified to:
df %>%
select(-everything()) %>%
cat(print(.), cat(rownames(.), sep = "\n"))
Output
data frame with 0 columns and 2 rows
A
B
Or using base R, if you don't care about the information being displayed about the dataframe.
df <- df[1]
df[1] <- rep("", nrow(df))
colnames(df) <- ""
Output
A
B
Data
df <- data.frame(a = c(1, 2),
b = c(1, 2),
c = c(4, 5))
rownames(df) <- c("A", "B")
I would like to write a for loop for the actions below, data is a df with multiple columns, each column contains a list. I would like to replace all NULL values in each column list with NA so that I can bind all lists into a dataframe. If there's a more efficient way to do this than a for loop I would like to know as well. Thank you.
for (i in names(data)){
list1=sapply(data[,1], function(x) ifelse(x == "NULL", NA, x))
list1=as.data.frame(list1)
list2=sapply(data[,2], function(x) ifelse(x == "NULL", NA, x))
list2=as.data.frame(list2)
.
.
.
fulllist=as.data.frame(cbind(list1,list2,....))
fulllist = as.data.frame(t(fulllist))
}
We loop over the columns of the data to find the list column ('i1'). Use that index to loop over the columns, then loop over the elements of the list and assign those NULL elements to NA
i1 <- sapply(data, is.list)
data[i1] <- lapply(data[i1], function(x) {
i2 <- sapply(x, is.null)
x[i2] <- NA
x })
If you indeed are working with a dataframe, you could perhaps consider not going through listing and recombining into dataframe:
purrr::map_df(.x = data, .f = ~ stringr::str_replace(.x, 'NULL', NA_character_))
You are inputting a dataframe data, applying to each column the function str_replace where you replace the character NULL with the character version of NA. The output is also a dataframe.
Here is an example:
library(purrr)
library(stringr)
df <- data.frame(
X1 = c('A', 'NULL', 'B'),
X2 = c('NULL', 'C', 'D'),
X3 = c('E', 'NULL', 'NULL')
)
purrr::map_df(.x = df, .f = ~ stringr::str_replace(.x, 'NULL', NA_character_))
# X1 X2 X3
# <chr> <chr> <chr>
# 1 A NA E
# 2 NA C NA
# 3 B D NA
This question already has answers here:
R group by aggregate
(3 answers)
Closed 2 years ago.
I am working with a dataset of more than 3 million observations. This data set includes more than 770,000 unique IDs that are of interest to me. The data includes descriptive information about these IDs. The challenge is that these unique IDs contain non-unique duplicates, which means I need to find a way to consolidate the data.
After much thinking, I decided to take the mode of each column for each ID in the data set. The output gives me most common value for each column for each id. By taking the most common value, I am able to consolidate the non-unique duplicates into one row per each id.
The problem: To do so, I have iterate over 770,000 unique ids in a for loop. I want to use code that will be as efficient as possible because the for loop I have been using takes days to complete.
Given the code I have provided, is there a way to optimize the code, use parallel processing, or a different way to complete the task more efficiently?
Reproducible code:
ID <- c(1,2,2,3,3,3)
x1 <- c("A", "B", "B","C", "C", "C")
x2 <- c("alpha", "bravo", "bravo", "charlie", "charlie2", "charlie2")
x3 <- c("apple", "banana", "banana", "plum1", "plum1", "plum")
df <- data.frame(ID, x1, x2, x3)
#Mode Function
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
library(reshape2)
#Takes the mode for every column
mode_row <- function(dat){
x <- setNames(as.data.frame(apply(dat, 2, getmode)), c("value"))
x$variable <- rownames(x); rownames(x) <- NULL
mode_row <- reshape2::dcast(x, . ~ variable, value.var = "value")
mode_row$. <- NULL
return(mode_row)
}
#Take the mode of each row to account for duplicate donors
df2 <- NULL
for(i in unique(df$ID)){
df2 <- rbind(df2, mode_row(subset(df, ID == i)))
#message(i)
}
df2
Expected Output:
ID x1 x2 x3
1 1 A alpha apple
2 2 B bravo banana
3 3 C charlie2 plum1
There are grouped functions available in base R, dplyr and data.table :
Base R :
aggregate(.~ID, df, getmode)
# ID x1 x2 x3
#1 1 A alpha apple
#2 2 B bravo banana
#3 3 C charlie2 plum1
dplyr :
library(dplyr)
df %>% group_by(ID) %>% summarise(across(x1:x3, getmode))
#Use summarise_at in older version of dplyr
#df %>% group_by(ID) %>% summarise_at(vars(x1:x3), getmode)
data.table :
library(data.table)
setDT(df)[, lapply(.SD, getmode), ID, .SDcols = x1:x3]
I have a dataframe that looks like this.
input dataframe
position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2
These data are nucleotide differences at a given position in a hypothetical genome, mean_freq is relative to the reference, so the first row means the proportion of C's are 0.002 implying the A are at 0.998.
I want to transform this to a different structure by creating new columns such that,
desired_output
position,G,C,T,A,sampleid
1,0,0.002,0,0.998,name1
2, 0.96,0,0.04,0,name
3,0,0.93,0,0.07,name2
I have attempted this approach
per_position_full_nt_freq <- function(x){
df <- data.frame(A=0, C=0, G=0, T=0)
idx <- names(df) %in% x$alternative
df[,idx] <- x$mean_freq
idx2 <- names(df) %in% x$reference
df[,idx2] <- 1 - x$mean_freq
df$position <- x$position
df$sampleName <- x$sampleName
return(df)
}
desired_output_dataframe <- per_position_full_nt_freq(input_dataframe)
I ran into an error
In matrix(value, n, p) :
data length [8905] is not a sub-multiple or multiple of the number of columns
additionally, I feel there has to be a more intuitive solution and presumably using tidyr or dplyr.
How do I conveniently transform the input dataframe to the desired output dataframe format?
Thank you.
One option would be to create a matrix of 0's with the 'G', 'C', 'T', 'A' column names, match with the column names of the original dataset, use the row/column index to assign the values and then cbind with the original dataset's 'position' and 'sample_id', columns
m1 <- matrix(0, ncol=4, nrow=nrow(df1), dimnames = list(NULL, c("G", "C", "T", "A")))
m1[cbind(seq_len(nrow(df1)), match(df1$alternative, colnames(m1)))] <- df1$mean_freq
m1[cbind(seq_len(nrow(df1)), match(df1$reference, colnames(m1)))] <- 0.1 - df1$mean_freq
cbind(df1['position'], m1, df1['sample_id'])
# position G C T A sample_id
#1 1 0.00 0.002 0.00 0.098 name1
#2 2 0.06 0.000 0.04 0.000 name1
#3 3 0.00 0.030 0.00 0.070 name2
The following should do the trick:
library(readr)
library(dplyr)
library(tidyr)
input_df <- read_csv(
'position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2'
)
input_df %>%
mutate( ref_val = 0.1 -mean_freq) %>%
spread(alternative, mean_freq, fill=0) %>%
spread(reference, ref_val, fill=0) %>%
select( position, G, C, T, A, sample_id )
One assumption you have here is that the alternative and reference are distinct, otherwise you will get two columns with the same name, but different values. You need to handle for that with a couple of command at the beginning of your code if need be.