I have a data frame (df), a vector of column names (foo), and a function (spaces) that calculates a value for all rows in a specified column of a df. I am trying to accomplish the following:
Private foo as input to spaces
Spaces operates on each element of foo matching a column name in df
For each column spaces operates on, store the output of spaces in a new column of df with a column name produced by concatenating the name of the original column and ".counts".
I keep receiving Error:
> Error: unexpected '=' in:
>" new[i] <- paste0(foo[i],".count") # New variable name
> data <- transform(data, new[i] ="
> }
> Error: unexpected '}' in " }"
Below is my code. Note: spaces does what I want when provided an input of a single variable of the form df$x but using transform() should allow me to forego including the prefix df$ for each variable.
# Create data for example
a <- c <- seq(1:5)
b <- c("1","1 2", "1 2 3","1 2 3 4","1 2 3 4 5")
d <- 10
df <- data.frame(a,b,c,d) # data fram df
foo <- c("a","b") # these are the names of the columns I want to provide to spaces
# Define function: spaces
spaces <- function(s) { sapply(gregexpr(" ", s), function(p) { sum(p>=0) } ) }
# Initialize vector with new variable names
new <- vector(length = length(foo))
# Create loop with following steps:
# (1) New variable name
# (2) Each element (e.g. "x") of foo is fed to spaces
# a new variable (e.g. "x.count") is created in df,
# this new df overwrites the old df
for (i in 1:length(foo)) {
new[i] <- paste0(foo[i],".count") # New variable name
df <- transform(df, new[i] = spaces(foo[i])) # Function and new df
}
transform(df, new[i] = spaces(foo[i])) is not valid syntax. You cannot call argument names by an index. Create a temporary character string and use that.
for (i in 1:length(foo)) {
new[i] <- paste0(foo[i],".count") # New variable name
tmp <- paste0(new[i], ".counts")
df <- transform(df, tmp = spaces(foo[i])) # Function and new df
}
Related
I have a data frame df. It has a column named b. I know this column name, although I do not know its position in the data frame. I know that colnames(df) will give me a vector of character strings that are the names of all the columns, but I do not know how to get a string for this particular column. In other words, I want to obtain the string "b". How can I do that? I imagine this may involve the rlang package, which I have difficulty understanding.
Here's an example:
library(rlang)
library(tidyverse)
a <- c(1:8)
b <- c(23,34,45,43,32,45,68,78)
c <- c(0.34,0.56,0.97,0.33,-0.23,-0.36,-0.11,0.17)
df <- data.frame(a,b,c)
tf <- function(df,MYcol) {
print(paste0("The name of the input column is ",MYcol)) # does not work
print(paste0("The name of the input column is ",{{MYcol}})) # does not work
y <- {{MYcol}} # This gives the values in column b as it shoulkd
}
z <- tf(df,b) # Gives undesired values - I want the string "b"
If you cannot pass column name as string in the function (tf(df,"b")) directly, you can use deparse + substitute.
tf <- function(df,MYcol) {
col <- deparse(substitute(MYcol))
print(paste0("The name of the input column is ",col))
return(col)
}
z <- tf(df,b)
#[1] "The name of the input column is b"
z
#[1] "b"
We can use as_string with enquo/ensym
tf <- function(df, MYcol) {
mycol <- rlang::as_string(rlang::ensym(MYcol))
print(glue::glue("The name of the input column is {mycol}"))
return(mycol)
}
z <- tf(df,b)
The name of the input column is b
z
#[1] "b"
library(dplyr)
clean_name <- function(df,col_name,new_col_name){
#remove whitespace and common titles.
df$new_col_name <- mutate_all(df,
trimws(gsub("MR.?|MRS.?|MS.?|MISS.?|MASTER.?","",df$col_name)))
#remove any chunks of text where a number is present
df$new_col_name<- transmute_all(df,
gsub("[^\\s]*[\\d]+[^\\s]*","",df$col_name,perl = TRUE))
}
I get the following error
"Error: Column new_col_name must be a 1d atomic #vector or a list"
what you want to do is make sure that the output of the functions you're using is either a vector or a list with only one dimension so that you can add it as a new column in the desired data frame. You can verify the class of an object with the Class function which comes within the base package.
The mutate function by itself should do what you want, it returns the same data frame but with the new column:
library(dplyr)
clean_name <- function(df, col_name, new_col_name) {
# first_cleaning_to_colname = The first change you want to make to the col_name column. This should be a vector.
# second_cleaning_to_colname = The change you're going to make to the col_name column after the first one. This should be a vector too.
first_change <- mutate(df, col_name = first_cleaning_to_colname)
second_change <- mutate(first_change, new_col_name = second_cleaning_to_colname)
return(second_change)
}
You can make both this changes at the same time but I thought this way it's easier to read.
If we are passing unquoted column names, then use
library(tidyverse)
clean_name <- function(df,col_name, new_col_name){
col_name <- enquo(col_name)
new_col_name <- enquo(new_col_name)
df %>%
mutate(!! new_col_name :=
trimws(str_replace_all(!!col_name, "MR.?|MRS.?|MS.?|MISS.?|MASTER.?","")) ) %>%
transmute(!! new_col_name := trimws(str_replace_all(!! new_col_name,
"[^\\s]*[\\d]+[^\\s]*","")))
}
clean_name(dat1, col1, colN)
# colN
#1 one
#2 two
data
dat1 <- data.frame(col1 = c("MR. one", "MS. two 24"), stringsAsFactors = FALSE)
I'm trying to extract the name of the i column used in a loop:
for (i in df){
print(name(i))
}
Python code solution example:
for i in df:
print(i)
PS: R gives me the column values If I use the same code than Python (but python gives just the name).
EDIT: It has to be in a loop. As I will do more elaborate things with this.
for (i in names(df)){
print(i)
}
Just do
names(df)
to print all the column names in df. There's no need for a loop, unless you want to do something more elaborate with each column.
If you want the i'th column name:
names(df)[i]
Instead of looping, you can use the imap function from the purrr package. When writing the code, .x is the object and .y is the name.
df <- data.frame(a = 1:10, b = 21:30, c = 31:40)
library(purrr)
imap(df, ~paste0("The name is ", .y, " and the sum is ", sum(.x)))
# $a
# [1] "The name is a and the sum is 55"
#
# $b
# [1] "The name is b and the sum is 255"
#
# $c
# [1] "The name is c and the sum is 355"
This is just a more convenient way of writing the following Base R code, which gives the same output:
Map(function(x, y) paste0("The name is ", y, " and the sum is ", sum(x))
, df, names(df))
You can try the following code:
# Simulating your data
a <- c(1,2,3)
b <- c(4,5,6)
df <- data.frame(a, b)
# Answer 1
for (i in 1:ncol(df)){
print(names(df)[i]) # acessing the name of column
print(df[,i]) # acessing column content
print('----')
}
Or this alternative:
# Answer 2
columns <- names(df)
for(i in columns) {
print(i) # acessing the name of column
print(df[, i]) # acessing column content
print('----')
}
Hope it helps!
I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!
Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT
Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))
You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post
I have a character array that holds the column names and values for a row in a data frame. Unfortunately, if the value of a specific entry is zero, the column name and value are not listed in the array. I create my desired data frame with this information, but I rely on a "for loop".
I want to utilize plyr to avoid the for loop in the working code below.
types <- c("one", "two", "three") # My data
entry <- c("one(1)", "three(2)") # My data
values <- function(entry, types)
{
frame<- setNames(as.data.frame(matrix(0, ncol = length(types), nrow = 1)), types)
for(s1 in 1:length(entry))
{
name <- gsub("\\(\\w*\\)", "", entry[s1]) # get name
quantity <- as.numeric(unlist(strsplit(entry[s1], "[()]"))[2]) # get value
frame[1, which(colnames(frame)==name)] <- quantity # store
}
return(frame)
}
values(entry, types) # This is how I want the output to look
I have tried the following to split the array, but I can't figure out how to get adply to return a single row.
types <- c("one", "two", "three") # data
entry <- c("one(1)", "three(2)") # data
frame<- setNames(as.data.frame(matrix(0, ncol = length(types), nrow = 1)), types)
array_split <- function(entry, frame){
name <- gsub("\\(\\w*\\)", "", entry) # get name
quantity <- as.numeric(unlist(strsplit(entry, "[()]"))[2]) # get value
frame[1, which(colnames(frame)==name)] <- quantity # store
return(frame)
}
adply(entry, 1, array_split, frame)
Is there something like cumsum I should be considering? I want to complete the operation quickly.
I'm not sure why you aren't just doing something more like this:
frame <- setNames(rep(0,length(types)),types)
a <- as.numeric(sapply(strsplit(entry,"[()]"),`[[`,2))
names(a) <- gsub("\\(\\w*\\)", "", entry)
frame[names(a)] <- a
Both gsub and strsplit are already vectorized, so there's no real need for explicit loop anywhere. You only need the sapply to extract the second element of the strsplit results. The rest is just regular indexing.