Lets say I have a dataframe that looks like this
Column1, Column2, Column3
a_2019 b_2020 c_2021
d_2019 e_2020 f_2021
a_2019 b_2020 c_2021
d_2019 e_2020 f_2021
And I would like to take out "_2019", "_2020", and "_2021". I could use
df$Column1 <- substr(df$Column1, 1, nchar(df$Column1)-5)
For every column, but I have multiple dataframes with quite a few columns. substr need a text or a vector for it to work, so using df[,3:10] doesn´t work, lapply either.
Any suggestion on how to achieve this in an elegant way? Thank you
We can try using lapply along with sub for a base R option:
df[cols] <- lapply(df[cols], function(x) sub("_(?:2019|2020|2021)$", "", x))
Here cols should be a vector containing the column names on which you seek to make the replacement.
More generally, to target underscore followed by any number, we can use:
df[cols] <- lapply(df[cols], function(x) sub("_\\d+$", "", x)) # or _\\d{4} for a year
Using dplyr
library(dplyr)
df <- df %>%
mutate(across(3:10, ~ substr(.x, 1, nchar(.x)-5)))
Related
I would like to do something more efficient than
dataframe$col <- as.character(dataframe$col)
since I have many numeric columns.
In base R, we may either use one of the following i.e. loop over all the columns, create an if/else conditon to change it
dataframe[] <- lapply(dataframe, function(x) if(is.numeric(x))
as.character(x) else x)
Or create an index for numeric columns and loop only on those columns and assign
i1 <- sapply(dataframe, is.numeric)
dataframe[i1] <- lapply(dataframe[i1], as.character)
It may be more flexible in dplyr
library(dplyr)
dataframe <- dataframe %>%
mutate(across(where(is.numeric), as.character))
All said by master akrun! Here is a data.table alternative. Note it converts all columns to character class:
library(data.table)
data.table::setDT(df)
df[, (colnames(df)) := lapply(.SD, as.character), .SDcols = colnames(df)]
Using the mtcars dataframe, how can I get a new dataframe that contains the string "3"
So far I have:
mtcars<-lapply(mtcars, function(x) as.character(x))
myindices<-sapply(mtcars, function(x) { grep("3",x, ignore.case = TRUE) })
This gives me a list of indices. How do I just get a filtered dataframe from the original.
Feel free to criticise my approach, it is the end result that I am really interested in
We can use filter_all from dplyr. This returns a dataframe with rows that has at least one column containing the string "3":
library(dplyr)
mtcars %>%
filter_all(any_vars(grepl("3", .)))
If we want a dataframe with rows that has all columns containing the string "3". We use all_vars instead of any_vars:
mtcars %>%
filter_all(all_vars(grepl("3", .)))
We can uses grepl with Reduce from base R
out <- mtcars[Reduce(`|`, lapply(mtcars, grepl, pattern = "3")),]
dim(out)
#[1] 31 11
Similar to your sapply solution:
mtcars[sapply(1:nrow(mtcars), function(i) any(grepl("3", mtcars[i,], fixed = T))),]
Or, you could do this as well:
mtcars[grepl("3", do.call(paste0, mtcars), fixed = T),]
Another base R solution:
mtcars[apply(mtcars,1,function(x) grepl("3",paste(x,collapse=""))),]
We may use toString.
mtcars.3 <- mtcars[grep("3", apply(mtcars, 1, toString)), ]
Check:
rbind(mtcars=dim(mtcars), mtcars.3=dim(mtcars.3))
[,1] [,2]
mtcars 32 11
mtcars.3 31 11
I am trying to set some variables as character and others as numeric, what I currently have is;
colschar <- c(1:2, 68:72)
colsnum <- c(3:67)
subset <- as.data.frame(lapply(data[, colschar], as.character), (data[, colsnum], as.numeric))
which returns an error.
I am trying to set columns 1:2 and 68:72 as a character and columns 3:67 all as numeric.
I suggest:
data[colschar] <- lapply(data[colschar], as.character)
data[colsnum] <- lapply(data[colsnum], as.numeric)
It should be better if you share an extract of your data. In any case you may try with tidiverse approach:
library(dplyr)
mydf_molt <- mydf %>%
mutate_at(.vars=c(1:2, 68:72),.funs=funs(as.character(.))) %>%
mutate_at(.vars=c(3:67),.funs=funs(as.numeric(.)))
I need to multiply columns in R data.frame. I want to do this based on certain patterns in the column names. This is very elementary task, but I struggle to make it work with sapply() or some related function. This is what I've tried thus far.
df <- data.frame("pA" = sample(1:100), "pB" = sample(1:100), "qA" = sample(1:100), "qB" = sample(1:100))
cols <- c("A","B")
multip <- function(df,col){
dfp <- df[which(names(df) %in% paste0("p",col))]
dfq <- df[which(names(df) %in% paste0("q",col))]
dfv <- dfp*dfq
setNames(dfv, paste0("v",col))
}
sapply(df, function(x) multip(x,cols))
I can make it work if I take it apart and forget the function and sapply parts but that would complicate my work. Is there some solution that would make this work?
You can use multip directly on 'df'
multip(df, cols)
Or without using multip
Map('*', df[grep('p', names(df))], df[grep('q', names(df))])
The problem with sapply/lapply call is that we get access to only a single column for each list element and that is not the arguments based on the function multip
I'm new to R. In a data frame, I wanted to create a new column #21 that is equal to the sum of column #1 to #20,row by row.
I knew I could do
df$Col21<-df$Col1+df$Col2+.....+df$Col20
But is there a more concise expression?
Also, can I achieve this if using column names not numbers? Thanks!
There is rowSums:
df$Col21 = rowSums(df[,1:20])
should do the trick, and with names:
df$Col21 = rowSums(df[,paste("Col", 1:20, sep="")])
With leading zeros and 3 digits, try:
df$Col21 = rowSums(df[,sprintf("Col%03d", 1:20, sep="")])
I find the dplyr functions for column selection very intuitive, like starts_with(), ends_with(), contains(), matches() and num_range():
df <- as.data.frame(replicate(20, runif(10)))
names(df) <- paste0("Col", 1:20)
library(dplyr)
# e.g.
summarise_each(df, funs(sum), starts_with("Col"))
# or
rowSums(select(df, contains("8")))