Related
I would like to do something more efficient than
dataframe$col <- as.character(dataframe$col)
since I have many numeric columns.
In base R, we may either use one of the following i.e. loop over all the columns, create an if/else conditon to change it
dataframe[] <- lapply(dataframe, function(x) if(is.numeric(x))
as.character(x) else x)
Or create an index for numeric columns and loop only on those columns and assign
i1 <- sapply(dataframe, is.numeric)
dataframe[i1] <- lapply(dataframe[i1], as.character)
It may be more flexible in dplyr
library(dplyr)
dataframe <- dataframe %>%
mutate(across(where(is.numeric), as.character))
All said by master akrun! Here is a data.table alternative. Note it converts all columns to character class:
library(data.table)
data.table::setDT(df)
df[, (colnames(df)) := lapply(.SD, as.character), .SDcols = colnames(df)]
I've got a data frame with text. I'd like to change all "," to "-" in all observations of selected variables, and like to select the variables based on their names containing the word date.
I've tried to incorporate various variations of grep() expressions into MyFunc but haven't been able to get it to work.
Thanks!
starting point:
df <- data.frame(dateofbirth=c("25,06,1939","15,04,1941","21,06,1978","06,07,1946","14,07,1935"),recdate=c("26,06,1945","03,04,1964","21,06,1949","15,07,1923","07,12,1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2,st"),disdatenow=c("25,06,1975","25,05,1996","21,06,1932","26,07,1934","07,07,1965"), stringsAsFactors = FALSE)
desired outcome:
df <- data.frame(dateofbirth=c("25-06-1939","15-04-1941","21-06-1978","06-07-1946","14-07-1935"),recdate=c("26-06-1945","03-04-1964","21-06-1949","15-07-1923","07-12-1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2"),disdatenow=c("25-06-1975","25-05-1996","21-06-1932","26-07-1934","07-07-1965"), stringsAsFactors = FALSE)
Current code:
MyFunc <- function(x) {gsub(",","-",df$x)}
You can use mutate_at from dplyr:
df %>%
mutate_at(vars(contains("date")), function(x){gsub(",", "-", x)})
and that gives you this:
dateofbirth recdate b disdatenow
1 25-06-1939 26-06-1945 8,ted,st 25-06-1975
2 15-04-1941 03-04-1964 99,tes,rd 25-05-1996
3 21-06-1978 21-06-1949 6,ldk,dr 21-06-1932
4 06-07-1946 15-07-1923 2,sdd,jun 26-07-1934
5 14-07-1935 07-12-1945 asd,2,st 07-07-1965
Using your function MyFunc, this should also work
MyFunc <- function(x) {gsub(",", "-", x)}
library(data.table)
setDT(df)
cols <- c("dateofbirth", "recdate", "disdatenow")
df[, cols] <- df[, lapply(.SD, MyFunc), .SDcols = cols]
Let's say I have a following very simple data frame:
a <- rep(5,30)
b <- rep(4,80)
d <- rep(7,55)
df <- data.frame(Column = c(a,b,d))
What would be the most generic way for removing all rows with the value that appear less then 60 times?
I know you could say "in this case it's just a", but in my real data there are many more frequencies, so I wouldn't want to specify them one by one.
I was thinking of writing a loop such that if length() of an 'i' is smaller than 60, these rows will be deleted, but perhaps you have other ideas. Thanks in advance.
A solution using dplyr.
library(dplyr)
df2 <- df %>%
group_by(Column) %>%
filter(n() >= 60)
Or a solution from base R
uniqueID <- unique(df$Column)
targetID <- sapply(split(df, df$Column), function(x) nrow(x) >= 60)
df2 <- df[df$Column %in% uniqueID[targetID], , drop = FALSE]
We create a frequency table and then subset the rows based on the 'count' of values in 'Column'
tbl <- table(df$Column) >=60
subset(df, Column %in% names(tbl)[tbl])
Or with ave from base R
df[with(df, ave(Column, Column, FUN = length)>=60),]
Or we use data.table
library(data.table)
setDT(df)[, .SD[.N >= 60], Column]
Or another option with data.table is .I
setDT(df)[df[, .I[.N >=60], Column]$V1]
If there are more than one column to group, place it in a list (or compactly .()
setDT(df)[df[, .I[.N >=60], by = .(Column1, Column2)]$V1]
If there are many columns, we can also pass as a character string or object
colnms <- paste0("Column", 1:5)
setDT(df)[df[, .I[.N >=60], by = c(colnms)]$V1]
Using data.table
library(data.table)
setDT(df)
df[Column %in% df[, .N, by = Column][N >= 60, Column]]
There is also a variant to Eric Watt's answer which uses a join instead of %in%:
library(data.table)
setDT(df)
df[df[, .N, by = Column][N >= 60, .(Column)], on = "Column"]
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I have a data.frame with colnames: A01, A02, ..., A25, ..., Z01, ..., Z25 (altogether 26*25). For example:
set.seed(1)
df <- data.frame(matrix(rnorm(26*25),ncol=26*25,nrow=1))
cols <- c(paste("0",1:9,sep=""),10:25)
colnames(df) <- c(sapply(LETTERS,function(l) paste(l,cols,sep="")))
and I want to dcast it to a data.frame of 26x25 (rows will be A-Z and columns 01-25). Any idea what would be the formula for this dcast?
We can use tidyverse
library(tidyverse)
res <- gather(df) %>%
group_by(key = sub("\\D+", "", key)) %>%
mutate(n = row_number()) %>%
spread(key, value) %>%
select(-n)
dim(res)
#[1] 26 25
The removing of columns doesn't look nice (still learning data.table). Someone needs to make that one nice.
# convert to data.table
df <- data.table(df)
# melt all the columns first
test <- melt(df, measure.vars = names(df))
# split the original column name by letter
# paste the numbers together
# then remove the other columns
test[ , c("ch1", "ch2", "ch3") := tstrsplit(variable, "")][ , "ch2" :=
paste(ch2, ch3, sep = "")][ , c("ch3", "variable") := NULL]
# dcast with the letters (ch1) as rows and numbers (ch2) as columns
dcastOut <- dcast(test, ch1 ~ ch2 , value.var = "value")
Then just remove the first column which contains the number?
The "formula" you're looking for can come from the patterns argument in the "data.table" implementation of melt. dcast is for going from a "long" form to a "wide" form, while melt is for going from a wide form to a long(er) form. melt() does not use a formula approach.
Essentially, you would need to do something like:
library(data.table)
setDT(df) ## convert to a data.table
cols <- sprintf("%02d", 1:25) ## Easier way for you to make cols in the future
melt(df, measure.vars = patterns(cols), variable.name = "ID")[, ID := LETTERS][]
I'm trying to use data.table rather data.frame(for a faster code). Despite the syntax difference between than, I'm having problems when I need to extract a specific character column and use it as character vector. When I call:
library(data.table)
DT <- fread("file.txt")
vect <- as.character(DT[, 1, with = FALSE])
class(vect)
###[1] "character"
head(vect)
It returns:
[1] "c(\"uc003hzj.4\", \"uc021ofx.1\", \"uc021olu.1\", \"uc021ome.1\", \"uc021oov.1\", \"uc021opl.1\", \"uc021osl.1\", \"uc021ovd.1\", \"uc021ovp.1\", \"uc021pdq.1\", \"uc021pdv.1\", \"uc021pdw.1\")
Any ideas of how to avoid these "\" in the output?
The as.character works on vectors and not on data.frame/data.table objects in the way the OP expected. So, if we need to get the first column as character class, subset with .SD[[1L]] and apply the as.character
DT[, as.character(.SD[[1L]])]
If there are multiple columns, we can specify the column index with .SDcols and loop over the .SD to convert to character and assign (:=) the output back to the particular columns.
DT[, (1:2) := lapply(.SD, as.character), .SDcols= 1:2]
data
DT <- data.table(Col1 = 1:5, Col2= 6:10, Col3= LETTERS[1:5])