Formula for dcast'ing a data.frame [duplicate] - r

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I have a data.frame with colnames: A01, A02, ..., A25, ..., Z01, ..., Z25 (altogether 26*25). For example:
set.seed(1)
df <- data.frame(matrix(rnorm(26*25),ncol=26*25,nrow=1))
cols <- c(paste("0",1:9,sep=""),10:25)
colnames(df) <- c(sapply(LETTERS,function(l) paste(l,cols,sep="")))
and I want to dcast it to a data.frame of 26x25 (rows will be A-Z and columns 01-25). Any idea what would be the formula for this dcast?

We can use tidyverse
library(tidyverse)
res <- gather(df) %>%
group_by(key = sub("\\D+", "", key)) %>%
mutate(n = row_number()) %>%
spread(key, value) %>%
select(-n)
dim(res)
#[1] 26 25

The removing of columns doesn't look nice (still learning data.table). Someone needs to make that one nice.
# convert to data.table
df <- data.table(df)
# melt all the columns first
test <- melt(df, measure.vars = names(df))
# split the original column name by letter
# paste the numbers together
# then remove the other columns
test[ , c("ch1", "ch2", "ch3") := tstrsplit(variable, "")][ , "ch2" :=
paste(ch2, ch3, sep = "")][ , c("ch3", "variable") := NULL]
# dcast with the letters (ch1) as rows and numbers (ch2) as columns
dcastOut <- dcast(test, ch1 ~ ch2 , value.var = "value")
Then just remove the first column which contains the number?

The "formula" you're looking for can come from the patterns argument in the "data.table" implementation of melt. dcast is for going from a "long" form to a "wide" form, while melt is for going from a wide form to a long(er) form. melt() does not use a formula approach.
Essentially, you would need to do something like:
library(data.table)
setDT(df) ## convert to a data.table
cols <- sprintf("%02d", 1:25) ## Easier way for you to make cols in the future
melt(df, measure.vars = patterns(cols), variable.name = "ID")[, ID := LETTERS][]

Related

How do I convert all numeric columns to character type in my dataframe?

I would like to do something more efficient than
dataframe$col <- as.character(dataframe$col)
since I have many numeric columns.
In base R, we may either use one of the following i.e. loop over all the columns, create an if/else conditon to change it
dataframe[] <- lapply(dataframe, function(x) if(is.numeric(x))
as.character(x) else x)
Or create an index for numeric columns and loop only on those columns and assign
i1 <- sapply(dataframe, is.numeric)
dataframe[i1] <- lapply(dataframe[i1], as.character)
It may be more flexible in dplyr
library(dplyr)
dataframe <- dataframe %>%
mutate(across(where(is.numeric), as.character))
All said by master akrun! Here is a data.table alternative. Note it converts all columns to character class:
library(data.table)
data.table::setDT(df)
df[, (colnames(df)) := lapply(.SD, as.character), .SDcols = colnames(df)]

R: convert mutate across character call to data.table

Apologies if I can't find an original post explaining how to do this tidyverse to data.table conversion.
I would like to mutate across all character vectors replacing <NA> values in character vectors.
How can I translate the following text to data.table?
library(tidyverse)
library(data.table)
mtcars <- mtcars %>%
mutate(across(where(is.character), ~na_if(.,"")))
We can use
library(data.table)
nm1 <- names(mtcars)[sapply(mtcars, is.character)]
setDT(mtcars)[, (nm1) := lapply(.SD, function(x) replace(x, x == "", NA)), .SDcols = nm1]

Combining rows if text is same before certain character [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have a data frame similar to this. I want to sum up values for rows if the text in column "Name" is the same before the - sign.
Remove everything after "-" using sub and then use How to sum a variable by group
df$Name <- sub("-.*", "",df$Name)
aggregate(cbind(val1, val2)~Name, df, sum)
Below is a data.table solution.
Data:
df = data.table(
Name = c('IRON - A', 'IRON - B', 'SABBATH - A', 'SABBATH - B'),
val1 = c(1,2,3,4),
val2 = c(5,6,7,8)
)
Code:
df[, Name := sub("-.*", "", Name)]
mat = df[, .(sum(val1), sum(val2)), by = Name]
> mat
Name V1 V2
1: IRON 3 11
2: SABBATH 7 15
You can rbind your 2 tables (top and bottom) into one data frame and then use dplyr or data.table. The data.table would be much faster for large tables.
data_framme$Name <- sub("-.*", "", data_frame$Name)
library(dplyr)
data_frame %>%
group_by(Name) %>%
summarise_all(sum)
library(data.table)
data.frame <- data.table(data.frame)
data.frame[, lapply(.SD, sum, na.rm=TRUE), by=Name ]

How to omit rows with a value contained in a separate vector [duplicate]

This question already has answers here:
How to delete multiple values from a vector?
(9 answers)
Closed 3 years ago.
I have a vector of values and a data frame.
I would like to filter out the rows of the data frame which contain (in specific column) any of the values in my vector.
I'm trying to figure out if a person in the survey has a child who was also questioned in the survey - if so I would like to remove them from my data frame.
I have a list of respondent IDs, and vectors of mother/father personal IDs. If the ID appears in the mother/father column I would like to remove it.
df <- data.frame(ID= c(101,102,103,104,105), Name = (Martin, Sammie, Reg, Seamus, Aine)
vec <- c(103,105,108,120,150)
Output should be a dataframe with three rows - Martin, Sammie, Seamus.
ID Name
1 101 Martin
2 102 Sammie
3 104 Seamus
df[!(df$ID %in% vec), ] # Or subset(df, !(ID %in% vec))
# ID Name
# 1 101 Martin
# 2 102 Sammie
# 4 104 Seamus
Data
df <- data.frame(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
You can do this with filter from dplyr
library(tidyverse)
df2 <- df%>%
filter(!ID %in% vec)
If you create this as a data.table (and load data.table package, and fix the errors in the example data):
library(data.table)
df <- data.table(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
# solution, slightly different from base R
df[!(ID %in% vec)]
Data.table is likely going to run a bit quicker than base R so very useful with large datasets. Microbenchmarking with a large dataset using base R, tidyverse and data.table shows data.table to be a bit quicker than tidyverse and a lot faster than base.
library(tidyverse)
library(data.table)
library(microbenchmark)
n <- 10000000
df <- data.frame("ID" = c(1:n), "Name" = sample(LETTERS, size = n, replace = TRUE))
dt <- data.table(df)
vec <- sample(1:n, size = n/10, replace = FALSE)
microbenchmark(dt[!(ID %in% vec)], df[!(df$ID %in% vec),], df%>% filter(!ID %in% vec))

changing non character letters in selected variables

I've got a data frame with text. I'd like to change all "," to "-" in all observations of selected variables, and like to select the variables based on their names containing the word date.
I've tried to incorporate various variations of grep() expressions into MyFunc but haven't been able to get it to work.
Thanks!
starting point:
df <- data.frame(dateofbirth=c("25,06,1939","15,04,1941","21,06,1978","06,07,1946","14,07,1935"),recdate=c("26,06,1945","03,04,1964","21,06,1949","15,07,1923","07,12,1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2,st"),disdatenow=c("25,06,1975","25,05,1996","21,06,1932","26,07,1934","07,07,1965"), stringsAsFactors = FALSE)
desired outcome:
df <- data.frame(dateofbirth=c("25-06-1939","15-04-1941","21-06-1978","06-07-1946","14-07-1935"),recdate=c("26-06-1945","03-04-1964","21-06-1949","15-07-1923","07-12-1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2"),disdatenow=c("25-06-1975","25-05-1996","21-06-1932","26-07-1934","07-07-1965"), stringsAsFactors = FALSE)
Current code:
MyFunc <- function(x) {gsub(",","-",df$x)}
You can use mutate_at from dplyr:
df %>%
mutate_at(vars(contains("date")), function(x){gsub(",", "-", x)})
and that gives you this:
dateofbirth recdate b disdatenow
1 25-06-1939 26-06-1945 8,ted,st 25-06-1975
2 15-04-1941 03-04-1964 99,tes,rd 25-05-1996
3 21-06-1978 21-06-1949 6,ldk,dr 21-06-1932
4 06-07-1946 15-07-1923 2,sdd,jun 26-07-1934
5 14-07-1935 07-12-1945 asd,2,st 07-07-1965
Using your function MyFunc, this should also work
MyFunc <- function(x) {gsub(",", "-", x)}
library(data.table)
setDT(df)
cols <- c("dateofbirth", "recdate", "disdatenow")
df[, cols] <- df[, lapply(.SD, MyFunc), .SDcols = cols]

Resources