tidyr or dplyr equivalent of JMP split table - r

JMP has a "split table" platform:
http://www.jmp.com/support/help/Split_Columns.shtml
Here is the image for it:
The "split by" becomes part of the column headers.
The "split columns" are the columns spread out.
The "group" are retained columns.
I have looked at a few links/pages and can't seem to get this right in R. Right now I have to kluge it into a macro in JMP.
Links that didn't help me include:
Use dplyr's group_by to perform split-apply-combine
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Split a column of a data frame to multiple columns
I need to split a table of ~20k rows and ~30 columns, along one of the columns (integers between 0 and 13), to being ~1400 rows with ~25 split into 350.
An inelegant, but repeatable, example is splitting this cars table
according to this:
Yields this:
How do I do this and retain the ~5 non-split columns using an R library like tidyr or dplyr?

Using reshape, it's not too terrible to do one split column at a time. You could then merge the model and engine.disp together. For your real example, you could just change the lists in aggregate and formula in cast.
x <- read.csv('http://web.pdx.edu/~gerbing/data/cars.csv',stringsAsFactors = F)
names(x) <- tolower(names(x))
agg <- aggregate(list(model = x$model),list(origin = x$origin,cylinders = x$cylinders,year = x$year),FUN = paste,collapse = ',')
require(reshape)
output <- cast(data = agg,formula = origin + cylinders ~ year,value = 'model')
Edit:
I haven't checked all possible cases, but this function should work similar to the split tables, or at least give you a good start.
x <- read.csv('http://web.pdx.edu/~gerbing/data/cars.csv',stringsAsFactors = F)
names(x) <- tolower(names(x))
jmpsplitcol <- function(data,splitby,splitcols,group){
require(reshape)
require(tidyr)
aggsplitlist <- data[ ,names(data) %in% c(splitby,group)]
aggsplitlist <- lapply(aggsplitlist,`[`)
agg <- aggregate(list(data[ ,names(data) %in% splitcols]),aggsplitlist,FUN = paste,collapse = ',')
newgat <- gather_(data = agg,key = 'splitcolname','myval',splitcols)
castformula <- as.formula(paste(paste(group,collapse = ' + '),'~','splitcolname','+',splitby))
output <- cast(data = newgat,formula = castformula,value = 'myval')
output
}
res <- jmpsplitcol(x,c('year'),c('engine.disp','model'),c('origin','cylinders'))
head(res2)

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))

Subset of list in Dataframe R in categorical variable

My data looks like that but number of observations are approx 10000.
Part<-c(1,2,3,4,5,6,7)
Disease_codes>-c("A101.12","A111.12","A121.13","A130.0","B102","C132","D156")
class(Disease_codes)<-Factor
df<-data.frame(Part,Disease_codes)
The obs having Disease_codes starting from A10_A13 are BloodCancer patients. I need to make subset of it and i am trying following
BloodCancer <- subset(df, grepl('^A10', Disease_codes), select = Part
Part_without_Blood_cancer <- subset(df, !grepl('^A10', Disease_codes))
If i am trying the following it is not working.
BloodCancer <- subset(df, grepl('^A10-A13', Disease_codes), select = Part
But it is giving me just A10 coding containing Participants but I want BloodCancer variable to contain all from A10-A13. How can i do this in one command.
the syntax for grepl to return true for any of the strings (e.g. A10, A11) is as follows:
grepl("A10| A11", variable). To keep it as one statement, you can do the following:
BloodCancer = subset(df, grepl(paste(paste("A1", 0:3, sep = ""), collapse = "|"), Disease_codes), select = Part)
try to do it this way
BloodCancer <- subset(df, grepl("^A1[0-3]", as.character(Disease_codes)), select = Part)
An option with dplyr
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Disease_codes, "^A1[0-3]")) %>%
select(Part)

R: Convert frequency to percentage with only a selected number of columns

I would like to convert a dataframe filled with frequencies into a dataframe filled with percentage by row using dplyr.
My data set has the particularity to get filled with others variables and I just want to calculate the percentage for a set of columns defined by a vector of names. Plus, I want to use the dplyr library.
sim_dat <- function() abs(floor(rnorm(26)*10))
df <- data.frame(a = letters, b = sim_dat(), c = sim_dat(), d = sim_dat()
, z = LETTERS)
names_to_transform <- names(df)[2:4]
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_each(function(x) x / sum_freq_codpos, names_to_transform)
# does not work
Any idea on how to do it? I have tried with mutate_at and mutate_each but I can't get it to work.
you're almost there!:
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_at(names_to_transform, funs(./sum_freq_codpos))
the dot . roughly translates to "the object i am manipulating here", which in this call is "the focal variable in names_to_transform".

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

R loops: Adding a column to a table if does not already exist

I am trying to compile data from several files using for loops in R. I would like to get all the data into one table. Following calculation is just an example.
library(reshape)
dat1 <- data.frame("Specimen" = paste("sp", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2), "Density_3" = rnorm(10,4,2))
dat2 <- data.frame("Specimen" = paste("fg", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2))
dat <- c("dat1", "dat2")
for(i in 1:length(dat)){
data <- get(dat[i])
melt.data <- melt(data, id = 1)
assign(paste(dat[i], "tbl", sep=""), cast(melt.data, ~ variable, mean))
}
rbind(dat1tbl, dat2tbl)
What is the smoothest way to add an extra column into dat2? I would like to get the same column name ("Density_3" in this case) and fill it up with zeros, if it does not already exist. Assume that I have ~100 tables with number of columns (Density_1, 2, 3 etc) varying between 5 and 6.
I tried following, but it didn't work:
if(names(data) %in% "Density_3" == FALSE){
dat.all$Density_3 <- 0
} else {
dat.all$Density_3 <- dat.all$Density3}
Another one: is there a smooth way to rbind() the tables? It seems that rbind(get(dat)) does not work.
After staring at this question for a while I think its intent may have been obscured by the unnecessary get and assign manipulations. And I think the answer is pylr::rbind.fill
I would have constructed "dat", not as a character vector but as a list of two dataframes, used aggregate( ..., FUN=mean) (because I haven't gotten on the reshape2/plyr bus, except for melt and rbind.fill that is ) and then do.call(rbind.fill, ...) on the resulting list. At any rate this is what I think you want. I do not think it is a good idea to add in zeros for what are really missing values.
> rbind.fill(dat1tbl, dat2tbl)
value Density_1 Density_2 Density_3
1 (all) 5.006709 4.088988 2.958971
2 (all) 4.178586 3.812362 NA

Resources