Subset of list in Dataframe R in categorical variable - r

My data looks like that but number of observations are approx 10000.
Part<-c(1,2,3,4,5,6,7)
Disease_codes>-c("A101.12","A111.12","A121.13","A130.0","B102","C132","D156")
class(Disease_codes)<-Factor
df<-data.frame(Part,Disease_codes)
The obs having Disease_codes starting from A10_A13 are BloodCancer patients. I need to make subset of it and i am trying following
BloodCancer <- subset(df, grepl('^A10', Disease_codes), select = Part
Part_without_Blood_cancer <- subset(df, !grepl('^A10', Disease_codes))
If i am trying the following it is not working.
BloodCancer <- subset(df, grepl('^A10-A13', Disease_codes), select = Part
But it is giving me just A10 coding containing Participants but I want BloodCancer variable to contain all from A10-A13. How can i do this in one command.

the syntax for grepl to return true for any of the strings (e.g. A10, A11) is as follows:
grepl("A10| A11", variable). To keep it as one statement, you can do the following:
BloodCancer = subset(df, grepl(paste(paste("A1", 0:3, sep = ""), collapse = "|"), Disease_codes), select = Part)

try to do it this way
BloodCancer <- subset(df, grepl("^A1[0-3]", as.character(Disease_codes)), select = Part)

An option with dplyr
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Disease_codes, "^A1[0-3]")) %>%
select(Part)

Related

How to OR Loop in R

I have a data set with 100 values and want to pick only specific items from that data set. That's how I do it right now:
df.match <- subset(df.raw.csv, value == "UC9d" | value == "UCenoM“)
It's working but I want to solve it with a loop. I tried this but I only get one match. Although I know both values are in the data set.
for (ID in c("UC9d" , "UCenoM")){df.match <- subset(df.raw.csv, value == ID)}
Any suggestions?
My suggestion would be not to use loops in R:
library(dplyr)
mydata <- mutate(mydata, TOBEINCL = 0) #rename according to your data
Create a list of patterns for the match of mydata$ID (^ and $ are for exact matching):
toMatch <- c("^UC9d$", "^UCenoM$")
Use pattern matching from base R:
mydata$TOBEINCL[grep(paste(toMatch,collapse="|"), mydata$ID, ignore.case = FALSE, invert = TRUE)] <- 1
Select data:
mydataINCL <- mydata[(mydata$TOBEINCL==1) , ]
mydataINCL$ID <- factor(mydataINCL$ID) #sometimes R sticks with the old values
An option:
df.match <- subset(df.raw.csv, value %in% c("UcenoM", "Uc9d"))

How to use the R pipe operator (%>%) in the following cases

1) I have a data frame named df, how can I include an if statement within the mutate function used within the pipe operator? The following does not work:
df %>%
mutate_if(myvar == "A", newColumn = oldColumn*3, newColumn = oldColumn)
The variable myvar is not included in the data frame and is a "flag" variable with values either "A" or "B". When "A", would like to create a new column named "newColumn" in the data frame that is three times the old column (named "oldColumn"), otherwise it is identical to the old column.
2) Would like to divide the column named "numbers" with the entry of numbers which has the minimum value in another column named "seconds", as follows:
df$newCol <- df$numbers / df[df$seconds== min(df$seconds),]$numbers
How can I do that with mutate command and "%>%", so that it looks more handy? Nothing that I tried works unfortunately.
Thanks for any answers,
J.
If myvar is just a variable floating around in the environmnet, you can use an if else statement within mutate (similar question here)
library(dplyr)
# Generate dataset
df <- tibble(oldColumn = rnorm(100))
# Mutate with if-else conditions
df <- df %>% mutate(newColumn = if(myvar == "A") oldColumn else if(myvar=="B") oldColumn * 3)
If myvar is included as a column in the dataframe then you could can use case_when.
# Generate dataset
df <- tibble(myvar = sample(c("A", "B"), 100, replace = TRUE),
oldColumn = rnorm(100))
# Create a new column which depends on the value of myvar
df <- df %>%
mutate(newColumn = case_when(myvar == "A" ~ oldColumn*3,
myvar == "B" ~ oldColumn))
As for question 2, you can use mutate with "." operater which calls the left hand side (i.e. "df") in the right hand side of the function. Then you can filter down to the row with the minimum value of seconds (top_n statement using -1 as argument), and pull out the value for the numbers variable
# Generate data
df <- tibble(numbers = sample(1:60),
seconds = sample(1:60))
# Do computation
df <- df %>% mutate(newCol = numbers / top_n(.,-1,seconds) %>% pull(numbers))

tidyr or dplyr equivalent of JMP split table

JMP has a "split table" platform:
http://www.jmp.com/support/help/Split_Columns.shtml
Here is the image for it:
The "split by" becomes part of the column headers.
The "split columns" are the columns spread out.
The "group" are retained columns.
I have looked at a few links/pages and can't seem to get this right in R. Right now I have to kluge it into a macro in JMP.
Links that didn't help me include:
Use dplyr's group_by to perform split-apply-combine
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Split a column of a data frame to multiple columns
I need to split a table of ~20k rows and ~30 columns, along one of the columns (integers between 0 and 13), to being ~1400 rows with ~25 split into 350.
An inelegant, but repeatable, example is splitting this cars table
according to this:
Yields this:
How do I do this and retain the ~5 non-split columns using an R library like tidyr or dplyr?
Using reshape, it's not too terrible to do one split column at a time. You could then merge the model and engine.disp together. For your real example, you could just change the lists in aggregate and formula in cast.
x <- read.csv('http://web.pdx.edu/~gerbing/data/cars.csv',stringsAsFactors = F)
names(x) <- tolower(names(x))
agg <- aggregate(list(model = x$model),list(origin = x$origin,cylinders = x$cylinders,year = x$year),FUN = paste,collapse = ',')
require(reshape)
output <- cast(data = agg,formula = origin + cylinders ~ year,value = 'model')
Edit:
I haven't checked all possible cases, but this function should work similar to the split tables, or at least give you a good start.
x <- read.csv('http://web.pdx.edu/~gerbing/data/cars.csv',stringsAsFactors = F)
names(x) <- tolower(names(x))
jmpsplitcol <- function(data,splitby,splitcols,group){
require(reshape)
require(tidyr)
aggsplitlist <- data[ ,names(data) %in% c(splitby,group)]
aggsplitlist <- lapply(aggsplitlist,`[`)
agg <- aggregate(list(data[ ,names(data) %in% splitcols]),aggsplitlist,FUN = paste,collapse = ',')
newgat <- gather_(data = agg,key = 'splitcolname','myval',splitcols)
castformula <- as.formula(paste(paste(group,collapse = ' + '),'~','splitcolname','+',splitby))
output <- cast(data = newgat,formula = castformula,value = 'myval')
output
}
res <- jmpsplitcol(x,c('year'),c('engine.disp','model'),c('origin','cylinders'))
head(res2)

R: Convert frequency to percentage with only a selected number of columns

I would like to convert a dataframe filled with frequencies into a dataframe filled with percentage by row using dplyr.
My data set has the particularity to get filled with others variables and I just want to calculate the percentage for a set of columns defined by a vector of names. Plus, I want to use the dplyr library.
sim_dat <- function() abs(floor(rnorm(26)*10))
df <- data.frame(a = letters, b = sim_dat(), c = sim_dat(), d = sim_dat()
, z = LETTERS)
names_to_transform <- names(df)[2:4]
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_each(function(x) x / sum_freq_codpos, names_to_transform)
# does not work
Any idea on how to do it? I have tried with mutate_at and mutate_each but I can't get it to work.
you're almost there!:
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_at(names_to_transform, funs(./sum_freq_codpos))
the dot . roughly translates to "the object i am manipulating here", which in this call is "the focal variable in names_to_transform".

removing columns from a data frame which feature in a list, but don't feature in another list

Say, my variable are as follows.
df = read.csv('somedataset.csv') #contains 'col1','col2','col3','col4','col5' say
colsSomeRemoveSomeDontRemove = c('col1','col2','col3')
colsDontRemove = 'col2'
I would like to remove all those columns from df which feature in colsSomeRemoveSomeDontRemove, but are not part of colsDontRemove.
So basically, at the end my df should contain only columns 'col2','col4','col5'
How can I do that?
I have tried doing the following, but could not get it to work
df1 = cbind(df[,which(!(names(df) %in% colsSomeRemoveSomeDontRemove))],as.data.frame(df[,colsDontRemove]))
df[, !(colnames(df) %in% setdiff(colsSomeRemoveSomeDontRemove, colsDontRemove))]

Resources