mutate_at using multiple conditons - r

I would to transform the columns that contain 1 or 2 in their names of my dataframe test by dividing them by the column Unit.
It seems to work with one conditions but I do not know how to add the condition OR 1
test= test%>% mutate_at(vars(contains('2')), funs(./Unit))
any idea?
Identifier Source 196001 200006 Unit
1: top HH NA NA 1e-06
2: top2 BB NA 4569.6 1e+00

This should work:
test = read.table(header = T,text="
Identifier Source 196001 200006 Unit
top HH 65 888 3
top2 BB 0111 9886 8") #I modified your values so you can see the divisions
test %>% mutate_at(vars(contains('2'), contains('1')), funs(./Unit))
Basically you say "select variables that contain 2, oh and also select variables than contain 1".
If you wanted to select variables that contain 1 AND 2, this would be a different problem and might require regexp (with dplyr::match).
Also, please note that funs is soft-deprecated, you should now use lambda functions:
test = test %>% mutate_at(vars(contains('2'), contains('1')), ~./Unit)

Related

How to replace values in all columns in r

Just a quick question: how can I replace some values with others if these values are present in all the dataframe's column? Functions like mapvalues and recode work only if the column is specified, but in my case the dataframe has 89 columns so that would be time-consuming.
For the sake of clarity, take in consideration the following example. I want to replace [NULL] with another value.
Example:
a <- c("NULL",2,"NULL")
b <- c(3, "NULL", 1)
df <- data.frame(a, b)
df
a b
0 NULL 3
1 2 NULL
2 NULL 1
The difference between the example and my case is that the dataset is [35383 x 89], and the values I want to replace are more than one.
Thank you in advance for your time.
An extension to the comment by Ronak Shah. You can add 0 if you want like that. Or you can replace it with desired values, if you like that.
For example, replace the NULLs with mean of the respective columns:
#Run a loop to convert the characters into numbers because for your case it is all characters
#This will change the NULL to NAs.
for (i in colnames(df)){
df[,i] <- as.numeric(df[,i])
}
#Now replace the NAs with the mean of the column
for (i in colnames(df)){
df[,i][is.na(df[,i])] <- mean(df[,i], na.rm=TRUE)
}
You can similarly do this for median also. Let me know in the comment if you have any doubts.
For starters, I have added a few more rows to your example to better show how the code works
df
# a b
#1 NULL 3
#2 2 NULL
#3 NULL 1
#4 a 14
#5 1 a
#6 14 5
First, create two vectors: one with whe values you want to replace (pattern) and one with replacements in the same order. To make sure you have done it right, put them together in a data frame and take a look at the rows (this will also help in next step)
In this case, I want NULL to be 0, "a" to be "alpha", and so on, as shown below
pattern <- c("NULL", "a", 14, 1)
replacement <- c(0, "alpha", "fourteen", "one")
subs <- data.frame(pattern, replacement)
subs
# pattern replacement
#1 NULL 0
#2 a alpha
#3 14 fourteen
#4 1 one
To finish it, we will make a for tthat each time we will pick a pattern and its replacement from the subs data frame we created, and with these values execute a map_df(). This function iterates over the columns from our original data frame (df) and apply the gsub() function with the pattern and replacement
for (i in 1:nrow(subs)) {
df <- map_df(df, gsub, pattern = subs$pattern[i], replacement = subs$replacement[i])
}
df
# a b
#1 0 3
#2 2 0
#3 0 one
#4 alpha fourteen
#5 one alpha
#6 fourteen 5
I hope this was clear. Let me know if you have any doubts

Replacing values based on multiple columns in R

I have raw data with multiple observation and I have a cleaning log which contains some new values for specific columns of raw data I want to replace old values with these new ones.
My raw data is :
raw_df<- data.frame(
id=c(1,2,3,4),
name=c("a","b","c","d"),
age=c(15,16,20,22),
add=c("xyz","bc","no","da")
)
MY cleaning log is :
cleaning_log <- data.frame(
id=c(2,4),
question=c("name","age"),
old_value=c("b",22),
new_value=c("bob",25)
)
And my expected result is :
result<-data.frame(
id=c(1,2,3,4),
name=c("a","bob","c","d"),
age=c(15,16,20,25),
add=c("xyz","bc","no","da")
)
Note:At the end how can I check whether these new values are replaced properly or not?
In addition, in cleaning log question column I may have more than two columns like 10 to 20 which possibly will have new value but here I just give two column names as an example.
Thanks in advance for you help
Find out the row number and column number to change in raw_df using match and replace it with cleaning_log$new_value.
row_inds <- match(cleaning_log$id, raw_df$id)
col_inds <- match(cleaning_log$question, names(raw_df))
raw_df[cbind(row_inds, col_inds)] <- cleaning_log$new_value
raw_df
# id name age add
#1 1 a 15 xyz
#2 2 bob 16 bc
#3 3 c 20 no
#4 4 d 25 da

which pattern from string another string matches

I have a column variable that I want to split into three factor variables. There are the factor variables I want to create:
goal<-c('newref', 'meow', 'woof')
area<-c('eco', 'social', 'bank')
fr<-c('demo', 'hist', 'util')
And the current variable looks more or less like that:
code<-c('goal\\\\meow', 'area\\\\bank', 'area\\\\bank', 'fr\\\\utilitarian', 'fr\\\\history')
And let's say the dataframe is something like that
df<-data.frame(var1=c(1,2,3,4,5), var2=c('a', 'b', 'c', 'd', 'e'), code=code)
So I would like to create 3 new columns, one per each factor variable, and use a regular expression that detected what it belongs to. So for example row number one should look as follows:
row1<-data.frame(var1=1, var2=c('a'), code=c('goal\\\\meow'), goal=2, area=NA, fr=NA)
Also note that the value of the factor variables is an abbreviation of the value in code (eg, history / hist).
The database is likely to have 10000 entries, so I would really appreciate any hints on this.
Thank you!
We can define a function that finds the position of the factor variable that, when used as a regular expression, finds a match in the code column:
find_match <- function(code, matches) {
apply(sapply(matches, grepl, code), 1, match, x=T)
}
If there is no match, this function returns NA for that row.
Next, we can simply use mutate from dplyr to add each column of factors:
df %>% mutate(goal = find_match(code, goal),
area = find_match(code, area),
fr = find_match(code, fr))
Which gives:
var1 var2 code goal area fr
1 1 a goal\\\\meow 2 NA NA
2 2 b area\\\\bank NA 3 NA
3 3 c area\\\\bank NA 3 NA
4 4 d fr\\\\utilitarian NA NA 3
5 5 e fr\\\\history NA NA 2
Doing this with tidyverse tools like the pipe %>% and dplyr:
Separate breaks up the code column into two with the separator you specify.
Because "\" is a special character in regex you have to escape each \ you want to look for with another .
Spread converts it from tall form to wide form as you needed.
library(dplyr)
df %>%
separate(code, into = c("colName", "value"), sep = "\\\\\\\\") %>%
spread(colName, value)

Multiple one-to-many matching between vectors in R

I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Resources