I have a column variable that I want to split into three factor variables. There are the factor variables I want to create:
goal<-c('newref', 'meow', 'woof')
area<-c('eco', 'social', 'bank')
fr<-c('demo', 'hist', 'util')
And the current variable looks more or less like that:
code<-c('goal\\\\meow', 'area\\\\bank', 'area\\\\bank', 'fr\\\\utilitarian', 'fr\\\\history')
And let's say the dataframe is something like that
df<-data.frame(var1=c(1,2,3,4,5), var2=c('a', 'b', 'c', 'd', 'e'), code=code)
So I would like to create 3 new columns, one per each factor variable, and use a regular expression that detected what it belongs to. So for example row number one should look as follows:
row1<-data.frame(var1=1, var2=c('a'), code=c('goal\\\\meow'), goal=2, area=NA, fr=NA)
Also note that the value of the factor variables is an abbreviation of the value in code (eg, history / hist).
The database is likely to have 10000 entries, so I would really appreciate any hints on this.
Thank you!
We can define a function that finds the position of the factor variable that, when used as a regular expression, finds a match in the code column:
find_match <- function(code, matches) {
apply(sapply(matches, grepl, code), 1, match, x=T)
}
If there is no match, this function returns NA for that row.
Next, we can simply use mutate from dplyr to add each column of factors:
df %>% mutate(goal = find_match(code, goal),
area = find_match(code, area),
fr = find_match(code, fr))
Which gives:
var1 var2 code goal area fr
1 1 a goal\\\\meow 2 NA NA
2 2 b area\\\\bank NA 3 NA
3 3 c area\\\\bank NA 3 NA
4 4 d fr\\\\utilitarian NA NA 3
5 5 e fr\\\\history NA NA 2
Doing this with tidyverse tools like the pipe %>% and dplyr:
Separate breaks up the code column into two with the separator you specify.
Because "\" is a special character in regex you have to escape each \ you want to look for with another .
Spread converts it from tall form to wide form as you needed.
library(dplyr)
df %>%
separate(code, into = c("colName", "value"), sep = "\\\\\\\\") %>%
spread(colName, value)
Related
Just a quick question: how can I replace some values with others if these values are present in all the dataframe's column? Functions like mapvalues and recode work only if the column is specified, but in my case the dataframe has 89 columns so that would be time-consuming.
For the sake of clarity, take in consideration the following example. I want to replace [NULL] with another value.
Example:
a <- c("NULL",2,"NULL")
b <- c(3, "NULL", 1)
df <- data.frame(a, b)
df
a b
0 NULL 3
1 2 NULL
2 NULL 1
The difference between the example and my case is that the dataset is [35383 x 89], and the values I want to replace are more than one.
Thank you in advance for your time.
An extension to the comment by Ronak Shah. You can add 0 if you want like that. Or you can replace it with desired values, if you like that.
For example, replace the NULLs with mean of the respective columns:
#Run a loop to convert the characters into numbers because for your case it is all characters
#This will change the NULL to NAs.
for (i in colnames(df)){
df[,i] <- as.numeric(df[,i])
}
#Now replace the NAs with the mean of the column
for (i in colnames(df)){
df[,i][is.na(df[,i])] <- mean(df[,i], na.rm=TRUE)
}
You can similarly do this for median also. Let me know in the comment if you have any doubts.
For starters, I have added a few more rows to your example to better show how the code works
df
# a b
#1 NULL 3
#2 2 NULL
#3 NULL 1
#4 a 14
#5 1 a
#6 14 5
First, create two vectors: one with whe values you want to replace (pattern) and one with replacements in the same order. To make sure you have done it right, put them together in a data frame and take a look at the rows (this will also help in next step)
In this case, I want NULL to be 0, "a" to be "alpha", and so on, as shown below
pattern <- c("NULL", "a", 14, 1)
replacement <- c(0, "alpha", "fourteen", "one")
subs <- data.frame(pattern, replacement)
subs
# pattern replacement
#1 NULL 0
#2 a alpha
#3 14 fourteen
#4 1 one
To finish it, we will make a for tthat each time we will pick a pattern and its replacement from the subs data frame we created, and with these values execute a map_df(). This function iterates over the columns from our original data frame (df) and apply the gsub() function with the pattern and replacement
for (i in 1:nrow(subs)) {
df <- map_df(df, gsub, pattern = subs$pattern[i], replacement = subs$replacement[i])
}
df
# a b
#1 0 3
#2 2 0
#3 0 one
#4 alpha fourteen
#5 one alpha
#6 fourteen 5
I hope this was clear. Let me know if you have any doubts
I want to cbind a column to the data frame with the column name dynamically assigned from a string
y_attribute = "Survived"
cbind(test_data, y_attribute = NA)
this results in a new column added as y_attribute instead of the required Survived attribute which in provided as a string to the y_attribute variable. What needs to be done to get a column in the data frame with the column name provided from a variable?
You don't actually need cbind to add a new column. Any of these will work:
test_data[, y_attribute] = NA # data frame row,column syntax
test_data[y_attribute] = NA # list syntax (would work for multiple columns at once)
test_data[[y_attribute]] = NA # list single item syntax (single column only)
New columns are added after the existing columns, just like cbind.
We can use tidyverse to do this
library(dplyr)
test_data %>%
mutate(!! y_attribute := NA)
# col1 Survived
#1 1 NA
#2 2 NA
#3 3 NA
#4 4 NA
#5 5 NA
data
test_data <- data.frame(col1 = 1:5)
Not proud of this but I usually will do somethingl like this:
dyn.col <- "XYZ"
cbind(test.data, UNIQUE_NAMEXXX=NA)
colnames(test.data)[colnames(test.data == 'UNIQUE_NAMEXXX')] <- dyn.col
We can also do it with data.table
library(data.table)
setDT(test_data)[, (y_attribute) := NA]
I want to map the FactorName in the dataframe FName to the column header names of Stack. Ie Factor1 in Stack is actually named Value, Factor 2 is Leverage etc. I have a large dataset so manually renaming is not an option.
Stack <- data.frame(rowid=1:3, Factor1=2:4, Factor2=3:5, Factor3=4:6)
FName <- data.frame(FactorID=c("Factor1","Factor2","Factor3"), FactorName=c("Value","Leverage","Growth"))
Thanks.
How about this using match:
Stack <- data.frame(rowid=1:3, Factor1=2:4, Factor2=3:5, Factor3=4:6)
FName <- data.frame(
FactorID=c("Factor1","Factor2","Factor3"),
FactorName=c("Value","Leverage","Growth"))
# Matching entries from FName
colnames(Stack) <- ifelse(
!is.na(FName$FactorName[match(colnames(Stack), FName$FactorID)]),
as.character(FName$FactorName[match(colnames(Stack), FName$FactorID)]),
colnames(Stack));
Stack;
# rowid Value Leverage Growth
#1 1 2 3 4
#2 2 3 4 5
#3 3 4 5 6
Explanation: We match column names of Stack and entries from FName$FactorID. If there is a match, replace with FName$FactorName, else keep the original column name.
if we have factor names handy then we can use that to change the column names
colnames(Stack) <- "facotor header file"
Another approach using match, but using indexing instead of ifelse
# Get indices of matches
m <- match(names(Stack), FName$FactorID)
# replace names where a match is found.
names(Stack)[!is.na(m)] <- as.character(FName$FactorName[m[!is.na(m)]])
I am trying to filter out NA, NaN and Inf values out of a tbl using dyplr's filter function.
The trick is that I only want to apply the filter to columns whose names contain a specific pattern. The pattern is: r1, r2, r3, etc.
I have tried to combine grep and filter to achieve this, but can't get it to work. My current code looks like this:
filter_(!is.na(grep("r[1-9]", colnames(DF), value = TRUE))
& !is.infinite(grep("r[1-9]", colnames(DF), value = TRUE))
& !is.nan(grep("r[1-9]", colnames(DF), value = TRUE)))
However, this code returns a warning message: "Truncating vector to length 1."
And the data returned is unfiltered.
I suspect that it's the is.na functions here that are causing the problem, because I've seen an example online where you can apply grep to filter using a normal condition (i.e. condition == value) and not a condition based on is.na
dplyr provides matches() that is useful for this
Example 1: How matches() work?
library(dplyr)
# remove columns that start with "mp"
mtcars %>% select(-matches("mp"))
# keep columns that start with "mp"
mtcars %>% select(matches("mp"))
Example 2: Using matches() in the context of your request but using a MWE
# Create a dummy dataset
data = tibble(id = c("John","Paul","George","Ringo"),
r1 = c(1,2,NA,NA),
r2 = c(1,2,NA,4),
s1 = c(1,NA,3,4))
# Filter NAs in columns that start with r followed by a number
data %>% filter_at(vars(matches("r[0-9]")), all_vars(!is.na(.)))
Here is a base R method to filter rows, comparing specific columns.
# sample data
set.seed(1234)
dat <- data.frame(r1=c(NA, 1,NaN, 5, Inf), r2=c(NA, 1,NaN, NA, Inf), d=rnorm(5))
this data set looks like
dat
r1 r2 d
1 NA NA -1.2070657
2 1 1 0.2774292
3 NaN NaN 1.0844412
4 5 NA -2.3456977
5 Inf Inf 0.4291247
We will check the first two columns and ignore the third column. Notice that the only row that should remain is row 2.
dat[Reduce("&", lapply(dat[grep("^r", names(dat))], is.finite)),]
r1 r2 d
2 1 1 0.2774292
Here, a data.frame that is subset using grep to select the appropriate columns (1 and 2) is fed to lapply. The regex "^r" says only include variables whose names that start with "r". In the lapply loop, each vector is checked using is.finite. This function returns FALSE for NA, NaN, and Inf. The resulting list of logical vectors is fed to Reduce` which returns a logical vector the length of the number of rows of the data.frame where an element is TRUE if and only if every element in a row is finite.
With dplyr, you can use the filter_at function:
dat %>% filter_at(vars(matches("^r[1-9]")), all_vars(is.finite(.)))
Using #lmo's sample data, the result is:
r1 r2 d
1 1 1 0.2774292
Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.