data.table weird behaviour when used in a function - r

I have a data.frame as follows.
data <- structure(list(V1 = structure(1:3, .Label = c("S01", "S02", "S03"), class = "factor"), V2 = structure(c(1L, 3L, 2L), .Label = c("Alan", "Bruce", "Jay"), class = "factor"), V3 = structure(c(3L, 1L, 2L), .Label = c("Barry", "Dick", "Hal"), class = "factor"), V4 = structure(c(1L, 3L, 2L), .Label = c("Guy", "Jean-Paul", "Wally"), class = "factor"), V5 = structure(c(3L, 1L, 2L), .Label = c("Bart", "Damien", "John"), class = "factor")), .Names = c("V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA, -3L))
It is not a data.table
is.data.table(data)
[1] FALSE
I have a function foo for example which utilizes data.table for doing some manipulations in the data.frame as follows.
foo <- function(df) {
if(!is.data.frame(df)) stop('"df" is not a data.frame')
setDT(df)
setkey(df, V1)
df[, "NEW" := paste0(V3, V4), with = FALSE]
setDF(df)
return(df)
}
However when I run the function with the data.frame data (not a data.table), the output out is a data.frame (because of setDF(df)).
out <- foo(data)
is.data.table(out)
[1] FALSE
But now the original data.frame data is a data.table.
is.data.table(data)
[1] TRUE
I understand this is because data.table works by reference. However how to deal with this when being used in a function. I dont' wan't to inadvertently change any data.frame in environment. Should I always force copy with copy or <- instead of setDT whenever data.table is used in a function, or is there another way?

With regard to
is there another way?
Instead of setDT() inside the function, you could use as.data.table()
foo <- function(df) {
if(!is.data.frame(df)) stop('"df" is not a data.frame')
df <- as.data.table(df)
setkey(df, V1)
df[, NEW := paste0(V3, V4)]
setDF(df)
return(df)
}
foo(data)
# V1 V2 V3 V4 V5 NEW
# 1 S01 Alan Hal Guy John HalGuy
# 2 S02 Jay Barry Wally Bart BarryWally
# 3 S03 Bruce Dick Jean-Paul Damien DickJean-Paul
is.data.table(data)
# [1] FALSE
For some examples of functions that turn the input data frame into a data table but do not change the original data frame at all, I'd definitely recommend looking at source code for the functions in package splitstackshape.

Related

Filter table by column in R

I would like to filter table if I have column name written in variable. I tried bellow code but it did not work. dat is a data frame, name of column is Name, and I would like to filter by "John".
colname <- "Name"
dat[dat$colname %in% "John",]
I saw that it works fine if I do not use variable for column name. (Bellow code works fine)
dat[dat$"Name" %in% "John",]
You may use the bracket function [.
colname <- "Name"
dat[dat[[colname]] %in% "John", ]
dat[dat[, colname] %in% "John", ] # or
# Name X1 X2
# 8 John 0.8646536 1.2688507
# 9 John -1.7201559 -0.3125515
Data
dat <- structure(list(Name = structure(c(3L, 3L, 2L, 4L, 4L, 2L, 3L,
1L, 1L, 2L), .Label = c("John", "Linda", "Mary", "Olaf"), class = "factor"),
X1 = c(0.758396178001042, -1.3061852590117, -0.802519568703793,
-1.79224083446114, -0.0420324540227439, 2.15004261784474,
-1.77023083820321, 0.864653594565389, -1.72015589816109,
0.134125668141181), X2 = c(-0.0758265646523722, 0.85830054437592,
0.34490034810227, -0.582452690107777, 0.786170375925402,
-0.692099286413293, -1.18304353631275, 1.26885070606311,
-0.31255154601115, 0.0305712590978896)), class = "data.frame", row.names = c(NA,
-10L))
An approach with dplyr using non-standard evaluation. Using #jay.sf's data
library(dplyr)
dat %>% filter(!!sym(colname) == "John")
# Name X1 X2
#1 John 0.864654 1.268851
#2 John -1.720156 -0.312552
In data.table, we can use get
library(data.table)
setDT(dat)[get(colname) == "John"]
Since we have only one value to compare we can use == here instead of %in%.
With data.table, we can use eval with as.symbol
library(data.table)
setDT(dat)[eval(as.symbol(colname)) == "John"]
# Name X1 X2
#1: John 0.8646536 1.2688507
#2: John -1.7201559 -0.3125515

How to convert a dataframe into named list and remove the NA too

I have the following data frame:
df <- structure(list(cell_type = c("Adipocytes", "Astrocytes", "B cells"
), V1.x = structure(c(NA, 14L, 4L), .Label = c("alb", "beta-s",
"ccr2", "cd74", "cx3cr1", "fosb", "gria2", "gzma", "lck", "myh6",
"plp1", "ptgs2", "s100a9", "slc1a2", "ttr"), class = "factor"),
V2.x = structure(c(7L, 18L, 8L), .Label = c("1500015o10rik",
"apold1", "ccl5", "cd74", "coro1a", "cybb", "fabp4", "h2-aa",
"hpx", "mag", "ms4a4b", "myh7", "s100a8", "selplg", "slc4a1",
"smoc2", "snap25", "xist"), class = "factor"), V3.x = structure(c(8L,
1L, 6L), .Label = c("bcan", "coro1a", "crispld2", "csf1r",
"emcn", "h2-ab1", "itgb2", "lpl", "mal", "mt3", "myl2", "ngp",
"nkg7", "rhd", "s100a8", "serpina1a", "slc1a2", "tyrobp"), class = "factor")), row.names = c(NA,
3L), class = "data.frame")
It looks like this:
cell_type V1.x V2.x V3.x
1 Adipocytes <NA> fabp4 lpl
2 Astrocytes slc1a2 xist bcan
3 B cells cd74 h2-aa h2-ab1
What I want to do is to convert them as a list of named vector with cell_type as name, and I also want to remove the <NA>, yielding:
$Adipocytes
fabp4 lpl
$Astrocytes
slc1a2 xist bcan
$`B cells`
cd74 h2-aa h2-ab1
How can I achieve that?
I'm stuck with this: lapply(group_split(df, cell_type), as.vector)
We could use split to split based on cell_type and then use lapply to remove NA values
lapply(split(df[-1], df$cell_type), function(x) x[!is.na(x)])
#$Adipocytes
#[1] "fabp4" "lpl"
#$Astrocytes
#[1] "slc1a2" "xist" "bcan"
#$`B cells`
#[1] "cd74" "h2-aa" "h2-ab1"
A variation using dplyr and purrr could be to use group_split to split based on cell_type, discard NA values from each list and assign names using setNames.
library(dplyr)
library(purrr)
df %>%
mutate_all(as.character) %>%
group_split(cell_type, keep = FALSE) %>%
map(~discard(flatten_chr(.), is.na)) %>%
setNames(df$cell_type)
We can use base R
setNames(apply(df[-1], 1, function(x) unname(x)[complete.cases(x)]), df[[1]])
#$Adipocytes
#[1] "fabp4" "lpl"
#$Astrocytes
#[1] "slc1a2" "xist" "bcan"
#$`B cells`
#[1] "cd74" "h2-aa" "h2-ab1"

Converting factors into numeric format with signs in R

Let, I have such a dataframe(df) where each elements are factors:
df
---
+100.5
+120.2
-30.0
+75.0
-600.3
How can I convert df into a numric df using R? I ill be very glad for any help. Thanks a lot.
The conversion from factors to numerical values is sometimes complicated, and I think that it is usually necessary to convert the factors first into characters, and then into numerical values.
This should work:
df_n <- as.data.frame(as.numeric(as.character(df[,1])))
colnames(df_n) <- "df_n"
head(df_n)
# df_n
#1 100.5
#2 120.2
#3 -30.0
#4 75.0
#5 -600.3
class(df_n[,1])
#[1] "numeric"
data
df <- structure(list(df = structure(c(4L, 5L, 2L, 3L, 1L),
.Label = c("-600.3", "-30", "75", "100.5", "120.2"),
class = "factor")), .Names = "df",
row.names = c(NA, -5L), class = "data.frame")
Hope this helps.

How to prevent data.table to force numeric variables into character variables without manually specifying these?

Consider the following dataset:
dt <- structure(list(lllocatie = structure(c(1L, 6L, 2L, 4L, 3L), .Label = c("Assen", "Oosterwijtwerd", "Startenhuizen", "t-Zandt", "Tjuchem", "Winneweer"), class = "factor"),
lat = c(52.992, 53.32, 53.336, 53.363, 53.368),
lon = c(6.548, 6.74, 6.808, 6.765, 6.675),
mag.cat = c(3L, 2L, 1L, 2L, 2L),
places = structure(c(2L, 4L, 5L, 6L, 3L), .Label = c("", "Amen,Assen,Deurze,Ekehaar,Eleveld,Geelbroek,Taarlo,Ubbena", "Eppenhuizen,Garsthuizen,Huizinge,Kantens,Middelstum,Oldenzijl,Rottum,Startenhuizen,Toornwerd,Westeremden,Zandeweer", "Loppersum,Winneweer", "Oosterwijtwerd", "t-Zandt,Zeerijp"), class = "factor")),
.Names = c("lllocatie", "lat", "lon", "mag.cat", "places"),
class = c("data.table", "data.frame"),
row.names = c(NA, -5L))
When I want to split the strings in the last column into separate rows, I use (with data.table version 1.9.5+):
dt.new <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=list(lllocatie,lat,lon,mag.cat)]
However, when I use:
dt.new2 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=lllocatie]
I get the the same result except that all columns are forced into character variables. The problem is that for small datasets it is not a big problem to specify the variables that do not have to split in the by argument, but for datasets with many columns/variables it is. I know it is possible to do this with the splitstackshape package (as is mentioned by #ColonelBeauvel in his answer), but I'm looking for a data.table solution as i want to chain more operations to this.
How can I prevent that without manually specifying the variables that do not have to be split in the by argument?
Two solutions with data.table:
1: Use the type.convert=TRUE argument inside tstrsplit() as proposed by #Arun:
dt.new1 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE, type.convert=TRUE))), by=lllocatie]
2: Use setdiff(names(dt),"places") in the by argument as proposed by #Frank:
dt.new2 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=setdiff(names(dt),"places")]
Both approaches give the same result:
> identical(dt.new1,dt.new2)
[1] TRUE
The advantage of the second solution is that when you have more thanone columns with string values, only the one you specify in setdiff(names(dt),"places") is being split (supposing you want only that specific one, in this case places, to split). The splitstackshape package also offers this advantage.
It's exactly a job for cSplit from splitstackshape package:
library(splitstackshape)
cSplit(dt, 'places', ',')

Check whether value in one dataframe is in another (larger) dataframe

I'm struggling to come up with a vectorised solution to the following problem. I have two dataframes:
> people <- data.frame(name = c('Fred', 'Bob'), profession = c('Builder', 'Baker'))
> people
name profession
1 Fred Builder
2 Bob Baker
> allowed <- data.frame(name = c('Fred', 'Fred', 'Bob', 'Bob'), profession = c('Builder', 'Baker', 'Barman', 'Biker'))
> allowed
name profession
1 Fred Builder
2 Fred Baker
3 Bob Barman
4 Bob Biker
That is to say, I want to check every person in people has a permitted profession, and return any names which do not.
For instance, Fred can be a Builder or a Baker, and so he is fine. However, Bob can be a Barman or a Biker, but not a Baker (note: there are only ever two permitted professions in my use case).
I would like to a return a data frame those names which do not have a permitted profession:
name profession permitted
1 Bob Baker Biker
2 Bob Baker Barman
Thanks for the help
Simple base-only solution. I'm sure someone can come up with something better.
out <- allowed[!allowed$name %in% merge(people, allowed)$name, ]
This gets you the desired people, along with their permitted professions. If you also want their actual professions:
names(out)[2] <- "permitted"
out <- merge(people, out, all.y=TRUE)
Here's a slightly more readable data.table solution. You can do the last step on the same line as well to make it a one-liner, if you consider that readable.
# load library, convert people to a data.table and set a key
library(data.table)
people = data.table(people, key = "name,profession")
# compute
result = data.table(allowed, key = "name")[people[!allowed]]
setnames(result, "profession.1", "permitted")
result
# name profession permitted
#1: Bob Barman Baker
#2: Bob Biker Baker
Probably there's another way, but this should work. I added a third person with an unpermitted profession to show you how to apply the function to the entire dataset.
currentprof <-structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Bob",
"Fred", "Jan"), class = "factor"), profession = structure(c(3L,
2L, 1L), .Label = c("Analyst", "Baker", "Builder"), class = "factor")), .Names = c("name",
"profession"), class = "data.frame", row.names = c(NA, -3L))
allowed <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("Bob",
"Fred", "Jan"), class = "factor"), profession = structure(c(4L,
1L, 2L, 3L, 6L, 5L), .Label = c("Baker", "Barman", "Biker", "Builder",
"Driver", "Teacher"), class = "factor")), .Names = c("name",
"profession"), class = "data.frame", row.names = c(NA, -6L))
checkprof <- function(name){
allowedn <- allowed[allowed$name == name,]
currentprofn <- currentprof[currentprof$name==name,]
if(!currentprofn$profession %in% allowedn$profession)
{result <- merge(currentprofn, allowedn, by = "name", all.x=TRUE)} else
{result <-data.frame(col1=character(),
col2=character(),
col3=character(),
stringsAsFactors=FALSE)}
colnames(result) <- c("name","profession","permitted")
return(result)
}
do.call(rbind,lapply(levels(allowed$name),checkprof))
This is my take on it. May need some more testing though.I'd be open to suggestions myself. It works with your example but I am not sure if it would generalize.
people$check <- ifelse(people$profession %in% allowed[which(allowed$name == people$name),"profession"], TRUE,FALSE)
people_select <- people[people$check == TRUE,]
EDIT: and just for clarification in case this is holding you back from voting. The ifelse is vectorized and will run very fast.

Resources