Related
I have two data.tables: dt_main is the main one, to which I need to append information of only one column from dt_add:
dt_main <- data.table(id = c(1:5)
, name = c("a", "b", NA,NA,NA)
, stuff = c(11:15))
dt_add <- data.table(id = c(4:5)
, name = c("aaa", "bbb"))
I got the job correctly done by first joining then coalescing:
dt_main_final <- dt_add[dt_main, on = "id"]
dt_main_final[, name := fcoalesce(name, i.name)][, i.name:=NULL]
The provided output is as expected:
id name stuff
1: 1 a 11
2: 2 b 12
3: 3 <NA> 13
4: 4 aaa 14
5: 5 bbb 15
I wander whether there is a more direct way to have it done, any suggestions? Thanks.
PS> I also tried melting then dcasting:
dt <- dt_add[dt_main, on = "id"]
setnames(dt, "i.name", "name")
dt_melt <- melt(dt
, measure.vars = patterns("name")
)
dt_main_final <- dcast(dt_melt
, id + stuff ~ variable
, fun.aggregate = fcoalesce
, value.var = "value")
I got the error:
Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.
Any ideas for this one also?
We can do the join and coalesce in one step:
dt_main[dt_add, name := fcoalesce(name, i.name), on = .(id)]
dt_main
# id name stuff
# <int> <char> <int>
# 1: 1 a 11
# 2: 2 b 12
# 3: 3 <NA> 13
# 4: 4 aaa 14
# 5: 5 bbb 15
I have the following data.table:
dat<-data.table(Y=as.factor(c("a","b","a")),"a"=c(1,2,3),"b"=c(3,2,1))
It looks like:
Y a b
1: a 1 3
2: b 2 2
3: a 3 1
What I want is to subtract the value of the column indicated by the value of Y by 1. E.g. the Y value of the first row is "a", so the value of the column "a" in the first row should be reduced by one.
The result should be:
Y a b
1: a 0 3
2: b 2 1
3: a 2 1
Is this possible? If yes, how? Thank you!
Using self-joins and get:
for (yval in dat[ , unique(Y)]){
dat[yval, (yval) := get(yval) - 1L, on = "Y"]
}
dat[]
# Y a b
# 1: a 0 3
# 2: b 2 1
# 3: a 2 1
We can use melt/dcast to do this. melt the dataset after creating a row sequence ('N') to 'long' format, subtract 1 from the 'value' column where 'Y' and 'variable' elements are equal, assign (:= the output to 'value', then dcast the 'long' format to 'wide'.
dcast(melt(dat[, N := 1:.N], id.var = c("Y", "N"))[Y==variable,
value := value -1], N + Y ~variable, value.var = "value")[, N := NULL][]
# Y a b
#1: a 0 3
#2: b 2 1
#3: a 2 1
First an apply function to make the actual transformation. We need to apply by row and then use the first element to name the second element to access and over write. For some reason the values I was accessing in a and b were strings, so I used as.numeric to transform them to numbers. I don't know if this is normal in data.tables or a result of using the apply statement on one since I don't use data.tables normally.
tformDat <- apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})
Then you need to reformat back to the original data.table format
data.table(t(tformDat))
The whole thing can be done in one line.
data.table(t(apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})))
I have a data.table of the form:
d1 <- data.table(read.csv(header=TRUE, file=textConnection("x1,y1,z1
string1,string2,1
string3,string1,2
string3,string5,3")))
I am trying to convert this data for usage in Spark. It seems Spark doesn't take string as input or tries to convert it (I am very beginner in Spark):
File
"/grid/6/hadoop/yarn/local/usercache/Z076156/appcache/application_1438295298158_169576/container_1438295298158_169576_01_000003/pyspark.zip/pyspark/mllib/util.py",
line 45, in _parse_libsvm_line
label = float(items[0]) ValueError: could not convert string to float:
"505",0,"17661674","MULTI-COLORED","0","75",2,131,"2","",0,"XS","5.10
So I am trying to convert all string into numerical factors in R. Here is simple function which I wrote based on my success with conversion of just one column:
string2num <- function(d,a){
l<-unique(c(as.character(d$a)))
return(as.numeric(factor(d$a, levels=l)))
}
However I am not able to apply it over multiple string columns of a tables (due to atomic vector reference in function). Currently writing simple code snippets and debugging but not success. I am expecting some solution in a form:
for(i in colnames(d1)){
if(is.character(d1$i))
string2num(d1,i)
}
or:
d1[,lapply(.SD, string2num),.SDcols=is.character(.SD)]
or:
do.call(rbind(lapply(d1[,sapply(d1,is.character)],string2num)))
or may be I don't have any clue of right solution. Expected output will be of form:
x1 y1 z1
1: 1 1 1
2: 2 2 2
3: 2 3 3
Notice in x1 column both instances of string3 going for number 1 (one of one mapping (string -> some number) for all string columns)
You could try:
indx <- which(sapply(d1, is.character))
d1[, (indx) := lapply(.SD, as.factor), .SDcols = indx
][, (indx) := lapply(.SD, as.integer), .SDcols = indx]
or as proposed by #Frank everything in one go:
d1[, (indx) := lapply(.SD, function(x) as.integer(as.factor(x))), .SDcols=indx]
this gives:
> d1
x1 y1 z1
1: 1 2 1
2: 2 1 2
3: 2 3 3
Used data:
d1 <- fread("x1,y1,z1
string1,string2,1
string3,string1,2
string3,string5,3", header=TRUE)
Imagine you want to apply a function row-wise on a data.table. The function's arguments correspond to fixed data.table columns as well as dynamically generated column names.
Is there a way to supply fixed and dynamic column names as argument to a function while using data.tables?
The problems are:
Both, variablenames and dynamically generated strings as argument to a function over a datatable
The dynamic column name strings are stored in a vector with > 1 entries (get() won't work)
The dynamic column's values need to be supplied as a vector to the function
This illustrates it:
library('data.table')
# Sample dataframe
D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3) #fixed and dynamic column names
setkey(D, id)
# Sample function
foo <-function(fix, dynvector){ rep(fix,length(dynvector)) %*% dynvector}
# It does not matter what this function does.
# The result when passing column names not dynamically
D[, "new" := foo(fix,c(dyn1,dyn2)), by=id]
# id fix dyn1 dyn2 new
# 1: 1 1 1 1 2
# 2: 2 2 2 2 8
# 3: 3 3 3 3 18
I want to get rid of the c(dyn1,dyn2). I need to get the column names dyn1, dyn2 from another vector which holds them as string.
This is how far I got:
# Now we try it dynamically
cn <-paste("dyn",1:2,sep="") #vector holding column names "dyn1", "dyn2"
# Approaches that don't work
D[, "new" := foo(fix,c(cn)), by=id] #wrong as using a mere string
D[, "new" := foo(fix,c(cn)), by=id, with=F] #does not work
D[, "new" := foo(fix,c(get(cn))), by=id] #uses only the first element "dyn1"
D[, "new" := foo(fix,c(mget(cn, .GlobalEnv, inherits=T))), by=id] #does not work
D[, "new" := foo(fix,c(.SD)), by=id, .SDcols=cn] #does not work
I suppose mget() is the solution, but I know too less about scoping to figure it out.
Thanks! JBJ
Update: Solution
based on the answer by BondedDust
D[, "new" := foo(fix,sapply(cn, function(x) {get(x)})), by=id]
I wasn't able to figure out what you were trying to do with the matrix-multiplication, but this shows how to create new variables with varying and fixed inputs to a function:
D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3)
setkey(id)
foo <-function(fix, dynvector){ fix* dynvector}
D[, paste("new",1:2,sep="_") := lapply( c(dyn1,dyn2), foo, fix=fix), by=id]
#----------
> D
id fix dyn1 dyn2 new_1 new_2
1: 1 1 1 1 1 1
2: 2 2 2 2 4 4
3: 3 3 3 3 9 9
So you need to use a vector of character values to get columns. This is a bit of an extension to this question: Why do I need to wrap `get` in a dummy function within a J `lapply` call?
> D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3)
> setkey(D, id)
> id1 <- parse(text=cn)
> foo <-function( fix, dynvector){ fix*dynvector}
> D[, paste("new",1:2,sep="_") := lapply( sapply( cn, function(x) {get(x)}) , foo, fix=fix) ]
Warning message:
In `[.data.table`(D, , `:=`(paste("new", 1:2, sep = "_"), lapply(sapply(cn, :
Supplied 2 columns to be assigned a list (length 6) of values (4 unused)
> D
id fix dyn1 dyn2 new_1 new_2
1: 1 1 1 1 1 2
2: 2 2 2 2 2 4
3: 3 3 3 3 3 6
You could probably use the methods in create an expression from a function for data.table to eval as well.
I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah