I have a data.table of the form:
d1 <- data.table(read.csv(header=TRUE, file=textConnection("x1,y1,z1
string1,string2,1
string3,string1,2
string3,string5,3")))
I am trying to convert this data for usage in Spark. It seems Spark doesn't take string as input or tries to convert it (I am very beginner in Spark):
File
"/grid/6/hadoop/yarn/local/usercache/Z076156/appcache/application_1438295298158_169576/container_1438295298158_169576_01_000003/pyspark.zip/pyspark/mllib/util.py",
line 45, in _parse_libsvm_line
label = float(items[0]) ValueError: could not convert string to float:
"505",0,"17661674","MULTI-COLORED","0","75",2,131,"2","",0,"XS","5.10
So I am trying to convert all string into numerical factors in R. Here is simple function which I wrote based on my success with conversion of just one column:
string2num <- function(d,a){
l<-unique(c(as.character(d$a)))
return(as.numeric(factor(d$a, levels=l)))
}
However I am not able to apply it over multiple string columns of a tables (due to atomic vector reference in function). Currently writing simple code snippets and debugging but not success. I am expecting some solution in a form:
for(i in colnames(d1)){
if(is.character(d1$i))
string2num(d1,i)
}
or:
d1[,lapply(.SD, string2num),.SDcols=is.character(.SD)]
or:
do.call(rbind(lapply(d1[,sapply(d1,is.character)],string2num)))
or may be I don't have any clue of right solution. Expected output will be of form:
x1 y1 z1
1: 1 1 1
2: 2 2 2
3: 2 3 3
Notice in x1 column both instances of string3 going for number 1 (one of one mapping (string -> some number) for all string columns)
You could try:
indx <- which(sapply(d1, is.character))
d1[, (indx) := lapply(.SD, as.factor), .SDcols = indx
][, (indx) := lapply(.SD, as.integer), .SDcols = indx]
or as proposed by #Frank everything in one go:
d1[, (indx) := lapply(.SD, function(x) as.integer(as.factor(x))), .SDcols=indx]
this gives:
> d1
x1 y1 z1
1: 1 2 1
2: 2 1 2
3: 2 3 3
Used data:
d1 <- fread("x1,y1,z1
string1,string2,1
string3,string1,2
string3,string5,3", header=TRUE)
Related
I want to use a list external to my data.table to inform what a new column of data should be, in that data.table. In this case, the length of the list element corresponding to a data.table attribute;
# dummy list. I am interested in extracting the vector length of each list element
l <- list(a=c(3,5,6,32,4), b=c(34,5,6,34,2,4,6,7), c = c(3,4,5))
# dummy dt, the underscore number in Attri2 is the element of the list i want the length of
dt <- data.table(Attri1 = c("t","y","h","g","d","e","d"),
Attri2 = c("fghd_1","sdafsf_3","ser_1","fggx_2","sada_2","sfesf_3","asdas_2"))
# extract that number to a new attribute, just for clarity
dt[, list_gp := tstrsplit(Attri2, "_", fixed=TRUE, keep=2)]
# then calculate the lengths of the vectors in the list, and attempt to subset by the index taken above
dt[,list_len := '[['(lapply(1, length),list_gp)]
Error in lapply(l, length)[[list_gp]] : no such index at level 1
I envisaged the list_len column to be 5,3,5,8,8,3,8
A couple of things.
tstrsplit gives you a string. convert to number.
not quite sure about the [[ construct there, see proposed solution:
dt[, list_gp := as.numeric( tstrsplit(Attri2, "_", fixed=TRUE, keep=2)[[1]] )]
dt[, list_len := sapply( l[ list_gp ], length ) ]
Output:
> dt
Attri1 Attri2 list_gp list_len
1: t fghd_1 1 5
2: y sdafsf_3 3 3
3: h ser_1 1 5
4: g fggx_2 2 8
5: d sada_2 2 8
6: e sfesf_3 3 3
7: d asdas_2 2 8
I want to replace the nth consecutive occurrence of a particular code in my data frame. This should be a relatively easy task but I can't think of a solution.
Given a data frame
df <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,1,2,2,2,1,1))
I want a result
df_result <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,0,2,2,2,1,0))
The data frame is time-ordered so I need to keep the same order after replacing the values. I guess that nth() or duplicate() functions could be useful here but I'm not sure how to use them. What I'm missing is a function that would count the number of consecutive occurrences of a given value. Once I have it, I could then use it to replace the nth occurrence.
This question had some ideas that I explored but still didn't solve my problem.
EDIT:
After an answer by #Gregor I wrote the following function which solves the problem
library(data.table)
library(dplyr)
replace_nth <- function(x, nth, code) {
y <- data.table(x)
y <- y[, code_rleid := rleid(y$Code)]
y <- y[, seq := seq_along(Code), by = code_rleid]
y <- y[seq == nth & Code == code, Code := 0]
drop.cols <- c("code_rleid", "seq")
y %>% select(-one_of(drop.cols)) %>% data.frame() %>% return()
}
To get the solution, simply run replace_nth(df, 2, 1)
Using data.table:
library(data.table)
setDT(df)
df[, code_rleid := rleid(df$Code)]
df[, seq := seq_along(Code), by = code_rleid]
df[seq == 2 & Code == 1, Code := 0]
df
# Values Code code_rleid seq
# 1: 1 1 1 1
# 2: 4 0 1 2
# 3: 5 2 2 1
# 4: 6 2 2 2
# 5: 3 2 2 3
# 6: 3 1 3 1
# 7: 2 0 3 2
You could combine some of these (and drop the extra columns after). I'll leave it clear and let you make modifications as you like.
I would like to know if it's possible to name an aggregate variable by a dynamic reference at the time of aggregation in data.table.
Please note that I know I can rename the variable after aggregation by reference and that is not what I'm asking here!
Let's say I've got a data.table DT with three variables v1, v2, and v3.
> DT
var1 var2 var3
1: 1 1 A
2: 3 0 A
3: 2 2 B
4: 1 0 A
5: 0 2 C
I would like to dynamically name the aggregate variable, based on the names stored in a vector OR a string variable
var_string <- c('agg_var1', 'agg_var2')
# the following doesn't work
DT_agg <- DT[, .( (var_string[1]) = sum(v1 + v2)), by = .( (var_string[2]) = var3)]
#this is the output I want
> DT_agg
agg_var2 agg_var1
1: A 6
2: B 4
3: C 2
The code above doesn't work. it gives me error of the sort:
Error: unexpected '=' in "DT_agg <- DT[, .( (var_string[1]) = sum(v1 + v2)), by = .( (var_string[2]) = var3)="
I'm only interested to know if it's possible to do this at the same time as aggregation, rather than renaming the columns afterwards, which i know how to do already.
Imagine you want to apply a function row-wise on a data.table. The function's arguments correspond to fixed data.table columns as well as dynamically generated column names.
Is there a way to supply fixed and dynamic column names as argument to a function while using data.tables?
The problems are:
Both, variablenames and dynamically generated strings as argument to a function over a datatable
The dynamic column name strings are stored in a vector with > 1 entries (get() won't work)
The dynamic column's values need to be supplied as a vector to the function
This illustrates it:
library('data.table')
# Sample dataframe
D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3) #fixed and dynamic column names
setkey(D, id)
# Sample function
foo <-function(fix, dynvector){ rep(fix,length(dynvector)) %*% dynvector}
# It does not matter what this function does.
# The result when passing column names not dynamically
D[, "new" := foo(fix,c(dyn1,dyn2)), by=id]
# id fix dyn1 dyn2 new
# 1: 1 1 1 1 2
# 2: 2 2 2 2 8
# 3: 3 3 3 3 18
I want to get rid of the c(dyn1,dyn2). I need to get the column names dyn1, dyn2 from another vector which holds them as string.
This is how far I got:
# Now we try it dynamically
cn <-paste("dyn",1:2,sep="") #vector holding column names "dyn1", "dyn2"
# Approaches that don't work
D[, "new" := foo(fix,c(cn)), by=id] #wrong as using a mere string
D[, "new" := foo(fix,c(cn)), by=id, with=F] #does not work
D[, "new" := foo(fix,c(get(cn))), by=id] #uses only the first element "dyn1"
D[, "new" := foo(fix,c(mget(cn, .GlobalEnv, inherits=T))), by=id] #does not work
D[, "new" := foo(fix,c(.SD)), by=id, .SDcols=cn] #does not work
I suppose mget() is the solution, but I know too less about scoping to figure it out.
Thanks! JBJ
Update: Solution
based on the answer by BondedDust
D[, "new" := foo(fix,sapply(cn, function(x) {get(x)})), by=id]
I wasn't able to figure out what you were trying to do with the matrix-multiplication, but this shows how to create new variables with varying and fixed inputs to a function:
D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3)
setkey(id)
foo <-function(fix, dynvector){ fix* dynvector}
D[, paste("new",1:2,sep="_") := lapply( c(dyn1,dyn2), foo, fix=fix), by=id]
#----------
> D
id fix dyn1 dyn2 new_1 new_2
1: 1 1 1 1 1 1
2: 2 2 2 2 4 4
3: 3 3 3 3 9 9
So you need to use a vector of character values to get columns. This is a bit of an extension to this question: Why do I need to wrap `get` in a dummy function within a J `lapply` call?
> D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3)
> setkey(D, id)
> id1 <- parse(text=cn)
> foo <-function( fix, dynvector){ fix*dynvector}
> D[, paste("new",1:2,sep="_") := lapply( sapply( cn, function(x) {get(x)}) , foo, fix=fix) ]
Warning message:
In `[.data.table`(D, , `:=`(paste("new", 1:2, sep = "_"), lapply(sapply(cn, :
Supplied 2 columns to be assigned a list (length 6) of values (4 unused)
> D
id fix dyn1 dyn2 new_1 new_2
1: 1 1 1 1 1 2
2: 2 2 2 2 2 4
3: 3 3 3 3 3 6
You could probably use the methods in create an expression from a function for data.table to eval as well.
I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah