Error when exporting dataframe to text file in R - r

I am trying to write a dataframe in R to a text file, however it is returning to following error:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
I used the following command for the export:
write.table(df, file ='dfname.txt', sep='\t' )
I have no idea what the problem could stem from. As far as "missing data where TRUE/FALSE is needed", I have only one column which contains TRUE/FALSE values, and none of these values are missing.
Contents of the dataframe:
> str(df)
'data.frame': 776 obs. of 15 variables:
$ Age : Factor w/ 4 levels "","A","J","SA": 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Rep : Factor w/ 11 levels "L","NR","NRF",..: 1 1 4 4 2 2 2 2 2 2 ...
$ FA : num 61.5 62.5 60.5 61 59.5 59.5 59.1 59.2 59.8 59.9 ...
$ Mass : num 20 19 16.5 17.5 NA 14 NA 23 19 18.5 ...
$ Vir1 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir2 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir3 : num 40 999 999 999 999 999 999 999 999 999 ...
$ Location : Factor w/ 4 levels "Loc1",..: 4 4 4 4 4 4 2 2 2 2 ...
$ Site : Factor w/ 6 levels "A","B","C",..: 5 5 5 5 5 5 3 3 3 3 ...
$ Date : Date, format: "2010-08-30" "2010-08-30" ...
$ Record : int 35 34 39 49 69 38 145 112 125 140 ...
$ SampleID : Factor w/ 776 levels "AT1-A-F1","AT1-A-F10",..: 525 524 527 528
529 526 111 78
88 110 ...
$ Vir1Inc : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Month :'data.frame': 776 obs. of 2 variables:
..$ Dates: Date, format: "2010-08-30" "2010-08-30" ...
..$ Month: Factor w/ 19 levels "Apr-2011","Aug-2010",..: 2 2 2 2
2 2 18 18 18 18 ...
I hope I've given enough/the right information ...
Many thanks,
Heather

An example to reproduce the error. I create a nested data.frame:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
str(dd)
'data.frame': 15 obs. of 2 variables:
$ Age : int 1 2 3 4 5 6 7 8 9 10 ...
$ Month:'data.frame': 15 obs. of 2 variables:
..$ Dates: Date, format: "2003-02-02" "2003-02-03" "2003-02-04" ...
..$ Month: Factor w/ 12 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
No I try to save it , I reproduce the error :
write.table(dd)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) : missing value where TRUE/FALSE needed
Without inverstigating, one option to remove the nested data.frame:
write.table(data.frame(subset(dd,select=-c(Month)),unclass(dd$Month)))

The solution by agstudy provides a great quick fix, but there is a simple alternative/general solution for which you do not have to specify the element(s) in your data.frame that was(were) nested:
The following bit is just copied from agstudy's solution to obtain the nested data.frame dd:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
You can use akhilsbehl's LinearizeNestedList() function (which mrdwab made available here) to flatten (or linearize) the nested levels:
library(devtools)
source_gist(4205477) #loads the function
ddf <- LinearizeNestedList(dd, LinearizeDataFrames = TRUE)
# ddf is now a list with two elements (Age and Month)
ddf <- LinearizeNestedList(ddf, LinearizeDataFrames = TRUE)
# ddf is now a list with 3 elements (Age, `Month/Dates` and `Month/Month`)
ddf <- as.data.frame.list(ddf)
# transforms the flattened/linearized list into a data.frame
ddf is now a data.frame without nesting. However, it's column names still reflect the nested structure:
names(ddf)
[1] "Age" "Month.Dates" "Month.Month"
If you want to change this (in this case it seems redundant to have Month. written before Dates, for example) you can use gsub and some regular expression that I copied from Sacha Epskamp to remove all text in the column names before the ..
names(ddf) <- gsub(".*\\.","",names(ddf))
names(ddf)
[1] "Age" "Dates" "Month"
The only thing left now is exporting the data.frame as usual:
write.table(ddf, file="test.txt")

Alternatively, you could use the "flatten" function from the jsonlite package to flatten the dataframe before export. It achieves the same result of the other functions mentioned and is much easier to implement.
jsonlite::flatten
https://rdrr.io/cran/jsonlite/man/flatten.html

Related

unable to write to the csv file [duplicate]

I am trying to write a dataframe in R to a text file, however it is returning to following error:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
I used the following command for the export:
write.table(df, file ='dfname.txt', sep='\t' )
I have no idea what the problem could stem from. As far as "missing data where TRUE/FALSE is needed", I have only one column which contains TRUE/FALSE values, and none of these values are missing.
Contents of the dataframe:
> str(df)
'data.frame': 776 obs. of 15 variables:
$ Age : Factor w/ 4 levels "","A","J","SA": 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Rep : Factor w/ 11 levels "L","NR","NRF",..: 1 1 4 4 2 2 2 2 2 2 ...
$ FA : num 61.5 62.5 60.5 61 59.5 59.5 59.1 59.2 59.8 59.9 ...
$ Mass : num 20 19 16.5 17.5 NA 14 NA 23 19 18.5 ...
$ Vir1 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir2 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir3 : num 40 999 999 999 999 999 999 999 999 999 ...
$ Location : Factor w/ 4 levels "Loc1",..: 4 4 4 4 4 4 2 2 2 2 ...
$ Site : Factor w/ 6 levels "A","B","C",..: 5 5 5 5 5 5 3 3 3 3 ...
$ Date : Date, format: "2010-08-30" "2010-08-30" ...
$ Record : int 35 34 39 49 69 38 145 112 125 140 ...
$ SampleID : Factor w/ 776 levels "AT1-A-F1","AT1-A-F10",..: 525 524 527 528
529 526 111 78
88 110 ...
$ Vir1Inc : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Month :'data.frame': 776 obs. of 2 variables:
..$ Dates: Date, format: "2010-08-30" "2010-08-30" ...
..$ Month: Factor w/ 19 levels "Apr-2011","Aug-2010",..: 2 2 2 2
2 2 18 18 18 18 ...
I hope I've given enough/the right information ...
Many thanks,
Heather
An example to reproduce the error. I create a nested data.frame:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
str(dd)
'data.frame': 15 obs. of 2 variables:
$ Age : int 1 2 3 4 5 6 7 8 9 10 ...
$ Month:'data.frame': 15 obs. of 2 variables:
..$ Dates: Date, format: "2003-02-02" "2003-02-03" "2003-02-04" ...
..$ Month: Factor w/ 12 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
No I try to save it , I reproduce the error :
write.table(dd)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) : missing value where TRUE/FALSE needed
Without inverstigating, one option to remove the nested data.frame:
write.table(data.frame(subset(dd,select=-c(Month)),unclass(dd$Month)))
The solution by agstudy provides a great quick fix, but there is a simple alternative/general solution for which you do not have to specify the element(s) in your data.frame that was(were) nested:
The following bit is just copied from agstudy's solution to obtain the nested data.frame dd:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
You can use akhilsbehl's LinearizeNestedList() function (which mrdwab made available here) to flatten (or linearize) the nested levels:
library(devtools)
source_gist(4205477) #loads the function
ddf <- LinearizeNestedList(dd, LinearizeDataFrames = TRUE)
# ddf is now a list with two elements (Age and Month)
ddf <- LinearizeNestedList(ddf, LinearizeDataFrames = TRUE)
# ddf is now a list with 3 elements (Age, `Month/Dates` and `Month/Month`)
ddf <- as.data.frame.list(ddf)
# transforms the flattened/linearized list into a data.frame
ddf is now a data.frame without nesting. However, it's column names still reflect the nested structure:
names(ddf)
[1] "Age" "Month.Dates" "Month.Month"
If you want to change this (in this case it seems redundant to have Month. written before Dates, for example) you can use gsub and some regular expression that I copied from Sacha Epskamp to remove all text in the column names before the ..
names(ddf) <- gsub(".*\\.","",names(ddf))
names(ddf)
[1] "Age" "Dates" "Month"
The only thing left now is exporting the data.frame as usual:
write.table(ddf, file="test.txt")
Alternatively, you could use the "flatten" function from the jsonlite package to flatten the dataframe before export. It achieves the same result of the other functions mentioned and is much easier to implement.
jsonlite::flatten
https://rdrr.io/cran/jsonlite/man/flatten.html

How to automatically convert a numeric column to categorical data using statistical techniques

>data
ACC_ID REG PRBLT OPP_TYPE_DESC PARENT_ID ACCT_NM INDUSTRY_ID BUY PWR REV QTY
11316456 No 90 A 2122628569 INF 7379 10190.82 6500 1
11456476 Yes 1 I 2385888136 Module 9199 17441.72 466.5 31
13453245 No 10 D 2122628087 Wooden 3559 44279.21 2500 500
15674568 No 1 I 2702074521 Nine 7379 183218.8 25.91 1
Above is the given dataset
When I load the same in R, I have the following structure
>str(data)
$ ACC_ID : int 11316974 11620677 11865091 ...
$ REG : Factor w/ 2 levels "No ","Yes ": 1 2 1 1 1 1 1 1 1 1 ...
$ PRBLT : int 90 1 10 1 30 30 10 1 60 1 ...
$ OPP_TYPE_DESC : Factor w/ 3 levels "D",..: 3 2 1 2 1 1 1 3 3 2 ...
$ PARENT_ID : num 2.12e+09 2.39e+09 2.12e+09 2.70e+09 2.12e+09 ...
$ ACCT_NM : Factor w/ 20 levels "Marketing Vertical",..: 10 15 20 17 8 16 2 14 7 11 ...
$ INDUSTRY_ID : int 7379 9199 3559 7379 2711 7374 7371 8742 4813 2111 ..
$ BUY PWR : num 1014791 17442 ...
$ REV : num 6500 46617 250000 25564 20000 ...
$ QTY : int 1 31 500 1 6 100 ...
But, I would want to somehow automatically want R to output the below fields as factors instead of int (with the help of statistical modelling or any other technique). Ideally, these are not continuous fields but categorical nominal fields
ACC_ID
PARENT_ID
INDUSTRY_ID
Whereas the REV and QTY columns should be left as is.
Also, the analysis should not be specific to the data and the columns shown here. The logic must be applicable to any data-set (with different columns) that we load in R
Can there be any method through which this is possible? Any ideas are welcome.
Thank you

Error while aggregating factor variable- argument is not numeric or logical [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
Below is the str of my data set.
'data.frame': 9995 obs. of 10 variables:
$ Count : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gates : Factor w/ 5 levels "B6","B9","I1",..: 3 3 4 4 3 4 4 4 4 4 ...
$ Entry_Date : Date, format: "0006-10-20" "0006-10-20" "0006-10-20" ...
$ Entry_Time : Factor w/ 950 levels "00:01:00","00:04:00",..: 347 366 450 550 563 700 701 350 460 506 ...
$ Exit_Date : Date, format: "0006-10-20" "0006-10-20" "0006-10-20" ...
$ Exit_Time : Factor w/ 1012 levels "00:00:00","00:01:00",..: 618 556 637 694 770 936 948 590 640 655 ...
$ Type_of_entry : Factor w/ 3 levels "Manual","Pass",..: 3 3 3 3 3 3 3 3 3 3 ...
$ weekday : Factor w/ 7 levels "Friday","Monday",..: 2 2 2 2 2 2 2 6 6 6 ...
$ Ticket.Loss: Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 2 1 1 1 ...
$ Duration : Factor w/ 501 levels "00:01:00","00:02:00",..: 223 142 139 96 159 188 199 192 132 101 ...
I am using below function:
W <- aggregate(Duration ~ Gates, data=parking, FUN=mean)
But getting below error:
Warning messages: 1: In mean.default(X[[i]], ...) : argument is not
numeric or logical: returning NA
Duration is a factor of strings that look like time durations, "00:01:00", etc.
The chron package works with character strings such as this.
library(chron)
aggregate(chron(times=Duration) ~ Gates, data=parking, FUN=mean)
This will give the average time for each level in Gates.
See also convert character to time in R
If the OP's dataset is infact time column, we can use as.POSIXct to convert it to 'DateTime' class
parking$Duration <- as.POSIXct(parking$Duration, format = "%H:%M:%S")
transform(aggregate(Duration ~ Gates, data = parking, FUN = mean),
Duration = sub("\\S+\\s+", "", Duration))
# Gates Duration
#1 B6 11:08:34
#2 B9 11:07:31
#3 I1 11:07:10
NOTE: No external packages used.
data
set.seed(24)
parking <- data.frame(Gates = sample(c("B6", "B9", "I1"), 20, replace=TRUE),
Duration = format(seq(Sys.time(), length.out=20, by = "1 min") , "%H:%M:%S"))

How to apply Naive Bayes model to new data

I asked a question on this this morning but am deleting that and posting here with more betterer wording.
I created my first machine learning model using train and test data. I returned a confusion matrix and saw some summary stats.
I would now like to apply the model to new data to make predictions but I don't know how.
Context: Predicting monthly "churn" cancellations. Target variable is "churned" and it has two possible labels "churned" and "not churned".
head(tdata)
months_subscription nvk_medium org_type churned
1 25 none Community not churned
2 7 none Sports clubs not churned
3 28 none Sports clubs not churned
4 18 unknown Religious congregations and communities not churned
5 15 none Association - Professional not churned
6 9 none Association - Professional not churned
Here's me training and testing:
library("klaR")
library("caret")
# import data
test_data_imp <- read.csv("tdata.csv")
# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]
#training
rn_train <- sample(nrow(tdata),
floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)
# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)
Everything up till here works fine.
Now I have new data, structure and laid out the same way as tdata above. How can I apply my model to this new data to make predictions? Intuitively I was seeking a new column cbinded that had the predicted class for each record.
I tried this:
## prediction ##
# import data
data_imp <- read.csv("pdata.csv")
pdata <- data_imp[variables]
actual_predictions <- predict(model, pdata)
#append to data and output (as head by default)
predicted_data <- cbind(pdata, actual_predictions$class)
# output
head(predicted_data)
Which threw errors
actual_predictions <- predict(model, pdata)
Error in object$tables[[v]][, nd] : subscript out of bounds
In addition: Warning messages:
1: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 1
2: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 2
3: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 3
How can I apply my model to the new data? I'd like a new data frame with a new column that has the predicted class?
** following comment, here is head and str of new data for prediction**
head(pdata)
months_subscription nvk_medium org_type churned
1 26 none Community not churned
2 8 none Sports clubs not churned
3 30 none Sports clubs not churned
4 19 unknown Religious congregations and communities not churned
5 16 none Association - Professional not churned
6 10 none Association - Professional not churned
> str(pdata)
'data.frame': 6433 obs. of 4 variables:
$ months_subscription: int 26 8 30 19 16 10 3 5 14 2 ...
$ nvk_medium : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
$ org_type : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
$ churned : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...
This is most likely caused by a mismatch in the encoding of factors in the training data (variable tdata in your case) and the new data used in the predict function (variable pdata), typically that you have factor levels in the test data that are not present in the training data. Consistency in the encoding of the features must be enforced by you, because the predict function will not check it. Therefore, I suggest that you double-check the levels of the features nvk_medium and org_type in the two variables.
The error message:
Error in object$tables[[v]][, nd] : subscript out of bounds
is raised when evaluating a given feature (the v-th feature) in a data point, in which nd is the numeric value of the factor corresponding to the feature. You also have warnings, indicating that the posterior probabilities for all the cases in data points ("observation") 1, 2, and 3 are all zero, but it is not clear if this is also related to the encoding of the factors...
To reproduce the error that you are seeing, consider the following toy data (from http://amunategui.github.io/binary-outcome-modeling/), which has a set of features somewhat similar to that in your data:
# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age', 'Sex', 'Title', 'Survived')]
# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]
with:
> str(Data_train)
'data.frame': 656 obs. of 5 variables:
$ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ...
$ Age : num 35 28 34 28 29 28 28 28 45 28 ...
$ Sex : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
$ Title : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...
> str(Data_test)
'data.frame': 657 obs. of 5 variables:
$ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age : num 47 63 39 58 19 28 50 37 25 39 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...
Then everything goes as expected:
model <- NaiveBayes(Survived ~ ., data = Data_train)
# This will work
pred_1 <- predict(model, Data_test)
> str(pred_1)
List of 2
$ class : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"
However, if the encoding is not consistent, e.g.:
# Mess things up, by "displacing" the factor values (i.e., 'Nothing'
# will now be encoded as number 5, which was not present in the
# training data)
Data_test_2 <- Data_test
Data_test_2$Title <- factor(
as.character(Data_test_2$Title),
levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing")
)
> str(Data_test_2)
'data.frame': 657 obs. of 5 variables:
$ PClass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age : num 47 63 39 58 19 28 50 37 25 39 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...
then:
> pred_2 <- predict(model, Data_test_2)
Error in object$tables[[v]][, nd] : subscript out of bounds

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

Resources