I'm trying to write a table from a SQLite database into an R data frame and have hit upon a problem that has me stumped. Here are the three first entries in the SQLite table I would like to import:
1|10|0|0|0|0|10|10|0|0|0|6|8|6|20000|30000|2012-02-29 21:27:07.239091|2012-02-29 21:28:24.815385|6|80.67.28.161|||||||||||||||||||||||||||||||33|13.4936||t|t|f||||||||||||||||||4|0|0|7|7|2
2|10|0|0|0|0|0|0|0|2|2|4|5|4|20000|30000|2012-02-29 22:00:30.618726|2012-02-29 22:04:09.629942|5|80.67.28.161|3|7||0|1|3|0|||4|3|4|5|5|5|5|4|5|4|4|0|0|0|0|0|9|9|9|9|9|||1|f|t|f|||||||||||||k|text|l|||-13|0|3|10||2
3|13|2|4|4|4|4|1|1|2|5|6|3|2|40000|10000|2012-03-01 09:07:52.310033|2012-03-01 09:21:13.097303|6|80.67.28.161|2|2||30|1|1|0|||4|2|1|6|8|3|5|6|6|7|6|||||||||||26|13.6336|4|f|t|f|t|f|f|f|f|||||||||some text||||10|1|1|3|2|3
What I'm interested in are columns 53 through 60, which, to save you the trouble of counting in the above, look like this:
|t|t|f||||||
|f|t|f||||||
|f|t|f|t|f|f|f|f|
As you can see for the first two entries only the first three of those columns are not NULL while for the third entry all eight columns have values assigned to them.
Here's the SQLite table info for those columns
sqlite> PRAGMA table_info(observations);
0|id|INTEGER|1||1
** snip **
53|understanding1|boolean|0||0
54|understanding2|boolean|0||0
55|understanding3|boolean|0||0
56|understanding4|boolean|0||0
57|understanding5|boolean|0||0
58|understanding6|boolean|0||0
59|understanding7|boolean|0||0
60|understanding8|boolean|0||0
** snip **
Now, when I try to read this into R here's what those same columns end up becoming:
> library('RSQLite')
> con <- dbConnect("SQLite", dbname = 'db.sqlite3))
> obs <- dbReadTable(con,'observations')
> obs[1:3,names(obs) %in% paste0('understanding',1:8)]
understanding1 understanding2 understanding3 understanding4 understanding5 understanding6 understanding7
1 t t f NA NA NA NA
2 f t f NA NA NA NA
3 f t f 0 0 0 0
understanding8
1 NA
2 NA
3 0
As you can see, while the first three columns contain values that are either 't' or 'f' the other columns are NA where the corresponding values in the SQLite table are NULL and 0 where they are not - irrespective of whether the corresponding values in the SQLite table are t or f. Needless to say this is not the behavior I expected. The problem is, I think, that these columns are typecasted incorrectly:
> sapply(obs[1:3,names(obs) %in% paste0('understanding',1:8)], class)
understanding1 understanding2 understanding3 understanding4 understanding5 understanding6 understanding7
"character" "character" "character" "numeric" "numeric" "numeric" "numeric"
understanding8
"numeric"
Could it be that RSQLite sets the first three columns to the character type upon seeing t and f as values in the corresponding columns in the first entry but goes with numeric because in these columns the first entry just happens to be NULL?
If this is indeed what's happening is there any way of working around this and casting all these columns into character (or, even better, logical)?
The following is hacky, but it works:
# first make a copy of the DB and work with it instead of changing
# data in the original
original_file <- "db.sqlite3"
copy_file <- "db_copy.sqlite3"
file.copy(original_file, copy_file) # duplicate the file
con <- dbConnect("SQLite", dbname = copy_file) # establish a connection to the copied DB
# put together a query to replace all NULLs by 'NA' and run it
columns <- c(paste0('understanding',1:15))
columns_query <- paste(paste0(columns,' = IfNull(',columns,",'NA')"),collapse=",")
query <- paste0("UPDATE observations SET ",columns_query)
dbSendQuery(con, query)
# Now that all columns have string values RSQLite will infer the
# column type to be `character`
df <- dbReadTable(con,'observations') # read the table
file.remove(copy_file) # delete the copy
# replace all 'NA' strings with proper NAs
df[names(df) %in% paste0('understanding',1:15)][df[names(df) %in% paste0('understanding',1:15)] == 'NA'] <- NA
# convert 't' to boolean TRUE and 'f' to boolean FALSE
df[ ,names(df) %in% paste0('understanding',1:15)] <- sapply( df[ ,names(df) %in% paste0('understanding',1:15)], function(x) {x=="t"} )
Related
I experience some unexpected behavior when using grouped modification of a column in a data.table:
# creating a data.frame
data <- data.frame(sequence = rep(c("A","B","C","D"), c(2,3,3,2)), trim = 0, random_value = NA)
data[c(1:4, 10), "trim"] <- 1
# copying data to data_temp
data_temp <- data
# assigning some random value to data_temp so that it should no longer be a
# copy of "data"
data_temp[1, "random_value"] <- rnorm(1)
# converting data_temp to data.table
setDT(data_temp)
# expanding trim parameter to group and subsetting
data_temp <- data_temp[, trim := sum(trim), by = sequence][trim == 0]
data_temp comes out as expected with only the "C" sequence entries remaining. However, I would also expect the "data" object to remain unchanged. This is not the case. The "data" object looks as follows:
sequence trim random_value
1 A 2 NA
2 A 2 NA
3 B 2 NA
4 B 2 NA
5 B 2 NA
6 C 0 NA
7 C 0 NA
8 C 0 NA
9 D 1 NA
10 D 1 NA
So the assignment by reference of the "trim" variable also happened in the original data.frame.
I am using data.table_1.11.4 and R version 3.4.3 for compatibility reasons.
Is the error a result of using old versions or am I doing something wrong / do I need to change the code to avoid that error?
As #Roland kindly pointed out in his comment to the original question, it's necessary to use the "copy()" function to explicitly copy objects in data.table. Otherwise data.table won't regard copied objects as distinct objects and will modify columns with the same name in both objects. As #Imo checked, only columns that are changed in just one of the two data.frames and not by reference (e.g. "random_value" in the example) are actually copied / unlinked.
The issue can be easily fixed by using the copy() function:
# creating a data.frame
data <- data.frame(sequence = rep(c("A","B","C","D"), c(2,3,3,2)), trim = 0, random_value = NA)
data[c(1:4, 10), "trim"] <- 1
# copying data to data_temp explicitly
data_temp <- copy(data)
# assigning some random value to data_temp so that it should no longer be a
# copy of "data" - if the copy() function isn't used, that just unlinks the
# "random_value" column, but not the others
data_temp[1, "random_value"] <- rnorm(1)
# converting data_temp to data.table
setDT(data_temp)
# expanding trim parameter to group and subsetting
data_temp <- data_temp[, trim := sum(trim), by = sequence][trim == 0]
So it's necessary to use the copy() function every time you don't want data.table modifications by reference done on the copied tables affect the original table (or vice versa) - even if at the time you copy the tables they are not (yet) data.table class objects.
So my hope is to change columns 14:18 into 1 column "Type". I wanted to give each of the entries in this new column (for matching observations in the previous) the value of which of the 5 is a 1 (because only 1 of them can be true). This is my best attempt at doing this in R (and beyond frustrated).
library(caret)
data("cars")
carSubset <- subset(cars)
head(carSubset)
# I want to convert the columns from of carSubset with following vector names
types <- c("convertible","coupe", "hatchback", "sedan", "wagon")
# into 1 column named Type, with the corresponding column name
carSubset$Type <- NULL
carSubset <- apply(carSubset[,types],
2,
function(each_obs){
hit_index <- which(each_obs == 1)
carSubset$Type <- types[hit_index]
})
head(carSubset) # output:
1 2 3 4 5
"sedan" "coupe" "convertible" "convertible" "convertible"
Which is what I wanted ... however, I also wanted the rest of my data.frame to come along with it, like I just wanted the new column of "Type" but I cannot even access it with the following line of code...
head(carSubset$Type) # output: Error in carSubset$Type : $ operator is invalid for atomic vectors
Any help on how to Add a new column dynamically while appending previously related data observations to it?
I actually figured it out! Probably not the best way to do it, but hey, it works.
library(caret)
data("cars")
carSubset <- subset(cars)
head(carSubset)
# I want to convert the columns from of carSubset with following vector names
types <- c("convertible","coupe", "hatchback", "sedan", "wagon")
head(carSubset[,types])
carSubset[,types]
# into 1 column named Type, with the corresponding column name
carSubset$Type <- NULL
newSubset <- c()
newSubset <- apply(carSubset[,types],
1,
function(obs){
hit_index <- which(obs == 1)
newSubset <- types[hit_index]
})
newSubset
carSubset$Type <- cbind(Type = newSubset)
head(carSubset[, !(names(carSubset) %in% types)])
I'm trying to write a function that turns empty strings into NA. A summary of one of my column looks like this:
a b
12 210 468
I'd like to change the 12 empty values to NA. I also have a few other factor columns for which I'd like to change empty values to NA, so I borrowed some stuff from here and there to come up with this:
# change nulls to NAs
nullToNA <- function(df){
# split df into numeric & non-numeric functions
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
# Change empty strings to NA
b<-b[lapply(b,function(x) levels(x) <- c(levels(x), NA) ),] # add NA level
b<-b[lapply(b,function(x) x[x=="",]<- NA),] # change Null to NA
# Put the columns back together
d<-cbind(a,b)
d[, names(df)]
}
However, I'm getting this error:
> foo<-nullToNA(bar)
Error in x[x == "", ] <- NA : incorrect number of subscripts on matrix
Called from: FUN(X[[i]], ...)
I have tried the answer found here: Replace all 0 values to NA but it changes all my columns to numeric values.
You can directly index fields that match a logical criterion. So you can just write:
df[is_empty(df)] = NA
Where is_empty is your comparison, e.g. df == "":
df[df == ""] = NA
But note that is.null(df) won’t work, and would be weird anyway1. I would advise against merging the logic for columns of different types, though! Instead, handle them separately.
1 You’ll almost never encounter NULL inside a table since that only works if the underlying vector is a list. You can create matrices and data.frames with this constraint, but then is.null(df) will never be TRUE because the NULL values are wrapped inside the list).
This worked for me
df[df == 'NULL'] <- NA
How about just:
df[apply(df, 2, function(x) x=="")] = NA
Works fine for me, at least on simple examples.
This is the function I used to solve this issue.
null_na=function(vector){
new_vector=rep(NA,length(vector))
for(i in 1:length(vector))
if(vector[i]== ""){new_vector[i]=NA}else if(is.na(vector[i]))
{new_vector[i]=NA}else{new_vector[i]=vector[i]}
return(new_vector)
}
Just plug in the column or vector you are having an issue with.
I place a data frame inside the list. Then when try to extract it back - I get all column names prefixed with list key for this data frame, is there a way to extract data frame exactly as it was passed initially?
cols<-c("column1", "Column2", "Column3")
df1<-data.frame(matrix(ncol = 3, nrow = 1))
colnames(df1)<-cols
df1
result<-list()
result['df1']<-list(df1)
newdf1<-as.data.frame(result['df1'])
newdf1
Get as a result (column names are prefixed with df1):
> cols<-c("column1", "Column2", "Column3")
> df1<-data.frame(matrix(ncol = 3, nrow = 1))
> colnames(df1)<-cols
> df1
column1 Column2 Column3
1 NA NA NA
>
> result<-list()
> result['df1']<-list(df1)
>
> newdf1<-as.data.frame(result['df1'])
> newdf1
df1.column1 df1.Column2 df1.Column3
1 NA NA NA
Of course, I can remove the prefixes manually, but probably there is a proper way to do this.
Thanks!
Extract using [[ rather than [:
> newdf1 <- as.data.frame(result[['df1']])
> newdf1
column1 Column2 Column3
1 NA NA NA
The difference is that [ extracts a list containing the requested component(s). [[ extracts the requested component directly (i.e. it retrieves the contents of that component of the list, not a list containing that component).
But as df1 already is a data frame, why not just do:
newdf1 <- result[['df1']]
? You don't need the as.data.frame() part.
I am trying to figure out why the rbind function is not working as intended when joining data.frames without names.
Here is my testing:
test <- data.frame(
id=rep(c("a","b"),each=3),
time=rep(1:3,2),
black=1:6,
white=1:6,
stringsAsFactors=FALSE
)
# take some subsets with different names
pt1 <- test[,c(1,2,3)]
pt2 <- test[,c(1,2,4)]
# method 1 - rename to same names - works
names(pt2) <- names(pt1)
rbind(pt1,pt2)
# method 2 - works - even with duplicate names
names(pt1) <- letters[c(1,1,1)]
names(pt2) <- letters[c(1,1,1)]
rbind(pt1,pt2)
# method 3 - works - with a vector of NA's as names
names(pt1) <- rep(NA,ncol(pt1))
names(pt2) <- rep(NA,ncol(pt2))
rbind(pt1,pt2)
# method 4 - but... does not work without names at all?
pt1 <- unname(pt1)
pt2 <- unname(pt2)
rbind(pt1,pt2)
This seems a bit odd to me. Am I missing a good reason why this shouldn't work out of the box?
edit for additional info
Using #JoshO'Brien's suggestion to debug, I can identify the error as occurring during this if statement part of the rbind.data.frame function
if (is.null(pi) || is.na(jj <- pi[[j]]))
(online version of code here: http://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R starting at: "### Here are the methods for rbind and cbind.")
From stepping through the program, the value of pi does not appear to have been set at this point, hence the program tries to index the built-in constant pi like pi[[3]] and errors out.
From what I can figure, the internal pi object doesn't appear to be set due to this earlier line where clabs has been initialized as NULL:
if (is.null(clabs)) clabs <- names(xi) else { #pi gets set here
I am in a tangle trying to figure this out, but will update as it comes together.
Because unname() & explicitly assigning NA as column headers are not identical actions. When the column names are all NA, then an rbind() is possible. Since rbind() takes the names/colnames of the data frame, the results do not match & hence rbind() fails.
Here is some code to help see what I mean:
> c1 <- c(1,2,3)
> c2 <- c('A','B','C')
> df1 <- data.frame(c1,c2)
> df1
c1 c2
1 1 A
2 2 B
3 3 C
> df2 <- data.frame(c1,c2) # df1 & df2 are identical
>
> #Let's perform unname on one data frame &
> #replacement with NA on the other
>
> unname(df1)
NA NA
1 1 A
2 2 B
3 3 C
> tem1 <- names(unname(df1))
> tem1
NULL
>
> #Please note above that the column headers though showing as NA are null
>
> names(df2) <- rep(NA,ncol(df2))
> df2
NA NA
1 1 A
2 2 B
3 3 C
> tem2 <- names(df2)
> tem2
[1] NA NA
>
> #Though unname(df1) & df2 look identical, they aren't
> #Also note difference in tem1 & tem2
>
> identical(unname(df1),df2)
[1] FALSE
>
I hope this helps. The names show up as NA each, but the two operations are different.
Hence, two data frames with their column headers replaced to NA can be "rbound" but two data frames without any column headers (achieved using unname()) cannot.