Bigrquery forcefully coerces strings to integers (schema is a string) - r

I'm working with zip codes, which of course have leading zeros. I am correctly loading my dataframe to preserve the leading zeros in R, but the upload step seems to fail. Here's what I mean:
Here's my minimal.csv file:
zip,val
07030,10
10001,100
90210,1000
60602,10000
Here's the R code
require("bigrquery")
filename <- "minimal.csv"
tablename <- "as_STRING"
ds <- bq_dataset(project='myproject', dataset="zips")
I am also correctly setting the type in my schema to expect them as strings.
# first pass
df <- read.csv(filename, stringsAsFactors=F)
# > df
# zip val
# 1 7030 10
# 2 10001 100
# 3 90210 1000
# 4 60602 10000
# uh oh! Let's fix it!
cols <- unlist(lapply(df, class))
cols[[1]] <- "character" # make zipcode a character
# then reload
df2 <- read.csv(filename, stringsAsFactors=F, colClasses=cols)
# > df2
# zip val
# 1 07030 10
# 2 10001 100
# 3 90210 1000
# 4 60602 10000
# much better! You can see my zips are now strings.
However, when I try to upload strings, the bigrquery interface complains that I am uploading integers, which they are not. Here's the schema, expecting strings:
# create schema
bq_table_create(bq_table(ds, tablename), fields=df2) # using df2, which has strings
# now prove it got the strings right:
> bq_table_meta(bq_table(ds, tablename))$schema$fields
[[1]]
[[1]]$name
[1] "zip"
[[1]]$type
[1] "STRING" # GOOD, ZIP IS A STRING!
[[1]]$mode
[1] "NULLABLE"
[[2]]
[[2]]$name
[1] "val"
[[2]]$type
[1] "INTEGER"
[[2]]$mode
[1] "NULLABLE"
Now it's time to upload....
bq_table_upload(bq_table(ds, tablename), df2) # using df2, with STRINGS
Error: Invalid schema update. Field zip has changed type from STRING to INTEGER [invalid]
Huh? What is this invalid schema update, and how can I stop it from trying to change my strings, which the data contains, and the schema is, to integers, which my data does not contain, and which the schema is not?
Is there a javascript serialization that's happening and turning my strings back to integers?

That is because BigQuery will auto-detect the schema when it is not specified. This could be solved by specifying fields argument, like this (see this similar question for more details):
bq_table_upload(bq_table(ds, tablename), df2,fields = list(bq_field("zip", "string"),bq_field("val", "integer")))
UPDATE:
Looking into the code,bq_table_upload is calling bq_perform_upload, which take the argument fields as schema. At the end, it parses the data frame as JSON file to upload it to the BigQuery.

Simply changing:
bq_table_upload(tab, df)
to
bq_table_upload(tab, df, fields=df)
works.

Related

How to see the difference in two strings

I'm trying to find the difference between two columns in a CSV file, which I named Test.
I'd like to add a new column called 'Results' that contains the difference between Events_1 & Events_2. If there is no difference the Results can be blank.
This is a basic example, for what I'm trying to accomplish, the real list contains hundreds of events in both columns.
Not tested with your data, but
vec2 <- c("hello,goodbye","hello,goodbye")
vec1 <- c("hello","hello,goodbye")
Map(setdiff, strsplit(vec2, "[,\\s]+"), strsplit(vec1, "[,\\s]+"))
# [[1]]
# [1] "goodbye"
# [[2]]
# character(0)
If you need them to be comma-delimited strings, then
mapply(function(a,b) paste(setdiff(a,b), collapse=","), strsplit(vec2, "[,\\s]+"), strsplit(vec1, "[,\\s]+"))
# [1] "goodbye" ""

Read txt file into list where each list element is delimited by row ending with colon

I've got the following .txt structure
test <- "A n/a:
4001
Exam date:
2020-01-01 15:38
Pos (deg):
18.19
18.37"
I'd like to read this into a list, where each list element is given the name of the row ending with a colon, and the values are given by the following rows. (see: expected output).
Challenges
The number of rows (the length of each list element) can differ. There can be special characters (e.g., "A n/a") and there is the date time value which contains a pesky colon.
My problem
My current solution (see below) is unsafe, because I cannot be sure that I have a full list of all expected elements - the file might contain unexpected list elements which I would then not capture, or worse, they would mess up the entire data.
What I tried
I tried reading the txt to json with jsonlite::fromJson, because the structure somehow resembled it, but this gave an error about an unexpected character.
I tried to read into a single string and split, but this leaves me, again, with all values in a single list element:
readr::read_file(test)
strsplit(test, split = ":\n")
My current approach is to read this in with read.csv2 and generate a lookup on the (expected) row names, create a vector for splitting and using the first element of the resulting list for naming.
myfile <- read.csv2(text = test,
header = FALSE)
lu <- paste(c("A n", "date", "Pos"), collapse = "|")
ls_file <- split(myfile$V1, cumsum(grepl(lu, myfile$V1, ignore.case = TRUE)))
names(ls_file) <- unlist(lapply(ls_file, function(x) x[1]))
ls_file <- lapply(ls_file, function(x) x <- x[2:length(x)])
## expected output is a named list
## The spaces and backticks below do not really bother me,
## but I would get rid of them in a next step.
ls_file
#> $`A n/a:`
#> [1] " 4001"
#>
#> $`Exam date:`
#> [1] " 2020-01-01 15:38"
#>
#> $`Pos (deg):`
#> [1] "18.19" "18.37"
Assuming the name of each element ends with :, then we can:
res <- readLines(textConnection(test))
res <- split(res, cumsum(endsWith(res, ':')))
res <- setNames(lapply(res, `[`, -1), sapply(res, `[`, 1))
# > res
# $`A n/a:`
# [1] " 4001"
#
# $`Exam date:`
# [1] " 2020-01-01 15:38"
#
# $`Pos (deg):`
# [1] "18.19" "18.37"

Removing text containing non-english character

This is my sample dataset:
Name <- c("apple firm","苹果 firm","Ãpple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)
I would like to delete the Name containing non-English character. For this sample, only "apple firm" should stay.
I tried to use the tm package, but it can only help me delete the non-english characters instead of the whole queries.
I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?
To translate this into R, you could do (to match non-ASCII):
res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
And to match non-unicode per that same SO post:
res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
Note - we had to take out the NUL character for this to work. So instead of starting at \u0000 or x00 we start at \u0001 and \x01.
stringi package has the convenience function stri_enc_isascii:
library(stringi)
stri_enc_isascii(data$Name)
# [1] TRUE FALSE FALSE
As the name suggests,
the function checks whether all bytes in a string are in the [ASCII] set 1,2,...,127 (from ?stri_enc_isascii).
An alternative to regex would be to use iconv and than filter for non NA entries:
library(dplyr)
data <- data %>%
mutate(Name = iconv(Name, from = "latin1", to = "ASCII")) %>%
filter(!is.na(Name))
What happens in the mutate statement is that the strings are converted from latin1 to ASCII. Here's a list of the characters covered by latin1 aka ISO 8859-1. When a string contains a character that is not on the latin1 list, it cannot be converted to ASCII and becomes NA.

R: Make character string refer to an object

I have a large list of files (file1, file2, file3, etc.) and, for each analysis, I want to refer to two files from this list (e.g. function(file1,file2)). When I try to do this using paste0("file", pairs[1,x] I get back the character string "file1" rather than the object file1.
How can I refer to the objects rather than create a character string?
Thank you very much!
Additional comment:
pairs is a 2xn matrix where each column is the combination of files for one analysis (e.g. pairs[1,1] = 1 and pairs[2,1] = 2 for the comparison between file1 and file2).
Are you looking for get()???
a <- 1:5
> get("a")
[1] 1 2 3 4 5
How to get the variable from a string containing the variable name:
> a = 10
> string = "a"
> string
[1] "a"
> eval(parse(text = string))
[1] 10
> eval(parse(text = "a"))
[1] 10
Hope this helps.
Another alternative:
eval(as.name("file"))

How to find and replace double quotes in R data frame

I have a data frame that looks like this (sorry, I can't replicate the actual data frame with code as the double quotes don't show up. Vx are variables):
V1, V2, V3, V4
home, 15, "grand", terminal,
"give", 32, "cuz", good,
"miles", 5, "before", ten,
yes, 45, "sorry," fine
Question: how I might be able to fix the double quote issue for my entire data frame that I've imported using the read.csv function, where all the double quotes are removed?
What I'm looking for is the excel or word equivalent of FIND + REPLACE: Find the double quote, and replace with nothing.
Notes:
1) I've confirmed it's a data frame by running is.data.frame() function
2) The actual data frame has hundreds of columns, so going through each one and declaring the type of column it is isn't feasible
3) I tried using the following, and it didn't work: as.data.frame(sapply(my_data, function(x) gsub("\"", "", x)))
4) I confirmed that this isn't a simple print issue by testing using sql on the the data frame. It won't find columns in double quotes unless I use LIKE instead of =
Thanks in advance!
7/7/15 EDIT 01: as requested from #alexforrence, here is the d(put) output for a couple of columns:
billing_first_name billing_last_name billing_company
3 NA
4 Peldi Guilizzoni NA
5 NA
6 "James Andrew" Angus NA
7 NA
8 Nova Spivack NA
Here is a solution using dplyr and stringr. Note that purely numerical columns will be character columns afterwards. It's not clear to me from your description whether there are purely numerical columns. If there are then you'd probably want to treat them separately, or alternatively convert back into numbers afterwards.
require(dplyr)
require(stringr)
df <- data.frame(V1=c("home", "\"give\"", "\"miles\"", "yes"),
V2=c(15, 32, 5, 45),
V3=c("\"grand\"", "\"cuz\"", "\"before\"", "\"sorry\""),
V4=c("terminal", "good", "ten", "fine"))
df
## V1 V2 V3 V4
## 1 home 15 "grand" terminal
## 2 "give" 32 "cuz" good
## 3 "miles" 5 "before" ten
## 4 yes 45 "sorry" fine
df %>% mutate_each(funs(str_replace_all(., "\"", "")))
## V1 V2 V3 V4
## 1 home 15 grand terminal
## 2 give 32 cuz good
## 3 miles 5 before ten
## 4 yes 45 sorry fine
You can identify the double quotes using nchar().
a <- ""
nchar(a)==0
[1] TRUE
In addition to the above I ran into a very strange problem. Using the tips I wrote this very short program:
setClass("char.with.deleted.quotes")
setAs("character", "char.with.deleted.quotes",
function(from) as.character(gsub('„',"xxx", as.character(from), fixed = TRUE)))
TMP = read.csv2("./test.csv", header=TRUE, sep=";", dec=",",
colClasses = c("character","char.with.deleted.quotes"))
temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
print(temp)
with the Output:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
which reads the dummy csv:
Number;Name
X-23;This is some „Test
K-33.01;And another „Test
My goal is to get rid of this double quote before the word Test. However this so far does not work. And this is because of this double quote.
If instead I choose to replace a different part of the character it does work with either read.csv2 and the above definition of a class or directly with gsub saving it into the temp variable.
Now what is really strange is the following. After running the program I copied the two lines "temp <- gsub" and "print(temp)" manually into the command line:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
>
> temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(temp)
[1] "This is some xxxTest" "And another xxxTest"
This for whatever reason works and it does also work if I modify the data frame directly:
> TMP$Name <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(TMP)
Number Name
1 X-23 This is some xxxTest
2 K-33.01 And another xxxTest
But if I repeat this command in the program and run it again, it does not work. And I really have no idea why.

Resources