Dealing with commas in a CSV file in sqldf - r

I am following up on my question here sqldf returns zero observations with a reproducible example.
I found that the problem is probably from the "comma" in one of the cells ("1,500+") and I think that I have to use a filter as suggested here sqldf, csv, and fields containing commas, but I am not sure how to define my filter. Below is the code:
library(sqldf)
df <- data.frame("a" = c("8600000US01770" , "8600000US01937"),
"b"= c("1,500+" , "-"),
"c"= c("***" , "**"),
"d"= c("(x)" , "(x)"),
"e"= c("(x)" , "(x)"),
"f"= c(992 , "-"))
write.csv(df, 'df_to_read.csv')
# 'df_to_read.csv' looks like this:
# "","a","b","c","d","e","f"
# 1,8600000US01770,1,500+,***,(x),(x),992
# 2,8600000US01937,-,**,(x),(x),-
Housing <- file("df_to_read.csv")
Housing_filtered <- sqldf('SELECT * FROM Housing', file.format = list(eol="\n"))
When I run this code, I get the following error:
Error in connection_import_file(conn#ptr, name, value, sep, eol, skip) : RS_sqlite_import: df_to_read.csv line 2 expected 7 columns of data but found 8

The problem comes from reading the column created by df$b. The first value in that column contains comma and so sqldf() function treats it as a separator.
One way to deal with this is to either remove comma or use some other symbol (like space).You can also use read.csv2.sql function:
library(sqldf)
df <- data.frame("a" = c("8600000US01770" , "8600000US01937"),
"b"= c("1,500+" , "-"),
"c"= c("***" , "**"),
"d"= c("(x)" , "(x)"),
"e"= c("(x)" , "(x)"),
"f"= c("992" , "-"))
write.csv(df, 'df_to_read.csv',row.names = FALSE )
Housing_filtered <- read.csv2.sql("df_to_read.csv", sql = "select * from file", header=TRUE)

Best way would be to clean your file once, so that you don't need to worry later again in your analysis for the same issue. This should get you going:
Housing <- readLines("df_to_read.csv") # read the file
n <- 6 # number of separators expected = number of columns expected - 1
library(stringr)
ln_idx <- ifelse(str_count(Housing, pattern = ",") == n, 0 , 1)
which(ln_idx == 1) # line indices with issue, includes the header row
#[1] 2
Check for the specific issues and write back to you file, at the same indices. for eg line (2):
Housing[2]
#[1] "1,8600000US01770,1,500+,***,(x),(x),992" # hmm.. extra comma
Housing[2] = "1,8600000US01770,1500+,***,(x),(x),992" # removed the extra comma
writeLines(Housing, "df_to_read.csv")
Now the business is usual, good to go:
Housing <- file("df_to_read.csv")
Housing_filtered <- sqldf('SELECT * FROM Housing')
# Housing_filtered
# a b c d e f
# 1 8600000US01770 1500+ *** (x) (x) 992
# 2 8600000US01937 - ** (x) (x) -

Related

Extract emails in brackets

I work with gmailR and I need to extract emails from brackets <> (sometimes few in one row) but in case when there are no brackets (e.g. name#mail.com) I need to keep those elements.
This is an example
x2 <- c("John Smith <jsmith#company.ch> <abrown#company.ch>","no-reply#cdon.com" ,
"<rikke.hc#hotmail.com>")
I need output like:
[1] "jsmith#company.ch" "abrown#company.ch"
[2] "no-reply#cdon.com"
[3] "rikke.hc#hotmail.com"
I tried this in purpose to merge that 2 results
library("qdapRegex")
y1 <- ex_between(x2, "<", ">", extract = FALSE)
y2 <- rm_between(x2, "<", ">", extract = TRUE )
My data code sample:
from <- sapply(msgs_meta, gm_from)
from[sapply(from, is.null)] <- NA
from1 <- rm_bracket(from)
from2 <- ex_bracket(from)
gmail_DK <- gmail_DK %>%
mutate(from = unlist(y1)) %>%
mutate(from = unlist(y2))
but when I use this function to my data (only one day emails) and unlist I get
Error in mutate():
! Problem while computing cc = unlist(cc2).
x cc must be size 103 or 1, not 104.
Run rlang::last_error() to see where the error occurred.
I suppose that in data from more days difference should be bigger, so I prefer to not go this way.
Preferred answer in R but if you know how to make it in for example PowerQuery should be great too.
We may also use base R - split the strings at the space that follows the > (strsplit) and then capture the substring between the < and > in sub (in the replacement, we specify the backreference (\\1) of the captured group) - [^>]+ - implies one or more characters that are not a >
sub(".*<([^>]+)>", "\\1", unlist(strsplit(x2,
"(?<=>)\\s+", perl = TRUE)))
[1] "jsmith#company.ch" "abrown#company.ch"
[3] "no-reply#cdon.com" "rikke.hc#hotmail.com"
Clunky but OK?
(x2
## split into single words/tokens
%>% strsplit(" ")
%>% unlist()
## find e-mail-like strings, with or without brackets
%>% stringr::str_extract("<?[\\w-.]+#[\\w-.]+>?")
## drop elements with no e-mail component
%>% na.omit()
## strip brackets
%>% stringr::str_remove_all("[<>]")
)

Using sqldf with r variable containing underscore in its name

This code
> A <- data.frame(col1 = c(1,2,3),col2 = c("red","blue","green"))
> color_num <- 2
> fn$sqldf("select * from A where col1 >= '$color_num'")
yields the error
Error in eval(parse(text = paste(..., sep = "")), env) : object
'color' not found
But if the variable color_num is instead given a name with no underscore (say colornum), then executing fn$sqldf("select * from A where col1 >= '$colornum'") yields the expected results with no error.
I believe sqldf is replacing underscores with periods behind the scenes, causing it to treat the component preceding the underscore as a table and the part following as a column name. This answer (and comments) to a question about column names in sqldf indicates that the library at one time replaced dots with underscores but no longer does, but I couldn't find anything about underscores being replaced with dots.
This is an issue since the naming convention I'm using makes heavy use of underscores for variable names.
Is there any way to get variable names with underscores in them working in sqldf queries?
You can use backticks:
fn$sqldf("select * from A where col1 >= `color_num`")
## col1 col2
## 1 2 blue
## 2 3 green
You can use paste0 around your sql code so that r evaluates the color_num to 2 and then pastes it together into one sql statement.
library(sqldf)
A <- data.frame(col1 = c(1,2,3),col2 = c("red","blue","green"))
color_num <- 2
fn$sqldf(paste0("select * from A where col1 >=",color_num))
If you want to use the $var approach and are having trouble with _ here is a work around to get all variables to have . instead of _ , this is probably inefficient but works.
color_col <- "blue"
#get list of values with underscores and not your df
nms <- setdiff(ls(),c("A"))
#change name of list to have '.' instead of '_'
setNames(nms,gsub("_",".",nms))
#Assign values to names with '.'s
myvars <- lapply(setNames(nms,gsub("_",".",nms)), function(x){
assign(gsub("_",".",x) , get(x))})
#bring them to global env
list2env(myvars,.GlobalEnv)
#run example query
fn$sqldf("select * from A where col1 >= '$color.num' and col2 = '$color.col' ")

Assigning new strings with conditional match

I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"

Removing rows based on character conditions in a column

Good morning, I have created the following R code:
setwd("xxx")
library(reshape)
##Insert needed year
url <- "./Quarterly/1990_qtrly.csv"
##Writes data in R with applicable columns
qtrly_data <- read.csv(url, header = TRUE, sep = ",", quote="\"", dec=".", na.strings=" ", skip=0)
relevant_cols <- c("area_fips", "industry_code", "own_code", "agglvl_code", "year", "qtr")
overall <- c(relevant_cols, colnames(qtrly_data)[8:16])
lq <- c(relevant_cols, colnames(qtrly_data)[17:25])
oty <- c(relevant_cols, colnames(qtrly_data)[18:42])
types <- c("overall", "lq", "oty")
overallx <- colnames(qtrly_data)[9:16]
lqx <- colnames(qtrly_data)[18:25]
otyx <- colnames(qtrly_data)[seq(27,42,2)]
###Adding in the disclosure codes from each section
disc_codes <- c("disclosure_code", "lq_disclosure_code", "oty_disclosure_code")
cols_list = list(overall, lq, oty)
denom_list = list(overallx, lqx, otyx)
##Uses a two-loop peice of code to go through data denominations and categories, while melting it into the correct format
for (j in 1:length(types))
{
cat("Working on type: " , types[j], "\n")
these_denominations <- denom_list[[j]]
type_data <- qtrly_data[ , cols_list[[j]] ]
QCEW_County <- melt(type_data, id=c(relevant_cols, disc_codes[j]))
colnames(QCEW_County) <- c(relevant_cols, "disclosure_code", "text_denomination", "value")
Data_Cat <- j
for (k in 1:length(these_denominations))
{
cat("Working on type: " , types[j], "and denomination: ", these_denominations[k], "\n")
QCEW_County_Denominated <- QCEW_County[QCEW_County[, "text_denomination"] == these_denominations[k], ]
QCEW_County_Denominated$disclosure_code <- ifelse(QCEW_County_Denominated$disclosure_code == "", 0, 1)
Data_Denom <- k
QCEW_County_Denominated <- cbind(QCEW_County_Denominated, Data_Cat, Data_Denom)
QCEW_County_Denominated$Source_ID <- 1
QCEW_County_Denominated$text_denomination <- NULL
colnames(QCEW_County_Denominated) <- NULL
###Actually writes the txt file to the QCEW folder
write.table(QCEW_County_Denominated, file="C:\\Users\\jjackson\\Downloads\\QCEW\\1990_test.txt", append=TRUE, quote=FALSE, sep=',', row.names=FALSE)
}
}
Now, there are some things I need to get rid of, namely, all the rows in my QCEW_County_Denominated dataframe where the "area_fips" column begins with the character "C", in that same column, there are also codes that start with US that I would like to replace with a 0. Finally, I also have the "industry_code" column that in my final dataframe has 3 values that need to be replaced. 31-33 with 31, 44-45 with 44, and 48-49 with 48. I understand that this is a difficult task. I'm slowly figuring it out on my own, but if anyone could give me a helpful nudge in the right direction while I'm figuring this out on my own, it would be much appreciated. Conditional statements in R is looking like it's my Achilles heel, as it's always where I begin to get confused with how its syntax differs from other statistical packages.
Thank you, and have a nice day.
You can remove and recode your data using regex and subsetting.
Using grepl, you can select the rows in the column area_fips that DON'T start with C.
QCEW_County_Denominated <- QCEW_County_Denominated[!grepl("^C", QCEW_County_Denominated$area_fips), ]
Using gsub, you can replace with 0 the values in the area_fips columns that start with 0.
QCEW_County_Denominated$area_fips <- as.numeric(gsub("^US", 0, QCEW_County_Denominated$area_fips))
Finally, using subsetting you can replace the values in the industry_code.
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "31-33"] <- 31
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "44-45"] <- 44
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "48-49"] <- 48

Read dataset in R in which comma is used for field separator and decimal point

How could you read this dataset in R, the problem is
that the numbers are floats and are like 4,000000059604644E+16
and they are separated by a ,
4,000000059604644E-16 , 7,999997138977056E-16, 9,000002145767216E-16
4,999999403953552E-16 , 6,99999988079071E-16 , 0,099999904632568E-16
9,999997615814208E-16 , 4,30000066757202E-16 , 3,630000114440918E-16
0,69999933242798E-16 , 0,099999904632568E-16, 55,657576767799999E-16
3,999999761581424E-16, 1,9900000095367432E-16, 0,199999809265136E-16
How would you load this kinf of dataset in R so it has 3 columns.
If I do
dataset <- read.csv("C:\\data.txt",header=T,row.names=NULL)
it would return 6 columns instead 3...
It might be best to transform that input data to use decimal points, rather than commas, in the floating point numbers. One way you could do this is to use sed (it looks like you are using Windows, so you would likely need to sed to use this approach):
sed 's/\([0-9]\),\([0-9]\)/\1.\2/g' data.txt > data2.txt
File data2 looks like this:
4.000000059604644E-16 , 7.999997138977056E-16, 9.000002145767216E-16
4.999999403953552E-16 , 6.99999988079071E-16 , 0.099999904632568E-16
9.999997615814208E-16 , 4.30000066757202E-16 , 3.630000114440918E-16
0.69999933242798E-16 , 0.099999904632568E-16, 55.657576767799999E-16
3.999999761581424E-16, 1.9900000095367432E-16, 0.199999809265136E-16
Then in R:
dataset <- read.csv("data2.txt",row.names=NULL)
Here is an all R solution that uses three read.table calls. The first read.table statement reads each data row as 6 fields; the second read.table statement puts the fields back together properly and reads them and the third grabs the names from the header.
fn <- "data.txt"
# create a test file
Lines <- "A , B , C
4,000000059604644E-16 , 7,999997138977056E-16, 9,000002145767216E-16
4,999999403953552E-16 , 6,99999988079071E-16 , 0,099999904632568E-16
9,999997615814208E-16 , 4,30000066757202E-16 , 3,630000114440918E-16
0,69999933242798E-16 , 0,099999904632568E-16, 55,657576767799999E-16
3,999999761581424E-16, 1,9900000095367432E-16, 0,199999809265136E-16"
cat(Lines, "\n", file = fn)
# now read it back in
DF0 <- read.table(fn, skip = 1, sep = ",", colClasses = "character")
DF <- read.table(
file = textConnection(do.call("sprintf", c("%s.%s %s.%s %s.%s", DF0))),
col.names = names(read.csv(fn, nrow = 0))
)
which gives:
> DF
A B C
1 4.000000e-16 7.999997e-16 9.000002e-16
2 4.999999e-16 7.000000e-16 9.999990e-18
3 9.999998e-16 4.300001e-16 3.630000e-16
4 6.999993e-17 9.999990e-18 5.565758e-15
5 4.000000e-16 1.990000e-16 1.999998e-17
Note: The read.csv statement in the question implies that there is a header but the sample data does not show one. I assumed that there is a header but if not then remove the skip= and col.names= arguments.
It's not pretty, but it should work:
x <- matrix(scan("c:/data.txt", what=character(), sep=","), byrow=TRUE, ncol=6)
y <- t(apply(x, 1, function(a) { left <- seq(1, length(a), by=2)
as.numeric(paste(a[left], a[left+1], sep="."))
} ))

Resources