How to append metadata to columns in R?

How to append metadata to columns in R? - r

I am trying to append metadata to columns in R. I have a large data set, and would like to pull up metadata info in R through comments (or is there somewhere more appropriate?). I have the metadata typed into single cells in a separate file, and am trying to refer to them when defining comments.
I have tried the following:
comment(data$cat)=as.quoted(metadata$catdot)
comment(data$cat)
# NULL
this next one I tried to put quotation marks around the text I wanted to pull in
comment(data$cat)=as.quoted(metadata$cat)
# Error in parse(text = x) : <text>:1:5: unexpected symbol
# 1: how many
^
another attempt using quotation marks. Please ignore the silly text, I'm just testing code with a data set and meta data I made up
comment(data$place)=metadata$placequote
# Error in `comment<-`(`*tmp*`, value = list("whether the location is a shelter or a foster home because that is really important the poor little dogs and cats just don't have anywhere else to go maybe they are blind or starving and it's really just such a shame")) :
# attempt to set invalid 'comment' attribute
comment(data$place)=readChar(metadata$place,nchars=81)
# Error in readChar(metadata$place, nchars = 81) :
# cannot read from this connection
comment(data$place)=paste(readLines(metadata$place), collapse=" ")
# Error in readLines(metadata$place) : 'con' is not a connection

Related

Why does read.csv2 work just fine, yet read.csv2.sql shows an error/warning?

I am trying to read a csv file in R using read.csv2.sql, since I would like to use a SELECT query from SQL to help me filter my data, but before I can even get to my SELECT query, I discovered that simply reading my csv file using read.csv2.sql already generates a warning message.
This is my code:
investment2 <- read.csv2.sql("investmentdata.csv")
This is the warning message:
Warning message:
In result_fetch(res#ptr, n = n) :
Column 'Capital.Investment': mixed type, first seen values of type real, coercing other values of type string
However, when I use the normal read.csv2 function, there is no error. In particular, the following code works fine with no warning messages:
investment <- read.csv2("investmentdata.csv")
Next, I tried to resolve this issue by casting the Capital.Investment column to be real as follows:
investment3 <- read.csv2.sql("investmentdata.csv", "SELECT *, CAST(Capital.Investment AS real) FROM file")
However, R now generates the following error:
Error: no such column: Capital.Investment
Thus, I have two questions. Firstly, why does using read.csv2.sql generate that warning message when read.csv2 works just fine? Secondly, why does R (or SQL) not recognise my Capital.Investment column when I try to cast it as real?
Perhaps it is also worth noting that I cannot simply ignore this warning that the read.csv2.sql function is showing, because I discovered that as a consequence of this warning, it has automatically casted some of the NA rows in my Capital.Investment column to 0, which I cannot allow - the NA rows must stay as NA. I do not seem to be having this problem with the other columns of my csv file though.
As I am quite new to R, any help and explanations will be greatly appreciated :)
Edit
The coded version of what my truncated csv file looks like is as follows. In particular, the name of the column-in-question is indeed Capital.Investment.
id;targetC;year;comp_id;homeC;Industry.Activity;Capital.Investment;Estimated;Jobs.Created;Estimated.1;Project.Type;geographic distance;SIC;listed;sales;assets;cap_structure;rnd;profit;rndintensity;polcon;homeC_gdp;targetC_gdp;homeC_gdppc;targetC_gdppc
1302;AUS;2008;FR338966385;FRA;Design, Development & Testing;33.1;Yes;36;Yes;New;15.26414042;3669;Unlisted;4333088.972;4037211.732;0;NA;-1339221.733;NA;0.489032525;2.92347E+12;1.05456E+12;45413.06571;49628.11513
1311;AUS;2008;US*190521496652;USA;Research & Development;8.4;Yes;30;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1313;AUS;2008;GB05817296;GBR;Business Services;9.7;Yes;10;Yes;New;15.31094496;7389;Unlisted;NA;87.64187374;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
1318;AUS;2008;US129687150L;USA;Business Services;1.3;Yes;225;Yes;New;15.24712914;7373;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1351;AUS;2008;GB*P0060071;GBR;Electricity;516;No;51;Yes;New;15.31094496;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9925;AUS;2008;GB00034121;GBR;Business Services;34.8;Yes;37;Yes;New;15.31094496;4412;Unlisted;NA;2079288.611;0.355157008;NA;94320.15469;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9932;AUS;2008;CA30060NC;CAN;Sales, Marketing & Support;3.2;Yes;11;Yes;New;14.88812529;1094;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.54913E+12;1.05456E+12;46596.33599;49628.11513
9935;AUS;2008;US940890210;USA;Manufacturing;771;Yes;266;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9938;AUS;2008;US770059951;USA;Technical Support Centre;9.1;Yes;104;Yes;Co-Locati;15.24712914;3661;Listed;34922000;53340000;0.120134983;4598000;7333000;0.086201723;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9946;AUS;2008;US010562944;USA;Extraction;535.8;Yes;198;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9955;AUS;2008;DE5030147191;DEU;Logistics, Distribution & Transportation;21.2;Yes;134;Yes;New;14.6718338;4311;Listed;93495971.01;346629334.8;0.036629492;0;2044745.934;0;0.489032525;3.75237E+12;1.05456E+12;45699.19832;49628.11513
9958;AUS;2008;US126012192L;USA;Business Services;9.7;Yes;10;Yes;New;15.24712914;8111;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9969;AUS;2008;US135409005;USA;Extraction;NA;No;538;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9977;AUS;2008;JP000000728JPN;JPN;ICT & Internet Infrastructure;128.6;Yes;77;Yes;New;7.0333688;3571;Listed;53255396.85;38181450.16;0.190244908;2584585.523;480589.4308;0.067692176;0.489032525;5.03791E+12;1.05456E+12;39339.29757;49628.11513
9984;AUS;2008;US841547578;USA;Sales, Marketing & Support;13.6;Yes;23;Yes;New;15.24712914;2095;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9993;AUS;2008;US258715604L;USA;Customer Contact Centre;1.8;No;40;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513

This issue was resolved in chat, to be one of two issues:
see my original answer below, this was causing an Error; when that is fixed, we see that ...
there is a warning, informing about the fact that a column (happens to be the same column) looks numeric but has a non-numeric cell somewhere within the guts of the file.
The first is resolved below, the second is just a warning.
However, because the OP is asking to convert to numeric via SQL, the NA is converted to 0, which is not good. My recommendation is to either cast([Capital.Investment] as char) as [Capital.Investment] and use R's as.numeric to convert to numeric (preserving the NA-nature), or to just read.csv2(.) the file outright and use sqldf(.) to use its SQL querying on table-like data.
Up front: add brackets or quotes around your column name.
Rationale: Capital.Investment is seen as a dot-delimited table-column or schema-table or something similarly not what you intend. I believe in general in SQL that field names with embedded dots need this escaping. If your data has an embedded space, realize that R does not like spaces in its field names, so it is by-default using make.names when reading it in (which replaces spaces with dots).
Setup:
Save the following as "quux.csv". (I've named it csv, but since I'm changing it to be ;-delimited, it behaves the same.)
quux;Capital.Investment
1;100
2;200
(Or you can use Capital Investment, it's the same thing.)
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast(Capital.Investment as real) from file')
# Error: no such column: Capital.Investment
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast([Capital.Investment] as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast("Capital.Investment" as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200

Loading CSV with fread stops because of to large string

This is the command I'm using :
dallData <- fread("data.csv", showProgress = TRUE, colClasses = c(rep("NULL", 2), "character", rep("NULL", 37)))
but I get this error when trying to load it: R character strings are limited to 2^31-1 bytes|
Anyway to skip those values ?

Here's a strategy that may work or at least narrow down the possible sources of error. It assumes you have enough working memory to hold the data and that your separators are really commas. If you actually have tabs as separators then you will need to modify accordingly. The plan is to read using readLines which will basically ignore the quotes that are probably mismatched. Then figure out which line or lines are at fault using count.fields, table, and which.
input <- readLines("data.csv") # ignores quotes
counts.def <- count.fields(textConnection(input),
sep=",") # defaults quotes are both ' and "
table(counts.def) # might show a variety of line counts.
# Second try with just double-quotes
counts.dbl <- count.fields(textConnection(input),
sep=",", quote="\"") # just dbl-quotes
table(counts.dbl) # if all the same, then all you do is change the quotes argument
Depending on the results you may need to edit cerain lines which can be identified using which(counts.def < 40) assuming most of them are 40 as your input efforts suggest is the expected number of fields per line.
(If the tag for [ram] means you are limited and getting warnings or using virtual memory which slows things down horribly, then you should restart your OS, and only load R before trying again. R needs contiguous block of memory and Windoze isn't very good at memory management.)
Here's a small test case to work with:
input <- readLines(textConnection(
"v1,v2,v3,v4,v5,v6
text, text, text, text, text, text
text, text, O'Malley, text,text,text
junk,junk, more junk, \"text\", tex\"t, nothing
3,4,5,6,7,8")

Comparing the MD5 sum of a string to the contents of a file

I am trying to compare a string (in memory) to the contents of a file to see if they are the same. Boring details on motivation are below the question if anyone cares.
My confusion is that when I hash file contents, I get a different result than when I hash the string.
library(readr)
library(digest)
# write the string to the file
the_string <- "here is some stuff"
the_file <- "fake.txt"
readr::write_lines(the_string, the_file)
# both of these functions (predictably) give the same hash
tools::md5sum(the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398"
digest(file = the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398"
# now read it back to a string and get something different
back_to_a_string <- readr::read_file(the_file)
# "here is some stuff\n"
digest(back_to_a_string)
# "03ed1c8a2b997277100399bef6f88939"
# add a newline because that's what write_lines did
orig_with_newline <- paste0(the_string, "\n")
# "here is some stuff\n"
digest(orig_with_newline)
# "03ed1c8a2b997277100399bef6f88939"
What I want to do is just digest(orig_with_newline) == digest(file = the_file) to see if they're the same (they are) but that returns FALSE because, as shown, the hashes are different.
Obviously I could either read the file back to a string with read_file or write the string to a temp file, but both of those seem a bit silly and hacky. I guess both of those are actually fine solutions, I really just want to understand why this is happening so that I can better understand how the hashing works.
Boring details on motivation
The situation is that I have a function that will write a string to a file, but if the file already exists then it will error unless the user has explicitly passed .overwrite = TRUE. However, if the file exists, I would like to check whether the string about to be written to the file is in fact the same thing that's already in the file. If this is the case, then I will skip the error (and the write). This code could be called in a loop and it will be obnoxious for the user to continually see this error that they are about to overwrite a file with the same thing that's already in it.

Short answer: I think you need to set serialize=FALSE. Supposing that the file doesn't contain the extra newline (see below),
digest(the_string,serialize=FALSE) == digest(file=the_file) ## TRUE
(serialize has no effect on the file= version of the command)
dealing with newlines
If you read ?write_lines, it only says
sep: The line separator ... [information about defaults for different OSes]
To me, this seems ambiguous as to whether the separator will be added after the last line or not. (You don't expect a "comma-separated list" to end with a comma ...)
On the other hand, ?base::writeLines is a little more explicit,
sep: character string. A string to be written to the connection
after each line of text.
If you dig down into the source code of readr you can see that it uses
output << na << sep;
for each line of code, i.e. it's behaving the same way as writeLines.
If you really just want to write the string to the file with no added nonsense, I suggest cat():
identical(the_string, { cat(the_string,file=the_file); readr::read_file(the_file) }) ## TRUE

Error: "argument is not an atomic vector; coercing[1] FALSE"

I'm new to R and am having trouble (1) generalizing previous stack overflow answers to my situation, and (2) understanding R documentation. So I turn to this community and hope someone will walk me through.
I have this code where data1 is a text file:
data1 <- read.delim(file.choose())
pattern <- c("An Error Has Occurred!")
str_detect(data1, regex(pattern, ignore_case = FALSE))
The error message I see is:
argument is not an atomic vector; coercing[1] FALSE
When I use is.vector() to confirm the data type, it looks like it should be fine:
is.vector(pattern)
#this returns [1] TRUE as the output
Reference I used for str_detect function is https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_detect.
Edit 1: Here is the output of data1 - I'm trying to match the 4th to last line "An Error Has Occurred!":
Silk.Road.Forums
<fctr>
*
Welcome, Guest. Please login or register.
[ ] [ ] [Forever] [Login]
Login with username, password and session length
[ ] [Search]
â\200¢ Home
â\200¢ Search
â\200¢ Login
â\200¢ Register
â\200¢ Silk Road Forums
An Error Has Occurred!
The user whose profile you are trying to view does not exist.
Back
â\200¢ SMF | SMF Â© 2013, Simple Machines
Edit 2: After a bit of rudimentary testing, it looks like the issue is with how I opened up data1, not necessarily str_detect().
When I just create a vector, it works:
dataVector <- c("An Error Has Occurred!", "another one")
pattern <- c("An Error Has Occurred!")
str_detect(dataVector, pattern) # returns [1] TRUE FALSE
But when I try to use the function on the file, it doesn't
data1 <- read.delim(file.choose())
pattern <- c("An Error Has Occurred!")
str_detect(data1, pattern) # returns the atomic vector error message`
Problem: So I'm convinced that the problem is that (1) I'm using the wrong function or (2) I'm loading up the file wrong for this file type. I've never used text files in R before so I'm a bit lost.
That's all I have and thank you in advance for anyone willing to take a stab at helping!

I think what is going on here is that read.delim is reading in your text file as a data frame and not a vector which is what str_detect requires.
For a quick work around you can try.
str_detect(data1[,1], "An Error Has Occurred!")
This works because right now data1 is a 1 column data frame. data2[,1] returns all rows for the first (and only) column of that data frame and returns it as a vector.
However! The problem here is you are using read.delim which is for delimited text files (i.e. like a csv file that has a separator such ',') which your data is not. Much better would be to use the function readlines which will return you a character vector.
# open a connection to your file
con <- file('path/to/file.txt',open="r")
# read file contents
data1 <- readLines(con)
# close the connection
close(con)
Then str_detect should work.
str_detect(data1, "An Error Has Occurred!")

Just as.data.frame() your data, the the str_replace() works fine!

List index out of range (searched for answer, none work)

When i run this code
f = open("test.txt", "r")
xp_levelup_save=f.readlines(3)
xp_levelup_save=[int(i.replace("\n", "")) for i in xp_levelup_save][2]
f.close()
print (xp_levelup_save)
but a error comes up "List index out of range" If the readlines is 2 and the [2] is [1] it works fine. Not sure why this is happening. Can anyone help me and find a fix. I've tried looking at mulitple other discussions but none
seem to work with this code.
My text document looks like this
1
2
3
4
5
6

f.readlines() doesn't need an argument to retrieve the lines of a document. If you provide this optional argument like f.readlines(n), Python reads it as "expect n bytes and finish the line". Depending on your text file, this could be one, two or even more lines.
To read the text file, just use
xp_levelup_save = f.readlines()
#resume with your code
where each line is stored as a string list element in your variable xp_levelup_save.
Or simply
xp_levelup_save = [int(i.replace("\n", "")) for i in f.readlines()][2]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to append metadata to columns in R? - r

Related

Why does read.csv2 work just fine, yet read.csv2.sql shows an error/warning?

Loading CSV with fread stops because of to large string

Comparing the MD5 sum of a string to the contents of a file

Error: "argument is not an atomic vector; coercing[1] FALSE"

List index out of range (searched for answer, none work)

Categories

Resources