Related
I have a file that that has over a hundred million rows and scattered throughout there are extra tab delimiters in fields. I need to read the problematic rows into R whilst ignoring the others due to the large file size involved.
Example txt file with extra delimiters in some rows:
text_file <-"My\tname\tis\tAlpha\nMy\tname\tis\t\t\tBravo\nMy\tname\tis\tCharlie\nMy\tname\tis\t\t\tDelta\nMy\tname\tis\tEcho"
The first thing that I tried was using the 'readLines' function however whilst I can specify the row to stop on it will still read everything else up to that point which could still be too much
readLines(textConnection(text_file), n = 4)
[1] "My\tname\tis\tAlpha" "My\tname\tis\t\t\tBravo" "My\tname\tis\tCharlie" "My\tname\tis\t\t\tDelta"
I then realised that I could also use the other dataset import functions if I specified the delimiter to be something that should probably never appear. The "fread" function from the data.table package would be perfect for this as it is the fastest way to deal with large datasets like mine however when I tried it the data was in a format that I couldnt really work with further:
library(data.table)
library(stringi)
lines <- fread(text_file, sep = NULL, header = FALSE, skip = 1, nrows = 3)
> lines
V1
1: My\tname\tis\t\t\tBravo
2: My\tname\tis\tCharlie
3: My\tname\tis\t\t\tDelta
> invalid_delimiter_rows <- which(stri_count_regex(lines, "\\t") != 3)
Warning message:
In stri_count_regex(lines, "\\t") :
argument is not an atomic vector; coercing
Preferably I shouldnt have to convert this data after importing however when I tried changing this to a character vector or list it was still in a bad format (the concatenation is considered part of the string and not a function). What is the most computing time efficient way that I could approach this issue?
> class(lines)
[1] "data.table" "data.frame"
> as.character(lines)
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\")"
Let's replicate the process till fread() import:
# your example string
text_file <-"My\tname\tis\tAlpha\nMy\tname\tis\t\t\tBravo\nMy\tname\tis\tCharlie\nMy\tname\tis\t\t\tDelta\nMy\tname\tis\tEcho"
# import
library(data.table)
lines <- fread(text_file, sep = NULL, header = FALSE, skip = 1, nrows = 5)
lines
V1
1: My\tname\tis\t\t\tBravo
2: My\tname\tis\tCharlie
3: My\tname\tis\t\t\tDelta
4: My\tname\tis\tEcho
When you try
as.character(lines)
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
it converts all data.table in character, so each column will be a concatenated vector. See below:
as.character(data.table(lines$V1, lines$V1))
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
[2] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
What you want is extract just lines$V1, which is already a character vector.
lines$V1
[1] "My\tname\tis\t\t\tBravo" "My\tname\tis\tCharlie" "My\tname\tis\t\t\tDelta" "My\tname\tis\tEcho"
I imported a dataset that unfortunately did not have any separators defined, nor in columns or in rows. I have tried to look for an option to define a specific row separator but could not find one that could be applicable to this situation.
df1 <- data.frame("V1" = "{lat:45.493,lng:-76.4886,alt:22400,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:328,spd:null,postime:2019-01-15 16:10:39},
{lat:45.5049,lng:-76.5285,alt:23425,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50}")
df2 <- data.frame("V1" = "{lat:45.493,lng:-76.4886,alt:22400,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:328,spd:null,postime:2019-01-15 16:10:39},
{lat:45.5049,lng:-76.5285,alt:23425,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50}")
newdf <- rbind(df1,df2)
This is a model of the data that I am currently struggling with. Ideally, the row separators in this case would have to be defined as "},{" and the column separators as ",". I tried subsetting this pattern to a tab and defining a different separator but this either returned an error (tried with separate_rows from TidyR) or simply did nothing.
Hope you guys can help
This looks like incomplete (incorrect) JSON, so I suggest you bring it up-to-spec and then parse it with known tools. Some problems, easily mitigated:
sqk should have a comma-separator, perhaps a copy/paste issue. This might be generalized as any "number-letter" progression depending on your process. (Edit: your update seems to have resolved this issue, so I'll remove it. If you still need it, I recommend you go with a very literal gsub("([^,])sqk:", "\\1,sql:", s).)
Labels (e.g., lat, alt, sql) should all be double-quoted.
Non-numeric data needs to be quoted, specifically the dates.
Exception to 3: null should remain unquoted.
There are multiple "dicts" that need to be within a "list", i.e., from {...},{...} to [{...},{...}].
Side note with your data: I read them in with stringsAsFactors=FALSE, since we don't need factors.
fixjson <- function(s) {
gsub(",+", ",",
paste(
gsub('"sqk":([^,]+)', '"sqk":"\\1"',
gsub("\\s*\\b([A-Za-z]+)\\s*(?=:)", '"\\1"', # note 2
gsub('(?<=:)"(-?[0-9.]+|null)"', "\\1", # notes 3, 4
gsub("(?<=:)([^,]+)\\b", "\"\\1\"", # quote all data
s, perl = TRUE), perl = TRUE), perl = TRUE)),
collapse = "," )
)
}
fixjson(df1$V1)
# [1] "{\"lat\":45.493,\"lng\":-76.4886,\"alt\":22400,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":328,\"spd\":null,\"postime\":\"2019-01-15 16:10:39\"},\n {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":23425,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"},\n {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":24000,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"},\n {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":24000,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"}"
From here, we use a well-defined json parser (from either jsonlite or RJSONIO, both use similar APIs):
jsonlite::fromJSON(paste("[", fixjson(df1$V1), "]", sep=""))
# lat lng alt call icao registration sqk trak spd postime
# 1 45.4930 -76.4886 22400 COFPQ C056P X-VLMP 6232 328 NA 2019-01-15 16:10:39
# 2 45.5049 -76.5285 23425 COFPQ C056P X-VLMP 6232 321 NA 2019-01-15 16:11:50
# 3 45.5049 -76.5285 24000 COFPQ C056P X-VLMP 6232 321 NA 2019-01-15 16:11:50
# 4 45.5049 -76.5285 24000 COFPQ C056P X-VLMP 6232 321 NA 2019-01-15 16:11:50
From here, rbind as needed. (Note that the null literal was translated into R's NA, which is "as it should be" in my opinion.)
Follow-on suggestion: you can use as.POSIXct directly on your postime column; I hope you are certain all of your data are in the same timezone since the field contains no hint.
Lastly, you mentioned something about non-ASCII characters gumming up the works. My recent edit included a little added robustness for spaces introduced from the use of iconv (e.g., the use of \\s*), so the following might suffice for you:
jsonlite::fromJSON( paste("[", fixjson(iconv(df2$V1, "latin1", "ASCII", sub="")), "]") )
(Use of iconv suggested by https://stackoverflow.com/a/9935242/3358272)
As example, I have the following XML code
tt = '<Nummeraanduiding>
<identificatie>0010200000114849</identificatie>
<aanduidingRecordInactief>N</aanduidingRecordInactief>
<aanduidingRecordCorrectie>0</aanduidingRecordCorrectie>
<huisnummer>13</huisnummer>
<officieel>N</officieel>
<postcode>9904PC</postcode>
<tijdvakgeldigheid>
<begindatumTijdvakGeldigheid>2010051100000000</begindatumTijdvakGeldigheid>
</tijdvakgeldigheid>
<inOnderzoek>N</inOnderzoek>
<typeAdresseerbaarObject>Verblijfsobject</typeAdresseerbaarObject>
<bron>
<documentdatum>20100511</documentdatum>
<documentnummer>2010/NR002F</documentnummer>
</bron>
<nummeraanduidingStatus>Naamgeving uitgegeven</nummeraanduidingStatus>
<gerelateerdeOpenbareRuimte>
<identificatie>0010300000000444</identificatie>
</gerelateerdeOpenbareRuimte>
</Nummeraanduiding> '
The goal is to convert this node(Nummeraanduiding) to a data.table (or data.frame is also fine). One challenge is that I have a lot of these Nummeraanduiding nodes (millions of them).
The following code is able to process the data:
library(XML)
# This parses the doc...
doc = xmlParse(tt)
# Solution (1) - this is the most obvious solution..
XML::xmlToDataFrame(doc)
# Solution (2) - apparently converting to a list is also possible..
unlist(xmlToList(doc))
# Solution (3) - My own solution
data.frame(as.list(unlist(xmlToList(doc))))
Not all solutions produce the desired result... In the end only the version of Solution (3) satisfies my needs.
It is in a data.frame/data.table format
It contains all the child-child-nodes and has distinct names for each column
It does not 'merge' the information of child-child-nodes
However, running this piece of code for all my data becomes quite slow. It took 8+ hours to complete it for a file containing 2290000 times the 'Nummeraanduiding'-node.
Do you guys know any way to speed up this process? Can my method be improved? Am I missing some useful function maybe?
Given that each field is already on a separate line just grep them out, read what is left using read.table and convert from long to wide using tapply to produce the resulting matrix (which can be converted to a data frame or data.table if desired). Note that in read.table we bypass quote, comment and class processing. Finally, test it out to see if it is faster. No packages are used.
nms <- c("identificatie", "aanduidingRecordInactief", "aanduidingRecordCorrectie",
"huisnummer", "officieel", "postcode", "tijdvakgeldigheid.begindatumTijdvakGeldigheid",
"inOnderzoek", "typeAdresseerbaarObject", "bron.documentdatum",
"bron.documentnummer", "nummeraanduidingStatus",
"gerelateerdeOpenbareRuimte.identificatie")
rx <- paste(nms, collapse = "|")
g <- chartr("<", ">", grep(rx, readLines(textConnection(tt)), value = TRUE))
long <- read.table(text = g, sep = ">", quote = "", comment.char = "",
colClasses = "character")[2:3]
names(long) <- c("field", "value")
long$field <- factor(long$field, levels = nms) # maintain order of columns
long$recno <- cumsum(long$field == "identificatie")
with(long, tapply(value, list(recno, field), c))
If all records have exactly the same set of fields, such as those listed in nms, then the last line could be replaced with this (which is likely faster):
matrix(long$value, ncol = length(nms), byrow = TRUE, dimnames = list(NULL, nms))
Another alternative to the tapply line would be to use reshape from base R or to use dcast from the reshape2 package.
I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks
One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))
Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.
Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.
You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.
R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file
If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)
I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"