Regular expression to convert raw text into columns of data - r

I have a raw text output from a program that I want to convert into a DataFrame. The text file is not formatted and is as shown below.
10037 149439Special Event 11538.00 13542.59 2004.59
10070 10071Weekday 8234.00 9244.87 1010.87
10216 13463Weekend 145.00 0 -145.00
I am able to read the data into R using readLines() in the base package. How can I convert this into data that looks like this (column names can be anything).
A B C D E F
10037 149439 Special Event 11538.00 13542.59 2004.59
10070 10071 Weekday 8234.00 9244.87 1010.87
10216 13463 Weekend 145.00 0 -145.00
What regular expression should I use to achieve this? I know that this is ideal for applying a combination of regexec() and regmatches(). But I am unable to come up with an expression that splits the line into the desired components.

Here's a simple solution:
raw <- readLines("filename.txt")
data.frame(do.call(rbind, strsplit(raw, " {2,}|(?<=\\d)(?=[A-Z])", perl = TRUE)))
# X1 X2 X3 X4 X5 X6
# 1 10037 149439 Special Event 11538.00 13542.59 2004.59
# 2 10070 10071 Weekday 8234.00 9244.87 1010.87
# 3 10216 13463 Weekend 145.00 0 -145.00
The regular expression " {2,}|(?<=\\d)(?=[A-Z])" consists of two parts, combined with "|" (logical or).
" {2,}" means at least two spaces. This will split between the different columns only, since the text in the third column has a single space.
"(?<=\\d)(?=[A-Z])" denotes the positions that are preceded by a digit and followed by an uppercase letter. This is used to split between the second and the third column.

I created "txt.txt" from your data. Then we work some with a regular expression.
> read <- readLines("txt.txt")
> S <- strsplit(read, "[A-Za-z]|\\s")
> W <- do.call(rbind, lapply(S, function(x) x[nzchar(x)]))
> D <- data.frame(W[,1:2], col, W[,3:5])
> names(D) <- LETTERS[seq(D)]
> D
## A B C D E F
## 1 10037 149439 SpecialEvent 11538.00 13542.59 2004.59
## 2 10070 10071 Weekday 8234.00 9244.87 1010.87
## 3 10216 13463 Weekend 145.00 0 -145.00
Toss it all into some curly brackets and you've got yourself a function to parse your files.
PS: If the space between "Special" and "Event" is important, please comment and I'll revise.

Something like this at least works on your example but I don't know all your corner cases...
([0-9]+) +([0-9]+)(.+) ([0-9.-]+) +([0-9.-]+) +([0-9.-]+)
The captured groups from 1 to 6 are resp. your columns from A to F.

Related

Is there a way to correct this y-axis glitch? [duplicate]

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Dividing one column by another column in a dataset [duplicate]

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Change the way R sorts it's values [duplicate]

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

r extracting list elements from within dataframe

I am working with data where text comments are used to record a change in field contents rather than have an extra record and start/end dates. So the data looks like this:
Study Fob
1 100
2 101 now 102
3 103
Note: test data can be constructed with:
df <- data.frame(Study = 1:3,
Fob = c("100", "101 now 102", "103"),
stringsAsFactors = FALSE)
I want to end up with the following form so I can process it essentially as a many-to-one conversion from Fob signal data to Study IDs:
Study Fob
1 100
2 101
2 102
3 103
I can get rid of the superfluous text with:
df$IDs <- strsplit(df$Fob, "[^0-9]+")
which gets me to:
Study Fob IDs
1 100 100
2 101 now 102 c("101", "102")
3 103 103
but can't get any further. My first thought was to try and replicate the lines with multiple IDs (like 2) using a counter based on the length of the IDs, but adding df$counter <- length(df$IDs) just gets me a column with the value 3 because it is taking the length of the IDs column, not the element within it.
One option is cSplit from library(splitstackshape). We specify the pattern to split, use fixed=FALSE as the default is fixed=TRUE and the direction = 'long'
library(splitstackshape)
cSplit(df, 'Fob', '[^0-9]+', fixed=FALSE, 'long')
# Study Fob
#1: 1 100
#2: 2 101
#3: 2 102
#4: 3 103
[^0-9]+ implies one more characters that are not a number. So, it will split by all non-numeric characters leaving only the numeric part. By default, type.convert=TRUE, so we will be getting numeric column class after the split.
Or instead of using [^0-9]+, a compact version would be \\D+ to match all non-numeric characters (from #David Arenburg's comments)
cSplit(df, 'Fob', '\\D+', fixed=FALSE, 'long')

How to read data when some numbers contain commas as thousand separator?

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Resources