R: dplyr not recognizing gathered field [duplicate] - r

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.

Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.

You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})

This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.

We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)

a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)

"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?

Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania

If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))

A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7

I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Related

Is there a way to correct this y-axis glitch? [duplicate]

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Dividing one column by another column in a dataset [duplicate]

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Change the way R sorts it's values [duplicate]

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Dynamically read/process csv thousand separator [duplicate]

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

R Reading a File with read.csv [duplicate]

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr (character) instead of a numeric.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply
df[2:3] <- lapply(df[2:3], readr::parse_number)
df
# a b c
#1 a 12234 12
#2 b 123 1234123
#3 c 1234 1234
#4 d 13456234 15342
#5 e 12312 12334512
Or use mutate_at from dplyr to apply it to specific variables.
library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)
data
df <- data.frame(a = letters[1:5],
b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
c = c("12", "1,234,123","1234", "15,342", "123,345,12"),
stringsAsFactors = FALSE)
a dplyr solution using mutate_all and pipes
say you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and
convert them to numeric. also, let's say X2014-X2016 are read in as
factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all applies the function(s) inside funs to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines on a textConnection. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
Using read_delim function, which is part of readr library, you can specify additional parameter:
locale = locale(decimal_mark = ",")
read_delim("filetoread.csv", ";", locale = locale(decimal_mark = ","))
*Semicolon in second line means that read_delim will read csv semicolon separated values.
This will help to read all numbers with a comma as proper numbers.
Regards
Mateusz Kania
If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is readr::read_delim-family. Taking the example from here:
Importing csv with multiple separators into R you can do it as follows:
txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")
Which results in the expected result:
# A tibble: 3 × 6
OBJECTID District_N ZONE_CODE COUNT AREA SUM
<int> <chr> <int> <dbl> <dbl> <dbl>
1 1 Bagamoyo 1 136227 8514187500 352678.8
2 2 Bariadi 2 88350 5521875000 526307.3
3 3 Chunya 3 483059 30191187500 352444.7
I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.
For example, if your file were like this:
"1,234","123","1,234"
"234","123","1,234"
123,456,789
Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2
1234,"123",1234
"234","123",1234
123,456,789
Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Resources