r import csv skip first and last lines

r import csv skip first and last lines - r

I know many posts have already answered similar questions like mine, but I've tried to figure it out for 2 days now and it seems as if I'm not seeing the picture here...
I got this csv file looking like this:
Werteformat: wertabh. (Q)
Werte:
01.01.76 00:00 0,363
02.01.76 00:00 0,464
...
31.12.10 00:00 1,03
01.01.11 00:00 Lücke
I wanna create a timeline with the data, but I can't import the csv properly.
I've tried this so far:
data<-read.csv2(file,
header = FALSE,
sep = ";",
quote="\"",
dec=",",
col.names=c("Datum", "Abfluss"),
skip=2,
nrows=length(strs)-2,
colClasses=c("date","numeric"))`
But then I get
"Fehler in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() erwartete 'a real', bekam 'L�cke'"
so i delete the colClasses and it works, I got rid of all unwanted rows. But: everything is in factors. So i use as.numeric
Abfluss1<-as.numeric(data$Abfluss)
Know i can calculate with Abfluss 1, but the values are totally different than in the original csv...
Abfluss1
[1] 99 163 250 354 398 773 927 844 796 772 1010 1468 1091 955 962 933 881 844 803 772 773 803 1006 969 834 779 755
[28] 743 739
Where did I go wrong?! I really would appreciate some helpful hints.
By the way, the files I'm working on can be downloaded here:
http://ehyd.gv.at/#
Just click on one of these blue-ish triangles and download "Q-Tagesmittel"

First of all, there seems a problem with the file encoding. The downloaded file has obviously a Latin-encoding which is not correctly recognizes, why it says L�cke and not Lücke:
encoding = "latin1"
Secondly, Your example seems to be not reproducible: From my understanding you want to skip 28 lines (maybe I am wrong). And the variable strs is not declared in your example. From what I understood you want to skip 28 lines and leave the last one out so in total
nrows = length( readLines( file ) ) - 29
Finally you bumped into this common R issue: How to convert a factor to an integer\numeric without a loss of information?. The entire column is interpreted as character vector because not all elements could be interpreted as numeric. And when adding a character vector to a data.frame it is per default casted to a factor column. Although it is not necessary, if you specify the correct range of lines, you can avoid this with
stringsAsFactors = FALSE
So in total:
f <- readLines("Q-Tagesmittel-204586.csv")
df <- read.csv2(
text = f,
header = FALSE,
sep = ";",
quote="\"",
dec=",",
skip=28,
col.names=c("Datum", "Abfluss"),
nrows = length(f) -29,
encoding = "latin1",
stringsAsFactors = FALSE
)
Oh, and just in case you want to convert as next step the Datum column to a date object, one method to achieve this would be
df$Datum <- strptime( df$Datum, "%d.%m.%Y %H:%M:%S" )
str(df)
'data.frame': 12784 obs. of 2 variables:
$ Datum : POSIXlt, format: "1976-01-01" "1976-01-02" "1976-01-03" "1976-01-04" ...
$ Abfluss: num 0.691 0.799 0.814 0.813 0.795 0.823 0.828 0.831 0.815 0.829 ...

Related

Splitting a single variable dataframe

I have a CSV file that appears as just one variable. I want to split it to 6. I need help.
str(nyt_data)
'data.frame': 3104 obs. of 1 variable:
$ Article_ID.Date.Title.Subject.Topic.Code: Factor w/ 3104 levels "16833;7-Dec-03;Ruse in Toyland: Chinese Workers' Hidden Woe;Chinese Workers Hide Woes for American Inspectors;5",..: 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 ...
nyt_data$Article_ID.Date.Title.Subject.Topic.Code
The result displaced after the above line of code is:
> head(nyt_data$Article_ID.Date.Title.Subject.Topic.Code)
[1] 41246;1-Jan-96;Nation's Smaller Jails Struggle To Cope With Surge in Inmates;Jails overwhelmed with hardened criminals;12
[2] 41257;2-Jan-96;FEDERAL IMPASSE SADDLING STATES WITH INDECISION;Federal budget impasse affect on states;20
[3] 41268;3-Jan-96;Long, Costly Prelude Does Little To Alter Plot of Presidential Race;Contenders for 1996 Presedential elections;20
Please help me with code to split these into 6 separate columns Article_ID, Date, Title, Subject, Topic, Code.

The data is split with ";" but read.csv defaults to ",". Simply do the following:
df <- read.csv(data, sep = ";")

Just read CSV file with custom sep.
Like this:
data <- read.csv(input_file, sep=';')

Data lost in conversion from Factor to Numeric in R

I'm having trouble with a data conversion. I have this data that I get from a .csv file, for instance:
comisiones[2850,28:30]
Periodo.Pago Monto.Pago.Credito Disposicion.En.Efectivo
2850 Mensual 11,503.68 102,713.20
The field Monto.Pago.Credito has a Factor data class and I need it to be numeric but the double precision kind. I need the decimals.
str(comisiones$Monto.Pago.Credito)
Factor w/ 3205 levels "1,000.00","1,000.01",..: 2476 2197 1373 1905 1348 3002 1252 95 2648 667 ...
So I use the generic data conversion function as.numeric():
comisiones$Monto.Pago.Credito <- as.numeric(comisiones$Monto.Pago.Credito)
But then the observation changes to this:
comisiones[2850,28:30]
Periodo.Pago Monto.Pago.Credito Disposicion.En.Efectivo
2850 Mensual 796 102,713.20
str(comisiones$Monto.Pago.Credito)
num [1:5021] 2476 2197 1373 1905 1348 ...
The max of comisiones$Monto.Pago.Credito should be 11,504.68 but now it is 3205.
I don't know if there is a specific data class or type for the decimals in R, I've looked for it but, it didn´t work.

You need to clean up your column firstly, like remove the comma, convert it to character then to numeric:
comisiones$Monto.Pago.Credito <- as.numeric(gsub(",", "", comisiones$Monto.Pago.Credito))
The problem shows up when you convert a factor variable directly to numeric.

You can use extract_numeric from the tidyr package - it will handle factor inputs and remove commas, dollar signs, etc.
library(tidyr)
comisiones$Monto.Pago.Credito <- extract_numeric(comisiones$Monto.Pago.Credito)
If the resulting numbers are large, they may not print with decimal places when you view them, whether you used as.numeric or extract_numeric (which itself calls as.numeric). But the precision is still being stored. For instance:
> x <- extract_numeric("1,200,000.3444")
> x
[1] 1200000
Verify that precision is still stored:
> format(x, nsmall = 4)
[1] "1200000.3444"
> x > 1200000.3
[1] TRUE

Remove blank lines in txt output from R

I am trying to create a specifically formatted file to use as an input file in another software. I have been able, with the help of people here, to create a file that is almost there. Now I just need to remove some empty lines in my *.txt output file. I have tried several different approaches with gsub() but can't figure out a way. Below is an example that produces a file that shows where I'm stuck.
matsplitter<-function(M, r, c) {
rg <- (row(M)-1)%/%r+1
cg <- (col(M)-1)%/%c+1
rci <- (rg-1)*max(cg) + cg
N <- prod(dim(M))/r/c
cv <- unlist(lapply(1:N, function(x) M[rci==x]))
dim(cv)<-c(r,c,N)
cv}
B <- matrix(c(1:1380),ncol=5)
capture.output(matsplitter(B,3,5), file='output.txt')
write.table(gsub('\\[.*\\]', '',
readLines('output.txt')),
file='output.txt', row.names=FALSE, quote=FALSE)
What I need to further remove are the two blank lines between the ", , 1", ", , 2" etc. string and the matrix of numbers.
x
, , 1
1 277 553 829 1105
2 278 554 830 1106
3 279 555 831 1107
, , 2
4 280 556 832 1108
5 281 557 833 1109
6 282 558 834 1110
, , 3
7 283 559 835 1111
8 284 560 836 1112
9 285 561 837 1113

A possible solution if you are willing to go beyond gsub. I have taken the liberty of breaking the answer up into pieces for clarity (hopefully).
#read in file created by "capture.out"
out = gsub('\\[.*\\]', '', readLines('output.txt'))
If you look at this object out you will see that blocks seem separated by five spaces, and that the first of the two spaces you want to get rid of is an empty string "". We get rid of the multiple spaces by means of:
out = gsub("\\s{5}","",out)
Now after the header but in front of every block there is two empty strings and after every block there is one empty string. As we only look to exclude spaces in front of blocks we use the function rle to find repeating elements and exclude these.
#get indicator vector
exclvec = rep(rle(out)$lengths,rle(out)$lengths)
#remove values as indicated by exclvec
out = out[ifelse(out=="" & exclvec==2,F,T)]
As i interpret your question writing this dataframe provides the desired result.
write.table(out,file='output.txt', row.names=FALSE, quote=FALSE)

Read a CSV file in R, and select each element

Sorry if the title is confusing. I can import a CSV file into R, but once I would like to select one element by providing the row and col index. I got more than one elements. All I want is to use this imported csv as a data.frame, which I can select any columns, rows and single cells. Can anyone give me some suggestions?
Here is the data:
SKU On Off Duration(hr) Sales
C010100100 2/13/2012 4/19/2012 17:00 1601 238
C010930200 5/3/2012 7/29/2012 0:00 2088 3
C011361100 2/13/2012 5/25/2012 22:29 2460 110
C012000204 8/13/2012 11/12/2012 11:00 2195 245
C012000205 8/13/2012 11/12/2012 0:00 2184 331
CODE:
Dat = read.table("Dat.csv",header=1,sep=',')
Dat[1,][1] #This is close to what I need but is not exactly the same
SKU
1 C010100100
Dat[1,1] # Ideally, I want to have results only with C010100100
[1] C010100100
3861 Levels: B013591100 B024481100 B028710300 B038110800 B038140800 B038170900 B038260200 B038300700 B040580700 B040590200 B040600400 B040970200 ... YB11624Q1100
Thanks!

You can convert to character to get the value as a string, and no longer as a factor:
as.character(Dat[1,1])
You have just one element, but the factor contains all levels.
Alternatively, pass the option stringsAsFactors=FALSE to read.table when you read the file, to prevent creation of factors for character values:
Dat = read.table("Dat.csv",header=1,sep=',', stringsAsFactors=FALSE )

sqldf, csv, and fields containing commas

Took me a while to figure this out. So, I am answering my own question.
You have some .csv, you want to load it fast, you want to use the sqldf package. Your usual code is irritated by a few annoying fields. Example:
1001, Amy,9:43:00, 99.2
1002,"Ben,Jr",9:43:00, 99.2
1003,"Ben,Sr",9:44:00, 99.3
This code only works on *nix systems.
library(sqldf)
system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")
If try to read with
x <- read.csv.sql("temp.csv", header=FALSE)
R complains
Error in try({ :
RS-DBI driver: (RS_sqlite_import: ./temp.csv line 2 expected 4 columns of data but found 5)
The sqldf-FAQ.13 solution doesn't work either:
x <- read.csv.sql("temp.csv", filter = "tr -d '\"' ", header=FALSE)
Again, R complains
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 5 elements
In fact, the filter only removes double quotes.
So, how to proceed?

Perl and regexes to the rescue. Digging through SO, and toying with regexes here, it is not too hard come up with the right one:
s/(\"[^\",]+),([^\"]+\")/$1_$2/g
which matches "...,...", here the dots are anything but double quotes and commas, and substitues the comma with an underscore. A perl one-liner is the right filter to pass to sqldf:
x <- read.csv.sql("temp.csv",
filter = "perl -e 's/(\"[^\",]+)_([^\"]+\")/$1_$2/g'",
header=FALSE)
Here is the dataframe x
> x
V1 V2 V3 V4
1 1001 Amy 9:43:00 99.2
2 1002 "Ben_Jr" 9:43:00 99.2
3 1003 "Ben_Sr" 9:44:00 99.3
Now, DYO cosmesis on strings ...
EDIT: The regex above only replaces the first occurrence of a comma in the field. To replace all the occurrencies use this
s{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg
What's different?
I replaced the delimiters / with {};
The option e at the very end, instructs the parser to interpret the replacement field as perl code;
The replecement is a simple regex replace, that substitutes all "," with "_" within the matched substring $&.
An example:
system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr,More,Commas\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")
The file temp.csv looks like:
1001, Amy,9:43:00, 99.2
1002,"Ben,Jr,More,Commas",9:43:00, 99.2
1003, "Ben,Sr",9:44:00, 99.3
And can be read with
x <- read.csv.sql("temp.csv",
filter = "perl -p -e 's{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg'",
header=FALSE)
> x
V1 V2 V3 V4
1 1001 Amy 9:43:00 99.2
2 1002 "Ben_Jr_More_Commas" 9:43:00 99.2
3 1003 "Ben_Sr" 9:44:00 99.3

For windows, sqldf now comes with trcomma2dot.vbs which does this by default with read.csv2.sql . Although found it to be slow for very large data.(>1million rows)
It mentions about "tr" for non-windows based system but I could not try it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

r import csv skip first and last lines - r

Related

Splitting a single variable dataframe

Data lost in conversion from Factor to Numeric in R

Remove blank lines in txt output from R

Read a CSV file in R, and select each element

sqldf, csv, and fields containing commas

Categories

Resources