I've a data frame with some metadata in the first 3 rows, that I need to skip. But doing so, also affects the colnames of the values cols.
What can I do, to avoid opening every CSV on excel and deleting these rows manually?
This is how the CSV looks when opened in Excel:
In R, I'm using this command to open it:
android_per <- fread("...\\Todas las adquisiciones de dispositivos de VersiĆ³n de Android PE.csv",
skip = 3)
And it looks like this:
UPDATE 1:
Similar logic to #G5W, but I think there needs to be a step of squashing the header that is in 2 rows back to one. E.g.:
txt <- "Some, utter, rubbish,,
Even more rubbish,,,,
,,Col_3,Col_4,Col_5
Col_1,Col_2,,,
1,2,3,4,5
6,7,8,9,0"
## below line writes a file - uncomment if you're happy to do so
##cat(txt, file="testfile.csv", "\n")
header <- apply(read.csv("testfile.csv", nrows=2, skip=2, header=FALSE),
2, paste, collapse="")
read.csv("testfile.csv", skip=4, col.names=header, header=FALSE)
Output:
# Col_1 Col_2 Col_3 Col_4 Col_5
#1 1 2 3 4 5
#2 6 7 8 9 0
Here is one way to do it. Read the file simply as lines of text. Eliminate the lines that you don't want, then read the remaining good part into a data.frame.
Sample csv file (I saved it as "Temp/Temp.csv")
Col_1,Col_2,Col_3,Col_4,Col_5
Some utter rubbish,,,,
Presumably documentation,,,,
1,2,3,4,5
6,7,8,9,0
Code
CSV_Lines = readLines("temp/Temp.csv")
CSV_Lines = CSV_Lines[-(2:3)]
DF = read.csv(text=CSV_Lines)
Col_1 Col_2 Col_3 Col_4 Col_5
1 1 2 3 4 5
2 6 7 8 9 0
It skipped the unwanted lines and got the column names.
If you use skip = 3, you definitely lose the column names without an option to get it back using R. An ugly hack could be to use skip = 2 which will make sure that all other columns except the first 2 are correct.
df <- read.table('csv_name.csv', skip = 2, header = TRUE)
The headers of the first 2 columns are in the first row so you can do
names(df)[1:2] <- df[1, 1:2]
Probably, you need to shift all the rows 1 step up to get dataframe as intended.
In case you put Header as false then you can use below code:
df<-fread("~/Book1.csv", header = F, skip = 2)
shift_up <- function(x, n){
c(x[-(seq(n))], rep(NA, n))
}
df[1,1]<-df[2,1]
df[1,2]<-df[2,2]
df<-df[-2,]
names(df)<-as.character(df[1,])
df<-df[-1,]
Related
I have code that runs through a long list of URLS to make tables of air quality data. However, I keep getting a parsing error that I don't understand.
Here are my libraries:
library(tidyverse)
library(lubridate)
library(chron)
library(lemon)
The code that generates the URLs, and then creates the tables looks like this:
i = 1
tempList <- list()
for(p in Pollutants){
for(s in Stations){
for(y in Years){
# Build URL
url_download <- paste("http://www.airqualityontario.com/history/index.php?s=",s,"&y=",y,"&p=",p,"&m=",mS,"&e=",mE,"&t=csv&submitter=Search&i=1", sep = "")
print(url_download)
# Read in File
if(p == 124){
MOE_download <- read_csv(file = url_download ,skip = 15 , na = c("9999", "-999"))
}else{
MOE_download <- read_csv(file = url_download ,skip = 14 , na = c("9999", "-999"))
}
}
}
}
However, I am getting an odd parsing error that I don't understand. It looks like this:
-- Column specification -------------------------------------------------------------------------------------------------------
cols(
`Chatham (13001)` = col_character()
)
Warning: 379 parsing failures.
row col expected actual file
2 -- 1 columns 2 columns 'http://www.airqualityontario.com/history/index.php?s=13001&y=2020&p=124&m=1&e=12&t=csv&submitter=Search&i=1'
3 -- 1 columns 2 columns 'http://www.airqualityontario.com/history/index.php?s=13001&y=2020&p=124&m=1&e=12&t=csv&submitter=Search&i=1'
4 -- 1 columns 2 columns 'http://www.airqualityontario.com/history/index.php?s=13001&y=2020&p=124&m=1&e=12&t=csv&submitter=Search&i=1'
5 -- 1 columns 2 columns 'http://www.airqualityontario.com/history/index.php?s=13001&y=2020&p=124&m=1&e=12&t=csv&submitter=Search&i=1'
6 -- 1 columns 2 columns 'http://www.airqualityontario.com/history/index.php?s=13001&y=2020&p=124&m=1&e=12&t=csv&submitter=Search&i=1'
It says it got 2 columns and expected 1, but that doesn't make sense to me because when I check the size of the dataframe, it looks like this:
dim(MOE_download_test)
[1] 384 1
If anyone could help me with this question I would really appreciate some guidance.
I have this problem but in r:
How can I read a double-semicolon-separated .csv with quoted values using pandas?
The solution there is to drop the additional columns generated. I'd like to know if there's a way to read the file separated by ;; without generating those addiotional columns.
Thanks!
Read it in normally using read.csv2 (or whichever variant you prefer, including read.table, read.delim, readr::read_csv2, data.table::fread, etc), and then remove the even-numbered columns.
dat <- read.csv2(text = "a;;b;;c;;d\n1;;2;;3;;4")
dat
# a X b X.1 c X.2 d
# 1 1 NA 2 NA 3 NA 4
dat[,-seq(2, ncol(dat), by = 2)]
# a b c d
# 1 1 2 3 4
It is usually recommended to properly clean your data before attempting to parse it, instead of cleaning it WHILE parsing, or worse, AFTER. Either use Notepad++ to Replace all ;; occurences or R itself, but do not delete the original files (also a rule of thumb - never delete sources of data).
my.text <- readLines('d:/tmp/readdelim-r.csv')
cleaned <- gsub(';;', ';', my.text)
writeLines(cleaned, 'd:/tmp/cleaned.csv')
my.cleaned <- read.delim('d:/tmp/cleaned.csv', header=FALSE, sep=';')
I want to import a tsv file including some non-numeric fields (i.e., date or string) in R:
num1 num2 date
1 2 2012-10-18 12:17:19
2 4 2014-11-16 09:30:23
4 11 2010-03-18 22:18:04
12 3 2015-02-18 12:55:50
13 1 2014-05-16 10:39:11
2 14 2011-05-26 20:48:54
I am using the following command:
a = read.csv("C:\test\testFile.tsv", sep="\t")
I want to ignore all non-numeric values automatically (or put something like "NA"). And don't want to mention all the string column names to be ignored.
I tried "stringsAsFactors" and "as.is" parameters, with no success.
Any ideas?
You have quite a few options here.
First, you can inform R while reading the table:
data <- read.csv("C:\test\testFile.tsv",
sep="\t",
colClasses=c(NA, NA, "NULL"))
If you have many nonnumeric columns, say 10, you can use rep as colClasses=c(NA, NA, rep("NULL", 10)).
Second, you can read everything and process deletion afterwards (note the stringsAsFactors):
data <- read.csv("C:\test\testFile.tsv",
sep="\t", stringsAsFactors = FALSE)
You can subset everything column that is identified as character.
df[, !sapply(df, is.character)]
Or then apply a destructive method to you data.frame:
df[sapply(df, is.character)] <- list(NULL)
You can go further to make sure only numeric columns are left:
df[,-grep ("Date|factor|character", sapply(df, class))] <- list(NULL)
Just found this solution:
a = read.csv("C:\test\testFile.tsv", sep="\t", colClasses=c(NA, NA, "NULL"))
It is not completely automatic though.
I've looked at several of the functions that are able to add text to an existing data file (.csv or .txt) such as write.table, write.lines, or sink.
When the append argument =TRUE, the new data is always added after the last existing line of the file. Is it possible to add data to an existing file on first line (below a header)- AKA opposite of append?
Given a data frame:
DF <- as.data.frame(matrix(seq(20),nrow=5,ncol=4))
colnames(DF) <- c("A", "B", "C", "D")
write.table(DF, "DF.csv", row.names=FALSE, sep=",")
I can append a new data frame to the last line like this
A <- 1
A <- data.frame(A)
A$B <- 1
A$C <- 1
A$D <- 1
write.table(A, "DF.csv", row.names=FALSE, sep=",", append=TRUE, col.names=FALSE)
Which is close to what I want. But I would really like to have the above line to be added to the first line of the DF.csv (right below the header) like so
A B C D
1 1 1 1
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
To be clear, I'd not looking to add a row into a data frame within R. I am hoping to add a row to the beginning of a file outside of the R environment. Just as append can be used to add data to the end of an external .csv file, I am hoping to "append" data to the beginning of a .csv file, so that my latest data always comes up in the first row (to avoid scrolling through to the end of a long file to see the most current data).
Write your own function:
my.write.table <- function(df, filename, sep)
{
## read the existing content
temp.df <- read.table(filename, sep)
## append in front
df <- rbind(df, temp.df)
## write back the whole data frame
write.table(df, filename, sep)
}
I have a table in csv format, the data is the following:
1 3 1 2
1415_at 1 8.512147859 8.196725061 8.174426394 8.62388149
1411_at 2 9.119200527 9.190318548 9.149239039 9.211401637
1412_at 3 10.03383593 9.575728316 10.06998673 9.735217522
1413_at 4 5.925999419 5.692092375 5.689299161 7.807354922
When I read it with:
m <- read.csv("table.csv")
and print the values of m, I notice that they change to:
X X.1 X1 X3 X1.1 X4
1 1415_at 1 8.512148 8.196725 8.174426 8.623881
I made some manipulation to keep only those columns that are labelled 1 or 2, so I do that with:
smallerdat <- m[ grep("^X$|^X.1$|^X1$|^X2$|1\\.|2\\." , names(m) ) ]
write.csv(smallerdat,"table2.csv")
it writes me the file with those annoying headers and that first column added, which I do not need it:
X X.1 X1 X1.1 X2
1 1415_at 1 8.512148 8.174426 8.623881
so when I open that data in Excel the headers are still X, X.1 and son on. What I need is that the headers remain the same as:
1 1 2
1415_at 1 8.196725061 8.174426394 8.62388149
any help?
Please notice also that first column that is added automatically, I do not need it, so how I can get rid that of that column?
There are two issues here.
For reading your CSV file, use:
m <- read.csv("table.csv", check.names = FALSE)
Notice that by doing this, though, you can't use the column names as easily. You have to quote them with backticks instead, and will most likely still run into problems because of duplicated column names:
m$1
# Error: unexpected numeric constant in "mydf$1"
mydf$`1`
# [1] 8.512148 9.119201 10.033836 5.925999
For writing your "m" object to a CSV file, use:
write.csv(m, "table2.csv", row.names = FALSE)
After reading your file in using the method in step 1, you can subset as follows. If you wanted the first column and any columns named "3" or "4", you can use:
m[names(m) %in% c("", "3", "4")]
# 3 4
# 1 1415_at 1 8.196725 8.623881
# 2 1411_at 2 9.190319 9.211402
# 3 1412_at 3 9.575728 9.735218
# 4 1413_at 4 5.692092 7.807355
Update: Fixing the names before using write.csv
If you don't want to start from step 1 for whatever reason, you can still fix your problem. While you've succeeded in taking a subset with your grep statement, that doesn't change the column names (not sure why you would expect that it should). You have to do this by using gsub or one of the other regex solutions.
Here are the names of the columns with the way you have read in your CSV:
names(m)
# [1] "X" "X.1" "X1" "X3" "X1.1" "X2"
You want to:
Remove all "X"s
Remove all ".some-number"
So, here's a workaround:
# Change the names in your original dataset
names(m) <- gsub("^X|\\.[0-9]$", "", names(m))
# Create a temporary object to match desired names
getme <- names(m) %in% c("", "1", "2")
# Subset your data
smallerdat <- m[getme]
# Reassign names to your subset
names(smallerdat) <- names(m)[getme]
I am not sure I understand what you are attempting to do, but here is some code that reads a csv file with missing headers for the first two columns, selects only columns with a header of 1 or 2 and then writes that new data file retaining the column names of 1 or 2.
# first read in only the headers and deal with the missing
# headers for columns 1 and 2
b <- readLines('c:/users/Mark W Miller/simple R programs/missing_headers.csv',
n = 1)
b <- unlist(strsplit(b, ","))
b[1] <- 'name1'
b[2] <- 'name2'
b <- gsub(" ","", b, fixed=TRUE)
b
# read in the rest of the data file
my.data <- (
read.table(file = "c:/users/mark w miller/simple R programs/missing_headers.csv",
na.string=NA, header = F, skip=1, sep=','))
colnames(my.data) <- b
# select the columns with names of 1 or 2
my.data <- my.data[names(my.data) %in% c("1", "2")]
# retain the original column names of 1 or 2
names(my.data) <- floor(as.numeric(names(my.data)))
# write the new data file with original column names
write.csv(
my.data, "c:/users/mark w miller/simple R programs/missing_headers_out.csv",
row.names=FALSE, quote=FALSE)
Here is the input data file. Note the commas with missing names for columns 1 and 2:
, , 1, 3, 1, 2
1415_at, 1, 8.512147859, 8.196725061, 8.174426394, 8.62388149
1411_at, 2, 9.119200527, 9.190318548, 9.149239039, 9.211401637
1412_at, 3, 10.03383593, 9.575728316, 10.06998673, 9.735217522
1413_at, 4, 5.925999419, 5.692092375, 5.689299161, 7.807354922
Here is the output data file:
1,1,2
8.512147859,8.174426394,8.62388149
9.119200527,9.149239039,9.211401637
10.03383593,10.06998673,9.735217522
5.925999419,5.689299161,7.807354922