Importing in R; (read.csv) Ignore non-numeric columns automatically - r

I want to import a tsv file including some non-numeric fields (i.e., date or string) in R:
num1 num2 date
1 2 2012-10-18 12:17:19
2 4 2014-11-16 09:30:23
4 11 2010-03-18 22:18:04
12 3 2015-02-18 12:55:50
13 1 2014-05-16 10:39:11
2 14 2011-05-26 20:48:54
I am using the following command:
a = read.csv("C:\test\testFile.tsv", sep="\t")
I want to ignore all non-numeric values automatically (or put something like "NA"). And don't want to mention all the string column names to be ignored.
I tried "stringsAsFactors" and "as.is" parameters, with no success.
Any ideas?

You have quite a few options here.
First, you can inform R while reading the table:
data <- read.csv("C:\test\testFile.tsv",
sep="\t",
colClasses=c(NA, NA, "NULL"))
If you have many nonnumeric columns, say 10, you can use rep as colClasses=c(NA, NA, rep("NULL", 10)).
Second, you can read everything and process deletion afterwards (note the stringsAsFactors):
data <- read.csv("C:\test\testFile.tsv",
sep="\t", stringsAsFactors = FALSE)
You can subset everything column that is identified as character.
df[, !sapply(df, is.character)]
Or then apply a destructive method to you data.frame:
df[sapply(df, is.character)] <- list(NULL)
You can go further to make sure only numeric columns are left:
df[,-grep ("Date|factor|character", sapply(df, class))] <- list(NULL)

Just found this solution:
a = read.csv("C:\test\testFile.tsv", sep="\t", colClasses=c(NA, NA, "NULL"))
It is not completely automatic though.

Related

How can I read a double-semicolon-separated .txt in r?

I have this problem but in r:
How can I read a double-semicolon-separated .csv with quoted values using pandas?
The solution there is to drop the additional columns generated. I'd like to know if there's a way to read the file separated by ;; without generating those addiotional columns.
Thanks!
Read it in normally using read.csv2 (or whichever variant you prefer, including read.table, read.delim, readr::read_csv2, data.table::fread, etc), and then remove the even-numbered columns.
dat <- read.csv2(text = "a;;b;;c;;d\n1;;2;;3;;4")
dat
# a X b X.1 c X.2 d
# 1 1 NA 2 NA 3 NA 4
dat[,-seq(2, ncol(dat), by = 2)]
# a b c d
# 1 1 2 3 4
It is usually recommended to properly clean your data before attempting to parse it, instead of cleaning it WHILE parsing, or worse, AFTER. Either use Notepad++ to Replace all ;; occurences or R itself, but do not delete the original files (also a rule of thumb - never delete sources of data).
my.text <- readLines('d:/tmp/readdelim-r.csv')
cleaned <- gsub(';;', ';', my.text)
writeLines(cleaned, 'd:/tmp/cleaned.csv')
my.cleaned <- read.delim('d:/tmp/cleaned.csv', header=FALSE, sep=';')

Skipping rows gets rid off necessary colnames?

I've a data frame with some metadata in the first 3 rows, that I need to skip. But doing so, also affects the colnames of the values cols.
What can I do, to avoid opening every CSV on excel and deleting these rows manually?
This is how the CSV looks when opened in Excel:
In R, I'm using this command to open it:
android_per <- fread("...\\Todas las adquisiciones de dispositivos de Versión de Android PE.csv",
skip = 3)
And it looks like this:
UPDATE 1:
Similar logic to #G5W, but I think there needs to be a step of squashing the header that is in 2 rows back to one. E.g.:
txt <- "Some, utter, rubbish,,
Even more rubbish,,,,
,,Col_3,Col_4,Col_5
Col_1,Col_2,,,
1,2,3,4,5
6,7,8,9,0"
## below line writes a file - uncomment if you're happy to do so
##cat(txt, file="testfile.csv", "\n")
header <- apply(read.csv("testfile.csv", nrows=2, skip=2, header=FALSE),
2, paste, collapse="")
read.csv("testfile.csv", skip=4, col.names=header, header=FALSE)
Output:
# Col_1 Col_2 Col_3 Col_4 Col_5
#1 1 2 3 4 5
#2 6 7 8 9 0
Here is one way to do it. Read the file simply as lines of text. Eliminate the lines that you don't want, then read the remaining good part into a data.frame.
Sample csv file (I saved it as "Temp/Temp.csv")
Col_1,Col_2,Col_3,Col_4,Col_5
Some utter rubbish,,,,
Presumably documentation,,,,
1,2,3,4,5
6,7,8,9,0
Code
CSV_Lines = readLines("temp/Temp.csv")
CSV_Lines = CSV_Lines[-(2:3)]
DF = read.csv(text=CSV_Lines)
Col_1 Col_2 Col_3 Col_4 Col_5
1 1 2 3 4 5
2 6 7 8 9 0
It skipped the unwanted lines and got the column names.
If you use skip = 3, you definitely lose the column names without an option to get it back using R. An ugly hack could be to use skip = 2 which will make sure that all other columns except the first 2 are correct.
df <- read.table('csv_name.csv', skip = 2, header = TRUE)
The headers of the first 2 columns are in the first row so you can do
names(df)[1:2] <- df[1, 1:2]
Probably, you need to shift all the rows 1 step up to get dataframe as intended.
In case you put Header as false then you can use below code:
df<-fread("~/Book1.csv", header = F, skip = 2)
shift_up <- function(x, n){
c(x[-(seq(n))], rep(NA, n))
}
df[1,1]<-df[2,1]
df[1,2]<-df[2,2]
df<-df[-2,]
names(df)<-as.character(df[1,])
df<-df[-1,]

Import csv without thousand delimiter and convert from factor to numeric without loss of decimal separator

I have a List data.list with 5 columns, which looks like this:
Code Price_old MB Price_new Product
CZ 898.00 20.00 1.001.00 Type 1
CZ 890.00 300.00 1.016.33 Type 2
CZ 890.00 1.000.00 1.016.63 Type 2
CZ 899.00 200.00 1.019.33 Type 2
NO 999.00 50.00 1.025.75 Type 3
NO 999.00 600.00 1.025.75 Type 3
This is directly imported from a .csv. What I want to know is a way to convert columns 2, 3 and 4 from factor to numeric (as.numeric(levels(f))[f] did not work!) (1 and 5 are character) without losing any information.
Conversion with mutate_if(is.factor, as.numeric) ended up losing all decimal points: 1.025.75 -> 102575, 50.00 -> 5000, etc.
Conversion with sapply
indx <- sapply(data.list, is.factor)
data.list[indx] <- sapply(data.list[indx],
function(x) as.numeric(as.character(x)))
produced roughly 200 NAs by coercion in each column of my full dataset, data I can not do without.
Second, I want to find a solution to convert all numeric values to this format: "####.##".
I searched in many related blogs and posts, but did not find a proper solution to my problem. Hope someone has an ace up the sleeve.
Cheers
Using the answer from https://stackoverflow.com/a/38626760/1017276
Essentially, you want to remove all but the last period.
csvfile <-
"Code,Price_old,MB,Price_new,Product
CZ,898.00,20.00,1.001.00,Type 1
CZ,890.00,300.00,1.016.33,Type 2
CZ,890.00,1.000.00,1.016.63,Type 2
CZ,899.00,200.00,1.019.33,Type 2
NO,999.00,50.00,1.025.75,Type 3
NO,999.00,600.00,1.025.75,Type 3"
csvfile <- textConnection(csvfile)
df <- read.csv(csvfile, stringsAsFactors = FALSE)
df[2:4] <- lapply(df[2:4],
function(x) as.numeric(gsub("\\.(?=[^.]*\\.)", "", x, perl = TRUE)))
df

extracting variable from file names in R

I have files that contain multiple rows, I want to add two new rows that I create by extracting varibles from the filename and multipling them by current rows.
For example I have a bunch of file that are named something like this
file1[1000,1001].txt
file1[2000,1001].txt
between the [] there are always 2 numbers spearated by a comma
the file itself has multiple columns, for example column1 & column2
I want for each file to extract the 2 values in the name of the file and then use them as variables to make 2 new columns that used the variable to modify the values.
for example
file1[1000,2000]
the file contains two columns
column1 column2
1 2
2 4
I want at the end to add the first file name value to column 1 to create column3 and add the second file name value to column 2 to create column 4, ending up with something like this
column1 column2 column3 column4
1 2 1001 2002
2 4 1002 2004
thanks for the help. I am almost there just a few more issues
original files has 2 columns "X_Parameter" "Y_Parameter", the file name is "test(64084,4224).txt
your code works great at extracting the two values V1 "64084" and V2 "4224" from the file name. I then add these values to the original data set. this yields 4 columns. "X_Parameter" "Y_Parameter" "V1" "V2".
setwd("~/Desktop/txt/")
txt_names = list.files(pattern = ".txt")
for (i in 1:length(txt_names)){assign(txt_names[i], read.delim(txt_names[i]))
DS1 <- read.delim(file = txt_names[i], header = TRUE, stringsAsFactors = TRUE)
require(stringr)
remove_text <- str_extract(txt_names, pattern = "\\[[0-9,0-9]+\\]")
step1 <- gsub("(\\[)", "", remove_text)
step2 <- gsub("(\\])", "", step1)
DS2<-as.data.frame(do.call("rbind", (str_split(step2, ","))))
DS1$V1<-DS2$V1
DS1$V2<-DS2$V2
My issue arises when tying to sum "X_Parameter" and "V1" to make "absoluteX" and sum "Y_Parameter"with "V2" to make "absoluteY" for each row.
below are the two ways I have tried with the errors
DS1$absoluteX<-DS1$X_Parameter+DS1$V1
error
In Ops.factor(DS1$X_Parameter, DS1$V1) : ‘+’ not meaningful for factors
other try was
DS1$absoluteX<-rowSums(DS1[,c(“X_Parameter”,”V1”)])
error
Error in rowSums(DS1[, c("X_Parameter", "V1")]) : 'x' must be numeric
I have tried using
as.numeric(DS1$V1)
that causes all values to become 1
Any thoughts?Thanks
You can extract the numbers from a vector of file names as follows (not sure it is the shortest possible code, but it seems to work)
fnams<-c("file1[1000,2000].txt","file1[1500,2500].txt")
opsqbr<-regexpr("\\[",fnams)
comm<-regexpr(",",fnams)
clsqbr<-regexpr("\\]",fnams)
reslt<-data.frame(col1=as.numeric(substring(fnams,opsqbr+1,comm-1)),
col2=as.numeric(substring(fnams,comm+1,clsqbr-1)))
reslt
Which yields
col1 col2
1 1000 2000
2 1500 2500
Once you have this data frame,it is easy to sequentially read the files and do the addition
## set path to wherever your files are
setwd("path")
## make a vector with names of your files
txt_names <- list.files(pattern = ".txt") # use this to make a complete list of names
## read your files in
for (i in 1:length(txt_names)) assign(txt_names[i], read.csv(txt_names[i], sep = "whatever your separator is"))
## for now I'm making a dummy vector and data frame
txt_names <- c("[1000,2000]")
ds1 <- data.frame(column1 = c(1,2), column2 = c(2,4))
## grab the text you require from the file names
require(stringr)
remove_text <- str_extract(txt_names, pattern = "\\[[0-9,0-9]+\\]")
step1 <- gsub("(\\[)", "", remove_text)
step2 <- gsub("(\\])", "", step1)
## step2 should look like this
> step2
[1] "1000,1001"
## split each string and convert to data frame with two columns
ds2 <- as.data.frame(do.call("rbind", (str_split(step2, ","))))
## cbind with the file
df <- cbind(ds1, ds2)
## coerce factor columns to numeric
df$V1 <- as.numeric(as.character(df$V1))
df$V2 <- as.numeric(as.character(df$V2))
## perform the operation to change the columns
df$V1 <- df$column1 + df$V1
df$V2 <- df$column2 + df$V2
NOw you have a data.frame with two columns , each containing the file name parts you need. Just rep them times length of each of your data.frames and cbind.

Parse currency values from CSV, convert numerical suffixes for Million and Billion

I'm curious if there's any sort of out of the box functions in R that can handle this.
I have a CSV file that I am reading into a data frame using read.csv. One of the columns in the CSV contains currency values in the format of
Currency
--------
$1.2M
$3.1B
N/A
I would like to convert those into more usable numbers that calculations can be performed against, so it would look like this:
Currency
----------
1200000
3100000000
NA
My initial thoughts were to somehow subset the dataframe into 3 parts based on rows that contain *M, *B, or N/A. Then use gsub to replace the $ and M/B, then multiply the remaining number by 1000000 or 1000000000, and finally rejoin the 3 subsets back into 1 data frame.
However I'm curious if there's a simpler way to handle this sort of conversion in R.
We could use gsubfn to replace the 'B', 'M' with 'e+9', 'e+6' and convert to numeric (as.numeric).
is.na(v1) <- v1=='N/A'
options(scipen=999)
library(gsubfn)
as.numeric(gsubfn('([A-Z]|\\$)', list(B='e+9', M='e+6',"$"=""),v1))
#[1] 1200000 3100000000 NA
EDIT: Modified based on #nicola's suggestion
data
v1 <- c('$1.2M', '$3.1B', 'N/A')
Another way, is using a for-loop :
x <- c("1.2M", "2.5M", "1.6B", "N/A")
x <- ifelse(x=="N/A", NA, x)
num <- as.numeric(strsplit(x, "[^0-9.]+"))
for(i in 1:length(x)) {
if(grepl('M', x[i]))
print(prod(num[i], 1000000))
else
print(prod(num[i], 100000000))
}
# [1] 1200000
# [1] 2500000
# [1] 1.6e+08
# [1] NA

Resources