Regex in R: parse content of 1-column dataframe

Regex in R: parse content of 1-column dataframe - r

Although I can't imagine it not having been asked before in one way or another, I don't seem to be able to find something that answers my question.
I have data that looks like this
> mydata1
V1
,10.00,20.00,30.00,40.00
,11.00,22.00,33.00,44.00
And I'd like to have data that looks like:
> mydata2
V1 V2 V3 V4
10.00 20.00 30.00 40.00
11.00 22.00 33.00 44.00
When I try read.table and separation with "," I get:
> mydata2 <- read.table(> mydata1, sep = ",")
Error in read.table(mydata1, sep = ",") :
'file' must be a character string or connection
I tried some Regex magic, but this didn't work (mostly because I have no deep understanding in the matter).
Any help is much appreciated!

We can use read.csv after removing the , at the start of the string with sub
mydata2 <- read.csv(text = sub("^,", "", mydata1$V1), header = FALSE)
mydata2
# V1 V2 V3 V4
#1 10 20 30 40
#2 11 22 33 44

library(tidyverse)
separate(mydata1, V1, into = c("V0", "V1", "V2", "V3", "V4"), sep = ",") %>% select(-V0)
# V1 V2 V3 V4
# 1 10.00 20.00 30.00 40.00
# 2 11.00 22.00 33.00 44.00

Related

read.csv error due to no column names (R)

I'm trying to read a csv file in r.
The issue is that my file has no column names except for the first column
Using the read.csv() function gives me the 'Error in read.table : more columns than column names' error
So I used the read_csv() function from the readr library.
However this creates a df with just one column containing all the values.
(https://i.stack.imgur.com/Och8A.png)
What should I do to fix this issue?

First cut to read the data would be using skip=1 (to not read in the first line, it appears to be descriptive only) and header=FALSE:
quux <- read.csv("path/to/file.csv", skip = 1, header = FALSE)
I find this format to be a bit awkward, we may want to reshape it a bit.
quux <- setNames(data.frame(t(quux[,-1])), sub(":$", "", quux[[1]]))
quux
# LON LAT MMM 1984-Nov-01 1974-Nov-05
# V2 151.0 -24.5 27.11 22.28 22.92
# V3 151.5 -24.0 27.46 22.47 22.83
# V4 152.0 -24.0 27.19 22.27 22.64
Many tools prefer to have the "month" column names as a single column, which is converting this data from "wide" format to "long". This is easily done with either tidyr::pivot_longer or reshape2::melt:
dat <- reshape2::melt(quux, c("LON", "LAT", "MMM"), variable.name = "date")
dat
# LON LAT MMM date value
# 1 151.0 -24.5 27.11 1984-Nov-01 22.28
# 2 151.5 -24.0 27.46 1984-Nov-01 22.47
# 3 152.0 -24.0 27.19 1984-Nov-01 22.27
# 4 151.0 -24.5 27.11 1974-Nov-05 22.92
# 5 151.5 -24.0 27.46 1974-Nov-05 22.83
# 6 152.0 -24.0 27.19 1974-Nov-05 22.64
dat <- tidyr::pivot_longer(quux, -c(LON, LAT, MMM), names_to = "date")
From here, it might be nice to have the date column be a "proper" Date-object so that it "number-like" things can be done with it. For example, in its present form, sorting is incorrect since Apr will land before Jan; other number-like operations include finding ranges of dates (which can be done with strings, but not these strings) and adding/subtracting days (e.g., 7 days prior to a value).
dat$date <- as.Date(dat$date, format = "%Y-%b-%d")
dat
# LON LAT MMM date value
# 1 151.0 -24.5 27.11 1984-11-01 22.28
# 2 151.5 -24.0 27.46 1984-11-01 22.47
# 3 152.0 -24.0 27.19 1984-11-01 22.27
# 4 151.0 -24.5 27.11 1974-11-05 22.92
# 5 151.5 -24.0 27.46 1974-11-05 22.83
# 6 152.0 -24.0 27.19 1974-11-05 22.64
Sample data:
quux <- read.csv(skip = 1, header = FALSE, text = '
LON:,151.0,151.5,152.0
LAT:,-24.5,-24.0,-24.0
MMM:,27.11,27.46,27.19
1984-Nov-01,22.28,22.47,22.27
1974-Nov-05,22.92,22.83,22.64
')

Import multiple txt files into one data frame and use part of file names as "id"

I have a directory of text files named using the following convention: "Location[A-Z]_House[0-15]_Day[0_15].txt", so an example is LA_H05_D14.txt. Is there a way of splitting the names such that they can be made a factor? More specifically I would like to use the letter [A-Z] that comes after Location. E.g. LB_H01_D01.txt would be location "B" and all data belonging to Location B will be labelled "B"?
I have imported all the data from the files into one data frame:
l = list.files(patt="txt$", full.names = T)
library(dplyr)
Df = bind_rows(lapply(l, function(i) {temp <- read.table(i,stringsAsFactors = FALSE,sep=";");
setNames(temp, c("Date","Time","Timestamp","PM2_5(ug/m3)","AQI(US)","AQI(CN)","PM10(ug/m3)","Outdoor AQI(US)","Outdoor AQI(CN)","Temperature(C)","Temperature(F)","Humidity(%RH)","CO2(ppm)","VOC(ppb)"
))}), .id = "id")
The data looks like this with an "id" column:
head(Df)
id Date Time Timestamp PM2_5(ug/m3) AQI(US) AQI(CN) PM10(ug/m3) Outdoor AQI(US) Outdoor AQI(CN) Temperature(C) Temperature(F)
1 1 2017/10/17 20:31:38 1508272298 102.5 175 135 512 0 0 30 86.1
2 1 2017/10/17 20:31:48 1508272308 93.6 171 124 477 0 0 30 86.1
3 1 2017/10/17 20:31:58 1508272318 98.0 173 129 397 0 0 30 86.0
4 1 2017/10/17 20:32:08 1508272328 98.0 173 129 422 0 0 30 86.0
5 1 2017/10/17 20:32:18 1508272338 104.3 176 137 466 0 0 30 86.0
6 1 2017/10/17 20:32:28 1508272348 101.6 175 134 528 0 0 30 86.0
Humidity(%RH) CO2(ppm) VOC(ppb)
1 43 466 -1
2 43 467 -1
3 42 468 -1
4 42 469 -1
5 42 471 -1
6 42 471 -1

Independent of the issue concerning the content of the id column you might use the following code to extract the information from the filenames:
#you may use the original filenames
filenames <- basename(l)
#or the content of the id column
filenames <- as.character(Df$id) #if you have read in filenames in the Df
#for demonstration here a definition of exemplary filenames
filenames <- c("LA_H01_D01.txt"
,"LA_H02_D02.txt"
,"LD_H01_D14.txt"
,"LD_H01_D15.txt")
filenames <- gsub("_H|_D", "_", filenames)
filenames <- gsub(".txt|^L", "", filenames)
fileinfo <- as.data.frame(do.call(rbind, strsplit(filenames, "_")))
colnames(fileinfo) <- c("Location", "House", "Day")
fileinfo[, c("House", "Day")] <- apply(fileinfo[, c("House", "Day")], 2, as.numeric)
# Location House Day
# 1 A 1 1
# 2 A 2 2
# 3 D 1 14
# 4 D 1 15
#add the information to your Df as new columns
Df <- cbind(Df, fileinfo)
#the whole thing as a function used in your data import
add_fileinfo <- function(df, filename) {
filename <- gsub("_H|_D", "_", filename)
filename <- gsub(".txt|^L", "", filename)
fileinfo <- as.data.frame(do.call(rbind, strsplit(filename, "_")))
colnames(fileinfo) <- c("Location", "House", "Day")
fileinfo[, c("House", "Day")] <- apply(fileinfo[, c("House", "Day")], 2, as.numeric)
cbind(df, fileinfo[rep(seq_len(nrow(fileinfo)), each= nrow(df)),])
}
Df = bind_rows(lapply(l, function(i)
{temp <- read.table(i,stringsAsFactors = FALSE,sep=";");
setNames(temp, c("Date","Time","Timestamp","PM2_5(ug/m3)","AQI(US)","AQI(CN)","PM10(ug/m3)","Outdoor AQI(US)","Outdoor AQI(CN)","Temperature(C)","Temperature(F)","Humidity(%RH)","CO2(ppm)","VOC(ppb)"
));
temp <- add_fileinfo(temp, i);
}
), .id = "id")

Something like this (generic) solution should get you going.
mydata1 = read.csv(path1, header=T)
mydata2 = read.csv(path2, header=T)
Then, merge
myfulldata = merge(mydata1, mydata2)
As long as mydata1 and mydata2 have at least one common column with an identical name (that allows matching observations in mydata1 to observations in mydata2), this will work like a charm. It also takes three lines.
What if I have 20 files with data that I want to match observation-to-observation? Assuming they all have a common column that allows merging, I would still have to read 20 files in (20 lines of code) and merge() works two-by-two… so I could merge the 20 data frames together with 19 merge statements like this:
mytempdata = merge(mydata1, mydata2)
mytempdata = merge(mytempdata, mydata3)
.
.
.
mytempdata = merge(mytempdata, mydata20)
That’s tedious. You may be looking for a simpler way. If you are, I wrote a function to solve your woes called multmerge().* Here’s the code to define the function:
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)
Here is a good resource that should help you out.
https://stats.idre.ucla.edu/r/codefragments/read_multiple/

Spliting a row into columns using a delimiter in R

My data loks like this:
ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,
I would like to first separate this into different columns then apply a function on each row to calculate the difference of comma separated values such as (237-204).
Without the use of external library packages.

Try this except if the data is in a file replace the readLines line with something like this: L <- readLines("myfile.csv") . After that replace the colons with commas using gsub and then read the resulting text and transform it:
# test data
Lines <- "ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,"
L <- readLines(textConnection(Lines))
DF <- read.table(text = gsub(":", ",", L), sep = ",")
transform(DF, diff = V3 - V4)
giving:
V1 V2 V3 V4 V5 diff
1 ID 10 237 204 NA 33
2 ID 11 257 239 NA 18
3 ID 12 309 291 NA 18
4 ID 13 310 272 NA 38
5 ID 14 3202 3184 NA 18
6 ID 15 404 388 NA 16

How to read CSV with extra quotes with R?

I have the CSV file like below:
data,key
"VA1,VA2,20140524,,0,0,5969,20140523134902,S7,S1147,140,20140523134902,m/t",4503632376496128
"VA2,VA3,20140711,,0,0,8824,20140601095714,S1,S6402,175,20140601095839,m/t",4503643113914368
I try to read it with R, but I don't need key value and data value should be read to separate columns. With the following code I get almost what I need:
data <- read.csv(fileCSV, header = FALSE, sep = ",", skip = 1, comment.char = "", quote = "")
I skip header line there (skip = 1), say that I don't have it (header = FALSE), and say that I don't have quotes (quote = ""). But in result I get quote characters in V1 and V13 columns and extra V14 column:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 "VA1 VA2 20140524 NA 0 0 5969 2.014121e+13 S7 S1147 140 2.014121e+13 m/t" 4.503608e+15
Should I delete it somehow after reading csv? Or, is there any better way to read such csv files?
Upd. I use the following approach to delete quotes:
data[,"V1"] = sub("^\"", "", data[,"V1"])
data[,"V13"] = sub("\"$", "", data[,"V13"])
But factor type is changed to character for these columns.

How about a system command with fread()?
writeLines(
'data,key
"VA1,VA2,20140524,,0,0,5969,20140523134902,S7,S1147,140,20140523134902,m/t",4503632376496128
"VA2,VA3,20140711,,0,0,8824,20140601095714,S1,S6402,175,20140601095839,m/t",4503643113914368', "x.txt"
)
require(bit64)
data.table::fread("cat x.txt | rev | cut -d '\"' -f2 | rev | tail -n +2")
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1: VA1 VA2 20140524 NA 0 0 5969 20140523134902 S7 S1147 140 20140523134902 m/t
# 2: VA2 VA3 20140711 NA 0 0 8824 20140601095714 S1 S6402 175 20140601095839 m/t
Here's a test on the two methods, as requested.
## 150k lines
writeLines(c("data,key\n", rep_len(
'"VA1,VA2,20140524,,0,0,5969,20140523134902,S7,S1147,140,20140523134902,m/t",4503632376496128\n', 1.5e5)),
"test.txt"
)
## fread() in well under 1 second (with bit64 loaded)
system.time({
dt <- data.table::fread(
"cat test.txt | rev | cut -d '\"' -f2 | rev | grep -e '^V'"
)
})
# user system elapsed
# 0.945 0.108 0.547
## your current read.csv() method in just over two seconds
system.time({
df <- read.csv("test.txt", header = FALSE, sep = ",", skip = 1, comment.char = "", quote = "")
df[,"V1"] = sub("^\"", "", df[,"V1"])
df[,"V13"] = sub("\"$", "", df[,"V13"])
})
# user system elapsed
# 2.134 0.000 2.129
dim(dt)
# [1] 150000 13
dim(df)
# [1] 150000 14

Importing messy data to r

Lake Elsinore 9.7 F W 60.2 131 1 1 0 2310.1
Lake Elsinore 10.4 F W 53.9 67 0 0 0 1815.9
Lake Elsinore 10.1 M W 54.3 96 1 1 1 1872.9
Lake Elsinore 9.6 M W 55.1 72 1 . 1 1980.4
So here I have ten variables V1-V10. How can I read it to R. You see the first variable is actually separated by space. So I can't read in "separating by space". Could someone have me to find a way that I could easily import those kind of data in.
Thank you so so much!

Here are two approaches:
1) It could be done with read.pattern in the gsubfn package. The matches to the parenthesized portions of the pattern are read in as separate fields:
library(gsubfn)
pattern <- "^(.*) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+)"
read.pattern("myfile.dat", pattern, na.strings = ".")
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 Lake Elsinore 9.7 F W 60.2 131 1 1 0 2310.1
2 Lake Elsinore 10.4 F W 53.9 67 0 0 0 1815.9
3 Lake Elsinore 10.1 M W 54.3 96 1 1 1 1872.9
4 Lake Elsinore 9.6 M W 55.1 72 1 NA 1 1980.4
2) Read in the lines as they are, replace the first space on each line with some character (here we use underscore), re-read it now using read.table and then replace the underscore with space:
L <- readLines("myfile.dat")
L <- sub(" ", "_", L)
DF <- read.table(text = L, na.strings = ".")
DF[[1]] <- sub("_", " ", DF[[1]])
giving the same answer.

It's a little clunky, but I usually just read it in raw and parse the data from there. You could do something like:
# First, read in all columns space separated
df <- read.table(FILE, header = F, sep = " ")
# Create a new column (V12) that's a concatenation of V1, V2
within(df, V12 <- paste(V1, V2, sep=' '))
# And then drop the unwanted columns
df <- df[,2:11]
Remember, you have 11 columns reading it in raw, which is why I'm creating a 12th.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Regex in R: parse content of 1-column dataframe - r

We can use read.csv after removing the , at the start of the string with sub mydata2 <- read.csv(text = sub("^,", "", mydata1$V1), header = FALSE) mydata2 # V1 V2 V3 V4 #1 10 20 30 40 #2 11 22 33 44

library(tidyverse) separate(mydata1, V1, into = c("V0", "V1", "V2", "V3", "V4"), sep = ",") %>% select(-V0) # V1 V2 V3 V4 # 1 10.00 20.00 30.00 40.00 # 2 11.00 22.00 33.00 44.00

Related

read.csv error due to no column names (R)

Import multiple txt files into one data frame and use part of file names as "id"

Spliting a row into columns using a delimiter in R

How to read CSV with extra quotes with R?

Importing messy data to r

Categories

Resources