Convert tab delimited string to dataframe - r

I have a long character string that looks like this, except where I've shown double back slashes there is, in reality, only one backslash.
char.string <- "BAT\\tUSA\\t\\tmedium\\t0.8872\\t9\\tOff production\\tCal1|Cal2\\r\\nGNAT\\tCAN\\t\\small\\t0.3824\\t11\\tOff production\\tCal3|Cal8|Cal9\\r\\n"
I tried the following.
df <- data.frame(do.call(rbind, strsplit(char.string, "\t", fixed=TRUE)))
df <- ldply (df, data.frame)
The first returns a vector. The second returns thousands of rows and two columns, one consisting of sequential numbers and the second consisting of all the data.
I'm trying to achieve this:
item = c("BAT", "GNAT")
origin = c("USA", "CAN")
size = c("medium", "small")
lot = c("0.8872", "0.3824")
mfgr = c("9", "11")
stat = c("Off production", "Off production")
line = c("Cal1|Cal2", "Cal3|Cal8|Cal9")
df = data.frame(item, origin, size, lot, mfgr, stat, line)
df
item origin size lot mfgr stat line
1 BAT USA medium 0.8872 9 Off production Cal1|Cal2
2 GNAT CAN small 0.3824 11 Off production Cal3|Cal8|Cal9

read.table() should actually be just fine here, but you have two basic problems:
There's two typos
a. I'm assuming you don't want \\small, but rather small
b. You have \\t\\tmedium where I think you want just \\tmedium
"\\t" is not the same as "\t"
Try this
# Start with your original input
char.string <- "BAT\\tUSA\\t\\tmedium\\t0.8872\\t9\\tOff production\\tCal1|Cal2\\r\\nGNAT\\tCAN\\t\\small\\t0.3824\\t11\\tOff production\\tCal3|Cal8|Cal9\\r\\n"
# Eliminate the typos
char.string <- sub("\\\\s", "s", char.string)
char.string <- sub("\\\\t\\\\t", "\\\\t", char.string)
# Convert \\t, etc. to actual tabs and newlines
char.string <- gsub("\\\\t", "\t", char.string)
char.string <- gsub("\\\\r", "\r", char.string)
char.string <- gsub("\\\\n", "\n", char.string)
# Read the data into a dataframe
df <- read.table(text = char.string, sep = "\t")
# Add the colnames
colnames(df) <- c("item", "origin", "size", "lot", "mfgr", "stat", "line")
# And take a look at the result
df
item origin size lot mfgr stat line
1 BAT USA medium 0.8872 9 Off production Cal1|Cal2
2 GNAT CAN small 0.3824 11 Off production Cal3|Cal8|Cal9

I took some liberties with what I think are typos in your char.string.
library(tidyverse)
char.string <- "BAT\\tUSA\\tmedium\\t0.8872\\t9\\tOff production\\tCal1|Cal2\\r\\nGNAT\\tCAN\\tsmall\\t0.3824\\t11\\tOff production\\tCal3|Cal8|Cal9\\n"
lapply(
str_split(gsub("\\\\n", "", char.string), "\\\\r")[[1]]
, function(x) {
y <- str_split(x, "\\\\t")[[1]]
data.frame(
item = y[1]
, origin = y[2]
, size = y[3]
, lot = y[4]
, mfgr = y[5]
, stat = y[6]
, line = y[7]
, stringsAsFactors = F
)
}) %>%
bind_rows()
item origin size lot mfgr stat line
1 BAT USA medium 0.8872 9 Off production Cal1|Cal2
2 GNAT CAN small 0.3824 11 Off production Cal3|Cal8|Cal9

Related

Is there an R function that reads text files with \n as a (column) delimiter?

The Problem
I'm trying to come up with a neat/fast way to read files delimited by newline (\n) characters into more than one column.
Essentially in a given input file, multiple rows in the input file should become a single row in the output, however most file reading functions sensibly interpret the newline character as signifying a new row, and so they end up as a data frame with a single column. Here's an example:
The input files look like this:
Header Info
2021-01-01
text
...
#
2021-01-02
text
...
#
...
Where the ... represents potentially multiple rows in the input file, and the # signifies what should really be the end of a row in the output data frame. So upon reading this file, it should become a data frame like this (ignoring the header):
X1
X2
...
Xn
2021-01-01
text
...
...
2021-01-02
text
...
...
...
...
...
...
My attempt
I've tried base, data.table, readr and vroom, and they all have one of two outputs, either a data frame with a single column, or a vector. I want to avoid a for loop, and so my current solution is using base::readLines(), to read it as a character vector, then manually adding some "proper" column separators (e.g. ;), and then joining and splitting again.
# Save the example data to use as input
writeLines(c("Header Info", "2021-01-01", "text", "#", "2021-01-02", "text", "#"), "input.txt")
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
# Clean up the example input
unlink("input.txt")
My code above works and gives the desired result, but surely there's a better way??
Edit: This is internal in a function, so part (perhaps the larger part) of the intention of any simplification is to improve the speed.
Thanks in advance!
1) Read in the data, locate the # signs giving logical variable at and then create a grouping variable g which has distinct values for each desired line. Finally use tapply with paste to rework it into lines that can be read using read.table and read it. (If there are commas in the data then use some other separating character.)
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","),
sep = ",", col.names = cnames)
giving this data frame:
V1 V2
1 2021-01-01 text
2 2021-01-02 text
2) Another approach is to rework the data into dcf form by removing the # sign and prefacing other lines with their column name and a colon. Then use read.dcf. cnames is a character vector of column names that you want to use.
cnames <- c("Date", "Text")
L <- readLines("input.txt")[-1]
LL <- sub("#", "", paste0(c(paste0(cnames, ": "), ""), L))
DF <- as.data.frame(read.dcf(textConnection(LL)))
DF[] <- lapply(DF, type.convert, as.is = TRUE)
DF
giving this data frame:
Date Text
1 2021-01-01 text
2 2021-01-02 text
3) This approach simply reshapes the data into a matrix and then converts it to a data frame. Note that (1) converts numeric columns to numeric whereas this one just leaves them as character.
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
## V1 V2
## 1 2021-01-01 text
## 2 2021-01-02 text
Benchmark
The question did not mention speed as a consideration but in a comment it was later mentioned. Based on the data in the benchmark below (1) runs over twice as fast as the code in the question and (3) runs nearly 25x faster.
library(microbenchmark)
writeLines(c("Header Info",
rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)),
"input.txt")
library(microbenchmark)
writeLines(c("Header Info", rep(c("2021-01-01", "text", "#", "2021-01-02", "text", "#"), 10000)), "input.txt")
microbenchmark(times = 10,
ques = {
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";#;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
},
ans1 = {
L <- readLines("input.txt")[-1]
at <- grepl("#", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","), sep = ",")
},
ans3 = {
L <- readLines("input.txt")[-1]
k <- grep("#", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## ques 1146.62 1179.65 1188.74 1194.78 1200.11 1219.01 10 c
## ans1 518.95 522.75 548.33 532.59 561.55 647.14 10 b
## ans3 50.47 51.19 51.68 51.69 52.25 52.52 10 a
You can get round some of the string manipulation with something along the lines of:
input <- readLines("input.txt")[-1] #Read in and remove header
ncol <- which(input=="#")[1]-1 #Number of columns of data
data.frame(matrix(input[input != "#"], ncol = ncol, byrow=TRUE)) #Convert to dataframe
# X1 X2
#1 2021-01-01 text
#2 2021-01-02 text
At this point, you might consider going the full mile and use a proper grammar to parse it. I don't know how big or complex the situation really is, but using pegr it might look something like this:
input <-
"Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#
"
library(pegr)
peg <- new.parser(commonRules,action=TRUE) +
c("HEADER <- 'Header Info' EOL" , "{}" ) + # Rule to match literal 'Header Info' and a \n, then discard
c("TYPE <- 'text' EOL" , "{-}" ) + # Rule to match literal 'text', store paste and store as $TYPE
c("DATE <- (!EOL .)* EOL" , "{-}" ) + # Rule to match any character leading up to a new line. Could improve to look for a date format
c("EOS <- '#' EOL" , "{}" ) + # Rule to match end of section, then discard
c("BODY <- (!EOS .)*" , "{-}" ) + # Rule to match body of text, including newlines
c("SECTION <- DATE TYPE BODY EOS" ) + # Combining rules to match each section
c("DOCUMENT <- HEADER SECTION*" ) # Combining more rules to match the endire document
res <- peg[["DOCUMENT"]](input))
final <- matrix( value(res), ncol=3, byrow=TRUE ) %>%
as.data.frame %>%
setnames( names(value(res))[1:3])
final
Produces:
DATE TYPE BODY
1 2021-01-01 text multiple lines\nof\ntext\n
2 2021-01-02 text more\nlines of text\n
It might feel clunky if you don't know the syntax, but once you do, its a fire and forget solution. It'll run according to spec until the spec doesn't hold. You don't have to worry about fragile pretreatment and it is easy to adapt to changing formats in the future.
There's also the tidyverse way:
library(tidyr)
library(readr)
library(stringr)
max_columns <- 5
d <- {
readr::read_file("file.txt") %>%
stringr::str_remove("^Header Info\n") %>%
tibble::enframe(name = NULL) %>%
separate_rows(value, sep = "#\n") %>%
separate("value", into = paste0("X", 1:max_columns) , sep = "\n")
}
Using your example input in a file called file.txt, the d looks like:
# A tibble: 3 x 5
X1 X2 X3 X4 X5
<chr> <chr> <chr> <chr> <chr>
1 2021-01-01 text ... "" NA
2 2021-01-02 text ... "" NA
3 ... NA NA NA NA
Warning message:
Expected 5 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
Note that the warning is simply to make sure you know you're getting NA's, this is inevitable if number of rows between # varies
I am using the data similar to that provided by Sirius for demonstration. you can also do something like this to have variable number of columns in resulting data frame
example <- "Header Info
2021-01-01
text
multiple lines
of
text
#
2021-01-02
text
more
lines of text
#"
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate_rows(dummy, sep = '\\n') %>%
filter(row_number() !=1) %>%
group_by(rowid = rev(cumsum(rev(dummy == '#')))) %>%
filter(dummy != '#') %>%
mutate(name = paste0('X', row_number())) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 2 x 6
#> # Groups: rowid [2]
#> rowid X1 X2 X3 X4 X5
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 2 2021-01-01 text multiple lines of text
#> 2 1 2021-01-02 text more lines of text <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)

Selecting multiple columns using Regular Expressions

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!
Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))
Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6

Iterating over columns in a data frame in order to replace values from matching data in list of data frames

I'm interested in building a function making use of apply/sapply or Map that would iterate over available columns in dta and replace values in each column with matched values from data frame available in a nameless list of data frames with list item index corresponding to the column number of the dta data frame.
Example
Given objects:
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE)
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D")
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q")
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January")
)
)
Desired results
When applied on dta, the function should return a data.frame corresponding to the extract below:
unitA unitB unitC someValue
Letter B small t Apr 912876
Letter B small q March 293604
C s Apr 459066
Letter D p March 332395
Letter A small q March 650871
Letter D small q Apr 258017
Letter D p January 478546
C small q Feb 766311
C small t March 84247
Letter A small q March 875322
Letter A r Feb 339073
Letter A r Ap 839441
C r Feb 346684
Letter B p January 333775
Letter D small t January 476352
(...)
Existing approach
replaceLbls <- function(dataSet, lstDict) {
sapply(seq_along(dataSet), function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
dataSet[,i][match(dataSet[,i], dtaDict[,1])] <- dtaDict[,2][match(dtaDict[,1], dataSet[,i])]
})
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
Of course the approach proposed above does not work as it will try to use NA in assignments; but it summarises what I want to achieve:
Error in x[...] <- m : NAs are not allowed in subscripted assignments
In addition: Warning message: In [<-.factor(*tmp*, match(dataSet[,
i], dtaDict[, 1]), value = c(NA, : invalid factor level, NA
generated
Additional remarks
Source data set
The key characteristics of the data are:
The list is nameless so subsetting has to be done by item numbers not by names
Item number correspond to column numbers
There is no full match between metadata data frames available in the list of data frames and unit columns available in the data
The someValue column also should be iterated over as it may contain labels that should be replaced
Solution
I'm not interested in dplyr/data.table/sqldf-based solutions.
I'm not interested in nested for-loops
I have a hacky solution that doesn't use for loops or other packages. I needed to convert the factors to characters for it to work but you might be able to improve my solution.
The solution works by only matching values that are found in your lstMeta by creating a vector of indices where matches are found. I also used the <<- operator. If you're better at R than me, you can probably improve this.
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE),
stringsAsFactors = F
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D"),
stringsAsFactors = F
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q"),
stringsAsFactors = F
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January"),
stringsAsFactors = F
)
)
replaceLbls <- function(dataSet, lstDict) {
sapply(1:3, function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
myUniques <- which(dataSet[,i] %in% dtaDict[,1])
dataSet[myUniques,i]<<- dtaDict[,2][match(dataSet[myUniques,i],dtaDict[,1])]
})
return(dataSet)
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
The following approach works well for the example data:
replaceLbls <- function(dataSet, lstDict) {
dataSet[seq_along(lstDict)] <- Map(function(x, lst) {
x <- as.character(x)
idx <- match(x, as.character(lst$V1))
replace(x, !is.na(idx), as.character(lst$V2)[na.omit(idx)])
}, dataSet[seq_along(lstDict)], lstDict)
dataSet
}
head(replaceLbls(dta, lstMeta))
# unitA unitB unitC someValue
# 1 Letter B small t Apr 912876
# 2 Letter B small q March 293604
# 3 C s Apr 459066
# 4 Letter D p March 332395
# 5 Letter A small q March 650871
# 6 Letter D small q Apr 258017
This assumes that you want to apply the changes to the first X column of the data that are as long as the meta-list. You might want to include an extra step to convert back to factor since this approach converts the adjusted columns to character class.
Another remark on factors: you could potentially speed up the performance by working only on the levels of any factor variables instead the whole column. The general process would be similar but requires a few more steps to check classes etc.
You can also try this:
mapr<-function(t,meta){
ind<-match(t,meta$V1)
if(!is.na(ind)){return(meta$V2[ind])}
else{return(t)}}
Then using sapply:
dta<-as.data.frame(cbind(sapply(1:3,function(t,df,meta){sapply(df[,t],mapr,lstMeta[[t]])},dta,lstMeta,simplify = T),dta[,4]))
A couple of mapplys can do the job
f1 <- function(df, lst){
d1 <- setNames(data.frame(mapply(function(x, y) x$V2[match(y, x$V1)], lst, df[1:3]),
df$someValue, stringsAsFactors = FALSE),
names(df))
as.data.frame(mapply(function(x, y) replace(x, is.na(x), y[is.na(x)]), d1, df))
}

In R, how can I copy rows from one dataframe to another when the df being copied to has 2 additional columns?

I have a tab delimited text file with 12 columns that I am uploading to my program. I go on to create another dataframe with a structure similar to the one uploaded and add 2 more columns to it.
excelfile = read.delim(ExcelPath)
matchedPictures<- excelfile[0,]
matchedPictures$beforeName <- character()
matchedPictures$afterName <- character()
Now I have a function in which I do the following:
Based on a condition, I obtain the row number pictureMatchNum of the row I need to copy from excelfile to matchedPictures.
I should then copy the row from excelfile to matchedPictures. I tried a couple of different ways so far.
a.
rowNumber = nrow(matchedPictures) + 1
matchedPictures[rowNumber,1:12] <<- excelfile[pictureMatchNum,1:12]
b.
matchedPictures[rowNumber,1:12] <<- rbind(matchedPictures, excelfile[pictureWordMatches,1:12], make.row.names = FALSE)
2a. doesn't seem to work because it copies the indices from the excelfileand uses them as row names in the matchedPictures - which is why I decided to go with rbind
2b. doesn't seem to work because rbind needs to have the columns be identical and matchedPictureshas 2 extra columns.
EDIT START - Including reproducible example.
Here is some reproducible code (with fewer columns and fake data)
excelfile <- data.frame(x = letters, y = words[length(letters)], z= fruit[length(letters)] )
matchedPictures <- excelfile[0,]
matchedPictures$beforeName <- character()
matchedPictures$afterName <- character()
pictureMatchNum1 = match(1, str_detect("A", regex(excelfile$x, ignore_case = TRUE)))
rowNumber1 = nrow(matchedPictures) + 1
pictureMatchNum2 = match(1, str_detect("D", regex(excelfile$x, ignore_case = TRUE)))
rowNumber2 = nrow(matchedPictures) + 1
The 2 options I tried are
2a.
matchedPictures[rowNumber1,1:3] <<- excelfile[pictureMatchNum1,1:3]
matchedPictures[rowNumber1,"beforeName"] <<- "xxx"
matchedPictures[rowNumber1,"afterName"] <<- "yyy"
matchedPictures[rowNumber2,1:3] <<- excelfile[pictureMatchNum2,1:3]
matchedPictures[rowNumber2,"beforeName"] <<- "uuu"
matchedPictures[rowNumber2,"afterName"] <<- "www"
OR
2b.
matchedPictures[rowNumber1,1:3] <<- rbind(matchedPictures, excelfile[pictureMatchNum1,1:3], make.row.names = FALSE)
matchedPictures[rowNumber1,"beforeName"] <<- "xxx"
matchedPictures[rowNumber1,"afterName"] <<- "yyy"
matchedPictures[rowNumber2,1:3] <<- rbind(matchedPictures, excelfile[pictureMatchNum2,1:3], make.row.names = FALSE)
matchedPictures[rowNumber2,"beforeName"] <<- "uuu"
matchedPictures[rowNumber2,"afterName"] <<- "www"
EDIT END
Additionally, I have also seen the suggestions in many places that rather than using empty dataframes, one should have vectors and append data to the vectors and then combine them into a dataframe. Is this suggestion valid when I have so many columns and would need to have 14 separate vectors and copy each one of them individually?
What can I do to make this work?
You could
first determine the row indices of excelfile that match your criteria
extract these rows
then generate the data to fill your columns beforeName and afterName
then append these columns to your new data frame
Example:
excelfile <- data.frame(x = letters, y = words[length(letters)],
z = fruit[length(letters)])
## Vector of patterns:
patternVec <- c("A", "D", "M")
## Look for appropriate rows in file 'excelfile':
indexVec <- vapply(patternVec,
function(myPattern) which(str_detect(myPattern,
regex(excelfile$x, ignore_case = TRUE))), integer(1))
## Extract these rows:
matchedPictures <- excelfile[indexVec,]
## Somehow generate the data for columns 'beforeName' and 'afterName':
## I do not know how this information is generated so I just insert
## some dummy code here:
beforeNameVec <- c("xxx", "uuu", "mmm")
afterNameVec <- c("yyy", "www", "nnn")
## Then assign these variables:
matchedPictures$beforeName <- beforeNameVec
matchedPictures$afterName <- afterNameVec
matchedPictures
# x y z beforeName afterName
# a air dragonfruit xxx yyy
# d air dragonfruit uuu www
# m air dragonfruit mmm nnn
You can make this much simpler by using dplyr
library(dplyr)
library(stringr)
excelfile <- data.frame(x = letters, y = words[length(letters)], z= fruit[length(letters)],
stringsAsFactors = FALSE ) #add stringsAsFactors to have character columns
pictureMatch <- excelfile %>%
#create a match column
mutate(match = ifelse(str_detect(x,"a") | str_detect(x,'d'),1,0)) %>%
#filter to only the columns that match your condition
filter(match ==1)
pictureMatch <- pictureMatch[['x']] #convert to a vector
matchedPictures <- excelfile %>%
filter(x %in% pictureMatch) %>% #grab the rows that match your condition
mutate(beforeName = c('xxx','uuu'), #add your names
afterName = c('yyy','www'))

how to split fields after reading the file in R

I have a file with this format in each line:
f1,f2,f3,a1,a2,a3,...,an
Here, f1, f2, and f3 are the fixed fields separated by ,, but f4 is the whole a1,a2,...,an where n can vary.
How can I read this into R and conveniently store those variable-length a1 to an?
Thank you.
My file looks like the following
3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie
...
It is not clear what you mean by "conveniently store". If you think a data frame will suit you, try this:
df <- read.table(text = "3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
Edit following #Ananda Mahto's comment.
From ?read.table:
"The number of data columns is determined by looking at the first five lines of input".
Thus, if the maximum number of columns with data occurs somewhere after the first five lines, the solution above will fail.
Example of failure
# create a file with max five columns in the first five lines,
# and six columns in the sixth row
cat("3, a, -4, news, finance",
"2, b, 1, politics",
"1, a, 0",
"2, c, 2, book,movie",
"1, a, 0",
"2, c, 2, book, movie, news",
file = "df",
sep = "\n")
# based on the first five rows, read.table determines that number of columns is five,
# and creates an incorrect data frame
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
df
Solution
# This can be solved by first counting the maximum number of columns in the text file
ncol <- max(count.fields("df", sep = ","))
# then this count is used in the col.names argument
# to handle the unknown maximum number of columns after row 5.
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE,
col.names = paste0("f", seq_len(ncol)))
df
# change column names as above
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
df
#
# Read example data
#
txt <- "3,a,-4,news,finance\n2,b,1,politics\n1,a,0\n2,c,2,book,movie"
tc = textConnection(txt)
lines <- readLines(tc)
close(tc)
#
# Solution
#
lines_split <- strsplit(lines, split=",", fixed=TRUE)
ind <- 1:3
df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", ind)))
df$V4 <- lapply(lines_split, "[", -ind)
#
# Output
#
V1 V2 V3 V4
1 3 a -4 news, finance
2 2 b 1 politics
3 1 a 0
4 2 c 2 book, movie
A place to start:
dat <- readLines(file) ## file being your file
df <- data.frame(
f1=sapply(dat_split, "[[", 1),
f2=sapply(dat_split, "[[", 2),
f3=sapply(dat_split, "[[", 3),
a=unlist( sapply(dat_split, function(x) {
if (length(x) <= 3) {
return(NA)
} else {
return(paste(x[4:length(x)], collapse=","))
}
}) )
)
and when you need to pull things out of a, you can do splitting as necessary.

Resources