R: read table with multiple separators and parantheses? - r

I have received this type of table, available also here
I wonder, how to efficiently open the table in R?
My output should be splitted in 3 separate columns, and without the parentheses: :
id type V1
1 13242924 'SA' 1
2 13035909 'SA' 1
3 6685553 'SA' 1
4 12990163 'SA' 1
For now, I was thinking to split it in few steps:
open the file as .csv with \t separator,
use multiple gsub() to replace both parentheses,
split the first column in two, etc.
Isn't there a simpler way? Also, seems that optim$V1 <- gsub('(', "", optim$V1) simply does not remove parantheses.
df<- read.csv("C:/sample.csv",
sep = "\t",
header = F)
# Replace the parantheses:
optim$V1 <- gsub('(', "", optim$V1)

One way would be to read it in using readLines(), clean it up and then read it in again with read.table().
txt <- readLines("C:/sample.csv")
read.table(text = gsub("[()\",\t]", " ", txt))
V1 V2 V3
1 13242924 SA 1
2 13035909 SA 1
3 6685553 SA 1
4 12990163 SA 1
5 13243126 SA 1
6 12941091 SA 1
7 12939233 SA 1
8 13242835 SA 1
9 6685130 SA 1

Here you go, (Not sure if we can do something while data load, but below method you definitely use post data load)
library(dplyr)
(d <- tibble(id = c("(123","(24"),
type = c("'sa')", "'sa')")))
d %>% mutate_at(vars(id, type), ~str_remove_all(.x, pattern = "\\(|\\)"))
using base R
gsub("\\(", "", d$id)
Note: you need to use escape character for parentheses. see here.

Related

Looping over patterns list to remove them for a string column in R [duplicate]

This question already has answers here:
remove multiple patterns from text vector r
(4 answers)
Closed 1 year ago.
I have a df with 2 columns where the second one represents strings that contains special characters and other characters I want to remove.
The problem
I have written a for loop that works but only after being executed Three (03) times!
Libraries & Data
library(tidyverse)
client_id <- 1:10
client_name <- c("name5", "-name", "name--", "name-µ", "name²", "name31", "7name8", "name514", "²name8")
df <- data.frame(cbind(client_id, client_name))
Patterns to be removed
patterns <- list("-", "--", "[:digit:]", "[:cntrl:]" , "µ" , "²" , "[:punct:]")
What I have done
To remove the unwanted patterns in col 2 client_names I have written the following for loop:
for(ptrn in patterns) {
df <- df %>%
mutate(client_name = str_remove(df$client_name, ptrn))
print(ptrn) # progress
}
The above for loop removes all unwanted patterns, but only after being executed Three (03) times.
How can we fix that in order to remove all unwanted patterns since the first execution?
Should I nest the above for loop with another one in order to iterate over client_names[i]?
Thanks
This is a more straightforward method:
Instead of making a list of all unwanted characters you can str_extract all and only the wanted ones, which, in your case, are the (Roman) alphabetic characters:
library(stringr)
df %>%
mutate(client_name = str_extract(client_name,"[A-Za-z]+"))
client_id client_name
1 1 name
2 2 name
3 3 name
4 4 name
5 5 name
6 6 name
7 7 name
8 8 name
9 9 name
10 10 name
You can collapse the patterns in one regex pattern and use str_remove_all to remove all the occurrences of it.
library(dplyr)
library(stringr)
ptrn <- paste0(patterns, collapse = '|')
df <- df %>% mutate(client_name = str_remove_all(client_name, ptrn))
df
# client_id client_name
#1 1 name
#2 2 name
#3 3 name
#4 4 name
#5 5 name
#6 6 name
#7 7 name
#8 8 name
#9 9 name
data
client_id <- 1:9
client_name <- c("name5", "-name", "name--", "name-µ", "name²", "name31", "7name8", "name514", "²name8")
df <- data.frame(client_id, client_name)

Skipping rows gets rid off necessary colnames?

I've a data frame with some metadata in the first 3 rows, that I need to skip. But doing so, also affects the colnames of the values cols.
What can I do, to avoid opening every CSV on excel and deleting these rows manually?
This is how the CSV looks when opened in Excel:
In R, I'm using this command to open it:
android_per <- fread("...\\Todas las adquisiciones de dispositivos de Versión de Android PE.csv",
skip = 3)
And it looks like this:
UPDATE 1:
Similar logic to #G5W, but I think there needs to be a step of squashing the header that is in 2 rows back to one. E.g.:
txt <- "Some, utter, rubbish,,
Even more rubbish,,,,
,,Col_3,Col_4,Col_5
Col_1,Col_2,,,
1,2,3,4,5
6,7,8,9,0"
## below line writes a file - uncomment if you're happy to do so
##cat(txt, file="testfile.csv", "\n")
header <- apply(read.csv("testfile.csv", nrows=2, skip=2, header=FALSE),
2, paste, collapse="")
read.csv("testfile.csv", skip=4, col.names=header, header=FALSE)
Output:
# Col_1 Col_2 Col_3 Col_4 Col_5
#1 1 2 3 4 5
#2 6 7 8 9 0
Here is one way to do it. Read the file simply as lines of text. Eliminate the lines that you don't want, then read the remaining good part into a data.frame.
Sample csv file (I saved it as "Temp/Temp.csv")
Col_1,Col_2,Col_3,Col_4,Col_5
Some utter rubbish,,,,
Presumably documentation,,,,
1,2,3,4,5
6,7,8,9,0
Code
CSV_Lines = readLines("temp/Temp.csv")
CSV_Lines = CSV_Lines[-(2:3)]
DF = read.csv(text=CSV_Lines)
Col_1 Col_2 Col_3 Col_4 Col_5
1 1 2 3 4 5
2 6 7 8 9 0
It skipped the unwanted lines and got the column names.
If you use skip = 3, you definitely lose the column names without an option to get it back using R. An ugly hack could be to use skip = 2 which will make sure that all other columns except the first 2 are correct.
df <- read.table('csv_name.csv', skip = 2, header = TRUE)
The headers of the first 2 columns are in the first row so you can do
names(df)[1:2] <- df[1, 1:2]
Probably, you need to shift all the rows 1 step up to get dataframe as intended.
In case you put Header as false then you can use below code:
df<-fread("~/Book1.csv", header = F, skip = 2)
shift_up <- function(x, n){
c(x[-(seq(n))], rep(NA, n))
}
df[1,1]<-df[2,1]
df[1,2]<-df[2,2]
df<-df[-2,]
names(df)<-as.character(df[1,])
df<-df[-1,]

R remove multiple text strings in data frame

New to R. I am looking to remove certain words from a data frame. Since there are multiple words, I would like to define this list of words as a string, and use gsub to remove. Then convert back to a dataframe and maintain same structure.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
a
id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"
I was thinking something like:
a2 <- apply(a, 1, gsub(wordstoremove, "", a)
but clearly this doesnt work, before converting back to a data frame.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))
# id text time username
# 1 1 ai and x 10 me
# 2 2 and computing 5 you
# 3 3 nothing 15 everyone
# 4 4 ibm privacy 0 know
(dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste(wordstoremove, collapse = '|'), '', x))))
# id text time username
# 1 1 and x 10 me
# 2 2 and 5 you
# 3 3 nothing 15 everyone
# 4 4 0 know
Another option using dplyr::mutate() and stringr::str_remove_all():
library(dplyr)
library(stringr)
dat <- dat %>%
mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\b so that they are not removed from the beginning, middle, or end or other words.
The search pattern is also wrapped with regex(pattern, ignore_case = T) in case some words are capitalized in the text string.
str_replace_all() could be used if you wanted to replace the words with something other than just removing them. str_remove_all() is just an alias for str_replace_all(string, pattern, '').
rawr's anwswer could be updated to:
dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))

Read data with space character in R

Usually, read.table will solve many data input problems personally. Like this one:
China 2 3
USA 1 4
Sometimes, the data can madden people, like:
Chia 2 3
United States 3 4
So the read.table cannot work, and any assistance is appreciated.
P.S. the format of data file is .dat
First set up some test data:
# create test data
cat("Chia 2 3
United States 3 4
", file = "file.with.spaces.txt")
1) Using the above read in the data, insert commas between fields and re-read:
L <- readLines("file.with.spaces.txt")
L2 <- sub("^(.*) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L) # 1
DF <- read.table(text = L2, sep = ",")
giving:
> DF
V1 V2 V3
1 Chia 2 3
2 United States 3 4
2) Another approach. Using L from above, replace the last string of spaces with comma twice (since there are three fields):
L2 <- L
for(i in 1:2) L2 <- sub(" +(\\S+)$", ",\\1", L2) # 2
DF <- read.table(text = L2, sep = ",")
ADDED second solution. Minor improvements.
If the column seperator 'sep' is indeed a whitespace, it logically cannot differentiate between spaces in a name and spaces that actually seperate between columns. I'd suggest to change your country names to single strings, ie, strings without spaces. Alternatively, use semicolons to seperate between your data colums and use:
data <- read.table(foo.dat, sep= ";")
If you have many rows in your .dat file, you can consider using regular expressions to find spaces between the columns and replace them with semicolons.

Is there a way to use read.csv to read from a string value rather than a file in R?

I'm writing an R package where the R code talks to a Java application. The Java application outputs a CSV formatted string and I want the R code to be able to directly read the string and convert it into a data.frame.
Editing a 7-year old answer: By now, this is much simpler thanks to the text= argument which has been added to read.csv() and alike:
R> data <- read.csv(text="flim,flam
+ 1.2,2.2
+ 77.1,3.14")
R> data
flim flam
1 1.2 2.20
2 77.1 3.14
R>
Yes, look at the help for textConnection() -- the very powerful notion in R is that essentially all readers (as e.g. read.table() and its variants) access these connection object which may be a file, or a remote URL, or a pipe coming in from another app, or ... some text as in your case.
The same trick is used for so-called here documents:
> lines <- "
+ flim,flam
+ 1.2,2.2
+ 77.1,3.14
+ "
> con <- textConnection(lines)
> data <- read.csv(con)
> close(con)
> data
flim flam
1 1.2 2.20
2 77.1 3.14
>
Note that this is a simple way for building something but it is also costly due to the repeated parsing of all the data. There are other ways to get from Java to R, but this should get you going quickly. Efficiency comes next...
Note that in now-current versions of R, you no longer need the textConnection(), it's possible to simply do this:
> states.str='"State","Abbreviation"
+ "Alabama","AL"
+ "Alaska","AK"
+ "Arizona","AZ"
+ "Arkansas","AR"
+ "California","CA"'
> read.csv(text=states.str)
State Abbreviation
1 Alabama AL
2 Alaska AK
3 Arizona AZ
4 Arkansas AR
5 California CA
Yes. For example:
string <- "this,will,be\na,data,frame"
x <- read.csv(con <- textConnection(string), header=FALSE)
close(con)
#> x
# V1 V2 V3
#1 this will be
#2 a data frame
Suppose you have a file called tommy.csv (yes, imaginative, I know...) that has the contents of
col1 col2 \n 1 1 \n 2 2 \n 3 3
where each line is separated with an escape character "\n".
This file can be read with the help of allowEscapes argument in read.table.
> read.table("tommy.csv", header = TRUE, allowEscapes = TRUE)
col1 col2
1 col1 col2
2 1 1
3 2 2
4 3 3
It's not perfect (modify column names...), but it's a start.
Using a tidyverse approach, you can just specify a text value
library(readr)
read_csv(file = "col1, col2\nfoo, 1\nbar, 2")
# A tibble: 2 x 2
col1 col2
<chr> <dbl>
1 foo 1
2 bar 2
This function wraps Dirk's answer into a convenient form. It's brilliant for answering questions on SO, where the asker has just dumped the data onscreen.
text_to_table <- function(text, ...)
{
dfr <- read.table(tc <- textConnection(text), ...)
close(tc)
dfr
}
To use it, first copy the onscreen data and paste into your text editor.
foo bar baz
1 2 a
3 4 b
Now wrap it with text_to_table, quotes and any other arguments for read.table.
text_to_table("foo bar baz
1 2 a
3 4 b", header = TRUE)

Resources