Read data with space character in R - r

Usually, read.table will solve many data input problems personally. Like this one:
China 2 3
USA 1 4
Sometimes, the data can madden people, like:
Chia 2 3
United States 3 4
So the read.table cannot work, and any assistance is appreciated.
P.S. the format of data file is .dat

First set up some test data:
# create test data
cat("Chia 2 3
United States 3 4
", file = "file.with.spaces.txt")
1) Using the above read in the data, insert commas between fields and re-read:
L <- readLines("file.with.spaces.txt")
L2 <- sub("^(.*) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L) # 1
DF <- read.table(text = L2, sep = ",")
giving:
> DF
V1 V2 V3
1 Chia 2 3
2 United States 3 4
2) Another approach. Using L from above, replace the last string of spaces with comma twice (since there are three fields):
L2 <- L
for(i in 1:2) L2 <- sub(" +(\\S+)$", ",\\1", L2) # 2
DF <- read.table(text = L2, sep = ",")
ADDED second solution. Minor improvements.

If the column seperator 'sep' is indeed a whitespace, it logically cannot differentiate between spaces in a name and spaces that actually seperate between columns. I'd suggest to change your country names to single strings, ie, strings without spaces. Alternatively, use semicolons to seperate between your data colums and use:
data <- read.table(foo.dat, sep= ";")
If you have many rows in your .dat file, you can consider using regular expressions to find spaces between the columns and replace them with semicolons.

Related

Split columns in a dataframe into a column that contains text not numbers and a column that contains numbers not text in R

Here is a simplified version of data I am working with:
a<-c("There are 5 programs", "2 - adult programs, 3- youth programs","25", " ","there are a number of programs","other agencies run our programs")
b<-c("four", "we don't collect this", "5 from us, more from others","","","")
c<-c(2,6,5,8,2,"")
df<-cbind.data.frame(a,b,c)
df$c<-as.numeric(df$c)
I want to keep both the text and numbers from the data b/c some of the text is important
expected output:
What I think makes sense is the following:
id all columns that have text in them, perhaps in a list (because some columns are just numbers)
subset columns from step 1 to a new dataframe lets call this df1
delete the subsetted columns in df1 from df
split all the columns in df1 into 2 columns, one that keeps the text and one that has the number.
bind the new spit columns from df1 into the orginal df
What I am struggling with is steps 1-2 and 4. I am okay with the characters (e.g., - and ') being excluded or included. There is additional processing I have to do after (e.g., when there are multiple numbers in a column after splitting I will need to split and add these and also address the written numbers), but those are things I can do.
Here's a dplyr solution using regular expression:
library(stringr)
library(dplyr)
df %>%
mutate(
a.text = gsub("(^|\\s)\\d+", "", a),
a.num = str_extract_all(a, "\\d+"),
b.text = gsub("(^|\\s)\\d+", "", b),
b.num = str_extract_all(b, "\\d+")
) %>%
select(c(4:7,3))
a.text a.num b.text b.num c
1 There are programs 5 four 2
2 - adult programs,- youth programs 2, 3 we don't collect this 6
3 25 from us, more from others 5 5
4 8
5 there are a number of programs 2
6 other agencies run our programs NA
Here is what I would do with my preferred tools. The solution will work with arbitrary numbers of arbitrarily named character and non-character columns.
library(data.table) # development version 1.14.3 used here
library(magrittr) # piping used to improve readability
num <- \(x) stringr::str_extract_all(x, "\\d+", simplify = TRUE) %>%
apply(1L, \(x) sum(as.integer(x), na.rm = TRUE))
txt <- \(x) stringr::str_remove_all(x, "\\d+") %>%
stringr::str_squish()
setDT(df)[, lapply(
.SD, \(x) if (is.character(x)) data.table(txt = txt(x), num = num(x)) else x)]
which returns
a.txt a.num b.txt b.num c
<char> <int> <char> <int> <num>
1: There are programs 5 four 0 2
2: - adult programs, - youth programs 5 we don't collect this 0 6
3: 25 from us, more from others 5 5
4: 0 0 8
5: there are a number of programs 0 0 2
6: other agencies run our programs 0 0 NA
Explanation
num() is a function which uses the regular expression \\d+ to extract all strings which consist of contiguous digits (aka integer numbers), coerces them to type integer, and computes the rowwise sum of the extracted numbers (as requested in OP's last sentence).
txt() is a function which removes all strings which consist of contiguous digits (aka integer numbers), removes whitespace from start and end of the strings and reduces repeated whitespace inside the strings.
\(x) is a new shortcut for function(x) introduced with R version 4.1
The next steps implement OP's proposed approach in data.table syntax, by and large:
lapply(.SD, ...) loops over each column of df.
if the column is character both functions txt() and num() are applied. The two resulting vectors are turned into a data.table as a partial result. Note that cbind() cannot be used here as it would return a character matrix.
if the column is non-character it is returned as is.
The final result is a data.table where the column names have been renamed automagically.
This approach keeps the relative position of columns.

How can I read a double-semicolon-separated .txt in r?

I have this problem but in r:
How can I read a double-semicolon-separated .csv with quoted values using pandas?
The solution there is to drop the additional columns generated. I'd like to know if there's a way to read the file separated by ;; without generating those addiotional columns.
Thanks!
Read it in normally using read.csv2 (or whichever variant you prefer, including read.table, read.delim, readr::read_csv2, data.table::fread, etc), and then remove the even-numbered columns.
dat <- read.csv2(text = "a;;b;;c;;d\n1;;2;;3;;4")
dat
# a X b X.1 c X.2 d
# 1 1 NA 2 NA 3 NA 4
dat[,-seq(2, ncol(dat), by = 2)]
# a b c d
# 1 1 2 3 4
It is usually recommended to properly clean your data before attempting to parse it, instead of cleaning it WHILE parsing, or worse, AFTER. Either use Notepad++ to Replace all ;; occurences or R itself, but do not delete the original files (also a rule of thumb - never delete sources of data).
my.text <- readLines('d:/tmp/readdelim-r.csv')
cleaned <- gsub(';;', ';', my.text)
writeLines(cleaned, 'd:/tmp/cleaned.csv')
my.cleaned <- read.delim('d:/tmp/cleaned.csv', header=FALSE, sep=';')

R: read table with multiple separators and parantheses?

I have received this type of table, available also here
I wonder, how to efficiently open the table in R?
My output should be splitted in 3 separate columns, and without the parentheses: :
id type V1
1 13242924 'SA' 1
2 13035909 'SA' 1
3 6685553 'SA' 1
4 12990163 'SA' 1
For now, I was thinking to split it in few steps:
open the file as .csv with \t separator,
use multiple gsub() to replace both parentheses,
split the first column in two, etc.
Isn't there a simpler way? Also, seems that optim$V1 <- gsub('(', "", optim$V1) simply does not remove parantheses.
df<- read.csv("C:/sample.csv",
sep = "\t",
header = F)
# Replace the parantheses:
optim$V1 <- gsub('(', "", optim$V1)
One way would be to read it in using readLines(), clean it up and then read it in again with read.table().
txt <- readLines("C:/sample.csv")
read.table(text = gsub("[()\",\t]", " ", txt))
V1 V2 V3
1 13242924 SA 1
2 13035909 SA 1
3 6685553 SA 1
4 12990163 SA 1
5 13243126 SA 1
6 12941091 SA 1
7 12939233 SA 1
8 13242835 SA 1
9 6685130 SA 1
Here you go, (Not sure if we can do something while data load, but below method you definitely use post data load)
library(dplyr)
(d <- tibble(id = c("(123","(24"),
type = c("'sa')", "'sa')")))
d %>% mutate_at(vars(id, type), ~str_remove_all(.x, pattern = "\\(|\\)"))
using base R
gsub("\\(", "", d$id)
Note: you need to use escape character for parentheses. see here.

R - Parse text file as fscanf ? [duplicate]

This question already has answers here:
How to transform a key/value string into distinct rows?
(2 answers)
Closed 4 years ago.
I have a large text file that I want to import in R with multimodal data encoded as such :
A=1,B=1,C=2,...
A=2,B=1,C=1,...
A=1,B=2,C=1,...
What I'd like to have is a dataframe similar to this :
A B C
1 1 2
2 1 1
1 2 1
Because the column name is being repeated over and over for each row, I was wondering if there was a way import that text file with a fscanf functionality that would parse the A, B, C column names such as "A=%d,B=%d,C=%d,...."
Or maybe there's a simpler way using read.table or scan ? But I couldn't figure out how.
Thanks for any tip
1) read.pattern read.pattern in the gsubfn package is very close to what you are asking. Instead of %d use (\\d+) when specifying the pattern. If the column names are not important the col.names argument could be omitted.
library(gsubfn)
L <- c("A=1,B=1,C=2", "A=1,B=1,C=2", "A=1,B=1,C=2") # test input
pat <- "A=(\\d+),B=(\\d+),C=(\\d+)"
read.pattern(text = L, pattern = pat, col.names = unlist(strsplit(pat, "=.*?(,|$)")))
giving:
A B C
1 1 1 2
2 1 1 2
3 1 1 2
1a) percent format Just for fun we could implement it using exactly the format given in the question.
fmt <- "A=%d,B=%d,C=%d"
pat <- gsub("%d", "(\\\\d+)", fmt)
Now run the read.pattern statement above.
2) strapply Using the same input and the gsubfn package, again, an alternative is to pull out all strings of digits eliminating the need for the pat shown in (1) reducing the pattern to just "\\d+".
DF <- strapply(L, "\\d+", as.numeric, simplify = data.frame)
names(DF) <- unlist(strsplit(L[1], "=.*?(,|$)"))
3) read.csv Even simpler is this base only solution which deletes the headings and reads in what is left setting the column names as in the prior solution. Again, omit the col.names argument if column names are not important.
read.csv(text = gsub("\\w*=", "", L), header = FALSE,
col.names = unlist(strsplit(L[1], "=.*?(,|$)")))

Using R to list and mark multiple csv files with characters from the title of those files, and put those in a dataframe

I have a large number of files that are all numbered and labeled from a CTD cast. These files all contain 3 columns, for bottle number fired, Depth, and Conductivity, and 3 rows, one for each water bottle fired.
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547
These files are named after the cast number as such "OS1505xxx.csv", where the xxx is the cast number. I would like to take the data from multiple casts, label the data with the cast number(which I presume would go in another column for each bottle sample), and then merge that data together in one dataframe.
1,68.93,0.2123,001
2,14.28,0.3139,001
3,8.683,0.3547,001
1,109.5,0.2062,002
2,27.98,0.4842,002
3,5.277,0.3705,002
One other thing, some files only have 1 or 2 bottles fired, While others also have 4 bottles fired. I tried finding files with only 3 rows and making a list of the filenames repeated three times, and then mergeing that with the binded csv files that had three rows into a dataframe but I am very new to R and couldn't figure it out. Any help is appreciated.
This gets all of them into one data frame in order (001-100), and from there you can export it however you want.
df <- data.frame(matrix(ncol = 4, nrow = 1))
colnames(df) <- c("V1", "V2", "V3", "file")
for(i in 1:100) {
file_name <- paste("OS1505",as.name(sprintf("%03d", i)),".csv",sep="")
if(file.exists(file_name)) {
print("match found")
df_tmp <- read.csv(file_name, header = FALSE, sep = ",",fill = TRUE)
df_tmp$file <- sprintf("%03d", i)
df <- rbind(df, df_tmp)
}
}
Try this:
files <- list.files(pattern="OS1505")
lst <- lapply(files, read.csv)
ids <- substr(files, 7,9)
for(i in 1:length(lst)) lst[[i]][,4] <- ids[i]
do.call(rbind, lst)
# X V1 V2 V3
#1 1 1 68.930 001
#2 2 2 14.280 001
#3 3 3 8.683 001
#4 1 1 109.500 002
#5 2 2 27.980 002
#6 3 3 5.277 002
We start by first creating two dummy files to try and save them as csv files to test. I named them in a way to match your files. (i.e. "OS1505001.csv"):
file1 <- read.table(text="
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547", sep=',')
file2 <- read.table(text="
1,109.5,0.2062
2,27.98,0.4842
3,5.277,0.3705", sep=',')
write.csv(file1, "OS1505001.csv")
write.csv(file2, "OS1505002.csv")
Going through the code, files checks the directory for any files that have OS1505 in them. There are two files that match that description "OS1505001.csv" "OS1505002.csv". We bring those two files into R with read.csv. It is wrapped in lapply so that the process can happen to all of the files in the files vector at once and saved in a list called lst. Now ids is a way to grab the id numbers from the file names. In a for loop we assign each id to the 4th column of the data frames. Lastly, do.call brings it all together with the rbind function.

Resources