Need to extract data from the text ( this is just a sample)
text <- c(" 9 A 1427107 -",
" 99 (B) 3997915 -",
" 999 (SOCIO) 7161315 -",
" 9999 #M 4035115 -",
" 99999 01 Z 2136481035115 8,621"
)
so far I tried but could not create pattern for all columns
as.numeric(gsub("([0-9]+).*$", "\\1",text))
I want my data frame out put looks like
row_names Text ID Amount
9 A 1427107 -
99 (B) 3997915 -
999 (SOCIO) 7161315 -
9999 #M 4035115 -
99999 01 Z 2136481035115 8,621
Row_names are all the numbers, "Text" contains numbers and text
ID column contains numbers from 7 to 13 digits,
Amount is either a "-" or numbers with thousands (,)
We can use read.table to read the data into a data.frame
df1 <- read.table(text = text, header = FALSE, fill = TRUE)
Or using extract
library(tibble)
library(tidyr)
tibble(col1 = trimws(text)) %>%
extract(col1, into = c('rn', 'Text', 'ID', 'Amount'),
'^(\\d+)\\s+(.*)\\s+(\\d+)\\s+([-0-9,]+)', convert = TRUE)
In base R, we can use strcapture and provide the pattern and type of data to extract.
strcapture('\\s+(\\d+)\\s(.*?)\\s+(\\d+)\\s(.*)', text,
proto=list(row_names=integer(), Text=character(),
ID = numeric(), Amount = character()))
# row_names Text ID Amount
#1 9 A 1427107 -
#2 99 (B) 3997915 -
#3 999 (SOCIO) 7161315 -
#4 9999 #M 4035115 -
#5 99999 01 Z 2136481035115 8,621
Related
I have a file separated by semicolons in which one of the variables of type character contains semicolon inside it. The readr::read_csv2 function splits the contents of those variables that have semicolons into more columns, messing up the formatting of the file.
For example, when using read_csv2 to open the file below, Bill's age column will show jogging, not 41.
File:
name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32
Considering that the original file doesn't contain quotes around the character type variables, how can I import the file so that karate and jogging belong in the hobbies column?
read.csv()
You can use the read.csv() function. But there would be some warning messages (or use suppressWarnings() to wrap around the read.csv() function). If you wish to avoid warning messages, using the scan() method in the next section.
library(dplyr)
read.csv("./path/to/your/file.csv", sep = ";",
col.names = c("name", "hobbies", "age", "X4")) %>%
mutate(hobbies = ifelse(is.na(X4), hobbies, paste0(hobbies, ";" ,age)),
age = ifelse(is.na(X4), age, X4)) %>%
select(-X4)
scan() file
You can first scan() the CSV file as a character vector first, then split the string with pattern ; and change it into a dataframe. After that, do some mutate() to identify your target column and remove unnecessary columns. Finally, use the first row as the column name.
library(tidyverse)
library(janitor)
semicolon_file <- scan(file = "./path/to/your/file.csv", character())
semicolon_df <- data.frame(str_split(semicolon_file, ";", simplify = T))
semicolon_df %>%
mutate(X4 = na_if(X4, ""),
X2 = ifelse(is.na(X4), X2, paste0(X2, ";" ,X3)),
X3 = ifelse(is.na(X4), X3, X4)) %>%
select(-X4) %>%
janitor::row_to_names(row_number = 1)
Output
name hobbies age
2 Jon cooking 38
3 Bill karate;jogging 41
4 Maria fishing 32
Assuming that you have the columns name and age with a single entry per observation and hobbies with possible multiple entries the following approach works:
read in the file line by line instead of treating it as a table:
tmp <- readLines(con <- file("table.csv"))
close(con)
Find the position of the separator in every row. The entry before the first separator is the name the entry after the last is the age:
separator_pos <- gregexpr(";", tmp)
name <- character(length(tmp) - 1)
age <- integer(length(tmp) - 1)
hobbies <- vector(length=length(tmp) - 1, "list")
fill the three elements using a for loop:
# the first line are the colnames
for(line in 2:length(tmp)){
# from the beginning of the row to the first";"
name[line-1] <- strtrim(tmp[line], separator_pos[[line]][1] -1)
# between the first ";" and the last ";".
# Every ";" is a different elemet of the list
hobbies[line-1] <- strsplit(substr(tmp[line], separator_pos[[line]][1] +1,
separator_pos[[line]][length(separator_pos[[line]])]-1),";")
#after the last ";", must be an integer
age[line-1] <- as.integer(substr(tmp[line],separator_pos[[line]][length(separator_pos[[line]])]+1,
nchar(tmp[line])))
}
Create a separate matrix to hold the hobbies and fill it rowwise:
hobbies_matrix <- matrix(NA_character_, nrow = length(hobbies), ncol = max(lengths(hobbies)))
for(line in 1:length(hobbies))
hobbies_matrix[line,1:length(hobbies[[line]])] <- hobbies[[line]]
Add all variable to a data.frame:
df <- data.frame(name = name, hobbies = hobbies_matrix, age = age)
> df
name hobbies.1 hobbies.2 age
1 Jon cooking <NA> 38
2 Bill karate jogging 41
3 Maria fishing <NA> 32
You could also do:
read.csv(text=gsub('(^[^;]+);|;([^;]+$)', '\\1,\\2', readLines('file.csv')))
name hobbies age
1 Jon cooking 38
2 Bill karate;jogging 41
3 Maria fishing 32
Ideally you'd ask whoever generated the file to do it properly next time :) but of course this is not always possible.
Easiest way is probably to read the lines from the file into a character vector, then clean up and make a data frame by string matching.
library(readr)
library(dplyr)
library(stringr)
# skip header, add it later
dataset <- read_lines("your_file.csv", skip = 1)
dataset_df <- data.frame(name = str_match(dataset, "^(.*?);")[, 2],
hobbies = str_match(dataset, ";(.*?);\\d")[, 2],
age = as.numeric(str_match(dataset, ";(\\d+)$")[, 2]))
Result:
name hobbies age
1 Jon cooking 38
2 Bill karate;jogging 41
3 Maria fishing 32
Using the file created in the Note at the end
1) read.pattern can read this by specifying the pattern as a regular expression with the portions within parentheses representing the fields.
library(gsubfn)
read.pattern("hobbies.csv", pattern = '^(.*?);(.*);(.*)$', header = TRUE)
## name hobbies age
## 1 Jon cooking 38
## 2 Bill karate;jogging 41
## 3 Maria fishing 32
2) Base R Using base R we can read in the lines, put quotes around the middle field and then read it in normally.
L <- "hobbies.csv" |>
readLines() |>
sub(pattern = ';(.*);', replacement = ';"\\1";')
read.csv2(text = L)
## name hobbies age
## 1 Jon cooking 38
## 2 Bill karate;jogging 41
## 3 Maria fishing 32
Note
Lines <- "name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32
"
cat(Lines, file = "hobbies.csv")
I'm not even sure how to approach this situation, I'm probably blocked. I have a wide dataframe, something like this
Date Amy_X Amy_Y John_X John_Y
March 14 15 10.5 14.5
April 10 11 15 16
I would like to export the data to a csv file with the following format
Date Amy John
X Y X Y
March 14 15 10.5 14.5
April 10 11 15 16
My first question is what is the best approach to achieve this. Should I separate Amy_X into Amyand Xand then create a repeat of the vector of names Amy, Amy, John, John and use than as another header. What's the best solution for this scenario?
The question says to output the file to csv but the output shown is not comma-separated values (csv). We show both.
Using input data frame DF defined reproducibly in the Note at the end, create a data frame from the headers and use separate_rows on it and rbind that to DF. Then do any remaining fix ups. Write it out without the row and column names and without quotes. Replace stdout() with your file name.
library(dplyr)
library(tidyr)
DF2 <- DF %>%
names %>%
as.list %>%
as.data.frame %>%
separate_rows(everything()) %>%
setNames(names(DF)) %>%
rbind(DF)
DF2[2, 1] <- DF2[1, duplicated(unlist(DF2[1, ]))] <- ""
output <- capture.output(prmatrix(DF2, quote = FALSE,
rowlab = rep("", nrow(DF2)), collab = rep("", ncol(DF2))))[-1]
writeLines(output, stdout())
giving the following which reproduces the output shown in the question:
Date Amy John
X Y X Y
March 14 15 10.5 14.5
April 10 11 15 16
If you really did want csv then use this instead of the writeLines and statement prior to it above:
write.table(DF2, stdout(), sep = ",", quote = FALSE, row.names = FALSE,
col.names = FALSE)
giving:
Date,Amy,,John,
,X,Y,X,Y
March,14,15,10.5,14.5
April,10,11,15,16
Note
Lines <- "Date Amy_X Amy_Y John_X John_Y
March 14 15 10.5 14.5
April 10 11 15 16"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE, as.is = TRUE)
I am currently working with clinical assessment data that is scored and output by a software package in a .txt file. My goal is extract the data from the txt file into a long format data frame with a column for: Participant # (which is included in the file name), subtest, Score, and T-score.
An example data file is available here:
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
I am running into a couple road blocks that I could use some input into how navigate.
1) I only need the information that corresponds to each subtest, these all have a number prior to the subtest name. Therefore, the rows that only have one to two words that are not necessary (eg cognitive screen) seem to be interfering creating new data frames because I have a mismatch in columns provided and columns wanted.
Some additional corks to the data:
1) the asteriks are NOT necessary
2) the cognitive TOTAL will never have a value
I am utilizing the readtext package to import the data at the moment and I am able to get a data frame with two columns. One being the file name (this includes the participant name) so that problem is fixed. However, the next column is a a giant character string with the columns data points for both Score and T-Score. Presumably I would then need to split these into the columns of interest, previously listed.
Next problem, when I view the data the T scores are in the correct order, however the "score" data no longer matches the true values.
Here is what I have tried:
# install.packages("readtext")
library(readtext)
library(tidyr)
pathTofile <- path.expand("/Users/Brahma/Desktop/CAT TEXT FILES/")
data <- readtext(paste0(pathTofile2, "CAToutput.txt"),
#docvarsfrom = "filenames",
dvsep = " ")
From here I do not know how to split the data, in my head I would do something like this
data2 <- separate(data2, text, sep = " ", into = c("subtest", "score", "t_score"))
This of course, gives the correct column names but removes almost all the data I actually am interested in.
Any help would be appreciated whether a solution or a direction you might suggest I look for more answers.
Sincerely,
Alex
Here is a way of converting that text file to a dataframe that you can do analysis on
library(tidyverse)
input <- read_lines('c:/temp/scores.txt')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
table <- left_join(scores,
header,
by = 'row'
)
colnames(table) <- c('index', 'type', 'Score', 'T-Score', 'row', 'title')
head(table, 10)
# A tibble: 10 x 6
index type Score `T-Score` row title
<chr> <chr> <chr> <chr> <int> <chr>
1 "1. " Line Bisection 9 53 3 Subtest/Section
2 "2. " Semantic Memory 8 51 4 Subtest/Section
3 "3. " Word Fluency 1 56* 5 Subtest/Section
4 "4. " Recognition Memory 40 59 6 Subtest/Section
5 "5. " Gesture Object Use 2 68 7 Subtest/Section
6 "6. " Arithmetic 5 49 8 Subtest/Section
7 "7. " Spoken Words 17 45* 14 Spoken Language
8 "9. " Spoken Sentences 25 53* 15 Spoken Language
9 "11. " Spoken Paragraphs 4 60 16 Spoken Language
10 "8. " Written Words 14 45* 20 Written Language
What is the source for the code at the link provided?
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
This data is odd. I was able to successfully match patterns and manipulate most of the data, but two rows refused to oblige. Rows 17 and 20 refused to be matched. In addition, the data type / data structure are very unfamiliar.
This is what was accomplished before hitting a wall.
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, Extract = str_extract(df$V2, "[1-9]+\\s[1-9]+\\*+\\s?"))
df2 <- df1 %>% mutate(V2, Extract2 = str_extract(df1$V2, "[0-9]+.[0-9]+$"))
head(df2)
When the data was further explored, the second column, V2, included data types that are completely unfamiliar. These included: Arithmetic, Complex Words, Digit Strings, and Function Words.
If anything, it would good to know something about those unfamiliar data types.
Took another look at this problem and found where it had gotten off track. Ignore my previous post. This solution works in Jupyter Lab using the data that was provided.
library(stringr)
library(dplyr)
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, "Score" = str_extract(df$V2, "\\d+") )
df2 <- df1 %>% mutate(V2, "T Score" = str_extract(df$V2, "\\d\\d\\*?$"))
df3 <- df2 %>% mutate(V2, "Subtest/Section" = str_remove_all(df2$V2, "\\\t+[0-9]+"))
df4 <- df3 %>% mutate(V1, "Sub-S" = str_extract(df3$V1, "\\s\\d\\d\\s*"))
df5 <- df4 %>% mutate(V1, "Sub-T" = str_extract(df4$V1,"\\d\\d\\*"))
df6 <- replace(df5, is.na(df5), "")
df7 <- df6 %>% mutate(V1, "Description" = str_remove_all(V1, "\\d\\d\\s\\d\\d\\**$")) # remove digits, new variable
df7$V1 <- NULL # remove variable
df7$V2 <- NULL # remove variable
df8 <- df7[, c(6,3,1,4,2,5)] # re-align variables
head(df8,15)
I have a data frame, like this:
my.tree <- data.frame(Tree=c("Acer campestre", "Abies alba", "Pyrus communis", "Robinia pseudoacacia", "Tilia cordata"),
Freq=c(23,65,47,69,65))
I want to replace all the spaces between words with point at once. I want to create new data frame (or modify this data frame) where there will be points between words of tree's name, e.g. Acer.campestre, Abies.alba, Pyrus.communis etc.
Is it possible to replace at once or how can I do these change easier?
You can do:
> library(dplyr); mutate(my.tree, Tree = gsub(" ", ".", Tree))
# Tree Freq
#1 Acer.campestre 23
#2 Abies.alba 65
#3 Pyrus.communis 47
#4 Robinia.pseudoacacia 69
#5 Tilia.cordata 65
It might be safer (and more conventional) to use gsub, but you could also use make.names:
make.names(my.tree$Tree)
# [1] "Acer.campestre" "Abies.alba" "Pyrus.communis"
# [4] "Robinia.pseudoacacia" "Tilia.cordata"
Or even chartr:
chartr(" ", ".", my.tree$Tree)
# [1] "Acer.campestre" "Abies.alba" "Pyrus.communis"
# [4] "Robinia.pseudoacacia" "Tilia.cordata"
You can do:
my.tree$Tree <- gsub(pattern = " ", replacement = ".", x = my.tree$Tree)
> my.tree
# Tree Freq
#1 Acer.campestre 23
#2 Abies.alba 65
#3 Pyrus.communis 47
#4 Robinia.pseudoacacia 69
#5 Tilia.cordata 65
I want to replace all ,, -, ), ( and (space) with . from the variable DMA.NAME in the example data frame. I referred to three posts and tried their approaches but all failed.:
Replacing column values in data frame, not included in list
R replace all particular values in a data frame
Replace characters from a column of a data frame R
Approach 1
> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")")
c$DMA.NAME[shouldbecomeperiod] <- "."
Approach 2
> removetext <- c("-", ",", " ", "(", ")")
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME)
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE)
Warning message:
In gsub(removetext, ".", c$DMA.NAME) :
argument 'pattern' has length > 1 and only the first element will be used
Approach 3
> c[c == c(" ", ",", "(", ")", "-")] <- "."
Sample data frame
> df
DMA.CODE DATE DMA.NAME count
111 22 8/14/2014 12:00:00 AM Columbus, OH 1
112 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
79 18 7/30/2014 12:00:00 AM Boston (Manchester) 1
99 22 8/20/2014 12:00:00 AM Columbus, OH 1
112.1 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
208 27 7/31/2014 12:00:00 AM Minneapolis-St. Paul 1
I know the problem - gsub uses pattern and only first element . The other two approaches are searching the entire variable for the exact value instead of searching within value for specific characters.
You can use the special groups [:punct:] and [:space:] inside of a pattern group ([...]) like this:
df <- data.frame(
DMA.NAME = c(
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Boston (Manchester)",
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Minneapolis-St. Paul"),
stringsAsFactors=F)
##
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME)
[1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
If your data frame is big you might want to look at this fast function from stringi package. This function replaces every character of specific class for another. In this case character class is L - letters (inside {}), but big P (before {}) indicates that we are looking for the complements of this set, so for every non letter character. Merge indicates that consecutive matches should be merged into a single one.
require(stringi)
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T)
## [1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
And some benchmarks:
x <- sample(df$DMA.NAME, 1000, T)
gsubFun <- function(x){
gsub("[[:punct:][:space:]]+","\\.",x)
}
striFun <- function(x){
stri_replace_all_charclass(x, "\\P{L}",".", T)
}
require(microbenchmark)
microbenchmark(gsubFun(x), striFun(x))
Unit: microseconds
expr min lq median uq max neval
gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984 100
striFun(x) 877.259 893.3945 907.769 929.8065 3189.017 100