I'm trying to chop up a text file into the articles it contains. Usually this is done by identifying a pattern each article begins with. Unfortunately the database I downloaded the articles from doesn't have that. The only pattern I can find is that after each article there are 3 empty lines.
How could I identify three consecutive empty line?
I know that I can find empty lines with:
Beginnings <- grep('^$', Lines.i)
Beginnings then looks like
> Beginnings[1:50]
[1] 1 2 3 6 8 10 12 13 40 41 42 43 45 49 50 51 53 54 62 63 64 65 67
[24] 69 70 110 111 112 113 115 117 121 122 123 125 131 132 133 135 137 138 150 151 152 153 155
[47] 157 158 169 170
You can see that the first article starts after 1 2 3 and the next one after 41 42 43.
So my idea was to just add the newline expression to the pattern
Beginnings <- grep('^$\n^$\n^$\n', Lines.i)
But this does not work. I would be grateful for any suggestions!
You may try rle
which(inverse.rle(within.list(rle(!nzchar(v1)),
values[lengths<3 & values] <- FALSE)))
#[1] 3 4 5 9 10 11 12
data
v1 <- c('ard', 'b', '', '', '', 'rr', '', 'fr', '', '', '', '', 'gh', 'd')
Here's a solution for extracting the article lines only. Turned out much more complex and cryptic than I'd been hoping, but I'm pretty sure it works. Also, thanks to akrun for the test data.
v1 <- c('ard','b','','','','rr','','fr','','','','','gh','d');
ind <- with(rle(c(rep(F,3),nzchar(v1),rep(F,3))),data.frame(start=cumsum(lengths[-length(lengths)])[values[-1]&!values[-length(values)]&lengths[-length(values)]>=3]-2,end=cumsum(lengths[-length(lengths)])[values[-length(lengths)]&!values[-1]&lengths[-1]>=3]-3));
articles <- lapply(1:nrow(ind),function(r) v1[ind[r,'start']:ind[r,'end']]);
v1;
## [1] "ard" "b" "" "" "" "rr" "" "fr" "" "" "" "" "gh" "d"
ind;
## start end
## 1 1 2
## 2 6 8
## 3 13 14
articles;
## [[1]]
## [1] "ard" "b"
##
## [[2]]
## [1] "rr" "" "fr"
##
## [[3]]
## [1] "gh" "d"
Related
I have a string:
a = c("112 271 [X];313 179 [X];125 162;123 131 [X];124 107")
I want to first split it by semicolon ;
b = as.list(strsplit(a, ";")[[1]])
> b
[[1]]
[1] "112 271 [X]"
[[2]]
[1] "313 179 [X]"
[[3]]
[1] "125 162"
[[4]]
[1] "123 131 [X]"
[[5]]
[1] "124 107"
then I want to split b by space, and save the result as a 3-column data frame.
The result looks like:
A B C
1 112 271 [X]
2 313 179 [X]
3 125 162
4 123 131 [X]
5 124 107
I don't know how to do it. Thanks for your help.
Replace semicolon with newline then fread with fill, and set the column names:
data.table::fread(gsub(";", "\n", a, fixed = TRUE),
fill = TRUE,
col.names = LETTERS[1:3])
# A B C
# 1: 112 271 [X]
# 2: 313 179 [X]
# 3: 125 162
# 4: 123 131 [X]
# 5: 124 107
A base R option using read.table (similar to #zx8754's data.table solution)
> read.table(text = gsub(";", "\n", a), fill = TRUE, col.names = head(LETTERS, 3))
A B C
1 112 271 [X]
2 313 179 [X]
3 125 162
4 123 131 [X]
5 124 107
A tidyverse solution; the two functions separate and separate_rows are from tidyr (which is part of the tidyverse):
library(tidyr)
data.frame(a) %>%
separate_rows(a, sep = ";") %>%
separate(a,
into = c("A","B","C"),
sep = "\\s")
# A tibble: 5 × 3
A B C
<chr> <chr> <chr>
1 112 271 [X]
2 313 179 [X]
3 125 162 NA
4 123 131 [X]
5 124 107 NA
You can also do this with stringr::str_split(). In the example below, I use two consecutive calls to str_split() with simplified outputs to create a character matrix that can then be converted into a data frame.
## Question data --------------------------------------------------
a <- c("112 271 [X];313 179 [X];125 162;123 131 [X];124 107")
require(stringr)
#> Loading required package: stringr
## Split into character matrix ------------------------------------
str_split(a, ";", simplify = TRUE) |>
str_split("[:space:]", simplify = TRUE) |>
## convert to data frame ----------------------------------------
as.data.frame() |>
setNames(c("A", "B", "C"))
#> A B C
#> 1 112 271 [X]
#> 2 313 179 [X]
#> 3 125 162
#> 4 123 131 [X]
#> 5 124 107
Created on 2022-11-16 with reprex v2.0.2
This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 4 years ago.
y <- data.frame(x = c("63,98,131","75,109,145","66,104,139"))
I want to make three columns A,B,C from x by splitting from comma
A B C
63 98 131
75 109 145
66 104 139
I tried to use str_split
str_split(y$x, " , ")
[[1]]
[1] "63,98,131"
[[2]]
[1] "75,109,145"
[[3]]
[1] "66,104,139"
But this does not do the job. How can I fix it?
> dt=as.data.frame(matrix(unlist(strsplit(y$x,",")),ncol=dim(y)[1],byrow = T))
> dt
V1 V2 V3
1 63 98 131
2 75 109 145
3 66 104 139
I have a table with a character field that can have either of these pattern :
input
97 # a single number
210 foo # a number and a word
87 bar 89 # a number, a word, a number
21 23 # two numbers
123 2 fizzbuzz # two number, a word
12 fizz 34 buzz # a number, a word, a number, a word
I'd like to split each line up to 4 parts, containing respectively the first number, the first word if it exists, the second number if it exists, and the second word if it exists. So my example would give :
input nb_1 word_1 nb_2 word_2
97 97
210 foo 210 foo
87 bar 89 87 bar 89
21 23 21 23
123 2 fizzbuzz 123 2 fizzbuzz
12 fizz 34 buzz 12 fizz 34 buzz
Please note the case of two number, a word (the example before the last one) : it has nothing in word_1 as there is no word between the two numbers.
Is there a way to do this without a tedious if / if / else structure ?
If it can help, all the words belong to a list of 10 specific words. Also, if there are two words, they can be the same or different. Also, the numbers can be one, two or three digits long.
Thanks
Here is an idea using gsub and cSplit from splitstackshape package,
library(splitstackshape)
df$num <- gsub('\\D', ' ', df$V1)
df$wrds <- gsub('\\d', ' ', df$V1)
newdf <- cSplit(df, 2:3, ' ', 'wide')
newdf
# V1 num_1 num_2 wrds_1 wrds_2
#1: 97 97 NA NA NA
#2: 210 foo 210 NA foo NA
#3: 87 bar 89 87 89 bar NA
#4: 21 23 21 23 NA NA
#5: 123 2 fizzbuzz 123 2 fizzbuzz NA
#6: 12 fizz 34 buzz 12 34 fizz buzz
The only problem is row 5, which can be fixed as follows,
newdf$wrds_1 <- as.character(newdf$wrds_1)
newdf$wrds_2 <- as.character(newdf$wrds_2)
newdf$wrds_2[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)] <- newdf$wrds_1[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)]
newdf$wrds_1[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)] <- NA
which finally gives,
newdf
# V1 num_1 num_2 wrds_1 wrds_2
#1: 97 97 NA NA NA
#2: 210 foo 210 NA foo NA
#3: 87 bar 89 87 89 bar NA
#4: 21 23 21 23 NA NA
#5: 123 2 fizzbuzz 123 2 NA fizzbuzz
#6: 12 fizz 34 buzz 12 34 fizz buzz
DATA
dput(df)
structure(list(V1 = c("97", " 210 foo", " 87 bar 89",
" 21 23", " 123 2 fizzbuzz",
" 12 fizz 34 buzz")), .Names = "V1", row.names = c(NA,
-6L), class = "data.frame")
Tried in a different way...
library(splitstackshape)
abc <- data.frame(a=c(97,"210 foo","87 bar 89","21 23","123 2 fizzbuzz","12 fizz 34 buzz"))
abc1 <- data.frame(cSplit(abc, "a", " ", stripWhite = FALSE))
abc <- cbind(abc,abc1)
names(abc) <- c("input","nb_1", "word_1", "nb_2","word_2")
abc[,1:5] <-apply(abc[,1:5] , 2, as.character)
for(i in 1:nrow(abc)){
abc$word_2[i] <- replace(abc$word_2[i] , is.na(abc$word_2[i]),abc$nb_2[grepl("[a-z]",abc$nb_2[i])][i])
abc$nb_2[i] <- replace(abc$nb_2[i] , is.na(abc$nb_2[i])|grepl("[a-z]",abc$nb_2[i]),abc$word_1[grepl("[0-9]",abc$word_1[i])][i])
}
abc$word_1 <- ifelse(grepl("[0-9]",abc$word_1),NA,abc$word_1)
abc[is.na(abc)] <- ""
print(abc)
input nb_1 word_1 nb_2 word_2
1 97 97
2 210 foo 210 foo
3 87 bar 89 87 bar 89
4 21 23 21 23
5 123 2 fizzbuzz 123 2 fizzbuzz
6 12 fizz 34 buzz 12 fizz 34 buzz
This is a hacky function to do it... although you might have other cases that would break it.
f <- function(x){
string2 <- strsplit(x, " ")[[1]]
if (length(string2) < 2)
return(c(string2, NA, NA, NA))
arenums <- grepl("\\d", string2)
c(string2[which(arenums)[1]],
if (arenums[2]) NA else string2[which(!arenums)[1]],
string2[which(arenums)[2]],
if (arenums[2]) string2[which(!arenums)[1]] else string2[which(!arenums)[2]])
}
> f("97")
[1] "97" NA NA NA
> f("210 foo")
[1] "210" "foo" NA NA
> f("87 bar 89")
[1] "87" "bar" "89" NA
> f("21 23")
[1] "21" NA "23" NA
> f("123 2 fizzbuzz")
[1] "123" NA "2" "fizzbuzz"
> f("12 fizz 34 buzz")
[1] "12" "fizz" "34" "buzz"
Question
Is a data frame in R is a list (list is, in my understanding, a sequence of objects) of columns?
What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
Any reference to related design document or article of data structure design would be appreciated.
I am just used to row-as-a-unit/record and would like to know why it is column oriented. Or if I misunderstood something, kindly suggest.
Background
I had thought a dataframe was a sequence of row, such as (Ozone, Solar.R, Wind, Temp, Month, Day).
> c ## data frame created from read.csv()
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
> typeof(c)
[1] "list"
However when lapply() is applied against c to show each list element, it was a column.
> lapply(c, function(arg){ return(arg) })
$Ozone
[1] 41 36 12 18 23 19
$Solar.R
[1] 190 118 149 313 299 99
$Wind
[1] 7.4 8.0 12.6 11.5 8.6 13.8
$Temp
[1] 67 72 74 62 65 59
$Month
[1] 5 5 5 5 5 5
$Day
[1] 1 2 3 4 7 8
Whereas I had expected was
[1] 41 190 7.4 67 5 1
[1] 36 118 8.0 72 5 2
…
1) Is a data frame in R a list of columns?
Yes.
df <- data.frame(a=c("the", "quick"), b=c("brown", "fox"), c=1:2)
is.list(df) # -> TRUE
attr(df, "name") # -> [1] "a" "b" "c"
df[[1]][2] # -> "quick"
2) What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
A data.frame is a list of column vectors.
is.atomic(df[[1]]) # -> TRUE
mode(df[[1]]) # -> [1] "character"
mode(df[[3]]) # -> [1] "numeric"
Vectors can only store one kind of object. A "row-oriented" data.frame would demand data frames be composed of lists instead. Now imagine what the performance of an operation like
df[[1]][20000]
would be in a list-based data frame keeping in mind that random access is O(1) for vectors and O(n) for lists.
3) Any reference to related design document or article of data structure design would be appreciated.
http://adv-r.had.co.nz/Data-structures.html#data-frames
I am using the below code for calculating the mode of a dataframe:
library(modeest)
apply(df[ ,2:length(df)], 1, mfv)
My data looks like this:
Item A B C
Book001 56 32 56
Book002 95 95 20
Book003 50 89 50
Book004 6 65 40
It gives me the following output:
[[1]]
[1] 56
[[2]]
[1] 95
[[3]]
[1] 50
[[4]]
[1] 6 40 65
This code is perfect only if the data contains a recurring term.
How can I display the mode as NA when there is no recurring term?
Let's try with a custom function:
foo <- function(x){
out <- mfv(x)
if(length(out) > 1) out <- NA
return(out)
}
apply(df[ ,2:length(df)], 1, foo)
# [1] 56 95 50 NA