R - Parse text string into multiple columns & extract data values

R - Parse text string into multiple columns & extract data values - r

I have a large dataset in the form shown below:
ID
Scores
1
English 3, French 7, Geography 8
2
Spanish 7, Classics 4
3
Physics 5, English 5, PE 7, Art 4
I need to parse the text string from the Scores column into separate columns for each subject with the scores for each individual stored as the data values, as below:
ID
English
French
Geography
Spanish
Classics
Physics
PE
Art
1
3
7
8
-
-
-
-
-
2
-
-
-
7
4
-
-
-
3
5
-
-
-
-
5
7
4
I cannot manually predefine the columns as there are 100s in the full dataset. So far I have cleaned the data to remove inconsistent capitalisation and separated each subject-mark pairing into a distinct column as follows:
df$scores2 <- str_to_lower(df$Scores)
split <- separate(
df,
scores2,
into = paste0("Subject", 1:8),
sep = "\\,",
remove = FALSE,
convert = FALSE,
extra = "warn",
fill = "warn",
)
I have looked at multiple questions on the subject, such as Split irregular text column into multiple columns in r, but I cannot find another case where the column titles and data values are mixed in the text string. How can I generate the full set of columns required and then populate the data value?

You can first strsplit the Scores column to split on subject-score pairs (which would be in a list), then unnest the list-column into rows. Then separate the subject-score pairs into Subject and Score columns. Finally transform the data from a "long" format to a "wide" format.
Thanks #G. Grothendieck for improving my code:)
library(tidyverse)
df %>%
separate_rows(Scores, sep = ", ") %>%
separate(Scores, sep = " ", into = c("Subject", "Score")) %>%
pivot_wider(names_from = "Subject", values_from = "Score")
# A tibble: 3 × 9
ID English French Geography Spanish Classics Physics PE Art
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 3 7 8 NA NA NA NA NA
2 2 NA NA NA 7 4 NA NA NA
3 3 5 NA NA NA NA 5 7 4

Using data.table
library(data.table)
setDT(dt)
dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")
Results
# ID Art Classics English French Geography PE Physics Spanish
# 1: 1 <NA> <NA> 3 7 8 <NA> <NA> <NA>
# 2: 2 <NA> 4 <NA> <NA> <NA> <NA> <NA> 7
# 3: 3 4 <NA> 5 <NA> <NA> 7 5 <NA>
Data
dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8",
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA,
-3L), class = c("data.frame"))

Related

Read CSV file with unused header rows, footer rows and column parsing issue

I want to import a bank statement in CSV format that contains multiple rows of headers and footers that are not required.
I use this skip = 7 to skip first 6 rows and col_types to select columns required for automation.
read_csv(files[1], skip = 7, col_types = cols_only(AMOUNT = col_character(), `BILL REF NO.` = col_character()))
Here is the result. I want the AMOUNT to be double, so I change the col_types.
A tibble: 34,825 x 2
AMOUNT `BILL REF NO.`
<chr> <chr>
1 "=\"17.58\"" 10000572874151776433
2 "=\"20.88\"" 10001648407332077912
3 "=\"70.60\"" 10002560021836683570
4 "=\"31.60\"" 10002744168017800627
5 "=\"80.00\"" 10003770035224984569
6 "=\"71.70\"" 10005255656409587173
7 "=\"27.97\"" 10005611886756396773
8 "=\"30.00\"" 10005808228105071391
9 "=\"34.58\"" 10006408254089150090
10 "=\"27.81\"" 10006412992762689126
# ... with 34,815 more rows
Code below change the column types but the result shows AMOUNT to be NA. What can I do to make it right?
read_csv(files[1], skip = 7, col_types = cols_only(AMOUNT = col_double(), `BILL REF NO.` = col_character()))
# A tibble: 34,825 x 2
AMOUNT `BILL REF NO.`
<dbl> <chr>
1 NA 10000572874151776433
2 NA 10001648407332077912
3 NA 10002560021836683570
4 NA 10002744168017800627
5 NA 10003770035224984569
6 NA 10005255656409587173
7 NA 10005611886756396773
8 NA 10005808228105071391
9 NA 10006408254089150090
10 NA 10006412992762689126
# ... with 34,815 more rows

Agree with #Dave2e, you need to correct the data after importing it. One way is to use parse_number function since you are already using readr.
library(dplyr)
library(readr)
read_csv(files[1], skip = 7) %>%
mutate(AMOUNT = parse_number(AMOUNT))

How to combine multiple character columns into one columns and remove NA without knowing column numbers

I would like to have a column that contains other columns characters without NA.
I have tried paste, str_c and unite, but could not get the expected result. Maybe I used them incorrectly.
The real case is, I could not know the column numbers in advance since each dataset can be varied in terms of years.
i.e. some datasets contain 10 years, but some contain 20 years.
Here is the input data:
input <- tibble(
id = c('aa', 'ss', 'dd', 'qq'),
'2017' = c('tv', NA, NA, 'web'),
'2018' = c(NA, 'web', NA, NA),
'2019' = c(NA, NA, 'book', 'tv')
)
# A tibble: 4 x 4
id `2017` `2018` `2019`
<chr> <chr> <chr> <chr>
1 aa tv NA NA
2 ss NA web NA
3 dd NA NA book
4 qq web NA tv
The desired output with the ALL column is:
> output
# A tibble: 4 x 5
id `2017` `2018` `2019` ALL
<chr> <chr> <chr> <chr> <chr>
1 aa tv NA NA tv
2 ss NA web NA web
3 dd NA NA book book
4 qq web NA tv web tv
Thanks for the help!

Here is a base R method
input$ALL <- apply(input[-1], 1, function(x) paste(na.omit(x), collapse=" "))
input$ALL
#[1] "tv" "web" "book" "web tv"

This actually is duplicate (or is really close) of this question but things have changed since then. unite has na.rm parameter which helps to drop NAs.
As far as selection of columns is concerned, here we have selected all the columns ignoring the first one without specifying the column names so it should work for your case with multiple years.
library(tidyverse)
input %>%
unite("ALL", names(input)[-1], remove = FALSE, sep = " ", na.rm = TRUE)
# A tibble: 4 x 5
# id ALL `2017` `2018` `2019`
# <chr> <chr> <chr> <chr> <chr>
#1 aa tv tv NA NA
#2 ss web NA web NA
#3 dd book NA NA book
#4 qq web tv web NA tv
It worked for me after installing the development version of tidyr by doing
devtools::install_github("tidyverse/tidyr")

For the sake of completeness (and to supplement LocoGris' data.table answer), there are three other approaches which update input by reference, i.e., without copying the whole data object.
All approaches return the same result and can handle an arbitrary number of years.
Note that id is supposed to be a unique key, i.e., without any duplicates.
Reshape, na.omit(), aggregate
library(data.table)
setDT(input)[, ALL := melt(input, id.var = "id")[, toString(na.omit(value)), by = id]$V1][]
id 2017 2018 2019 ALL
1: aa tv <NA> <NA> tv
2: ss <NA> web <NA> web
3: dd <NA> <NA> book book
4: qq web <NA> tv web, tv
BTW, reshaping from wide to long format exhibits a more concise way to store the sparsely populated data.
melt(input, id.var = "id", na.rm = TRUE)
id variable value
1: aa 2017 tv
2: qq 2017 web
3: ss 2018 web
4: dd 2019 book
5: qq 2019 tv
Reshape, aggregate, join
library(data.table)
setDT(input)[melt(input, id.var = "id", na.rm = TRUE)[, toString(value), by = id],
on = "id", ALL := V1][]
This drops the NA values from the result of the reshape step which distorts the original row order due to the many NA. Hence, an update join is required.
Filter(), aggregate
library(data.table)
setDT(input)[, ALL := .SD[, toString(Filter(Negate(is.na), .SD)), by = id]$V1][]

A data.table approach:
library(data.table)
library(tidyverse)
input <- data.table(
id = c('aa', 'ss', 'dd', 'qq'),
'2017' = c('tv', NA, NA, 'web'),
'2018' = c(NA, 'web', NA, NA),
'2019' = c(NA, NA, 'book', 'tv')
)
""-> input[is.na(input)]
input[, ALL:=paste0(.SD,collapse=" "), .SDcols =2:length(input), by=seq_len(nrow(input))]

Match part of a pattern to a string

I have two dataframes, and I want to do a match and merge.
Initially I was using inner_join and coalesce, but realized the match portion wasn't properly matching.
I found an example which seemed to be in the right direction How to merge two data frame based on partial string match with R? . One answer suggested using this code:
idx2 <- sapply(df_mouse_human$Protein.IDs, grep, df_mouse$Protein.IDs)
idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))
merged <- cbind(df_mouse_human[unlist(idx1),,drop=F], df_mouse[unlist(idx2),,drop=F])
However it fell short. The issue being is the dataset that I want to use as the pattern match, has strings which are longer than what I want to match to, and thus didn't match anything. Let me show a subset of the data:
dput(droplevels(df_mouse))
structure(list(Protein.IDs = c("Q8CBM2;A2AL85;Q8BSY0", "A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8",
"A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6", "Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15379-2;P15379-3;P15379-6;P15379-11;P15379-5;P15379-10;P15379-9;P15379-4;P15379-8;P15379-7;P15379;P15379-12;P15379-13",
"A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78", "A2AUR7;Q9D031;Q01730"
), Replicate = c(2L, 2L, 2L, 2L, 2L, 2L), Ratio.H.L.normalized.01 = c(NaN,
NaN, NaN, NaN, NaN, NaN), Ratio.H.L.normalized.02 = c(NaN, NaN,
NaN, NaN, NaN, NaN), Ratio.H.L.normalized.03 = c(NaN, NaN, NaN,
NaN, NaN, NaN)), .Names = c("Protein.IDs", "Replicate", "Ratio.H.L.normalized.01",
"Ratio.H.L.normalized.02", "Ratio.H.L.normalized.03"), row.names = 12:17, class = "data.frame")
dput(droplevels(df_mouse_human))
structure(list(Human = c("Q8WZ42", "Q8NF91", "Q9UPN3", "Q96RW7",
"Q8WXG9", "P20929", "Q5T4S7", "O14686", "Q2LD37", "Q92736"),
Protein.IDs = c("A2ASS6", "Q6ZWR6", "Q9QXZ0", "D3YXG0", "Q8VHN7",
"E9Q1W3", "A2AN08", "Q6PDK2", "A2AAE1", "E9Q401")), .Names = c("Human",
"Protein.IDs"), row.names = c(NA, 10L), class = "data.frame")
So I want to match the Protein.IDs in df_mouse to where they exist in df_mouse_human. In the sample data I'm trying to match A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 to the entry A2ASS6. It works well if I do it the other way, but is there a way so that if part of the pattern matches the query, it will come back TRUE?
My long term goal is to match and merge the data, so that df_mouse gets a new column with the matching Human protein ids, and where there is no match I'll just replace the NA value with the original string of mouse IDs.
thanks

One method I commonly use with partial matches like this is to reduce the more-complex field to make it look like the simpler one. Sometimes this involves just removing extraneous characters (e.g., if "match only on the first four chars", then I'd make a new index column from substr(idcol, 1, 4) and join on that), but in this case it involves breaking one string into multiple.
This involves associating each of the semi-colon-delimited ids with the big-string, making this intermediate frame taller (sometimes much taller) than the original data.
(For the sake of presentability/aesthetics, I'm modifying df1 to remove the other invariant columns and, for the sake of "other data", adding a row number column.)
I'm using dplyr and tidyr, so:
library(dplyr)
library(tidyr)
df1 <- select(df1, Protein.IDs) %>%
mutate(other = row_number())
First I'll break the 6-row frame into a much larger one:
df1ids <- tbl_df(df1) %>%
select(Protein.IDs) %>%
mutate(eachID = strsplit(Protein.IDs, ";")) %>%
unnest()
df1ids
# # A tibble: 46 x 2
# Protein.IDs eachID
# <chr> <chr>
# 1 Q8CBM2;A2AL85;Q8BSY0 Q8CBM2
# 2 Q8CBM2;A2AL85;Q8BSY0 A2AL85
# 3 Q8CBM2;A2AL85;Q8BSY0 Q8BSY0
# 4 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH3
# 5 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH5
# 6 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH4
# 7 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 Q6X893
# 8 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 Q6X893-2
# 9 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH8
# 10 A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 A2AMW0
# # ... with 36 more rows
Notice how the first row of three is now three rows of three. We'll use "eachID" to join.
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
filter(complete.cases(.)) %>%
select(Human, Protein.IDs) %>%
right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
# Human Protein.IDs other
# <chr> <chr> <int>
# 1 <NA> Q8CBM2;A2AL85;Q8BSY0 1
# 2 <NA> A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 2
# 3 <NA> A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 3
# 4 <NA> Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~ 4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 6 <NA> A2AUR7;Q9D031;Q01730 6
If you happen to have multiple Human rows for each Proteins.IDs, things change a little.
df2$Protein.IDs[2] <- "E9Q8K5"
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
filter(complete.cases(.)) %>%
select(Human, Protein.IDs) %>%
right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 7 x 3
# Human Protein.IDs other
# <chr> <chr> <int>
# 1 <NA> Q8CBM2;A2AL85;Q8BSY0 1
# 2 <NA> A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 2
# 3 <NA> A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 3
# 4 <NA> Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~ 4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 6 Q8NF91 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 7 <NA> A2AUR7;Q9D031;Q01730 6
Notice how you now have two copies of other 5? Likely not what you want. If you intend to continue with the semi-colon-delimited theme, though:
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
filter(complete.cases(.)) %>%
group_by(Protein.IDs) %>%
summarize(Human = paste(Human, collapse = ";")) %>%
select(Human, Protein.IDs) %>%
right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
# Human Protein.IDs other
# <chr> <chr> <int>
# 1 <NA> Q8CBM2;A2AL85;Q8BSY0 1
# 2 <NA> A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 2
# 3 <NA> A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 3
# 4 <NA> Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM~ 4
# 5 Q8WZ42;Q8N~ A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 6 <NA> A2AUR7;Q9D031;Q01730 6

#r2evans asks a good question about what to do with multiple matches. Once that question gets answered, I may need to edit my answer, but here is a quick solution. First, we split up the string of possible IDs, then we see which IDs are matched in the other dataframe, then we join on the row index of the match.
library(tidyverse)
df_mouse %>% mutate(all_id = str_split(Protein.IDs, ";"),
row = map(all_id, ~.x %in% df_mouse_human$Protein.IDs %>% which())) %>%
unnest(row) %>%
list(., df_mouse_human %>% rownames_to_column("row") %>% mutate(row = as.numeric(row))) %>%
reduce(left_join, by = "row")
#> Protein.IDs.x Replicate
#> 1 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 2
#> Ratio.H.L.normalized.01 Ratio.H.L.normalized.02 Ratio.H.L.normalized.03
#> 1 NaN NaN NaN
#> row Human Protein.IDs.y
#> 1 1 Q8WZ42 A2ASS6

Match words in a data frame to a string in R

I have a data frame from a recall task where participants recall as many words as they can from a list they learned earlier. Here's a mock up of the data. Each row is a subject and each column (w1-w5) is a word recalled:
df <- data.frame(subject = 1:5,
w1 = c("screen", "toad", "toad", "witch", "toad"),
w2 = c("package", "tuna", "tuna", "postage", "dinosaur"),
w3 = c("tuna", "postage", "toast", "athlete", "ranch"),
w4 = c("toad", "witch", "tuna", "package", "NA"),
w5 = c("windwo", "mermaid", "NA", "NA", "NA")
)
Which produces the following data frame:
subject w1 w2 w3 w4 w5
1 1 screen package tuna toad windwo
2 2 toad tuna postage witch mermaid
3 3 toad tuna toast tuna NA
4 4 witch postage athlete package NA
5 5 toad dinosaur ranch NA NA
I want to match each word produced (columns w1 - w5) to a list of the correct words, which are:
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
I only want to award points for words that are spelled correctly and are not repeated. So for example, for the data above I'd like to end up with a data frame that looks like this:
subject nCorrect
1 1 4
2 2 5
3 3 3
4 4 3
5 5 2
Subject 1 would get four points because they misspelled one word.
Subject 2 would get five points.
Subject 3 would get 3 points because they repeated tuna and are missing one word.
Subject 4 would get three points because they have one incorrect word and one missing word.
Subject 5 would get two points because they have one incorrect word and two missing words.

data.frame(subject = df$subject
, nCorrect = apply(df[, -1], 1, function(x) sum(unique(x) %in% words)))
# subject nCorrect
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2
With data.table (same result)
setDT(df)
df[, sum(unique(unlist(.SD)) %in% words), by = subject]

Another option is to convert the data in long format. Group on subject to use dplyr::summarise to find correct number of matching answers.
library(tidyverse)
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
df %>% gather(key, value, -subject) %>%
group_by(subject) %>%
summarise(nCorrect = sum(unique(value) %in% words))
# # A tibble: 5 x 2
# subject nCorrect
# <int> <int>
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2

Convert Column Values into Row Names using R

I need to Convert Column Values into Row Names using R.
For example to convert format1 into format2
var<-c("Id", "Name", "Score", "Id", "Score", "Id", "Name")
num<-c(1, "Tom", 4, 2, 7, 3, "Jim")
format1<-data.frame(var, num)
format1
var num
1 Id 1
2 Name Tom
3 Score 4
4 Id 2
5 Score 7
6 Id 3
7 Name Jim
Be careful, there are missing values in the format1,and that's the challenge, I guess.
Id<-c(1, 2, 3)
Name<-c("Tom", NA, "Jim")
Score<-c(4, 7, NA)
format2<-data.frame(Id, Name, Score)
format2
Id Name Score
1 1 Tom 4
2 2 <NA> 7
3 3 Jim NA
# How to convert format1 into format2?
I may not articulate in the exact way, however, you can refer to the toy data i give above.
I know a litter bit about reshape and reshape2, however, I failed in converting the data format using both of them.

format1$ID <- cumsum(format1$var == "Id")
format2 <- reshape(format1, idvar = "ID",timevar = "var", direction = "wide")[-1]
names(format2) <- gsub("num.", "", names(format2)
# Id Name Score
# 1 1 Tom 4
# 4 2 <NA> 7
# 6 3 Jim <NA>
Alternatively, if you'd like to skip the gsub() step, you could directly specify the output column names via the varying argument:
reshape(format1, idvar = "ID",timevar = "var", direction = "wide",
varying = list(c("Id", "Name", "Score")))[-1]

You can use dcast after adding an identifier column.
format1$pk <- cumsum( format1$var=="Id" )
library(reshape2)
dcast( format1, pk ~ var, value.var="num" )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Parse text string into multiple columns & extract data values - r

Related

Read CSV file with unused header rows, footer rows and column parsing issue

How to combine multiple character columns into one columns and remove NA without knowing column numbers

Match part of a pattern to a string

Match words in a data frame to a string in R

Convert Column Values into Row Names using R

Categories

Resources