I have a vector with text in R data frame such as below:
string<-c("Real estate surface: 60m2 Number of rooms: 3 Number of bedrooms: 2 Number of bathrooms: 1 Number of toilets: 0 Year of construction: 1980 Last renovation: Floor: 1/15")
and I want to split text into 8 columns data frame with associated values, as e.g.
How can I do that?
Thanks!
An option would be to create NA for missing cases, then use separate_rows/separate to split the string
library(dplyr)
library(tidyr)
library(stringr)
library(tibble)
tibble(col = string) %>%
mutate(col = str_replace_all(col, ": (?![0-9])", ": NA ")) %>%
separate_rows(col, sep="(?<=:\\s\\w{1,5}) ") %>%
separate(col, into = c('col1', 'col2'), sep=":\\s+") %>%
deframe %>%
as.data.frame.list(check.names = FALSE) %>%
type.convert(as.is = TRUE)
#Real estate surface Number of rooms Number of bedrooms Number of bathrooms Number of toilets Year of construction
#1 60m2 3 2 1 0 1980
# Last renovation Floor
#1 NA 1/15
Related
I have a data frame like this
df <- data.frame(Income = c("$100to$200","under$100","above$1000"))
I would like this as output
df_final <- data.frame(Avg = c(150,100,1000))
I would like to extract the numeric value from the income column, if there are two numbers, take the average, if there is only one number, take that number.
A few key steps here. First we need to clean our data, in this case getting rid of the $ makes thing easier. Then we'll split into a From and To column. Finally we need to convert to numeric and calculate the row means.
library(tidyverse)
df %>%
mutate(Income = gsub("$", "", Income, fixed = TRUE)) %>%
separate(Income, "to|under|above", into = c("From", "To")) %>%
mutate_all(.,as.numeric) %>%
mutate(Avg = rowMeans(.,na.rm =TRUE))
From To Avg
1 100 200 150
2 NA 100 100
3 NA 1000 1000
You could try:
library(dplyr)
library(stringr)
df %>%
mutate(across(Income, ~ sapply(str_extract_all(.x, '\\d+'), \(x) {strsplit(x, ',') |>
as.numeric() |> mean()})))
Income
1 150
2 100
3 1000
A stringr approach using gsub to get the numerics, str_squish to remove the white space and str_split to get the entries in case of more then one value.
library(stringr)
data.frame(Avg = sapply(
str_split(str_squish(gsub("[[:alpha:]$]", " ", df$Income)), " "), function(x)
sum(as.numeric(x)) / length(x)))
Avg
1 150
2 100
3 1000
df %>%
transmute(
Avg = stringr::str_extract_all(Income, "(?<=\\$)\\d+") %>%
lapply(as.numeric) %>%
sapply(mean)
)
Avg
1 150
2 100
3 1000
I have an example concatenated text field (please see sample data below) that is created from two or three different fields, however there is no guarantee that the order of the words will be the same. I would like to create a new dataset where fields with the same words, regardless of order, are collapsed. However, since I do not know in advance what words will be concatenated together, the code will have to recognize that all words in both strings match.
Code for example data:
var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
freq<-c(1,1,1,1)
have<-as.data.frame(cbind(var1,freq))
Have:
var1 freq
BLUE|RED 1
RED|BLUE 1
WHITE|BLACK|ORANGE 1
BLACK|WHITE|ORANGE 1
How can I collapse the data into what I want below?
color freq
BLUE|RED 2
WHITE|BLACK|ORANGE 2
data.frame(table(sapply(strsplit(have$var1, '\\|'),
function(x)paste(sort(x), collapse = '|'))))
Var1 Freq
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2
In the world of piping: R > 4.0
have$var1 |>
strsplit('\\|')|>
sapply(\(x)paste0(sort(x), collapse = "|"))|>
table()|>
data.frame()
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
have %>%
group_by(id=row_number()) %>%
separate_rows(var1) %>%
arrange(var1, .by_group = TRUE) %>%
mutate(var1 = paste(var1, collapse = "|")) %>%
slice(1) %>%
ungroup() %>%
count(var1, name = "freq")
var1 freq
<chr> <int>
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2
I am trying to match keywords between two large bibliographic datasets (1.8M obs and 3.9M obs), which are derived from various fields in the record: title, author, publication date, publisher.
For each entry (1.8M), I want to match each keyword in the string against keywords in each entry of the other dataset (3.9M), and return the line with the most matches.
The method I've come up with, using the separate() and gather() functions from tidyverse, along with some basic dplyr, seems to work, but it is impossible to scale to the entire dataset.
Is there a more efficient (or entirely better) way of doing this?
Sample data for three keyword and strings and code:
library(dplyr)
library(tidyverse)
df1 <- data.frame("df1.index" = c(1:3),
"keywords" = c("2013 history interpretation oxford the tractatus univ wittgensteins",
"2014 baxi law of oxford pratiksha public secrets univ",
"2014 darwin flinching from looking on oxford scientific shell-shock"))
df2 <- data.frame("df2.index" = c(1:3),
"keywords" = c("2014 darwin flinching from looking on oxford scientific theatricality",
"2013 interpretation oxford tractatushistory univ wittgensteins",
"2014 baxi in india law of oxford pratiksha public rape secrets trials univ"))
#separate up to 10 keywords
df1_sep <- separate(df1, keywords, into =
c("key1", "key2", "key3", "key4", "key5", "key6", "key7", "key8", "key9", "key10"),
sep = " ", remove = FALSE)
df2_sep <- separate(df2, keywords, into =
c("key1", "key2", "key3", "key4", "key5", "key6", "key7", "key8", "key9", "key10"),
sep = " ", remove = FALSE)
#gather separated keywords into one column
df1_gather <- df1_sep %>%
gather(keys, key.match, key1:key10, factor_key = TRUE) %>%
distinct()
df2_gather <- df2_sep %>%
gather(keys, key.match, key1:key10, factor_key = TRUE) %>%
distinct()
#remove NAs, blanks, trim
df1_gather <- df1_gather %>% filter(!is.na(key.match))
df1_gather <- df1_gather %>% filter(key.match != "")
df1_gather$key.match <- str_trim(df1_gather$key.match)
df2_gather <- df2_gather %>% filter(!is.na(key.match))
df2_gather <- df2_gather %>% filter(key.match != "")
df2_gather$key.match <- str_trim(df2_gather$key.match)
#join, after removing some columns from df2_gather
df2_gather <- df2_gather %>% select(df2.index, key.match)
df_join <- left_join(df1_gather, df2_gather)
#remove NAs
df_join <- df_join %>% filter(!is.na(df2.index))
#tally matches for each index, then take top match
df_join <- df_join %>% group_by(df1.index, df2.index) %>% tally()
df_join <- df_join %>% group_by(df1.index) %>% top_n(1, n)
#add back keywords to review match
df_join$df1.keywords=df1$keywords[match(df_join$df1.index, df1$df1.index)]
df_join$df2.keywords=df2$keywords[match(df_join$df2.index, df2$df2.index)]
Maybe this approach could be useful to count using directly each keyword. I hope this can help:
library(tidytext)
#Separate
df1 %>% mutate(keywords=as.character(keywords)) %>% unnest_tokens(word,keywords) -> l1
df2 %>% mutate(keywords=as.character(keywords)) %>% unnest_tokens(word,keywords) -> l2
#Join
l1 %>% left_join(l2) -> l3
l2 %>% left_join(l1) -> l4
#Compute number of ocuurences
table(l3$df1.index,l3$df2.index,exclude=NULL)
table(l4$df1.index,l4$df2.index,exclude=NULL)
Output:
1 2 3 <NA>
1 1 5 2 3
2 2 2 9 0
3 8 1 2 2
1 2 3
1 1 5 2
2 2 2 9
3 8 1 2
<NA> 1 1 4
I have the following dataframe:
df1 <- data.frame( word = c("house, garden, flower", "flower, red", "garden, tree, forest", "house, window, door, red"),
value = c(10,12,20,5),
stringsAsFactors = FALSE
)
Now I would like to sum up the values for each single word. This means the table should look like this:
word | value
house | 15
garden | 30
flower | 22
...
I could not find a solution by now. Does anybody has a solution?
Here's an example using unnest_tokens from the tidytext library:
library(tidyverse)
library(tidytext)
df1 %>%
unnest_tokens(word, word) %>%
group_by(word) %>%
summarize(value = sum(value))
You can get all of the words to sum up using strsplit then use sapply to sum up by the word.
Words = unique(unlist(strsplit(df1$word, ",\\s*")))
sapply(Words, function(w) sum(df1$value[grep(w, df1$word)]))
house garden flower red tree forest window door
15 30 22 17 20 20 5 5
One option could be to separate word column in multiple columns using splitstackshape::cSplit and then use tidyr::gather. Finally process data in long format.
library(tidyverse)
library(splitstackshape)
df1%>% cSplit("word", sep = ",", stripWhite = TRUE) %>%
mutate_at(vars(starts_with("word")), funs(as.character)) %>%
gather(key, word, -value) %>%
filter(!is.na(word)) %>%
group_by(word) %>%
summarise(value = sum(value)) %>%
as.data.frame()
# word value
# 1 door 5
# 2 flower 22
# 3 forest 20
# 4 garden 30
# 5 house 15
# 6 red 17
# 7 tree 20
# 8 window 5
This question already has answers here:
Split a column of concatenated comma-delimited data and recode output as factors
(3 answers)
Closed 5 years ago.
I have data frame as below
+---+-----------+
|lot|Combination|
+---+-----------+
|A01|A,B,C,D,E,F|
|A01|A,B,C |
|A02|B,C,D,E |
|A03|A,B,D,F |
|A04|A,C,D,E,F |
+---+-----------+
Each of the alphabet is a character separated by comma, I would like to split 'Combination' on each comma and insert the split strings as new column, in binary form. For instance, the desired output will be:
+---+-+-+-+-+-+-+
|lot|A|B|C|D|E|F|
+---+-+-+-+-+-+-+
|A01|1|1|1|1|1|1|
|A01|1|1|1|0|0|0|
|A02|0|1|1|1|1|0|
|A03|1|1|0|1|0|1|
|A04|1|0|1|1|1|1|
+---+-+-+-+-+-+-+
Any help will be appreciated :)
A solution using dplyr and tidyr. dt2 is the final output.
# Load packages
library(dplyr)
library(tidyr)
# Create example data frame
dt <- lot <- c("A01", "A01", "A02", "A03","A04")
Combination <- c("A,B,C,D,E,F", "A,B,C","B,C,D,E", "A,B,D,F", "A,C,D,E,F")
dt <- data_frame(lot, Combination)
# Process the data
dt2 <- dt %>%
mutate(ID = 1:n()) %>%
mutate(Combination = strsplit(Combination, split = ",")) %>%
unnest() %>%
mutate(Value = 1) %>%
spread(Combination, Value, fill = 0) %>%
select(-ID)
Please provide your sample input data in a form that is directly usable as input for the person answering. I added same sample data myself here. hope that helps.
library(tidyr)
library(dplyr)
lot <- c("A01", "A02", "A03","A04")
Combination <- c("A,B,C,D,E,F", "A,B,C","B,C,D,E", "A,C")
df <- data.frame(lot, Combination)
df
separate(df, Combination, into=paste("V",1:6, sep=""), sep=",") %>%
gather(key, value,-lot) %>%
filter(!is.na(value)) %>%
mutate(yesno = 1) %>%
distinct %>%
spread(value, yesno, fill = 0) %>% select(-key)
To understand what's happening here, run every step starting from separate() individually. The %>% is a pipe operator that is shorthand for adding the result of the previous row as the first parameter of the next row.
another option, using the handy separate_rows() function:
df <- read.table( text = "lot|Combination
A01|A,B,C,D,E,F
A01|A,B,C
A02|B,C,D,E
A03|A,B,D,F
A04|A,C,D,E,F", sep ="|", header = TRUE)
library(tidyverse)
df %>%
mutate(id = row_number(), flg = 1) %>%
separate_rows(Combination, sep = ",") %>%
spread(Combination, flg)
gives:
lot id A B C D E F
1 A01 1 1 1 1 1 1 1
2 A01 2 1 1 1 NA NA NA
3 A02 3 NA 1 1 1 1 NA
4 A03 4 1 1 NA 1 NA 1
5 A04 5 1 NA 1 1 1 1