r programming: align two sequences of words - r

I want to align two datasets that mostly intersect on one column -- but each dataset is missing some rows. For example:
df1 <- data.frame(word = c("my", "dog", "ran", "with", "your", "dog"),
freq = c(5, 2, 2, 6, 5, 10))
df2 <- data.frame(word = c("my", "brown", "dog", "ran", "your", "dog"),
pos = c("a", "b", "c", "d", "a", "e"))
What I want as output is to have gaps inserted wherever there's a missing item. Thus in the output, the new form of df1 will have NAs where df1 was missing a word match that was in df2, and the new form of df2 will have NAs where df2 was missing a word-instance that was in df1.
As in my example, the sequence matters and elements do repeat. (so this isn't a generic "merge" situation.) I suspect DTW could figure in to the solution but I'm not sure. For present purposes it's fair to stipulate that only exact matches do match.
For the above case the desired output would be a data frame with these columns:
$word1 my NA dog ran with your dog
$freq 5 NA 2 2 6 5 2
$word2 my brown dog ran NA your dog
$pos a b c d NA a c
Thus, the sequence in each original data frame is maintained; nothing is deleted; word tokens remain tokens (it's a corpus, not a dictionary); all that's really happened is spaces (NAs) have been inserted where data are missing.

df1$count = ave(seq_along(df1$word), df1$word, FUN = seq_along)
df2$count = ave(seq_along(df2$word), df2$word, FUN = seq_along)
df1$merge = paste(df1$count, df1$word)
df2$merge = paste(df2$count, df2$word)
output = merge(x = df1, y = df2, by = "merge", all.x = TRUE, all.y = TRUE)
output[c(2, 3, 5, 6)]
# word.x freq word.y pos
#1 <NA> NA brown b
#2 dog 2 dog c
#3 my 5 my a
#4 ran 2 ran d
#5 with 6 <NA> <NA>
#6 your 5 your a
#7 dog 2 dog c

Related

Move subgroup under repeated main group while keeping main group once in data.frame R

I'm aware that the question is awkward. If I could phrase it better I'd probably find the solution in an other thread.
I have this data structure...
df <- data.frame(group = c("X", "F", "F", "F", "F", "C", "C"),
subgroup = c(NA, "camel", "horse", "dog", "cat", "orange", "banana"))
... and would like to turn it into this...
data.frame(group = c("X", "F", "camel", "horse", "dog", "cat", "C", "orange", "banana"))
... which is surprisingly confusing. Also, I would prefer not using a loop.
EDIT: I updated the example to clarify that solutions that depend on sorting unfortunately do not do the trick.
Here an (edited) answer with new data.
Using data.table is going to help a lot. The idea is to split the df into groups and lapply() to each group what we need. Whe have to take care of some things meanwhile.
library(data.table)
# set as data.table
setDT(df)
# to mantain the ordering, you need to put as factor the group.
# the levels are going to give the ordering infos to split
df[,':='(group = factor(group, levels =unique(df$group)))]
# here the split function, splitting df int a list
df_list <-split(df, df$group, sorted =F)
# now you lapply to each element what you need
df_list <-lapply(df_list, function(x) data.frame(group = unique(c(as.character(x$group),x$subgroup))))
# put into a data.table and remove NAs
rbindlist(df_list)[!is.na(df_onecol$group)]
group
1: X
2: F
3: camel
4: horse
5: dog
6: cat
7: C
8: orange
9: banana
With the edited data we need to add another column (here row_number) to sort by:
df %>%
pivot_longer(col = everything()) %>%
mutate(r_n = row_number()) %>%
group_by(value) %>% slice(1) %>%
arrange(r_n) %>%
filter(!is.na(value))
#output
# A tibble: 9 × 3
# Groups: value [9]
name value r_n
<chr> <chr> <int>
1 group X 1
2 group F 3
3 subgroup camel 4
4 subgroup horse 6
5 subgroup dog 8
6 subgroup cat 10
7 group C 11
8 subgroup orange 12
9 subgroup banana 14

R: outer-merge two dataframes with unequal columns

I'm new to coding and am struggling a bit with this merge. I have two dataframes:
> a1
a b c
1 1 apple x
2 2 bees a
3 3 candy a
4 4 dice s
5 5 donut d
> a2
a b c d
1 1 apple x a
2 2 bees a d
3 6 coffee r s
I would like to join these two dataframes by a, b, and c. I want to get rid of the duplicate rows where a, b, c are the same, but keep the unique rows in both datasets. In the case where a unique row in a2 is kept, I would also want d to be shown. So the result would be something like the following:
> a3
a b c d
1 1 apple x a
2 2 bees a d
3 3 candy a NA
4 4 dice s NA
5 5 donut d NA
6 6 coffee r s
You can use a full_join from tidyverse:
library(tidyverse)
full_join(a1, a2, by = c("a", "b", "c")) %>%
distinct()
Output
a b c d
1 1 apple x a
2 2 bees a d
3 3 candy a <NA>
4 4 dice s <NA>
5 5 donut d <NA>
6 6 coffee r s
Data
a1 <- structure(list(a = 1:5, b = c("apple", "bees", "candy", "dice",
"donut"), c = c("x", "a", "a", "s", "d")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
a2 <- structure(list(a = c(1L, 2L, 6L), b = c("apple", "bees", "coffee"
), c = c("x", "a", "r"), d = c("a", "d", "s")), class = "data.frame", row.names = c("1",
"2", "3"))
Hello there & here you go:
a3 <- merge(a1, a2, all=TRUE) # this merges your data preserving all vals from all cols/rows (if missing - adds NA)
a3 <- a3[order(a3[,"d"], decreasing=TRUE),] # this sorts your merged df
a3 <- a3[!duplicated(a3[,"b"]),] # this removes duplicate values
I edited my answer since I looked more carefully at your desired output. You wanna preserve "d" col value even tho there is a duplicate, so I first suggest to order your df based on the "d" column in decreasing fashion (so the NAs will be at the bottom of the df). Then I suggest to exclude the duplicates in col "b", and since the function preserves the first encountered value, the "apple" row taken initially from a2 df will be preserved in the output. Also, you can make an oneliner using pipe operator.

Using lapply to change column names of list of dataframes with different column names

I have three dataframes with different column names:
df1 <- data.frame(A = 1:10, B = 10:19)
df2 <- data.frame(D = 20:29, E = 30:39)
df3 <- data.frame(G = 40:49, H = 50:59)
They are in a list:
my_list <- list(df1, df2, df3)
I need to change the column names (ie A becomes A1, B becomes B2, D becomes D2, etc....). No, it is not as easy as appending a 2 on all column names. My real situation will involved unique changes (ie. A becomes 'species', B becomes 'abundance', etc.)
There are good answers for how to do this when the dataframes within the list all have the same column names (Using lapply to change column names of a list of data frames).
My dataframes have different names though. It would be nice to use something similar to the functionality of mapvalues.
You can create a dataframe with the information of from and to and with lapply use setNames to match and replace the column names :
lookup_names <- data.frame(from = c("A", "B", "D", "E", "G", "H"),
to = c("this", "that", "he", "she", "him", "her"))
lookup_names
# from to
#1 A this
#2 B that
#3 D he
#4 E she
#5 G him
#6 H her
lapply(my_list, function(x)
setNames(x, lookup_names$to[match(names(x), lookup_names$from)]))
#[[1]]
# this that
#1 1 10
#2 2 11
#3 3 12
#4 4 13
#...
#...
#[[2]]
# he she
#1 20 30
#2 21 31
#3 22 32
#4 23 33
#5 24 34
#....

How can be splitted sentences contained in a cell into different rows in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I tried several times and it does not work.
How can I split sentences contained in a cell into different rows maintaining the rest of the values?
Example:
Dataframe df has 20 columns.
Row j, Column i contains some comments which are separated by " | "
I want to have a new dataframe df2 which increases the amount of rows depending the number of sentences.
This means, if cell j,i has Sentence A | Sentence B
Row j, Column i has Sentence A
Row j+1, Column i has Sentence B
Columns 1 to i-1 and i+1 to 20 have the same value in rows j and j+1.
I do not know if this has an easy solution.
Thank you very much.
We could use cSplit from splitstackshape
library(splitstackshape)
cSplit(df, 'col3', sep="\\|", "long", fixed = FALSE)
# col1 col2 col3
#1: a 1 fitz
#2: a 1 buzz
#3: b 2 foo
#4: b 2 bar
#5: c 3 hello world
#6: c 3 today is Thursday
#7: c 3 its 2:00
#8: d 4 fitz
data
df <- structure(list(col1 = c("a", "b", "c", "d"), col2 = c(1, 2, 3,
4), col3 = c("fitz|buzz", "foo|bar", "hello world|today is Thursday | its 2:00",
"fitz")), class = "data.frame", row.names = c(NA, -4L))
Here is a solution using 3 tidyverse packages that accounts for an unknown maximum number of comments
library(dplyr)
library(tidyr)
library(stringr)
# Create function to calculate the max number comments per observation within
# df$col3 and create a string of unique "names"
cols <- function(x) {
cmts <- str_count(x, "([|])")
max_cmts <- max(cmts, na.rm = TRUE) + 1
features <- c(sprintf("V%02d", seq(1, max_cmts)))
}
# Create the data
df1 <- data.frame(col1 = c("a", "b", "c", "d"),
col2 = c(1, 2, 3, 4),
col3 = c("fitz|buzz", NA,
"hello world|today is Thursday | its 2:00|another comment|and yet another comment", "fitz"),
stringsAsFactors = FALSE)
# Generate the desired output
df2 <- separate(df1, col3, into = cols(x = df1$col3),
sep = "([|])", extra = "merge", fill = "right") %>%
pivot_longer(cols = cols(x = df1$col3), values_to = "comments",
values_drop_na = TRUE) %>%
select(-name)
Which results in
df2
# A tibble: 8 x 3
col1 col2 comments
<chr> <dbl> <chr>
1 a 1 "fitz"
2 a 1 "buzz"
3 c 3 "hello world"
4 c 3 "today is Thursday "
5 c 3 " its 2:00"
6 c 3 "another comment"
7 c 3 "and yet another comment"
8 d 4 "fitz"

Convert all empty & fields marked with "N/A" as NA in R

I am new to Machine Learning & R, so my question is a pretty basic one:
I have imported a dataset and performed some modifications and stored the final output in a dataframe named df_final.
Now I would like to replace all the empty fields and fields with "N/A", "n/a" as NA, so that I could use the inbuilt na libraries in R.
Any help in this context would be highly appreciated.
Cheers!
Vivek
I agree that the problem is best solved at read-in, by setting na.strings = c("", "N/A", "n/a") in read.table, as suggested by #Darren Tsai. If that's no longer an option because you've processed the data already and, as I suspect, you do not want to keep only complete cases, as suggested by #Rui Barradas, then the issue can be addressed this way:
DATA:
df_final <- data.frame(v1 = c(1, "N/A", 2, "n/a", "", 3),
v2 = c("a", "", "b", "c", "d", "N/A"))
df_final
v1 v2
1 1 a
2 N/A
3 2 b
4 n/a c
5 d
6 3 N/A
SOLUTION:
To introduce NA into empty fields, you can do:
df_final[df_final==""] <- NA
df_final
v1 v2
1 1 a
2 N/A <NA>
3 2 b
4 n/a c
5 <NA> d
6 3 N/A
To change the other values into NA, you can use lapply and a function:
df_final[,1:2] <- lapply(df_final[,1:2], function(x) gsub("N/A|n/a", NA, x))
df_final
v1 v2
1 1 a
2 <NA> <NA>
3 2 b
4 <NA> c
5 <NA> d
6 3 <NA>
This is a two steps solution.
Replace the bad values by real NA values.
Keep the complete.cases.
In base R:
is.na(df1) <- sapply(df1, function(x) x %in% c("", "N/A", "n/a"))
df_final <- df1[complete.cases(df1), , drop = FALSE]
df_final
# x y
#1 a u
#3 d v
Data creation code.
df1 <- data.frame(x = c("a", "N/A", "d", "n/a", ""),
y = c("u", "", "v", "x", "y"))

Resources