How to remove columns/rows having empty value? - r

I have 3 columns.
type <- Tv show, Movie,Movie
title <- Norm of the North: King Sized Adventure,Jandino: Whatever it Takes,Transformers Prime
director <- Richard Finn and Tim Maltby
The 3rd column has only one value(i.e director).
How to remove those rows with empty values?

One way to remove the rows with empty cells is this:
Illustrative data:
df <- data.frame(
type = c("Tv show", "Movie", "Sitcom"),
title = c("Norm of the North", "King Sized Adventure","Whatever it Takes"),
director = c("Richard Finn", "Tim Maltby", "")
)
First, transform empty cells to NA:
df[df==""] <- NA
df
type title director
1 Tv show Norm of the North Richard Finn
2 Movie King Sized Adventure Tim Maltby
3 Sitcom Whatever it Takes <NA>
Then, remove rows with NA using na.omit:
na.omit(df)
type title director
1 Tv show Norm of the North Richard Finn
2 Movie King Sized Adventure Tim Maltby

Related

What is the quickest way to create new column of all unique last names for a given first name in Base R

I have a dataset similar to the following (but larger):
dataset <- data.frame(First = c("John","John","Andy","John"), Last = c("Lewis","Brown","Alphie","Johnson"))
I would like to create a new column that contains each unique last name cooresponding to the given first name. Thus, each observation of "John" would have c("Lewis", "Brown", "Johnson") in the third column.
I'm a bit perplexed because my attempts at vectorization seem impossible given I can't reference the particular observation I'm looking at. Specifically, what I want to write is:
dataset$allLastNames <- unique(data$Last[data$First == "the current index???"])
I think this can work in a loop (since I reference the observation with 'i'), but it is taking too long given the size of my data:
for(i in 1:nrow(dataset)){
dataset$allLastNames[i] <- unique(dataset$Last[dataset$First == dataset$First[i]])
}
Any suggestions for how I could make this work (using Base R)?
Thanks!
You can use dplyr library with a few lines. First, you can group by first names and list all unique last names occurences.
library(dplyr)
list_names = dataset %>%
group_by(First) %>%
summarise(allLastNames = list(unique(Last)))
Then, add the summary table to your dataset matching the First names:
dataset %>% left_join(list_names,by='First')
First Last allLastNames
1 John Lewis Lewis, Brown, Johnson
2 John Brown Lewis, Brown, Johnson
3 Andy Alphie Alphie
4 John Johnson Lewis, Brown, Johnson
Also, I think R is a good language to avoid using for-loops. You have several methods to work with dataset and arrays avoiding them.
Base R option:
allLastNames <- aggregate(.~First, dataset, paste, collapse = ",")
dataset <- merge(dataset, allLastNames, by = "First")
names(dataset) <- c("First", "Last", "allLastNames")
Output:
First Last allLastNames
1 Andy Alphie Alphie
2 John Lewis Lewis,Brown,Johnson
3 John Brown Lewis,Brown,Johnson
4 John Johnson Lewis,Brown,Johnson
library(dplyr)
library(stringr)
dataset %>%
group_by(First) %>%
mutate(Lastnames = str_flatten(Last, ', '))
# Groups: First [2]
First Last Lastnames
<chr> <chr> <chr>
1 John Lewis Lewis, Brown, Johnson
2 John Brown Lewis, Brown, Johnson
3 Andy Alphie Alphie
4 John Johnson Lewis, Brown, Johnson

str_extract() and summarise() gives me na row

This should be pretty straightforward, as think I'm just looking for verification about what I'm seeing.
I'm trying to use str_extract() to pull areas of interest out of a column in my data frame, and then count how often each word appears. I'm running into an issue though where when I do this, the data frame I produce has NA listed in one of the rows. This is confusing to me, because I don't know what is causing it or if it is a sign of an error in my code. I'm not sure how to fix this.
Additionally, note that the last item in words is "the table is light", which contains two of the words of interest in this example. I've done this intentionally because I want to make sure that it will be counted twice.
library(tidyverse)
df <- data.frame(words =c("paper book", "food press", "computer monitor", "my fancy speakers",
"my two dogs", "the old couch", "the new couch", "loud speakers",
"wasted paper", "put the dishes away", "set the table", "put it on the table",
"lets go to church", "turn out the lights", "why are the lights on",
"the table is light"))
keep <- c("dogs|paper|table|light|couch")
new_df <- df %>%
mutate(Subject = str_extract(words, keep), n = n()) %>%
group_by(Subject)%>%
summarise(`Word Count` = length(Subject))
This is what I'm getting now
Subject `Word Count`
<chr> <int>
1 couch 2
2 dogs 1
3 light 2
4 paper 2
5 table 3
6 NA 6
So my question is- what is causing the NA row in Subject? Is it all other records?
The NA appears for those values where there are no words in keep appearing in that row so there is nothing to extract.
library(dplyr)
library(stringr)
df %>% mutate(Subject = str_extract(words, keep))
# words Subject
#1 paper book paper
#2 food press <NA>
#3 computer monitor <NA>
#4 my fancy speakers <NA>
#5 my two dogs dogs
#6 the old couch couch
#7 the new couch couch
#8 loud speakers <NA>
#9 wasted paper paper
#10 put the dishes away <NA>
#11 set the table table
#12 put it on the table table
#13 lets go to church <NA>
#14 turn out the lights light
#15 why are the lights on light
#16 the table is light table
For example, for 2nd row 'food press' there are no values from "dogs|paper|table|light|couch" in it hence it returns NA.

Split strings into utterances and assign same-speaker utterances to columns in dataframe

I have multi-party conversations in strings like this:
convers <- "Peter: Hiya Mary: Hi. How w'z your weekend. Peter: a::hh still got a headache. An' you (.) party a lot? Mary: nuh, you know my kid's sick 'n stuff Peter: yeah i know that's=erm al hamshi: hey guys how's it goin'? Peter: Great! Mary: where've you BEn last week al hamshi: ah y' know, camping with my girl friend."
I also have a vector with the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
I'd like to create a dataframe with the utterances by each individual speaker in a separate column. I can only do this task in a piecemeal fashion, by addressing each speaker specifically using the indices in speakers, and then combine the separate results in a list but what I'd really like to have is a dataframe with separate columns for each speaker:
Peter <- str_extract_all(convers, paste0("(?<=", speakers[1],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
Mary <- str_extract_all(convers, paste0("(?<=", speakers[2],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
al_hamshi <- str_extract_all(convers, paste0("(?<=", speakers[3],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
df <- list(
Peter = Peter, Mary = Mary , al_hamshi = al_hamshi
)
df
$Peter
$Peter[[1]]
[1] "Hiya" "a::hh still got a headache. An' you (.) party a lot?"
[3] "yeah i know that's=erm" "Great!"
$Mary
$Mary[[1]]
[1] "Hi. How w'z your weekend." "nuh, you know my kid's sick 'n stuff" "where've you BEn last week"
$al_hamshi
$al_hamshi[[1]]
[1] "hey guys how's it goin'?" "ah y' know, camping with my girl friend."
How can I extract the same-speaker utterances not one by one but in one go and how can the results be assigned not to a list but a dataframe?
With a bit of pre-processing, and assuming the names exactly match the speakers in the conversation text, you can do:
# Pattern to use to insert new lines in string
pattern <- paste0("(", paste0(speakers, ":", collapse = "|"), ")")
# Split string by newlines
split_conv <- strsplit(gsub(pattern, "\n\\1", convers), "\n")[[1]][-1]
# Capture speaker and text into data frame
dat <- strcapture("(.*?):(.*)", split_conv, data.frame(speaker = character(), text = character()))
Which gives:
speaker text
1 Peter Hiya
2 Mary Hi. How w'z your weekend.
3 Peter a::hh still got a headache. An' you (.) party a lot?
4 Mary nuh, you know my kid's sick 'n stuff
5 Peter yeah i know that's=erm
6 al hamshi hey guys how's it goin'?
7 Peter Great!
8 Mary where've you BEn last week
9 al hamshi ah y' know, camping with my girl friend.
To get each speaker into their own column:
# Count lines by speaker
dat$cnt <- with(dat, ave(speaker, speaker, FUN = seq_along))
# Reshape and rename
dat <- reshape(dat, idvar = "cnt", timevar = "speaker", direction = "wide")
names(dat) <- sub("text\\.", "", names(dat))
cnt Peter Mary al hamshi
1 1 Hiya Hi. How w'z your weekend. hey guys how's it goin'?
3 2 a::hh still got a headache. An' you (.) party a lot? nuh, you know my kid's sick 'n stuff ah y' know, camping with my girl friend.
5 3 yeah i know that's=erm where've you BEn last week <NA>
7 4 Great! <NA> <NA>
If new lines already exist in your text, choose another character that doesn't exist to do use to split the string.
You can add :\\s to each speakers, as you are also doing, then make a gregexpr finding the position where a speaker starts. Extract this using regmatches and remove the previously added :\\s to get the speaker. Make again a regmatches but with invert giving the sentences. With spilt the sentences are grouped to the speaker. To bring this to the desired data.frame you have to add NA to have the same length for all speakes, done her with [ inside lapply:
x <- gregexpr(paste0(speakers, ":\\s", collapse="|"), convers)
y <- sub(":\\s$", "", regmatches(convers, x)[[1]])
z <- trimws(regmatches(convers, x, TRUE)[[1]][-1])
tt <- split(z, y)
do.call(data.frame, lapply(tt, "[", seq_len(max(lengths(tt)))))
# al.hamshi Mary Peter
#1 hey guys how's it goin'? Hi. How w'z your weekend. Hiya
#2 ah y' know, camping with my girl friend. nuh, you know my kid's sick 'n stuff a::hh still got a headache. An' you (.) party a lot?
#3 <NA> where've you BEn last week yeah i know that's=erm
#4 <NA> <NA> Great!

How to go from long to wide dataframe in R with multiple values separated by a comma in focal column [duplicate]

This question already has answers here:
Split string column to create new binary columns
(10 answers)
Transform comma delimited list values into a sparse matrix using R
(2 answers)
Closed 3 years ago.
Say I have a list of movies with their directors. I want to convert these directors to dummy variables (i.e. if a director directs a movie, they have their own column with a 1, if they don't direct that movie then that column has a zero). This is tricky because there are occasionally movies with two directors. See example below. df is the data I have, df2 is what I want.
movie <- c("Star Wars V", "Jurassic Park", "Terminator 2")
budget <- c(100,300,400)
director <- c("George Lucas, Lawrence Kasdan", "Steven Spielberg", "Steven Spielberg")
df <- data.frame(movie,budget,director)
df
movie <- c("Star Wars V", "Jurassic Park", "Terminator 2")
budget <- c(100,300,400)
GeorgeLucas <- c(1,0,0)
LawrenceKasdan <- c(1,0,0)
StevenSpielberg <- c(0,1,1)
df2 <- data.frame(movie, budget, GeorgeLucas, LawrenceKasdan, StevenSpielberg)
df2
One option is cSplit_e
library(splitstackshape)
library(dplyr)
library(stringr)
cSplit_e(df, 'director', sep=", ", type = 'character', fill = 0, drop = TRUE) %>%
rename_at(vars(starts_with('director_')), ~ str_remove(., 'director_'))
# movie budget George Lucas Lawrence Kasdan Steven Spielberg
#1 Star Wars V 100 1 1 0
#2 Jurassic Park 300 0 0 1
#3 Terminator 2 400 0 0 1

Group words (from defined list) into themes in R

I am new to Stackoverflow and trying to learn R.
I want to find a set of defined words in a text. Return the count of these words in a table format with the associated theme I have defined.
Here is my attempt:
text <- c("Green fruits are such as apples, green mangoes and avocados are good for high blood pressure. Vegetables range from greens like lettuce, spinach, Swiss chard, and mustard greens are great for heart disease. When researchers combined findings with several other long-term studies and looked at coronary heart disease and stroke separately, they found a similar protective effect for both. Green mangoes are the best.")
library(qdap)
**#Own Defined Lists**
fruit <- c("apples", "green mangoes", "avocados")
veg <- c("lettuce", "spinach", "Swiss chard", "mustard greens")
**#Splitting in Sentences**
stext <- strsplit(text, split="\\.")[[1]]
**#Obtain and Count Occurences**
library(plyr)
fruitres <- laply(fruit, function(x) grep(x, stext))
vegres <- laply(veg, function(x) grep(x, stext))
**#Quick check, and not returning 2 results for** "green mangoes"
grep("green mangoes", stext)
**#Trying with stringr package**
tag_ex <- paste0('(', paste(fruit, collapse = '|'), ')')
tag_ex
library(dplyr)
library(stringr)
themes = sapply(str_extract_all(stext, tag_ex), function(x) paste(x, collapse=','))[[1]]
themes
#Create data table
library(data.table)
data.table(fruit,fruitres)
Using the respective qdap and stringr packages I am unable to obtain a solution I desire.
Desired solution for fruits and veg combined in a table
apples fruit 1
green mangoes fruit 2
avocados fruit 1
lettuce veg 1
spinach veg 1
Swiss chard veg 1
mustard greens veg 1
Any help will be appreciated. Thank you
I tried to generalize for N number of vectors
tidyverse and stringr solution
library(tidyverse)
library(stringr)
Create a data.frame of your vectors
data <- c("fruit","veg") # vector names
L <- map(data, ~get(.x))
names(L) <- data
long <- map_df(1:length(L), ~data.frame(category=rep(names(L)[.x]), type=L[[.x]]))
# You may receive warnings about coercing to characters
# category type
# 1 fruit apples
# 2 fruit green mangoes
# 3 fruit avocados
# etc
To count instances of each
long %>%
mutate(count=str_count(tolower(text), tolower(type)))
Output
category type count
1 fruit apples 1
2 fruit green mangoes 2
3 fruit avocados 1
4 veg lettuce 1
# etc
Extra stuff
We can add another vector easily
health <- c("blood", "heart")
data <- c("fruit","veg", "health")
# code as above
Extra output (tail)
6 veg Swiss chard 1
7 veg mustard greens 1
8 health blood 1
9 health heart 2

Resources