Complex lag inheritance with dplyr - r

Firstly "complex lag inheritance" may not be the clearest title, so suggestions welcome. I have a large dataset of ordered segmented strings that I need to group by stem matching of segments. This looping example demonstrates the required logic:
require(tidyverse)
x = data_frame(name = c('smith', 'smith.james', 'smith.jill',
'taylor', 'taylor.ian', 'walker', 'walker.john', 'walker.john.sid',
'reed.snow', 'reed.snow.harry', 'reed.snow.helen.jane'),
family_name = NA_character_)
x$family_name[1] = x$name[1]
for(i in 2:nrow(x)){
# if current record matches previous record's family assignment..
family_match = str_detect(string = x$name[i], pattern = paste0('^', x$family_name[i-1], '[.]'))
x$family_name[i] = ifelse(family_match, x$family_name[i-1], x$name[i])
}
print(x)
#> # A tibble: 11 x 2
#> name family_name
#> <chr> <chr>
#> 1 smith smith
#> 2 smith.james smith
#> 3 smith.jill smith
#> 4 taylor taylor
#> 5 taylor.ian taylor
#> 6 walker walker
#> 7 walker.john walker
#> 8 walker.john.sid walker
#> 9 reed.snow reed.snow
#> 10 reed.snow.harry reed.snow
#> 11 reed.snow.helen.jane reed.snow
I have tried using this looping approach and it does not seem feasible given the data size, so the alternative is a vectored dplyr approach or python.
The heart of the problem is that each family_name assignment is based on match of either the current record's name (when inferring new family names), or the previous record's family_name. I don't see how to reconcile this logic with an approach using e.g. pmap_chr, but if I'm wrong I'd love to know how.

This uses no regular expressions or explicit loops although internally Reduce would be using a loop. No packages are used.
Names <- paste0(x$name, ".")
iter <- function(x, y) if (startsWith(y, x)) x else y
Reduce(iter, Names, acc = TRUE)
giving:
[1] "smith." "smith." "smith." "taylor." "taylor."
[6] "walker." "walker." "walker." "reed.snow." "reed.snow."
[11] "reed.snow."

Related

How to decode base64 strings in a vectorized way within dplyr::mutate?

I have a tibble which contains a column of base64-encoded strings like so:
mytib <- tibble(encoded_var = c("VGVzdGluZ3Rlc3Rpbmc=", "QW5vdGhlcnRlc3Q="))
When I try to decode it with base64::base64decode
mytib %>%
mutate(decoded_var = base64decode(encoded_var))
I receive an error:
Error in `mutate()`:
! Problem while computing `decoded_var = base64decode(encoded_var)`.
x `decoded_var` must be size 2 or 1, not 25.
I'm looking to have a tibble with a column of decoded, human-readable base64 strings. I'd also like to do that using the mutate tidyverse syntax. How can I achieve that?
Update: The tibble should look like this
# A tibble: 2 × 2
encoded_var decoded_var
<chr> <chr>
1 VGVzdGluZ3Rlc3Rpbmc= Testingtesting
2 QW5vdGhlcnRlc3Q= Anothertest
base64enc::base64decode produces a raw vector, so you need to carry out the conversion rowwise and wrap the result with rawToChar:
mytib %>%
rowwise() %>%
mutate(decoded_var = rawToChar(base64decode(encoded_var)))
#> # A tibble: 2 x 2
#> # Rowwise:
#> encoded_var decoded_var
#> <chr> <chr>
#> 1 VGVzdGluZ3Rlc3Rpbmc= Testingtesting
#> 2 QW5vdGhlcnRlc3Q= Anothertest
The problem is that the caTools::base64decode function only works on one string at a time, because a single string could contain several values. If you always have a single character value in your variable, then you can vectorize it:
library(tidyverse)
mytib <- tibble(encoded_var = c("VGVzdGluZ3Rlc3Rpbmc=", "QW5vdGhlcnRlc3Q="))
mytib %>%
mutate(decoded_var = Vectorize(caTools::base64decode)(encoded_var, "character"))
#> # A tibble: 2 × 2
#> encoded_var decoded_var
#> <chr> <chr>
#> 1 VGVzdGluZ3Rlc3Rpbmc= Testingtesting
#> 2 QW5vdGhlcnRlc3Q= Anothertest
Created on 2022-03-14 by the reprex package (v2.0.1)
EDITED TO ADD: Actually, there are (at least) four different packages that provide base64decode functions. I used caTools. There are also versions in the processx, xfun and base64enc packages. (The one in xfun is actually named base64_decode.) This is why it's important to show reproducible code here on StackOverflow. The reprex package makes this very easy.

Searching a single column for multiple values at once in R

I am working with a trade dataset, and I need to subset out all rows which represent goods going to China, Korea, Dominican Republic, and several others. I can programmatically create this list, and I know how to subset the trade dataset for anyone of these countries, but not all of them at once. What I have tried is using the which() function.
DesiredSubset = TotalTradeData[which(TotalTradeData$Destination.Code == c(List of desired country codes), ]
This runs but produces this:
Warning message: In DesiredSubset = TotalTradeData[which(TotalTradeData$Destination.Code == : longer object length is not a multiple of shorter object length
It subsets out some rows, but nowhere close to all of the ones that I need.
I'm pretty sure that this would work if I just typed all the codes with | in between as an or operator, but I have to do this for dozens of codes dozens of times so that isn't practical.
How can I subset out all of the rows containing any one of the country codes in my list?
Welcome to stackoverflow. Here it is always a good idea to share a minimal reproducible example of your data and, if necessary, an example of your desired output.
In your case you are using the wrong logical operator, use %in% instead of ==.
# The data
TotalTradeData <- data.frame(
Destination.Code = c('COL', 'DOM', 'KOR', 'CHINA', 'USA', 'BRA'),
variable1 = letters[1:6]
)
TotalTradeData
#> Destination.Code variable1
#> 1 COL a
#> 2 DOM b
#> 3 KOR c
#> 4 CHINA d
#> 5 USA e
#> 6 BRA f
TotalTradeData[TotalTradeData$Destination.Code %in% c('DOM', 'KOR', 'CHINA'), ]
#> Destination.Code variable1
#> 2 DOM b
#> 3 KOR c
#> 4 CHINA d
Created on 2022-03-27 by the reprex package (v2.0.1)

Adding a column to a df based on comparison with a list through strsplit() in R

I've been working on something for a while now and still haven't figured out how to get it to work in my preferred way. Hoping someone can help me:
I have a dataframe containing lots of data (5000+ obs) about city budgets, therefore, one of the variable names is obviously 'city'. I have a seperate list of 40 cities that I want to attach to this dataframe and essentially conditionally check for each cityname in the df, if it's also on the seperate list (and so; code it 1; or else 0). I made an example below with smaller dataset:
city <- c(rep("city_a", 8), rep("city_b", 5), rep("city_c", 4), rep("city_d", 7),
rep("city_e", 3), rep("city_f", 9), rep("city_g", 4))
school <- c(1:8, 1:5, 1:4, 1:7,1:3, 1:9, 1:4)
df <- data.frame(city, school)
seperate_list <- tolower("City_A, City_B, City_E, City_G")
seperate_list <- gsub('[,]', '', seperate_list)
seperate_list <- strsplit(seperate_list, " ")[[1]]
Note: You may ask; why do the second part like that? My dataset is much larger and I wanted to find a way to make the process more automatic, so e.g. I wouldn't have to manually delete all the commas and seperate the citynames from one another. Now that I have df and seperate_list, I want to combine them in df, by adding a third column that specifies whether (1) or not (0) each city is in the seperate list. I've tried using a for loop and also lapply, but with no luck since I'm not very skilled in both of those yet.
I would appreciate a hint, so I can sort of find of myself!
library(tidyverse)
city <- c(rep("city_a", 8), rep("city_b", 5), rep("city_c", 4), rep("city_d", 7),
rep("city_e", 3), rep("city_f", 9), rep("city_g", 4))
school <- c(1:8, 1:5, 1:4, 1:7,1:3, 1:9, 1:4)
df <- data.frame(city, school)
seperate_list <- tolower("City_A, City_B, City_E, City_G")
seperate_list <- gsub('[,]', '', seperate_list)
seperate_list <- strsplit(seperate_list, " ")[[1]]
df %>%
mutate(
in_list = city %in% seperate_list
) %>%
as_tibble()
#> # A tibble: 40 x 3
#> city school in_list
#> <chr> <int> <lgl>
#> 1 city_a 1 TRUE
#> 2 city_a 2 TRUE
#> 3 city_a 3 TRUE
#> 4 city_a 4 TRUE
#> 5 city_a 5 TRUE
#> 6 city_a 6 TRUE
#> 7 city_a 7 TRUE
#> 8 city_a 8 TRUE
#> 9 city_b 1 TRUE
#> 10 city_b 2 TRUE
#> # … with 30 more rows
Created on 2021-09-09 by the reprex package (v2.0.1)
I think you might also look in joining tables and make the list of interest as a column of another table. This looks for what databases and relational algebra are made for.

Group by name and add up the columns count in r

I have a dataset with 405 observations and 39 variables. But just two columns are important for further analysis.
I would like to group the first row with similar names together and add up their number from the second column.
Reproducible dataset looks like this:
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
Outcome should be in an new data.frame and look like this:
df2 <- data.frame (name=c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft","Others"),
value=c(30,70,50,60,70,80,190))
A tidyverse way of doing it.
First store all valid_names in a vector say valid_names
Thereafter create a new column say all_names in df1 by -
first splitting all strings at space ' ' using str_split
thereafter use purrr::map_chr() to check if any of the split string matches with your valid_names and if yes, retrieve that string only otherwise get others
Thereafter group_by on this field. (I omitted one step of mutate first and then group_by and directly created the new field in group_by statement, that works)
Now summarise your important values as desired.
valid_names =c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
valid_names
#> [1] "Google" "Facebook" "Twitter" "Flurry" "Amazon" "Microsoft"
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
df1
#> name value unimportant
#> 1 Google Ads 10 1
#> 2 Google Doubleclick 20 2
#> 3 Facebook Login 30 3
#> 4 Facebook Ads 40 4
#> 5 Twitter MoPub 50 5
#> 6 Flurry 60 6
#> 7 Amazon advertisment 70 7
#> 8 Microsoft 80 8
#> 9 Ad4screen 90 9
#> 10 imobi 100 10
library(tidyverse)
df1 %>% group_by(all_names = str_split(name, ' '),
all_names = map_chr(all_names, ~ ifelse(any(.x %in% valid_names),.x[.x %in% valid_names], 'others'))) %>%
summarise(value = sum(value), .groups = 'drop')
#> # A tibble: 7 x 2
#> all_names value
#> <chr> <dbl>
#> 1 Amazon 70
#> 2 Facebook 70
#> 3 Flurry 60
#> 4 Google 30
#> 5 Microsoft 80
#> 6 others 190
#> 7 Twitter 50
Created on 2021-06-22 by the reprex package (v2.0.0)
This works on the sample data using the adist function and with partial=TRUE to look at partial string matches. It requires defining the known groups though, rather than trying to find them. I think this leg work is worth doing though as it simplifies the problem a lot once the output is known
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facbook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
# types we want to map. known is the groupings
types <- unique(df1$name)
known <- c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
# use distrance measures, and look for matches on partial strings eg
# ignore the Doubleclick part when matching on Google
distance <- adist(known, types, partial=TRUE)
# cap controls leniancy in matching e.g. Facbook and Facebook have a dist of 1
# whilst Facebook and Facebook is a perfect match with score of 0
# Raise to be more leniant
cap <- 1
# loop through the types
map_all <- sapply(seq_along(types), function(i){
# find minimum value, check if its below the cap. If so, assign to the closest
# group, else assign to others
v <- min(distance[,i])
if(v <= cap){
map_i <- known[which.min(distance[,i])]
}else{
map_i <- "Others"
}
map_i
})
# now merge in to df1, then sum out using your preferred method
df_map <- data.frame(name=types, group=map_all)
df_merged <- merge(df1, df_map, by="name")
df2 <- aggregate(value ~ group, sum, data=df_merged)
df2
group value
1 Amazon 70
2 Facebook 70
3 Flurry 60
4 Google 30
5 Microsoft 80
6 Others 190
7 Twitter 50

Sentiment analysis (AFINN) in R

I am trying to the sentiment of a dataset of Tweets using the AFINN dictionary (get_sentiments("afinn"). A sample of the dataset is provided below:
A tibble: 10 x 2
Date TweetText
<dttm> <chr>
1 2018-02-10 21:58:19 "RT #RealSirTomJones: Still got the moves! That was a lo~
2 2018-02-10 21:58:19 "Yass Tom \U0001f600 #snakehips still got it #TheVoiceUK"
3 2018-02-10 21:58:19 Yasss tom he’s some chanter #TheVoiceUK #ItsNotUnusual
4 2018-02-10 21:58:20 #TheVoiceUK SIR TOM JONES...HE'S STILL HOT... AMAZING VO~
5 2018-02-10 21:58:21 I wonder how many hips Tom Jones has been through? #TheV~
6 2018-02-10 21:58:21 Tom Jones has still got it!!! #TheVoiceUK
7 2018-02-10 21:58:21 Good grief Tom Jones is amazing #TheVoiceuk
8 2018-02-10 21:58:21 RT #tonysheps: Sir Thomas Jones you’re a bloody legend #~
9 2018-02-10 21:58:22 #ITV Tom Jones what a legend!!! ❤️ #StillGotIt #TheVoice~
10 2018-02-10 21:58:22 "RT #RealSirTomJones: Still got the moves! That was a lo~
What I want to do is:
1. Split up the Tweets into individual words.
2. Score those words using the AFINN lexicon.
3. Sum the score of all the words of each Tweet
4. Return this sum into a new third column, so I can see the score per Tweet.
For a similar lexicon I found the following code:
# Initiate the scoreTopic
scoreTopic <- 0
# Start a loop over the documents
for (i in 1:length (myCorpus)) {
# Store separate words in character vector
terms <- unlist(strsplit(myCorpus[[i]]$content, " "))
# Determine the number of positive matches
pos_matches <- sum(terms %in% positive_words)
# Determine the number of negative matches
neg_matches <- sum(terms %in% negative_words)
# Store the difference in the results vector
scoreTopic [i] <- pos_matches - neg_matches
} # End of the for loop
dsMyTweets$score <- scoreTopic
I am however not able to adjust this code to get it working with the afinn dictionary.
This would be a great use case for tidy data principles. Let's set up some example data (these are real tweets of mine).
library(tidytext)
library(tidyverse)
tweets <- tribble(
~tweetID, ~TweetText,
1, "Was Julie helping me because I don't know anything about Python package management? Yes, yes, she was.",
2, "#darinself OMG, this is my favorite.",
3, "#treycausey #ftrain THIS IS AMAZING.",
4, "#nest No, no, not in error. Just the turkey!",
5, "The #nest people should write a blog post about how many smoke alarms went off yesterday. (I know ours did.)")
Now we have some example data. In the code below, unnest_tokens() tokenizes the text, i.e. breaks it up into individual words (the tidytext package allows you to use a special tokenizer for tweets) and the inner_join() implements the sentiment analysis.
tweet_sentiment <- tweets %>%
unnest_tokens(word, TweetText, token = "tweets") %>%
inner_join(get_sentiments("afinn"))
#> Joining, by = "word"
Now we can find the scores for each tweet. Take the original data set of tweets and left_join() on to it the sum() of the scores for each tweet. The handy function replace_na() from tidyr lets you replace the resulting NA values with zero.
tweets %>%
left_join(tweet_sentiment %>%
group_by(tweetID) %>%
summarise(score = sum(score))) %>%
replace_na(list(score = 0))
#> Joining, by = "tweetID"
#> # A tibble: 5 x 3
#> tweetID TweetText score
#> <dbl> <chr> <dbl>
#> 1 1. Was Julie helping me because I don't know anything about … 4.
#> 2 2. #darinself OMG, this is my favorite. 2.
#> 3 3. #treycausey #ftrain THIS IS AMAZING. 4.
#> 4 4. #nest No, no, not in error. Just the turkey! -4.
#> 5 5. The #nest people should write a blog post about how many … 0.
Created on 2018-05-09 by the reprex package (v0.2.0).
If you are interested in sentiment analysis and text mining, I invite you to check out the extensive documentation and tutorials we have for tidytext.
For future reference:
Score_word <- function(x) {
word_bool_vec <- get_sentiments("afinn")$word==x
score <- get_sentiments("afinn")$score[word_bool_vec]
return (score) }
Score_tweet <- function(sentence) {
words <- unlist(strsplit(sentence, " "))
words <- as.vector(words)
scores <- sapply(words, Score_word)
scores <- unlist(scores)
Score_tweet <- sum(scores)
return (Score_tweet)
}
dsMyTweets$score<-apply(df, 1, Score_tweet)
This executes what I initially wanted! :)

Resources