Give string values in vector an auto index - r

i have two vectors:
names_of_p <- c("John", "Adam", "James", "Robert")
speeds <- c("Slow", "Fast", "Average", "Slow")
And i need the show to slowest person, i did it with if's and if else's, but i wonder if there is easier way to do it with like auto give "Slow" = 1 , "Average" = 2 and so on. In other words attach values to them.
At the end it should be vector like
names_speeds <- c(names_of_p, speed)
And then so i can compare persons and get who is faster.

You could turn speeds into an ordered factor, which would preserve the labeling while also creating an underlying numerical representation:
names_of_p <- c("John", "Adam", "James", "Robert")
speeds <- c("Slow", "Fast", "Average", "Slow")
speeds <- factor(speeds, levels = c('Slow', 'Average', 'Fast'), ordered = T)
names_of_p[order(speeds)]
[1] "John" "Robert" "James" "Adam"
names_of_p[as.numeric(speeds) < 3]
[1] "John" "James" "Robert"
It might also be a good idea to store the data in a data frame rather in separate vectors:
library(tidyverse)
df <- data.frame(
names_of_p = names_of_p,
speeds = factor(speeds, levels = c('Slow', 'Average', 'Fast'), ordered = T)
)
df %>%
arrange(speeds)
names_of_p speeds
<chr> <ord>
1 John Slow
2 Robert Slow
3 James Average
4 Adam Fast
df %>%
filter(as.numeric(speeds) < 3)
names_of_p speeds
<chr> <ord>
1 John Slow
2 James Average
3 Robert Slow

First assign names to the vector speeds then you get a named vector.
After that you can use which:
names(speeds) <- names
which(speeds=="Slow")
John Robert
1 4

Related

how to get a sentiment score (and keep the sentiment words) in quanteda?

Consider this simple example
library(tibble)
library(quanteda)
tibble(mytext = c('this is a good movie',
'oh man this is really bad',
'quanteda is great!'))
# A tibble: 3 x 1
mytext
<chr>
1 this is a good movie
2 oh man this is really bad
3 quanteda is great!
I would like to perform some basic sentiment analysis, but with a twist. Here is my dictionary, stored into a regular tibble
mydictionary <- tibble(sentiment = c('positive', 'positive','negative'),
word = c('good', 'great', 'bad'))
# A tibble: 3 x 2
sentiment word
<chr> <chr>
1 positive good
2 positive great
3 negative bad
Essentially, I would like to count how many positive and negative words are detected in each sentence, but also keep track of the matching words. In other words, the output should look like
mytext nb.pos nb.neg pos.words
1 this is a good and great movie 2 0 good, great
2 oh man this is really bad 0 1 bad
3 quanteda is great! 1 0 great
How can I do that in quanteda? Is this possible?
Thanks!
Stay tuned for quanteda v. 2.1 in which we will have greatly expanded, dedicated functions for sentiment analysis. In the meantime, see below. Note that I made some adjustments since there is a discrepancy in what you report as the text and your input text, also you have all sentiment words in pos.words, not just positive words. Below, I compute both positive and all sentiment matches.
# note the amended input text
mytext <- c(
"this is a good and great movie",
"oh man this is really bad",
"quanteda is great!"
)
mydictionary <- tibble::tibble(
sentiment = c("positive", "positive", "negative"),
word = c("good", "great", "bad")
)
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
# make the dictionary into a quanteda dictionary
qdict <- as.dictionary(mydictionary)
Now we can use the lookup functions to get to your final data.frame.
# get the sentiment scores
toks <- tokens(mytext)
df <- toks %>%
tokens_lookup(dictionary = qdict) %>%
dfm() %>%
convert(to = "data.frame")
names(df)[2:3] <- c("nb.neg", "nb.pos")
# get the matches for pos and all words
poswords <- tokens_keep(toks, qdict["positive"])
allwords <- tokens_keep(toks, qdict)
data.frame(
mytext = mytext,
df[, 2:3],
pos.words = sapply(poswords, paste, collapse = ", "),
all.words = sapply(allwords, paste, collapse = ", "),
row.names = NULL
)
## mytext nb.neg nb.pos pos.words all.words
## 1 this is a good and great movie 0 2 good, great good, great
## 2 oh man this is really bad 1 0 bad
## 3 quanteda is great! 0 1 great great

R using melt() and dcast() with categorical and numerical variables at the same time

I am a newbie in programming with R, and this is my first question ever here on Stackoverflow.
Let's say that I have a data frame with 4 columns:
(1) Individual ID (numeric);
(2) Morality of the individual (factor);
(3) The city (factor);
(4) Numbers of books possessed (numeric).
Person_ID <- c(1,2,3,4,5,6,7,8,9,10)
Morality <- c("Bad guy","Bad guy","Bad guy","Bad guy","Bad guy",
"Good guy","Good guy","Good guy","Good guy","Good guy")
City <- c("NiceCity", "UglyCity", "NiceCity", "UglyCity", "NiceCity",
"UglyCity", "NiceCity", "UglyCity", "NiceCity", "UglyCity")
Books <- c(0,3,6,9,12,15,18,21,24,27)
mydf <- data.frame(Person_ID, City, Morality, Books)
I am using this code in order to get the counts by each category for the variable Morality in each city:
mycounts<-melt(mydf,
idvars = c("City"),
measure.vars = c("Morality"))%>%
dcast(City~variable+value,
value.var="value",fill=0,fun.aggregate=length)
The code gives this kind of table with the sums:
names(mycounts)<-gsub("Morality_","",names(mycounts))
mycounts
City Bad guy Good guy
1 NiceCity 3 2
2 UglyCity 2 3
I wonder if there is a similar way to use dcast() for numerical variables (inside the same script) e.g. in order to get a sum the Books possessed by all individuals living in each city:
#> City Bad guy Good guy Books
#>1 NiceCity 3 2 [Total number of books in NiceCity]
#>2 UglyCity 2 3 [Total number of books in UglyCity]
Do you mean something like this:
mydf %>%
melt(
idvars = c("City"),
measure.vars = c("Morality")
) %>%
dcast(
City ~ variable + value,
value.var = "Books",
fill = 0,
fun.aggregate = sum
)
#> City Morality_Bad guy Morality_Good guy
#> 1 NiceCity 18 42
#> 2 UglyCity 12 63

Replacing integers in a dataframe column that's a list of integer vectors (not just single integers) with character strings in R

I have a dataframe with a column that's really a list of integer vectors (not just single integers).
# make example dataframe
starting_dataframe <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
starting_dataframe$player_indices <-
list(as.integer(1),
as.integer(c(2, 5)),
as.integer(3),
as.integer(4),
as.integer(c(6, 7)))
I want to replace the integers with character strings according to a second concordance dataframe.
# make concordance dataframe
example_concord <-
data.frame(last_names = c("Rapinoe",
"Wambach",
"Naeher",
"Morgan",
"Dahlkemper",
"Mitts",
"O'Reilly"),
player_ids = as.integer(c(1,2,3,4,5,6,7)))
The desired result would look like this:
# make dataframe of desired result
desired_result <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
desired_result$player_indices <-
list(c("Rapinoe"),
c("Wambach", "Dahlkemper"),
c("Naeher"),
c("Morgan"),
c("Mitts", "O'Reilly"))
I can't for the life of me figure out how to do it and failed to find a similar case here on stackoverflow. How do I do it? I wouldn't mind a dplyr-specific solution in particular.
I suggest creating a "lookup dictionary" of sorts, and lapply across each of the ids:
example_concord_idx <- setNames(as.character(example_concord$last_names),
example_concord$player_ids)
example_concord_idx
# 1 2 3 4 5 6
# "Rapinoe" "Wambach" "Naeher" "Morgan" "Dahlkemper" "Mitts"
# 7
# "O'Reilly"
starting_dataframe$result <-
lapply(starting_dataframe$player_indices,
function(a) example_concord_idx[a])
starting_dataframe
# first_names player_indices result
# 1 Megan 1 Rapinoe
# 2 Abby 2, 5 Wambach, Dahlkemper
# 3 Alyssa 3 Naeher
# 4 Alex 4 Morgan
# 5 Heather 6, 7 Mitts, O'Reilly
(Code golf?)
Map(`[`, list(example_concord_idx), starting_dataframe$player_indices)
For tidyverse enthusiasts, I adapted the second half of the accepted answer by r2evans to use map() and %>%:
require(tidyverse)
starting_dataframe <-
starting_dataframe %>%
mutate(
result = map(.x = player_indices, .f = function(a) example_concord_idx[a])
)
Definitely won't win code golf, though!
Another way is to unlist the list-column, and relist it after modifying its contents:
df1$player_indices <- relist(df2$last_names[unlist(df1$player_indices)], df1$player_indices)
df1
#> first_names player_indices
#> 1 Megan Rapinoe
#> 2 Abby Wambach, Dahlkemper
#> 3 Alyssa Naeher
#> 4 Alex Morgan
#> 5 Heather Mitts, O'Reilly
Data
## initial data.frame w/ list-column
df1 <- data.frame(first_names = c("Megan", "Abby", "Alyssa", "Alex", "Heather"), stringsAsFactors = FALSE)
df1$player_indices <- list(1, c(2,5), 3, 4, c(6,7))
## lookup data.frame
df2 <- data.frame(last_names = c("Rapinoe", "Wambach", "Naeher", "Morgan", "Dahlkemper",
"Mitts", "O'Reilly"), stringsAsFactors = FALSE)
NB: I set stringsAsFactors = FALSE to create character columns in the data.frames, but it works just as well with factor columns instead.

Creating a new variable that counts the # of duplicate values from another variable in R

I am trying to a create a new variable in R that gives a unique (ordered) numeric value to each observation based on the duplicate values in another variable. I have put below what the data looks like and what I would like it too look like. Can anyone help?
name <- c("Alex", "Alex", "Alex", "Bill", "Bill", "Cathy")
purchase <- c("hat", "bag", "book", "bag", "book", "book")
individual_purchase_No <- c(1, 2, 3, 1, 2, 1)
What the data looks like:
purchase.data <- data.frame(name, purchase)
What I want the data to look like:
purchase_order.data <- data.frame(name, purchase, individual_purchase_No)
You can do this with dplyr:
library(dplyr)
purchase.data %>% group_by(name) %>%
mutate(individual_purchase_No = 1:n())
## Source: local data frame [6 x 3]
## Groups: name [3]
##
## name purchase individual_purchase_No
## (fctr) (fctr) (int)
## 1 Alex hat 1
## 2 Alex bag 2
## 3 Alex book 3
## 4 Bill bag 1
## 5 Bill book 2
## 6 Cathy book 1
A base R solution is for instance:
purchase.data$individual_purchase_No <- sequence(table(purchase.data$name))
Table counts the number of appearances of each name, and sequence then creates for each number n the sequence 1:n.

How to create groups of like sounding names in R?

I'd like to create a group variables based upon how similar a selection of names is. I have started by using the stringdist package to generate a measure of distance. But I'm not sure how to use that output information to generate a group by variable. I've looked at hclust but it seems like to use clustering functions you need to know how many groups you want in the end, and I do not know that. The code I start with is below:
name_list <- c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
name_dist <- stringdistmatrix(name_list)
name_dist
name_dist2 <- stringdistmatrix(name_list, method="soundex")
name_dist2
I would like to see a dataframe with two columns that look like
name = c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
name_group = c(1, 1, 1, 2, 2, 2, 3, 3, 4)
The groups might be slightly different depending obviously on what distance measure I use (I've suggested two above) but I would probably choose one or the other to run.
Basically, how do I get from the distance matrix to a group variable without knowing the number of clusters I'd like?
You can also use adist(...) in base R to calculate the Levenshtein distances, and cluster based on that.
n<- c("Mary", "Mery", "Mari", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
d <- adist(n)
rownames(d) <- n
cl <- hclust(as.dist(d))
plot(cl)
You could use a cluster analysis like this:
# loading the package
require(stringdist);
# Group selection by class numbers or height
num.class <- 5;
num.height <-0.5;
# define names
n <- c("Mary", "Mery", "Mari", "Joe",
"Jo", "Joey", "Bob", "Beb", "Paul");
# calculate distances
d <- stringdistmatrix(n, method="soundex");
# cluster the stuff
h <- hclust(d);
# cut the cluster by num classes
m <- cutree(h, k = num.class);
# cut the cluster by height
p <- cutree(h, h = num.height);
# build the resulting frame
df <- data.frame(names = n,
group.class = m,
group.prob = p);
It produces:
df;
names group.class group.prob
1 Mary 1 1
2 Mery 1 1
3 Mari 1 1
4 Joe 2 2
5 Jo 2 2
6 Joey 2 2
7 Bob 3 3
8 Beb 4 3
9 Paul 5 4
And the chart gives you an overview:
plot(h, labels=n);
Regards huck

Resources