I have problem with my data frame.
I have a dataframe with 2 columns, 'word' and 'word_categories'. I created different variables which include the different words, e.g. 'noun' which includes all the nouns of the word column. I now want to change the labels in the word_categories column to the corresponding variable. So if the word in the word column is included in the object 'noun', I want the word_categories column to display 'noun'.
df <- read.csv("palm.csv")
noun <- c("house", ...)
adj <- c("hard", ...)
...
The data frame looks like the following. It includes other columns but they are fine.
word word_categories
house
car
hard
...
I now want to look, if the words are in any of the created objects and if so, I want the corresponding label printed in the word_categories column. So for 'house' the column should show noun, for 'hard' it should show adjective. If the word is in none of the objects, it should show nothing or 'NA'.
I tried it with the following:
palm$word_categories <- ifelse(palm$word == noun, "noun",
ifelse(palm$word == adj, "adjective", "")))
This, however, doesn't work at all and I have 7 Objects in total so the statement becomes ridiculously long. How do I do it properly?
If the dataframe is called palm (you first call it df but later you use palm) and noun and adj are vectors as you define above, I would do:
library(dplyr)
palm <- palm %>%
mutate(word_categories = case_when(word %in% noun ~ "noun",
word %in% adj ~ "adjective",
TRUE ~ NA_character_))
One way would be to create a named vector of your noun/adjective dictionaries to select each element. The name would be the word and the corresponding data would be noun, adjective etc. You didn't really supply any data so I made some up.
df <- data.frame(
stringsAsFactors = FALSE,
word = c("dog", "short", "bird", "cat", "short", "man")
)
nounName <- c('dog', 'cat', 'bird')
adjName <- c('quick', 'brown', 'short')
noun <- rep('noun', length(nounName))
adj <- rep('adjective', length(adjName))
names(noun) <- nounName
names(adj) <- adjName
partsofspeech <- c(noun, adj)
df$word_categories <- partsofspeech[df$word]
Related
I have a table that is shaped like this called df (the actual table is 16,263 rows):
title date brand
big farm house 2022-01-01 A
ranch modern 2022-01-01 A
town house 2022-01-01 C
Then I have a table like this called match_list (the actual list is 94,000 rows):
words_for_match
farm
town
clown
beach
city
pink
And I'm trying to filter the first table to just be rows where the title contains a word in the words_for_match list. So I do this:
match_list <- match_list$words_for_match
match_list <- paste(match_list, collapse = "|")
match_list <- sprintf("\\b(%s)\\b", match_list)
df %>%
filter(grepl(match_list, title))
But then I get the following error:
Problem while computing `..1 = grepl(match_list, subject)`.
Caused by error in `grepl()`:
! invalid regular expression, reason 'Out of memory'
If I filter the table with 94,000 rows to just 1,000 then it runs, so it appears to just be a memory issue. So I'm wondering if there's a less memory-intensive way to do this or if this is an example of needing to look beyond my computer for computation. Advice on either pathway (or other options) is welcome. Thanks!
You could keep titles sequentially, let's say you have 10 titles that match 'farm' you do not need to evaluate those titles with other words.
Here a simple implementation :
titles <- c("big farm house", "ranch modern", "town house")
words_for_match <- c("farm", "town", "clown", "beach", "city", "pink")
titles.to.keep <- c()
for(w in words_for_match)
{
w <- sprintf("\\b(%s)\\b", w)
is.match <- grepl(w, titles)
titles.to.keep <- c(titles.to.keep, titles[is.match])
titles <- titles[!is.match]
print(paste(length(titles), "remaining titles"))
}
titles.to.keep
If you have a prior on the frequency of words on match_list, it's better to start with the most frequent ones.
UPDATE
You can also make a mix with your previous strategy to make it faster :
gr.size <- 20
gr.words <- split(words_for_match, ceiling(seq_along(words_for_match) / gr.size))
gr.words <- sapply(gr.words, function(words)
{
words <- paste(words, collapse = "|")
sprintf("\\b(%s)\\b", words)
})
and then iterate on gr.words and not on words_for_match in the first code chunk.
I want to rename factor levels with fct_recode by using items I created beforehand.
I first create some labels and save them into a list:
#Creating the Labels:
LabelsWithN <- c(
sprintf("Man(%s)", FreqGender["Man","Freq"]),
sprintf("Woman(%s)", FreqGender["Woman","Freq"]),
sprintf("Non-Binary(%s)", FreqGender["Non-Binary","Freq"]),
sprintf("Other(%s)", FreqGender["Other","Freq"]),
sprintf("Prefer Not To Disclose(%s)", FreqGender["Prefer not to disclose","Freq"])
)
This creates a chr list with items like "Man(105)", "Woman(51)" etc.
Now I want to relabel the factors in the original DataSet (i.e. "Man" --> "Man(105)") in order to label a graph. I want to use either the list item (i.e., LabelsWithN[1]) or directly the function creating the string (i.e., sprintf("Man(%s)", FreqGender["Man","Freq"]).
I then try to enter either the list item or the function into fct_recode:
#Using the Labels:
DataSet %>%
mutate(`Gender. What_is_your_ge.._` = fct_recode(`Gender. What_is_your_ge.._`, LabelsWithN[1] = "Man", sprintf("Woman(%s)", FreqGender["Woman","Freq"]) = "Woman")) %>%
#THis is just the code for the graph:
ggplot(aes(x = `Gender. What_is_your_ge.._` , y = `Age. How_old_are_you?`, main = "Age Distribution By Gender")) +
geom_boxplot() +
xlab("Gender (n)") +
ylab("Age")
However, this yields:
"Unexpected '=' in:
"DataSet %>%
mutate(`Gender. What_is_your_ge.._` = fct_recode(`Gender. What_is_your_ge.._`, LabelsWithN[1] ="
It doesn't matter if I use the function or the list item.
The vector is a factor and the list is filled with characters. If I manipulate the code to rename the factor "man" to "cat" ("cat" = "Man") the code works fine.
How can I address the list item/enter the function into fct_recode so that it works?
Also, can somebody explain to me what the problem here is? If I print out LabelsWithN[1] I get the correct string printed out.
Thank you and Bw,
Jan
perhaps this might help ?
x <- factor(c("apple", "bear", "banana", "dear")) # what you want to recode
levels <- c(fruit = "apple", fruit = "banana") # how you want to recode it
x <- fct_recode(x, !!!levels) # recoding it
I am working with a corpus with speeches spanning several years (aggregated to person-year level). I want to remove words that occur less than 4 times in a year (not remove it for the whole corpus, but only for the year in which it does not meet the threshold).
I have tried the following:
DT$text <- ifelse(grepl("1998", DT$session), mgsub(DT$text, words_remove_1998, ""), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), str_remove_all(DT$text, words_remove_1998), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, words_remove_1998), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), drop_element(DT$text, words_remove_1998), DT$text)
However, none seem to work. Mgsub just substitutes the whole speech with "" for 1998, whilst the other options give error messages. The reason that removeWords does not work is that my words_remove_1998 vector is too large. I have tried to split the word vector and loop over the words (see code below), but R does not appear to like this (running forever).
group <- 100
n <- length(words_remove_1998)
r <- rep(1:ceiling(n/group),each=group)[1:n]
d <- split(words_remove_1998,r)
for (i in 1:length(d)) {
DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, c(paste(d[[i]]))), DT$text)
}
Any suggestions for how to solve this?
Thank you for your help!
Reproducible example:
text <- rbind(c("i like ice cream"), c("banana ice cream is my favourite"), c("ice cream is not my thing"))
name <- rbind(c("Arnold Ford"), c("Arnold Ford"), c("Leslie King"))
session <- rbind("1998", "1999", "1998")
DT <- cbind(name, session, text)
words_remove_1998 <- c("like", "ice", "cream")
newtext <- rbind(c("i"), c("banana ice cream is my favourite"), c("is not my thing"))
DT <- cbind(DT, newtext)
My real word vector that I want removed contains 30k elements.
I ended up not using any wrappings, as none of them could handle the size of the data. Insted I did it the old-fashioned and simple way; separate the text into several rows, count the occurences of each word per session (year) and person, then remove the rows corresponding to less than a threshold (same limit as I used to identify the vector with words I wanted to remove). Lastly, I aggregate the data back to it's initial level (person-year).
This only words because I am removing words according to a threshold. If I had a list of words to remove that I could not remove in this way, I would have been in more trouble.
DT_separate <- separate_rows(DT, text)
df <- DT_separate %>%
dplyr::group_by(session, text) %>%
dplyr::mutate(count = dplyr::n())
df <- df[df$count >5, ]
df <- aggregate(
text ~ x, #where x is a person-year id
data=df,
FUN=paste, collapse=' '
)
names(df)[names(df) == 'text'] <- 'text2'
DT <- left_join(DT, df, by="x")
DT$text <- DT$text2
DT <- DT[, !(colnames(DT) %in% c("text2"))]
I am pretty new to R. This seems like a simple question, but I just don't know the best way to approach it. I have checked similar questions but have not found the answer I am looking for.
I have a list for data frames (actually tibbles) that I want to run through the convert() function from the hablar package to convert all of the data types for each variable in the data frames. I then want to overwrite the original data frames. Here is a simplified example data frame (N.B. all of the variables are currently factors). For simplicity I have made adm2 and adm3 the same as adm1, but there are different in my real data.
adm1 <- data.frame(admV1 = as.factor(c("male", "female", "male", "female")),
admV2 = as.factor(c("12.2", "13.0", "14.0", "15.1")),
admV3 = as.factor(c("free text", "more free text", "even more free text", "free text again")),
admV4 = as.factor(c("2019-01-01T12:00:00", "2019-01-01T12:00:00", "2019-01-01T12:00:00", "2019-01-01T12:00:00")))
adm1 <- as_tibble(adm1)
adm2 <- adm1
adm3 <- adm1
dis1 <- data.frame(disV1 = as.factor(c("yes", "no", "yes", "no")),
disV2 = as.factor(c("12.2", "13.0", "14.0", "15.1")),
disV3 = as.factor(c("free text", "more free text", "even more free text", "free text again")),
disV4 = as.factor(c("2019-01-01+T12:00:00", "2019-01-01+T12:00:00", "2019-01-01+T12:00:00", "2019-01-01+T12:00:00")))
dis1 <- as_tibble(dis1)
dis2 <- dis1
dis3 <- dis1
I have two 'types' of data frames: admissions and discharges. I defined the variables that need to be converted to each data type (N.B. In my real example each is a character vector containing more than one variable name):
# Define data types
adm_chr<- admV3
adm_num<- admV2
adm_fct<- admV1
adm_dte<- admV4
dis_chr<- disV3
dis_num<- disV2
dis_fct<- disV1
dis_dte<- disV4
I have then created a list of the datasets:
# Define datasets
adm_dfs<- list(adm1, adm2, adm2)
dis_dfs<- list(dis1, dis2, dis3)
This is what I have managed so far:
# Write function
convertDataTypes<- function(dfs, type = c("adm", "dis")){
outputs1<- dfs %>% lapply(convert(chr(paste0(type, "_chr")),
num(paste0(type, "_num")),
fct(paste0(type, "_fct"))))
outputs2<- dfs %>% mutate_at(vars(paste0(type, "_dte")),
ymd_hms, tz = "GMT")
}
# Run function
convertDataTypes(adm_dfs, "adm")
I think I need to then use lapply over outputs1 and outputs2 to assign the variables, but there is probably a much better way of approaching this. I would be very grateful for your input.
If the 'dfs' are a list of data.frames, then
library(hablar)
library(purrr)
library(dplyr)
If the 'type' corresponds to each data.frame in the list use map2
convertDataTypes <- function(dfs, type = c("adm", "dis")) {
map2(dfs, type, ~ {
.type <- .y
map(.x, ~ .x %>%
convert(chr(str_c(.type, "_chr")),
num(str_c(.type, "_num")),
fct(str_c(.type, "_fct"))) %>%
mutate_at(vars(str_c(.type, "_dte")),
ymd_hms, tz = "GMT"))
})
}
dfsN <- list(adm_dfs, dis_dfs)
I have two dataframes, remove and dat (the actual dataframe). remove specifies various combinations of the factor variables found in dat, and how many to sample (remove$cases).
Reproducible example:
set.seed(83)
dat <- data.frame(RateeGender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
RateeAgeGroup=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
Relationship=sample(c("Direct", "Manager", "Work Peer", "Friend/Family"), size = 1500, replace = TRUE),
X=rnorm(n=1500, mean=0, sd=1),
y=rnorm(n=1500, mean=0, sd=1),
z=rnorm(n=1500, mean=0, sd=1))
What I am trying to accomplish is to read in a row from remove and use it to subset dat. My current approach looks like:
remove <- expand.grid(RateeGender = c("Male", "Female"),
RateeAgeGroup = c("18-39","40-49", "50+"),
Relationship = c("Direct", "Manager", "Work Peer", "Friend/Family"))
remove$cases <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)
# For each row of remove (combination of factor levels:)
for (i in 1:nrow(remove)) {
selection <- character()
# For each column of remove (particular selection):
for (j in 1:(ncol(remove)-1)){
add <- paste0("dat$", names(remove)[j], ' == "', remove[i,j], '" & ')
selection <- paste0(selection, add)
}
selection <- sub(' & $', '', selection) # Remove trailing ampersand
cat(selection, sep = "\n") # What does selection string look like?
tmp <- sample(dat[selection, ], size = remove$cases[i], replace = TRUE)
}
The output from cat() while the loop runs looks right, for example: dat$RateeGender == "Male" & dat$RateeAgeGroup == "18-39" & dat$Relationship == "Direct" and if I paste that into dat[dat$RateeGender == "Male" & dat$RateeAgeGroup3 == "18-39" & dat$Relationship == "Direct" ,], I get the right subset.
However, if I run the loop as written with dat[selection, ], each subset only returns NAs. I get the same outcome if I use subset(). Note, I have replace = TRUE in the above solely because of the random sampling. In the actual application, there will always be more cases per combination than required.
I know I can dynamically construct formulas for lm() and other functions using paste() in this way, but am obviously missing something in translating this into working with [,].
Any advice would be really appreciated!
You cannot use character expressions as you describe to subset either with [ or subset. If you wanted to do that you would have to construct the entire expression, and then use eval. That said, there is a better solution using merge. For example, let's get all the entries in dat that match the first two rows from remove:
merge(dat, remove[1:2,])
If we want all the rows that don't match those two, then:
subset(merge(dat, remove[1:2,], all.x=TRUE), is.na(cases))
This is assuming you want to join on the columns with the same names across the two tables. If you have a lot of data you should consider using data.table as it is very fast for this type of operation.
I upvoted BrodieG's answer before I realized it doesn't do what you wanted in situations wehre the size of the category is smaller than the number of samples desired. (In fact his method doesn't really do sampling at all, but I think it is is an elegant solution to a different question so I'm not reversing my vote. And you could use a similar split strategy as illustrated below with that data.frame as the input.).
sub <- lapply( split(dat, with(dat, paste(RateeGender, # split vector
RateeAgeGroup,
Relationship, sep="_")) ),
function (d) { n= with(remove, remove[
RateeGender==d$RateeGender[1]&
RateeAgeGroup==d$RateeAgeGroup[1]&
Relationship==d$Relationship[1],
"cases"])
cat(n);
sample(d, n, repl=TRUE) } )