Exctract number and sum number from free text input, add to df

Exctract number and sum number from free text input, add to df - r

I have a dataframe with a column that contains free text entries on years of education. From the free text entries I want to extract all of the numbers and sum them.
Example: data_en$educationTxt[1] gives "6 primary school 10 highschool"
With the following code I can extract both numbers and sum them.
library(stringr)
x <- as.numeric(str_extract_all(data_en$education[1], "[0-9A]+")[[1]])
x <- as.vector(x)
x <- sum(x)
However, I would ideally like to do this for all free text entries (i.e. each row) and subsequently add the results to the dataframe per row (i.e. in a variable such as data_en$educationNum). I'm a bit stuck on how to proceed.

You can use sapply:
data_en$educationNum <- sapply(str_extract_all(data_en$education, "[0-9]+"),
function(i) sum(as.numeric(i)))
data_en
# education educationNum
# 1 6 primary school 10 highschool 16
# 2 10 primary school 2 highschool 12
# 3 no school 0
Data
data_en <- data.frame(education = c("6 primary school 10 highschool",
"10 primary school 2 highschool",
"no school"))

You just need to map over the output of str_extract_all
x <- c('300 primary 1 underworld', '6 secondary 9 dungeon lab')
library(purrr)
map_dbl(str_extract_all(x, '\\d+'), ~ sum(as.numeric(.)))
# [1] 301 15

Related

Correct variable values in a dataframe applying a function using variable-specific values in another dataframe in R

I have a df called 'covs' with sites on rows and in columns, 9 different environmental variables for each of these sites. I need to recalculate the value of each cell using the function x - center_values(x)) / scale_values(x). However, 'center_values' and 'scale_values' are different for each environmental covariate, and they are located in another df called 'correction'.
I have found many solutions for applying a function for a whole df, but not for applying specific values according to the id of the value to transform.
covs <- read.table(text = "X elev builtup river grip pa npp treecov
384879-2009 1 24.379101 25188.572 1241.8348 1431.1082 5.705152e+03 16536.664 60.23175
385822-2009 2 29.533478 32821.770 2748.9053 1361.7772 2.358533e+03 15773.115 62.38455
385823-2009 3 30.097059 28358.244 2525.7627 1073.8772 4.340906e+03 14899.451 46.03269
386765-2009 4 33.877861 40557.891 927.4295 1049.4838 4.580944e+03 15362.518 53.08151
386766-2009 5 38.605156 36182.801 1479.6178 1056.2130 2.517869e+03 13389.958 35.71379",
header= TRUE)
correction <- read.table(text = "var_name center_values scale_values
1 X 196.5 113.304898393671
2 elev 200.217889868483 307.718211316278
3 builtup 31624.4888660664 23553.2438790344
4 river 1390.41023742909 1549.88661649406
5 grip 5972.67361738244 6996.57793554527
6 pa 2731.33431010861 4504.71055521749
7 npp 10205.2997576655 2913.19658598938
8 treecov 47.9080656134352 17.7101565911347
9 nonveg 7.96755640452006 4.56625351682905", header= TRUE)
Could someone help me write a code to recalculate the environmental covariate values in 'covs' using the specific covariate values reported in 'correction'? E.g. For each value in the column 'elev' of the df 'covs', I need to substract the 'center_value' reported for 'elev' in the 'corrected' df, and then divided by the 'scale_value' of 'elev' reported in 'corrected' df. Thank you for your kind help.

You may assign var_name to row names, then loop over the names of covs to do the calculations in an sapply.
rownames(correction) <- correction$var_name
res <- as.data.frame(sapply(names(covs), function(x, y)
(covs[, x] - correction[x, "center_values"])/correction[x, "scale_values"]))
res
# X elev builtup river grip pa npp treecov
# 1 -1.725433 -0.5714280 -0.27324970 -0.09586213 -0.6491124 0.66015733 2.173339 0.6958541
# 2 -1.716607 -0.5546776 0.05083296 0.87651254 -0.6590217 -0.08275811 1.911239 0.8174114
# 3 -1.707781 -0.5528462 -0.13867495 0.73253905 -0.7001703 0.35730857 1.611340 -0.1058927
# 4 -1.698956 -0.5405596 0.37928543 -0.29871910 -0.7036568 0.41059457 1.770295 0.2921174
# 5 -1.690130 -0.5251972 0.19353224 0.05755748 -0.7026950 -0.04738713 1.093183 -0.6885470
Check e.g. "elev":
(covs[,"elev"] - correction["elev", "center_values"]) / correction["elev", "scale_values"]
# [1] -0.5714280 -0.5546776 -0.5528462 -0.5405596 -0.5251972

Quanteda changing rel freq of a term over time

I have a corpus of news articles with date and time of publication as 'docvars'.
readtext object consisting of 6 documents and 8 docvars.
# Description: df[,10] [6 × 10]
doc_id text year month day hour minute second title source
* <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <chr>
1 2014_01_01_10_51_00… "\"新华网伦敦1… 2014 1 1 10 51 0 docid报告称若不减… RMWenv
2 2014_01_01_11_06_00… "\"新华网北京1… 2014 1 1 11 6 0 docid盘点2013… RMWenv
3 2014_01_02_08_08_00… "\"原标题：报告… 2014 1 2 8 8 0 docid报告称若不减… RMWenv
4 2014_01_03_08_42_00… "\"地球可能毁灭… 2014 1 3 8 42 0 docid地球可能毁灭… RMWenv
5 2014_01_03_08_44_00… "\"北美鼠兔看起… 2014 1 3 8 44 0 docid北美鼠兔为应… RMWenv
6 2014_01_06_10_30_00… "\"欣克力C点核… 2014 1 6 10 30 0 docid英国欲建50… RMWenv
I would like to measure the changing relative frequency that a particular term - e.g 'development' - occurs in these articles (either as a proportion of the total terms in the article / or as a proportion of the total terms in all the articles published in a particular day / month). I know that I can count the number of times the term occurs in all the articles in a month, using:
dfm(corp, select = "term", groups = "month")
and that I can get the relative frequency of the word to the total words in the document using:
dfm_weight(dfm, scheme = "prop")
But how do I combine these together to get the frequency of a specific term relative to the total number of words on a particular day or in a particular month?
What I would like to be able to do is measure the change in the amount of times a term is used over time, but accounting for the fact that the total number of words used is also changing. Thanks for any help!

#DaveArmstrong gives a good answer here and I upvoted it, but can add a bit of efficiency using some of the newest quanteda syntax, which is a bit simpler.
The key here is preserving the date format created by zoo::yearmon(), since the dfm grouping coerce that to a character. So we pack it into a docvar, which is preserved by the grouping, and then retrieve it in the ggplot() call.
load(file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1"))
library("quanteda")
## Package version: 2.1.1
## create corpus and dfm
corp <- corpus(m, text_field = "body_text")
corp$date <- m$first_publication_date %>%
zoo::as.yearmon()
D <- dfm(corp, remove = stopwords("english")) %>%
dfm_group(groups = "date") %>%
dfm_weight(scheme = "prop")
library("ggplot2")
convert(D[, "wonderfully"], to = "data.frame") %>%
ggplot(aes(x = D$date, y = wonderfully, group = 1)) +
geom_line() +
labs(x = "Date", y = "Wonderfully/Total # Words")

I suspect someone will come up with a better solution within quanteda, but in the event they don't, you could always extract the word from the dfm and put it in a dataset along with the date and then make the graph. In the code below, I'm using some music reviews I scraped from the Guardian's website. I've commented out the functions that read in the data from an .rda file from Dropbox. You're welcomed to use it if you like - it's clean, but I don't want to inadvertently have someone download a file from the web they're not aware of.
# f <- file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1")
# load(f)
## create corpus and dfm
corp <- corpus(as.character(m$body_text))
docvars(corp, "date") <- m$first_publication_date
D <- dfm(corp, remove=stopwords("english"))
## take word frequencies "wonderfully" in the dfm
## along with the date
tmp <- tibble(
word = as.matrix(D)[,"wonderfully"],
date = docvars(corp)$date,
## calculate the total number of words in each document
total = rowSums(D)
)
tmp <- tmp %>%
## turn date into year-month
mutate(yearmon =zoo::as.yearmon(date)) %>%
## group by year-month
group_by(yearmon) %>%
## calculate the sum of the instances of "wonderfully"
## divided by the sum of the total words across all
## documents in the month
summarise(prop = sum(word)/sum(total))
## make a plot.
ggplot(tmp, aes(x=yearmon, y=prop)) +
geom_line() +
labs(x= "Date", y="Wonderfully/Total # Words")

Combining rows in the data set into categories in R

I'm trying to write a script which combines similar entries into the common category.
I have the dataset:
product <- c('Laptops','13" Laptops','Apple Laptops', '10 inch laptop','Laptop 13','TV','Big TV')
volume <- c(100,10,20,2,1,200,10)
dataset <- data.frame(product,volume)
Looks like:
product volume
1 Laptops 100
2 13" Laptops 10
3 Apple Laptop 20
4 10 inch laptop 2
5 Laptop 13 1
6 TV 200
7 Big TV 10
What I want to do is combine all categories together, so for example after running the script I want the dataset to be:
product volume
1 Laptops 113
2 Apple Laptop 20
3 TV 210
Since Apple is a brand, I want it to remain separate from categories. I don't know how to get started but I figure I need a for loop to go through every row, and check if a Brand name is in the product name. E.g.
brandlist <- 'Apple|Samsung'
if ( grepl(brandlist, dataset$product[i])) { Skip this row }
Now I need to define category names - which I do by looking at products which most searches, since people tend to search for categories. Let's say a row is a category if the volume is >100.
categories <- c()
for ( i in 1:count(dataset) ) {
if ( dataset$volume[i] > 100 ) { categories <- c(categories , dataset$product[i] }}
Now I need to check if every row name has a somewhat partial match... I'm thinking of some sort of regex with number + " + category or the other way around. I was also considering some sort of algorithm to check how many letters are different, e.g. allow 4 characters to differ and at least 5 must match exactly to the category, so laptops and 13" laptops will be grouped together since they have 7 characters in common and differ in 4.
EDIT:
I'm currently thinking along the lines of the following solution:
I made a list of categories, and I created a new data frame such as:
category <- c ('other', 'category 1', 'category 2')
volume <- c(0,0,0)
df <- data.frame(category,volume)
category volume
1 other 0
2 category 1 0
3 category 2 0
Now I want to go through results in the previous table using a loop, and match all results (based on the restriction on brands and matching - it must have 1 word in common and could differ in some ways, and put the result in the new data frame.

You can try following. First remove all numbers and signes like ", \ or " ".
Then search for brands and extract the last words, update if there are brands found and print all with lower case. Finally replace the plural s. Group and summarize in the last step. Of course this is a hardcoded solution for the provided data.frame, but I see no other way.
library(stringi)
library(tidyverse)
dataset %>%
mutate(p2=gsub("[[:digit:]]|\"","",product),
p2=stri_trim(p2)) %>%
mutate(p3=grepl(brandlist, p2)) %>%
mutate(p4=stri_extract_last_words(p2),
p4=ifelse(p3, grep(brandlist, p2, value=T), p4),
p4=tolower(p4),
p4=stri_replace_last_fixed(p4, "s","")) %>%
group_by(p4) %>%
summarise(volume=sum(volume)) %>%
select(product=p4, volume)
# A tibble: 3 x 2
product volume
<chr> <dbl>
1 laptop 113
2 tv 210
3 apple laptop 20
Edits:
You can also set up a function. but then you have to create the categories by yourself. Please note to write them in singular and in lower case.
library(stringr)
foo <- function(data, product=product, volume=volume, brandlist, categories){
data %>%
mutate(p1=tolower(product)) %>%
mutate(p2=str_extract(p1, brandlist),
p2=ifelse(is.na(p2),"",p2)) %>%
mutate(p3=str_extract(p1, categories)) %>%
unite(Product, p2, p3, sep = " ") %>%
mutate(Product=str_trim(Product)) %>%
group_by(Product) %>%
summarise(volume=sum(volume))
}
foo(dataset, brandlist = 'apple|samsung',categories = "laptop|tv")
# A tibble: 3 x 2
Product volume
<chr> <dbl>
1 apple laptop 20
2 laptop 113
3 tv 210
foo(dataset, brandlist = 'apple|samsung',categories = "laptop|tv|big tv")
> foo(dataset, brandlist = 'apple|samsung',categories = "laptop|tv|big tv")
# A tibble: 4 x 2
Product volume
<chr> <dbl>
1 apple laptop 20
2 big tv 10
3 laptop 113
4 tv 200

To the first part you can define a categories list and then differentially exclude
Categories <- c("Laptop","TV")
Brands <- c("Apple")
Aggregated.df <- do.call(rbind,lapply(1:length(Categories),function(x){
SumRow <- sum(dataset[grepl(Categories[x],dataset$product,ignore.case=TRUE),"volume"])
Excluded <- sapply(1:length(Brands),function(y){
SumCol <- sum(dataset[grepl(Categories[x],dataset$product,ignore.case=TRUE) & grepl(Brands[y],dataset$product,ignore.case=TRUE),"volume"])
})
SumRow <- ifelse((SumRow - sum(Excluded)) < 0, 0, (SumRow - sum(Excluded)))
Excluded.df <- NULL
if(any(Excluded>0)){
Which <- which(Excluded>0)
Excluded.df <- data.frame(Product=paste(Brands[Which],Categories[x],sep=" "), volume = Excluded[Which])
}
Row.df <- data.frame(Product=Categories[x], volume = SumRow)
DataFrame <- rbind(Row.df,Excluded.df)
}))
Now I need to define category names - which I do by looking at products which most searches, since people tend to search for categories. Let's say a row is a category if the volume is >100.
Min.volume <- 100
Categories <- unique(Aggregated.df$Product[Aggregated.df$volume > Min.volume])

Using R, Randomly Assigning Students Into Groups Of 4

I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.

I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.

I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)

Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.

I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2

If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set

Converting factors to numeric values in R

I have factors in R that are salary ranges of the form $100,001 - $150,000, over $150,000, $25,000, etc. and would like to convert these to numeric values (e.g. converting the factor $100,001 - $150,000 to the integer 125000).
Similarly I have educational categories such as High School Diploma, Current Undergraduate, PhD, etc. that I would like to assign numbers to (e.g., giving PhD a higher value than High School Diploma).
How do I do this, given the dataframe containing these values?

For converting the currency
# data
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000"), educ = c("High School Diploma", "Current Undergraduate",
"PhD"),stringsAsFactors=FALSE)
# Remove comma and dollar sign
temp <- gsub("[,$]","", df$sal)
# remove text
temp <- gsub("[[:alpha:]]","", temp)
# get average over range
df$ave.sal <- sapply(strsplit(temp , "-") , function(i) mean(as.numeric(i)))
For your education levels - if you want it numeric
df$educ.f <- as.numeric(factor(df$educ , levels=c("High School Diploma" ,
"Current Undergraduate", "PhD")))
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 High School Diploma 125000.5 1
# 2 over $150,000 Current Undergraduate 150000.0 2
# 3 $25,000 PhD 25000.0 3
EDIT
Having missing / NA values should not matter
# Data that includes missing values
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000" , NA), educ = c(NA, "High School Diploma",
"Current Undergraduate", "PhD"),stringsAsFactors=FALSE)
Rerun the above commands to get
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 <NA> 125000.5 NA
# 2 over $150,000 High School Diploma 150000.0 1
# 3 $25,000 Current Undergraduate 25000.0 2
# 4 <NA> PhD NA 3

You could use the recode function in the car package.
For example:
library(car)
df$salary <- recode(df$salary,
"'$100,001 - $150,000'=125000;'$150,000'=150000")
For more information on how to use this function see the help file.

I'd just make a vector of values that map to the levels of your factor and map them in. The code below is a much less elegant solution than I'd have liked because I can't figure out how to do the indexing with a vector, but nonetheless this will do the job if your data's not overwhelmingly large. Say we want to map the factor elements of fact to the numbers in vals:
fact<-as.factor(c("a","b","c"))
vals<-c(1,2,3)
#for example:
vals[levels(fact)=="b"]
# gives: [1] 2
#now make an example data frame:
sample(1:3,10,replace=T)
data<-data.frame(fact[sample(1:3,10,replace=T)])
names(data)<-c("myvar")
#our vlookup function:
vlookup<-function(fact,vals,x) {
#probably should do an error checking to make sure fact
# and vals are the same length
out<-rep(vals[1],length(x))
for (i in 1:length(x)) {
out[i]<-vals[levels(fact)==x[i]]
}
return(out)
}
#test it:
data$myvarNumeric<-vlookup(fact,vals,data$myvar)
This should work for what you're describing.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Exctract number and sum number from free text input, add to df - r

You just need to map over the output of str_extract_all x <- c('300 primary 1 underworld', '6 secondary 9 dungeon lab') library(purrr) map_dbl(str_extract_all(x, '\\d+'), ~ sum(as.numeric(.))) # [1] 301 15

Related

Correct variable values in a dataframe applying a function using variable-specific values in another dataframe in R

Quanteda changing rel freq of a term over time

Combining rows in the data set into categories in R

Using R, Randomly Assigning Students Into Groups Of 4

Converting factors to numeric values in R

Categories

Resources