R extract multiple variables from column - r

I'm new to R so my apologies if this is unclear.
My data contains 1,000 observations of 3 variable columns: (a) person, (b) vignette, (c) response. The vignette column contains demographic information presented in a paragraph, including age (20, 80), sex (male, female), employment (employed, not employed, retired), etc. Each person received a vignette that randomly presented one of the values for age (20 or 80), sex (male or female), employment (employed, not employed, retired), etc.
(e.x. Person #1 received: A(n) 20 year old male is unemployed. Person #2 received: A(n) 80 year old female is retired. Person #3 received: A(n) 20 year old male is unemployed... Person # 1,000 received: A(n) 20 year old female is employed.)
I'm trying to use tidyr:extract on (b) vignette to extract the rest of the demographic information and create several new variable columns labeled "age", "sex" "employment" etc. So far, I've only been able to extract "age" using this code:
tidyr::extract(data, vignette, c("age"), "([20:80]+)")
I want to extract all of the demographic information and create variable columns for (b) age, (c) sex, (d) employment, etc. My goal is to have 1,000 observation rows with several variable columns like this:
(a) person, (b) age, (c) sex, (d) employment (e) response
Person #1 20 Male unemployed Very Likely
Person #2 80 Female retired Somewhat Likely
Person #3 20 Male unemployed Very Unlikely
...
Person #1,000 20 Female employed Neither Likely nor Unlikely
Vignette Example:
structure(list(Response_ID = "R_86Tm81WUuyFBZhH", Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?", Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
I appreciate any guidance or help!

I made up some regex's to pull out your info. Experience shows that you're going to spend many hours tweaking the regex before you get anything reasonably satisfactory. E.g. you won't pull the employment status correctly out of a sentence like "Neither she nor her boyfriend are employed"
raw <- structure(list(Response_ID = "R_86Tm81WUuyFBZhH",
Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?",
Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
raw2 <- raw %>%
add_row(Response_ID = "R_xesrew",
Vignette = "A 22 year-old White boy drinks bleach. He is unemployed",
Response = "Unlikely")
rzlt <- raw2 %>%
tidyr::extract(Vignette, "Age", "(?ix) (\\d+) \\s* year\\-old", remove = FALSE) %>%
tidyr::extract(Vignette, "Race", "(?ix) (hispanic|white|asian|black|native \\s* american)", remove = FALSE) %>%
tidyr::extract(Vignette, "Job", "(?ix) (not \\s+ employed|unemployed|employed|jobless)", remove = FALSE) %>%
tidyr::extract(Vignette, "Sex", "(?ix) (female|male|woman|man|boy|girl)", remove = FALSE) %>%
select(- Vignette)
Gives
# A tibble: 2 x 6
Response_ID Sex Job Race Age Response
<chr> <chr> <chr> <chr> <chr> <chr>
1 R_86Tm81WUuyFBZhH woman employed Hispanic 18 Very Likely
2 R_xesrew boy unemployed White 22 Unlikely
Save your work
library(readr)
write_csv(rzlt, "myResponses.csv")
Alternatively
library(openxlsx)
openxlsx::write.xlsx(rzlt, "myResponses.xlsx", asTable = TRUE)

Related

R using melt() and dcast() with categorical and numerical variables at the same time

I am a newbie in programming with R, and this is my first question ever here on Stackoverflow.
Let's say that I have a data frame with 4 columns:
(1) Individual ID (numeric);
(2) Morality of the individual (factor);
(3) The city (factor);
(4) Numbers of books possessed (numeric).
Person_ID <- c(1,2,3,4,5,6,7,8,9,10)
Morality <- c("Bad guy","Bad guy","Bad guy","Bad guy","Bad guy",
"Good guy","Good guy","Good guy","Good guy","Good guy")
City <- c("NiceCity", "UglyCity", "NiceCity", "UglyCity", "NiceCity",
"UglyCity", "NiceCity", "UglyCity", "NiceCity", "UglyCity")
Books <- c(0,3,6,9,12,15,18,21,24,27)
mydf <- data.frame(Person_ID, City, Morality, Books)
I am using this code in order to get the counts by each category for the variable Morality in each city:
mycounts<-melt(mydf,
idvars = c("City"),
measure.vars = c("Morality"))%>%
dcast(City~variable+value,
value.var="value",fill=0,fun.aggregate=length)
The code gives this kind of table with the sums:
names(mycounts)<-gsub("Morality_","",names(mycounts))
mycounts
City Bad guy Good guy
1 NiceCity 3 2
2 UglyCity 2 3
I wonder if there is a similar way to use dcast() for numerical variables (inside the same script) e.g. in order to get a sum the Books possessed by all individuals living in each city:
#> City Bad guy Good guy Books
#>1 NiceCity 3 2 [Total number of books in NiceCity]
#>2 UglyCity 2 3 [Total number of books in UglyCity]
Do you mean something like this:
mydf %>%
melt(
idvars = c("City"),
measure.vars = c("Morality")
) %>%
dcast(
City ~ variable + value,
value.var = "Books",
fill = 0,
fun.aggregate = sum
)
#> City Morality_Bad guy Morality_Good guy
#> 1 NiceCity 18 42
#> 2 UglyCity 12 63

better and easy way to find who spoke top 10 anger words from conversation text

I have a dataframe that contains variable 'AgentID', 'Type', 'Date', and 'Text' and a subset is as follows:
structure(list(AgentID = c("AA0101", "AA0101", "AA0101", "AA0101",
"AA0101"), Type = c("PS", "PS", "PS", "PS", "PS"), Date = c("4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019"), Text = c("I am on social security XXXX and I understand it can not be garnished by Paypal credit because it's federally protected.I owe paypal {$3600.00} I would like them to cancel this please.",
"My XXXX account is being reported late 6 times for XXXX per each loan I was under the impression that I was paying one loan but it's split into three so one payment = 3 or one missed payment would be three missed on my credit,. \n\nMy account is being reported wrong by all credit bureaus because I was in forbearance at the time that these late payments have been reported Section 623 ( a ) ( 2 ) States : If at any time a person who regularly and in the ordinary course of business furnishes information to one or more CRAs determines that the information provided is not complete or accurate, the furnisher must promptly provide complete and accurate information to the CRA. In addition, the furnisher must notify all CRAs that received the information of any corrections, and must thereafter report only the complete and accurate information. \n\nIn this case, I was in forbearance during that tie and document attached proves this. By law, credit need to be reported as of this time with all information and documentation",
"A few weeks ago I started to care for my credit and trying to build it up since I have never used my credit in the past, while checking my I discover some derogatory remarks in my XXXX credit report stating the amount owed of {$1900.00} to XXXX from XX/XX/2015 and another one owed to XXXX for {$1700.00} I would like to address this immediately and either pay off this debt or get this negative remark remove from my report.",
"I disputed this XXXX account with all three credit bureaus, the reported that it was closed in XXXX, now its reflecting closed XXXX once I paid the {$120.00} which I dont believe I owed this amount since it was an fee for a company trying to take money out of my account without my permission, I was charged the fee and my account was closed. I have notified all 3 bureaus to have this removed but they keep saying its correct. One bureau is showing XXXX closed and the other on shows XXXX according to XXXX XXXX, XXXX shows a XXXX, this account has been on my report for seven years",
"On XX/XX/XXXX I went on XXXX XXXX and noticed my score had gone down, went to check out why and seen something from XXXX XXXX and enhanced recovery company ... I also seen that it had come from XXXX and XXXX dated XX/XX/XXXX, XX/XX/XXXX, and XX/XX/XXXX ... I didnt have neither one before, I called and it the rep said it had come from an address Im XXXX XXXX, Florida I have never lived in Florida ever ... .I have also never had XXXX XXXX nor XXXX XXXX ... I need this taken off because it if affecting my credit score ... This is obviously identify theft and fraud..I have never received bills from here which proves that is was not done by me, I havent received any notifications ... if it was not for me checking my score I wouldnt have known nothing of this" )), row.names = c(NA, 5L), class = "data.frame")
First, I found out the top 10 anger words using the following:
library(tm)
library(tidytext)
library(tidyverse)
library(sentimentr)
library(wordcloud)
library(ggplot2)
CS <- function(txt){
MC <- Corpus(VectorSource(txt))
SW <- stopwords('english')
MC <- tm_map(MC, tolower)
MC<- tm_map(MC,removePunctuation)
MC <- tm_map(MC, removeNumbers)
MC <- tm_map(MC, removeWords, SW)
MC <- tm_map(MC, stripWhitespace)
myTDM <- as.matrix(TermDocumentMatrix(MC))
v <- sort(rowSums(myTDM), decreasing=TRUE)
FM <- data.frame(word = names(v), freq=v)
row.names(FM) <- NULL
FM <- FM %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, "x") <= 1)
return(FM)
}
DF <- CS(df$Text)
# using nrc
nrc <- get_sentiments("nrc")
# create final dataset
DF_nrc = DF %>% inner_join(nrc)
And the I created a vector of top 10 anger words as follows:
TAW <- DF_nrc %>%
filter(sentiment=="anger") %>%
group_by(word) %>%
summarize(freq = mean(freq)) %>%
arrange(desc(freq)) %>%
top_n(10) %>%
select(word)
Next what I wanted to do is to find which were the 'Agent'(s) who spoke these words frequently and rank them. But I am confused how we could do that? Should I search the words one by one and group all by agents or is there some other better way. What I am looking at as a result, something like as follows:
AgentID Words_Spoken Rank
A0001 theft, dispute, money 1
A0001 theft, fraud, 2
.......
If you are more of a dplyr/tidyverse person, you can take an approach using some dplyr verbs, after converting your text data to a tidy format.
First, let's set up some example data with several speakers, one of whom speaks no anger words. You can use unnest_tokens() to take care of most of your text cleaning steps with its defaults, such as splitting tokens, removing punctuation, etc. Then remove stopwords using anti_join(). I show using inner_join() to find the anger words as a separate step, but you could join these up into one big pipe if you like.
library(tidyverse)
library(tidytext)
my_df <- tibble(AgentID = c("AA0101", "AA0101", "AA0102", "AA0103"),
Text = c("I want to report a theft and there has been fraud.",
"I have taken great offense when there was theft and also poison. It is distressing.",
"I only experience soft, fluffy, happy feelings.",
"I have a dispute with the hateful scorpion, and also, I would like to report a fraud."))
my_df
#> # A tibble: 4 x 2
#> AgentID Text
#> <chr> <chr>
#> 1 AA0101 I want to report a theft and there has been fraud.
#> 2 AA0101 I have taken great offense when there was theft and also poison.…
#> 3 AA0102 I only experience soft, fluffy, happy feelings.
#> 4 AA0103 I have a dispute with the hateful scorpion, and also, I would li…
tidy_words <- my_df %>%
unnest_tokens(word, Text) %>%
anti_join(get_stopwords())
#> Joining, by = "word"
anger_words <- tidy_words %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment == "anger"))
#> Joining, by = "word"
anger_words
#> # A tibble: 10 x 3
#> AgentID word sentiment
#> <chr> <chr> <chr>
#> 1 AA0101 theft anger
#> 2 AA0101 fraud anger
#> 3 AA0101 offense anger
#> 4 AA0101 theft anger
#> 5 AA0101 poison anger
#> 6 AA0101 distressing anger
#> 7 AA0103 dispute anger
#> 8 AA0103 hateful anger
#> 9 AA0103 scorpion anger
#> 10 AA0103 fraud anger
Now you now which anger words each person used, and the next step is to count them up and rank people. The dplyr package has fantastic support for exactly this kind of work. First you want to group_by() the person identifier, then calculate a couple of summarized quantities:
the total number of words (so you can arrange by this)
a pasted-together string of the words used
Afterwards, arrange by the number of words and make a new column that gives you the rank.
anger_words %>%
group_by(AgentID) %>%
summarise(TotalWords = n(),
WordsSpoken = paste0(word, collapse = ", ")) %>%
arrange(-TotalWords) %>%
mutate(Rank = row_number())
#> # A tibble: 2 x 4
#> AgentID TotalWords WordsSpoken Rank
#> <chr> <int> <chr> <int>
#> 1 AA0101 6 theft, fraud, offense, theft, poison, distressi… 1
#> 2 AA0103 4 dispute, hateful, scorpion, fraud 2
Do notice that with this approach, you don't have a zero entry for the person who spoke no anger words; they get dropped at the inner_join(). If you want them in the final data set, you will likely need to join back up with an earlier dataset and use replace_na().
Created on 2019-09-11 by the reprex package (v0.3.0)
Not the most elegant solution, but here's how you could count the words based on the line number:
library(stringr)
# write a new data.frame retaining the AgentID and Date from the original table
new.data <- data.frame(Agent = df$AgentID, Date = df$Date)
# using a for-loop to go through every row of text in the df provided.
for(i in seq(nrow(new.data))){ # i represent row number of the original df
# write a temporary object (e101) that:
## do a boolean check to see if the text from row i df[i, "Text"] the TAW$Word with stringr::str_detect function
## loop the str_detect with sapply so that the str_detect do a boolean check on each TAW$Word
## return the TAW$Word with TAW$Word[...]
e101 <- TAW$word[sapply(TAW$word, function(x) str_detect(df[i, "Text"], x))]
# write the number of returned words in e101 as a corresponding value in new data.frame
new.data[i, "number_of_TAW"] <- length(e101)
# concatenate the returned words in e101 as a corresponding value in new data.frame
new.data[i, "Words_Spoken"] <- ifelse(length(e101)==0, "", paste(e101, collapse=","))
}
new.data
# Agent Date number_of_TAW Words_Spoken
# 1 AA0101 4/1/2019 0
# 2 AA0101 4/1/2019 0
# 3 AA0101 4/1/2019 2 derogatory,remove
# 4 AA0101 4/1/2019 3 fee,money,remove
# 5 AA0101 4/1/2019 1 theft

180 nested conditions in a separate file to create a new id variable for each row in the my dataframe

I need to identify 180 short sentences written by experiment participants and match to each sentence, a serial number in a new column. I have 180 conditions in a separate file. All the texts are in Hebrew but I attach examples in English that can be understood.
I'm adding example of seven lines from 180-line experiment data. There are 181 different conditions. Each has its own serial number. So I also add small 6-conditions example that match this participant data:
data_participant <- data.frame("text" = c("I put a binder on a high shelf",
"My friend and me are eating chocolate",
"I wake up with superhero powers",
"Low wooden table with cubes",
"The most handsome man in camopas invites me out",
"My mother tells me she loves me and protects me",
"My laptop drops and breaks"),
"trial" = (1:7) )
data_condition <- data.frame("condition_a" = c("wooden table" , "eating" , "loves",
"binder", "handsome", "superhero"),
"condition_b" = c("cubes", "chocolate", "protects me",
"shelf","campos", "powers"),
"condition_c" = c("0", "0", "0", "0", "me out", "0"),
"i.d." = (1:6) )
I decided to use ifelse function and a nested conditions strategy and to write 181 lines of code. For each condition one line. It's also cumbersome because it requires moving from English to Hebrew. But after 30 lines I started getting an error message:
contextstack overflow
A screenshot of the error in line 147 means that after 33 conditions.
In the example, there are at most 3 keywords per condition but in the full data there are conditions with 5 or 6 keywords. (The reason for this is the diversity in the participants' verbal formulations). Therefore, the original table of conditions has 7 columns: on for i.d. no. and the rest are the words identifiers for the same condition with operator "or".
data <- mutate(data, script_id = ifelse((grepl( "wooden table" ,data$imagery))|(grepl( "cubes" ,data$imagery))
,"1",
ifelse((grepl( "eating" ,data$imagery))|(grepl( "chocolate" ,data$imagery))
,"2",
ifelse((grepl( "loves" ,data$imagery))|(grepl( "protect me" ,data$imagery))
,"3",
ifelse((grepl( "binder" ,data$imagery))|(grepl( "shelf" ,data$imagery))
,"4",
ifelse( (grepl("handsome" ,data$imagery)) |(grepl( "campus" ,data$imagery) )|(grepl( "me out" ,data$imagery))
,"5",
ifelse((grepl("superhero", data$imagery)) | (grepl( "powers" , data$imagery ))
,"6",
"181")))))))
# I expect the output will be new column in the participant data frame
# with the corresponding ID number for each text.
# I managed to get it when I made 33 conditions rows. And then I started
# to get an error message contextstack overflow.
final_output <- data.frame("text" = c("I put a binder on a high shelf", "My friend and me are eating chocolate",
"I wake up with superhero powers", "Low wooden table with cubes",
"The most handsome man in camopas invites me out",
"My mother tells me she loves me and protects me",
"My laptop drops and breaks"),
"trial" = (1:7),
"i.d." = c(4, 2, 6, 1, 5, 3, 181) )
Here's an approach using fuzzymatch::regex_left_join.
data_condition_long <- data_condition %>%
gather(col, text_match, -`i.d.`) %>%
filter(text_match != 0) %>%
arrange(`i.d.`)
data_participant %>%
fuzzyjoin::regex_left_join(data_condition_long %>% select(-col),
by = c("text" = "text_match")) %>%
mutate(`i.d.` = if_else(is.na(`i.d.`), 181L, `i.d.`)) %>%
# if `i.d.` is doubles instead of integers, use this:
# mutate(`i.d.` = if_else(is.na(`i.d.`), 181, `i.d.`)) %>%
group_by(trial) %>%
slice(1) %>%
ungroup() %>%
select(-text_match)
# A tibble: 7 x 3
text trial i.d.
<fct> <int> <int>
1 I put a binder on a high shelf 1 4
2 My friend and me are eating chocolate 2 2
3 I wake up with superhero powers 3 6
4 Low wooden table with cubes 4 1
5 The most handsome man in camopas invites me out 5 5
6 My mother tells me she loves me and protects me 6 3
7 My laptop drops and breaks 7 181

Count word frequency across multiple columns in R

I have a data frame in R with multiple columns with multi-word text responses, that looks something like this:
1a 1b 1c 2a 2b 2c
student job prospects money professors students campus
future career unsure my grades opportunities university
success reputation my job earnings courses unsure
I want to be able to count the frequency of words in columns 1a, 1b, and 1c combined, as well as 2a, 2b, and 2b combined.
Currently, I'm using this code to count word frequency in each column individually.
data.frame(table(unlist(strsplit(tolower(dat$1a), " "))))
Ideally, I want to be able to combine the two sets of columns into just two columns and then use this same code to count word frequency, but I'm open to other options.
The combined columns would look something like this:
1 2
student professors
future my grades
success earnings
job prospects students
career opportunities
reputation courses
money campus
unsure university
my job unsure
Here's a way using dplyr and tidyr packages. FYI, one should avoid having column names starting with a number. Naming them a1, a2... would make things easier in the long run.
df %>%
gather(variable, value) %>%
mutate(variable = substr(variable, 1, 1)) %>%
mutate(id = ave(variable, variable, FUN = seq_along)) %>%
spread(variable, value)
id 1 2
1 1 student professors
2 2 future my grades
3 3 success earnings
4 4 job prospects students
5 5 career opportunities
6 6 reputation courses
7 7 money campus
8 8 unsure university
9 9 my job unsure
Data -
df <- structure(list(`1a` = c("student", "future", "success"), `1b` = c("job prospects",
"career", "reputation"), `1c` = c("money", "unsure", "my job"
), `2a` = c("professors", "my grades", "earnings"), `2b` = c("students",
"opportunities", "courses"), `2c` = c("campus", "university",
"unsure")), .Names = c("1a", "1b", "1c", "2a", "2b", "2c"), class = "data.frame", row.names = c(NA,
-3L))
In general, you should avoid column names that start with numbers. That aside, I created a reproducible example of your problem and provided a solution using dplyr and tidyr. The substr() function inside the mutate_at assume your column names follow the [num][char] pattern in your example.
library(dplyr)
library(tidyr)
data <- tibble::tribble(
~`1a`, ~`1b`, ~`1c`, ~`2a`, ~`2b`, ~`2c`,
'student','job prospects', 'mone', 'professor', 'students', 'campus',
'future', 'career', 'unsure', 'my grades', 'opportunities', 'university',
'success', 'reputation', 'my job', 'earnings', 'courses', 'unsure'
)
data %>%
gather(key, value) %>%
mutate_at('key', substr, 0, 1) %>%
group_by(key) %>%
mutate(id = row_number()) %>%
spread(key, value) %>%
select(-id)
# A tibble: 9 x 2
`1` `2`
<chr> <chr>
1 student professor
2 future my grades
3 success earnings
4 job prospects students
5 career opportunities
6 reputation courses
7 mone campus
8 unsure university
9 my job unsure
If your end purpose is to count frequency (as opposed to switching from wide to long format), you could do
ave(unlist(df[,paste0("a",1:3)]), unlist(df[,paste0("a",1:3)]), FUN = length)
which will count the frequency of the elements of columns a1,a2,a3, where df denotes the data frame (and the columns are labeled a1,a2,a3,b1,b2,b3).

How to convert specific rows into columns in r?

I have a df in R of only one column of food ratings from amazon.
head(food_ratings)
product.productId..B001E4KFG0
1 review/userId: A3SGXH7AUHU8GW
2 review/profileName: delmartian
3 review/helpfulness: 1/1
4 review/score: 5.0
5 review/time: 1303862400
6 review/summary: Good Quality Dog Food
The rows repeat themselves, so that rows 7 through 12 have the same information regarding another user(row 7). This pattern is repeated many times.
Therefore, I need to have every group of 6 rows distributed in one row with 6 columns, so that later I can subset, for instance, the review/summary according to their review/score.
I'm using RStudio 1.0.143
EDIT: I was asked to show the output of dput(head(food_ratings, 24)) but it was too big regardless of the number used.
Thanks a lot
I have taken your data and added 2 more fake users to it. Using tidyr and dplyr you can create new columns and collapse the data into a nice data.frame. You can use select from dplyr to drop the id column if you don't need it or to rearrange the order of the columns.
library(tidyr)
library(dplyr)
df %>%
separate(product.productId..B001E4KFG0, into = c("details", "data"), sep = ": ") %>%
mutate(details = sub("review/ ", "", details)) %>%
group_by(details) %>%
mutate(id = row_number()) %>%
spread(details, data)
# A tibble: 3 x 7
id helpfulness profileName score summary time userId
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1/1 delmartian 5.0 Good Quality Dog Food 1303862400 A3SGXH7AUHU8GW
2 2 1/1 martian2 1.0 Good Quality Snake Food 1303862400 123456
3 3 2/5 martian3 5.0 Good Quality Cat Food 1303862400 123654
data:
df <- structure(list(product.productId..B001E4KFG0 = c("review/userId: A3SGXH7AUHU8GW",
"review/profileName: delmartian", "review/helpfulness: 1/1",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Dog Food",
"review/userId: 123456", "review/profileName: martian2", "review/helpfulness: 1/1",
"review/score: 1.0", "review/time: 1303862400", "review/summary: Good Quality Snake Food",
"review/userId: 123654", "review/profileName: martian3", "review/helpfulness: 2/5",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Cat Food"
)), class = "data.frame", row.names = c(NA, -18L))

Resources