Map differing length variables to one data frame - r

Suppose I have to following data
specialty <- c("Primary Care", "Internal Medicine Subspecialties" ,
"Pediatric subspecialties","Surgical subspecialties", "Emergency
Medicine","All other specialties", "No Medical specialty")
test <- c(23,43,67,77,54)
dfTEST <- data.frame(test)
dfTEST<- t(dfTEST)
colnames(dfTEST) <- c(1,2,4,5,7)
> dfTEST
1 2 4 5 7
test 23 43 67 77 54
Note that my dfTest has 5 variables that skip numbers. I need to create a data frame that maps these colname numbers (1,2,4,5,7) to the specialty. Specialty is 7 strings that are in coordination to the dfTest colnames. Meaning dfTest 2 = "Internal Medicine Subspecialties" and dfTest 4 ="surgical subspecialties and so on. Below is a snippet of what I am looking to achieve, but I am stumped on how to go about it. I need it to be flexible so that no matter what the numbers in the colnames are, the code will still work. Any ideas?? Thanks!!
> dfTEST
1 2 4 5 7
test 23 43 67 77 54
added "primary care" "internal" ...

This here should solve your problem.
library(dplyr)
specialty_lookup <- data.frame(specialty = c("Primary Care",
"Internal Medicine Subspecialties",
"Pediatric subspecialties",
"Surgical subspecialties",
"Emergency Medicine",
"All other specialties",
"No Medical specialty"),
test = 1:7,
stringsAsFactors = F)
data <- data.frame(code = c(23,43,67,77,54),
test = c(1,2,4,5,7))
data <- data %>%
left_join(specialty_lookup)
data_wide <- data %>%
select(-test) %>%
t() %>%
data.frame()
colnames(data_wide) <- data$test
data_wide
But you should question yourself if this is really the format you want your data to have. From the little I could see of your problem, the following format would be more adequate:
library(dplyr)
specialty_lookup <- data.frame(specialty = c("Primary Care",
"Internal Medicine Subspecialties",
"Pediatric subspecialties",
"Surgical subspecialties",
"Emergency Medicine",
"All other specialties",
"No Medical specialty"),
test = 1:7, stringsAsFactors = F)
data <- data.frame(code = c(23,43,67,77,54),
test = c(1,2,4,5,7))
data <- data %>%
left_join(specialty_lookup)
data

Hope this helps:
# get the indexes of correspondent specialties
ids <- as.integer(colnames(dfTEST))
dfTEST<- as.data.frame(t(dfTEST))
dfTEST$added <- specialty[ids]
dfTEST<- t(dfTEST)
The output:
> dfTEST
1 2 4
test "23" "43" "67"
added "Primary Care" "Internal Medicine Subspecialties" "Surgical subspecialties"
5 7
test "77" "54"
added "Emergency \n Medicine" "No Medical specialty"

Related

R: find words from tweets in Lexicon, count them and save number in dataframe with tweets

I have a data set of 50,176 tweets (tweets_data: 50176 obs. of 1 variable). Now, I have created a self-made lexicon (formal_lexicon), which consists of around 1 million words, which are all formal language style. Now, I want to create a small code which per tweet counts how many (if there are any) words are also in that lexicon.
tweets_data:
Content
1 "Blablabla"
2 "Hi my name is"
3 "Yes I need"
.
.
.
50176 "TEXT50176"
formal_lexicon:
X
1 "admittedly"
2 "Consequently"
3 "Furthermore"
.
.
.
1000000 "meanwhile"
The output should thus look like:
Content Lexicon
1 "TEXT1" 1
2 "TEXT2" 3
3 "TEXT3" 0
.
.
.
50176 "TEXT50176" 2
Should be a simple for loop like:
for(sentence in tweets_data$Content){
for(word in sentence){
if(word %in% formal_lexicon){
...
}
}
}
I don't think "word" works and I'm not sure how to count in the specific column if a word is in the lexicon. Can anyone help?
structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
c("#barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc #santa", "When my whole fam tryna have a peaceful holiday " )
You can try something like this:
library(tidytext)
library(dplyr)
# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
tweets_data <- c("#barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc #santa", "When my whole fam tryna have a peaceful holiday " )
# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data, id = 1:length(tweets_data))
tweets_data_df %>%
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id, Content, cnt)
With result:
Joining, by = "id"
# A tibble: 6 x 3
id Content cnt
<int> <chr> <dbl>
1 1 "#barackobama Thank you for your incredible grace in leadership a~ 0
2 2 "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles Co~ 0
3 3 "2017 resolution: to embody authenticity!" 0
4 4 "Happy Holidays! Sending love and light to every corner of the ea~ 0
5 5 "Damn, it's hard to wrap presents when you're drunk. cc #santa" 0
6 6 "When my whole fam tryna have a peaceful holiday " 0
Hope this is useful for you:
library(magrittr)
library(dplyr)
library(tidytext)
# Data frame with tweets, including an ID
tweets <- data.frame(
id = 1:3,
text = c(
'Hello, this is the first tweet example to your answer',
'I hope that my response help you to do your task',
'If it is tha case, please upvote and mark as the correct answer'
)
)
lexicon <- data.frame(
word = c('hello', 'first', 'response', 'task', 'correct', 'upvote')
)
# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
tidytext::unnest_tokens(output = 'words', input = text) %>%
# Determining if a word is in your lexicon
dplyr::mutate(
in_lexicon = words %in% lexicon$word
) %>%
dplyr::group_by(id) %>%
dplyr::summarise(words_in_lexicon = sum(in_lexicon))
# Binding count and the original data
dplyr::left_join(tweets, in_lexicon)

180 nested conditions in a separate file to create a new id variable for each row in the my dataframe

I need to identify 180 short sentences written by experiment participants and match to each sentence, a serial number in a new column. I have 180 conditions in a separate file. All the texts are in Hebrew but I attach examples in English that can be understood.
I'm adding example of seven lines from 180-line experiment data. There are 181 different conditions. Each has its own serial number. So I also add small 6-conditions example that match this participant data:
data_participant <- data.frame("text" = c("I put a binder on a high shelf",
"My friend and me are eating chocolate",
"I wake up with superhero powers",
"Low wooden table with cubes",
"The most handsome man in camopas invites me out",
"My mother tells me she loves me and protects me",
"My laptop drops and breaks"),
"trial" = (1:7) )
data_condition <- data.frame("condition_a" = c("wooden table" , "eating" , "loves",
"binder", "handsome", "superhero"),
"condition_b" = c("cubes", "chocolate", "protects me",
"shelf","campos", "powers"),
"condition_c" = c("0", "0", "0", "0", "me out", "0"),
"i.d." = (1:6) )
I decided to use ifelse function and a nested conditions strategy and to write 181 lines of code. For each condition one line. It's also cumbersome because it requires moving from English to Hebrew. But after 30 lines I started getting an error message:
contextstack overflow
A screenshot of the error in line 147 means that after 33 conditions.
In the example, there are at most 3 keywords per condition but in the full data there are conditions with 5 or 6 keywords. (The reason for this is the diversity in the participants' verbal formulations). Therefore, the original table of conditions has 7 columns: on for i.d. no. and the rest are the words identifiers for the same condition with operator "or".
data <- mutate(data, script_id = ifelse((grepl( "wooden table" ,data$imagery))|(grepl( "cubes" ,data$imagery))
,"1",
ifelse((grepl( "eating" ,data$imagery))|(grepl( "chocolate" ,data$imagery))
,"2",
ifelse((grepl( "loves" ,data$imagery))|(grepl( "protect me" ,data$imagery))
,"3",
ifelse((grepl( "binder" ,data$imagery))|(grepl( "shelf" ,data$imagery))
,"4",
ifelse( (grepl("handsome" ,data$imagery)) |(grepl( "campus" ,data$imagery) )|(grepl( "me out" ,data$imagery))
,"5",
ifelse((grepl("superhero", data$imagery)) | (grepl( "powers" , data$imagery ))
,"6",
"181")))))))
# I expect the output will be new column in the participant data frame
# with the corresponding ID number for each text.
# I managed to get it when I made 33 conditions rows. And then I started
# to get an error message contextstack overflow.
final_output <- data.frame("text" = c("I put a binder on a high shelf", "My friend and me are eating chocolate",
"I wake up with superhero powers", "Low wooden table with cubes",
"The most handsome man in camopas invites me out",
"My mother tells me she loves me and protects me",
"My laptop drops and breaks"),
"trial" = (1:7),
"i.d." = c(4, 2, 6, 1, 5, 3, 181) )
Here's an approach using fuzzymatch::regex_left_join.
data_condition_long <- data_condition %>%
gather(col, text_match, -`i.d.`) %>%
filter(text_match != 0) %>%
arrange(`i.d.`)
data_participant %>%
fuzzyjoin::regex_left_join(data_condition_long %>% select(-col),
by = c("text" = "text_match")) %>%
mutate(`i.d.` = if_else(is.na(`i.d.`), 181L, `i.d.`)) %>%
# if `i.d.` is doubles instead of integers, use this:
# mutate(`i.d.` = if_else(is.na(`i.d.`), 181, `i.d.`)) %>%
group_by(trial) %>%
slice(1) %>%
ungroup() %>%
select(-text_match)
# A tibble: 7 x 3
text trial i.d.
<fct> <int> <int>
1 I put a binder on a high shelf 1 4
2 My friend and me are eating chocolate 2 2
3 I wake up with superhero powers 3 6
4 Low wooden table with cubes 4 1
5 The most handsome man in camopas invites me out 5 5
6 My mother tells me she loves me and protects me 6 3
7 My laptop drops and breaks 7 181

Reshape long to wide on repeated rows

I have a data frame df that looks like the following:
Label Info
1 0-22 Records N/A
2 0-22 Records Poland
3 0-22 Records N/A
4 0-22 Records active
5 0-22 Records Hardcore
6 0-22 Records N/A
7 0-22 Records N/A
8 Nuclear Blast "Oeschstr. 40 73072 Donzdorf"
9 Nuclear Blast Germany
10 Nuclear Blast +49 7162 9280-0
11 Nuclear Blast active
12 Nuclear Blast Hardcore (early), Metal and subgenres
13 Nuclear Blast 1987
14 Nuclear Blast "Anstalt Records, Arctic Serenades, Cannibalised Serial Killer, Deathwish Office, Epica, Gore Records, Grind Syndicate Media, Mind Control Records, Nuclear Blast America, Nuclear Blast Brasil, Nuclear Blast Entertainment, Radiation Records, Revolution Entertainment"
15 Nuclear Blast Yes
I would like to reshape to wide where df will look like:
Label Address Country Phone Status Genre Year Sub Online
1 0-22 Records N/A Poland N/A active Hardcore N/A N/A N/A
2 Nuclear Blast "Oes.." Germany +49...
.
.
The number of repeated rows varies from 7 to 9 and I used reshape and reshape2 with the key assigned to "Label" to no avail.
EDIT: dput:
structure(list(label = c("0-22 Records", "0-22 Records", "0-22 Records",
"0-22 Records", "0-22 Records", "0-22 Records", "0-22 Records",
"Nuclear Blast", "Nuclear Blast", "Nuclear Blast", "Nuclear Blast",
"Nuclear Blast", "Nuclear Blast", "Nuclear Blast", "Nuclear Blast",
"Metal Blade Records", "Metal Blade Records", "Metal Blade Records",
"Metal Blade Records", "Metal Blade Records"), info = c(" N/A ",
"Poland", " N/A ", "active", " Hardcore ", " N/A ", "N/A", " Oeschstr.
40\r\n73072 Donzdorf ",
"Germany", " +49 7162 9280-0 ", "active", " Hardcore (early), Metal and
subgenres ", " 1987 ", "\n\t\t\t\t\t\t\t\t\tAnstalt
Records,\t\t\t\t\t\t\t\t\tArctic Serenades,\t\t\t\t\t\t\t\t\tCannibalised
Serial Killer,\t\t\t\t\t\t\t\t\tDeathwish
Office,\t\t\t\t\t\t\t\t\tEpica,\t\t\t\t\t\t\t\t\tGore
Records,\t\t\t\t\t\t\t\t\tGrind Syndicate Media,\t\t\t\t\t\t\t\t\tMind
Control Records,\t\t\t\t\t\t\t\t\tNuclear Blast
America,\t\t\t\t\t\t\t\t\tNuclear Blast Brasil,\t\t\t\t\t\t\t\t\tNuclear
Blast Entertainment,\t\t\t\t\t\t\t\t\tRadiation
Records,\t\t\t\t\t\t\t\t\tRevolution Entertainment\t\t\t\t\t ",
"Yes", " 5737 Kanan Road #143\r\nAgoura Hills, California 91301 ",
"United States", " N/A ", "active", " Heavy Metal, Extreme Metal "
)), .Names = c("label", "info"), class = c("data.table", "data.frame"
), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x10200db78>)
The new column names for the wide data frame (e.g., Address, Country, etc.) don't appear in df. We need to add a column to df that maps info to the correct column names for the wide data frame in order to ensure that a given row's data ends up in the correct columns after reshaping.
The challenge is that we need to find ways to exploit regularities in the data in order to figure out which values of info represent Genre, Country, Year, etc. Based on the data sample you've provided, here are some initial ideas. In the code below, the case_when statement is an attempt to map info to the new column names. Going in order, the statements within the case_when statement are trying to do the following:
Find Country by identifying strings containing country names
Find Status (assuming it can only be either "active" or "inactive")
Find Genre. Here you'll need to cover more possibilities.
Find Year. I've assumed any row with a four-digit number in the range 1950-2017 represents a year. Adjust as necessary.
Find Phone. I've assumed it always starts with +, so you may need something more complex here.
Find Online (assuming it can only be either "Yes" or "No", and that no row that would be mapped to a different column would ever contain only the word "Yes" or "No")
Find Sub. You'll likely need a more complex strategy here. For now I've assumed rows that contain the words "Records" or "Entertainment" or that have three or more commas are Sub rows.
If a row doesn't match any of the above statements, assume it's an address.
You'll need to play around with these and see what works in the context of your data.
library(stringr)
library(tidyverse)
library(countrycode)
data("countrycode_data")
df %>%
filter(!grepl("N/A", info)) %>%
mutate(info = str_trim(gsub("\r*\t*|\n*| {2,}", "", info)),
NewCols = case_when(sapply(info, function(x) any(grepl(x, countrycode_data$country.name.en))) ~ "Country",
grepl("active", info) ~ "Status",
grepl("hardcore|metal|rock|classical", info, ignore.case=TRUE) ~ "Genre",
info %in% 1950:2017 ~ "Year",
grepl("^\\+", info) ~ "Phone",
grepl("^Yes$|^No$", info) ~ "Online",
grepl("Records|Entertainment|,{3,}", info) ~ "Sub",
TRUE ~ "Address")) %>%
group_by(label) %>%
spread(NewCols, info)
Here's the output (where I've truncated the long value of Sub to save space):
label Address Country Genre Online Phone Status Sub Year
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 0-22 Records <NA> Poland Hardcore <NA> <NA> active NA <NA>
2 Metal Blade Records 5737 Kanan Road #143Agoura Hills, California 91301 United States Heavy Metal, Extreme Metal <NA> <NA> active NA <NA>
3 Nuclear Blast Oeschstr. 4073072 Donzdorf Germany Hardcore (early), Metal and subgenres Yes +49 7162 9280-0 active Anstalt Re... 1987
Original answer (before data sample was available)
If you had all nine rows for each Label, and the data type in each row is always in the same order for each Label, then one solution would be:
library(tidyverse)
df.wide = df %>%
group_by(Label) %>%
mutate(NewCols = rep(c("Address","Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>%
spread(NewCols, Info)
You can implement this in your real data for any level of Label that has 9 rows.
df.wide9 = df %>%
group_by(Label) %>%
filter(n()==9) %>%
mutate(NewCols = rep(c("Address","Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>%
spread(NewCols, Info)
For the levels of Label with 8 or 7 rows, if the missing rows always represent the same type of data, for example, say the address row is the one that's always missing for the 8-row levels of Label, then you could do (once again, assuming the data data types are in the same order for each Label):
df.wide8 = df %>%
group_by(Label) %>%
filter(n()==8) %>%
mutate(NewCols = rep(c("Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>%
spread(NewCols, Info)
Then you could put them together with df.wide = bind_rows(df.wide8, df.wide9).
If you provide more information, we might be able to come up with a solution that works for your actual data.

Words matching in two columns using r

I have two data frames in that DF1 is (word dictionary) and DF2 is sentences.I want to make text matching in such a way that If word in DF1 matches to DF2 sentence(any word from sentence) then output should be column with yes if match or No if won't match data frames are as follow:
(DF1) word dictionary:
DF1 <- c("csi", "dsi", "market", "share", "improvement", "dealers", "increase")
(DF2)sentences:
DF2 <- c("Customer satisfaction index improvement", "reduction in retail cycle", "Improve market share", "% recovery from vendor")
and output should be:
Customer satisfaction index improvement ( yes)
reduction in retail cycle (no)
Improve market share (yes)
% recovery from vendor (no)
note- yes and No is different column showing result of text matching
Can anyone help .....thanks in advance
You could do it like this:
df <- data.frame(sentence = c("Customer satisfaction index improvement", "reduction in retail cycle", "Improve market share", "% recovery from vendor"))
words <- c("csi", "dsi", "market", "share", "improvement", "dealers", "increase")
# combine the words in a regular expression and bind it as column yes
df <- cbind(df, yes = grepl(paste(words, collapse = "|"), df$sentence))
This outputs
sentence yes
1 Customer satisfaction index improvement TRUE
2 reduction in retail cycle FALSE
3 Improve market share TRUE
4 % recovery from vendor FALSE
See it working on ideone.com.
Try this:
DF1 <- c("csi", "dsi", "market", "share", "improvement", "dealers", "increase")
DF2 <- c("Customer satisfaction index improvement", "reduction in retail cycle", "Improve market share", "% recovery from vendor")
result <- cbind(DF2, "word found" = ifelse(rowSums(sapply(DF1, grepl, x = DF2)) > 0, "YES", "NO"))
> result
DF2 word found
[1,] "Customer satisfaction index improvement" "YES"
[2,] "reduction in retail cycle" "NO"
[3,] "Improve market share" "YES"
[4,] "% recovery from vendor" "NO"

Remove part of an element in a dataframe in R

I have a data frame (DF) like this:
word
1 vet clinic New York
2 super haircut Alabama
3 best deal on dog drugs
4 doggy medicine Texas
5 cat healthcare
6 lizards that don't lie
I am trying to get the resulting data frame (only remove the geo names)
word
1 vet clinic
2 super haircut
3 best deal on dog drugs
4 doggy medicine
5 cat healthcare
6 lizards that don't lie
The following does not keep the remaining words after the geo name has been removed.
vec <- # vector of geo names
DF <-DF[!grepl(vec,DF$word),]
Using #Ari's variables and data frame, a vectorized method could use Reduce:
vec = c("New York", "Texas", "Alabama")
word = c("vet clinic New York", "super haircut Alabama", "best deal on dog drugs", "doggy medicine Texas", "cat healthcare", "lizards that don't lie")
df = data.frame(word=word)
df$word = as.character(df$word)
Reduce(function(a, b) gsub(b,"", a, fixed=T), vec, df$word)
[1] "vet clinic " "super haircut " "best deal on dog drugs" "doggy medicine "
[5] "cat healthcare" "lizards that don't lie"
Using #Ari's example,
library(stringr)
df$word <- str_trim(gsub(paste(vec,collapse="|"),"", df$word))
df$word
#[1] "vet clinic" "super haircut" "best deal on dog drugs"
#[4] "doggy medicine" "cat healthcare" "lizards that don't lie"
As Henrik mentioned, it would have been helpful if you submitted a reproducible example along with your post. I will do so here:
vec = c("New York", "Texas", "Alabama")
word = c("vet clinic New York", "super haircut Alabama", "best deal on dog drugs", "doggy medicine Texas", "cat healthcare", "lizards that don't lie")
df = data.frame(word=word)
df$word = as.character(df$word)
df
word
1 vet clinic New York
2 super haircut Alabama
3 best deal on dog drugs
4 doggy medicine Texas
5 cat healthcare
6 lizards that don't lie
Generally speaking R gurus prefer vectorization over for loops. But in this case I found a nested for loop and the stringr package to be the easiest way to solve this problem.
library(stringr)
for(i in 1:nrow(df))
{
for (j in 1:length(vec))
{
df[i, "word"] = str_replace_all(df[i, "word"], vec[j], "")
}
}
df
word
1 vet clinic
2 super haircut
3 best deal on dog drugs
4 doggy medicine
5 cat healthcare
6 lizards that don't lie
I believe that this code gives you the result that you were looking for.

Resources