R data frame from contents of a character vector - r

I am working with a long character vector where each row is supposed to contain a small data frame. I created a function to clean the data and produce a string which would be ready to input to the data.frame() function. The output is as follows:
[1] "`demo/members/education_member` = c('High_school', 'High_school'), `demo/members/status` = c('Other', 'Other'), `demo/members/name` = c('Hans Solo', 'Luke Skywalker')"
I wanted to pass this output to data.frame() to obtain the following data frame (resulting from copying and pasting the unquoted output above and passing it to the data.frame function):
demo.members.education_member demo.members.status demo.members.name
1 High_school Other Hans Solo
2 High_school Other Luke Skywalker
Question: What would be the best way to convert the contents of a character output into a R data frame?

Here's a way but reiterating my comment - There is likely to be a better way to transform your raw data into a cleaned up dataframe.
x <- "`demo/members/education_member` = c('High_school', 'High_school'), `demo/members/status` = c('Other', 'Other'), `demo/members/name` = c('Hans Solo', 'Luke Skywalker')"
y <- paste0("data.frame(", x, ")")
eval(parse(text = y))
demo.members.education_member demo.members.status demo.members.name
1 High_school Other Hans Solo
2 High_school Other Luke Skywalker

This is a way you can do it, if it is possible to add the data.frame() function to your string. If not let me know and I'll delete. This is a bit quick and dirty and uses rlang but you can make it better by not using base eval and instead figuring out the rlang quasiquotation stuff.
x <- "data.frame(
'demomemberseducationmember' = c('High_school','High_school'),
'demomembersstatus' = c('Other', 'Other'),
'demmembersname' = c('Hans Solo', 'Luke Skywalker'))"
eval(rlang::parse_expr(x))

Related

Converting list of Characters to Named num in R

I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.
As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...
Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))

Can "assign()" and "get()" be written more concisely?

Below is my code. I use an extra variation "tmp" to clean the "ABC_Chla". Because the "Location_name" can change, I use "assign()" and "get()" function.
Location_name <- "ABC_"
tmp <- get(paste(Location_name,"DO",sep = "")) %>% filter(log.DO != -Inf)
assign(paste(Location_name,"DO",sep = ""), tmp)
My code can achieve this goal, but it seems not concise (introduce a temporary variable). Is there a better way?
Assuming the inputs shown reproducibly in the Note at the end (next time please make sure your question includes complete reproducible code including inputs) we can make the following changes:
use paste0 instead of paste
create a variable locname to hold the name of the data frame and a variable e to be the environment where our data frame is located
use e[[...]] instead of get and assign
use magrittr %<>% two-way pipe
possibly use filter(is.finite(log.DO)) -- not shown below
giving this code:
library(dplyr)
library(magrittr)
e <- .GlobalEnv # change if our data frame is in some other environment
locname <- paste0(Location_name, "DO")
e[[locname]] %<>%
filter(log.DO != -Inf)
The result is:
get(locname, e)
## log.DO
## 1 1
## 2 2
Alternative
This alternative only uses ordinary pipes. We use e and locname from above.
library(dplyr)
e[[locname]] <- e[[locname]] %>%
filter(log.DO != -Inf)
Note
Test input:
ABC_DO <- data.frame(log.DO = c(1, -Inf, 2))
Location_name <- "ABC_"
You only have a temporary variable because you store the data in tmp, i don't see it as a problem.But, n this case, the only thing that i see you can do is pass the code of tmp directly to assign, like:
assign(
paste(Location_name,"DO",sep = ""),
get(paste(Location_name,"DO",sep = "")) %>% filter(log.DO != -Inf)
)

NLP - identifying and replacing words (synonyms) in R

I have problem with code in R.
I have a data-set(questions) with 4 columns and over 600k observation, of which one column is named 'V3'.
This column has questions like 'what is the day?'.
I have second data-set(voc) with 2 columns, of which one column name 'word' and other column name 'synonyms'. If In my first data-set (questions )exists word from second data-set(voc) from column 'synonyms' then I want to replace it word from 'word' column.
questions = cbind(V3=c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
V3
1 what is the day today?
2 Tom has brown eyes
voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown"))
voc <- data.frame(voc)
word synonyms
1 weather day
2 a the
3 blue brown
Desired output
V3 V5
1 what is the day today? what is a weather today?
2 Tom has brown eyes Tom has blue eyes
I wrote simple code but it doesn't work.
for (k in 1:nrow(question))
{
for (i in 1:nrow(voc))
{
question$V5<- gsub(do.call(rbind,strsplit(question$V3[k]," "))[which (do.call(rbind,strsplit(question$V3[k]," "))== voc[i,2])], voc[i,1], question$V3)
}
}
Maybe someone will try to help me? :)
I wrote second code, but it doesn't work too..
for( i in 1:nrow(questions))
{
for( j in 1:nrow(voc))
{
if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE)
{
new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,]))
questions[i,]=new
}
}
questions = cbind(questions,c(new))
}
First, it is important that you use the stringsAsFactors = FALSE option, either at the program level, or during your data import. This is because R defaults to making strings into factors unless you otherwise specify. Factors are useful in modeling, but you want to do analysis of the text itself, and so you should be sure that your text is not coerced to factors.
The way I approached this was to write a function that would "explode" each string into a vector, and then uses match to replace the words. The vector gets reassembled into a string again.
I'm not sure how performant this will be given your 600K records. You might look into some of the R packages that handle strings, like stringr or stringi, since they will probably have functions that do some of this. match tends to be okay on speed, but %in% can be a real beast depending on the length of the string and other factors.
# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
voc <- cbind(word = c("weather","a","blue"),
synonyms = c("day","the","brown"))
voc <- data.frame(voc)
# This function takes:
# - an input string
# - a vector of words to replace
# - a vector of the words to use as replacements
# It returns a list of the original input and the changed version
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {
# Start by breaking the input string into a vector
# Note that we use [[1]] to get first list element of strsplit output
# Obviously this relies on breaking sentences by spacing
orig_words <- strsplit(x = input_string,split = " ")[[1]]
# If we find at least one of the words to replace in the original words, proceed
if(sum(orig_words %in% words_to_repl) > 0) {
# The right side selects the elements of orig_words that match words to be replaced
# The left side uses match to find the numeric index of those replacements within the words_to_repl vector
# This numeric vector is used to select the values from repl_words
# These then replace the values in orig_words
orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]
# We rebuild the sentence again, and return a list with original and new version
new_sent <- paste(orig_words,collapse = " ")
return(list(original = input_string,new = new_sent))
} else {
# Otherwise we return the original version since no changes are needed
return(list(original = input_string,new = input_string))
}
}
# Using do.call and rbind.data.frame, we can collapse the output of a lapply()
do.call(what = rbind.data.frame,
args = lapply(X = questions$V3,
FUN = uFunc_FindAndReplace,
words_to_repl = voc$synonyms,
repl_words = voc$word))
>
original new
1 What is the day today? What is a weather today?
2 Tom has brown eyes Tom has blue eyes

How to efficiently create the same variables for each element of a list?

I am a long-time Stata user but am trying to familiarize myself with the syntax and logic of R. I am wondering if you could help me with writing more efficient codes as shown below (The "The Not-so-efficient Codes")
The goal is to (A) read several files (each of which represents the data of a year), (B) create the same variables for each file, and (C) combine the files into a single one for statistical analysis. I have finished revising "part A", but are struggling with the rest, particularly part B. Could you give me some ideas as to how to proceed, e.g. use unlist to unlist data.l first, or lapply to each element of data.l? I appreciate your comments-thanks.
More Efficient Codes: Part A
# Creat an empty list
data.l = list()
# Create a list of file names
fileList=list.files(path="C:/My Data, pattern=".dat")
# Read the ".dat" files into a single list
data.l = sapply(fileList, readLines)
The Not-so-efficient Codes: Part A, B and C
setwd("C:/My Data")
# Part A: Read the data. Each "dat" file is text file and each line in the file has 300 characters.
dx2004 <- readLines("2004.INJVERBT.dat")
dx2005 <- readLines("2005.INJVERBT.dat")
dx2006 <- readLines("2006.INJVERBT.dat")
# Part B-1: Create variables for each year of data
dt2004 <-data.frame(hhx = substr(dx2004,7,12),fmx = substr(dx2004,13,14),
,iphow = substr(dx2004,19,318),stringsAsFactors = FALSE)
dt2005 <-data.frame(hhx = substr(dx2005,7,12),fmx = substr(dx2005,13,14),
,iphow = substr(dx2005,19,318),stringsAsFactors = FALSE)
dt2006 <-data.frame(hhx = substr(dx2006,7,12),fmx = substr(dx2006,13,14),
iphow = substr(dx2006,19,318),stringsAsFactors = FALSE)
# Part B-2: Create the "iid" variable for each year of data
dt2004$iid<-paste0("2004",dt2004$hhx, dt2004$fmx, dt2004$fpx, dt2004$ipepno)
dt2005$iid<-paste0("2005",dt2005$hhx, dt2005$fmx, dt2005$fpx, dt2005$ipepno)
dt2006$iid<-paste0("2006",dt2006$hhx, dt2006$fmx, dt2006$fpx, dt2006$ipepno)
# Part C: Combine the three years of data into a single one
data = rbind(dt2004,dt2005, dt2006)
you are almost there. Its a combination of lapply and do.call/rbind to work with lapply's list output.
Consider this example:
test1 = "Thisistextinputnumber1"
test2 = "Thisistextinputnumber2"
test3 = "Thisistextinputnumber3"
data.l = list(test1, test2, test3)
makeDF <- function(inputText){
DF <- data.frame(hhx = substr(inputText, 7, 12), fmx = substr(inputText, 13, 14), iphow = substr(inputText, 19, 318), stringsAsFactors = FALSE)
DF <- within(DF, iid <- paste(hhx, fmx, iphow))
return(DF)
}
do.call(rbind, (lapply(data.l, makeDF)))
Here test1, test2, test3 represent your dx200X, and data.l should be the list format you get from the efficient version of Part A.
In makeDF you create your desired data.frame. The do.call(rbind, ) is somewhat standard if you work with lapply-return values.
You also might want to consider checking out the data.table-package which features the function rbindlist, replacing any do.call-rbind construction (and is much faster), next to other great utility for large data sets.

Using ifelse on factor in R

I am restructuring a dataset of species names. It has a column with latin names and column with trivial names when those are available. I would like to make a 3rd column which gives the trivial name when available, otherwise the latin name. Both trivial names and latin names are in factor-class.
I have tried with an if-loop:
if(art2$trivname==""){
art2$artname=trivname
}else{
art2$artname=latname
}
It gives me the correct trivnames, but only gives NA when supplying latin names.
And when I use ifelse I only get numbers.
As always, all help appreciated :)
Example:
art <- data.frame(trivname = c("cat", "", "deer"), latname = c("cattus", "canis", "cervus"))
art$artname <- with(art, ifelse(trivname == "", as.character(latname), as.character(trivname)))
print(art)
# trivname latname artname
# 1 cat cattus cat
# 2 canis canis
# 3 deer cervus deer
(I think options(stringsAsFactors = FALSE) as default would be easier for most people, but there you go...)
Getting only numbers suggests that you just need to add as.character to your assignments, and the if-else would probably work you also seem to not be referring to the data frame in the assignment?
if(as.character(art2$trivname)==""){
art2$artname=as.character(art2$trivname)
}else{
art2$artname=as.character(art2$latname)
}
Option 2: Using ifelse:
art2$artname= ifelse(as.character(art2$trivname) == "", as.character(art2$latname),as.character(art2$trivname))
It is probably easier (and more "R-thonic" because it avoids the loop) just to assign artname to trivial across the board, then overwrite the blank ones with latname...
art2 = art
art2$artname = as.character(art$trivname)
changeme = which(art2$artname=="")
art2$artname[changeme] = as.character(art$latname[changeme])
If art2 is the dataframe, and artname the new column, another possible solution:
art2$artname <- as.character(art2$trivname)
art2[art$artname == "",'artname'] <- as.character(art2[art2$artname == "", 'latname'])
And if you want factors in the new column:
art2$artname <- as.factor(art2$artname)

Resources