So my situation is that I have a list of files in a physical chemistry dataset which I created from multiple calculations, and I want run a foreach or while loop through a column named Files in my dataframe titled, CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES.
I have filenames which look like this: "1AH7A_TRP-16-A_GLU-9-A.log:", "1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:", "1CXQA_TRP-61-A_ASP-82-A.log:", etc ...
I want to run a while or foreach loop through my column "Files", and if exists the word "GLU" or "ASP", and then if I find "GLU" or "ASP", in the file I want to print it to a list.
So in the files above, the printing order would be "GLU", "ASP", "GLU", "ASP". Again, my files aren't order in any particular way, and all the way down through my 1273 entries of files. Then I can save this list and put it into a column title "Residues" in my dataframe, and do some useful exploratory data analysis.
Note: ASP is for the amino acid Aspartate, and GLU is for the amino acid Glutamate.
I know that I can regular expression search grep for the terms in a the column "Files" like so.
Searching for "ASP":
> grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-198-A_ASP-197-A.log:"
[2] "1CXQA_TRP-61-A_ASP-82-A.log:"
[3] "1EJDA_TRP-279-A_ASP-278-A.log:"
[4] "1EU1A_TRP-32-A_ASP-33-A.log:"
As you can see I get a few matches. In fact I get 683 matches. But that's not good enough. I need the matches where they occur, not that they occur.
And of course I can grep for "GLU":
> grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-16-A_GLU-9-A.log:"
[2] "1BGFA_TRP-43-A_GLU-44-A.log:"
[3] "1D8WA_TRP-17-A_GLU-14-A.log:"
And I get a whole bunch of matches!
I tried a for loop. Of course it failed!!!
> for(i in 1:length(CD1_and_CH2_Distances$Distance_Files))
{if(grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("ASP")}
else if(grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("GLU")}}
All it did was print:
[1] "ASP"
[1] "ASP"
[1] "ASP"
...
Even though there is "GLU"!
I mean I can do basic algebraic loops that don't matter to anyone:
> for(i in 1:10){print(i^2)}
[1] 1
[1] 4
[1] 9
[1] 16
Anyway, I checked the warnings to see what was going wrong:
> warnings()
Warning messages:
1: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
2: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
As you can see I'm getting the same error over and over again. I guess that makes sense since this is a a loop. But why is this happening, and why can't I grep inside of a loop?
My dataframe that I am trying to parse looks like this:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms"
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437
where commas separate columns.
This is what I want the result to look like:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms", "Residue",
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896, "GLU",
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204, "ASP",
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897, "GLU",
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956, "ASP",
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145, "GLU",
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058, "GLU",
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436, "GLU",
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437, "GLU",
...
Any help is appreciated! Thank you!
We can use split the dataset into a list of data.frame using substring derived with sub
lst <- split(df1, sub(".*_([A-Z]{3})-.*", "\\1", df1$Files))
data
df1 <- structure(list(X = 1:8, Files = c("1AH7A_TRP-16-A_GLU-9-A.log:",
"1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:",
"1CXQA_TRP-61-A_ASP-82-A.log:", "1D8WA_TRP-17-A_GLU-14-A.log:",
"1D8WA_TRP-17-A_GLU-18-A.log:", "1DJ0A_TRP-223-A_GLU-226-A.log:",
"1E58A_TRP-15-A_GLU-18-A.log:"), Interaction_Energy_kcal_per_Mole = c(-8.49787784468197,
-7.92648167142146, -6.73507800775909, -9.39887176290279, -9.74720319145055,
-11.3235196065977, -7.46891330209553, -6.59830781067777), atom = c("CD1",
"CD1", "CD1", "CD1", "CD1", "CD1", "CD1", "CD1"), Distance_Angstroms = c(4.03269909613896,
3.54307493570204, 4.17179517713897, 5.29897291934956, 3.69398565238145,
3.52345441293058, 5.41108436452436, 4.79790235415437)), .Names = c("X",
"Files", "Interaction_Energy_kcal_per_Mole", "atom", "Distance_Angstroms"
), class = "data.frame", row.names = c(NA, -8L))
I am not sure I get your question completely but consider your data resides in "dat" data(which contains rows for GLU and ASP). Use below to tabulate a field which can contain the data of "ASP" and "GLU".
library(stringr)
newvar <- NULL
newvar$GLU <- str_extract(dat$Files,"(GLU)")
newvar$ASP <- str_extract(dat$Files,"(ASP)")
newvar1 <- data.frame(newvar)
newvar1
library(tidyr)
newvar1[is.na(newvar1)] = ""
new <- unite(newvar1, new, GLU:ASP, sep='')
dat$new <- new
Here the field called new would contain your value of GLU and ASP
Answer:
dat
X Files Interaction_Energy_kcal_per_Mole atom Distance_Angstroms new
1 1 1AH7A_TRP-16-A_GLU-9-A.log: -8.497878 CD1 4.032699 GLU
2 2 1AH7A_TRP-198-A_ASP-197-A.log: -7.926482 CD1 3.543075 ASP
3 3 1BGFA_TRP-43-A_GLU-44-A.log: -6.735078 CD1 4.171795 GLU
4 4 1CXQA_TRP-61-A_ASP-82-A.log: -9.398872 CD1 5.298973 ASP
5 5 1D8WA_TRP-17-A_GLU-14-A.log: -9.747203 CD1 3.693986 GLU
6 6 1D8WA_TRP-17-A_GLU-18-A.log: -11.323520 CD1 3.523454 GLU
7 7 1DJ0A_TRP-223-A_GLU-226-A.log: -7.468913 CD1 5.411084 GLU
8 8 1E58A_TRP-15-A_GLU-18-A.log: -6.598308 CD1 4.797902 GLU
After a long time I figured out a solution to my problem:
# Save my column as a vector because factors are making the world burn:
Files <- as.vector(CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)
# Split the Files into three parts along the two underscores, and save it back to my vector, preserving the third cut around the underscore.
Files <- str_split_fixed(Files, "_", 3)[,3]
Result:
[1] "GLU-9-A.log:"
"ASP-197-A.log:"
etc ...
# Split those results along the hyphens, and take what's next to the first hyphen or the first cut:
Residues <- str_split_fixed(Files, "-", 3)[,1]
> Residues
[1] "GLU" "ASP" "GLU", ...
Add the Residue columns to my data.frame.
CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Residue <- Residue
I guess the grep function is overrated. I had to look hard for this function.
Assuming you saved the data you are trying to parse in the file glu_vs_asp.csv.
Below is an example of how you can create two data frames, one for GLU and one for ASP:
# Read .csv file.
dt <- read.table(file = "glu_vs_asp.csv", sep = ",", header = TRUE)
# Create two data frames, one for GLU and one for ASP.
dt_glu <- dt[grep("GLU", dt$Files),]
dt_asp <- dt[grep("ASP", dt$Files),]
To create a data frame containing both GLU and ASP you can try the following:
dt_glu_asp <- dt[grep("(ASP|GLU)", dt$Files),]
The commands
grep("ASP", dt$Files)
grep("GLU", dt$Files)
give you the indices of the rows that contain respectively 'ASP' and 'GLU' in the Files column.
Related
I want to work with a very big text file (txt) that contains more than 3 million lines, each line with a different name, containing characters and integers.
My idea is to clean a little bit this file (so I can use it easier) and remove those names that have more than 2 numbers.
I would like to parse the names, counting the numbers and then if the name contains less than 3 numbers, writing it into a new file in R.
My big file would be something like this (separating names in new lines):
susan123 susan1 john john22345 alex55 alex1234
And then I would have this new file:
susan1 john alex55
Is this possible in R?
Thanks
I'll leave reading and writing the file to your choice of functions. When you've got the words in R in a vector, here's a way to subset it to just the names with less than 3 digits.
x = c("susan123", "susan1", "john", "john22345", "alex55", "alex1234")
library(stringr)
x[str_detect(x, pattern = "\\D+\\d{0,2}$")]
# [1] "susan1" "john" "alex55"
Base R:
We could use which with nchar and gsub and the useNames = TRUE argument of which:
x = c("susan123", "susan1", "john", "john22345", "alex55", "alex1234")
x[which(nchar(gsub("\\D", "", x)) < 3, useNames = TRUE)]
[1] "susan1" "john" "alex55"
I have a folder full of raster files. They come by group of 12 where each one of them is a band (there are 12 bands) of the satellite Sentinel 2. I simply want to create a loop that goes through the folder and first identify the two bands that I am interested in (Band 4 et 5). To process them in pairs from the same set, I am trying to extract from the Band 4 the date of the photo in a string, that I will the use to retrieve the Band 5 from the same date;
There the problem comes. The names are like this : T31UER_20210722T105619_B12.jp2, but I manage to extract only the numbers from it and get rid of the 31 and this gives me : 20190419105621042
The core of my question is then, how can I select only a small part (YYYY/MM/DD) of this string ?
here is the piece of code. As you can see, my method is to select the part I want deleted. But it doesn't work for the second step where the part coming after the date changes all the time, except for the 042.
thank you very much !
for (f in files){
#Load band 4
Bande4 <- list.files(path="C:/Users/Perrin/Desktop/INRA/Raster/BDA/Images en vrac",
pattern ="B04.jp2$", full.names=TRUE)
#Copy the date
x <- gsub("[A-z //.//(//)]", "", Bande4)
y <- gsub("31", "", x)
z <- gsub("??? this part changes for every file!", "", y)
#Load the matching Band 5
Bande5 <- list.files(path="C:/Users/Perrin/Desktop/INRA/Raster/BDA/Images en vrac",
pattern = z, full.names=TRUE)
#Calculate NDVI
NDVI <- ((Bande5 - Bande4)/(Bande5- Bande4))
#Save the result
r4 <- writeRaster(z, "C:/Users/Perrin/Desktop/INRA/Raster/BDA/Images en vrac", format="GTiff", overwrite=TRUE)
}
You can use substr to extract certain characters from a string, e.g.:
substr(z, 1, 8)
[1] "20210722"
If your names are always in the same format, you can directly use substr without gsub first:
substr(Bande4, 8, 15)
# e.g. with
substr("T31UER_20210722T105619_B12.jp2", 8, 15)
[1] "20210722"
you can select the date because it's a string 8 digit long between and underscore and a capital letter (here I assume it's always "T")
str <- "T31UER_20210722T105619_B12.jp2"
sub("(.*_)([[:digit:]]{8})(T.*)", "\\2", str)
#> [1] "20210722"
I describe the string as a regex and only gather the second part of it (parts being delimited by parenthesis).
I hope it will match all your raster !
Good afternoon,
have a folder with 231 .csv files and I would like to merge them in R. Each file is one spectrum with 2 columns (Wavenumber and Reflectance), but as they come from the spectrometer they don't have colnames. So they look like this when I import them:
C_Sycamore = read.csv("#C_SC_1_10 average.CSV", header = FALSE)
head(C_Sycamore)
V1 V2
1 399.1989 7.750676e+001
2 401.1274 7.779499e+001
3 403.0559 7.813432e+001
4 404.9844 7.837078e+001
5 406.9129 7.837600e+001
6 408.8414 7.822227e+001
The first column (Wavenumber) is identical in all 231 files and all spectra contain exactly 1869 rows. Therefore, it should be possible to merge the whole folder in one big dataframe, right? At least this would very practical for me.
So what I tried is this. I set the working directory to the according folder. Define an empty variable d. Store all the file names in file.list. And the loop through the names in the file.list. First, I want to change the colnames of every file to "Wavenumber" and "the according file name itself", so I use deparse(substitute(i)). Then, I want to read in the file and merge it with the others. And then I could probably do merge(d, read.csv(i, header = FALSE, by = "Wavenumber"), but I don't even get this far.
d = NULL
file.list = list.files()
for(i in file.list){
colnames(i) = c("Wavenumber", deparse(substitute(i)))
d = merge(d, read.csv(i, header = FALSE))
}
When I run this I get the error code
"Error in colnames<-(*tmp*, value = c("Wavenumber", deparse(substitute(i)))) :
So I tried running it without the "colnames()" line, which does not produce an error code, but doesn't work either. Instead of my desired dataframe I get am empty dataframe with only two columns and the message:
"reread"#S_BE_1_10 average.CSV" "#S_P_1_10 average.CSV""
This kind of programming is new to me. So I am thankful for all useful suggestions. Also I am happy to share more data if it helps.
Thanks in advance.
Solution
library(tidyr)
library(purrr)
path <- "your/path/to/folder"
# in one pipeline:
C_Sycamore <- path %>%
# get csvs full paths. (?i) is for case insentitive
list.files(pattern = "(?i)\\.csv$", full.names = TRUE) %>%
# create a named vector: you need it to assign ids in the next step.
# and remove file extection to get clean colnames
set_names(tools::file_path_sans_ext(basename(.))) %>%
# read file one by one, bind them in one df and create id column
map_dfr(read.csv, col.names = c("Wavenumber", "V2"), .id = "colname") %>%
# pivot to create one column for each .id
pivot_wider(names_from = colname, values_from = V2)
Explanation
I would suggest not to change the working directory.
I think it's better if you read from that folder instead.
You can read each CSV file in a loop and bind them together by row. You can use map_dfr to loop over each item and then bind every dataframe by row (that's what the _dfr stands for).
Note that I've used .id = to create a new column called colname. It gets populated out of the names of the vector you're looping over. (That's why we added the names with set_names)
Then, to have one row for each Wavenumber, you need to reshape your data. You can use pivot_wider.
You will have at the end a dataframe with as many rows as Wavenumber and as many columns as the number of CSV plus 1 (the wavenumber column).
Reproducible example
To double check my results, you can use this reproducible example:
path <- tempdir()
csv <- "399.1989,7.750676e+001
401.1274,7.779499e+001
403.0559,7.813432e+001
404.9844,7.837078e+001
406.9129,7.837600e+001
408.8414,7.822227e+001"
write(csv, file.path(path, "file1.csv"))
write(csv, file.path(path, "file2.csv"))
You should expect this output:
C_Sycamore
#> # A tibble: 5 x 3
#> Wavenumber file1 file2
#> <dbl> <dbl> <dbl>
#> 1 401. 77.8 77.8
#> 2 403. 78.1 78.1
#> 3 405. 78.4 78.4
#> 4 407. 78.4 78.4
#> 5 409. 78.2 78.2
Thanks a lot to #Konrad Rudolph for the suggestions!!
no need for a loop here simply use lapply.
first set your working directory to file location###
library(dplyr)
files_to_upload<-list.files(, pattern = "*.csv")
theData_list<-lapply(files_to_upload, read.csv)
C_Sycamore <-bind_rows(theData_list)
I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.
I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).
node.positions <- grep("blood", novel.vector.words)
output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header
#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match
for (i in 1:length(node.positions)){ # access each match...
# access the current match
node <- novel.vector.words[node.positions[i]]
# access the left context of the current match
left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
# access the right context of the current match
right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
# concatenate and print the results
cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}
What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.
You can create an index using dplyr::lead to match 'of' following 'blood':
library(dplyr)
novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
[1] 1 5
In response to the question in the comments:
This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.
Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.
library(tidytext)
library(dplyr)
library(stringr)
## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
mutate(line = row_number())
## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5
## Create ngrams using skip_ngrams token
blood_of <- fulltext %>%
unnest_tokens(output = ngram, input = text, token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b")))
## Return count
blood_of %>%
nrow
[1] 54
## Inspect first six line number indices
head(blood_of$line)
[1] 999 1279 1309 2192 3844 4135
I have problem with code in R.
I have a data-set(questions) with 4 columns and over 600k observation, of which one column is named 'V3'.
This column has questions like 'what is the day?'.
I have second data-set(voc) with 2 columns, of which one column name 'word' and other column name 'synonyms'. If In my first data-set (questions )exists word from second data-set(voc) from column 'synonyms' then I want to replace it word from 'word' column.
questions = cbind(V3=c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
V3
1 what is the day today?
2 Tom has brown eyes
voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown"))
voc <- data.frame(voc)
word synonyms
1 weather day
2 a the
3 blue brown
Desired output
V3 V5
1 what is the day today? what is a weather today?
2 Tom has brown eyes Tom has blue eyes
I wrote simple code but it doesn't work.
for (k in 1:nrow(question))
{
for (i in 1:nrow(voc))
{
question$V5<- gsub(do.call(rbind,strsplit(question$V3[k]," "))[which (do.call(rbind,strsplit(question$V3[k]," "))== voc[i,2])], voc[i,1], question$V3)
}
}
Maybe someone will try to help me? :)
I wrote second code, but it doesn't work too..
for( i in 1:nrow(questions))
{
for( j in 1:nrow(voc))
{
if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE)
{
new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,]))
questions[i,]=new
}
}
questions = cbind(questions,c(new))
}
First, it is important that you use the stringsAsFactors = FALSE option, either at the program level, or during your data import. This is because R defaults to making strings into factors unless you otherwise specify. Factors are useful in modeling, but you want to do analysis of the text itself, and so you should be sure that your text is not coerced to factors.
The way I approached this was to write a function that would "explode" each string into a vector, and then uses match to replace the words. The vector gets reassembled into a string again.
I'm not sure how performant this will be given your 600K records. You might look into some of the R packages that handle strings, like stringr or stringi, since they will probably have functions that do some of this. match tends to be okay on speed, but %in% can be a real beast depending on the length of the string and other factors.
# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
voc <- cbind(word = c("weather","a","blue"),
synonyms = c("day","the","brown"))
voc <- data.frame(voc)
# This function takes:
# - an input string
# - a vector of words to replace
# - a vector of the words to use as replacements
# It returns a list of the original input and the changed version
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {
# Start by breaking the input string into a vector
# Note that we use [[1]] to get first list element of strsplit output
# Obviously this relies on breaking sentences by spacing
orig_words <- strsplit(x = input_string,split = " ")[[1]]
# If we find at least one of the words to replace in the original words, proceed
if(sum(orig_words %in% words_to_repl) > 0) {
# The right side selects the elements of orig_words that match words to be replaced
# The left side uses match to find the numeric index of those replacements within the words_to_repl vector
# This numeric vector is used to select the values from repl_words
# These then replace the values in orig_words
orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]
# We rebuild the sentence again, and return a list with original and new version
new_sent <- paste(orig_words,collapse = " ")
return(list(original = input_string,new = new_sent))
} else {
# Otherwise we return the original version since no changes are needed
return(list(original = input_string,new = input_string))
}
}
# Using do.call and rbind.data.frame, we can collapse the output of a lapply()
do.call(what = rbind.data.frame,
args = lapply(X = questions$V3,
FUN = uFunc_FindAndReplace,
words_to_repl = voc$synonyms,
repl_words = voc$word))
>
original new
1 What is the day today? What is a weather today?
2 Tom has brown eyes Tom has blue eyes