String extract from multiple columns using a list in R - r

I am trying to extract information from more than 2 columns (2 columns given as an example below) using a list and creating another column which contains the string from the list found from either one of the column specifying which column to look in first. I have the example below and what the desired output is. Hope that helps what I am exactly looking for.
A<-c("This contains NYU", "This has NYU", "This has XT", "This has FIT",
"Something something UNH","I got into UCLA","Hello XT")
B<-c("NYU","UT","USC","FIT","UNA","UCLA", "CA")
data<-data.frame(A,B)
list <- c("NYU","FIT","UCLA","CA","UT","USC")
A B
1 This contains NYU NYU
2 This has NYU UT
3 This has XT USC
4 This has FIT FIT
5 Something something UNH UNA
6 I got into UCLA UCLA
7 Hello XT CA
I would want the code to search from the list and look in column A first and if it cannot find the string then look in column B and if not then give null. By looking at the list, I would like the desired output to look like the below.
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA <NA>
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA

You can transform your list to a regexpr and then apply R regexpr function :
expr <- paste0(list,collapse = "|")
# expr = "NYU|FIT|UCLA|CA|UT|USC" -> Reg expr means NYU or FIT or ......
data[,"C"] <- ""
cols <- rev(names(data)[-(which(names(data)=="C"))])
for(c in cols) {
index <- regexpr(expr,data[,c])
data[,"C"] <- ifelse(index != -1,substr(data[,c],index,index + attr(index,"match.length")-1),data[,"C"])
}
Hope that will helps
Gottavianoni

Another approach could be
#common between column A & vector l
C_tempA <- sapply(df$A, function(x) intersect(strsplit(as.character(x), split = " ")[[1]], l))
#common between column B & vector l
C_tempB <- sapply(df$B, function(x) intersect(as.character(x), l))
#column C calculation
df$C <- ifelse(C_tempA=="character(0)", C_tempB, C_tempA)
df$C[df$C=="character(0)"] <- NA
#final dataframe
df
Output is:
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA NA
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
Sample data:
df <- structure(list(A = structure(c(4L, 6L, 7L, 5L, 3L, 2L, 1L), .Label = c("Hello XT",
"I got into UCLA", "Something something UNH", "This contains NYU",
"This has FIT", "This has NYU", "This has XT"), class = "factor"),
B = structure(c(3L, 7L, 6L, 2L, 5L, 4L, 1L), .Label = c("CA",
"FIT", "NYU", "UCLA", "UNA", "USC", "UT"), class = "factor")), .Names = c("A",
"B"), row.names = c(NA, -7L), class = "data.frame")
l <- c("NYU","FIT","UCLA","CA","UT","USC")

Use library(tokenizers) from tokenizers package.
Merge two columns and create a new column with merged A and B
data$newC <- paste(data$A, data$B, sep = " " )
Then, follow below loop which will extract values in a vector and then u can cbind the vector in existing dataframe.
newcolumn <- 'X'
for (p in data$newC)
{
if (!is.na(p))
{
x <- which(is.element(unlist(tokenize_words(list, lowercase = TRUE)), unlist(tokenize_words(p, lowercase = TRUE, stopwords = NULL, simplify = FALSE))))
newcolumn <- append(newcolumn,ifelse(x[1]!= 0, list[x[1]], "NA"))
}
}
newcolumn <- newcolumn[-1]
newcolumn
data <- cbind(data, newcolumn)
Hope it helps.
I am getting output of above as what you expected.
Solution Image:

Related

Remove a part of a row name in a list of dataframes

I have two lists of dataframes. One list of dataframes is structured as follows:
data1
Label Pred n
1 Mito-0001_Series007_blue.tif Pear 10
2 Mito-0001_Series007_blue.tif Orange 223
3 Mito-0001_Series007_blue.tif Apple 890
4 Mito-0001_Series007_blue.tif Peach 34
And repeats with different numbers e.g.
Label Pred n
1 Mito-0002_Series007_blue.tif Pear 90
2 Mito-0002_Series007_blue.tif Orange 127
3 Mito-0002_Series007_blue.tif Apple 76
4 Mito-0002_Series007_blue.tif Peach 344
The second list of dataframes is structured. like this:
data2
Slice Area
Mask of Mask-0001Series007_blue-1.tif. 789.21
etc
Question
I want to
Make the row names match up by:
a) Remove the "Mito-" from data1
b) Remove the "Mask of Mask-" from data 2
c) Remove the "-1" towards the end of data 2
Keeping in mind that this is a list of dataframes.
So far:
I have used the information from the post named "How can I remove certain part of row names in data frame"
How can I remove certain part of row names in data frame
They suggest using
data2$Slice <- sub("Mask of Mask-", "", data2$Slice)
Which obviously isn't working for the list of dataframes. It returns a blank character
character(0)
Thanks in advance, I have been amazed at how great people are at answering questions on this site :)
First, we could define a function f that applies gsub with a regex that fits for all.
f <- \(x) gsub('.*(\\d{4}_?Series\\d{3}_blue).*(\\.tif)?\\.?', '\\1\\2', x)
Explanation:
.* any single character, repeatedly
\\d{4} four digits
_? underscore, if available
Series literally
(...) capture group (they get numbered internally)
\\. a period (needs to be escaped, otherwise we say "any character")
\\1 capture group 1
Test the regex
## test it
(x <- c(names(data1), data1[[1]]$Label, data2$Slice))
# [1] "Mito-0001_Series007_blue" "Mito-0002_Series007_blue"
# [3] "Mito-0001_Series007_blue.tif" "Mito-0001_Series007_blue.tif"
# [5] "Mito-0001_Series007_blue.tif" "Mito-0001_Series007_blue.tif"
# [7] "Mask of Mask-0001Series007_blue-1.tif."
f(x)
# [1] "0001_Series007_blue" "0002_Series007_blue" "0001_Series007_blue" "0001_Series007_blue"
# [5] "0001_Series007_blue" "0001_Series007_blue" "0001Series007_blue"
Seems to work, so we can apply it.
names(data1) <- f(names(data1))
data1 <- lapply(data1, \(x) {x$Label <- f(x$Label); x})
data2$Slice <- f(data2$Slice)
data1
# $`0001_Series007_blue`
# Label Pred n
# 1 0001_Series007_blue Pear 10
# 2 0001_Series007_blue Orange 223
# 3 0001_Series007_blue Apple 890
# 4 0001_Series007_blue Peach 34
#
# $`0002_Series007_blue`
# Label Pred n
# 1 0002_Series007_blue Pear 90
# 2 0002_Series007_blue Orange 127
# 3 0002_Series007_blue Apple 76
# 4 0002_Series007_blue Peach 344
data2
# Slice Area
# 1 0001Series007_blue 789.21
Data:
data1 <- list(`Mito-0001_Series007_blue` = structure(list(Label = c("Mito-0001_Series007_blue.tif",
"Mito-0001_Series007_blue.tif", "Mito-0001_Series007_blue.tif",
"Mito-0001_Series007_blue.tif"), Pred = c("Pear", "Orange", "Apple",
"Peach"), n = c(10L, 223L, 890L, 34L)), class = "data.frame", row.names = c("1",
"2", "3", "4")), `Mito-0002_Series007_blue` = structure(list(
Label = c("Mito-0002_Series007_blue.tif", "Mito-0002_Series007_blue.tif",
"Mito-0002_Series007_blue.tif", "Mito-0002_Series007_blue.tif"
), Pred = c("Pear", "Orange", "Apple", "Peach"), n = c(90L,
127L, 76L, 344L)), class = "data.frame", row.names = c("1",
"2", "3", "4")))
data2 <- structure(list(Slice = "Mask of Mask-0001Series007_blue-1.tif.",
Area = 789.21), class = "data.frame", row.names = c(NA, -1L
))
Using the given info
The answer by #jay.sf, was really helpful. But it only worked for data1, rather than data2. To ensure it also got applied to data2, I added the extra line of code:
#Old code
f <-function(x) gsub('.*(\\d{4}_?Series\\d{3}_blue).*(\\.tif)?\\.?', '\\1\\2', x)
#I added the [[1]] after data2 as well
(x <- c(names(data1), data1[[1]]$Label, data2[[1]]$Slice))
f(x)
names(data1) <- f(names(data1))
data1 <- lapply(data1, function(x) {x$Label <- f(x$Label); x})
# This line of code was causing problems, so I removed it
# data2$Slice <- f(data2$Slice)
#And added the following to apply it to data 2
names(data2) <- f(names(data2))
data2 <- lapply(data2, function(x) {x$Slice <- f(x$Slice); x})

R not updating field based on criteria

i have a simple DF:
Dev_Func
agn
agn
ttt
ttt
agn
all i am trying to do is if the field contains "agn" replace it with "PE"
this is the code that i have written:
test = subset(Final.ds,Device_Function == "AGN" | Device_Function ==
"TTT", select = c(Device_Function))
colnames(test) = c("Device_Function")
as.character(test)
test = within(test, Device_Function[Device_Function == 'AGN'] = 'PE')
but i just keep on geting this error:
Warning message:
In `[<-.factor`(`*tmp*`, Device_Function == "AGN", value = "PE") :
invalid factor level, NA generated
and all it does is replaces all the "AGN" values with NA.
help please!
You could do this with gsub:
df$Dev_Func <- gsub("agn", "PE", df$Dev_Func)
df
# Dev_Func
#1 PE
#2 PE
#3 ttt
#4 ttt
#5 PE
An alternative solution to keeping Dev_Func as a factor (as mentioned by akrun):
df$Dev_Func <- as.factor(gsub("agn", "PE", df$Dev_Func))
class(df$Dev_Func)
[1] "factor"
As the column is a factor, we can assign the levels that are 'agn' to 'PE'
levels(DF$Dev_Func)[levels(DF$Dev_Func)=='agn'] <- 'PE'
and keep it as a factor column
levels(DF$Dev_Func)
#[1] "PE" "ttt"
DF
# Dev_Func
#1 PE
#2 PE
#3 ttt
#4 ttt
#5 PE
NOTE: Assuming that 'agn' is a fixed match and not a substring
In the OP's code, i.e. within function, there are some issues
1) the assignment is <- instead of =
2) it cannot do a logical subset assignment
3) the column is factor and doesn't have any level 'PE' which generates the warning message about invalid factor level, NA generated
4) According to the example the 'agn' is lower case and not 'AGN' (could be a typo), but R is case-sensitive
Suppose, we add the PE as levels
DF$Dev_Func <- factor(DF$Dev_Func, levels = c(levels(DF$Dev_Func), 'PE'))
then the assignment below would work
DF$Dev_Func[DF$Dev_Func=='agn'] <- 'PE'
It is still not a cleaner way compared to change based on levels assignment
data
DF <- structure(list(Dev_Func = structure(c(1L, 1L, 2L, 2L, 1L), .Label = c("agn",
"ttt"), class = "factor")), .Names = "Dev_Func", row.names = c(NA,
-5L), class = "data.frame")

Data manipulations in R

As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)

How to create R output likes confusion matrix table

I have two of directories. The name of first directory is "model" and the second directory is "test", the list of files in both of directories are same but have different content. The total number of files in both of directories also same, that is 37 files.
I show the example of content from one of file.
First file from model directory
Name file : Model_A5B45
data
1 papaya | durian | orange | grapes
2 orange
3 grapes
4 banana | durian
5 tomato
6 apple | tomato
7 apple
8 mangostine
9 strawberry
10 strawberry | mango
dput output :
structure(list(data = structure(c(7L, 6L, 4L, 3L, 10L, 2L, 1L,
5L, 8L, 9L), .Label = c("apple", "apple | tomato", "banana | durian",
"grapes", "mangostine ", "orange", "papaya | durian | orange | grapes",
"strawberry", "strawberry | mango", "tomato"), class = "factor")), .Names = "data", class = "data.frame", row.names = c(NA,
-10L))
Second file in test directory
Name file: Test_A5B45
data
1 apple
2 orange | apple | mango
3 apple
4 banana
5 grapes
6 papaya
7 durian
8 tomato | orange | papaya | durian
dput output:
structure(list(data = structure(c(1L, 5L, 1L, 2L, 4L, 6L, 3L,
7L), .Label = c("apple", "banana", "durian", "grapes", "orange | apple | mango",
"papaya", "tomato | orange | papaya | durian"), class = "factor")), .Names = "data", class = "data.frame", row.names = c(NA,
-8L))
I want to calculate the percentage of intersect and except data from files in directory test to files in directory model.
This is example of my code only for two of files (Model_A5B45 and Test_A5B45).
library(dplyr)
data_test <- read.csv("Test_A5B45")
data_model <- read.csv("Model_A5B45")
intersect <- semi_join(data_test,data_model)
except <- anti_join(data_test,data_model)
except_percentage <- (nrow(except)/nrow(data_test))*100
intersect_percentage <- (nrow(intersect)/nrow(data_test))*100
sprintf("%s/%s",intersect_percentage,except_percentage)
Output : "37.5/62.5"
My question is, I want to implement my code to all of files (looping in both of directories) so the output will looks like confusion matrix.
Example of my expected output:
## y
## Model_A5B45 Model_A6B46 Model_A7B47
## Test_A5B45 37.5/62.5 value value
## Test_A6B46 value value value
## Test_A7B47 value value value
My answer:
I've create code that can process those thing, but I am still do not know how to make output looks like confusion matrix.
This is my code: (I don't know this is efficient or not, I use for loop)
f_performance_testing <- function(data_model_path, data_test_path){
library(dplyr)
data_model <- read.csv(data_model_path, header=TRUE)
data_test <- read.csv(data_test_path, header=TRUE)
intersect <- semi_join(data_test,data_model)
except <- anti_join(data_test,data_model)
except_percentage <- (nrow(except)/nrow(data_test))*100
intersect_percentage <- (nrow(intersect)/nrow(data_test))*100
return(list("intersect"=intersect_percentage,"except"=except_percentage))
}
for (model in model_list){
for (test in test_list){
result <- f_performance_testing(model,test)
intersect_percentage <- round(result$intersect,3)
except_percentage <- round(result$except,3)
final_output <- sprintf("intersect : %s | except : %s",intersect_percentage,except_percentage)
cat(print(paste(substring(model,57),substring(test,56), final_output,sep=",")),file="outfile.txt",append=TRUE,"\n")
print("Writing to file.......")
}
}
The output is:
Model_A5B45,Test_A5B45, 37.5/62.5
Model_A5B45,Test_A6B46, value
Model_A5B45,Test_A7B47, value
Model_A6B46,......
Model_A7B47,.....
...............
......
....
How can I transform this output as looks like confusion matrix table?
This won't answer your question directly, but hopefully gives you enough information to arrive at your own solution.
I would recommend creating a function like the following:
myFun <- function(model, test, datasource) {
model <- datasource[[model]]
test <- datasource[[test]]
paste(rev(mapply(function(x, y) (x/y)*100,
lapply(split(test, test %in% model), length),
length(test))),
collapse = "/")
}
This function is to be used with a two-column data.frame, where the columns represent all the combinations of "test" and "model" values (why work with a data.frame structure when a character vector would suffice?)
Here's an example of such a data.frame (other sample data is found at the end of the answer).
models <- c("model_1", "model_2", "model_3")
tests <- c("test_1", "test_2", "test_3")
A <- expand.grid(models, tests, stringsAsFactors = FALSE)
Next, create a named list of your models and tests. If you've read your data in using lapply, it is likely you might have names to work with anyway.
dataList <- mget(c(models, tests))
Now, calculate the relevant values. Here, we can use apply to cycle through each row and perform the relevant calculation.
A$value <- apply(A, 1, function(x) myFun(x[1], x[2], dataList))
Finally, you reshape the data from a "long" form to a "wide" form.
reshape(A, direction = "wide", idvar = "Var1", timevar = "Var2")
# Var1 value.test_1 value.test_2 value.test_3
# 1 model_1 75/25 100 75/25
# 2 model_2 50/50 50/50 62.5/37.5
# 3 model_3 62.5/37.5 50/50 87.5/12.5
Here's some sample data. Note that they are basic character vectors and not data.frames.
set.seed(1)
sets <- c("A", "A|B", "B", "C", "A|B|C", "A|C", "D", "A|D", "B|C", "B|D")
test_1 <- sample(sets, 8, TRUE)
model_1 <- sample(sets, 10, TRUE)
test_2 <- sample(sets, 8, TRUE)
model_2 <- sample(sets, 10, TRUE)
test_3 <- sample(sets, 8, TRUE)
model_3 <- sample(sets, 10, TRUE)
In a real world application, you would probably do something like:
testList <- lapply(list.files(path = "path/to/test/files"),
function(x) read.csv(x, stringsAsFactors = FALSE)$data)
modelList <- lapply(list.files(path = "path/to/model/files"),
function(x) read.csv(x, stringsAsFactors = FALSE)$data)
dataList <- c(testList, modelList)
But, this is pure speculation on my part based on what you've shared in your question as working code (for example, csv files with no file extension).

Collapse columns by grouping variable (in base)

I have a text variable and a grouping variable. I'd like to collapse the text variable into one string per row (combine) by factor. So as long as the group column says m I want to group the text together and so on. I provided a sample data set before and after. I am writing this for a package and have thus far avoided all reliance on other packages except for wordcloudand would like to keep it this way.
I suspect rle may be useful with cumsum but haven't been able to figure this one out.
Thank you in advance.
What the data looks like
text group
1 Computer is fun. Not too fun. m
2 No its not, its dumb. m
3 How can we be certain? f
4 There is no way. m
5 I distrust you. m
6 What are you talking about? f
7 Shall we move on? Good then. f
8 Im hungry. Lets eat. You already? m
What I'd like the data to look like
text group
1 Computer is fun. Not too fun. No its not, its dumb. m
2 How can we be certain? f
3 There is no way. I distrust you. m
4 What are you talking about? Shall we move on? Good then. f
5 Im hungry. Lets eat. You already? m
The Data
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on? Good then.",
"Im hungry. Lets eat. You already?"), group = structure(c(2L,
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text",
"group"), row.names = c(NA, 8L), class = "data.frame")
EDIT: I found I can add unique column for each run of the group variable with:
x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))
Yielding:
text group new
1 Computer is fun. Not too fun. m 1
2 No its not, its dumb. m 1
3 How can we be certain? f 2
4 There is no way. m 3
5 I distrust you. m 3
6 What are you talking about? f 4
7 Shall we move on? Good then. f 4
8 Im hungry. Lets eat. You already? m 5
This makes use of rle to create an id to group the sentences on. It uses tapply along with paste to bring the output together
## Your example data
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on?  Good then.",
"Im hungry.  Lets eat.  You already?"), group = structure(c(2L,
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text",
"group"), row.names = c(NA, 8L), class = "data.frame")
# Needed for later
k <- rle(as.numeric(dat$group))
# Create a grouping vector
id <- rep(seq_along(k$len), k$len)
# Combine the text in the desired manner
out <- tapply(dat$text, id, paste, collapse = " ")
# Bring it together into a data frame
answer <- data.frame(text = out, group = levels(dat$group)[k$val])
I got the answer and came back to post but Dason beat me to it and more understandably than my own.
x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))
Paste <- function(x) paste(x, collapse=" ")
aggregate(text~new, dat, Paste)
EDIT
How I'd do it with aggregate and what I learned from your response (though tapply is a better solution):
y <- rle(as.character(dat$group))
x <- y[[1]]
dat$new <- as.factor(rep(1:length(x), x))
text <- aggregate(text~new, dat, paste, collapse = " ")[, 2]
data.frame(text, group = y[[2]])

Resources