how to find similar strings within a data

how to find similar strings within a data - r

My data looks like this
df<- structure(list(A = structure(c(7L, 6L, 5L, 4L, 3L, 2L, 1L, 1L,
1L), .Label = c("", "P42356;Q8N8J0;A4QPH2", "P67809;Q9Y2T7",
"Q08554", "Q13835", "Q5T749", "Q9NZT1"), class = "factor"), B = structure(c(9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("P62861", "P62906",
"P62979;P0CG47;P0CG48", "P63241;Q6IS14", "Q02413", "Q07955",
"Q08554", "Q5T749", "Q9UQ80"), class = "factor"), C = structure(c(9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("", "P62807;O60814;P57053;Q99879;Q99877;Q93079;Q5QNW6;P58876",
"P63241;Q6IS14", "Q02413", "Q16658", "Q5T750", "Q6P1N9", "Q99497",
"Q9UQ80"), class = "factor")), .Names = c("A", "B", "C"), class = "data.frame", row.names = c(NA,
-9L))
I want to count how many elements are in each columns including those that are separated with a ; , for example in this case
first column has 9, second column has 12 elements and the third column has 16 elements. then I want to check how many times a element is repeated in other columns . for example
string number of times columns
Q5T749 2 1,2
then remove the strings which are seen more than once from the df

One way to approach this is to start by re-organizing the data into a form that is more convenient to work with. The tidyr and dplyr packages are useful for that sort of thing.
library(tidyr)
df$index <- 1:nrow(df)
df <- gather(df, key = 'variable', value = 'value', -index, na.rm = TRUE)
df <- separate(df, "value", into = paste("x", 1:(1 + max(nchar(gsub("[^;]", "", df$value)))), sep = ""), sep = ";", fill = "right")
df <- gather(df, "which", "value", -index, -variable)
Once you do that counting each element is easy:
addmargins(t(table(df[, c("variable", "value")])), margin = 2)
Dropping duplicates is also easy.
df <- df[!duplicated(df$value), ]
If you really want to put the data back into the original for you can (though I don't recommend it).
df <- spread(df, key = "variable", value = "value")
library(dplyr)
summarize(group_by(df, index),
A = paste(na.omit(A), collapse = ";"),
B = paste(na.omit(B), collapse = ";"),
C = paste(na.omit(C), collapse = ";"))

For the count of elements in each column use this
sapply(df,function(x) length(unlist(sapply(strsplit(as.character(x),"\\s+"),strsplit,split=";"))))
For counting the repetition use this
words <- lapply(df,function(x) unlist(sapply(strsplit(as.character(x),"\\s+"),strsplit,split=";")))
dup_table <- table(unlist(words))
dup_table
There is a very bad approach to remove the repetition
pat <- names(dup_table)[unname(dup_table)>1]
for(i in pat)
df <- as.data.frame.list(lapply(df,function(x) gsub(pattern = i,replacement = "",x)))
But, there is only one problem. It will replace all the occurences of a particular pattern.

Related

trying to summarize survey data for questions with 'select all that apply' using R

We have a survey that asks for 'select all that apply' so the result is a string inside quotes with the values separated by commas. i.e. "red, black,green"
There are other question about income so I have a factor with 'low, medium, high'
I want to be able to answer questions: What percent selected 'Red', then group that by income.
I can split the string with
'''df4 <- c("black,silver,green")'''
I can create a data frame with a timestamp and the split string with
'''t2 <- as.data.frame(c(df2[2],l2))'''
I am not able to understand how to do this for all rows at one time.
Here is a DPUT of the input:
structure(list(RespData = structure(1:2, .Label = c("1/20/2020",
"1/21/2020"), class = "factor"), CarColor = c("red,blue,green,yellow",
"black,silver,green")), row.names = c(NA, -2L), class = "data.frame")
and here is a DPUT of the desired output:
structure(list(RespData = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("1/20/2020", "1/21/2020"), class = "factor"),
Cars = structure(c(3L, 1L, 2L, 4L, 5L, 6L, 2L), .Label = c("blue",
"green", "red", "yellow", "black", "silver"), class = "factor")), row.names = c(NA,
-7L), class = "data.frame")
Example of Function:
MySplitFunc <- function(ListIn) {
# build an empty data frame and set the column names
x1.all <- ListIn[0,]
names(x1.all) <- c("ResponseTime", "Descriptive")
# for each row build the data and combine to growing list
for(x in 1:nrow(ListIn)) {
#print(x)
r1 <- ListIn[x,1]
c1 <- strsplit(ListIn[x,2],",")
x1 <- as.data.frame(c(r1,c1))
# set the names and combine to all
names(x1) <- c("ResponseTime", "Descriptive")
x1.all <- rbind(x1.all,x1)
}
# strip the whitespace
x1.all <- data.frame(lapply(x1.all, trimws), stringsAsFactors = TRUE)
return(x1.all)
}

Elegant way to write function

I have an input column (symbols) which has more than 10000 rows and they contain operator symbols and text values like ("",">","<","","****","inv","MOD","seen") as shown below in the code as values. This column doesn't contain any numbers. It only contains the value which are stated in the code.
What I would like to do is map those operator symbols ('<','>' etc) to different codes, 1) Operator_codes 2) Value_codes and have these two different codes as separate columns
I already have a working code but it is not very efficient as you can see I repeat the same operation twice. Once for Operator_codes and then for value_codes. I am sure there must be some efficient way to write this. I am new to R and not very familiar with other approach.
oper_val_concepts = function(DF){
operators_source = str_extract(.$symbols)
operators_source = as.data.frame(operators_source)
colnames(operators_source) <- c("Symbol")
operator_list = c("",">","<","-","****","inv","MOD","seen")
operator_codes = c(123L,14L,16L,13L,0L,0L,0L,0L)
value_codes=c(14L,12L,32L,123L,16L
,41L,116L,186L)
operator_code_map = map2(operator_list,operator_codes,function(x,y)c(x,y))
%>%
data.frame()
value_code_map = map2(operator_list,value_codes,function(x,y) c(x,y)) %>%
data.frame()
operator_code_map = t(operator_code_map)
value_code_map = t(value_code_map)
colnames(operator_code_map) <- c("Symbol","Code")
colnames(value_code_map) <- c("Symbol","Code")
rownames(operator_code_map) = NULL
rownames(value_code_map) = NULL
dfm<-merge(x=operators_source,y=operator_code_map,by="Symbol",all.x =
TRUE)
dfm1<-merge(x=operators_source,y=value_code_map,by="Symbol",all.x = TRUE)
}
t1 = oper_val_concepts(test)
dput command output is
structure(list(Symbols = structure(c(2L, 3L, 1L, 4L, 2L, 3L,
5L, 4L, 6L), .Label = c("****", "<", ">", "inv", "mod", "seen"
), class = "factor")), .Names = "Symbols", row.names = c(NA,-9L), class =
"data.frame")
I am expecting an output to be two columns in a dataframe as shown below.

Based on what I am understanding, it seems like you want to create a dataframe that will act as a key (see key below). Once you have this, you can just join the dataframe that just contains symbols with this key dataframe.
df <- structure(list(Symbols = structure(c(2L, 3L, 1L, 4L, 2L, 3L,
5L, 4L, 6L), .Label = c("****", "<", ">", "inv", "mod", "seen"
), class = "factor")), .Names = "Symbols", row.names = c(NA, -9L), class = "data.frame")
key <- data.frame(Symbols = c("",">","<","-","****","inv","mod","seen"),
Oerator_code_map = c(123L,14L,16L,13L,0L,0L,0L,0L),
value_code_map = c(14L,12L,32L,123L,16L,41L,116L,186L))
df %>% left_join(key, by = "Symbols")
output
Symbols Oerator_code_map value_code_map
1 < 16 32
2 > 14 12
3 **** 0 16
4 inv 0 41
5 < 16 32
6 > 14 12
7 mod 0 116
8 inv 0 41
9 seen 0 186

Replace % and comma in data frame

dat <- structure(list(V1 = structure(c(3L, 4L, 1L, 5L, 6L, 1L, 1L, 1L, 1L, 1L),
.Label = c("0,0%", "0,5%", "0,6%", "1,0%", "1,2%", "2,0%", "2,1%", "2,4%",
"3,0%", "3,3%", "4,0%", "5,0%", "7,0%"), class = "factor"),
V2 = structure(c(6L, 7L, 5L, 7L, 7L, 7L, 1L, 1L, 1L, 1L),
.Label = c("0,0%", "12,0%", "2,0%", "2,8%", "3,0%", "3,6%", "4,0%", "4,3%",
"5,0%", "6,0%", "6,4%", "7,0%", "7,9%", "8,0%"), class = "factor"),
V3 = structure(c(3L, 6L, 2L, 16L, 2L, 14L, 1L, 1L, 1L, 1L),
.Label = c("0,0%", "10,0%", "11,7%", "11,9%", "12,0%", "13,0%", "14,0%", "15,0%",
"18,0%", "18,9%", "25,0%", "30,0%", "7,0%", "8,0%", "9,0%", "9,1%"), class = "factor"),
V4 = structure(c(8L, 9L, 4L, 5L, 7L, 3L, 2L, 2L, 2L, 2L),
.Label = c("0,5%", "1,0%","12,0%", "14,0%", "14,3%", "15,0%", "16,0%", "16,3%", "18,0%",
"19,4%", "20,0%", "22,0%", "22,4%", "23,0%", "25,0%", "28,0%",
"28,5%", "30,0%", "35,0%", "50,0%"), class = "factor")),
row.names = c(NA, 10L), class = "data.frame")
I want to do 2 things:
1) Remove the , with decimal .
2) Remove the % symbol
sapply(dat, function(x) as.numeric(gsub("%", "", x)))
sapply(dat, function(x) as.numeric(gsub(",", ".", x)))
Both of them are giving me NAs. What is it I am doing wrong here?

We need to do this in a single step as converting to numeric after removing the % is still a character vector as there is ,. So, use the as.numeric only after doing both the operations
dat[] <- lapply(dat, function(x) as.numeric(gsub("%", "", gsub(",", ".", x))))
If we are using tidyverse
library(tidyverse)
dat %>%
mutate_all(funs(parse_number(str_replace(., ",", "."))))

Thought I would add a tidyverse approach:
library(tidyverse)
dat <- dat %>%
map_df(str_replace, pattern = ",", replacement = ".") %>%
map_df(str_remove, pattern = "%") %>%
map_df(as.numeric)
Definitely not the fastest approach:
mbm <- microbenchmark::microbenchmark(lap = {lapply(dat, function(x)
as.numeric(gsub("%", "", gsub(",", "", x))))},
tidy = {dat %>%
map_df(str_replace, pattern = ",", replacement = ".") %>%
map_df(str_remove, pattern = "%") %>%
map_df(as.numeric)})
This shows that using lapply instead of my tidyverse approach is approximately 10x faster but maybe harder for some to understand.

Extracting data frames from a list based on column names in r

I am looking at extracting df's from within a list of multiple df's into separate data frames based on a condition (if the column names of a df within the list contains the name I am looking for).
For illustration purposes I have created an example which resembles the situation I am in.
I have list with multiple data frames and the dput of that list is given below:
structure(list(V1 = structure(list(lvef = c(0.965686195194885,
0.0806777632648268, -0.531729196500083, -0.511913109608259, -0.413670941196816,
-0.0501899795864357, -0.337583918771946, 1.16086745780346, -0.478358865835724,
-1.95009138673888), hbc = c(-0.389950511350405, -0.904388183933348,
0.811821977223064, -0.868381700124344, -0.637307418402866, -1.04703715824204,
-0.394340445217658, -0.194653869597247, 0.00822402232044511,
-0.145032587618231), id = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = "NA", class = "factor")), .Names = c("lvef",
"hbc", "id"), row.names = c(NA, -10L), class = "data.frame"),
V2 = structure(list(ersta = c(-0.254360310986174, 0.3859806928747,
-0.135741797055127, 1.03929145413636, -0.484219739337178,
0.255476285148917, 1.0479422937128, 0.146613094683722, -0.914377222535014,
1.75052418161618, -0.275059500684816, 2.34861397588234, 0.00183723766664941,
0.97612891408903, 0.278868537504227, 0.456979477254684, 1.46323739326792,
0.664511602217853, 0.870420202897545, 1.38228375734407),
pgrsta = c(-1.49129812271989, 0.820330747101906, -0.0469488167129374,
0.471549380446308, -1.71312120132398, 0.0578140025416816,
1.67016363826724, 0.226180835709491, -2.00294530465909,
-0.0464857361954717, 0.306942902768782, -0.785096914460742,
0.283822632249141, -0.260774679911329, -1.2865970194309,
0.307972619170242, 0.223715024597144, -1.01642533651475,
-0.12229427204957, 0.223326519096996), id = structure(c(7L,
7L, 7L, 7L, 4L, 1L, 3L, 5L, 6L, 2L, 7L, 7L, 7L, 7L, 4L,
1L, 3L, 5L, 6L, 2L), class = "factor", .Label = c("-0.10863576856322",
"-0.317324527228699", "-0.422764348315332", "0.285132258310185",
"1.23305496219042", "1.39326602279981", "NA"))), .Names = c("ersta",
"pgrsta", "id"), row.names = c(NA, -20L), class = "data.frame"),
V3 = structure(list(hormrec = 1:15, event = structure(c(10L,
10L, 10L, 10L, 10L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"NA"), class = "factor")), .Names = c("hormrec", "event"), row.names = c(NA,
-15L), class = "data.frame"), V4 = structure(list(asat = c(-0.321423784000631,
0.181345361079582, 0.389158724418319, -1.15251833725336,
-0.351981383678293, -0.506888212379408, 0.870705917350059,
-0.626883041051641, -0.321843006223371, -0.674564527029912,
-0.609383943267379, -0.181661119817784, -1.63676077872658
), lab = structure(c(1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 2L), .Label = c("btest", "NA", "rtest"), class = "factor")), .Names = c("asat",
"lab"), row.names = c(NA, -13L), class = "data.frame")), .Names = c("V1",
"V2", "V3", "V4"))
I am trying to extract data frames from the list based on the condition that if a data frame within the list contains the column name/s required then that data frame from the list should go into a separate data frame. So far, I have been able to extract the data frames into a list using the following code:
# function to extract required df's
trial <- function(x)
{
reqname <- c("hbc","ersta") # column names to check for
data <- x
lapply(seq(data), function(i){ # loop through all the data frames in the list
y <- data.frame(data[[i]]) # extract df in y
names <- names(y) # extract names of df
for(a in 1:length(reqname)) # loop through the length of reqname
{
if(reqname[a]%in%names) # check if column name/s present in current df
{
z <- y # extract df into another df
return(z) # return df
}
}
}
)
}
The above function returns a list of matching df's along with nulls where there was not a match. I am looking for a modification so that the selected data frame comes out separately. If there are two df's matching the requirement then the output should be two separate data frames.
I will appreciate all and any help in finding a solution.

You can easily use the lapply() plus a custom function to identify wanted outputs. For instance, if k is your list,
trial <- function(x)
{
reqnames <- c("hbc","ersta")
k <- lapply(k, function(x) any(names(x) %in% reqnames))
k <- which(k==1)
x[k]
}
This outputs a list with only the dataframes containing at least one of the names in reqnames.

We can remove the NULL elements with Filter
lst1 <- Filter(length, trial(lst))
If we need multiple data.frame objects in the global environment, use list2env after renaming the list elements with the object names
names(lst1) <- paste0('dat' seq_along(lst1))
list2env(lst1, envir = .GlobalEnv)

Reproduce a datset to different format in R

I have a dataset Data like below:
dput(Data)
structure(list(FN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "20131202-0985 ", class = "factor"), Values = structure(c(1L,
8L, 7L, 6L, 5L, 9L, 2L, 4L, 3L), .Label = c("|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567",
"|8121|B01|SOMERSET STN", "|96942883", "|SN30|SMRT\n", "CENTRAL",
"FOUR SEASONS HOTEL", "HOTEL", "IKEA", "nanyang avenue"), class = "factor"),
IND = structure(c(4L, 1L, 1L, 1L, 1L, 6L, 3L, 2L, 5L), .Label = c("BN",
"BR", "BS", "LOC", "PN", "RN"), class = "factor")), .Names = c("FN",
"Values", "IND"), class = "data.frame", row.names = c(NA, -9L
))
I wanted the above dataset to be converted as in the below format as a Data Frame(out_data).
Presently my Data has 3 columns - and need to covert these into 16 columns in below format.
I need to rehape my input - to exactly given in the screenshot as data frame.
I cannot change the below structure -
colnames(out_data) <- ("FN","H_BLK","S_N/R_N","B_N","FL_N","U_N","PC","XC","YC","BS","BRF","LCT_DEC","BRN","BO PN","S_TY_CD")
The Multiple value columns in the inputnand are always in the below Format:
|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567 -
|PC|H_BLK|S_N/R_N|XC|YC
|8121|B01|SOMERSET STN -> |BS|BRF|LCT_DEC
|SN30|SMRT ------> |BRN|BO
If the
IND =LOC - then |PC|H_BLK|S_N/R_N|XC|YC` get updated with S_TY_CD=LOC
IND= BN - then B_N column should be updated with S_TY_CD=BN
IND= RN - then _N/R_N column should be updated with S_TY_CD=RN
IND= BS then `|BS|BRF|LCT_DEC` should be updated with S_TY_CD=BS
IND= BR then `|BRN|BO` should be updated with S_TY_CD=BR
IND= PN then PN with S_TY_CD=PN
Is there an efficient way of doing this.

Here's one method of transformation. First I define some helper functions for the various sub problems.
#define out cols
outcols<-c("FN", "H_BLK", "S_N/R_N", "B_N", "FL_N", "U_N", "PC",
"XC", "YC", "BS", "BRF", "LCT_DEC", "BRN","BO","PN","S_TY_CD")
#identify parts for each compound value
namevals <- function(ind, vals) {
names<-if (ind=="LOC") {
c("PC","H_BLK","S_N/R_N","XC","YC")
} else if (ind=="BN") {
c("B_N")
} else if (ind=="RN") {
c("S_N/R_N")
} else if (ind=="BS") {
c("BS","BRF","LCT_DEC")
} else if (ind=="BR") {
c("BRN","BO")
} else if (ind=="PN") {
c("PN")
}
stopifnot(length(names)==length(vals))
stopifnot(all(names %in% outcols))
names(vals)<-names
vals
}
#add missing values for row
fillrow <- function(nvals) {
r<-rep(NA, length(outcols))
r[match(names(nvals), outcols)]<-nvals
r
}
Now I apply these to each row of the data with mapply to return a character vector. Here we make sure to split the "values" column on the pipe and remove the leading pipe.
#combine rows into character matrix
dt<-mapply(function(fn,vals,ind){
x<-c(FN=fn,namevals(ind, vals), "S_TY_CD"=ind)
fillrow(x)
},
as.character(Data$FN),
strsplit(gsub("^\\|","",as.character(Data$Values)),"|", fixed=T),
as.character(Data$IND)
)
Finally we tidy the data up so it can be written out to a file with write.table. Note that all missing values are true R NA values. In the write.table, you can set na = "" if you'd rather they print out as blank values than the default "NA" value.
#turn matrix into data.frame with proper names
dd<-data.frame(unname(t(dt)), stringsAsFactors=F)
names(dd)<-outcols
dd

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to find similar strings within a data - r

Related

trying to summarize survey data for questions with 'select all that apply' using R

Elegant way to write function

Replace % and comma in data frame

Extracting data frames from a list based on column names in r

Reproduce a datset to different format in R

Categories

Resources