How can I store logs that are replaced compared with original data in R?

How can I store logs that are replaced compared with original data in R? - r

I already asked and got the solution about this topic.
However, additionally, I want to check which data are replaced in the new column. I just tried below,
df$check <- str_match_all(df, "\\d{11}") %% unlist
but, it cannot work. Ultimately, I want to get the below data set.
original edited check
1 010-1234-5678 010-1234-5678
2 John 010-8888-8888 John 010-8888-8888
3 Phone: 010-1111-2222 Phone: 010-1111-2222
4 Peter 018.1111.3333 Peter 018.1111.3333
5 Year(2007,2019,2020) Year(2007,2019,2020)
6 Alice 01077776666 Alice 010-9999-9999 01077776666
Here is my code.
x = c("010-1234-5678",
"John 010-8888-8888",
"Phone: 010-1111-2222",
"Peter 018.1111.3333",
"Year(2007,2019,2020)",
"Alice 01077776666")
df = data.frame(
original = x
)
df$edited <- gsub("\\d{11}", "010-9999-9999", df$original)
df$check <- c("","","","","","01077776666") # I want to know the way here.
Thank you.

In an ifelse using `==` you could test if the columns match, then if not, use gsub to match the first digit and get it and the rest of the string out of "original".
transform(df, check=ifelse(!do.call(`==`, df[c("original", "edited")]),
gsub('(\\D*)(\\d.*)', '\\2', original),
NA))
# original edited check
# 1 010-1234-5678 010-1234-5678 <NA>
# 2 John 010-8888-8888 John 010-8888-8888 <NA>
# 3 Phone: 010-1111-2222 Phone: 010-1111-2222 <NA>
# 4 Peter 018.1111.3333 Peter 018.1111.3333 <NA>
# 5 Year(2007,2019,2020) Year(2007,2019,2020) <NA>
# 6 Alice 01077776666 Alice 010-9999-9999 01077776666

Related

Order a variable and separate by another variable R

I want to order() a variable by the number of characters it has and then separate that variable based on gender, f and m. How can I do that using R?

Calling your data d,
d[order(d$sex, nchar(as.character(d$name))), ]
# name sex
# 2 Alex Berniugn female
# 5 Aaron Poss male
# 1 Alex Josh Adams male
# 4 Potter Allbinore male
# 3 Abbigol Alexander male

Create weight node and edges lists from a normal dataframe in R?

I'm trying to use visNetwork to create a node diagram. However, my data is not in the correct format and I haven't been able to find any help on this on the internet.
My current data frame looks similar to this:
name town car color age school
John Bringham Swift Red 22 Brighton
Sarah Bringham Corolla Red 33 Rustal
Beth Burb Swift Blue 43 Brighton
Joe Spring Polo Black 18 Riding
I'm wanting to change use this to create nodes and edges lists that can be used to create a vis network.
I know that the "nodes" list will be made from the unique values in the "name" column but I'm not sure how I would use the rest of the data to create the "edges" list?
I was thinking that it may be possible to group by each column and then read back the matches from this function but I am not sure how to implement this. The idea that I thought of is to weight the edges based on how many matches they detect in the various group by functions. I'm not sure how to actually implement this yet.
For example, Joe will not match with anyone because he shares no common columns with any of the others. John and Sarah will have a weight of 2 because they share two common columns.
Also open to solutions in python!

One option is to compar row by row, in order to calculate the number of commun values.
For instance for John (first row) and Sarah (second row):
sum(df[1,] == df[2,])
# 2
Then you use the function combn() from library utils to know in advance the number of pair-combinaison you have to calculate:
nodes <- matrix(combn(df$name, 2), ncol = 2, byrow = T) %>% as.data.frame()
nodes$V1 <- as.character(nodes$V1)
nodes$V2 <- as.character(nodes$V2)
nodes$weight <- NA
(nodes)
# V1 V2 weight
#1 John Sarah NA
#2 John Beth NA
#3 John Joe NA
#4 Sarah Beth NA
#5 Sarah Joe NA
#6 Beth Joe NA
Finally a loop to calculate weight for each node.
for(n in 1:nrow(nodes)){
name1 <- df[df$name == nodes$V1[n],]
name2 <- df[df$name == nodes$V2[n],]
nodes$weight[n] <- sum(name1 == name2)
}
# V1 V2 weight
#1 John Sarah 2
#2 John Beth 2
#3 John Joe 0
#4 Sarah Beth 0
#5 Sarah Joe 0
#6 Beth Joe 0
I think node will be the kind of dataframe that you can use in the function visNetwork().

cbind for multiple table() functions

I'm trying to count the frequency of multiple columns in a data.frame.
I used the table function on each column and bound them all by cbind, and was going to use the aggregate function after to calculate the means by my identifier.
Example:
df1
V1 V2 V3
George Mary Mary
George Mary Mary
George Mary George
Mary Mary George
Mary George George
Mary
Frequency<- as.data.frame(cbind(table(df1$V1), table(df1$V2), table(df1$V3)))
row.names V1
George 3
Mary 3
1
George 1
Mary 4
1
George 3
Mary 2
The result I get (visually) is a 2 column data frame, but when I check the dimension of Frequency, I get a result implying that the 2nd column only exists.
It's causing me trouble when I try to rename the columns and run the aggregate function, errors I get for rename:
colnames(Frequency) <- c("Name", "Frequency")
Error in names(Frequency) <- c("Name", "Frequency") :
'names' attribute [2] must be the same length as the vector [1]
The Final purpose is to run an aggregate command and get the mean by name:
Name.Mean<- aggregate(Frequency$Frequency, list(Frequency.Name), mean)
Desired output:
Name Mean
George Value
Mary Value

Using mtabulate (data from #user3169080's post)
library(qdapTools)
d1 <- mtabulate(df1)
is.na(d1) <- d1==0
colMeans(d1, na.rm=TRUE)
# Alice George Mary
# 4.0 3.0 2.5

I hope this is what you were looking for:
> df1
V1 V2 V3
1 George George George
2 Mary Mary Alice
3 George George George
4 Mary Mary Alice
5 <NA> George George
6 <NA> Mary Alice
7 <NA> <NA> George
8 <NA> <NA> Alice
> ll=unlist(lapply(df1,table))
> nn=names(ll)
> nn1=sapply(nn,function(x) substr(x,4,nchar(x)))
> mm=data.frame(ll)
> mm$names=nn1
> tapply(mm$ll,mm$names,mean)
> Mean=tapply(mm$ll,mm$names,mean)
> data.frame(Mean)
Mean
Alice 4.0
George 3.0
Mary 2.5

Turn names into numbers in a dataframe based on the row index of the name in another dataframe

I have two dataframes. One is just the names of my facebook friends and another one is the links with a sorce and target columns. I want to turn the names in the links dataframe to numbers based on the row index of that name in the friends dataframe.
friends
name
1 Andrewt Thomas
2 Robbie McCord
3 Mohammad Mojadidi
4 Andrew John
5 Professor Owk
6 Joseph Charles
links
source target
1 Andrewt Thomas Andrew John
2 Andrewt Thomas James Zou
3 Robbie McCord Bz Benz
4 Robbie McCord Yousef AL-alawi
5 Robbie McCord Sherhan Asimov
6 Robbie McCord Aigerim Aig
Seems trivial, but I cannot figure it out. Thanks for help.

Just use a simple match
links$source <- match(links$source, friends$name)
links
# source target
# 1 1 Andrew John
# 2 1 James Zou
# 3 2 Bz Benz
# 4 2 Yousef AL-alawi
# 5 2 Sherhan Asimov
# 6 2 Aigerim Aig

Something like this?
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
Full example
links <- data.frame(source = c("John", "John", "Alice"), target = c("Jimmy", "Al", "Chris"))
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
links$source
[1] 3 3 2

Merge data frames with partial id

Say I have these two data frames:
> df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'))
> df1
name
1 John Doe
2 Jane F. Doe
3 Mark Smith Simpson
4 Sam Lee
> df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6))
> df2
family size
1 Doe 2
2 Smith 6
I want to merge both data frames in order to get this:
name family size
1 John Doe Doe 2
2 Jane F. Doe Doe 2
3 Mark Smith Simpson Smith 6
4 Sam Lee <NA> NA
But I can't wrap my head around a way to do this apart from the following very convoluted solution, which is becoming very messy with my real data, which has over 100 "family names":
> df3 <- within(df1, {
family <- ifelse(test = grepl('Doe', name),
yes = 'Doe',
no = ifelse(test = grepl('Smith', name),
yes = 'Smith',
no = NA))
})
> merge(df3, df2, all.x = TRUE)
family name size
1 Doe John Doe 2
2 Doe Jane F. Doe 2
3 Smith Mark Smith Simpson 6
4 <NA> Sam Lee NA
I've tried taking a look into pmatch as well as the solutions provided at R partial match in data frame, but still haven't found what I'm looking for.

Rather than attempting to use regular expressions and partial matches, you could split the names up into a lookup-table format, where each component of a person's name is kept in a row, and matched to their full name:
df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'),
stringsAsFactors = FALSE)
df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6),
stringsAsFactors = FALSE)
library(tidyr)
library(dplyr)
str_df <- function(x) {
ss <- strsplit(unlist(x)," ")
data.frame(family = unlist(ss),stringsAsFactors = FALSE)
}
splitnames <- df1 %>%
group_by(name) %>%
do(str_df(.))
splitnames
name family
1 Jane F. Doe Jane
2 Jane F. Doe F.
3 Jane F. Doe Doe
4 John Doe John
5 John Doe Doe
6 Mark Smith Simpson Mark
7 Mark Smith Simpson Smith
8 Mark Smith Simpson Simpson
9 Sam Lee Sam
10 Sam Lee Lee
Now you can just merge or join this with df2 to get your answer:
left_join(df2,splitnames)
Joining by: "family"
family size name
1 Doe 2 Jane F. Doe
2 Doe 2 John Doe
3 Smith 6 Mark Smith Simpson
Potential problem: if one person's first name is the same as somebody else's last name, you'll get some incorrect matches!

Here is one strategy, you could use lapply with grep match over all the family names. This will find them at any position. First let me define a helper function
transindex<-function(start=1) {
function(x) {
start<<-start+1
ifelse(x, start-1, NA)
}
}
and I will also be using the function coalesce.R to make things a bit simpler. Here the code i'd run to match up df2 to df1
idx<-do.call(coalesce, lapply(lapply(as.character(df2$family),
function(x) grepl(paste0("\\b", x, "\\b"), as.character(df1$name))),
transindex()))
Starting on the inside and working out, i loop over all the family names in df2 and grep for those values (adding "\b" to the pattern so i match entire words). grepl will return a logical vector (TRUE/FALSE). I then apply the above helper function transindex() to change those vector to be either the index of the row in df2 that matched, or NA. Since it's possible that a row may match more than one family, I simply choose the first using the coalesce helper function.
Not that I can match up the rows in df1 to df2, I can bring them together with
cbind(df1, size=df2[idx,])
name family size
# 1 John Doe Doe 2
# 1.1 Jane F. Doe Doe 2
# 2 Mark Smith Simpson Smith 6
# NA Sam Lee <NA> NA

Another apporoach that looks valid, at least with the sample data:
df1name = as.character(df1$name)
df1name
#[1] "John Doe" "Jane F. Doe" "Mark Smith Simpson" "Sam Lee"
regmatches(df1name, regexpr(paste(df2$family, collapse = "|"), df1name), invert = T) <- ""
df1name
#[1] "Doe" "Doe" "Smith" ""
cbind(df1, df2[match(df1name, df2$family), ])
# name family size
#1 John Doe Doe 2
#1.1 Jane F. Doe Doe 2
#2 Mark Smith Simpson Smith 6
#NA Sam Lee <NA> NA