Scanning and Replacing Values of Rows in R - r

I have this dataset:
sample_data = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "jason", "tim"), col3 = c("NA", "124 valley street", "peter"), col4 = c("NA", "NA", "125 ocean road") )
col1 col2 col3 col4
james 123 forest road NA NA
john jason 124 valley street NA
henry tim peter 125 ocean road
I want to try and figure out a way in which the second column always contains the "address" - the final product would look this would look something like this:
# code to show sample of desired result
desired_result = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "124 valley street", "125 ocean road"))
col1 col2
james 123 forest road
john 124 valley street
henry 125 ocean road
I have been trying to think of and research functions in R that are able to "scan" if the value contained within a row/column starts with a number, and make a decision accordingly.
I had the following idea - I can check to see if a given column starts with a number or not:
sample_data$is_col2_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col2,1,1))
sample_data$is_col3_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col3,1,1))
sample_data$is_col4_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col4,1,1))
col1 col2 col3 col4 is_col2_a_number is_col3_a_number is_col4_a_number
1 james 123 forest road NA NA TRUE FALSE FALSE
2 john jason 124 valley street NA FALSE TRUE FALSE
3 henry tim peter 125 ocean road FALSE FALSE TRUE
Next, I would try to figure out how to code the following logic:
For a given row, find the first cell that contains the value TRUE
Keep the column corresponding to that condition
I tried this row-by-row:
first_row = sample_data[1,]
ifelse(first_row$is_col2_a_number == "TRUE", first_row[,c(1,2)], ifelse(first_row$is_col3_a_number, first_row[, c(1,3)], first_row[, c(1,4)]))
But I think I have made this unnecessarily complicated. Can someone please give me a hand and suggest how I can continue solving this problem?
Thank you!

This should work:
library(dplyr)
library(tidyr)
library(stringr)
sample_data = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "jason", "tim"), col3 = c("NA", "124 valley street", "peter"), col4 = c("NA", "NA", "125 ocean road") )
tmp <- sample_data %>%
mutate(across(col2:col4, ~case_when(str_detect(.x, "^\\d") ~ .x,
TRUE ~ NA_character_)),
address = coalesce(col2, col3, col4)) %>%
select(col1, address)
tmp
#> col1 address
#> 1 james 123 forest road
#> 2 john 124 valley street
#> 3 henry 125 ocean road
Created on 2022-06-30 by the reprex package (v2.0.1)

I thought of a (very ineffective) way to solve my own problem!
sample_data = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "jason", "tim"), col3 = c("NA", "124 valley street", "peter"), col4 = c("NA", "NA", "125 ocean road") )
sample_data$is_col2_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col2,1,1))
sample_data$is_col3_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col3,1,1))
sample_data$is_col4_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col4,1,1))
a1 <- sample_data[which(sample_data$is_col2_a_number == "TRUE"), ]
a1 <- a1[,c(1,2)]
colnames(a1)[2] <- "i"
b1 <- sample_data[which(sample_data$is_col3_a_number == "TRUE"), ]
b1 <- b1[,c(1,3)]
colnames(b1)[2] <- "i"
c1 <- sample_data[which(sample_data$is_col4_a_number == "TRUE"), ]
c1 <- c1[,c(1,4)]
colnames(c1)[2] <- "i"
final = rbind(a1,b1,c1)
Here is the desired output:
col1 i
1 james 123 forest road
2 john 124 valley street
3 henry 125 ocean road

Related

Find the differences in a row of two different dataframes that have been updated

I am new to R and trying to figure out how to find differences in two data sets after merging the two. I have merged the data sets with SETDIFF and found 19 different rows in the new df. However there is no way to know which of the columns have been changed. Since the df have 100s of columns it is not practical to search every row and column to find the change. Is there a way to determine the exact change in the row in the new df.
IP Name Address ZIP
1 Bob 3456 st 2012
2 Jane 2456 st 4302
3 Mike 9698 st 2398
Example of the old df
IP Name Address ZIP
1 Bob 3000 st 2012
2 Jane 2456 st 4302
3 Mike 9698 st 2000
If the new df had changes to Bobs address and Mikes ZIP, how would I do that in R. I have tried SETDIFF and COMPARE, but those did not work. I would like to only get output for the specific changes to the dataframe and in what row it happened.
EDIT: Another example, from the comments:
new <- data.frame(
stringsAsFactors = FALSE,
IP = c(1L, 2L, 3L, 4L, 5L, 6L),
Name = c("Bob", "Jack", "Jane", "Mike", "Alex", "Amy"),
Address = c("3000 st", "5678 st", "2456 st", "9698 st",
"9776 st", "1002 st"),
ZIP = c(2012L, 1121L, 4302L, 2398L, 3476L, 4655L)
)
old <- data.frame(
stringsAsFactors = FALSE,
IP = c(1L, 2L, 3L, 4L),
Name = c("Bob", "Jane", "Mike", "Jack"),
Address = c("3456 st", "2456 st", "9698 st", "5678 st"),
ZIP = c(2012L, 4302L, 2012L, 1121L)
)
EDIT #2:
If you want to find new names added from the old to the new data, and the Names column is a unique identifier, you could use this on the updated example data:
new %>%
filter(!Name %in% old$Name)
# IP Name Address ZIP
#1 5 Alex 9776 st 3476
#2 6 Amy 1002 st 4655
There must be a more elegant way to do this, but another approach could be to join the data to itself using Name as a key, and then reshape to identify differences between the two:
library(tidyverse)
new %>%
left_join(old, by = "Name") %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(-Name, names_to = c("name", "src"), names_sep = "\\.") %>%
pivot_wider(names_from = src, values_from = value) %>%
group_by(Name) %>%
filter(x != y) %>%
ungroup()
## A tibble: 5 x 4
# Name name x y
# <chr> <chr> <chr> <chr>
#1 Bob Address 3000 st 3456 st
#2 Jack IP 2 4
#3 Jane IP 3 2
#4 Mike IP 4 3
#5 Mike ZIP 2398 2012
This output tells us that Bob's Address field changed, Jack, Jane, and Mike appear in different rows, and Mike's ZIP changed.
Original answer
The waldo package offers an easy way to do this:
waldo::compare(new, old)
#`old$Address`: "3456 st" "2456 st" "9698 st"
#`new$Address`: "3000 st" "2456 st" "9698 st"
#
#`old$ZIP`: 2012 4302 2398
#`new$ZIP`: 2012 4302 2000
While not visible here, the output in the console highlights the values that changed in green.
EDIT - A better option might be diffdf::diffdf(new, old), which outputs a summary of the specific differences:
Differences found between the objects!
A summary is given below.
Not all Values Compared Equal
All rows are shown in table below
=============================
Variable No of Differences
-----------------------------
Address 1
ZIP 1
-----------------------------
All rows are shown in table below
===========================================
VARIABLE ..ROWNUMBER.. BASE COMPARE
-------------------------------------------
Address 1 3456 st 3000 st
-------------------------------------------
All rows are shown in table below
========================================
VARIABLE ..ROWNUMBER.. BASE COMPARE
----------------------------------------
ZIP 3 2398 2000
----------------------------------------
Example data in loadable form:
new <- data.frame(
stringsAsFactors = FALSE,
IP = c(1L, 2L, 3L),
Name = c("Bob", "Jane", "Mike"),
Address = c("3456 st", "2456 st", "9698 st"),
ZIP = c(2012L, 4302L, 2398L)
)
old <- data.frame(
stringsAsFactors = FALSE,
IP = c(1L, 2L, 3L),
Name = c("Bob", "Jane", "Mike"),
Address = c("3000 st", "2456 st", "9698 st"),
ZIP = c(2012L, 4302L, 2000L)
)

Group_by multiple columns and summarise unique column

I have a dataset below
family
type
inc
name
AA
success
30000
Bill
AA
ERROR
15000
Bess
CC
Pending
22000
Art
CC
Pending
18000
Amy
AA
Serve not respnding d
25000
Paul
ZZ
Success
50000
Pat
ZZ
Processing
50000
Pat
I want to group by multiple columns
here is my code bellow
df<-df1%>%
group_by(Family , type)%>%
summarise(Transaction_count = n(), Face_value = sum(Inc))%>%
mutate(Pct = Transaction_count/sum(Transaction_count))
what I want is that anywhere there is same observation Family, it should pick only one
like this result in the picture below.
Thank you
You can use duplicated to replace the repeating values with blank value.
library(dplyr)
df %>%
group_by(family , type)%>%
summarise(Transaction_count = n(), Face_value = sum(inc))%>%
mutate(Pct = Transaction_count/sum(Transaction_count),
family = replace(family, duplicated(family), '')) %>%
ungroup
# family type Transaction_count Face_value Pct
# <chr> <chr> <int> <int> <dbl>
#1 "AA" ERROR 1 15000 0.333
#2 "" Serve not respnding d 1 25000 0.333
#3 "" success 1 30000 0.333
#4 "CC" Pending 2 40000 1
#5 "ZZ" Processing 1 50000 0.5
#6 "" Success 1 50000 0.5
If you want data for displaying purpose you may look into packages like formattable, kable etc.
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(family = c("AA", "AA", "CC", "CC", "AA", "ZZ",
"ZZ"), type = c("success", "ERROR", "Pending", "Pending", "Serve not respnding d",
"Success", "Processing"), inc = c(30000L, 15000L, 22000L, 18000L,
25000L, 50000L, 50000L), name = c("Bill", "Bess", "Art", "Amy",
"Paul", "Pat", "Pat")), row.names = c(NA, -7L), class = "data.frame")

Formatting output of R dataframe

So i currently have a dataframe in R and I want to export/write it to a text file using write.table()
Here's an example of the dataframe:
ID FirstName LastName Class
1000 John NA C-02
1001 Jane Wellington C-03
1002 Kate NA C-04
1003 Adam West C-05
I want to write it to a text file where for each row, if any column value is NA, then it won't include the word "NA" but proceed to the other column. The output I want:
1000 John C-02
1001 Jane Wellington C-03
1002 Kate C-04
1003 Adam West C-05
Example as shown, the first row didn't have a last name entered, so I will proceed to the next column, preventing something like:
1000 John NA C-02
I did the write.table() command:
write.table(df, "student_list.txt", col.names = FALSE, row.names = FALSE, quote = FALSE, sep="\t")
But the problem is I'm getting the one where NA is included in the second output i mentioned.
library(tidyverse)
dta <- tribble(
~ID, ~FirstName, ~LastName, ~Class,
1000, "John", NA, "C-02",
1001, "Jane", "Wellington", "C-03",
1002, "Kate", NA, "C-04",
1003, "Adam", "West", "C-05"
)
dta %>%
unite(column, everything(), sep = " ") %>%
mutate(column = str_remove_all(column, "NA ")) %>%
write.table("student_list.txt", col.names = FALSE, row.names = FALSE, quote = FALSE, sep = "\t")
I would use apply to remove the NAs and convert rows into text lines (using paste), as follows:
data <- apply(df, 1, function(row){
paste(row[!is.na(row)], collapse="\t")
})
write.table(data, "student_list.txt", col.names = FALSE, row.names = FALSE, quote = FALSE, sep="\t")
File output would look like the following:
#1000 John C-02
#1001 Jane Wellington C-03
#1002 Kate C-04
#1003 Adam West C-05

Find discrepancies between two tables

I'm working with R from a SAS/SQL background, and am trying to write code to take two tables, compare them, and provide a list of the discrepancies. This code would be used repeatedly for many different sets of tables, so I need to avoid hardcoding.
I'm working with Identifying specific differences between two data sets in R , but it doesn't get me all the way there.
Example Data, using the combination of LastName/FirstName (which is unique) as a key --
Dataset One --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4321 Tower St 54321 10
Don Bob 771 North Ave 23232 5
Smith Mike 732 South Blvd. 77777 3
Dataset Two --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4111 Tower St 32132 17
Donn Bob 771 North Ave 11111 5
Desired Output --
LastName FirstName VarName TableOne TableTwo
Doe Jane StreetAddress 4321 Tower St 4111 Tower St
Doe Jane Zip 23232 32132
Doe Jane VisitCount 5 17
Note that this output ignores records where I don't have the same ID in both tables (for instance, because Bob's last name is "Don" in one table, and "Donn" in another table, we ignore that record entirely).
I've explored doing this by applying the melt function on both datasets, and then comparing them, but the size data I'm working with indicates that wouldn't be practical. In SAS, I used Proc Compare for this kind of work, but I haven't found an exact equivalent in R.
Here is a solution based on data.table:
library(data.table)
# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)
# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]
# Last_Name First_Name VarName TableOne TableTwo
# 1: Doe Jane Street_Address 4321 Tower St 4111 Tower St
# 2: Doe Jane ZIP 54321 32132
# 3: Doe Jane VisitCount 10 17
where input data sets are:
# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John",
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St",
"771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L,
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name",
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John",
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St",
"771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L,
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address",
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
dplyr and tidyr work well here. First, a slightly reduced dataset:
dat1 <- data.frame(Last_Name = c('Doe', 'Doe', 'Don', 'Smith'),
First_Name = c('John', 'Jane', 'Bob', 'Mike'),
ZIP = c(12345, 54321, 23232, 77777),
VisitCount = c(20, 10, 5, 3),
stringsAsFactors = FALSE)
dat2 <- data.frame(Last_Name = c('Doe', 'Doe', 'Donn'),
First_Name = c('John', 'Jane', 'Bob'),
ZIP = c(12345, 32132, 11111),
VisitCount = c(20, 17, 5),
stringsAsFactors = FALSE)
(Sorry, I didn't want to type it all in. If it's important, please provide a reproducible example with well-defined data structures.)
Additionally, it looks like your "desired output" is a little off with Jane Doe's ZIP and VisitCount.
Your thought to melt them works well:
library(dplyr)
library(tidyr)
dat1g <- gather(dat1, key, value, -Last_Name, -First_Name)
dat2g <- gather(dat2, key, value, -Last_Name, -First_Name)
head(dat1g)
## Last_Name First_Name key value
## 1 Doe John ZIP 12345
## 2 Doe Jane ZIP 54321
## 3 Don Bob ZIP 23232
## 4 Smith Mike ZIP 77777
## 5 Doe John VisitCount 20
## 6 Doe Jane VisitCount 10
From here, it's deceptively simple:
dat1g %>%
inner_join(dat2g, by = c('Last_Name', 'First_Name', 'key')) %>%
filter(value.x != value.y)
## Last_Name First_Name key value.x value.y
## 1 Doe Jane ZIP 54321 32132
## 2 Doe Jane VisitCount 10 17
The dataCompareR package aims to solve this exact problem. The vignette for the package includes some simple examples, and I've used this package to solve the original problem below.
Disclaimer: I was involved with creating this package.
library(dataCompareR)
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", "Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", "771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", "Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", "771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
compd1d2 <- rCompare(d1, d2, keys = c("First_Name", "Last_Name"))
print(compd1d2)
All columns were compared, 3 row(s) were dropped from comparison
There are 3 mismatched variables:
First and last 5 observations for the 3 mismatched variables
FIRST_NAME LAST_NAME valueA valueB variable typeA typeB diffAB
1 Jane Doe 4321 Tower St 4111 Tower St STREET_ADDRESS character character
2 Jane Doe 10 17 VISITCOUNT integer integer -7
3 Jane Doe 54321 32132 ZIP integer integer 22189
To get a more detailed and pretty summary, the user can run
summary(compd1d2)
The use of FIRST_NAME and LAST_NAME as the 'join' between the two tables is controlled by the keys = argument to the rCompare function. In this case any rows that do not match on these two variables are dropped from the comparison, but you can get a more detailed output on the comparison performed by using summary

transform one long row in data-frame to individual records

I have a variable list of people I get as one long row in a data frame and I am interested to reorganize these record into a more meaningful format.
My raw data looks like this,
df <- data.frame(name1 = "John Doe", email1 = "John#Doe.com", phone1 = "(444) 444-4444", name2 = "Jane Doe", email2 = "Jane#Doe.com", phone2 = "(444) 444-4445", name3 = "John Smith", email3 = "John#Smith.com", phone3 = "(444) 444-4446", name4 = NA, email4 = "Jane#Smith.com", phone4 = NA, name5 = NA, email5 = NA, phone5 = NA)
df
# name1 email1 phone1 name2 email2 phone2
# 1 John Doe John#Doe.com (444) 444-4444 Jane Doe Jane#Doe.com (444) 444-4445
# name3 email3 phone3 name4 email4 phone4 name5
# 1 John Smith John#Smith.com (444) 444-4446 NA Jane#Smith.com NA NA
# email5 phone5
# 1 NA NA
and I am trying to bend it into a format like this,
df_transform <- structure(list(name = structure(c(2L, 1L, 3L, NA, NA), .Label = c("Jane Doe",
"John Doe", "John Smith"), class = "factor"), email = structure(c(3L,
1L, 4L, 2L, NA), .Label = c("Jane#Doe.com", "Jane#Smith.com",
"John#Doe.com", "John#Smith.com"), class = "factor"), phone = structure(c(1L,
2L, 3L, NA, NA), .Label = c("(444) 444-4444", "(444) 444-4445",
"(444) 444-4446"), class = "factor")), .Names = c("name", "email",
"phone"), class = "data.frame", row.names = c(NA, -5L))
df_transform
# name email phone
# 1 John Doe John#Doe.com (444) 444-4444
# 2 Jane Doe Jane#Doe.com (444) 444-4445
# 3 John Smith John#Smith.com (444) 444-4446
# 4 <NA> Jane#Smith.com <NA>
# 5 <NA> <NA> <NA>
It should be added that it's not always five record, it could be any number between 1 and 99. I tried with reshape2's melt and `t()1 but it got way to complicated. I imagine there is some know method that I simply do not know about.
You're on the right track, try this:
library(reshape2)
# melt it down
df.melted = melt(t(df))
# get rid of the numbers at the end
df.melted$Var1 = sub('[0-9]+$', '', df.melted$Var1)
# cast it back
dcast(df.melted, (seq_len(nrow(df.melted)) - 1) %/% 3 ~ Var1)[,-1]
# email name phone
#1 John#Doe.com John Doe (444) 444-4444
#2 Jane#Doe.com Jane Doe (444) 444-4445
#3 John#Smith.com John Smith (444) 444-4446
#4 Jane#Smith.com <NA> <NA>
#5 <NA> <NA> <NA>
1) reshape() First we strip off the digits from the column names giving the reduced column names, names0. Then we split the columns into groups producing g (which has three components corresponding to the email, name and phone column groups). Then use reshape (from the base of R) to perform the wide to long transformation and select from the resulting long data frame the desired columns in order to exclude the columns that are added automatically by reshape. That selection vector, unique(names0), is such that it reorders the resulting columns in the desired way.
names0 <- sub("\\d+$", "", names(df))
g <- split(names(df), names0)
reshape(df, dir = "long", varying = g, v.names = names(g))[unique(names0)]
and the last line gives this:
name email phone
1.1 John Doe John#Doe.com (444) 444-4444
1.2 Jane Doe Jane#Doe.com (444) 444-4445
1.3 John Smith John#Smith.com (444) 444-4446
1.4 <NA> Jane#Smith.com <NA>
1.5 <NA> <NA> <NA>
2) reshape2 package Here is a solution using reshape2. We add a rowname column to df and melt it to long form. Then we split the variable column into the name portion (name, email, phone) and the numeric suffix portion which we call id. Finally we convert it back to wide form using dcast and select out the appropriate columns as we did before.
library(reshape2)
m <- melt(data.frame(rowname = 1:nrow(df), df), id = 1)
mt <- transform(m,
variable = sub("\\d+$", "", variable),
id = sub("^\\D+", "", variable)
)
dcast(mt, rowname + id ~ variable)[, unique(mt$variable)]
where the last line gives this:
name email phone
1 John Doe John#Doe.com (444) 444-4444
2 Jane Doe Jane#Doe.com (444) 444-4445
3 John Smith John#Smith.com (444) 444-4446
4 <NA> Jane#Smith.com <NA>
5 <NA> <NA> <NA>
3) Simple matrix reshaping . Remove the numeric suffixes from the column names and set cn to the unique remaining names. (cn stands for column names). Then we merely reshape the df row into an n x length(cn) matrix adding the column names.
cn <- unique(sub("\\d+$", "", names(df)))
matrix(as.matrix(df), nc = length(cn), byrow = TRUE, dimnames = list(NULL, cn))
name email phone
[1,] "John Doe" "John#Doe.com" "(444) 444-4444"
[2,] "Jane Doe" "Jane#Doe.com" "(444) 444-4445"
[3,] "John Smith" "John#Smith.com" "(444) 444-4446"
[4,] NA "Jane#Smith.com" NA
[5,] NA NA NA
4) tapply This problem can also be solved with a simple tapply. As before names0 is the column names without the numeric suffixes. names.suffix is just the suffixes. Now use tapply :
names0 <- sub("\\d+$", "", names(df))
names.suffix <- sub("^\\D+", "", names(df))
tapply(as.matrix(df), list(names.suffix, names0), c)[, unique(names0)]
The last line gives:
name email phone
1 "John Doe" "John#Doe.com" "(444) 444-4444"
2 "Jane Doe" "Jane#Doe.com" "(444) 444-4445"
3 "John Smith" "John#Smith.com" "(444) 444-4446"
4 NA "Jane#Smith.com" NA
5 NA NA NA

Resources