How to update name based on other column's condition (Cleaning Data) - r

I have a df below
df <- data.frame(LASTNAME = c("Robinson", "Anderson", "Beckham", "Wickham", "Carlos", "Robinson", "Beckham", "Anderson", "Carlos"),
FIRSTNAME = c("David", "Adi", "Joan", "Kesley", "Anberto", "Dave", "Joana", "Adien", "An"))
df <- data.frame(lapply(df, as.character), stringsAsFactors = FALSE)
There are some first names are not consistent. I want to find and replace these ones. But when I put it in the function, it doesn't work. One more thing is my data is big. There are hundred of names, so are there any better ways to do it.
My code works well when it is alone (not in function), but I failed to find a way to do it if I have 100 names need to find and replace. I found a reference here, but does not resolve my problem. Any suggestions would be appreciated.
fil_name <- function(last,first,alternative){
df %>%
mutate(FIRSTNAME = ifelse(LASTNAME == "last" & FIRSTNAME == "first", "alternative", FIRSTNAME))
}
fil_name(Robinson,Dave,David)
Expected output:
LASTNAME FIRSTNAME
1 Robinson David
2 Anderson Adien
3 Beckham Joana
4 Wickham Kesley
5 Carlos Anberto
6 Robinson David
7 Beckham Joana
8 Anderson Adien
9 Carlos Anberto

We can convert to character inside the function, and it should work
fil_name <- function(df, last,first,alternative){
last <- rlang::as_string(rlang::ensym(last))
first <- rlang::as_string(rlang::ensym(first))
alternative <- rlang::as_string(rlang::ensym(alternative))
df %>%
dplyr::mutate(FIRSTNAME = case_when(LASTNAME == last &
FIRSTNAME == first ~ alternative, TRUE ~ FIRSTNAME))
}
fil_name(df, Robinson,Dave,David)

Another approach is to create a separate data frame including the FIRSTNAME alternative name pairings, merge it into the original data, and update FIRSTNAME for those rows where ALTNAME is not NA.
This allows one to update the data with a vectorized process, rather than changing the names one by one.
# create data frame with a column to maintain original sort order
df <- data.frame(obs = 1:9,
LASTNAME = c("Robinson", "Anderson", "Beckham", "Wickham", "Carlos", "Robinson", "Beckham", "Anderson", "Carlos"),
FIRSTNAME = c("David", "Adi", "Joan", "Kesley", "Anberto", "Dave", "Joana", "Adien", "An"),
stringsAsFactors = FALSE)
# create firstname / altname pairs
altnames <- data.frame(FIRSTNAME = c("Dave","Adi","Joan","An"),
ALTNAME = c("David","Adien","Joana","Anberto"),
stringsAsFactors = FALSE)
# merge by firstname, keeping all rows from original data frame
combined <- merge(df,altnames,by="FIRSTNAME",all.x=TRUE)
# update rows where ALTNAME is not NA
combined[!is.na(combined$ALTNAME),"FIRSTNAME"] <- combined[!is.na(combined$ALTNAME),"ALTNAME"]
# print the result, ordered by sequence in original data frame
combined[order(combined$obs),c("LASTNAME","FIRSTNAME")]
...and the output:
> combined[order(combined$obs),c("LASTNAME","FIRSTNAME")]
LASTNAME FIRSTNAME
6 Robinson David
1 Anderson Adien
7 Beckham Joana
9 Wickham Kesley
4 Carlos Anberto
5 Robinson David
8 Beckham Joana
2 Anderson Adien
3 Carlos Anberto
>

Related

test if words are in a string (grepl, fuzzyjoin?)

I need to do a match and join on two data frames if the string from two columns of one data frame are contained in the string of a column from a second data frame.
Example dataframe:
First <- c("john", "jane", "jimmy", "jerry", "matt", "tom", "peter", "leah")
Last <- c("smith", "doe", "mcgee", "bishop", "gibbs", "dinnozo", "lane", "palmer")
Name <- c("mr john smith","", "timothy t mcgee", "dinnozo tom", "jane l doe", "jimmy mcgee", "leah elizabeth arthur palmer and co", "jerry bishop the cat")
ID <- c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7", "ID8")
df1 <- data.frame(First, Last)
df2 <- data.frame(Name, ID)
So basically, I have df1 which has fairly orderly names of people in first and last name; I have df2, which has names which may be organized as "First Name, Last Name", or "Last Name First Name" or "First Name MI Last Name" or something else entirely that also contains the name. I need the ID column from df2. So I want to run a code to see if df1$First and df2$Last is somewhere in the string of df2$Name, and if it is have it pull and join df2$ID to df1.
My R guru told me to use fuzzy_left_join from the fuzzyjoin package:
fzjoin <- fuzzy_left_join(df1, df2, by = c("First" = "Name"), match_fun = "contains")
but it gives me an error where the argument is not logical; and I can't figure out how to rewrite it to do what I want; the documentation says that match_fun should be TRUE or FALSE, but I don't know what to do with that. Also, it only matches on df1$First rather than df1$First and df1$Last. I think I might be able to use the grepl, but not sure how based on examples I've seen. Any advice?
The documentation says that match_fun should be a "Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match." It's not TRUE or FALSE, it's a function that returns TRUE or FALSE. If we switch your order, we can use stringr::str_detect, which does return TRUE or FALSE as required.
fuzzyjoin::fuzzy_left_join(
df2, df1,
by = c("Name" = "First", "Name" = "Last"),
match_fun = stringr::str_detect
)
# Name ID First Last
# 1 mr john smith ID1 john smith
# 2 ID2 <NA> <NA>
# 3 timothy t mcgee ID3 <NA> <NA>
# 4 dinnozo tom ID4 tom dinnozo
# 5 jane l doe ID5 jane doe
# 6 jimmy mcgee ID6 jimmy mcgee
# 7 leah elizabeth arthur palmer and co ID7 leah palmer
# 8 jerry bishop the cat ID8 jerry bishop

Detect string pattern in dataframe and conditionally fill another in R

I have a dataframe containing text and numeric references, and a vector of words that may appear in the text. What I want is to check for every instance in which a word from words_df appears in text_df$text, and record the word from word_df and the numeric reference from text_df$ref in a new dataframe (edge_df).
text_df <- data.frame(text = c("John went to the shops", "Sarita hates apples", "Wendy doesn't care about this"),
ref = c("13.5", "1.9.9", "20.1"))
words_df <- data.frame(word = c("shops", "John", "apples", "Wendy", "this"))
edge_df <- data.frame(ref = NA, word = NA)
The output should look like this:
> edge_df
ref word
1 13.5 shops
2 13.5 John
3 1.9.9 apples
4 20.1 Wendy
5 20.1 this
It isn't very elegant but I thought a for-loop would work, where each word is checked against the text using stringr::str_detect, and if the result is TRUE it would record the word and ref:
for (i in 1:nrow(text_df)) {
for (j in 1:nrow(words_df)) {
if (str_detect(text_df$text[i], words_df$word[j]) == TRUE) {
edge_df$ref <- text_df$ref[i]
edge_df$word <- words_df$word[j]
}
}
}
This did not work, and nor have several variations on this loop. If possible I would rather not use a loop at all as the dataframes I'm working with have around 1000 rows each and it takes far too long to loop through them. Any fixes to the loop much appreciated, and bonus points/props if you can do it without a loop at all.
Thank you!
Here is an option with str_extract and unnest. We extract the words from the 'text' column into a list and use unnest the expand the rows
library(dplyr)
library(stringr)
library(tidyr)
text_df %>%
transmute(ref, word = str_extract_all(text,
str_c(words_df$word, collapse="|"))) %>%
unnest(c(word))
# A tibble: 5 x 2
# ref word
# <chr> <chr>
#1 13.5 John
#2 13.5 shops
#3 1.9.9 apples
#4 20.1 Wendy
#5 20.1 this
Try this tidyverse approach. The key for your issue: you can format your data to long by separating each word in the sentences and then use left_join(). Here the code (I have used the data you provided):
library(tidyverse)
#Data
text_df <- data.frame(text = c("John went to the shops", "Sarita hates apples", "Wendy doesn't care about this"),
ref = c("13.5", "1.9.9", "20.1"),stringsAsFactors = F)
words_df <- data.frame(word = c("shops", "John", "apples", "Wendy", "this"),stringsAsFactors = F)
#Join
words_df %>% left_join(text_df %>% separate_rows(text,sep = ' ') %>%
rename(word=text))
Output:
word ref
1 shops 13.5
2 John 13.5
3 apples 1.9.9
4 Wendy 20.1
5 this 20.1
Here is a base R option
u <- lapply(text_df$text,function(x) words_df$word[sapply(words_df$word,function(y) grepl(y,x))])
edge_df <- data.frame(ref = rep(text_df$ref,lengths(u)),word = unlist(u))
which gives
ref word
1 13.5 shops
2 13.5 John
3 1.9.9 apples
4 20.1 Wendy
5 20.1 this
library(data.table)
words_df <- data.frame(word = c("shops", "John", "apples", "Wendy", "this"))
text_df <- data.frame(text = c("John went to the shops",
"Sarita hates apples", "Wendy doesn't care about this"),
ref = c("13.5", "1.9.9", "20.1"))
setDT(words_df)
setDT(text_df)
First we get our words vector ready.
wordvec <- paste0(words_df[,word],collapse="|")
Now all there is to do is to check, for each row all the words in wordvec
## > text_df[,.(word=unlist(regmatches(text,gregexpr(wordvec,text)))),ref]
## ref word
## 1: 13.5 John
## 2: 13.5 shops
## 3: 1.9.9 apples
## 4: 20.1 Wendy
## 5: 20.1 this
The functions regmatches,grepexpr will return a list containing all the words that match the pattern wordvec.
> regmatches("John went to the shops",gregexpr(wordvec,"John went to the shops"))
##[[1]]
##[1] "John" "shops"
Warning, to format the output quickly I'm over-relying the ref variable and consider them to be ids. If it is not the case then it is best to create an id column and use it with in addition to ref. For instance
text_df[,id:=1:.N][,.(word=unlist(regmatches(text,
gregexpr(wordvec,text)))),.(id,ref)]

R! mutate conditional and list intersect (How many time was a player on the court ?)

This is a sport analysis question - How many time was a player on the court ?
I have a list of players I am interested in
names <- c('John','Bill',Peter')
and a list of actions during multiple matches
team <- c('teama','teama','teama','teama','teama','teama','teamb','teamb')
player1 <- c('John', 'John', 'John', 'Bill', 'Mike', 'Mike', 'Steve', 'Steve')
player2 <- c('Mike', 'Mike', 'Mike', 'John', 'Bill', 'Bill', 'Peter', 'Bob')
df <- data.frame(team,player1,player2)
I want to build a column that will list how many action was the player on the court
actions_when_player_on_court <- df %>% group_by(team) %>%
calculate({nb of observation where the player is either player1 or player2} )
so I end up with a new list like
actions_when_player_on_court <- c(4,3,1)
so I can create a new DF like this
new df <- data.frame(names,actions_when_player_on_court)
where John appears 4 times on the court, Bill twice, and Peter once
I feel I may need to intersect the names and c(player1,player2) especially if
names are unique - John, Bill and Peter cannot belong to other teams and are unique in df
I may have 0 to n players on the field so 0 to n column (player1, player2... playern)
The following code should do what you need.
We first need to create a new data frame to store all names and an empty actions_when_player_on_court variable.
names = c()
for (i in 2:ncol(df)) {
names = c(names, unique(df[,i]))
}
names = data.frame(name = unique(names), actions_when_player_on_court = 0)
Then, we can fill the actions_when_player_on_court variable using a for loop:
df$n = 1
for (i in 2:(ncol(df)-1)) {
tmp = aggregate(cbind(n = n) ~ df[, i], data = df[, c(i, ncol(df))], FUN="sum")
names(tmp)[1] = "name"
names = merge(names, tmp, all=T)
names[is.na(names)] = 0
names$actions_when_player_on_court = names$actions_when_player_on_court + names$n
names = names[-ncol(names)]
}
You can have as many players as you want as long as they start with the second column an run until the end of the data frame. Note that the resulting data frame does not include the team variable. I think you can deal with that yourself. Here is the result:
> names
name actions_when_player_on_court
1 Bill 3
2 Bob 1
3 John 4
4 Mike 5
5 Peter 1
6 Steve 2

Replacing integers in a dataframe column that's a list of integer vectors (not just single integers) with character strings in R

I have a dataframe with a column that's really a list of integer vectors (not just single integers).
# make example dataframe
starting_dataframe <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
starting_dataframe$player_indices <-
list(as.integer(1),
as.integer(c(2, 5)),
as.integer(3),
as.integer(4),
as.integer(c(6, 7)))
I want to replace the integers with character strings according to a second concordance dataframe.
# make concordance dataframe
example_concord <-
data.frame(last_names = c("Rapinoe",
"Wambach",
"Naeher",
"Morgan",
"Dahlkemper",
"Mitts",
"O'Reilly"),
player_ids = as.integer(c(1,2,3,4,5,6,7)))
The desired result would look like this:
# make dataframe of desired result
desired_result <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
desired_result$player_indices <-
list(c("Rapinoe"),
c("Wambach", "Dahlkemper"),
c("Naeher"),
c("Morgan"),
c("Mitts", "O'Reilly"))
I can't for the life of me figure out how to do it and failed to find a similar case here on stackoverflow. How do I do it? I wouldn't mind a dplyr-specific solution in particular.
I suggest creating a "lookup dictionary" of sorts, and lapply across each of the ids:
example_concord_idx <- setNames(as.character(example_concord$last_names),
example_concord$player_ids)
example_concord_idx
# 1 2 3 4 5 6
# "Rapinoe" "Wambach" "Naeher" "Morgan" "Dahlkemper" "Mitts"
# 7
# "O'Reilly"
starting_dataframe$result <-
lapply(starting_dataframe$player_indices,
function(a) example_concord_idx[a])
starting_dataframe
# first_names player_indices result
# 1 Megan 1 Rapinoe
# 2 Abby 2, 5 Wambach, Dahlkemper
# 3 Alyssa 3 Naeher
# 4 Alex 4 Morgan
# 5 Heather 6, 7 Mitts, O'Reilly
(Code golf?)
Map(`[`, list(example_concord_idx), starting_dataframe$player_indices)
For tidyverse enthusiasts, I adapted the second half of the accepted answer by r2evans to use map() and %>%:
require(tidyverse)
starting_dataframe <-
starting_dataframe %>%
mutate(
result = map(.x = player_indices, .f = function(a) example_concord_idx[a])
)
Definitely won't win code golf, though!
Another way is to unlist the list-column, and relist it after modifying its contents:
df1$player_indices <- relist(df2$last_names[unlist(df1$player_indices)], df1$player_indices)
df1
#> first_names player_indices
#> 1 Megan Rapinoe
#> 2 Abby Wambach, Dahlkemper
#> 3 Alyssa Naeher
#> 4 Alex Morgan
#> 5 Heather Mitts, O'Reilly
Data
## initial data.frame w/ list-column
df1 <- data.frame(first_names = c("Megan", "Abby", "Alyssa", "Alex", "Heather"), stringsAsFactors = FALSE)
df1$player_indices <- list(1, c(2,5), 3, 4, c(6,7))
## lookup data.frame
df2 <- data.frame(last_names = c("Rapinoe", "Wambach", "Naeher", "Morgan", "Dahlkemper",
"Mitts", "O'Reilly"), stringsAsFactors = FALSE)
NB: I set stringsAsFactors = FALSE to create character columns in the data.frames, but it works just as well with factor columns instead.

Extracting Column data from .csv and turning every 10 consecutive rows into corresponding columns

Below is the code I am trying to implement. I want to extract this 10 consecutive values of rows and turn them into corresponding columns .
This is how data looks like: https://drive.google.com/file/d/0B7huoyuu0wrfeUs4d2p0eGpZSFU/view?usp=sharing
I have been trying but temp1 and temp2 comes out to be empty. Please help.
library(Hmisc) #for increment function
myData <- read.csv("Clothing_&_Accessories.csv",header=FALSE,sep=",",fill=TRUE) # reading the csv file
extract<-myData$V2 # extracting the desired column
x<-1
y<-1
temp1 <- NULL #initialisation
temp2 <- NULL #initialisation
data.sorted <- NULL #initialisation
limit<-nrow(myData) # Calculating no of rows
while (x! = limit) {
count <- 1
for (count in 11) {
if (count > 10) {
inc(x) <- 1
break # gets out of for loop
}
else {
temp1[y]<-data_mat[x] # extracting by every row element
}
inc(x) <- 1 # increment x
inc(y) <- 1 # increment y
}
temp2<-temp1
data.sorted<-rbind(data.sorted,temp2) # turn rows into columns
}
Your code is too complex. You can do this using only one for loop, without external packages, likes this:
myData <- as.data.frame(matrix(c(rep("a", 10), "", rep("b", 10)), ncol=1), stringsAsFactors = FALSE)
newData <- data.frame(row.names=1:10)
for (i in 1:((nrow(myData)+1)/11)) {
start <- 11*i - 10
newData[[paste0("col", i)]] <- myData$V1[start:(start+9)]
}
You don't actually need all this though. You can simply remove the empty lines, split the vector in chunks of size 10 (as explained here) and then turn the list into a data frame.
vec <- myData$V1[nchar(myData$V1)>0]
as.data.frame(split(vec, ceiling(seq_along(vec)/10)))
# X1 X2
# 1 a b
# 2 a b
# 3 a b
# 4 a b
# 5 a b
# 6 a b
# 7 a b
# 8 a b
# 9 a b
# 10 a b
We could create a numeric index based on the '' values in the 'V2' column, split the dataset, use Reduce/merge to get the columns in the wide format.
indx <- cumsum(myData$V2=='')+1
res <- Reduce(function(...) merge(..., by= 'V1'), split(myData, indx))
res1 <- res[order(factor(res$V1, levels=myData[1:10, 1])),]
colnames(res1)[-1] <- paste0('Col', 1:3)
head(res1,3)
# V1 Col1 Col2 Col3
#2 ProductId B000179R3I B0000C3XXN B0000C3XX9
#4 product_title Amazon.com Amazon.com Amazon.com
#3 product_price unknown unknown unknown
From the p1.png, the 'V1' column can also be the column names for the values in 'V2'. If that is the case, we can 'transpose' the 'res1' except the first column and change the column names of the output with the first column of 'res1' (setNames(...))
res2 <- setNames(as.data.frame(t(res1[-1]), stringsAsFactors=FALSE),
res1[,1])
row.names(res2) <- NULL
res2[] <- lapply(res2, type.convert)
head(res2)
# ProductId product_title product_price userid
#1 B000179R3I Amazon.com unknown A3Q0VJTU04EZ56
#2 B0000C3XXN Amazon.com unknown A34JM8F992M9N1
#3 B0000C3XX9 Amazon.com unknown A34JM8F993MN91
# profileName helpfulness reviewscore review_time
#1 Jeanmarie Kabala "JP Kabala" 7/7 4 1182816000
#2 M. Shapiro 6/6 5 1205107200
#3 J. Cruze 8/8 5 120571929
# review_summary
#1 Periwinkle Dartmouth Blazer
#2 great classic jacket
#3 Good jacket
# review_text
#1 I own the Austin Reed dartmouth blazer in every color
#2 This is the second time I bought this jacket
#3 This is the third time I bought this jacket
I guess this is just a reshaping issue. In that case, we can use dcast from data.table to convert from long to wide format
library(data.table)
DT <- dcast(setDT(myData)[V1!=''][, N:= paste0('Col', 1:.N) ,V1], V1~N,
value.var='V2')
data
myData <- structure(list(V1 = c("ProductId", "product_title",
"product_price",
"userid", "profileName", "helpfulness", "reviewscore", "review_time",
"review_summary", "review_text", "", "ProductId", "product_title",
"product_price", "userid", "profileName", "helpfulness",
"reviewscore",
"review_time", "review_summary", "review_text", "", "ProductId",
"product_title", "product_price", "userid", "profileName",
"helpfulness",
"reviewscore", "review_time", "review_summary", "review_text"
), V2 = c("B000179R3I", "Amazon.com", "unknown", "A3Q0VJTU04EZ56",
"Jeanmarie Kabala \"JP Kabala\"", "7/7", "4", "1182816000",
"Periwinkle Dartmouth Blazer",
"I own the Austin Reed dartmouth blazer in every color", "",
"B0000C3XXN", "Amazon.com", "unknown", "A34JM8F992M9N1",
"M. Shapiro",
"6/6", "5", "1205107200", "great classic jacket",
"This is the second time I bought this jacket",
"", "B0000C3XX9", "Amazon.com", "unknown", "A34JM8F993MN91",
"J. Cruze", "8/8", "5", "120571929", "Good jacket",
"This is the third time I bought this jacket"
)), .Names = c("V1", "V2"), row.names = c(NA, 32L),
class = "data.frame")

Resources