I have a dataframe from tidytext that contains the individual words from some survey free-response comments. It has just shy of 500,000 rows. Being free-response data, it is riddled with typos. Using textclean::replace_misspellings took care of almost 13,000 misspelled words, but there were still ~700 unique misspellings that I manually identified.
I now have a second table with two columns, the first is the misspelling and the second is the correction.
For instance
allComments <- data.frame("Number" = 1:5, "Word" = c("organization","orginization", "oragnization", "help", "hlp"))
misspellings <- data.frame("Wrong" = c("orginization", "oragnization", "hlp"), "Right" = c("organization", "organization", "help"))
How can I replace all the values of allComments$word that match misspellings$wrong with misspellings$right?
I feel like this is probably pretty basic and my R ignorance is showing....
You can use match to find the index for words from allComments$Word in misspellings$Wrong and then use this index to subset them.
tt <- match(allComments$Word, misspellings$Wrong)
allComments$Word[!is.na(tt)] <- misspellings$Right[tt[!is.na(tt)]]
allComments
# Number Word
#1 1 organization
#2 2 organization
#3 3 organization
#4 4 help
#5 5 help
In case the right word is not already in allComments$Word cast it to a character:
allComments$Word <- as.character(allComments$Word)
Here is another base R solution using replace()
allComments <- within(allComments,
Word <- replace(Word,
which(!is.na(match(Word,misspellings$Wrong))),
na.omit(misspellings$Right[match(Word,misspellings$Wrong)])))
such that
> allComments
Number Word
1 1 organization
2 2 organization
3 3 organization
4 4 help
5 5 help
allComments %>%
left_join(misspellings, by = c("Word" = "Wrong")) %>%
mutate(Word = coalesce(as.character(Right), Word))
# Number Word Right
# 1 1 organization <NA>
# 2 2 organization organization
# 3 3 organization organization
# 4 4 help <NA>
# 5 5 help help
You can, of course, drop the Right column when you're done with it.
Here's a data.table solution:
library(data.table)
setDT(allComments)
setDT(misspellings)
df <- merge.data.table(allComments, misspellings, all.x = T, by.x = "Word", by.y = "Wrong")
df <- df[!(is.na(Right)), Word := Right]
df <- df[, c("Number", "Word")]
df <- df[order(Number)]
df
# Number Word
#1: 1 organization
#2: 2 organization
#3: 3 organization
#4: 4 help
#5: 5 help
I have a problem with split column value when element of column has different number of strings. I can do it in plyr e.g.:
library(plyr)
column <- c("jake", "jane jane","john john john")
df <- data.frame(1:3, name = column)
df$name <- as.character(df$name)
df2 <- ldply(strsplit(df$name, " "), rbind)
View(df2)
As a result, we have data frame with number of column related to maximum number of stings in given element.
When I try to do it in dplyr, I used do function:
library(dplyr)
df2 <- df %>%
do(data.frame(strsplit(.$name, " ")))
but I get an error:
Error in data.frame("jake", c("jane", "jane"), c("john", "john", "john" :
arguments imply differing number of rows: 1, 2, 3
It seems to me that it should be used rbind function but I do not know where.
You're having troubles because strsplit() returns a list which we then need to apply as.data.frame.list() to each element to get it into the proper format that dplyr requires. Even then it would still require a bit more work to get usable results. Long story short, it doesn't seem like a suitable operation for do().
I think you might be better off using separate() from tidyr. It can easily be used with dplyr functions and chains. It's not clear whether you want to keep the first column since your ldply result for df2 does not have it, so I left it off.
library(tidyr)
separate(df[-1], name, 1:3, " ", extra = "merge")
# 1 2 3
# 1 jake <NA> <NA>
# 2 jane jane <NA>
# 3 john john john
You could also use cSplit. It is also very efficient since it relies on data.table
library(splitstackshape)
cSplit(df[-1], "name", " ")
# name_1 name_2 name_3
# 1: jake NA NA
# 2: jane jane NA
# 3: john john john
Or more specifically
setnames(df2 <- cSplit(df[-1], "name", " "), names(df2), as.character(1:3))
df2
# 1 2 3
# 1: jake NA NA
# 2: jane jane NA
# 3: john john john
Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.
This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.
A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT
In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.
Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4
I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.
If I specify n columns as a key of a data.table, I'm aware that I can join to fewer columns than are defined in that key as long as I join to the head of key(DT). For example, for n=2 :
X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
A B
1: 1 1
2: 1 1
3: 2 1
4: 2 1
5: 3 1
6: 3 2
7: 4 2
8: 4 2
9: 5 2
10: 5 2
X[J(3)]
A B
1: 3 1
2: 3 2
There I only joined to the first column of the 2-column key of DT. I know I can join to both columns of the key like this :
X[J(3,1)]
A B
1: 3 1
But how do I subset using only the second column colum of the key (e.g. B==2), but still using binary search not vector scan? I'm aware that's a duplicate of :
Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
so I'd like to generalise this question to n. My data set has about a million rows and solution provided in dup question linked above doesn't seem to be optimal.
Here is a simple function that will extract the correct unique values and return a data table to use as a key.
X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5),
C = letters[1:20], key=c('A','B','C'))
make.key <- function(ddd, what){
# the names of the key columns
zzz <- key(ddd)
# the key columns you wish to keep all unique values
whichUnique <- setdiff(zzz, names(what))
## unique data.table (when keyed); .. means "look up one level"
ud <- lapply([, ..whichUnique], unique)
## append the `what` columns and a Cross Join of the new
## key columns
do.call(CJ, c(ud,what)[zzz])
}
X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b
I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.
Adding secondary keys is on the feature request list :
FR#1007 Build in secondary keys
In the meantime we are stuck with either vector scan, or the approach used in the answer to the n=2 case linked in the question (which #mnel generalises nicely in his answer).