How can I find and replace values between two dataframes in R - r

I have a dataframe from tidytext that contains the individual words from some survey free-response comments. It has just shy of 500,000 rows. Being free-response data, it is riddled with typos. Using textclean::replace_misspellings took care of almost 13,000 misspelled words, but there were still ~700 unique misspellings that I manually identified.
I now have a second table with two columns, the first is the misspelling and the second is the correction.
For instance
allComments <- data.frame("Number" = 1:5, "Word" = c("organization","orginization", "oragnization", "help", "hlp"))
misspellings <- data.frame("Wrong" = c("orginization", "oragnization", "hlp"), "Right" = c("organization", "organization", "help"))
How can I replace all the values of allComments$word that match misspellings$wrong with misspellings$right?
I feel like this is probably pretty basic and my R ignorance is showing....

You can use match to find the index for words from allComments$Word in misspellings$Wrong and then use this index to subset them.
tt <- match(allComments$Word, misspellings$Wrong)
allComments$Word[!is.na(tt)] <- misspellings$Right[tt[!is.na(tt)]]
allComments
# Number Word
#1 1 organization
#2 2 organization
#3 3 organization
#4 4 help
#5 5 help
In case the right word is not already in allComments$Word cast it to a character:
allComments$Word <- as.character(allComments$Word)

Here is another base R solution using replace()
allComments <- within(allComments,
Word <- replace(Word,
which(!is.na(match(Word,misspellings$Wrong))),
na.omit(misspellings$Right[match(Word,misspellings$Wrong)])))
such that
> allComments
Number Word
1 1 organization
2 2 organization
3 3 organization
4 4 help
5 5 help

allComments %>%
left_join(misspellings, by = c("Word" = "Wrong")) %>%
mutate(Word = coalesce(as.character(Right), Word))
# Number Word Right
# 1 1 organization <NA>
# 2 2 organization organization
# 3 3 organization organization
# 4 4 help <NA>
# 5 5 help help
You can, of course, drop the Right column when you're done with it.

Here's a data.table solution:
library(data.table)
setDT(allComments)
setDT(misspellings)
df <- merge.data.table(allComments, misspellings, all.x = T, by.x = "Word", by.y = "Wrong")
df <- df[!(is.na(Right)), Word := Right]
df <- df[, c("Number", "Word")]
df <- df[order(Number)]
df
# Number Word
#1: 1 organization
#2: 2 organization
#3: 3 organization
#4: 4 help
#5: 5 help

Related

bind tables of different length

Thank you good ppl! This must be simple but I'm banging my head against it for a while. Please help. I have a large data set from which I get all kinds of information via table(). I then want to store that information which is essentially different counts, so I also want to store the rownames that were counted. For a reproducible example consider
```
a<-c("a","b","c","d","a","b") #one count, occurring twice for a and
b and once for c and d
b<-c("a","c") # a completly different property from the dataset
occurring once for a and c
x<-table(a)
y<-table(b) #so now x and y hold the information I seek
How can I merge/bind/whatever to get from x and y to this form:
x. y.
a 2. 1
b 2. 0
c 1. 1
d. 1 0
HOWEVER, I need to use the solution iteratively, in a loop that takes x and y and gets the requested form above, and then gets more tables added, each hopefully adding a column. One of my many failed attempts, just to show the logic, is:
`. member<-function (data=dfm,groupvar='group',analysis=kc15 {
res<-matrix(NA,ncol=length(analysis$size)+1)
res[,1]<-table(docvars(data,groupvar))
for (i in 1:length(analysis$size)) {
r<-table(docvars(data,groupvar)[analysis$cluster==i])
res<-cbind(res,r)
}
res
}`
So, to sum, the reproducible example above means to replicate the first column in res and an r, and I'm seeking (I think) a correct solution instead of the cbind, which would allow adding columns of different length but similar names, as in the example above.
Please help its embarrassing how much time I'm wasting on this
In base R, you can use table, stack and full join the two counts.
out <- merge(stack(table(a)), stack(table(b)), by = 'ind', all = TRUE)
out
# ind values.x values.y
#1 a 2 1
#2 b 2 NA
#3 c 1 1
#4 d 1 NA
If you want to replace NA with 0, you can do :
out[is.na(out)] <- 0
One purrr and tidyr solution could be:
map_dfr(lst, ~ stack(table(.)), .id = "ID") %>%
pivot_wider(names_from = "ID", values_from = "values", values_fill = list(values = 0))
ind a b
<chr> <int> <int>
1 a 2 1
2 b 2 0
3 c 1 1
4 d 1 0
lst being:
lst <- list(a = a,
b = b)

How to find a sequence in r in the middle of the texts?

So say there is a string of t and f, how might one use the grep function to find the pattern of say, something starting with f and stays in f for some time and go to t and I want to count the number of times it stays in t
a <- "fffftttfff"
b <- "fttttttfff"
c <- "tttttttttt"
d <- "fffffffftf"
path_ <- c(a,b,c,d)
ID <- 1:4
tf_dt <- data.table("ID" = ID,"path" = path_)
tf_dt
ID path
1: 1 fffftttfff
2: 2 fttttttfff
3: 3 tttttttttt
4: 4 fffffffftf
dt_raw <- tf_dt[,-1]
s <- paste0(as.vector(t(dt_raw)), collapse = "")
v <- substring(s,seq(1,nchar(s)-9,10), seq(10,nchar(s),10))
idx <- grep("^f*f.+t",v)
dt_final <- data.frame("ID" = tf_dt$ID, count = FALSE, time = NA)
dt_final$count[idx] <- TRUE
dt_final$time[idx] <- ???
What I reckon I should do is to remove the first string of f and all the remaining string of letters after the first string of t appearance. However I am not sure how might I be able to do that? Any help is appreciated.
My attempt:
nchar(gsub("^f*f","",gsub("something that relates to the end of the string","",v)))
More attempts:
#If I do gsub("^f*f+t*","",v) it gives me the last string that I want to remove
#But I cant do something like
nchar(gsub("^f*f","",gsub("gsub("^f*f+t*","",v)$",""v)))
Expected Output:
tf_count <- c(TRUE,TRUE,FALSE,TRUE)
tf_time <- c(3,6,NA,1)
output <- data.table("ID" = ID, "count" = tf_count,"time_taken" = tf_time)
# ID count time_taken
# 1: 1 TRUE 3
# 2: 2 TRUE 6
# 3: 3 FALSE NA
# 4: 4 TRUE 1
Also side note, is there somewhere that I can look at a lot of examples of how grep() and stringr() works. (I think from what I have seen this is under stringr()?) I tried reading things on this, but nothing really came out of it, and I am still equally as confused as before. Thanks.
A solution in base using grepl and gsub as you have tried already in the question.
tf_count <- grepl("^f+t+", tf_dt$path)
tf_time <- nchar(gsub("^f+(t+).*","\\1",tf_dt$path))
tf_time[!tf_count] <- NA
output <- data.frame("ID" = ID, "count" = tf_count,"time_taken" = tf_time)
output
# ID count time_taken
#1 1 TRUE 3
#2 2 TRUE 6
#3 3 FALSE NA
#4 4 TRUE 1
One way would be to find out number of t's after removing the first set of f's which can be achieved by
library(data.table)
tf_dt[, time_taken:= NA_integer_]
tf_dt[grep('^f', path), time_taken := nchar(sub('^f*(t{1,}).*', '\\1',path))]
tf_dt
# ID path time_taken
#1: 1 fffftttfff 3
#2: 2 fttttttfff 6
#3: 3 tttttttttt NA
#4: 4 fffffffftf 1
If you are interested in a stringr & tidyverse solution, try the following code. I borrowed a piece of code "^f*(t{1,})" from Ronak Shah's exellent answer:
tf_dt %>%
mutate(count = str_detect(path, "ft"),
time_taken = ifelse(count, str_count(str_extract(path, "^f*(t{1,})"), "t"), NA))

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

count frequency of rows based on a column value in R

I understand that this is quite a simple question, but I haven't been able to find an answer to this.
I have a data frame which gives you the id of a person and his hobby. Since a person may have many hobbies, the id field may be repeated in multiple rows, each with a different hobby. I have been trying to print out only those rows which have more than one hobbies. I was able to get the frequencies using table.
But how do I apply the condition to print only when the frequency is greater than one.
Secondly, is there a better way to find frequencies without using table.
This is my attempt with table without the filter for frequency greater than one
> id=c(1,2,2,3,2,4,3,1)
> hobby = c('play','swim','play','movies','golf','basketball','playstation','gameboy')
> df = data.frame(id, hobby)
> table(df$id)
1 2 3 4
2 3 2 1
Try using data table, I find it more readable than using table() functions:
library(data.table)
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies',
'golf','basketball','playstation','gameboy')
df = data.frame(id=id, hobby=hobby)
dt = as.data.table(df)
dt[,hobbies:=.N, by=id]
You will get, for your condition:
> dt[hobbies >1,]
id hobby hobbies
1: 1 play 2
2: 2 swim 3
3: 2 play 3
4: 3 movies 2
5: 2 golf 3
6: 3 playstation 2
7: 1 gameboy 2
This example assumes you are trying to filter df
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies','golf','basketball',
'playstation','gameboy')
df = data.frame(id, hobby)
table(df$id)
Get all those ids that have more than one hobby
tmp <- as.data.frame(table(df$id))
tmp <- tmp[tmp$Freq > 1,]
Using that information - select their IDs in df
df1 <- df[df$id %in% tmp$Var1,]
df1

R Obtain unique records on data frame based on secondary field conditions

UPDATED AND SIMPLIFIED
I am having a really large table (~ 7 million records) which has the following structure.
temp <- read.table(header = TRUE, stringsAsFactors=FALSE,
text = "Website Datetime Rating
A 2007-12-06T14:53:07Z 1
A 2006-07-28T03:52:26Z 4
B 2006-11-02T11:06:25Z 2
C 2007-06-19T06:56:08Z 5
C 2009-11-28T22:27:58Z 2
C 2009-11-28T22:28:13Z 2")
What I want to retrieve is the unique websites with a max rating per website:
Website Rating
A 4
B 2
C 5
I tried using a for loop but it was too slow. Is there any other way I can achieve this.
do.call( rbind, lapply( split(temp, temp$Website) ,
function(d) d[ which.max(d$Rating), ] ) )
Website Datetime Rating
A A 2006-07-28T03:52:26Z 4
B B 2006-11-02T11:06:25Z 2
C C 2007-06-19T06:56:08Z 5
Since your 'Datetime' variable does not appear to yet actually be either a Date or a datetime object, you should probably convert to a Date-object first.
which.max will pick the first item that is a maximum.
> which.max(c(1,1,2,2))
[1] 3
So Ananda may not be correct in his warning in that regard. Datatable methods will certainly be more rapid and may also succeed if the machine memory is modest. The method above may make several copies along the way and data.table functions do not need to to as much copying.
I would probably explore the data.table package, though without more details, the following example solution is most likely not going to be what you need. I mention this because, in particular, there might be more than one "Rating" record per group which matches max; how would you like to deal with those cases?
library(data.table)
temp <- read.table(header = TRUE, stringsAsFactors=FALSE,
text = "Website Datetime Rating
A 2012-10-9 10
A 2012-11-10 12
B 2011-10-9 5")
DT <- data.table(temp, key="Website")
DT
# Website Datetime Rating
# 1: A 2012-10-9 10
# 2: A 2012-11-10 12
# 3: B 2011-10-9 5
DT[, list(Datetime = Datetime[which.max(Rating)],
Rating = max(Rating)), by = key(DT)]
# Website Datetime Rating
# 1: A 2012-11-10 12
# 2: B 2011-10-9 5
I would recommend that to get better answers, you might want to include information like how your datetime variable might factor into your aggregation, or whether it is possible for there to be more than one "max" value per group.
If you want all the rows that match the max, the fix is easy:
DT[, list(Time = Times[Rating == max(Rating)],
Rating = max(Rating)), by = key(DT)]
If you do just want the Rating column, there are many ways to go about this. Following the same steps as above to convert to a data.table, try:
DT[, list(Datetime = max(Rating)), by = key(DT)]
Website Datetime
# 1: A 4
# 2: B 2
# 3: C 5
Or, keeping the original "temp" data.frame, try aggregate():
aggregate(Rating ~ Website, temp, max)
Website Rating
# 1 A 4
# 2 B 2
# 3 C 5
Yet another approach, using ave:
temp[with(temp, Rating == ave(Rating, Website, FUN=max)), ]

Resources