Finding the closest character string in a second data frame in R

Finding the closest character string in a second data frame in R - r

I have a quite big data.frame with non updated names and I want to get the correct names that are stored in another data.frame.
I am using stringdist function to find the closest match between the two columns and then I want to put the new names in the original data.frame.
I am using a code based on sapply function, as in the following example :
dat1 <- data.frame("name" = paste0("abc", seq(1:5)),
"value" = round(rnorm(5), 1))
dat2 <- data.frame("name" = paste0("abd", seq(1:5)),
"other_info" = seq(11:15))
dat1$name2 <- sapply(dat1$name,
function(x){
char_min <- stringdist::stringdist(x, dat2$name)
dat2[which.min(char_min), "name"]
})
dat1
However, this code is too slow considering the size of my data.frame.
Is there a more optimized alternative solution, using for example data.table R package?

First convert the data frames into data tables:
dat1 <- data.table(dat1)
dat2 <- data.table(dat2)
Then use the ":=" and "amatch" command to create a new column that approximately matches the two names:
dat1[,name2 := dat2[stringdist::amatch(name, dat2$name)]$name]
This should be much faster than the sapply function. Hope this helps!

Related

Create a function/macro in R which can do more steps

I have 40 different data.frames with no systems in there names, like: nat69eqte, nahi_il and nahc_cpwtre. I want to create a function/macro in R which can proceed the following code in a easy way for all the data frames:
nat69eqte_wide <- spread(nat69eqte, key = time, value = values)
attr(nat69eqte_wide, "Symname") <- "nat69eqte"
lst_nat69eqte_wide <- wgdx.reshape(nat69eqte_wide, 2)
In each data.frame there are the columns time and values to be passed to spread.

Without any knowledge of your data, I can try to guess:
the following function assumes that in each data.frame there are the columns time and values to be passed to spread. If it is not the case you can add these columns as arguments of the function.
NB spread is now deprecated, instead, use pivot_wider
myfun <- function(df){
name <- as_label(enquo(df))
df_wide <- spread(df, key = time, value = values)
attr(df_wide, "Symname") <- name
lst_df_wide <- wgdx.reshape(df_wide, 2)
return(lst_df_wide )
}

Converting List of Vectors to Data Frame in R

I'm trying to convert a list of vectors into a data frame, with there being a column for Company Names and column for the MPE. My list is generated by running the following code for each company:
MPE[[2]] <- c("Google", abs(((forecasted - goog[nrow(goog),]$close)
/ goog[nrow(goog),]$close)*100))
Now, i'm having trouble making it into the appropriate data frame for further manipulation. What's the easiest way to do this?
This is an example list of vectors that I would want to manipulate into a dataframe with the company names in one column and the number in the second column.
test <- list(c("Google", 2))
test[[2]] <- c("Microsoft", 3)
test[[3]] <- c("Apple", 4)

You can use unlist with matrix and then turn into a dataframe. reducing with rbind could take a long time with a large dataframe I think.
df <- data.frame(matrix(unlist(test), nrow=length(test), byrow=T))
colnames(df) <- c("Company", "MPE")

I was actually able to achieve what I wanted with the following:
MPE_df <- data.frame(Reduce(rbind ,MPE))
colnames(MPE_df) <- c("Company", "MPE")
MPE_df

Extract list of non-matches in R

So I have two dataframes, and both have one column that represents an ID number linked to a DNA sequence, and another column has the DNA sequence. My two dataframes are either the raw data, or data that have been filtered to only include a subset of the raw data. What I'm now interested in doing is generating a .csv of all the sequences in the raw dataframe that don't have a match to the stuff in the filtered dataframe.
So as an example of the goal, I'll define a couple dataframes here with two columns (col1 and col2):
col1a<-c(1,2,3,4,5,6)
col2a<-c("a","t","a","t","a","g")
col1b<-c(1,3,5,6)
col2b<-c("a","a","a","g")
df1<-data.frame(col1a,col2a)
df2<-data.frame(col1b,col2b)
my output wants to be this third dataframe (df3):
col1c <- c(2,4)
col2c <- c("t","t")
df3 <- data.frame(col1c,col2c)
I know I can use %in%. I can get this far:
IN <- sum(df1$col1a %in% df2$col1b) #Output = 4
NOTIN <- sum(!df1$col1a %in% df2$col1b) #Output = 2
So now I'm looking for a way to export the rows referred to from "NOTIN" such that they can be written as a table. I want to generate the example dataframe I called df3 earlier, as my output.
Any help or suggestions are much appreciated :)

If df1 contains all the entries in df2, it's as simple as
df1[!df1$col1a %in% df2$col1b, ]

You can use an anti_join:
library(dplyr)
anti_join(df1, df2, by = c("col1a" = "col1b"))

You can do this in data.table as well:
library(data.table)
df1 <- data.table(df1, key = col1a)
df2 <- data.table(df2, key = col1b)
df1[!df2]
With version 1.9.5 (On GithHub, not on CRAN yet), you can use on = syntax instead of setting a key :
df1[!df2, on = c(col1a = "col1b")]

R: Apply function on specific columns preserving the rest of the dataframe

I'd like to learn how to apply functions on specific columns of my dataframe without "excluding" the other columns from my df. For example i'd like to multiply some specific columns by 1000 and leave the other ones as they are.
Using the sapply function for example like this:
a<-as.data.frame(sapply(table.xy[,1], function(x){x*1000}))
I get new dataframes with the first column multiplied by 1000 but without the other columns that I didn't use in the operation. So my attempt was to do it like this:
a<-as.data.frame(sapply(table.xy, function(x) if (colnames=="columnA") {x/1000} else {x}))
but this one didn't work.
My workaround was to give both dataframes another row with IDs and later on merge the old dataframe with the newly created to get a complete one. But I think there must be a better solution. Isn't it?

If you only want to do a computation on one or a few columns you can use transform or simply do index it manually:
# with transfrom:
df <- data.frame(A = 1:10, B = 1:10)
df <- transform(df, A = A*1000)
# Manually:
df <- data.frame(A = 1:10, B = 1:10)
df$A <- df$A * 1000

The following code will apply the desired function to the only the columns you specify.
I'll create a simple data frame as a reproducible example.
(df <- data.frame(x = 1, y = 1:10, z=11:20))
(df <- cbind(df[1], apply(df[2:3],2, function(x){x*1000})))
Basically, use cbind() to select the columns you don't want the function to run on, then use apply() with desired functions on the target columns.

In dplyr we would use mutate_at in which you can select or exclude (by preceding variable name with "-" minus sign) specific variables.
You can just name a function
df <- df %>%
mutate_at(vars(columnA), scale)
or create your own
df <- df %>%
mutate_at(vars(columnA, columnC), function(x) {do this})

data.frame: create column by applying a function to groups of rows

I have a data frame consisting of results from multiple runs of an experiment, each of which serves as a log, with its own ascending counter. I'd like to add another column to the data frame that has the maximum value of iteration for each distinct value of experiment.num in the sample below:
df <- data.frame(
iteration = rep(1:5,5),
experiment.num = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
some.val=42,
another.val=12
)
In this example, the extra column would look like this (as all the subsets have the same maximum for iteration):
df$max <- rep(5,25)
The naive solution I currently use is:
df$max <- sapply(df$experiment.num,function(exp.num) max(df$iteration[df$experiment.num == exp.num]))
I've also used sapply(unique(df$experiment.num), function(n) c(n,max(df$iteration[df$experiment.num==n]))) to build another frame which I can then merge with the original, but both of these approaches seem more complicated than necessary.
The experiment.num column is a factor, so I think I might be able to exploit that to avoid iteratively doing this naive subsetting for all rows.
Is there a better way to get a column of maximum values for subsets of a data.frame?

Using plyr:
ddply(df, .(experiment.num), transform, max = max(iteration))

Using ave in base R:
df$i_max <- with(df, ave(iteration, experiment.num, FUN=max))

Here's a way in base R:
within(df[order(df$experiment.num), ],
max <- rep(tapply(iteration, experiment.num, max),
rle(experiment.num)$lengths))

I think you can use data.table:
install.packages("data.table")
library("data.table")
dt <- data.table(df) #make your data frame into a data table)
dt[, pgIndexBY := .BY, by = list(experiment.num)] #this will add a new column to your data table called pgIndexBY

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Finding the closest character string in a second data frame in R - r

Related

Create a function/macro in R which can do more steps

Converting List of Vectors to Data Frame in R

Extract list of non-matches in R

R: Apply function on specific columns preserving the rest of the dataframe

data.frame: create column by applying a function to groups of rows

Categories

Resources