First of all, sorry for my English, I'm translating with google translator
I have two df to which I apply fastLink
df1<-data.frame(col1=c("pruebaA","pruebaA","pruebaA","pruebaB","pruebaB","pruebaB"),col2=c("avion","casa","coche","verde","antonio","jardin"), stringsAsFactors = FALSE)
df2<-data.frame(col1=c("pruebaA","pruebaA","pruebaA","pruebaB","pruebaB","pruebaA"),col2=c("avion","casa grande","coche rojo","Berde","antoƱito","jardinn"), stringsAsFactors = FALSE)
library(fastLink)
prueba <- function(d1, d2) {
out <- fastLink(
dfA = d1, dfB = d2,
varnames = c("col1","col2"),
partial.match = c("col2"),
stringdist.match = c("col2")
)
indi<<- out$matches
dfA.match <<- d1[out$matches$inds.a,]
}
prueba(df1,df2)
I get indi and dfA.match so I can query them.
How could I do the same when I have a lot of df?
I can't make a loop
For example,
I divide df1 and df2 into parts
df1$M <- paste0(df1$col1, "_df1")
z <- split(df1,df1$M )
list2env(z, .GlobalEnv)
df2$M <- paste0(df2$col1,"_df2")
b <- split(df2,df2$M )
list2env(b, .GlobalEnv)
I get
-PruebaA_df1
-PruebaA_df2
-PruebaB_df1
-PruebaB_df1
prueba(pruebaA_df1,pruebaA_df2)
prueba(pruebaB_df1,pruebaB_df2)
works!
Same with a loop
unique(df1$col1)->nom2b
indices<- list()
uniones<- list()
for (i in nom2b){
d1<-paste0(i,"_df1")
d2<-paste0(i,"_df2")
#cat(d1)->d1
#cat(d2)->d2
prueba(d1,d2)
indices[[paste0("modelo",i)]]<-indi
uniones[[paste0("uniones",i)]]<- dfA.match
}
Wrong!!, it doesn't work!!
Assuming you have objects called pruebaA_df1, pruebaA_df2 .... pruebaA_df1000 in your environment, you can use Reduce as :
result <- Reduce(prueba, mget(paste0('pruebaA_df', 1:1000)))
Related
I'm using for loop to find all specific strings (df2$x2) in another dataframe (df1$x1) and what my purpose is create new column the df1$test and write the df$x2 value.
For example:
df1 <- data.frame(x1 = c("TE-T6-3 XYZ12X","TE-D31L-2 QWE12X","TE-H6-1 ABC12X","TE-D31L-2 QWE12X","EC20 QWX12X"),
Y = c(2017,2017,2018,2018,2017),
Sales = c(25,50,30,40,90))
df1$x1 <- as.character(as.factor(df1$x1))
df2 <- data.frame(x2 = c("TE-T6-5","TE-D31L-2","TE-H6-15","EC500","EC20","TE-D31L-2"),
Y = c(2018,2017,2018,2017,2018,2018),
P = c(100,300,200,50,150,300))
df2$x2 <- as.character(as.factor(df2$x2))
for(i in 1:nrow(df2)){
f <- df2[i,1]
df1$test <- ifelse(grepl(f, df1$x1),f,"not found")
}
What should I do after the end of loop? I know that problem is y is refreshing every time. I tried "if" statement to create new data frame and save outputs but it didn't work. It's writing only one specific string.
Thank you in advance.
Expected output:
df1 <- data.frame(x1 = c("TE-T6-3 XYZ12X","TE-D31L-2 QWE12X","TE-H6-1 ABC12X","TE-D31L-2 QWE12X","EC20 QWX12X"),
output = c("not found","TE-D31L-2","not found","TE-D31L-2","EC20"))
Do you want to have one new column for each string? if that is what you need, your code should be:
df1 <- data.frame(x1 = c("TE-T6-3 XYZ12X","TE-D31L-2 QWE12X","TE-H6-1 ABC12X","TE-D31L-2 QWE12X","EC20 QWX12X"),
Y = c(2017,2017,2018,2018,2017),
Sales = c(25,50,30,40,90))
df1$x1 <- as.character(as.factor(df1$x1))
df2 <- data.frame(x2 = c("TE-T6-5","TE-D31L-2","TE-H6-15","EC500","EC20","TE-D31L-2"),
Y = c(2018,2017,2018,2017,2018,2018),
P = c(100,300,200,50,150,300))
df2$x2 <- as.character(as.factor(df2$x2))
for(i in 1:nrow(df2)){
f <- df2[i,1]
df1$test <- ""
df1$test<-ifelse(grepl(f, df1$x1),T,F)
colnames(df1) <- c(colnames(df1[1:length(df1[1,])-1]),f)
}
it creates a new column with a temp name and then rename it with the string evaluated. Also i change "not found" for F, but you can use whatever you want.
[EDIT:]
If you want that expected output, you can use this code:
df1 <- data.frame(x1 = c("TE-T6-3 XYZ12X","TE-D31L-2 QWE12X","TE-H6-1 ABC12X","TE-D31L-2 QWE12X","EC20 QWX12X"),
Y = c(2017,2017,2018,2018,2017),
Sales = c(25,50,30,40,90))
df1$x1 <- as.character(as.factor(df1$x1))
df2 <- data.frame(x2 = c("TE-T6-5","TE-D31L-2","TE-H6-15","EC500","EC20","TE-D31L-2"),
Y = c(2018,2017,2018,2017,2018,2018),
P = c(100,300,200,50,150,300))
df2$x2 <- as.character(as.factor(df2$x2))
df1$output <- "not found"
for(i in 1:nrow(df2)){
f <- df2[i,1]
df1$output[grepl(f, df1$x1)]<-f
}
Very similar of what you have done, but it was needed to index which rows you have to write.
This only works when the data only can have one match, it is a little more complicated if you can have more than one match for row. But i think that's not your problem.
You simply need to split the df1$x1 strings on space and merge (or match since you are only interested in one variable)on df2$x2, i.e.
v1 <- sub('\\s+.*', '', df1$x1)
v1[match(v1, df2$x2)]
#[1] NA "TE-D31L-2" NA "TE-D31L-2" "EC20"
I need to multiply all columns in a data frame with each other. As an example, I need to achieve the following:
mydata$C1_2<-mydata$sic1*mydata$sic2
but for all my columns with values going from 1 to 733 (sic1, sic2, sic3,..., sic733).
I've tried the following but it doesn't work:
for(i in 1:733){
for(j in 1:733){
mydata$C[i]_[j]<-mydata$sic[i]*mydata$sic[j]
}
}
Could you help me? Thanks for your help.
Despite the question if you really want what you think you want, I feel like this could help:
df <- data.frame(
a = 1:4
, b = 1:4
, c = 4:1
)
multiplyColumns <- function(name1, name2, df){
df[, name1] * df[, name2]
}
combinations <- expand.grid(names(df), names(df), stringsAsFactors = FALSE)
names4result <- paste(combinations[,1], combinations[,2], sep = "_")
result <- as.data.frame(mapply(multiplyColumns, combinations[,1], combinations[,2], MoreArgs = list(df = df)))
names(result) <- names4result
result
I have a large set of dataframes (around 50,000). Each dataframe have two columns, key and value, with around 100-200 rows. My question is essentially similar to this and this. Following their ideas, I construct a list of dataframes and use Reduce function
freq_martix<-Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "key", all = TRUE),
freq_list)
But my code has run for several days. I just wonder if there is a more efficient, faster way to merge a large set of dataframes?
This way is pretty fast.
First of all I created 500 tables, each containing 150 key-value pairs.
library(data.table)
library(stringi)
for (i in 1:500) {
set.seed(i)
dfNam <- paste('df', i, sep = '_')
df <- data.frame( cbind(key = tolower(stri_rand_strings(150, 1, pattern = '[A-Za-z]')), value = sample(1:1000, 150, replace = TRUE)) )
assign(dfNam, df)
rm(df)
rm(dfNam)
}
Then I transposed and append them:
tmp <- data.table()
for (i in ls(pattern = 'df_') ) {
df <- get(i)
dt <- data.table( transpose(df) )
colnames(dt) <- as.character(unlist(dt[1, ]))
dt <- dt[-1, ]
tmp <- rbindlist(list(tmp, dt), use.names = TRUE, fill = TRUE)
}
And transposed back after all:
merged_data <- transpose(tmp)
key <- colnames(tmp)
merged_data <- cbind(key, merged_data)
Works like charm.
I have got two separate lists which contain 4 data.frames each one. I need to perform a Student's t-test (t.test) for rainfall between each data.frames within the two lists.
Here the lists:
lst1 = list(data.frame(rnorm(20), rnorm(20)), data.frame(rnorm(25), rnorm(25)), data.frame(rnorm(16), rnorm(16)), data.frame(rnorm(34), rnorm(34)))
lst1 = lapply(lst1, setNames, c('rainfall', 'snow'))
lst2 = list(data.frame(rnorm(19), rnorm(19)), data.frame(rnorm(38), rnorm(38)), data.frame(rnorm(22), rnorm(22)), data.frame(rnorm(59), rnorm(59)))
lst2 = lapply(lst2, setNames, c('rainfall', 'snow'))
What I would need to do is:
t.test(lst1[[1]]$rainfall, lst2[[1]]$rainfall)
t.test(lst1[[2]]$rainfall, lst2[[2]]$rainfall)
t.test(lst1[[3]]$rainfall, lst2[[3]]$rainfall)
t.test(lst1[[4]]$rainfall, lst2[[4]]$rainfall)
I can do it as above by writing each of the 4 data.frames (I actually have 40 with my real data) but I would like to know if there exists a smarter and quickier way to do it.
Here below what I tried (without success):
myfunction = function(x,y) {
test = t.test(x, y)
return(test)
}
result = mapply(myfunction, x=lst1, y=lst2)
x <- NULL
for (i in seq_along(lst1)){
x[[i]] <- t.test(lst1[[i]]$rainfall, lst2[[i]]$rainfall)
}
x
Works for me. I would use simplify = FALSE to get the results formatted better though.
lst1 <- list()
lst1[[1]] <- data.frame(rainfall = rnorm(10))
lst1[[2]] <- data.frame(rainfall = rnorm(10))
lst2 <- list()
lst2[[1]] <- data.frame(rainfall = rnorm(10))
lst2[[2]] <- data.frame(rainfall = rnorm(10))
myfunction = function(x,y) {
test = t.test(x$rainfall, y$rainfall)
return(test)
}
mapply(myfunction, x = lst1, y = lst2, SIMPLIFY = FALSE)
I'm trying to build a dataset before plotting it. I decided to use function factory gammaplot.ff() and the first version of my code looks like this:
PowerUtility1d <- function(x, delta = 4) {
return(((x+1)^(1 - delta)) / (1 - delta))
}
PowerUtility1d <- Vectorize(PowerUtility1d, "x")
# function factory allows multiparametrization of PowerUtility1d()
gammaplot.ff <- function(type, gamma) {
ff <- switch(type,
original = function(x) PowerUtility1d(x/10, gamma),
pnorm_wrong = function(x) PowerUtility1d(2*pnorm(x)-1, gamma),
pnorm_right = function(x) PowerUtility1d(2*pnorm(x/3)-1, gamma)
)
ff
}
gammaplot.df <- data.frame(type=numeric(), gamma=numeric(),
x=numeric(), y=numeric())
gammaplot.gamma <- c(1.1, 1.3, 1.5, 2:7)
gammaplot.pts <- (-1e4:1e4)/1e3
# building the data set
for (gm in gammaplot.gamma) {
for (tp in c("original", "pnorm_wrong", "pnorm_right")) {
fpts <- gammaplot.ff(tp, gm)(gammaplot.pts)
dataChunk <- cbind(tp, gm, gammaplot.pts, fpts)
colnames(dataChunk) <- names(gammaplot.df)
gammaplot.df <- rbind(gammaplot.df, dataChunk)
}
}
# rbind()/cbind() cast all data to character, but x and y are numeric
gammaplot.df$x <- as.numeric(as.character(gammaplot.df$x))
gammaplot.df$y <- as.numeric(as.character(gammaplot.df$y))
Turns out, the whole data frame contains character data, so I have to convert it back manually (took me a while to discover that in the first place!). SO search indicates that this happens because type variable is character. To avoid this (you can imagine performance issues on character data while building the data set!) I changed the code a bit:
gammaplot.ff <- function(type, gamma) {
ff <- switch(type,
function(x) PowerUtility1d(x/10, gamma),
function(x) PowerUtility1d(2*pnorm(x)-1, gamma),
function(x) PowerUtility1d(2*pnorm(x/3)-1, gamma)
)
ff
}
for (gm in gammaplot.gamma) {
for (tp in 1:3) {
fpts <- gammaplot.ff(tp, gm)(gammaplot.pts)
dataChunk <- cbind(tp, gm, gammaplot.pts, fpts)
colnames(dataChunk) <- names(gammaplot.df)
gammaplot.df <- rbind(gammaplot.df, dataChunk)
}
}
This works fine for me, but I lost a self-explanatory character parameter, which is a downside. Is there a way to keep the first version of function factory without an implicit conversion of all data to character?
If there's another way of achieving the same result, I'd be happy to try it out.
You can use rbind.data.frame and cbind.data.frame instead of rbind and cbind.
I want to put #mtelesha 's comment to the front.
Use stringsAsFactors = FALSE in cbind or cbind.data.frame:
x <- data.frame(a = letters[1:5], b = 1:5)
y <- cbind(x, c = LETTERS[1:5])
class(y$c)
## "factor"
y <- cbind.data.frame(x, c = LETTERS[1:5])
class(y$c)
## "factor"
y <- cbind(x, c = LETTERS[1:5], stringsAsFactors = FALSE)
class(y$c)
## "character"
y <- cbind.data.frame(x, c = LETTERS[1:5], stringsAsFactors = FALSE)
class(y$c)
## "character"
UPDATE (May 5, 2020):
As of R version 4.0.0, R uses a stringsAsFactors = FALSE default in calls to data.frame() and read.table().
https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/
If I use rbind or rbind.data.frame, the columns are turned into characters every time. Even if I use stringsAsFactors = FALSE. What worked for me was using
rbind.data.frame(df, data.frame(ColNam = data, Col2 = data), stringsAsFactors = FALSE)