rbind () on a conditional basis - r

Here's the question. I have a bunch of if statements that create different data frames (A-F) based on the user inputs. In some instances, some of the data frames will be empty, so maybe (A-C) are empty, but (D-F) have information.
I'm trying to create a conditional rbind(), where it combines the rows only if the data frame is not empty.
I'm not quite sure how to go about this? I don't know if I should create a bunch of conditions and use another if statement:
cond_a <- nrow(a) != 0
cond_b <- nrow(b) != 0
cond_c <- nrow(c) != 0
cond_d <- nrow(d) != 0
cond_e <- nrow(e) != 0
cond_f <- nrow(f) != 0
but then I don't know how to utilize these conditions...
EDIT: To take a step back and better explain: I have one data frame that I split into 6 different data frames by subsetting by a column, so it splits it into 6 data frames (A-G). The column has the letters A-G. These letters A-G in that column change depending on user inputs.
I then have a series of if statements that asks "if A is not empty then perform this aggregation", thereby skipping the df if it has no data in it. The aggregation takes it from 16 to 19 columns. Because the empty df has not been aggregated it still has 19 columns. After I perform these if statements and aggregations, I am left with dfA, dfB, dfC, etc. that have either been aggregated (16 columns) or still empty (19 columns). I then want a piece of code that says "for the df that have been aggregated (16 columns) rbind, if the df has not been aggregated (is empty), then don't perform the rbind.
thanks!

You can do something like this :
Create an empty list of the data frames you wish to select.
Write a bunch of if statements to see which ones contain non-zero zeros
Keep adding those names of non-zero DFs to the empty list created in step 1
Write an expression which collapses those names to an rbind expression
Evaluate the expression
Here is the script to do it :
1.Create an Empty list to store the names of dataframes
list.df = " "
2&3. If statements check if the Data Frame is Non-Zero, add the name to the list if true
if(nrow(df.a) > 0) {
list.df=c(list.df,deparse(substitute(df.a)))
}
if(nrow(df.b) > 0) {
list.df=c(list.df,deparse(substitute(df.b)))
}
if(nrow(df.c) > 0) {
list.df=c(list.df,deparse(substitute(df.c)))
}
.... So on from A through G
if(nrow(df.g) > 0) {
list.df=c(list.df,deparse(substitute(df.g)))
}
4.Formulate the expression
list.df = list.df[2:length(list.df)] #remove first element
expression = paste0("df.combined = rbind(",paste0( list.df, collapse =
','), paste0(")"))
5.Evaluate the expression
eval(parse(text=expression))
df.combined will be your final dataset that you are looking for

Related

Merge not merging data frames correctly

I have three data frames that need to be merged. There are a few small differences between the competitor names in each data frame. For instance, one name might not have a space between their middle and last name, while the other data frame correctly displays the persons name (Example: Sarah JaneDoe vs. Sarah Jane Doe). So, I used the two methods below.
The first method involves using fuzzy matching to merge the first two data frames, but when I run the code, it just keeps running.
The second attempt, I created a function to keep only one space between a capital letter and the first lower case letter that comes before it, and then merged all three data frames at together.
When I open the data set, the competitors who have NA's for their rank and team have their names spelt correctly in all three data sets. I'm not sure where the issue lies.
A few notes:
The 'comp01_n' column originally from the temp1 data frame is the same as the 'rank_1' column from the stats data frame. I kept them both to verify the data frames merged correctly at the end
I deleted rows in the 'fight' column with NA's because that was data for competitors not in the temp1 data frame. My actual data set is much larger and more complex.
Can you spot where I made an error and how to fix it?
library(fuzzyjoin)
library(tidyverse)
temp1 = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/temp1.csv')
stats=read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/stats.csv')
winners = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/winners.csv')
#============================================
# Attempt 1
#============================================
#perform fuzzy matching full join
star = stringdist_join(temp1, stats,
by='Name', #match based on Name
mode='full', #use full join
method = "jw", #use jw distance metric
max_dist=99,
distance_col='dist') %>%
group_by(Name.x)
#============================================
# Attempt 2
#============================================
# Function to keep only one space between a capital letter and the first lower case letter that comes before it
format_name <- function(x) {
gsub("([a-z])([A-Z])", "\\1 \\2", x)
}
# Apply the function to the Name column
temp1$Name <- sapply(temp1$Name, format_name)
# create a list of all three data frames
df_list <- list(temp1, stats, winners)
# create a function to remove duplicate columns in list
merge_dfs <- function(df_list) {
# Initialize the first data frame as the merged data frame
merged_df <- df_list[[1]]
# Loop through the rest of the data frames in the list
for (i in 2:length(df_list)) {
current_df <- df_list[[i]]
merged_df <- merge(merged_df, current_df, by=intersect(colnames(merged_df), colnames(current_df)), all=TRUE)
}
return(merged_df)
}
# apply function
t = merge_dfs(df_list)
# delete rows with NA in 'fight' column
t <- t[complete.cases(t[ , 'fight']), ]
# add suffix to indicate it's data for competitor 1
colnames(t)[c(5,16:32)]<-paste(colnames(t[,c(5,16:32)]),"1",sep="_")
# verify rank and comp01_n have the same values
result = ifelse(t$comp01_n == t$rank_1, 1, 0)
sum(result == 1, na.rm = TRUE)
sum(result == 0, na.rm = TRUE)
sum(is.na(result))

R- Remove rows based on condition across some columns

I have a data frame like this :
I want to remove rows which have values = 0 in those columns which are "numeric". I tried some functions but returned to me error o dind't remove anything be cause not the entire row is = 0. Summarizing, i need to remove the rows which are equals to 0 on the colums that have a numeric class( i.e from sales month to expected sales,). How i could do this???(below attach the result i expect)
PD: If I could do it with some function that allows me to put the number of the column instead of the name, it would be great!
Here a simple solution with lapply.
set.seed(5)
df <- data.frame(a=1:10,b=letters[1:10],x=sample(0:5,5,replace = T),y=sample(c(0,10,20,30,40,50),5,replace = T))
df <-df[!unlist(lapply(1:nrow(df), function(i) {
any(df[i, ] == 0)
})), ]

R: Match values in two data frames like vlookup but for multiple criteria without Key [large data]

I have two large data frames (500k rows) from two separate sources without a key. Instead of being able to merge using a key, I want to merge the two data frames by matching other columns. Such as age and amount. It is not a perfect match between the two data frames so some values will not match, and I will later simply remove these ones.
The data could look something like this.
So, in the example above I want to be able to create a table matching Key 1 and Key 2. In the picture above we see that XXX1 and YYY3 is a match. So from here I would like to create a data frame like:
[Key 1] [Key 2]
XXX1 YYY3
XXX2 N/A
XXX3 N/A
I know how to do this in Excel but due to the large amount of data, it simply crashes. I want to focus on R but for what it is worth, this is how I built it in Excel (where the idea is that we first do a VLOOKUP, and then uses INDEX as a VLOOKUP for getting the second match if the first one does not match both criteria):
=IF(P2=0;IFNA(VLOOKUP(L2;B:C;2;FALSE);VLOOKUP(L2;G:H;2;FALSE));IF(O2=Q2;INDEX($A$2:$A$378300;SMALL(IF($L2=$B$2:$B378300;ROW($B$2:$B$378300)-ROW($B$2)+1);2));0))
And this is the approach made in R:
for (i in 1:nrow(df)) {
for (j in 1:nrow(df)) {
if (df_1$pc_age[i] == df_2$pp_age[j] && (df_1$amount[i] %in% c(df_2$amount1[j], df_2$amount2[j], df_2$amount3[j]))) {
df_1$Key1[i] = df_2$Key2[j]
} else (df_1$Key1[i] = N/A)
}}
The problem is that this takes way, way to long. Is there a more effective way to map this data as good as possible?
Thanks!
Create dummy columns in both the data frames such as(I can show you for df1) :
for(i in 1:nrow(df1)){
df1$key1 <- paste0("X_",i)
}
Similarly for df2 from Y1....Yn and then join both data frames using "merge" on columns age and amount.
Concatenate Key1 and key2 in a new column in the merged data frame. You will directly get your desired data frame.
could the following code work for you?
# create random data
set.seed(123)
df1 <- data.frame(
key_1=as.factor(paste("xxx",1:100,sep="_")),
age = sample(1:100,100,replace=TRUE),
amount = sample(1:200,100))
df2 <- data.frame(
key_1=paste("yyy",1:500,sep="_"),
age = sample(1:100,500,replace=TRUE),
amount_1 = sample(1:200,500,replace=TRUE),
amount_2 = sample(1:200,500,replace=TRUE),
amount_3 = sample(1:200,500,replace=TRUE))
# ensure at least three fit rows
df2[10,2:3] <- df1[1,2:3]
df2[20,c(2,4)] <- df1[2,2:3]
df2[30,c(2,5)] <- df1[3,2:3]
# define comparrison with df2
comp2df2 <- function(x){
ageComp <- df2$age == as.numeric(x[2])
if(!any(ageComp)){
return(NaN)
}
amountComp <- apply(df2,1,function(a) as.numeric(x[3]) %in% as.numeric(a[3:5]))
if(!any(amountComp)){
return(NaN)
}
matchIdx <- ageComp & amountComp
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
return(which(matchIdx)[1])
}
# run match
matchIdx <- apply(df1,1,comp2df2)
# merge
df_new <- cbind(df1[!is.na(matchIdx),],df2[matchIdx[!is.na(matchIdx)],])
didn't had time to test it on really big data, but this should be faster than your two for loops I guess....
To further speed up things you could delete the
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
lines if you are not worried about a line matches several others.

How to do a complex edit of columns of all data frames in a list?

I have a list of 185 data frames called WaFramesNumeric. Each dataframe has several hundred columns and thousands of rows. I want to edit every data frame, so that it leaves all numeric columns as well as any non-numeric columns that I specify.
Using:
for(i in seq_along(WaFramesNumeric)) {
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
}
successfully makes each dataframe contain only its numeric columns.
I've tried to amend this with lines to add specific columns. I have tried:
for (i in seq_along(WaFramesNumeric)) {
a <- WaFramesNumeric[[i]]$Device_Name
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
cbind(WaFramesNumeric[[i]],a)
}
and in an attempt to call the column numbers of all integer columns as well as the specific ones and then combine based on that:
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]],is.numeric))
m <- match("Cost_Center",colnames(WaFramesNumeric[[i]]))
n <- match("Device_Name",colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]][,i,combine]
}
These all return errors and I am stumped as to how I could do this. WaFramesNumeric is a copy of another list of dataframes (WaFramesNumeric <- WaFramesAll) and so I also tried adding the specific columns from the WaFramesAll but this was not successful.
I appreciate any advice you can give and I apologize if any of this is unclear.
You are mistakenly assuming that the last commmand in a for loop is meaningful. It is not. In fact, it is being discarded, so since you never assigned it anywhere (the cbind and the indexing of WaFramesNumeric...), it is silently discarded.
Additionally, you are over-indexing your data.frame in the third code block. First, it's using i within the data.frame, even though i is an index within the list of data.frames, not the frame itself. Second (perhaps caused by this), you are trying to index three dimensions of a 2D frame. Just change the last indexing from [,i,combine] to either [,combine] or [combine].
Third problem (though perhaps not seen yet) is that match will return NA if nothing is found. Indexing a frame with an NA returns an error (try mtcars[,NA] to see). I suggest that you can replace match with grep: it returns integer(0) when nothing is found, which is what you want in this case.
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]], is.numeric))
m <- grep("Cost_Center", colnames(WaFramesNumeric[[i]]))
n <- grep("Device_Name", colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][combine]
}
I'm not sure what you mean by "an attempt to call the column numbers of all integer columns...", but in case you want to go through a list of data frames and select some columns based on some function and keep given a column name you can do like this:
df <- data.frame(a=rnorm(20), b=rnorm(20), c=letters[1:20], d=letters[1:20], stringsAsFactors = FALSE)
WaFramesNumeric <- rep(list(df), 2)
Selector <- function(data, select_func, select_names) {
select_func <- match.fun(select_func)
idx_names <- match(select_names, colnames(data))
idx_names <- idx_names[!is.na(idx_names)]
idx_func <- which(sapply(data, select_func))
idx <- unique(c(idx_func, idx_names))
return(data[, idx])
}
res <- lapply(X = WaFramesNumeric, FUN = Selector, select_names=c("c"), select_func = is.numeric)

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

Resources