I've got a data frame (df2) with two variables, Mood and PartOfTown, where Mood is a multi-select (ie any combination of options's allowed) question rating a person's happiness, and PartOfTown describes the geographical location.
The problem is that the centres code moods differently, with the centre in the northern part of town using NorthCode and the centre in the southern part using SouthCode (df1).
I'd like all the entries in the data set (df2) to be recoded to SouthCode, so that I end up with a data set like df3. I'd like a general solution, because there might be new entries with new combinations currently not featuring in the data set. Any thoughts on it would be much appreciated.
Centre codes and definitions for moods:
df1 <- data.frame(NorthCode=c(4,5,6,7,99),NorthDef=c("happy","sad","tired","energetic","other"),SouthCode=c(7,8,9,5,99),SouthDef=c("happy","sad","tired","energetic","other"))
Starting point:
df2 <- data.frame(Mood=c("4","5","6","7","4,5","5,6,99","99","7","8","9","5","7,8","8,5,99","99"),Region=c("north","north","north","north","north","north","north","south","south","south","south","south","south","south"))
Desired outcome:
df3 <- data.frame(Mood=c("7","8","9","5","7,8","8,9,99","99","7","8","9","5","7,8","8,5,99","99"),PartofTown=c("north","north","north","north","north","north","north","south","south","south","south","south","south","south"))
Current attempt: tried to start of by splitting the entries but couldn't get it to work.
unlist(strsplit(df2$Mood, ","))
You were on the right path with strsplit, but you need to add stringsAsFactors = F to as.data.frame() to make sure that Mood is a character vector, not a factor.
After that you can keep the separated elements as a list and match the old codes with the new ones with lapply().
df1 <-
data.frame(NorthCode=c(4,5,6,7,99),
NorthDef=c("happy","sad","tired","energetic","other"),
SouthCode=c(7,8,9,5,99),
SouthDef=c("happy","sad","tired","energetic","other"),
stringsAsFactors = F)
df2 <-
data.frame(Mood=c("4","5","6","7","4,5","5,6,99","99","7","8","9","5","7,8","8,5,99","99"),
Region=c("north","north","north","north","north","north","north","south","south","south","south" ,"south","south","south"),
stringsAsFactors = F)
df3 <-
data.frame(Mood=c("7","8","9","5","7,8","8,9,99","99","7","8","9","5","7,8","8,5,99","99"),
PartofTown=c("north","north","north","north","north","north","north","south","south","south","south" ,"south","south","south"),
stringsAsFactors = F)
# Split the Moods into separate values
splitCodes <- strsplit(df2$Mood, ",")
# Add the Region as the name of each element in the new list
names(splitCodes) <- df2$Region
# Recode the values by matching the north values to the south values
recoded <-
lapply(
seq_along(splitCodes),
function(x){
ifelse(rep(names(splitCodes[x]) == "north", length(splitCodes[[x]])),
df1$SouthCode[match(splitCodes[[x]], df1$NorthCode)],
splitCodes[[x]])
}
)
# Add the recoded values back to df2
df2$recoded <-
sapply(recoded,
paste,
collapse = ",")
# Check if the recoded values match your desired values
identical(df2$recoded, df3$Mood)
Related
I have three data frames that need to be merged. There are a few small differences between the competitor names in each data frame. For instance, one name might not have a space between their middle and last name, while the other data frame correctly displays the persons name (Example: Sarah JaneDoe vs. Sarah Jane Doe). So, I used the two methods below.
The first method involves using fuzzy matching to merge the first two data frames, but when I run the code, it just keeps running.
The second attempt, I created a function to keep only one space between a capital letter and the first lower case letter that comes before it, and then merged all three data frames at together.
When I open the data set, the competitors who have NA's for their rank and team have their names spelt correctly in all three data sets. I'm not sure where the issue lies.
A few notes:
The 'comp01_n' column originally from the temp1 data frame is the same as the 'rank_1' column from the stats data frame. I kept them both to verify the data frames merged correctly at the end
I deleted rows in the 'fight' column with NA's because that was data for competitors not in the temp1 data frame. My actual data set is much larger and more complex.
Can you spot where I made an error and how to fix it?
library(fuzzyjoin)
library(tidyverse)
temp1 = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/temp1.csv')
stats=read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/stats.csv')
winners = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/winners.csv')
#============================================
# Attempt 1
#============================================
#perform fuzzy matching full join
star = stringdist_join(temp1, stats,
by='Name', #match based on Name
mode='full', #use full join
method = "jw", #use jw distance metric
max_dist=99,
distance_col='dist') %>%
group_by(Name.x)
#============================================
# Attempt 2
#============================================
# Function to keep only one space between a capital letter and the first lower case letter that comes before it
format_name <- function(x) {
gsub("([a-z])([A-Z])", "\\1 \\2", x)
}
# Apply the function to the Name column
temp1$Name <- sapply(temp1$Name, format_name)
# create a list of all three data frames
df_list <- list(temp1, stats, winners)
# create a function to remove duplicate columns in list
merge_dfs <- function(df_list) {
# Initialize the first data frame as the merged data frame
merged_df <- df_list[[1]]
# Loop through the rest of the data frames in the list
for (i in 2:length(df_list)) {
current_df <- df_list[[i]]
merged_df <- merge(merged_df, current_df, by=intersect(colnames(merged_df), colnames(current_df)), all=TRUE)
}
return(merged_df)
}
# apply function
t = merge_dfs(df_list)
# delete rows with NA in 'fight' column
t <- t[complete.cases(t[ , 'fight']), ]
# add suffix to indicate it's data for competitor 1
colnames(t)[c(5,16:32)]<-paste(colnames(t[,c(5,16:32)]),"1",sep="_")
# verify rank and comp01_n have the same values
result = ifelse(t$comp01_n == t$rank_1, 1, 0)
sum(result == 1, na.rm = TRUE)
sum(result == 0, na.rm = TRUE)
sum(is.na(result))
I'm combining 12 CSV files into one dataframe in R. Before doing this I want to ensure all the column names are an exact match with each other. I've made a dataframe where each column is the column names of the 12 CSV files.
jul21_cols <- data.frame(colnames(jul21))
aug21_cols <- data.frame(colnames(aug21))
sep21_cols <- data.frame(colnames(sep21))
oct21_cols <- data.frame(colnames(oct21))
nov21_cols <- data.frame(colnames(nov21))
dec21_cols <- data.frame(colnames(dec21))
jan22_cols <- data.frame(colnames(jan22))
feb22_cols <- data.frame(colnames(feb22))
mar22_cols <- data.frame(colnames(mar22))
apr22_cols <- data.frame(colnames(apr22))
may22_cols <- data.frame(colnames(may22))
jun22_cols <- data.frame(colnames(jun22))
col_df <- cbind(jul21_cols,aug21_cols,sep21_cols,oct21_cols,nov21_cols,dec21_cols,
jan22_cols,feb22_cols,mar22_cols,apr22_cols,may22_cols,jun22_cols)
I've tried using the identical function to compare 2 columns at a time.
identical(col_df[['jul21']], col_df[['aug21']])
identical(col_df[['aug21']], col_df[['sep21']])
identical(col_df[['sep21']], col_df[['oct21']])
identical(col_df[['oct21']], col_df[['nov21']])
identical(col_df[['nov21']], col_df[['dec21']])
identical(col_df[['dec21']], col_df[['jan22']])
identical(col_df[['jan22']], col_df[['feb22']])
identical(col_df[['feb22']], col_df[['mar22']])
identical(col_df[['mar22']], col_df[['apr22']])
identical(col_df[['apr22']], col_df[['may22']])
identical(col_df[['may22']], col_df[['jun22']])`
All of the identical lines return the value of TRUE
I'm just trying to verify that this code is telling me all my column names are identical in each CSV files before I move on. I'd also like to know if there is a more efficient way to solve this problem.
First, identical() will only return TRUE if the two dataframes have all the same column names in the same order. If you don’t care about order, just that all the same names are in both dataframes, you can sort() the names before comparing as shown below.
Second, you can often use the base::lapply() or purrr::map() families of functions for operations requiring iteration.
For your case, let’s put your dataframes in a list (which they probably should be to begin with), then use sapply() to compare the column names of the first df in the list to the column names of all other dfs.
jul21 <- data.frame(x = 1, y = 2)
aug21 <- data.frame(x = 3, y = 4)
sep21 <- data.frame(y = 6, x = 5)
dfs <- list(jul21,aug21,sep21)
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# TRUE
And as another test case, we’ll add a df with a non-matching column.
oct22 <- data.frame(x = 1, y = 2, z = 3)
dfs[[4]] <- oct22
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# FALSE
We assume that what is needed is to determine if the column names are the same and in same order and if not to determine which differ.
First get a character vector, Names, containing the names of the data frames and from that make a named list L containing the data frames themselves.
From those names assemble a list L of the data frames and then get a character vector nms whose elements are strings of column names, one for each data frame.
Finally group the names of the data frames using tapply and nms as the groupings so we can see which data frames contain which columns. In the example below aug21 and jul21 have one set of columns, i.e. Time and demand, and sep21 has a different set, i.e. Time and DEMAND. If there were only one row then all data frames have the same column names in the same order.
Names <- c("jul21", "aug21", "sep21") # using example in Note
L <- mget(Names)[Names]
nms <- sapply(names(L), function(x) toString(names(L[[x]])))
tab <- stack(tapply(names(nms), nms, toString))
names(tab) <- c("data.frames", "column.names")
nrow(tab)
## [1] 2
tab
## data.frames column.names
## 1 jul21, aug21 Time, demand
## 2 sep21 Time, DEMAND
graph
Another approach which could be used alternately or in conjuction with the one above is to create a graph such that each vertex is a data frame and each edge means that the two vertices on either end of the edge have the same column names in the same order. Each connected component represents distinct column names or orders. From the example below we see that jul21 and aug21 form one connected component and sep21 forms a second connected component.
To investigate how data frame column names differ note that setdiff(names(jul21), names(sep21)) will show names that are in jul21 but not in sep21 and the reverse can be used for the other direction. If the setdiff in both directions are zero length vectors and names vectors are not the same then they differ by order.
library(igraph)
set.seed(123)
isSame <- function(x, y) +identical(names(x), names(y))
A <- outer(L, L, Vectorize(isSame))
diag(A) <- 0
g <- graph_from_adjacency_matrix(A, "undirected")
plot(g, vertex.color = "white", vertex.size = 30)
Note
Test data. BOD comes with R.
jul21 <- aug21 <- sep21 <- BOD
names(sep21) <- c("Time", "DEMAND")
I have a dataframe with many subject IDs (each subject with repeat observations).
I also have a separate dataframe with just a list of Subject IDs I want to match and extract from the larger dataframe.
How do I write the code in a way that allows me to reference the list of SubjectIDs in a different dataframe?
Not sure I'm fully understanding the question, but here's an example:
df1 <- data.frame(ID = c("chicken", "snake"))
df2 <- data.frame(ID = c("monkey", "elephant", "chicken"),
useful_data = 1:3)
We could subset df2 to only show rows where df2$ID matches an ID from df1$ID. In R, you can subset a data frame using brackets where you specify [rows_we_want, cols_we_want] and leaving one of those blank outputs all rows or all columns, as the case may be.
df2[df2$ID %in% df1$ID,]
# ID useful_data
#3 chicken 3
Im extremely new, sorry ahead of time. I have two vectors, one a character vector of account names (30) and the other a character vector of product names(30). Lastly, I have a dataframe with three columns account names, product names and revenue but this list goes way beyond the 30 of either.
Ultimately I need a 30x30 dataframe rows as products from the product name vector, columns as account names from the account name vector and the values as the revenue associated with the account in the column and the product in the row.
I think I need a nested loop function? but I dont know how to use that to populate the dataframe appropriately.
account<-c("a","b",etc)
product<-c("prod_a","prod_b", etc)
for(i in 1:length(account)){
for(i in 1:length(product)){
.....
}
}
Honestly Im just very lost haha
I think I know what you're trying to do here. I suspect there is a good reason you want this 30x30 cross-table type of structure, but I would also like to take the opportunity to encourage "tidy" data for analysis purposes. That link can be summarized by these three main criteria for data to be considered "tidy":
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
That said, below is my attempt to interpret and demonstrate what I think you're trying to accomplish.
library(tidyr)
# set up some fake data to better explain
account_vec <- paste0(letters, 1:26)
product_vec <- paste0(as.character(101:126), LETTERS)
revenue_vec <- rnorm(26*26)
# permutating accounts and products to set up our fake data
df <- expand.grid(account_vec, product_vec)
names(df) <- c("accounts", "products")
df$revenue <- revenue_vec
# if this is what your data looks like currently, I would consider this fairly "tidy"
# now let's pretend there's some data we need to filter out
df <- rbind(df,
data.frame(
accounts = paste0("bad_account", 1:3),
products = paste0("bad_product", 1:3),
revenue = rnorm(3)
)
)
# filter to just what is included in our "accounts" and "products" vectors
df <- df[df$accounts %in% account_vec, ]
df <- df[df$products %in% product_vec, ]
# spread out the products so they occupy the column values
df2 <- df %>% tidyr::spread(key="products", value="revenue")
# if you aren't familiar with the "%>%" pipe operator, the above
# line of code is equivalent to this one below:
# df2 <- tidyr::spread(df, key="products", value="revenue")
# now we have accounts as rows, products as columns, and revenues at the intersection
# we can go one step further by making the accounts our row names if we want
row.names(df2) <- df2$accounts
df2$accounts <- NULL
# now the accounts are in the row name and not in a column on their own
I am new to this community, currently working on a R project in which I need to find each of the element separated by comma in a dataframe, on any of the columns in another dataframe, here is an example below:
#DataFrame1
a=c("AA,BB","BB,CC,FF","CC,DD,GG,FF","GG","")
df1=as.data.frame(a)
#DataFrame2
x=c("AA","XX","BB","YY","ZZ","MM","YY","CC")
y=c("DD""VV","NN","XX","CC","AA","WW","FF")
z=c("CC","AA","YY","GG","HH","OO","PP","QQ")
df2=as.data.frame(x,y,z)
what I need to do is find, if any of the elements, lets take for example "AA,BB" (which is the first cell in column x of df1) "AA" is an element and "BB" is another element , is available on any of the columns (x,y,x) in df2, if a match is found I need to identify that row or rows, there is also a possibility of more then one match on df2 rows
. Hope I was able to explain this problem well, expert please help
Here it is a solution in 2 steps:
# load tidyverse
library(tidyverse)
Step 1: Split the elements separated by comma from df1 in a new data frame new_df
1a) To do this, we first identify the number of columns to be generated
(as the maximum number of elements separated by ,; that is: maximum number of , + 1)
number_new_columns <- max(sapply(df1$a, function(x) str_count(x, ","))) + 1
1b) Generate the new data frame new_df
new_df <- df1 %>%
separate(a, c(as.character(seq_len(number_new_columns)))) # missing pieces will be filled with NA
# Above, we used c(as.character(seq_len(number_new_columns))) to generate column names as numbers -- not very creative :)
Step 2: Identify the position of each unique element from new_df in df2
(hope I understood correctly this second part of the question)
2a) Get the unique elements (from new_df)
unique_elements <- unlist(new_df) %>%
unique()
2b) Get a list whose components contain the positions of each unique element within df2
output <- lapply(unique_elements, function(x) {
which(df2 == x, arr.ind=TRUE)
})