I am new to R and I need help sorting a scenario in R programming.
For the First Part of the problem:
I have a folder with multiple SAS files in a specific location and whose path would be coming in an excel file. I have managed to extract the files from the path provided in excel file as below:
Spec <- read_excel("file", sheet = "first")
Then to extract the dataframes from the folder with the below code (There are 3 dataframes in the folder namely "aa.sas7bdat", "bb.sas7bdat", "cc.sas7bdat" or it can be any number depending on the folder, but for this one we are taking 3 dataframes)
Path <- Spec$`Source Data Path`[1:1] (#it is in the first row in first column of excel file)
Files <- list.files(path = Path, pattern="*.sas7bdat", full.names=FALSE)
then putting a loop for sorting the dataframes as required for further process (as explained later)
Final_List <- NULL
for (y in Files){
List <- unlist(strsplit(y, split = ".sas7bdat", fixed = TRUE))
Final_List <- c(Final_List, List)
Final_List <- toupper(Final_List)
read_files[[List]] <- read_sas(y)
}
print(Final_List)
The above loops output is 3 dataframes namely "AA", "BB", "CC" stored in variable "Final_List" and now we need to access these dataframes from here on to another function.
Now for the second part
Now there is a requirement to filter all the dataframes based on one value of a column of single dataframe Dynamically
Let's say the input by user is a value of "Male" from column name "Gender" from dataframe "AA" (It can be any value from any column of any dataframe as this needs to be DYNAMIC)
Column_Name_Value <- "MALE" (#from the column "Gender" selecting only male values)
Dataframe_Name <- "AA"
I have created 3 functions to help in filtering
1st Function to filter the unique values of "Gender" (or any column name) from dataframe "AA"
Unique_Value_Fun <- function (dataframe, value) {
Unique_Value1 <- dataframe %>% distinct({{value}}) %>% filter({{value}} != "")
Unique_Value <- unlist(Unique_Value1)
return(Unique_Value)
}
Unique_Value <- Unique_Value_Fun({{Dataframe_Name}}, UQ(sym(Column_Name_Value)))
Now we have the dataframe AA with only "Male" values. Now there is a common column of ID's in all the dataframes, if the first dataframe is filtered with the "Unique Value" it will have only those "ID's" present. Now we need to filter all other dataset present in the folder with the same "ID's". The CATCH is all other dataframes will not have the same column name and the same values. BUT they have a common column "ID" for every dataframe so we need to use this common ID for filtering the rest of the dataframes.
Below code have the rest 2 functions
for (z in Unique_Value) {
print(z)
First_dataframe_Fun <- function (dataframe, value) {
dataframe %>% filter({{value}} == {{z}})
}
First_dataframe <- First_dataframe_Fun({{Dataframe_Name}}, UQ(sym(Column_Name_Value)))
Now if I take the dataframes values and put it against the functions, it works well (i.e, it's hardcoded but not Dynamic)
AA <- First_dataframe
BB <- BB %>% filter(ID %in% First_dataset$ID) (# filtering as per ID of first dataset to match the ID's)
CC <- CC %>% filter(ID %in% First_dataset$ID)
Based on the "First Dataframe ID's" we need to filter the rest of the dataframes dynamically. Now to make it dynamic I tried with if condition inside a for loop but it didn't work out.
Please suggest a logic or a similar code where I can make this Dynamic. If I provide any value of a dataframe it should filter the rest of datasets with the ID's (as there can be n number of dataframes but in our case we have taken only 3 dataframes for example).
Related
I have three data frames that need to be merged. There are a few small differences between the competitor names in each data frame. For instance, one name might not have a space between their middle and last name, while the other data frame correctly displays the persons name (Example: Sarah JaneDoe vs. Sarah Jane Doe). So, I used the two methods below.
The first method involves using fuzzy matching to merge the first two data frames, but when I run the code, it just keeps running.
The second attempt, I created a function to keep only one space between a capital letter and the first lower case letter that comes before it, and then merged all three data frames at together.
When I open the data set, the competitors who have NA's for their rank and team have their names spelt correctly in all three data sets. I'm not sure where the issue lies.
A few notes:
The 'comp01_n' column originally from the temp1 data frame is the same as the 'rank_1' column from the stats data frame. I kept them both to verify the data frames merged correctly at the end
I deleted rows in the 'fight' column with NA's because that was data for competitors not in the temp1 data frame. My actual data set is much larger and more complex.
Can you spot where I made an error and how to fix it?
library(fuzzyjoin)
library(tidyverse)
temp1 = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/temp1.csv')
stats=read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/stats.csv')
winners = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/winners.csv')
#============================================
# Attempt 1
#============================================
#perform fuzzy matching full join
star = stringdist_join(temp1, stats,
by='Name', #match based on Name
mode='full', #use full join
method = "jw", #use jw distance metric
max_dist=99,
distance_col='dist') %>%
group_by(Name.x)
#============================================
# Attempt 2
#============================================
# Function to keep only one space between a capital letter and the first lower case letter that comes before it
format_name <- function(x) {
gsub("([a-z])([A-Z])", "\\1 \\2", x)
}
# Apply the function to the Name column
temp1$Name <- sapply(temp1$Name, format_name)
# create a list of all three data frames
df_list <- list(temp1, stats, winners)
# create a function to remove duplicate columns in list
merge_dfs <- function(df_list) {
# Initialize the first data frame as the merged data frame
merged_df <- df_list[[1]]
# Loop through the rest of the data frames in the list
for (i in 2:length(df_list)) {
current_df <- df_list[[i]]
merged_df <- merge(merged_df, current_df, by=intersect(colnames(merged_df), colnames(current_df)), all=TRUE)
}
return(merged_df)
}
# apply function
t = merge_dfs(df_list)
# delete rows with NA in 'fight' column
t <- t[complete.cases(t[ , 'fight']), ]
# add suffix to indicate it's data for competitor 1
colnames(t)[c(5,16:32)]<-paste(colnames(t[,c(5,16:32)]),"1",sep="_")
# verify rank and comp01_n have the same values
result = ifelse(t$comp01_n == t$rank_1, 1, 0)
sum(result == 1, na.rm = TRUE)
sum(result == 0, na.rm = TRUE)
sum(is.na(result))
I have a list of plant names in a dataframe. Plant names come as a couplet with "genus" followed by "species". In my case the couplet is already split across columns (which should help). As a dummy example for three species (Helianthus annuus, Pinus radiatia, and Melaleuca leucadendra):
df <- data.frame(genus=c("Helianthus", "Pinus", "Melaleuca"), species=c("annuus","radiata", "leucadendra"))
I would like to use a function in the package "Taxize" to check these names against a database (IPNI).
There is no batch function for this, and annoyingly the format for querying a single name is:
checked <- ipni_search(genus='Helianthus', species='annuus')
What I need is a loop to feed each genus name and it's associated species name into that function.
I can do this for just genus:
list <- df$genus
checked <- lapply(list, function(z) ipni_search(genus=z))
but am tied up in all sorts of knots trying to pass the species with it.
Any help appreciated!
Cheers
Loop (or *apply) over the index, not the actual value:
checked = lapply(
1:nrow(df),
function(i) ipni_search(genus = df$genus[i], species = df$species[i])
)
Alternately, you can use Map which is made for iterating over multiple vectors/lists in parallel:
checked = Map(ipni_search, genus = df$genus, species = df$species)
I am doing some data analysis where I have my datasets in a folder and I use a for loop to go through all the datasets and (1) Plot a graph (2) Calculate some values from the graph and store them in a dataframe which is then appended to a list. The idea is to have graphs for each dataset and also a list having this summary dataframe for each dataset for analysis later.
With every dataset the for loop iterates through I have a variable specifying the current dataset in the loop. This variable is used to label and save the graph and to label and append the dataframe to a list. I am able to do the graph bit alright but I am not able to add the dataframe to the list in the for loop. My code is as follows:
# Create empty list for adding things to from each loop
parameters <- list()
# Begin the loop
for (file in filesVector) {
# Extract keywords from name of file to be used later
splitname <- strsplit(file, '4-')
splitname <- unlist(splitname)
secondhalf <- splitname[2]
splitsecondhalf <- strsplit(secondhalf, '\\.')
splitsecondhalf <- unlist(splitsecondhalf)
title <- splitsecondhalf[1]
# Extract values as a dataframe and assign to varying name
assign(paste(title, 'blanks', sep= '-'),data_drc_merge[data_drc_merge$ID ==
"B", ])
# Add to list
parameters <- c(parameters, paste(title, 'blanks', sep= '-'))
But when I try assigning it to a dataframe I get the current value of the variable added there instead
Any ideas how to fix this?
Could you use [[ and paste0 to paste the name of the data.frame you want to add to your list:
list_of_df = list()
for(i in files){
# do analysis...
list_of_df[[paste0(name_, i)]] = current_df
}
I have more than one hundred excel files need to clean, all the files in the same data structure. The code listed below is what I use to clean a single excel file. The files' name all in the structure like 'abcdefg.xlsx'
library('readxl')
df <- read_excel('abc.xlsx', sheet = 'EQuote')
# get the project name
project_name <- df[1,2]
project_name <- gsub(".*:","",project_name)
project_name <- gsub(".* ","",project_name)
# select then needed columns
df <- df[,c(3,4,5,8,16,17,18,19)]
# remane column
colnames(df)[colnames(df) == 'X__2'] <- 'Product_Models'
colnames(df)[colnames(df) == 'X__3'] <- 'Qty'
colnames(df)[colnames(df) == 'X__4'] <- 'List_Price'
colnames(df)[colnames(df) == 'X__7'] <- 'Net_Price'
colnames(df)[colnames(df) == 'X__15'] <- 'Product_Code'
colnames(df)[colnames(df) == 'X__16'] <- 'Product_Series'
colnames(df)[colnames(df) == 'X__17'] <- 'Product_Group'
colnames(df)[colnames(df) == 'X__18'] <- 'Cat'
# add new column named 'Project_Name', and set value to it
df$project_name <- project_name
# extract rows between two specific characters
begin <- which(df$Product_Models == 'SKU')
end <- which(df$Product_Models == 'Sub Total:')
## set the loop
in_between <- function(df, start, end){
return(df[start:end,])
}
dividers = which(df$Product_Models %in% 'SKU' == TRUE)
df <- lapply(1:(length(dividers)-1), function(x) in_between(df, start =
dividers[x], end = dividers[x+1]))
df <-do.call(rbind, df)
# remove the rows
df <- df[!(df$Product_Models %in% c("SKU","Sub Total:")), ]
# remove rows with NA
df <- df[complete.cases(df),]
# remove part of string after '.'
NeededString <- df$Product_Models
NeededString <- gsub("\\..*", "", NeededString)
df$Product_Models <- NeededString
Then I can get a well structured datafram.Well Structured Dataframe Example
Can you guys help me to write a code, which can help me clean all the excel files at one time. So, I do not need to run this code hundred times. Then, aggregating all the files into a big csv file.
You can use lapply (base R) or map (purrr package) to read and process all of the files with a single set of commands. lapply and map iterate over a vector or list (in this case a list or vector of file names), applying the same code to each element of the vector or list.
For example, in the code below, which uses map (map_df actually, which returns a single data frame, rather than a list of separate data frames), file_names is a vector of file names (or file paths + names, if the files aren't in the working directory). ...all processing steps... is all of the code in your question to process df into the form you desire:
library(tidyverse) # Loads several tidyverse packages, including purrr and dplyr
library(readxl)
single_data_frame = map_df(file_names, function(file) {
df = read_excel(file, sheet="EQUOTE")
... all processing steps ...
df
}
Now you have a single large data frame, generated from all of your Excel files. You can now save it as a csv file with, for example, write_csv(single_data_frame, "One_large_data_frame.csv").
There are probably other things you can do to simplify your code. For example, to rename the columns of df, you can use the recode function (from dplyr). We demonstrate this below by first changing the names of the built-in mtcars data frame to be similar to the names in your data. Then we use recode to change a few of the names:
# Rename mtcars data frame
set.seed(2)
names(mtcars) = paste0("X__", sample(1:11))
# Look at data frame
head(mtcars)
# Recode three of the column names
names(mtcars) = recode(names(mtcars),
X__1="New.1",
X__5="New.5",
X__9="New.9")
Or, if the order of the names is always the same, you can do (using your data structure):
names(df) = c('Product_Models','Qty','List_Price','Net_Price','Product_Code','Product_Series','Product_Group','Cat')
Alternatively, if your Excel files have column names, you can use the skip argument of read_excel to skip to the header row before reading in the data. That way, you'll get the correct column names directly from the Excel file. Since it looks like you also need to get the project name from the first few rows, you can read just those rows first with a separate call to read_excel and use the range argument, and/or the n_max argument to get only the relevant rows or cells for the project name.
I've got multiple data.frames (26) in a list. The dfs have the same structure, but I would like working/exporting only two different columns. I can export all the dfs to individual dfs
for(i in filelist){
list2env(setNames(filelist, paste0("names(filelist[[i]])",
seq_along(filelist))), envir = parent.frame())}
I can delete a column from all the dfs
for(i in seq_along(filelist)){filelist[[i]]$V5 = NULL}
but I cannot export the other columns individually. From a single data.frame it simply works:
token_out_mk_totatyafiak_02.txt = out_mk_totatyafiak_02.txt["V2"]
type_out_mk_totatyafiak_02.txt = out_mk_totatyafiak_02.txt["V1"]
When I tried these
for(i in seq_along(filelist)){n[[i]] <- filelist[[i]]$V2}
for(i in seq_along(filelist)){
sapply(filelist, function(x) n <- filelist[[i]]$V2)
}
the most I achieved, that I could read in all the 26 dfs the second column of the last df.
The V2 looks like:
V2
1 az
2 a
3 f
ekete
4 folt
(and so on, these are hungarian short stories... )
Depending on your desired results, you have several options.
If you want a new list, with your data frames containing only one specific column.
new_filelist <-
lapply(filelist, function(df){
df["V2"]
})
If you want to export to a file one specific column for all data frames, in separate files (in this case, .txt files).
This requires your data frames in your list to be named. In case they are not, you can replace names(filelist) for 1:length(filelist).
lapply(names(filelist), function(df){
df_filename <- paste0(df, ".txt")
write.table(filelist[[df]]["V2"], df_filename)
})
If you wan to assign to new objects in your enviroment one specific column for all your data frames.
Again, this requires your data frames to be named.
lapply(names(filelist), function(df){
assign(df, filelist[[df]]["V2"], envir = .GlobalEnv)
})