Renaming dataframes of varying number of columns in R - r

I would like to rename columns sequentially, for multiple dataframes with varying number of columns.
The dataframes will be put into R from a pdf that displays a table, and each pdf page is automatically assigned to a column. Automatically, the columns are named as the entire printout of the page. I simply want to replace this automatic column name with the page number.
A dataframe with 4 columns titled 1,2,3,4 or a dataframe with 5 columns titles 1,2,3,4,5 and so on.
I tried
txt_df1 <- data.frame("page1", "page2", "page3", "page4")
#remember the dataframes might have any number of columns
for (n in (1:ncol(txt_df1))){
colnames(txt_df1[n]) <- n
}
and
txt_df1 <- data.frame("page1", "page2", "page3", "page4")
#remember the dataframes might have any number of columns
for (n in (1:ncol(txt_df1))){
txt_df1 <- rename(txt_df1, n = n)
}
For some reason, nothing happens when either of these are run. Any suggestions on how to do this better/ make this code work?

rename_func <- function(df) {
df_max <- ncol(df)
names_vec <- c(1:df_max)
names(df) <- names_vec
df
}
If we use the following data.frames:
txt_df1 <- data.frame("page1", "page2", "page3", "page4")
txt_df2 <- data.frame("page1", "page2")
We will get:
rename_func(txt_df1)
1 2 3 4
1 page1 page2 page3 page4
...and
rename_func(txt_df2)
1 2
1 page1 page2

You almost solved the issue yourself, all you need to change is moving the index [n] outside of the data.frame to colnames(txt_df1)[n] <- n

It would be better to reference the column names in your data frames like this:
names(txt_df1)[indexNumber]
This way, you can assign names in your for-loop:
for(n in 1:ncol(txt_df1)){
names(txt_df1)[n] <- #new name here
}
As for how to assign names, you could try this:
for(n in 1:ncol(txt_df1)){
names(txt_df1)[n] <- paste("page_", n, sep="")
}

Related

How do I convert my code that transforms nested lists to a dataframe into a function?

I have 18 lists, one for each condition. Inside a list of a condiction there are 10 lists, one for each participant. Within a list of a participant, there is a list with anywhere between 1 and 20 values of type double. To clarify, this is code to reproduce the list of one condition, remember I have 18 of these all slightly different.
Participant_List <- list()
for (i in 1:10) {
Scores <- list()
for (k in sample(1:5, replace = TRUE)) {
Scores[[k]] <- sample(1:7, sample(1:10), replace = TRUE)
}
Participant_List[[i]] <- Scores
}
Now with some help, I got code to transform the list of one condition into a data frame in a long format:
#convert each participant's list to a data frame
x_dataframes <- lapply(seq_along(Participant_List), function(curParticipant){
return(data.frame(Participant = curParticipant,
Score = unlist(Participant_List[[curParticipant]])))
})
#combine the list of dataframes into one dataframe
x_combined <- do.call("rbind", x_dataframes)
I would like to create a function containing this code to be able to simply apply this to the other conditions. I came up with the following, where I first create a list containing the conditions I have, called Hypo1_lists and then I feed this into the function below:
function(Hypo1_lists){
#convert each participant's list to a data frame
x_dataframe <- lapply(seq_along(Hypo1_lists), function(curParticipant){
return(data.frame(Participant = curParticipant,
Score = unlist(Hypo1_lists[[curParticipant]])))
#combine the list of dataframes into one dataframe
Hypo1_lists <- do.call("rbind", x_dataframe)
})
}
But this outputs one nested list...I want to store the outputs in separate data frames (one for each condition), the same I get from the code before I put it into a function.
You were mistakenly including the binding into the apply function.
myf <- function(Hypo1_lists){
#convert each participant's list to a data frame
x_dataframe <- lapply(seq_along(Hypo1_lists), function(curParticipant){
return(data.frame(Participant = curParticipant,
Score = unlist(Hypo1_lists[[curParticipant]])))
})
#combine the list of dataframes into one dataframe
Hypo1_lists <- do.call("rbind", x_dataframe)
return(Hypo1_lists)
}
myf(Participant_List)
Participant Score
1 1 2
2 1 1
3 1 6
4 1 3
Also, don't forget to return something from your main function.
To apply this function to a nested list :
full <- list(Participant_List, Participant_List)
names(full) <- c("firstname", "secondname")
full_result <- lapply(full, myf)
summary(full_result)
Length Class Mode
firstname 2 data.frame list
secondname 2 data.frame list
To retrieve your second result for example, just use full_result[[2]] which is of class data.frame

Why is loop adding NA values to the data frame?

I have a basic while and for loop, I iterate through some starting and ending values in a dataframe, then go through a list and grab (substring) some values.
The problem with the below code is that it adds a lot of NA rows which I don't understand why and how.
I have an if which looks at the GREPL- finds "TRACK 2 DATA: ", if so then ads a row in dataframe. I don't have an else which adds NA values. So in my understanding in case if Block is false, the iteration should continue and not add values to dataframe?
What might be wrong?
i=1
fundi <- nrow(find_txn) #get the last record
while(i <=fundi) { # Start while-loop until END OF records
nga <- find_txn[i,1] #FRom record
ne <- find_txn[i,3] #to Records
for (j in nga:ne){ #For J in from:to
if(grepl("TRACK 2 DATA: ",linn[j])) { #If track data found do something
gather_txn[j,1] <- j # add a record for iteration number
gather_txn[j,2] <- substr(linn[j],1,9) #get some substrings
gather_txn[j,3] <- substr(linn[j],34,39) #get some substrings
}
}
i <- i + 1
}
I was looping through the wrong variable. the inside if loop needs to add to the table using i not j variable:
gather_txn[i,1] <- j # add a record for iteration number
gather_txn[i,2] <- substr(linn[j],1,9) #get some substrings
gather_txn[i,3] <- substr(linn[j],34,39) #get some substrings

Improve R speed when using dynamic column names

I have some code that works as expected but is very inefficient and slow. I have 2 dataframes - 1) with 40,645 rows and 264 columns where each column represents a KPI/dimension of some kind and 2) with 478,872 rows and 11 columns. DF1 is a wide dataframe and DF2 is a long data frame. I need to merge the two but cannot simply merge the columns since the dataframes are different formats. Additionally I need to merge 2 columns values from the DF2 to create a name for the new column in DF1. In the end I am accomplishing this using a loop and this command to do the work is what is slowing the code down substantially:
DF1[[col_name_v]][index_v] <- col_val_v
# Additionally these other methods appears just as slow:
DF1[index_v,"col1"] <- col_val_v
DF1[index_v,265] <- col_val_v
If I actually where to specify each of the column names manually like this it works 10X+ faster:
DF1$col1[index_v] <- col_val1_v
DF1$col2[index_v] <- col_val2_v
DF1$col3[index_v] <- col_val3_v
#.... etc
The problem is I need the code to be dynamic because there are many columns and potentially those column names may change over time so I would prefer the code to dynamically learn the column names and apply them on the fly to prevent frequent additions and changes to the code.
Data looks like the following:
DF1 (40645*264) - before adding new columns from the code below
Date_Time,YEAR,MON,DOW,WEEK,Location,ID,KPI1,KPI2,KPI3,...,KPI264
9/7/2020,2020,September,Monday,36,33,33001,43,0,2,...,10
DF2 (478872*11) - multiple rows to be merged as multiple columns to DF1
Date_Time,Location,ID,Technology,Cluster,Dimension,Variable,Formula,Value,Index
9/7/2020,33,33001,1,"OVERALL","NEWKPINAME1","SUM(N)/SUM(D)",2.8,0.003
9/7/2020,33,33001,1,"LOCATION","NEWKPINAME1","SUM(N)/SUM(D)",2.8,0.004
9/7/2020,33,33001,1,"GROUP1","NEWKPINAME1","SUM(N)/SUM(D)",2.8,0.002
dimension+variable are combined to create a new unique KPI name to be added as a column and the index value is written to DF1 for that new column index location.
# Provides the number of new columns that need to be added to DF1
dim_v <- unique(DF2$Dimension)
var_v <- unique(DF2$Variable)
limit_v <- length(dim_v) * length(var_v)
# This part adds the new columns to the DF1 - with NA set for values
index_v <- 1
while(index_v <= limit_v) {
# Create the column name
col_name_v <- paste(DF2$Dimension[index_v], DF2$Variable[index_v],sep="_")
# Add the column name with default values of NA
DF1[[col_name_v]] <- NA
index_v <- index_v + 1
}
# This part writes the values of each DF2 Dimension_Variable KPI values stored in ROWS (12 per ID) to DF1 across 12 COLUMNS
# Merge the long DF2 ROW based KPIs into the wide DF1 COLUMN based KPIs
index_v <- 1
while(index_v <= nrow(DF1)) {
# Identify the key fields we need for the lookup
datetime_v <- as.Date(DF1$Date_Time[index_v])
id_v <- DF1$ID[index_v]
# Create a temp data.frame for the related data
data_df <- subset(DF2, Date_Time == datetime_v & ID == id_v)
# Do we even have records?
if (nrow(data_df) !=0) {
# cycle through and write each value
index2_v <- 1
while(index2_v <= nrow(data_df)) {
# Create the column name
col_name_v <- paste(data_df$Dimension[index2_v], data_df$Variable[index2_v],sep="_")
col_val_v <- data_df$Index[index2_v]
# Write the values related to the column name
DF1[[col_name_v]][index_v] <- col_val_v
index2_v <- index2_v + 1
}
} else {
print(sprintf("No records for Date: %s ID: %s", datetime_v, id_v))
}
index_v <- index_v + 1
}

Searching and Replacing between Two Data Frames with Apply Family

I'm trying to analyze a large set of data so I can't use for loops to search for ID's from one data frame on the other and replace the text.
Basically, first data frame is with IDs and without names. The names are in the other data frame.
(Edit) Input dfs
(Edit) df1
ID------Name
1,2,3---NA
4,5-----NA
6-------NA
(Edit) df2
ID------Name
1-------John
2-------John
3-------John
4-------Stacy
5-------Stacy
6-------Alice
(Edit) Expected output df
ID------Name
1,2,3---John
4,5-----Stacy
6-------Alice
(Edit) Please note that this is very simplified version. df1 actually has 63 columns and 8551 rows, df2 has 5 columns and 37291 rows.
I can search for the IDs and get names on the second data frame like this. It' super fast!
namer <- function(df2, ids) {
ids <- gsub(',', '|', ids);
names <- df2[which(apply(df2, 1, function(x) any(grepl(ids, x)))),][['Name']];
if (length(names) != 0) {
return(names[[1]]);
} else {
return(NA);
}
}
But, I can't replace using apply families. I know doing it with for loops and it's super slow because I have around 8500 rows in the first data frame.
for (k in 1:nrow(df1)) {
df1$Name[k] <- namer(df2, df1$ID[k]);
}
Can you please help to do convert for loops into apply functions as well to speed it up?
Thanks in advance
You can try
df1$Name <- sapply(as.character(df1$ID),
function(x) paste(unique(df2[match(strsplit(x, ",")[[1]], df2$ID), "Name"]), collapse = ","))
df1
# ID Name
# 1 1,2,3 John
# 2 4,5 Stacy
# 3 6 Alice
Although I doubt sapply will be faster than a for loop. I've also added paste function here in case you have more than one name matched in df1$ID

removing duplicate subsets of rows

I have a list of stocks in an index sorted by date, and I'm trying to remove all rows in which the previous row has the same stock code. This will give a dataframe of the initial index and all dates that there was a change to the index
In my working example, I'll use names instead of the date column, and some numbers.
At first, I thought I could remove the rows by using subset() and !duplicated
name <- c("Joe","Mary","Sue","Frank","Carol","Bob","Kate","Jay")
num <- c(1,2,2,1,2,2,2,3)
num2 <- c(1,1,1,1,1,1,1,1)
df <- data.frame(name,num,num2)
dfnew <- subset(df, !duplicated(df[,2]))
However, this might not work in the case where a stock is removed from the list and then later replaced. So, in my working example, the desired output are the rows of Joe, Mary, Frank, Carol and Jay.
Next I created a function to tell if the index changes. The input of the function is row number:
#------ function to tell if there is a change in the row subset-----#
df2 <- as.matrix(df)
ChangeDay <- function(x){
Current <- df2[x,2:3]
Prev <- df2[x-1,2:3]
if (length(Current) != length(Prev))
NewList <- true
else
NewList <- length(which(Current==Prev))!=length(Current)
return(NewList)
}
Finally, I attempt to create a loop to remove the desired rows. I'm new to programming, and I struggle with loops. I'm not sure what the best way is to pre-allocate memory when the dimensions of my final output is unknown. All the books I've looked at only give trivial loop examples. Here is my latest attempt:
result <- matrix(data=NA,nrow=nrow(df2),ncol=3) #pre allocate memory
tmp <- as.numeric(df2) #store the original data
changes <- 1
for (i in 2:nrow(df2)){ #always keep row 1, thus the loop starts at row 2
if(ChangeDay(i)==TRUE){
result[i,] <-tmp[i] #store the row in result if ChangeDay(i)==TRUE
changes <- changes + 1 #increment counter
}
}
result <- result[1:changes,]
Thansk for your help, and any additional general advice on loops is appreciated!
It is not clear what you want to do. But I guess :
df[c(1,diff(df$num)) !=0,]
name num num2
1 Joe 1 1
2 Mary 2 1
4 Frank 1 1
5 Carol 2 1
8 Jay 3 1

Resources