Looping through column names to calculate new columns in R - r

Basically I have 2 tables with the same column names and want to do calculations across tables. Ideally, I would have taken data from the two tables and created a third, but I could only find a way to do that if the data tables are the same dimensions because it would be by cell position. Instead, I'd like to do it by column name after having done a join so that I know that the calculations are taking from the correct values.
I am trying to loop through column names to do calculations between various associated columns in the same data table (I have 2 lists of column names, that I am using to call columns from a table where I've joined the two tables. I've adjusted the column name list to add the "_A" and "_B" which were added during the join as the columns had the same names). I'm trying to call the column names using [[i]] (in this case I am using [[1]] to test it).
Does anyone know why I can't call the column name in the name$colname format? If I replace the variable with the name, it works, and if I take just the variable (colnameslistInf[[1]]) it shows the right column name, but once I put it together it says "Unknown or uninitialised column".
> joininfsup$colnameslistInf[[1]]
NULL
Warning message:
Unknown or uninitialised column: `colnameslistInf`.
> colnameslistInf[[1]]
[1] "newName.x"
> joininfsup$newName.x
[1] 5 5 5 5 5 5 5 5 5 5 5
[12] 5 5 5 5 5 5 5 5 5 5 5
[23] 5 5 5 5 5 5 5 5 5 5 5
[34] 5 5 5 5 5 5 5 5 5 5 5
[45] 5 5 5 5 5 5 5 5 5 5 5
I am also getting this error:
Error in `[[<-.data.frame`(`*tmp*`, col, value = integer(0)) :
replacement has 0 rows, data has 264
The code I am trying to run is here. joininfsup is the joined table, and I use mutate to create new columns with the calculations across each of the 200+ columns and its associated column.
joined_day_inf_numeric <-select_if(joined_day_inf, is.numeric) joined_day_sup_numeric<-select_if(joined_day_sup, is.numeric) joininfsup<- left_join(joined_day_inf_numeric, joined_day_sup_numeric, "JOININF", suffix = c("_A", "_B"))
#take colnames from original tables and add _A and _B as those are added during the join
colnameslistInf <- paste0(colnames(joined_day_inf_numeric), "_A")
colnameslistSup <- paste0(colnames(joined_day_sup_numeric), "_B")
for (i in 1:length(colnameslistInf)) { #245 cols, for example
name <- paste0(colnames(joined_day_inf_numeric)[[i]]) #names of new columns as loops through
joininfsup2 <-joininfsup %>%
mutate(!!name := ((joininfsup[[ colnameslistInf[[i]] ]])-joininfsup[[colnameslistSup[[i]] ]]))*joininfsup$proportion_A+joininfsup[[ colnameslistInf[[i]] ]]
write_csv(joininfsup2, paste0("test/finalcalc.csv"))
}
I think this might be the key but am having trouble applying it: Use dynamic name for new column/variable in `dplyr`
UPDATE: I replaced name in the mutate function with !!name := and the code ran! But gave me the same output as the original joined table because I'm still getting the "Unknown or uninitialised column: colnameslistInf." warning.
UPDATE2: added missing join code, needed to save variable in for loop, added [[]] acording to #Parfait 's suggestion-- but the code still does not work (does not add any new columns).
UPDATE3:
I tried #Parfait's common_columns method but got an error:
Error: Can't subset columns that don't exist. x Columns 8_, 50_, 51_, 55_, 78_, etc. don't exist.
These columns were removed at the is.numeric step so not sure why it is pulling from the original dataset. Also, using match deletes a bunch of other columns that have characters as names

In R, when referencing names with the $ operator, identifiers are interpreted literally requiring a column named "colnameslistInf[[1]]" (but even this will fail without backticks). However, the extract operator, [[, can interpret dynamic variables:
joininfsup[[ colnameslistInf[[1]] ]]
Additionally, mutate also takes identifiers literally. Hence, in each iteration of loop, you are assigning and re-assigning to a variable named, name. But you resolved it with the double bang operator, !!.
However, consider avoiding the loop by columns and calculate your formula on block of columns in matrix-style arithmetic. Specifically, adjust the default suffix in dplyr::inner_join (or suffixes argument in base::merge) and then reassign non-underscored columns, finally remove underscored columns. Below assumes your join operation. Adjust type of join and by arguments as needed.
joined_day_inf_numeric <- select_if(joined_day_inf, is.numeric)
joined_day_sup_numeric <- select_if(joined_day_sup, is.numeric)
common_columns <- intersect(
colnames(joined_day_inf_numeric), colnames(joined_day_sup_numeric)
)
common_columns <- common_columns[common_columns != "JOININF"]
joininfsup <- left_join(
joined_day_inf_numeric, joined_day_sup_numeric, by = "JOININF", suffix = c("", "_")
)
# ASSIGN NON-UNDERSCORED COLUMNS
joininfsup[common_columns] <- (
(
joininfsup[common_columns] - joininfsup[paste0(common_columns, "_")]
) *
joininfsup$proportion + joininfsup[common_columns]
)
# REMOVE UNDERSCORED COLUMNS
joininfsup[paste0(common_columns, "_")] <- NULL
write_csv(joininfsup, paste0("test/finalcalc.csv"))

Related

How to assign unambiguous values for each row in a data frame based on values found in rows from another data frame using R?

I have been struggling with this question for a couple of days.
I need to scan every row from a data frame and then assign an univocal identifier for each rows based on values found in a second data frame. Here is a toy exemple.
df1<-data.frame(c(99443975,558,99009680,99044573,599,99172478))
names(df1)<-"Building"
V1<-c(558,134917,599,120384)
V2<-c(4400796,14400095,99044573,4500481)
V3<-c(NA,99009680,99340705,99132792)
V4<-c(NA,99156365,NA,99132794)
V5<-c(NA,99172478,NA, 99181273)
V6<-c(NA, NA, NA,99443975)
row_number<-1:4
df2<-data.frame(cbind(V1, V2,V3,V4,V5,V6, row_number))
The output I expect is what follows.
row_number_assigned<-c(4,1,2,3,3,2)
output<-data.frame(cbind(df1, row_number_assigned))
Any hints?
Here's an efficient method using the arr.ind feature of thewhich function:
sapply( df1$Building, # will send Building entries one-by-one
function(inp){ which(inp == df2, # find matching values
arr.in=TRUE)[1]}) # return only row; not column
[1] 4 1 2 3 3 2
Incidentally your use of the data.frame(cbind(.)) construction is very dangerous. A much less dangerous, and using fewer keystrokes as well, method for dataframe construction would be:
df2<-data.frame( V1=c(558,134917,599,120384),
V2=c(4400796,14400095,99044573,4500481),
V3=c(NA,99009680,99340705,99132792),
V4=c(NA,99156365,NA,99132794),
V5=c(NA,99172478,NA, 99181273),
V6=c(NA, NA, NA,99443975) )
(It didn't cause coding errors this time but if there were any character columns it would changed all the numbers to character values.) If you learned this from a teacher, can you somehow approach them gently and do their future students a favor and let them know that cbind() will coerce all of the arguments to the "lowest common denominator".
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
df1 %>%
left_join(df2 %>%
pivot_longer(-row_number) %>%
select(-name),
by = c("Building" = "value"))
This returns
Building row_number
1 99443975 4
2 558 1
3 99009680 2
4 99044573 3
5 599 3
6 99172478 2

Rename column R

I am trying to rename columns but I do not know if that column will be present in the dataset. I have a large data set and if a certain column name is present I want to rename it. For example:
A B C D E
1 4 5 9 2
3 5 6 9 1
4 4 4 9 1
newNames <- data %>% rename(1=A,2=B,3=C,4=D,5=E)
This works to rename what is in the dataset but I am looking for the flexibility to add more potential name changes, without an error occurring.
newNames2 <- data %>% rename(1=A,2=B,3=C,4=D,5=E,6=F,7=G)
This ^ will not work it give me an error because F and G are not in the data set.
Is there any way to write a code to ignore the column change if the name does not exist?
Thanks!
There can be plenty of ways to do this. One would be to create a named vector with the names and their corresponding 'new name' (as the vector's names) and use that, i.e.
#The below vector v1, uses LETTERS as old names and 1:7 as the new ones
v1 <- setNames(LETTERS[1:7], 1:7)
names(df) <- names(v1)[v1 %in% names(df)]

R rbind error row.names duplicates not allowed

There are other issues here addressing the same question, but I don't realize how to solve my problem based on it. So, I have 5 data frames that I want to merge rows in one unique data frame using rbind, but it returns the error:
"Error in row.names<-.data.frame(*tmp*, value = value) :
'row.names' duplicated not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘100’, ‘1000’, ‘10000’, ‘100000’, ‘1000000’, ‘1000001 [....]"
The data frames have the same columns but different number of rows. I thought the rbind command took the first column as row.names. So tried to put a sequential id in the five data frames but it doesn't work. I've tried to specify a sequential row names among the data frames via row.names() but with no success too. The merge command is not an option I think because are 5 data frames and successive merges will overwrite precedents. I've created a new data frame only with ids and tried to join but the resulting data frame don't append the columns of joined df.
Follows an extract of df 1:
id image power value pol class
1 1 tsx_sm_hh 0.1834515 -7.364787 hh FR
2 2 tsx_sm_hh 0.1834515 -7.364787 hh FR
3 3 tsx_sm_hh 0.1991938 -7.007242 hh FR
4 4 tsx_sm_hh 0.1991938 -7.007242 hh FR
5 5 tsx_sm_hh 0.2079365 -6.820693 hh FR
6 6 tsx_sm_hh 0.2079365 -6.820693 hh FR
[...]
1802124 1802124 tsx_sm_hh 0.1991938 -7.007242 hh FR
The four other df's are the same structure, except the 'id' columns that don't have duplicated numbers among it. 'pol' and 'image' columns are defined as levels.
and all.pol <- rbind(df1,df2,df3,df4,df5) return the this error of row.names duplicated.
Any idea?
Thanks in advance
I had the same error recently. What turned out to be the problem in my case was one of the attributes of the data frame was a list. After casting it to basic object (e.g. numeric) rbind worked just fine.
By the way row name is the "row numbers" to the left of the first variable. In your example, it is 1, 2, 3, ... (the same as your id variable).
You can see it using rownames(df) and set it using rownames(df) <- name_vector (name_vector must have the same length as df and its elements must be unique).
I had the same error.
My problem was that one of the columns in the dataframes was itself a dataframe. and I couldn't easily find the offending column
data.table::rbindlist() helped to locate it
library(data.table)
rbindlist(a)
# Error in rbindlist(a) :
# Column 25 of item 1 is length 2 inconsistent with column 1 which is length 16. Only length-1 columns are recycled.
a[[1]][, 25] %>% class # "data.frame" K- this should obviously be converted to a column or removed
After removing the errant columndo.call(rbind, a) worked as expected

How to index an element of a list object in R

I'm doing the following in order to import some txt tables and keep them as list:
# set working directory - the folder where all selection tables are stored
hypo_selections<-list.files() # change object name according to each species
hypo_list<-lapply(hypo_selections,read.table,sep="\t",header=T) # change object name according to each species
I want to access one specific element, let's say hypo_list[1]. Since each element represents a table, how should I procced to access particular cells (rows and columns)?
I would like to do something like it:
a<-hypo_list[1]
a[1,2]
But I get the following error message:
Error in a[1, 2] : incorrect number of dimensions
Is there a clever way to do it?
Indexing a list is done using double bracket, i.e. hypo_list[[1]] (e.g. have a look here: http://www.r-tutor.com/r-introduction/list). BTW: read.table does not return a table but a dataframe (see value section in ?read.table). So you will have a list of dataframes, rather than a list of table objects. The principal mechanism is identical for tables and dataframes though.
Note: In R, the index for the first entry is a 1 (not 0 like in some other languages).
Dataframes
l <- list(anscombe, iris) # put dfs in list
l[[1]] # returns anscombe dataframe
anscombe[1:2, 2] # access first two rows and second column of dataset
[1] 10 8
l[[1]][1:2, 2] # the same but selecting the dataframe from the list first
[1] 10 8
Table objects
tbl1 <- table(sample(1:5, 50, rep=T))
tbl2 <- table(sample(1:5, 50, rep=T))
l <- list(tbl1, tbl2) # put tables in a list
tbl1[1:2] # access first two elements of table 1
Now with the list
l[[1]] # access first table from the list
1 2 3 4 5
9 11 12 9 9
l[[1]][1:2] # access first two elements in first table
1 2
9 11

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help
apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

Resources