I have a few tables that count the frequency of eye color across three dataframes. I thought I could translate those tables in dataframes, then merge them as a new dataframe. I believe is the error is coming from the fact that I transformed the tables into dataframes and merge seems to add on the rows of it. But I need:
Color Tb1 Tb2 Tb3
Bl 5 0 3
Blk 6 7 0
Small condition is that not each dataframe has Black or Blue eye colors in it. So I need to account for that, then change the NA's to 0's.
I have tried:
chart<-merge(tb1,tb2,tb3)
chart
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical
AND
chart<-merge(tb1,tb2,tb3,all=T)
chart
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical
Example code:
one<-c('Black','Black','Black','Black','Black','Black','Blue','Blue','Blue','Blue','Blue')
two<-c('Black','Black','Black','Black','Black','Black','Black')
three<-c('Blue','Blue','Blue')
tb1<-table(one)
tb2<-table(two)
tb3<-table(three)
tb1<-as.data.frame(tb1)
tb2<-as.data.frame(tb2)
tb3<-as.data.frame(tb3)
You can convert all tables directly into one tibble using bind_rows from the package dplyr:
# creating the setup given in the question
one<-c('Black','Black','Black','Black','Black','Black','Blue','Blue','Blue','Blue','Blue')
two<-c('Black','Black','Black','Black','Black','Black','Black')
three<-c('Blue','Blue','Blue')
tb1<-table(one)
tb2<-table(two)
tb3<-table(three)
# note that there is no need for individual dataframes
# bind the rows of the given tables into a tibble
result <- dplyr::bind_rows(tb1, tb2, tb3)
# replace NAs with 0 values
result[is.na(result)] <- 0
# check the resulting tibble
result
# # A tibble: 3 x 2
# Black Blue
# <dbl> <dbl>
# 1 6 5
# 2 7 0
# 3 0 3
Doing it your way, I will do something as follows (column names are needed to be corrected):
newframe <- merge(tb1, tb2, by.x ="one", by.y ="two", all = TRUE)
merge(newframe, tb3, by.x ="one", by.y ="three", all = TRUE)
However, for nicer ways, check dplyr() joins.
Related
Basically I have 2 tables with the same column names and want to do calculations across tables. Ideally, I would have taken data from the two tables and created a third, but I could only find a way to do that if the data tables are the same dimensions because it would be by cell position. Instead, I'd like to do it by column name after having done a join so that I know that the calculations are taking from the correct values.
I am trying to loop through column names to do calculations between various associated columns in the same data table (I have 2 lists of column names, that I am using to call columns from a table where I've joined the two tables. I've adjusted the column name list to add the "_A" and "_B" which were added during the join as the columns had the same names). I'm trying to call the column names using [[i]] (in this case I am using [[1]] to test it).
Does anyone know why I can't call the column name in the name$colname format? If I replace the variable with the name, it works, and if I take just the variable (colnameslistInf[[1]]) it shows the right column name, but once I put it together it says "Unknown or uninitialised column".
> joininfsup$colnameslistInf[[1]]
NULL
Warning message:
Unknown or uninitialised column: `colnameslistInf`.
> colnameslistInf[[1]]
[1] "newName.x"
> joininfsup$newName.x
[1] 5 5 5 5 5 5 5 5 5 5 5
[12] 5 5 5 5 5 5 5 5 5 5 5
[23] 5 5 5 5 5 5 5 5 5 5 5
[34] 5 5 5 5 5 5 5 5 5 5 5
[45] 5 5 5 5 5 5 5 5 5 5 5
I am also getting this error:
Error in `[[<-.data.frame`(`*tmp*`, col, value = integer(0)) :
replacement has 0 rows, data has 264
The code I am trying to run is here. joininfsup is the joined table, and I use mutate to create new columns with the calculations across each of the 200+ columns and its associated column.
joined_day_inf_numeric <-select_if(joined_day_inf, is.numeric) joined_day_sup_numeric<-select_if(joined_day_sup, is.numeric) joininfsup<- left_join(joined_day_inf_numeric, joined_day_sup_numeric, "JOININF", suffix = c("_A", "_B"))
#take colnames from original tables and add _A and _B as those are added during the join
colnameslistInf <- paste0(colnames(joined_day_inf_numeric), "_A")
colnameslistSup <- paste0(colnames(joined_day_sup_numeric), "_B")
for (i in 1:length(colnameslistInf)) { #245 cols, for example
name <- paste0(colnames(joined_day_inf_numeric)[[i]]) #names of new columns as loops through
joininfsup2 <-joininfsup %>%
mutate(!!name := ((joininfsup[[ colnameslistInf[[i]] ]])-joininfsup[[colnameslistSup[[i]] ]]))*joininfsup$proportion_A+joininfsup[[ colnameslistInf[[i]] ]]
write_csv(joininfsup2, paste0("test/finalcalc.csv"))
}
I think this might be the key but am having trouble applying it: Use dynamic name for new column/variable in `dplyr`
UPDATE: I replaced name in the mutate function with !!name := and the code ran! But gave me the same output as the original joined table because I'm still getting the "Unknown or uninitialised column: colnameslistInf." warning.
UPDATE2: added missing join code, needed to save variable in for loop, added [[]] acording to #Parfait 's suggestion-- but the code still does not work (does not add any new columns).
UPDATE3:
I tried #Parfait's common_columns method but got an error:
Error: Can't subset columns that don't exist. x Columns 8_, 50_, 51_, 55_, 78_, etc. don't exist.
These columns were removed at the is.numeric step so not sure why it is pulling from the original dataset. Also, using match deletes a bunch of other columns that have characters as names
In R, when referencing names with the $ operator, identifiers are interpreted literally requiring a column named "colnameslistInf[[1]]" (but even this will fail without backticks). However, the extract operator, [[, can interpret dynamic variables:
joininfsup[[ colnameslistInf[[1]] ]]
Additionally, mutate also takes identifiers literally. Hence, in each iteration of loop, you are assigning and re-assigning to a variable named, name. But you resolved it with the double bang operator, !!.
However, consider avoiding the loop by columns and calculate your formula on block of columns in matrix-style arithmetic. Specifically, adjust the default suffix in dplyr::inner_join (or suffixes argument in base::merge) and then reassign non-underscored columns, finally remove underscored columns. Below assumes your join operation. Adjust type of join and by arguments as needed.
joined_day_inf_numeric <- select_if(joined_day_inf, is.numeric)
joined_day_sup_numeric <- select_if(joined_day_sup, is.numeric)
common_columns <- intersect(
colnames(joined_day_inf_numeric), colnames(joined_day_sup_numeric)
)
common_columns <- common_columns[common_columns != "JOININF"]
joininfsup <- left_join(
joined_day_inf_numeric, joined_day_sup_numeric, by = "JOININF", suffix = c("", "_")
)
# ASSIGN NON-UNDERSCORED COLUMNS
joininfsup[common_columns] <- (
(
joininfsup[common_columns] - joininfsup[paste0(common_columns, "_")]
) *
joininfsup$proportion + joininfsup[common_columns]
)
# REMOVE UNDERSCORED COLUMNS
joininfsup[paste0(common_columns, "_")] <- NULL
write_csv(joininfsup, paste0("test/finalcalc.csv"))
In the case that a=matrix(c(1,2,3,4),nrow=2,ncol=2) and b=c('name',3). I am trying to merge a and b such that the outcome is [1 3 name 3] in the first row and [2 4] in the second row.
The number of rows differs in each dataframe. Therefore cbind is going to have a hard time merging the data and will by default loop the shorter dataframe, in this case b.
I would suggest adding in the rowname as a column and then binding on that. By default, full_join will then generate NA values for dataframes missing that value of the bind. This question is partially a duplicate of Add (not merge!) two data frames with unequal rows and columns so you may find more help there.
# Load packages
library(tidyverse)
library(magrittr) # To use the inplace assignment operator (%<>%)
# Create dataframes
a <- data.frame(1:2,3:4)
b <- merge('name', 3)
# Create rowname column for each dataframe
a %<>% tibble::rownames_to_column()
b %<>% tibble::rownames_to_column()
# Use 'full join' to bind dataframes together
c <- dplyr::full_join(a, b, by=rowname) %>%
# Remove the rowname column
dplyr::select(-rowname)
# Print c
print(c)
X1.2 X3.4 x y
1 1 3 name 3
2 2 4 <NA> NA
If you are satisfied with a list, not data frame, this will work.
a <- matrix(c(1,2,3,4),nrow=2,ncol=2)
b <- c('name',3)
c <- list(a[,1],a[,2],b[1],b[2] )
If you need a data frame,
you have to make the 1st and 2nd row have the same number of columns, by stuffing the gaps with something.
d <- as.data.frame(c)
d[2,3:4] <- NA
I have two dataframe with different columns that has large number of rows (about 2 million)
The first one is df1
The second one is df2
I need to get match the values in y column from table one to R column in table two
Example:
see the two rows in df1 in red box have matched the two rows in df2 in red box
Then I need to get the score of the matched values
so the result should look like this and it should be stores in a dataframe:
My attempt : first Im beginner in R, so when I searched I found that I can use Match function, merge function but I did not get the result that I want it might because I did not know how to use them correctly, therefore, I need step by step very simple solution
We can use match from base R
df2[match(df2$R, df1$y, nomatch = 0), c("R", "score")]
# R score
#3 2 3
#4 111 4
Or another option is semi_join from dplyr
library(dplyr)
semi_join(df2[-1], df1, by = c(R = "y"))
# R score
#1 2 3
#2 111 4
merge(df1,df2,by.x="y",by.y="R")[c("y","score")]
y score
1 2 3
2 111 4
Suppose I have a data frame in R where I would like to use 2 columns "factor1" and "factor2" as factors and I need to calculate mean value for all other columns per each pair of the above mentioned factors. After running the code below, the last line gives the following warnings:
Warning messages:
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Why is it happening and what should I do to make it right?
Thanks.
Here is my code:
# Create data frame
myDataFrame <- data.frame(factor1=c(1,1,1,2,2,2,3,3,3), factor2=c(3,3,3,4,4,4,5,5,5), val1=c(1,2,3,4,5,6,7,8,9), val2=c(9,8,7,6,5,4,3,2,1))
# Split by 2 columns (factors)
splitDataFrame <- split(myDataFrame, list(myDataFrame$factor1, mydataFrame$factor2))
# Calculate mean value for each column per each pair of factors
splitMeanValues <- lapply(splitDataFrame, function(x) apply(x, 2, mean))
# Combine back to reduced table whereas there is only one value (mean) per each pair of factors
MeanValues <- unsplit(splitMeanValues, list(unique(myDataFrame$factor1), unique(mydataFrame$factor2)))
EDIT1: Added data frame creation (see above)
If you need to calculate the mean for all other columns than the factors, you can use the formula syntax of aggregate()
aggregate(.~factor1+factor2, myDataFrame, FUN=mean)
That returns
factor1 factor2 val1 val2
1 1 3 2 8
2 2 4 5 5
3 3 5 8 2
Your split() method didn't work because when you unsplit you must have the same number of rows as when you split your data. You were reduing the number of rows for all groups to just one row. Plus, unsplit really should be used with the exact same list of factors that was used to do the split otherwise groups may get out of order. You could to a split and then lapply some collapsing function and then rbind the list back into a single data.frame if you really wanted, but for a simple mean, aggregate is probably best.
The same result can be obtained with summaryBy() in the doBy package. Although it's pretty much the same as aggregate() in this case.
> library(doBy)
> summaryBy( . ~ factor1+factor2, data = myDataFrame)
# factor1 factor2 val1.mean val2.mean
# 1 1 3 2 8
# 2 2 4 5 5
# 3 3 5 8 2
Have you tried aggregate?
aggregate(myDataFrame$valueColum, myDataFrame$factor1, FUN=mean)
aggregate(myDataFrame$valueColum, myDataFrame$factor2, FUN=mean)
I am trying to a simple task, and created a simple example. I would like to add the counts of a taxon recorded in a vector ('introduced',below) to the counts already measured in another vector ('existing'), according to the taxon name. However, when there is a new taxon (present in introduced by not in existing), I would like this taxon and its count to be added as a new entry in the matrix (doesn't matter what order, but name needs to be retained).
For example:
existing<-c(3,4,5,6)
names(existing)<-c("Tax1","Tax2","Tax3","Tax4")
introduced<-c(2,2)
names(introduced)<-c("Tax1","Tax5")
I want new matrix, called "combined" here, to look like this:
#names(combined)= c("Tax1","Tax2","Tax3","Tax4","Tax5")
#combined= c(5,4,5,6,2)
The main thing to see is that "Tax1"'s values are combined (3+2=5), "Tax5" (2) is added on to the end
I have looked around but previous answers similar to this have much more complex data and it is difficult to extract which function I need. I have been trying combinations of match and which, but just cannot get it right.
grp <- c(existing,introduced)
tapply(grp,names(grp),sum)
#Tax1 Tax2 Tax3 Tax4 Tax5
# 5 4 5 6 2
Instead of keeping your data in 'loose' vectors, you may consider collecting them in one data frame. First, put you two sets of vector data in data frames:
existing <- c(3, 4, 5, 6)
taxon <- c("Tax1", "Tax2", "Tax3", "Tax4")
df1 <- data.frame(existing, taxon)
introduced <- c(2, 2)
taxon <- c("Tax1", "Tax5")
df2 <- data.frame(introduced, taxon)
Then merge the two data frames by the common column, 'taxon'. Set all = TRUE to include all rows from both data frames:
df3 <- merge(df1, df2, all = TRUE)
Finally, sum 'existing' and 'introduced' taxon, and add the result to the data frame:
df3$combined <- rowSums(df3[ , c("existing", "introduced")], na.rm = TRUE)
df3
# taxon existing introduced combined
# 1 Tax1 3 2 5
# 2 Tax2 4 NA 4
# 3 Tax3 5 NA 5
# 4 Tax4 6 NA 6
# 5 Tax5 NA 2 2