Combine and aggregate multiple data.frames - r

I have a collection of .csv files each consisting of the same number of rows and columns. Each file contains observations (column 'value') of some test subjects characterised by A, B, C and takes the form similar to the following:
A B C value
1 1 1 0.5
1 1 2 0.6
1 2 1 0.1
1 2 2 0.2
. . . .
Suppose each file is read into a separate data frame. What would be the most efficient way to combine these data frames into a single data frame in which 'value' column contains means, or generally speaking, results of some function call over all 'value' rows for a given test subject. Columns A, B and C are constant across all files and can be viewed as keys for these observations.
Thank you for your help.

This should be pretty easy, assuming that the files are all ordered in the same way:
dflist <- lapply(dir(pattern='csv'), read.csv)
# row means:
rowMeans(do.call('cbind', lapply(dflist, `[`, 'value')))
# other function `myfun` applied to each row:
apply(do.call('cbind', lapply(dflist, `[`, 'value')), 1, myfun)

Here is another solution in the case where the keys might be in any order, or maybe missing:
n <- 10 # of csv files to create
obs <- 10 # of observations
# create test files
for (i in 1:n){
df <- data.frame(A = sample(1:3, obs, TRUE)
, B = sample(1:3, obs, TRUE)
, C = sample(1:3, obs, TRUE)
, value = runif(obs)
)
write.csv(df, file = tempfile(fileext = '.csv'), row.names = FALSE)
}
# read in the data
input <- lapply(list.files(tempdir(), "*.csv", full.names = TRUE)
, function(file) read.csv(file)
)
# put dataframe together and the compute the mean for each unique combination
# of A, B & C assuming that they could be in any order.
input <- do.call(rbind, input)
result <- lapply(split(input, list(input$A, input$B, input$C), drop = TRUE)
, function(sect){
sect$value[1L] <- mean(sect$value)
sect[1L, ]
}
)
# create output DF
result <- do.call(rbind, result)
result

Related

Nested for loop leading to: Error in [<-.data.frame`(`*tmp*` replacement has x rows, data has y

I have 6 data frames (dfs) with a lot of data of different biological groups and another 6 data frames (tax.dfs) with taxonomical information about those groups. I want to replace a column of each of the 6 dfs with a column with the scientific name of each species present in the 6 tax.dfs.
To do that I created two lists of the data frames and I'm trying to apply a nested for loop:
dfs <- list(df.birds, df.mammals, df.crocs, df.snakes, df.turtles, df.lizards)
tax.dfs <- list(tax.birds,tax.mammals, tax.crocs, tax.snakes, tax.turtles, tax.lizards )
for(i in dfs){
for(y in tax.dfs){
i[,1] <- y[,2]
}}
And this is the output I'm getting:
Error in `[<-.data.frame`(`*tmp*`, , 1, value = c("Aotus trivirgatus", :
replacement has 64 rows, data has 43
But both data frames have the same number of rows, I actually used dfs to create tax.dfs applying the tnrs_match_names function from rotl package.
Any suggestions of how I could fix this error or that help me to find another way to do what I need to will be greatly appreciated.
Thank You!
For what it is worth, to iterate over two objects simultaneously, the following works:
Example Data:
df1 <- data.frame(a=1, b=2)
df2 <- data.frame(c=3, d=4)
df3 <- data.frame(e=5, f=6)
df_1 <- data.frame(a='A', b='B')
df_2 <- data.frame(c='C', d='D')
df_3 <- data.frame(e='E', f='F')
dfs <- list(df1, df2, df3)
df_s <- list(df_1, df_2, df_3)
Using mapply:
out <- mapply(function(one, two) {
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
e f
1 F 6
Here, one and two in mapply correspond to the different elements in dfs and df_s. Having said that, let's make it a bit more interesting. Let's change my third example to the following:
df_3 <- data.frame(e=c('E', 'e'), f=c('F', 'f'))
df_s <- list(df_1, df_2, df_3) # needs to be executed again
Now, let's adjust the function:
out <- mapply(function(one, two) {
if(nrow(one) != nrow(two)){return('Wrong dimensions')}
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
[1] "Wrong dimensions"

Get the mode and its frequency in a factor variable when there is a tie

I am looking for the most frequent values (character strings) and its frequency.
The intended results is a dataframe with three columns:
char: the names of the original columns
mode: the most frequent value in each char
freq: the frequency of the modes
When there is a tie in frequencies, I want to put all of the qualified values in one cell, separated by a comma. -- Or is there any better representation?
Questions: I don't know how to deal with a tie.
I have used the table() function to get the frequency tables of each column.
clean <- read.xlsx("test.xlsx", sheet = "clean") %>% as_tibble()
freqtb <- apply(clean, 2, table)
Here is the second table I got in freqtb:
$休12
个 休 天 饿
1 33 2 1
Then I looped through the tables:
freq <- vector()
mode <- vector()
for (tb in freqtb) {
max = max(tb)
name = names(tb)[tb==max]
freq <- append(freq, max)
mode <- append(mode, name)
}
results <- data.frame(char = names(freqtb), freq = freq, mode=mode)
The mode has a greater length than other vectors, and it cannot attached to results. I bet it is due to ties.
How can can get the same length for this "mode" variable?
You can make some small modifications to the code here to get a Mode function. Then Map over your data frame and rbind the results together
options(stringsAsFactors = F)
set.seed(2)
df.in <-
data.frame(
a = sample(letters[1:3], 10, T),
b = sample(1:3, 10, T),
c = rep(1:2, 5))
Mode <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ind <- which(tab == max(tab))
data.frame(char = ux[ind], freq = tab[ind])
}
do.call(rbind, lapply(df.in, Mode))
# char freq
# a c 4
# b 1 4
# c.1 1 5
# c.2 2 5

In R, how can I copy rows from one dataframe to another when the df being copied to has 2 additional columns?

I have a tab delimited text file with 12 columns that I am uploading to my program. I go on to create another dataframe with a structure similar to the one uploaded and add 2 more columns to it.
excelfile = read.delim(ExcelPath)
matchedPictures<- excelfile[0,]
matchedPictures$beforeName <- character()
matchedPictures$afterName <- character()
Now I have a function in which I do the following:
Based on a condition, I obtain the row number pictureMatchNum of the row I need to copy from excelfile to matchedPictures.
I should then copy the row from excelfile to matchedPictures. I tried a couple of different ways so far.
a.
rowNumber = nrow(matchedPictures) + 1
matchedPictures[rowNumber,1:12] <<- excelfile[pictureMatchNum,1:12]
b.
matchedPictures[rowNumber,1:12] <<- rbind(matchedPictures, excelfile[pictureWordMatches,1:12], make.row.names = FALSE)
2a. doesn't seem to work because it copies the indices from the excelfileand uses them as row names in the matchedPictures - which is why I decided to go with rbind
2b. doesn't seem to work because rbind needs to have the columns be identical and matchedPictureshas 2 extra columns.
EDIT START - Including reproducible example.
Here is some reproducible code (with fewer columns and fake data)
excelfile <- data.frame(x = letters, y = words[length(letters)], z= fruit[length(letters)] )
matchedPictures <- excelfile[0,]
matchedPictures$beforeName <- character()
matchedPictures$afterName <- character()
pictureMatchNum1 = match(1, str_detect("A", regex(excelfile$x, ignore_case = TRUE)))
rowNumber1 = nrow(matchedPictures) + 1
pictureMatchNum2 = match(1, str_detect("D", regex(excelfile$x, ignore_case = TRUE)))
rowNumber2 = nrow(matchedPictures) + 1
The 2 options I tried are
2a.
matchedPictures[rowNumber1,1:3] <<- excelfile[pictureMatchNum1,1:3]
matchedPictures[rowNumber1,"beforeName"] <<- "xxx"
matchedPictures[rowNumber1,"afterName"] <<- "yyy"
matchedPictures[rowNumber2,1:3] <<- excelfile[pictureMatchNum2,1:3]
matchedPictures[rowNumber2,"beforeName"] <<- "uuu"
matchedPictures[rowNumber2,"afterName"] <<- "www"
OR
2b.
matchedPictures[rowNumber1,1:3] <<- rbind(matchedPictures, excelfile[pictureMatchNum1,1:3], make.row.names = FALSE)
matchedPictures[rowNumber1,"beforeName"] <<- "xxx"
matchedPictures[rowNumber1,"afterName"] <<- "yyy"
matchedPictures[rowNumber2,1:3] <<- rbind(matchedPictures, excelfile[pictureMatchNum2,1:3], make.row.names = FALSE)
matchedPictures[rowNumber2,"beforeName"] <<- "uuu"
matchedPictures[rowNumber2,"afterName"] <<- "www"
EDIT END
Additionally, I have also seen the suggestions in many places that rather than using empty dataframes, one should have vectors and append data to the vectors and then combine them into a dataframe. Is this suggestion valid when I have so many columns and would need to have 14 separate vectors and copy each one of them individually?
What can I do to make this work?
You could
first determine the row indices of excelfile that match your criteria
extract these rows
then generate the data to fill your columns beforeName and afterName
then append these columns to your new data frame
Example:
excelfile <- data.frame(x = letters, y = words[length(letters)],
z = fruit[length(letters)])
## Vector of patterns:
patternVec <- c("A", "D", "M")
## Look for appropriate rows in file 'excelfile':
indexVec <- vapply(patternVec,
function(myPattern) which(str_detect(myPattern,
regex(excelfile$x, ignore_case = TRUE))), integer(1))
## Extract these rows:
matchedPictures <- excelfile[indexVec,]
## Somehow generate the data for columns 'beforeName' and 'afterName':
## I do not know how this information is generated so I just insert
## some dummy code here:
beforeNameVec <- c("xxx", "uuu", "mmm")
afterNameVec <- c("yyy", "www", "nnn")
## Then assign these variables:
matchedPictures$beforeName <- beforeNameVec
matchedPictures$afterName <- afterNameVec
matchedPictures
# x y z beforeName afterName
# a air dragonfruit xxx yyy
# d air dragonfruit uuu www
# m air dragonfruit mmm nnn
You can make this much simpler by using dplyr
library(dplyr)
library(stringr)
excelfile <- data.frame(x = letters, y = words[length(letters)], z= fruit[length(letters)],
stringsAsFactors = FALSE ) #add stringsAsFactors to have character columns
pictureMatch <- excelfile %>%
#create a match column
mutate(match = ifelse(str_detect(x,"a") | str_detect(x,'d'),1,0)) %>%
#filter to only the columns that match your condition
filter(match ==1)
pictureMatch <- pictureMatch[['x']] #convert to a vector
matchedPictures <- excelfile %>%
filter(x %in% pictureMatch) %>% #grab the rows that match your condition
mutate(beforeName = c('xxx','uuu'), #add your names
afterName = c('yyy','www'))

Create combinations of measurements concatenated using underscore

I have a dataframe df1
ID <- c("A","B","C")
Measurement <- c("Length","Height","Breadth")
df1 <- data.frame(ID,Measurement)
I am trying to create combinations of measurements with an underscore between them and put it under the ID column "ALL"
Here is my desired output
ID Measurement
A Length
B Height
C Breadth
ALL Length_Height_Breadth
ALL Length_Breadth_Height
ALL Breadth_Height_Length
ALL Breadth_Length_Height
ALL Height_Length_Breadth
ALL Height_Breadth_Length
Also when there are similar measurements in the "measurement" column, I want to eliminate the underscore.
For example:
ID <- c("A","B")
Measurement <- c("Length","Length")
df2 <- data.frame(ID,Measurement)
Then I would want the desired output to be
ID Measurement
A Length
B Length
ALL Length
I am trying to do something like this which is totally wrong
df1$ID <- paste(df1$Measurement, df1$Measurement, sep="_")
Can someone point me in the right direction to achieving the above outputs?
I would like to see how it is done programmatically instead of using the actual measurement names. I am intending to apply the logic to a larger dataset that has several measurement names and so a general solution would be much appreciated.
We could use the permn function from the combinat package:
library(combinat)
sol_1 <- sapply(permn(unique(df1$Measurement)),
FUN = function(x) paste(x, collapse = '_'))
rbind.data.frame(df1, data.frame('ID' = 'All', 'Measurement' = sol_1))
# ID Measurement
# 1 A Length
# 2 B Height
# 3 C Breadth
# 4 All Length_Height_Breadth
# 5 All Length_Breadth_Height
# 6 All Breadth_Length_Height
# 7 All Breadth_Height_Length
# 8 All Height_Breadth_Length
# 9 All Height_Length_Breadth
sol_2 <- sapply(permn(unique(df2$Measurement)),
FUN = function(x) paste(x, collapse = '_'))
rbind.data.frame(df2, data.frame('ID' = 'All', 'Measurement' = sol_2))
# ID Measurement
# 1 A Length
# 2 B Length
# 3 All Length
Giving credit where credit is due: Generating all distinct permutations of a list.
We could also use permutations from the gtools package (HT #joel.wilson):
library(gtools)
unique_meas <- as.character(unique(df1$Measurement))
apply(permutations(length(unique_meas), length(unique_meas), unique_meas),
1, FUN = function(x) paste(x, collapse = '_'))
# "Breadth_Height_Length" "Breadth_Length_Height"
# "Height_Breadth_Length" "Height_Length_Breadth"
# "Length_Breadth_Height" "Length_Height_Breadth"

R parsing database dumps to find joins

I have a bunch of database dumps to try and figure out.
I have read in my files into a list of dataframes using:
filenames <- list.files(path ="./ref_tables", pattern="*.csv" )
file_read <- lapply(filenames, read.csv)
how can I build a list of database columns to try and identify what joins to what?
as a start I want to build a list of distinct columns names across all the dataframes in my list of dataframes.
You could start by finding the most common column names in the list:
file_read <- list(data.frame(id=rep(c("a","b","c"),each=3), x=c(1,3,6), w = 1:9),
data.frame(id=rep(c("a","b","c"),each=3), x=c(2,4,7), y = 10:18),
data.frame(id=rep(c("b","c","d"),each=3), t=c(4,8,0), x=c(5,6,7), z = 1:9)
)
For the distinct columns names:
distinctColumns <- unique(unlist(lapply(file_read, names)))
Now to count the number of times each column name appears:
table(unlist(lapply(file_read, names)))
## id t w x y z
## 3 1 1 3 1 1
EDIT:
There's probably a more efficient way of doing this, but here is the easiest/fastest way I could think of finding which tables have a specific column name:
listElements <- NULL
for(i in 1:length(file_read))
{
tmp <- rep(i, length(lapply(file_read, function(x) which(distinctColumns %in% names(x)))[[i]]))
listElements <- c(listElements, tmp)
}
names(listElements) <- distinctColumns[unlist(lapply(file_read, function(x) which(distinctColumns %in% names(x))))]
df <- data.frame(colNames = names(listElements), dfNumber = listElements)
df[df$colNames=="id",]
## colNames dfNumber
## id 1
## id 2
## id 3
df[df$colNames=="z",]
## colNames dfNumber
## z 3

Resources