R parsing database dumps to find joins - r

I have a bunch of database dumps to try and figure out.
I have read in my files into a list of dataframes using:
filenames <- list.files(path ="./ref_tables", pattern="*.csv" )
file_read <- lapply(filenames, read.csv)
how can I build a list of database columns to try and identify what joins to what?
as a start I want to build a list of distinct columns names across all the dataframes in my list of dataframes.

You could start by finding the most common column names in the list:
file_read <- list(data.frame(id=rep(c("a","b","c"),each=3), x=c(1,3,6), w = 1:9),
data.frame(id=rep(c("a","b","c"),each=3), x=c(2,4,7), y = 10:18),
data.frame(id=rep(c("b","c","d"),each=3), t=c(4,8,0), x=c(5,6,7), z = 1:9)
)
For the distinct columns names:
distinctColumns <- unique(unlist(lapply(file_read, names)))
Now to count the number of times each column name appears:
table(unlist(lapply(file_read, names)))
## id t w x y z
## 3 1 1 3 1 1
EDIT:
There's probably a more efficient way of doing this, but here is the easiest/fastest way I could think of finding which tables have a specific column name:
listElements <- NULL
for(i in 1:length(file_read))
{
tmp <- rep(i, length(lapply(file_read, function(x) which(distinctColumns %in% names(x)))[[i]]))
listElements <- c(listElements, tmp)
}
names(listElements) <- distinctColumns[unlist(lapply(file_read, function(x) which(distinctColumns %in% names(x))))]
df <- data.frame(colNames = names(listElements), dfNumber = listElements)
df[df$colNames=="id",]
## colNames dfNumber
## id 1
## id 2
## id 3
df[df$colNames=="z",]
## colNames dfNumber
## z 3

Related

Extract and append data to new datasets in a for loop

I have (what I think) is a really simple question, but I can't figure out how to do it. I'm fairly new to lists, loops, etc.
I have a small dataset:
df <- c("one","two","three","four")
df <- as.data.frame(df)
df
I need to loop through this dataset and create a list of datasets, such that this is the outcome:
[[1]]
one
[[2]]
one
two
[[3]]
one
two
three
This is more or less as far as I've gotten:
blah <- list()
for(i in 1:3){
blah[[i]]<- i
}
The length will be variable when I use this in the future, so I need to automate it in a loop. Otherwise, I would just do
one <- df[1,]
two <- df[2,]
list(one, rbind(one, two))
Any ideas?
You can try using lapply :
result <- lapply(seq(nrow(df)), function(x) df[seq_len(x), , drop = FALSE])
result
#[[1]]
# df
#1 one
# [[2]]
# df
#1 one
#2 two
#[[3]]
# df
#1 one
#2 two
#3 three
#[[4]]
# df
#1 one
#2 two
#3 three
#4 four
seq(nrow(df)) creates a sequence from 1 to number of rows in your data (which is 4 in this case). function(x) part is called as anonymous function where each value from 1 to 4 is passed to one by one. seq_len(x) creates a sequence from 1 to x i.e 1 to 1 in first iteration, 1 to 2 in second and so on. We use this sequence to subset the rows from dataframe (df[seq_len(x), ]). Since the dataframe has only 1 column when we subset it , it changes it to a vector. To avoid that we add drop = FALSE.
Base R solution:
# Coerce df vector of data.frame to character, store as new data.frame: str_df => data.frame
str_df <- transform(df, df = as.character(df))
# Allocate some memory in order to split data into a list: df_list => empty list
df_list <- vector("list", nrow(str_df))
# Split the string version of the data.frame into a list as required:
# df_list => list of character vectors
df_list <- lapply(seq_len(nrow(str_df)), function(i){
str_df[if(i == 1){1}else{1:i}, grep("df", names(str_df))]
}
)
Data:
df <- c("one","two","three","four")
df <- as.data.frame(df)
df

Subset dataframe using counter that resets to 1 and create dataframe for each subset

I have a dataframe that I need to break into multiple, smaller dataframes.
There is an integer index, which starts at 1 and counts up. When it resets to 1, I need to start creating a new dataframe.
df <- cbind(c(1,2,3,4,5,1,2,3,4), c("a","b","c","d","e","f","g","h","i"))
#end results should be:
df1 <- df[1:5, ]
df2 <- df[6:9, ]
How do I do this programmatically? I can find where all of the "1"s are, but how to I go row-wise and break it into different dataframes?
In your example, df is a character matrix, not a data.frame. To define a data.frame object use e.g. data.frame(index = c(1,2,3,4,5,1,2,3,4), value = c("a","b","c","d","e","f","g","h","i")
Find the index of the first value of each group, then split on groups. You do not need to perform any rowwise operation.
df <- data.frame(index = c(1,2,3,4,5,1,2,3,4), value = c("a","b","c","d","e","f","g","h","i"))
split(df, cumsum(df$index == 1))
result is a list of data.frame objects:
$`1`
index value
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
$`2`
index value
6 1 f
7 2 g
8 3 h
9 4 i
Try this approach with indexes and a loop. We create i1 to store the rows where there is 1. Then we compute the final position in i2. After that we create a list and use a loop to store the new data. Finally, we assign names and release to envir using list2env. Here the code:
#Create index
i1 <- which(df[,1]=='1')
i2 <- i1[-1]-1
#Test for dim
if(length(i2==1)){i2 <- c(i2,nrow(df))}
#Create a list
List <- list()
#Loop
for(j in 1:length(i1))
{
List[[j]] <- df[i1[j]:i2[j],]
}
#Assign names
names(List) <- paste0('df',1:length(List))
#Set to envir
list2env(List,envir = .GlobalEnv)

Filter rows of dataframes stored in a list and create new list

I have a list with 64 dataframes.
Dataframe 1 and Dataframe 5 have to have the same row names.
The same with 2 and 6, 3 and 7, and so on.
I'm being able to run a for loop and create a new list, but something is not working: I end up having an incorrect number of rows.
Here a simplified example to reproduce it:
# Create dataframes and store in list
dfA <- data.frame(v1=c(1:6), v2=c("x1","x2","x3","x4","x5","x6"))
dfB <- data.frame(v1=c(1:6), v2=c("x1","x2","x3","x4","x5","x6"))
dfC <- data.frame(v1=c(1:5), v2=c("x1","x2","x3","x4","x5"))
dfD <- data.frame(v1=c(1:4), v2=c("x1","x2","x3","x4"))
example_dataframes = list(dfA, dfB, dfC, dfD)
# These vectors give the order of the process
vectorA = c(1,2)
vectorB = c(3,4)
# Create new list and start for loop
filtered_dataframes = list()
for (i in vectorA) {
for (j in vectorB) {
df1 = example_dataframes[[i]]
df2 = example_dataframes[[j]]
test = intersect(df1$v2, df2$v2)
filtered_dataframes[[i]] <- df1[which(df1$v2 %in% test),]
filtered_dataframes[[j]] <- df2[which(df2$v2 %in% test),]
}
}
For this example, I expect to obtain:
sapply(filtered_dataframes, nrow)
> 5 4 5 4
This modified version should work to get the results you need.
dfA <- data.frame(v1=c(1:6), v2=c("x1","x2","x3","x4","x5","x6"))
dfB <- data.frame(v1=c(1:6), v2=c("x1","x2","x3","x4","x5","x6"))
dfC <- data.frame(v1=c(1:5), v2=c("x1","x2","x3","x4","x5"))
dfD <- data.frame(v1=c(1:4), v2=c("x1","x2","x3","x4"))
example_dataframes = list(dfA, dfB, dfC, dfD)
# Put the comparison vectors into a list. Exampl: To compare dataframes 1 and 3, put in c(1,3)
vector.list <- list(c(1,3),c(2,4))
# Create new list and start for loop
filtered_dataframes = list()
# Loop through the list of vectors
for (i in vector.list) {
# Get the first dataframe from the current vector being processed
df1 = example_dataframes[[i[1]]]
# Get the second dataframe from the current vector being processed
df2 = example_dataframes[[i[2]]]
# Get the intersection of the two dataframes
test = intersect(df1$v2, df2$v2)
# Add the first filtered dataframe to the list of filtered dataframes
filtered_dataframes[[i[1]]] <- df1[which(df1$v2 %in% test),]
# Add the second filtered dataframe to the list of filtered dataframes
filtered_dataframes[[i[2]]] <- df2[which(df2$v2 %in% test),]
}

Passing vector with multiple values into R function to generate data frame

I have a table, called table_wo_nas, with multiple columns, one of which is titled ID. For each value of ID there are many rows. I want to write a function that for input x will output a data frame containing the number of rows for each ID, with column headers ID and nobs respectively as below for x <- c(2,4,8).
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
This is what I have. It works when x is a single value (ex. 3), but not when it contains multiple values, for example 1:10 or c(2,5,7). I receive the warning "In ID[counter] <- x : number of items to replace is not a multiple of replacement length". I've just started learning R and have been struggling with this for a week and have searched manuals, this site, Google, everything. Can someone help please?
counter <- 1
ID <- vector("numeric") ## contain x
nobs <- vector("numeric") ## contain nrow
for (i in x) {
r <- subset(table_wo_nas, ID %in% x) ## create subset for rows of ID=x
ID[counter] <- x ## add x to ID
nobs[counter] <- nrow(r) ## add nrow to nobs
counter <- counter + 1 } ## loop
result <- data.frame(ID, nobs) ## create data frame
In base R,
# To make a named vector, either:
tmp <- sapply(split(table_wo_nas, table_wo_nas$ID), nrow)
# OR just:
tmp <- table(table_wo_nas$ID)
# AND
# arrange into data.frame
nobs_df <- data.frame(ID = names(tmp), nobs = tmp)
Alternately, coerce the table into a data.frame directly, and rename:
nobs_df <- data.frame(table(table_wo_nas$ID))
names(nobs_df) <- c('ID', 'nobs')
If you only want certain rows, subset:
nobs_df[c(2, 4, 8), ]
There are many, many more options; these are just a few.
With dplyr,
library(dplyr)
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n())
If you only want certain IDs, add on a filter:
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n()) %>% filter(ID %in% c(2, 4, 8))
Seems pretty straightforward if you just use table again:
tbl <- table( table_wo_nas[ , 'ID'] )
data.frame( IDs = names(tbl), nobs= tbl)
Could also get a quick answer although with different column names using:
as.data.frame(table( table_wo_nas[ , 'ID'] ))
Try this.
x=c(2,4,8)
count_of_id=0
#df is your data frame table_wo_nas
count_of<-function(x)
{for(i in 1 : length(x))
{count_of_id[i]<-length(which(df$id==x[i])) #find out the n of rows for each unique value of x
}
df_1<-cbind(id,count_of_id)
return(df_1)
}

Adding data frames into a list within a forloop

I have a for loop that generates a dataframe every time it loops through. I am trying to create a list of data frames but I cannot seem to figure out a good way to do this.
For example, with vectors I usually do something like this:
my_numbers <- c()
for (i in 1:4){
my_numbers <- c(my_numbers,i)
}
This will result in a vector c(1,2,3,4). I want to do something similar with dataframes, but accessing the list of data frames is quite difficult when i use:
my_dataframes <- list(my_dataframes,DATAFRAME).
Help please. The main goal is just to create a list of dataframes that I can later on access dataframe by dataframe. Thank you.
I'm sure you've noticed that list does not do what you want it to do, nor should it. c also doesn't work in this case because it flattens data frames, even when recursive=FALSE.
You can use append. As in,
data_frame_list = list()
for( i in 1:5 ){
d = create_data_frame(i)
data_frame_list = append(data_frame_list,)
}
Better still, you can assign directly to indexed elements, even if those elements don't exist yet:
data_frame_list = list()
for( i in 1:5 ){
data_frame_list[[i]] = create_data_frame(i)
}
This applies to vectors, too. But if you want to create a vector c(1,2,3,4) just use 1:4, or its underlying function seq.
Of course, lapply or the *lply functions from plyr are often better than looping depending on your application.
Continuing with your for loop method, here's a little example of creating and accessing.
> my_numbers <- vector('list', 4)
> for (i in 1:4) my_numbers[[i]] <- data.frame(x = seq(i))
And we can access the first column of each data frame with,
> sapply(my_numbers, "[", 1)
# $x
# [1] 1
#
# $x
# [1] 1 2
#
# $x
# [1] 1 2 3
#
# $x
# [1] 1 2 3 4
Other ways of accessing the data is my_numbers[[1]] for the first data set,
lapply(my_numbers, "[", 1,) to access the first row of each data frame, etc.
You can use operator [[ ]] for this purpose.
l <- list()
df1 <- data.frame(name = 'df1', a = 1:5 , b = letters[1:5])
df2 <- data.frame(name = 'df2', a = 6:10 , b = letters[6:10])
df3 <- data.frame(name = 'df3', a = 11:20 , b = letters[11:20])
df <- rbind(df1,df2,df3)
for(df_name in unique(df$name)){
l[[df_name]] <- df[df$name == df_name,]
}
In this example, there are three separate data frames and in order to store them
in a list using a for loop, we place them in one. Using the operator [[ we can even name the data frame in the list as we want and store it in the list normally.

Resources