Ok so here's my problem. I'm trying to scrape a ton of data off of websites. My code looks like this:
library(XML)
library(RCurl)
library(rlist)
library(rvest)
library(dplyr)
team_performance <- read.csv("C:/Users/Will/Documents/team_performance.csv")
stats_names <- read.csv("C:/Users/Will/Documents/stats_names.csv")
date_vals <- read.csv("C:/Users/Will/Documents/date_vals.csv")
teams_list <- read.csv("C:/Users/Will/Documents/teams_list.csv")
date_vals <- date_vals[[1]]
stats_names <- stats_names[[1]]
team_stats <- NULL
for(i in c(0:10)){
burner <- teams_list
burner$Year <- (2007 + i)
team_stats <- rbind(team_stats, burner)
}
names(team_stats)[[1]] <- "Team"
percent_complete <- 0
for(x in date_vals){
for(i in stats_names){
mpg_link <- getURL(paste0("https://www.teamrankings.com/ncaa- basketball/stat/",gsub(" ","-",i),"?date=",x),.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(mpg_link)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
temp_data <- data.frame(tables)
temp_data$NULL.Stat <- i
names(temp_data)[3] <- temp_data$NULL.Stat[1]
names(temp_data)[2] <- "Team"
temp_data <- temp_data[,-c(4:8)]
temp_data$Year <- as.numeric(substr(as.character(x),1,4))
team_stats <- left_join(team_stats,temp_data[,-c(1,4)], by.x = "Team", by.y = "Year")
percent_complete <- percent_complete + (100/979)
print(paste(round(percent_complete,digits=2),"% complete",sep=""))
}
}
After the first year (2017) is done, after the joins are completed, I get a message like this:
Joining, by = c("Team", "Year", "Points Per Game")
instead of getting a message like this:
Joining, by = c("Team", "Year")
Any ideas why this might be happening?
Edit: Ok no longer getting the messages but it still won't switch over the year. Once it starts to scrape 2016, data doesn't show up where the year is 2016.
In the left_join, the syntax should be
left_join(team_stats,temp_data[,-c(1,4)], by=c(Team = "Year"))
though the column names are not making sense for the join. It is based on the OP's syntax.
The by.x and by.x are arguments in merge (from base R)
As a reproducible example
set.seed(24)
df1 <- data.frame(col1 = 1:5, col2 = rnorm(5))
df2 <- data.frame(A = rep(1:3, each = 2), B = rnorm(6))
The OP's method is giving errors in dplyr_0.7.4
left_join(df2, df1, by.x = 'A', by.y = 'col1')
Error: by required, because the data sources have no common
variables
because the arguments don't match
left_join(df2, df1, by = c(A= "col1"))
# A B col2
#1 1 0.266021979 -0.5458808
#2 1 0.444585270 -0.5458808
#3 2 -0.466495124 0.5365853
#4 2 -0.848370044 0.5365853
#5 3 0.002311942 0.4196231
#6 3 -1.316908124 0.4196231
Related
I would like to "copy paste" one column's value from df A under DF B's column values.
Below is I've visualized on what I'm trying to achieve
An option is to use bind_rows for the selected columns after making the type of the column same
library(dplyr)
bind_rows(df2, df1[1] %>%
transmute(ColumnC = as.character(ColumnA)))
# ColumnC ColumnD
#1 a b
#2 1 <NA>
#3 2 <NA>
#4 3 <NA>
data
df1 <- data.frame(ColumnA = 1:3, ColumnB = 4:6)
df2 <- data.frame(ColumnC = 'a', ColumnD = 'b',
stringsAsFactors = FALSE)
You may use also R base for this. You actually want to right join df2 with df1 :
df1 <- data.frame(1:3, 4:6)
names(df1) <- paste0("c", 1:2)
df2 <- data.frame("a", "b")
names(df2) <- paste0("c", 3:4)
# renaming column to join on
names(df2)[1] <- "c1"
merge(x = df1[,1,drop=FALSE], y = df2, by.y = c("c1"), all = TRUE)
I have a list of data.frames (in this example only 2):
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.
If I use:
library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
The shared column names, which I'm not joining by, get mutated with the x and y suffices:
id val.x val1 val.y val2
1 G -0.05612874 0.2914462 2.087167 0.7876396
2 G -0.05612874 0.2914462 -0.255027 1.4411577
3 J -0.15579551 -0.4432919 -1.286301 1.0273924
In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.
I don't know these shared column names in advance but that's not difficult find out:
E.g.:
df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]
Which will then allow me to separate them from the data.frames in the list:
repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
unique()
I can then join the list of data.frames excluding these columns:
And then join them as above:
for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.
Is there anything less cumbersome for this situation?
If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.
Something like this:
library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
fun <- function(df1, df2, by_col = "id"){
df1_names <- names(df1)
df2_names <- names(df2)
dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
return(out)
}
df_chase <- df.list %>% reduce(fun,by_col="id")
Created on 2019-01-15 by the reprex package (v0.2.1)
If I compare df_chase to your final solution, I yield the same answer:
> all.equal(df_chase, df_orig)
[1] TRUE
You can just get rid of the duplicate columns from one of the data frames if you say you don't really care about them and simply use base::merge:
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
duplicates = names(df1) == names(df2) & names(df1) !="id"
df2 = df2[,!duplicates]
df12 = base::merge.data.frame(df1, df2, by = "id")
head(df12)
When using the various join functions from dplyr you can either join all variables with the same name (by default) or specify those ones using by = c("a" = "b"). Is there a way to join by exclusion? For example, I have 1000 variables in two data frames and I want to join them by 999 of them, leaving one out. I don't want to do by = c("a1" = "b1", ...,"a999" = "b999"). Is there a way to join by excluding the one variable that is not used?
Ok, using this example from one answer:
set.seed(24)
df1 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
df2 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
I want to join them using all variables excluding val. I'm looking for a more general solution. Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable. How can I perform the join while only knowing the variable names to exclude. I understand I can find the column index first but is there a simply way to add exclusions in by =?
We create a named vector to do this
library(dplyr)
grps <- setNames(paste0("b", 1:999), paste0("a", 1:999))
Note the 'grps' vector is created with paste as the OP's post suggested a pattern. If there is no pattern, but we know the column that is not to be grouped
nogroupColumn <- "someColumn"
grps <- setNames(setdiff(names(df1), nogroupColumn),
setdiff(names(df2), nogroupColumn))
inner_join(df1, df2, by = grps)
Using a reproducible example
set.seed(24)
df1 <- data_frame(a1 = LETTERS[1:3], a2 = letters[1:3], val = rnorm(3))
df2 <- data_frame(b1 = LETTERS[3:4], b2 = letters[3:4], valn = rnorm(2))
grps <- setNames(paste0("b", 1:2), paste0("a", 1:2))
inner_join(df1, df2, by = grps)
# A tibble: 1 x 4
# a1 a2 val valn
# <chr> <chr> <dbl> <dbl>
#1 C c 0.420 -0.584
To exclude a certain field(s), you need to identify the index of the columns you want. Here's one way:
which(!names(df1) %in% "sskjs" ) #<this excludes the column "sskjs"
[1] 1 2 4 #<and shows only the desired index columns
Use unite to create a join_id in each dataframe, and join by it.
df1 <- df1 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
df2 <- df2 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
left_join(df1, df2, by = "join_id" )
I have over a 1000 objects (z) in R, each containing three dataframes (df1, df2, df3) with different structures.
z1$df1 … z1000$df1
z1$df2 … z1000$df2
z1$df3 … z1000$df3
I created a list of these objects (list1 thus contains z1 thru z1000) and tried to use lapply to extract one type of dataframe (df2) for all objects, and then merge them to one single dataframe.
Extraction:
For a single object it would look like this:
df15<- z15$df2 # I transferred the index of z to the extracted df
I tried some code with lapply, ignoring the transfer of the index (I can create another list for that). However I don’t know what function I should use.
List2 <- lapply(list1, function(x))
I try to avoid using a loop because there's so many and vectorization is so much quicker. I have the idea I'm looking at it from the wrong angle.
Subsequent merging can be done as follows:
merged <- do.call(rbind, list2)
Thanks for any suggestions.
It sounds like you want to pull out all the df1s and rbind them together then do the same for the other dataframes. You can use purrr::map_dfr to extract a column from each element of the list and rowbind them together.
library('tidyverse')
dummy_df <- list(
df1 = iris,
df2 = cars,
df3 = CO2)
list1 <- list(
z1 = dummy_df,
z2 = dummy_df,
z3 = dummy_df)
df1 <- map_dfr(list1, 'df1')
df2 <- map_dfr(list1, 'df2')
df3 <- map_dfr(list1, 'df3')
If you wanted to do it in base R, you can use lapply.
df1 <- lapply(list1, function(x) x$df1)
df1_merged <- do.call(rbind, df1)
One option could be using lapply to extract data.frame and then use bind_rows from dplyr.
## The data
df1 <- data.frame(id = c(1:10), name = c(LETTERS[1:10]), stringsAsFactors = FALSE)
df2 <- data.frame(id = 11:20, name = LETTERS[11:20], stringsAsFactors = FALSE)
df3 <- data.frame(id = 21:30, name = LETTERS[15:24], stringsAsFactors = FALSE)
df4 <- data.frame(id = 121:130, name = LETTERS[15:24], stringsAsFactors = FALSE)
z1 <- list(df1 = df1, df2 = df2, df3 = df3)
z2 <- list(df1 = df1, df2 = df2, df3 = df3)
z3 <- list(df1 = df1, df2 = df2, df3 = df3)
z4 <- list(df1 = df1, df2 = df2, df3 = df4) #DFs can contain different data
# z <- list(z1, z2, z3, z4)
# Dynamically populate list z with many list object
z <- as.list(mget(paste("z",1:4,sep="")))
df1_all <- bind_rows(lapply(z, function(x) x$df1))
df2_all <- bind_rows(lapply(z, function(x) x$df2))
df3_all <- bind_rows(lapply(z, function(x) x$df3))
## Result for df3_all
> tail(df3_all)
## id name
## 35 125 S
## 36 126 T
## 37 127 U
## 38 128 V
## 39 129 W
## 40 130 X
Try this:
lapply(list1, "[[", "df2")
or if you want to rbind them together:
do.call("rbind", lapply(list1, "[[", "df2"))
The row names in the resulting data frame will identify the origin of each row.
No packages are used.
Note
We can use this input to test the code above. BOD is a built-in data frame:
z <- list(df1 = BOD, df2 = BOD, df3 = BOD)
list1 <- list(z1 = z, z2 = z)
THere's also data.table::rbindlist, which is likely faster than do.call(rbind, lapply(...)) or dplyr::bind_rows
library(data.table)
rbindlist(lapply(list1, "[[", "df2"))
I'm trying to put together several files and need to do a bunch of merges on column names that are created inside a loop. I can do this fine using data.frame() but am having issues using similar code with a data.table():
library(data.table)
df1 <- data.frame(id = 1:20, col1 = runif(20))
df2 <- data.frame(id = 1:20, col1 = runif(20))
newColNum <- 5
newColName <- paste('col',newColNum ,sep='')
df1[,newColName] <- runif(20)
df2 <- merge(df2, df1[,c('id',newColName)], by = 'id', all.x = T) # Works fine
######################
dt1 <- data.table(id = 1:20, col1 = runif(20))
dt2 <- data.table(id = 1:20, col1 = runif(20))
newColNum <- 5
newColName <- paste('col',newColNum ,sep='')
dt1[,newColName] <- runif(20)
dt2 <- merge(dt2, dt1[,c('id',newColName)], by = 'id', all.x = T) # Doesn't work
Any suggestions?
This really has nothing to do with merge(), and everything to do with how the j (i.e. column) index is, by default, interpreted by [.data.table().
You can make the whole statement work by setting with=FALSE, which causes the j index to be interpreted as it would be in a data.frame:
dt2 <- merge(dt2, dt1[,c('id',newColName), with=FALSE], by = 'id', all.x = T)
head(dt2, 3)
# id col1 col5
# 1: 1 0.4954940 0.07779748
# 2: 2 0.1498613 0.12707070
# 3: 3 0.8969374 0.66894157
More precisely, from ?data.table:
with: By default 'with=TRUE' and 'j' is evaluated within the frame
of 'x'. The column names can be used as variables. When
'with=FALSE', 'j' is a vector of names or positions to
select.
Note that this could be avoided by storing the columns in a variable like so:
cols = c('id', newColName)
dt1[ , ..cols]
.. signals to "look up one level"
Try dt1[,list(id,get(newColName))] in your merge.