I have a script which generates multiple dataframes after scraping data from internet
library("rvest")
urllist <- c("https://en.wikipedia.org/wiki/Jawaharlal_Nehru",
"https://en.wikipedia.org/wiki/Indira_Gandhi")
for(i in 1:length(urllist))
{ mydata <- urllist[i]
print(url)
mydata<- url %>%
html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/table[1]') %>%
html_table()
X <- mydata[[1]]
assign(paste("df", i, sep = '_'), X)
}
so it creates df_1,df_2 etc.
After download all this dataframe has 2 columns.1st column name is that person name, 2nd column name is NA.
How I can rename all those dataframes column names as 1st column name as "ID", 2nd column name as the person name dynamically ?
My below try is failing.This is changing those string...it is not affecting my dataframes.
for(i in 1:length(urllist))
{ assign(colnames(get(paste("df", i, sep = '_')))[1],"ID")
assign(colnames(get(paste("df", i, sep = '_')))[2],colnames(get(paste("df", i, sep = '_')))[1])
}
My final goal is then to merge all those dataframes in a single dataframe based on column "ID".
What could be the way ?
Solved it this way:
for (i in (1:length(urllist)))
{
df.tmp <- get(paste("df", i, sep = '_'))
names(df.tmp) <- c("ID",colnames(get(paste("df", i, sep = '_')))[1] )
assign(paste("df",i,sep='_'), df.tmp)
}
for merging i have solved this way:
#making the list without the 1st df
alldflist = lapply(ls(pattern = "df_[2]"), get)
#merge multiple data frames by ID
#note at first taking the 1st df
mergedf<-df_1
for ( .df in alldflist )
{
mergedf <-merge(mergedf,.df,by.x="ID", by.y="ID",all=T)
}
It works. But Can anybody please suggest a better way for this dynamic dataframe name and merging into a single dataframe
Using a list as Roman pointed out in his comment would definitely work in this case but if you're already looping through your list why don't you just do it using your initial for loop...something like this:
colnames(X) <- c("ID", colnames(X)[1])
This is assuming you want the first column name to be the second column name which it looks like this is the case based on your second loop.
Related
I have a dataset with columns that contain information of a code + name, which I would like to separate into 2 columns. So, just an example:
Column E5000_A contain values like `0080002. ALB - Democratic Party' in one cell, I would like two columns one containing the code 0080002, and the other containing the other info.
I have 8 more columns with values very similar (E5000_A until E5000_H). This is the code that I am writing.
cols2 <- c("E5000_A" , "E5000_B" , "E5000_C" , "E5000_D" ,
"E5000_E" , "E5000_F" , "E5000_G" , "E5000_H" )
for(i in cols2){
cses_imd_m <- cses_imd_m %>% mutate(substr(i, 1L, 7L))
}
But for some reason it is only generating a new column for the E5000_A and the loop does not go to the other variables. What am I doing wrong? Let me know if you need more details about the code or data frame.
data.frame approach
# to extract codes
df %>%
mutate_at(.vars = vars(c("E5000_A", "E5000_B", "E5000_C", "E5000_D", "E5000_E",
"E5000_F", "E5000_G", "E5000_H")),
.funs = function(x) str_extract("^\\d+", x))
You can also use across() inside of mutate().
If you want to use for loop
col_names <- c("E5000_A", "E5000_B", "E5000_C", "E5000_D", "E5000_E", "E5000_F", "E5000_G", "E5000_H")
for (i in col_names) {
df[,sprintf("code_%s", i)] <- str_extract("^\\d+", df[,i])
df[,sprintf("party_%s", i)] <- gsub(".*\\.", "", df[,i]) %>% str_trim() # remove all before dot (.)
}
imported tibble from textfile. Many numeric columns are imported as "chr". I guess it's because they contain a "," instead of a ".".
My goal is to write a loop which runs through the names of desired columns, replaces "," with "." and converts columns into "num".
Little example:
data <- data.frame("A1" =c("2,1","2,1","2,1"), "A2" =c("1,3","1,3","1,3"),
stringsAsFactors = F) %>% as.tibble() #example data
colname <- c("A1", "A2") #creating variable for loop
for(i in colname) {
nam <- paste0("data$", i)
assign(nam, as.numeric(gsub(",",".", eval(parse(text = paste0("data$",i))))) )
}
Instead of overwriting the existing column, R creates a new variable:
data$A1 # that's the existing column as part of the tibble
[1] "2,1" "2,1" "2,1"
`data$A1` # thats just a new variable. mind the little``
[1] 2.1 2.1 2.1
I also tried to assign (<-) the new numeric values via eval, but that does not work either.
eval(parse(text = paste0("data$", i))) <- as.numeric(
gsub(",",".", eval(parse(text = paste0("data$",i)))))
Error: target of assignment expands to non-language object
Any suggestions on how to transform? I have the same issue with other columns that I want to aggregate to a new variable. This variable should also be part of the existing tibble. I could do it by hand. This would take lots of time and probably produce many mistakes.
Thanks a lot!
Sam
As you are already working with the tidyverse, you can use dplyr::mutate_at and the colname variable you have already defined.
data %>%
mutate_at(.vars = colname,
.funs = function(x) { as.numeric(gsub(",", ".", x)) })
I am pulling 10-Ks off the SEC website using the EDGAR package in R. Fortunately, the text files come with a consistent file naming convention: CIK number (this is a unique filing ID)_File type_Date.
Ultimately I want to analyze these by SIC/industry group, so I think the best way to do this would be to add the SIC industry code to this filename rule.
I am including an image of what I would like to do below. It is kind of like a database join except my file names would be taking the new field. Not sure how to do that, I am pretty new to R and file scripting.
I am assuming that you have a data.frame with a column filenames. (Or a vector containing all the filenames) See the code below:
# A data.frame with a character column 'filenames'
df$CIK <- sapply(df$filenames, FUN = function(x) {unlist(strsplit(x, split = "_"))[1]})
df$CIK <- as.character(df$CIK)
Now, let us assume that you have another data.frame with two columns: CIK and SIC.
# A data.frame with two character columns: 'CIK' and 'SIC'
# df2.
#
# We add another column to the first data.frame: 'new_filenames'
df$new_filename <- sapply(1:nrow(df), FUN = function(idx, CIK, filenames, df2) {
SIC <- df2$SIC[which(df2$CIK == CIK[idx])]
new_filename <- as.character(paste(SIC, "_", filenames[idx], sep = ""))
new_filenames
}, CIK = df$CIK, filenames = df$filenames, df2 = df2)
# Now the new filenames are available in df$new_filenames
View(df)
Let's say I have 5 datasets in a list (each named df_1, df_2, and so on), each with a variable called cons. I'd like to execute a function over cons in each dataset in the list, and create a new variable whose name has the suffix of the corresponding dataset.
So in the end df_1 will have a variable called something like cons_1 and df_2 will have a variable called cons_2. The problem I run into is the variable looping and trying to create dynamic names.
Any suggestions?
This is actually pretty straightforward:
df_names <- paste("df", 1:5, sep = "_")
cons_names <- paste("cons", 1:5, sep = "_")
for (i in 1:5) {
# get the df from the current env by name
df_i <- get(df_names[i])
# do whatever you need to do and assign the result
df_i[[cons_names[i]]] <- some_operation(df_i)
}
But it would make more sense to keep your data frames in a list to avoid using get, which can be sketchy:
for (i in 1:5) {
df_i[[cons_names[i]]] <- some_operation(df_list[[i]])
}
Using the purrr package, this would be an alternative solution:
library(purrr)
lst <- list(mtcars_1 = mtcars,
mtcars_2 = mtcars,
mtcars_3 = mtcars,
mtcars_4 = mtcars,
mtcars_5 = mtcars)
map(seq_along(lst), function(x) {
lst[[x]][paste0("mpg_", x)] <- some_operation(lst[[x]]['mpg']); lst[[x]]
})
Subset each data frame from the list, create the new mpg variable with the index of the current data frame and perform whatever operation you want on the mpg variable. The result is a list with all data previous data frames with the new variable for each data frame.
Since this new list doesn't have the data frame names, you can always just add them with setNames(newlist, names(lst))
I have the following .csv file:
https://drive.google.com/open?id=0Bydt25g6hdY-RDJ4WG41VFpyX1k
And I would like to be able to take the date and agent name(pasting its constituent parts) and append them as columns to the right of the table, up until it finds a different name and date, doing the same for the remaining name and date items, to get the following result:
The only thing I have been able to do with the dplyr package is the following:
library(dplyr)
library(stringr)
report <- read.csv(file ="test15.csv", head=TRUE, sep=",")
date_pattern <- "(\\d+/\\d+/\\d+)"
date <- str_extract(report[,2], date_pattern)
report <- mutate(report, date = date)
Which gives me the following result:
The difficulty I am finding is probably using conditionals in order make the script get the appropriate string and append it as a column at the end of the table.
This might be crude, but I think it illustrates several things: a) setting stringsAsFactors=F; b) "pre-allocating" the columns in the data frame; and c) using the column name instead of column number to set the value.
report<-read.csv('test15.csv', header=T, stringsAsFactors=F)
# first, allocate the two additional columns (with NAs)
report$date <- rep(NA, nrow(report))
report$agent <- rep(NA, nrow(report))
# step through the rows
for (i in 1:nrow(report)) {
# grab current name and date if "Agent:"
if (report[i,1] == 'Agent:') {
currDate <- report[i+1,2]
currName=paste(report[i,2:5], collapse=' ')
# otherwise append the name/date
} else {
report[i,'date'] <- currDate
report[i,'agent'] <- currName
}
}
write.csv(report, 'test15a.csv')