I have a data frame with a name column and a ticker column, the ticker column trim with ";" if a name has more than 1 ticker.
A glimpse from this data.frame below:
df.info is the name of the dataframe
Rows: 525
Columns: 2
$ name.company <chr> "521 PARTICIPAÇOES S.A. - EM LIQUIDAÇÃO EXTRAJUDICIAL", "524 PARTICIPAÇOES SA", "AAJR SECURITIZADORA DE CRÉDITO...
$ tickers <chr> NA, "QVQP3B", NA, "ADHM3", "TIET11;TIET3;TIET4", "AFLT3", NA, "ALEF3B", "RPAD3;RPAD5;RPAD6", NA, "ALSO3", "ALPA...
And I want a dataframe that has 2 columns, ticker and name.company but without the trim pattern ";".
e.g.:
name ticker
tiete tiet11
tiete tiet3
tiete tiet4
and so it goes.. I solved it using the by() function but I have no clue how to solve it using the tidyverse/purrr packages.
Solution without tidyverse
get.ticker.df <- function(df.in)
{
# Gets ticker string and organizes it in another data_frame
temp.split <- str_split(df.in$tickers, ';')[[1]]
temp.df <- tibble(name.company = df.in$name.company,
ticker = temp.split)
}
my.l <- by(data = df.info,
INDICES = df.info$name.company,
FUN = get.ticker.df)
df.tickers <- bind_rows(my.l)
I don't know the equivalent of this by() function in tidyverse.
Edit - Added initial frame and the ideal result dataframe, to make it clear.
tibble_start <- tibble( name.company = c("AES TIETE", "AMBEV"),
ticker = c("TIET11;TIET3;TIET4", "ABEV3;ABEV4"))
tibble_ideal <- tibble( name.company = c( rep("AES TIETE", 3), rep("AMBEV",2)),
ticker = c("TIET11","TIET3","TIET4","ABEV3","ABEV4"))
Thanks in advance!
We can use separate_rows
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(tickers)
Related
I have a list listDF of 30 data frames (~1000 rows X 3 columns). In the last column of each one, I have a composite character as shown below :
Date Origin Chemical
28/10/2012 Artificial nuclides Cs-137__Sea
28/10/2012 Natural nuclides Ra-226__Clouds
28/10/2012 Natural nuclides Ra-228__Sands
28/10/2012 Natural nuclides Th-228__Sea
28/10/2012 Artificial nuclides Cs-137__Rocks
For the last column of each df, how can I simply remove "__Sea", "__Clouds"... and just keep the chemical name ?
This may also be done with trimws from base R
listDF <- lapply(listDF, transform,
Chemical = trimws(Chemical, whitespace = "__.*"))
Or in tidyverse
library(stringr)
library(purrr)
library(dplyr)
listDF <- map(listDF, mutate, Chemical = str_remove(Chemical, "__.*"))
Base R solution:
## requires R version 4.1 or higher
listDF = lapply(listDF,
\(df) {
df[["Chemical"]] = sub(pattern = "__.*", replacement = "", df[["Chemical"]])
df
})
## any R version
listDF = lapply(listDF,
function(df) {
df[["Chemical"]] = sub(pattern = "__.*", replacement = "", df[["Chemical"]])
df
})
For-loop solution:
for (df in listDF){
for (i in 1:nrow(listDF)){
df[i, "Chemical"] = unlist(strsplit(df[i, "Chemical"], "__"))[1]
}
}
I´m trying to rename two types of variables in R using tidyverse/dplyr. The first type "var_a_year", I want to rename it as "sample_year". The second type of variable "var_b_7", I want to rename it as "index_year".
The second variable, "var_b" starts on the number 7 for the first year "2004". And increases by 2 for each year. So for year 2005, the second type variable is called "var_b_9" as shown.
I would like to use a loop so I can make this faster instead of writting a line for each year.
Many thanks in advance!
df <- df %>%
rename(
sample_2004 = var_a_2004, index_2004 = var_b_7,
sample_2005 = var_a_2005, index_2005 = var_b_9,
sample_2006 = var_a_2006, index_2006 = var_b_11,
sample_2007 = var_a_2007, index_2007 = var_b_13,
...
sample_2020 = var_a_2020, index_2020 = var_b_39)
There's no need to use a loop. rename_with will do the trick:
df <- tibble(var_a_2004=NA, var_b_7=NA, var_a_2005=NA, var_b_8=NA)
renameA <- function(x) {
return(paste0("sample_", stringr::str_sub(x, -4)))
}
df %>% rename_with(renameA, starts_with("var_a"))
Gives
# A tibble: 1 x 4
sample_2004 var_b_7 sample_2005 var_b_8
<lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA
I'll leave you to work out how to code the corresponding function for your var_b_XXXX columns.
In addition to the answer of Limey:
#sample data
df <- structure(list(var_a_2004 = NA, var_b_7 = NA, var_a_2005 = NA,
var_b_9 = NA), row.names = c(NA, -1L), class = "data.frame")
#load data.table package
library(data.table)
#set df to data.table
dt <- as.data.table(df)
#convert var_a in columnnames to sample_
colnames(dt) <- gsub("var_a_", "sample_", colnames(dt))
#use a loop to replace var_b to index_
for(i in 2004:2005){
year <- i
nr <- 2* i -4001
setnames(dt, old = paste0("var_b_", nr), new = paste0("index_", year))
}
This function now works for the years 2004:2005 to match the sample data. You can change it to 2004:2020 for your dataset.
I have a data set where the names of the columns are very messy, and I want to simplify them. Example data below:
structure(list(MemberID = 1L, This.was.the.first.question = "ABC",
This.was.the.first.date = 1012018L, This.was.the.first.city = "New York",
This.was.the.second.question = "XYZ", This.was.the.second.date = 11052018L,
This.was.the.second.city = "Boston"), .Names = c("MemberID",
"This.was.the.first.question", "This.was.the.first.date", "This.was.the.first.city",
"This.was.the.second.question", "This.was.the.second.date", "This.was.the.second.city"
), class = "data.frame", row.names = c(NA, -1L))
MemberID This was the first question This was the first date This was the first city This was the second question This was the second date This was the second city
1 ABC 1012018 New York XYZ 11052018 Boston
This is what I want the columns to look like:
MemberID Question_1 Date_1 City_1 Question_2 Date_2 City_2
So essentially the column name is the same but every 3rd column the number increases by 1. How would I do this? While this example data set small, my real data set is much larger and I want to learn how to do this by column indexing and iteration.
An easier option is to remove the substring except the last word and use make.unique
names(df1)[-1] <- make.unique(sub(".*\\.", "", names(df1)[-1]), sep="_")
names(df1)
#[1] "MemberID" "question" "date" "city" "question_1" "date_1" "city_1"
Or if we need the exact output as expected, extract the last word with sub and use ave to create the sequence based on duplicate names
v1 <- sub(".*\\.(\\w)", "\\U\\1", names(df1)[-1], perl = TRUE)
names(df1)[-1] <- paste(v1, ave(v1, v1, FUN = seq_along), sep="_")
names(df1)
#[1] "MemberID" "Question_1" "Date_1" "City_1"
#[5] "Question_2" "Date_2" "City_2"
#
# create vector of question name triplets
theList <- c("question_","date_","city_")
# create enough headings for 10 questions
questions <- rep(theList,10)
idNumbers <- 1:length(questions)
library(numbers)
# use mod function to group ids into triplets
idNumbers <- as.character(ifelse(mod(idNumbers,3)>0,floor(idNumbers/3)+1,floor(idNumbers/3)))
# concatenate question stems with numbers and add MemberID column at start of vector
questionHeaders <- c("MemberID",paste0(questions,idNumbers))
head(questionHeaders)
...and the output:
[1] "MemberID" "question_1" "date_1" "city_1" "question_2" "date_2"
use the colnames() or names() function to assign this vector as the column names of the data frame.
As noted in the comments on the OP, the question ID numbers can be generated by using the each= argument in rep(), eliminating the need for the mod() function.
idNumbers <- rep(1:10,each = 3)
I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)
I am trying to modify the values of a column for rows in a specific range. This is my data:
df = data.frame(names = c("george","michael","lena","tony"))
and I want to do the following using dplyr:
df[2:3,] = "elsa"
My attempt at it is the following, but it doesn't seem to work:
df = cbind(df, rows = as.integer(rownames(df)))
dplyr::mutate(df, ifelse(rows %in% c(2,3), names = "elsa" , names = names))
which gives the result:
Error: unused arguments (names = "elsa", names = c(1, 3, 2, 4))
Thanks for any advice.
This question is a little vague, but I think OP is trying to just replace certain values in a data frame using indexing. As the comment above noted the example dataframe's column is comprised of a factor variable, which makes replacing the value behave differently than you might expect. There are two ways to get around this.
The first (more verbose) way is to force df$names to be a character variable instead of a factor. Then using indexing to select the value you'd like to change and replace it:
df$names = as.character(df$names)
df$names[c(2,3)] = "elsa"
Alternatively, you can set stringsAsFactors = TRUE and proceed as above.
df = data.frame(names = c("george","michael","lena","tony"), stringsAsFactors = FALSE)
df$names[c(2:3)] = "elsa"
names
1 george
2 elsa
3 elsa
4 tony
Definitely check out ?data.frame to get a fuller explanation.
The factor answers are faster, but you can do it with dplyr like this (notice that the column must be of type character and not factor):
df <- data.frame(names = c("george","michael","lena","tony"), stringsAsFactors=F)
oldnames <- c("michael", "lena")
df <- mutate(df, names=ifelse(names %in% oldnames, "elsa", names))
Another way is to do something like
oldnames <- c("michael", "lena")
df$names[df$names %in% oldnames] <- "elsa"
Convert names to a character vector explicitly and use replace:
df %>% mutate(names = replace(as.character(names), 2:3, "elsa"))
Note: If names were already a character vector we could have done just:
df %>% mutate(names = replace(names, 2:3, "elsa"))
We can do this using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), specify the row index as i and assign (:=) 'elisa' to the 'names'. As the OP mentioned about large dataset, using the := from data.table will be extremely fast.
library(data.table)
setDT(df)[2:3, names := 'elisa']
df
# names
#1: george
#2: elisa
#3: elisa
#4: tony