I am working with a set of excel spreadsheets which has column names which are dates.
After reading in the data with readxl::read_xlsx(), these column names become excel index dates (i.e. an integer representing days elapsed from 1899-12-30)
Is it possible to used dplyr::rename_if() or similar to rename all column names that are currently integers? I have written a function rename_func that I would like to apply to all such columns.
df %>% rename_if(is.numeric, rename_func) is not suitable as is.numeric is applied to the data in the column not the column name itself. I have also tried:
is.name.numeric <- function(x) is.numeric(names(x))
df %>% rename_if(is.name.numeric, rename_func)
which does not work and does not change any names (i.e. is.name.numeric returns FALSE for all cols)
edit: here is a dummy version of my data
df_badnames <- structure(list(Level = c(1, 2, 3, 3, 3), Title = c("AUSTRALIAN TOTAL",
"MANAGERS", "Chief Executives, Managing Directors & Legislators",
"Farmers and Farm Managers", "Hospitality, Retail and Service Managers"
), `38718` = c(213777.89, 20997.52, 501.81, 121.26, 4402.7),
`38749` = c(216274.12, 21316.05, 498.1, 119.3, 4468.67),
`38777` = c(218563.95, 21671.84, 494.08, 118.03, 4541.02),
`38808` = c(220065.05, 22011.76, 488.56, 116.24, 4609.28)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
and I would like:
df_goodnames <- structure(list(Level = c(1, 2, 3, 3, 3), Title = c("AUSTRALIAN TOTAL",
"MANAGERS", "Chief Executives, Managing Directors & Legislators",
"Farmers and Farm Managers", "Hospitality, Retail and Service Managers"
), Jan2006 = c(213777.89, 20997.52, 501.81, 121.26, 4402.7),
Feb2006 = c(216274.12, 21316.05, 498.1, 119.3, 4468.67),
Mar2006 = c(218563.95, 21671.84, 494.08, 118.03, 4541.02),
Apr2006 = c(220065.05, 22011.76, 488.56, 116.24, 4609.28)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
I understand that it is best practice to create a date column and change the shape of this df, but I need to join a few spreadsheets first and having integer column names causes a lot of problems. I currently have a work around but the crux of my question (apply a rename_if predicate to a name, rather than a column) is still interesting.
Although, the names look numeric but they are not
class(names(df_badnames))
#[1] "character"
so they would not be caught by is.numeric or similar other functions.
One way to do this is find out which names can be coerced to numeric and then convert them into the date format of our choice
cols <- as.numeric(names(df_badnames))
names(df_badnames)[!is.na(cols)] <- format(as.Date(cols[!is.na(cols)],
origin = "1899-12-30"), "%b%Y")
df_badnames
# Level Title Jan2006 Feb2006 Mar2006 Apr2006
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 AUSTRALIAN TOTAL 213778. 216274. 218564. 220065.
#2 2 MANAGERS 20998. 21316. 21672. 22012.
#3 3 Chief Executives, Managing Directors & Legisla… 502. 498. 494. 489.
#4 3 Farmers and Farm Managers 121. 119. 118. 116.
#5 3 Hospitality, Retail and Service Managers 4403. 4469. 4541. 4609.
Related
I'd like to figure out a way to compare columns in the SAME data frame, but in such a way that I create a new column called STATUS for the output. I have 3 columns 1)SNPs, 2)gained, and 3)lost. I want to know if the data in each cell in column 1 is present in either column 2 or 3. If the data from column 1 is present in column 2 then I would want the output to say GAINED, and if its present in column 3 then the output would be LOST. If it's present in either then the output will be NEUTRAL.
Here is what I would like:
SNPs GAINED LOST STATUS
1_752566 1_949654 6_30022061 NEUTRAL
1_776546 1_1045331 6_30314321 NEUTRAL
1_832918 1_832918 13_95612033 GAINED
1_914852 1_1247494 1_914852 LOST
I've tried this:
data_frame$status <- data.frame(lapply(data_frame[1], `%in%`, data_frame[2:3]))
but it produces 2 columns that all say NEUTRAL. I believe it's reading per row to see if it matches, but my data isn't organized in that manner such that it will find every match per row. Instead I'd like to search the entire column and have R find the matches in each cell instead of searching per row.
You don't need lapply or anything fancy like that.
data_frame$STATUS = with(data_frame,
ifelse(SNPs %in% GAINED, "GAINED",
ifelse(SNPs %in% LOST, "LOST", "NEUTRAL")
)
)
Note that the way this is written the GAINED condition is checked first so if it is present in both GAINED and LOST the result will be "GAINED".
Using a nested ifelse should work, and be fairly understandable if indented properly:
tbl$status <- ifelse(tbl$SNPs %in% tbl$GAINED, "GAINED",
ifelse(tbl$SNPs %in% tbl$LOST, "LOST", "NEUTRAL") )
> tbl
SNPs GAINED LOST STATUS status
1 1_752566 1_949654 6_30022061 NEUTRAL NEUTRAL
2 1_776546 1_1045331 6_30314321 NEUTRAL NEUTRAL
3 1_832918 1_832918 13_95612033 GAINED GAINED
4 1_914852 1_1247494 1_914852 LOST LOST
A Tidyverse approach with case_when
library(tidyverse)
df <-
structure(
list(
SNPs = c("1_752566", "1_776546", "1_832918", "1_914852"),
GAINED = c("1_949654", "1_1045331", "1_832918", "1_1247494"),
LOST = c("6_30022061", "6_30314321", "13_95612033", "1_914852")
),
row.names = c(NA,-4L),
spec = structure(list(
cols = list(
SNPs = structure(list(), class = c("collector_character",
"collector")),
GAINED = structure(list(), class = c("collector_character",
"collector")),
LOST = structure(list(), class = c("collector_character",
"collector"))
),
default = structure(list(), class = c("collector_guess",
"collector")),
delim = ","
), class = "col_spec"),
class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
)
df %>%
mutate(STATUS = case_when(
SNPs %in% GAINED ~ 'GAINED',
SNPs %in% LOST ~ 'LOST',
TRUE ~ 'NEUTRAL'
))
#> # A tibble: 4 × 4
#> SNPs GAINED LOST STATUS
#> <chr> <chr> <chr> <chr>
#> 1 1_752566 1_949654 6_30022061 NEUTRAL
#> 2 1_776546 1_1045331 6_30314321 NEUTRAL
#> 3 1_832918 1_832918 13_95612033 GAINED
#> 4 1_914852 1_1247494 1_914852 LOST
Created on 2022-12-01 with reprex v2.0.2
I want to pass all values in a dataframe as condition to dplyr::case_when() with stringr::str_detect() while using the respective column title als replacement value.
I have these two data frames:
> print(city_stack)
# A tibble: 11 × 1
city
<chr>
1 Britz
2 Berlin-Reinickendorf
3 Berlin-Kladow
4 Berlin-Spindlersfeld
5 Berlin-Mahlsdorf
6 Berlin-Lichterfelde
7 Berlin-Spandau
8 Berlin-Biesdorf
9 Berlin-Niederschöneweide
10 Rüdersdorf bei Berlin
11 Berlin-Nordend
> print(districts_stack)
# A tibble: 10 × 2
Berlin Köln
<chr> <chr>
1 Adlershof Rodenkirchen
2 Altglienicke Chorweiler
3 Baumschulenweg Ehrenfeld
4 Biesdorf Kalk
5 Blankenburg Lindenthal
6 Blankenfelde Mülheim
7 Bohnsdorf Nippes
8 Britz Porz
9 Buch Kölner Zoo
10 Buckow Universität zu Köln
I tried using a nested for loop:
for (i in colnames(districts_stack)){
for (j in districts_stack[[i]]){
mutate(city_stack, case_when(
str_detect(city, paste0(j) ~ i,
TRUE ~ city)
)
}
}
While that totally works, this is extremely inefficient and gets problematic with the huge dataframe I am actually working with. I feel like there should be a more efficient solution using purrr::map(), but I wasn't able to come up with anything working.
dput() of the dataframes:
dput(city_stack[1:11,])
structure(list(city = c("Britz", "Berlin-Reinickendorf", "Berlin-Kladow",
"Berlin-Spindlersfeld", "Berlin-Mahlsdorf", "Berlin-Lichterfelde",
"Berlin-Spandau", "Berlin-Biesdorf", "Berlin-Niederschöneweide",
"Rüdersdorf bei Berlin", "Berlin-Nordend")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
> dput(districts_stack[1:10,1:2])
structure(list(Berlin = c("Adlershof", "Altglienicke", "Baumschulenweg",
"Biesdorf", "Blankenburg", "Blankenfelde", "Bohnsdorf", "Britz",
"Buch", "Buckow"), Köln = c("Rodenkirchen", "Chorweiler", "Ehrenfeld",
"Kalk", "Lindenthal", "Mülheim", "Nippes", "Porz", "Kölner Zoo",
"Universität zu Köln")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I'm not 100% sure the output you're looking for. However, I believe this is a step in the right direction. Rather than looping over the district values and checking for matches, I propose melting the district_stack data and joining that new df to the city names using a fuzzy string match.
That is what I understand is happening in the loop. You then have a dataframe in which you can replace the city value using if_else more easily.
I drew inspiration from this thread: dplyr: inner_join with a partial string match
library(tidyverse)
library(fuzzyjoin) # to join the data based on fuzzy matches to get results in one dataframe for easier manipulation
city_stack <- structure(list(city = c("Britz", "Berlin-Reinickendorf", "Berlin-Kladow",
"Berlin-Spindlersfeld", "Berlin-Mahlsdorf", "Berlin-Lichterfelde",
"Berlin-Spandau", "Berlin-Biesdorf", "Berlin-Niederschöneweide",
"Rüdersdorf bei Berlin", "Berlin-Nordend")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
districts_stack <- structure(list(Berlin = c("Adlershof", "Altglienicke", "Baumschulenweg",
"Biesdorf", "Blankenburg", "Blankenfelde", "Bohnsdorf", "Britz",
"Buch", "Buckow"), Köln = c("Rodenkirchen", "Chorweiler", "Ehrenfeld",
"Kalk", "Lindenthal", "Mülheim", "Nippes", "Porz", "Kölner Zoo",
"Universität zu Köln")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame")) %>%
pivot_longer(., cols = everything(), names_to='city', values_to='district') %>%
arrange(city)
city_stack %>% # left join to get all potential string matches, then mutate
regex_left_join(districts_stack, by = c(city = "district")) %>%
mutate(city.x = if_else(!is.na(city.y), district, city.x))
I have a very untidy data set something like this
A tibble: 200000 x 2
ChatData
<chr>
1 Sep 30, 2018 7:12pm
2 Person A
3 Hello
4 Sep 30, 2018 7:11pm
5 Person B
6 Hello there
7 Sep 30, 2018 7:10pm
8 Person A
...
As you can see it goes date, person name, comment, and repeats.
I am working on the problem and have a very complex method that adds a score column depending on the names etc....
I would like to transform this into something like this
Person A , Person B
Hello NA
NA Hello there
how's you, NA
...
(The date as a row name or third column would be great but not essential to the question)
Optimally I am looking for a dplyr/tidyverse solution
I am working with lots of data so no slow for loops etc..
Raw data to work with:
structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
If anyone is wondering I am analysing facebook messenger data, and this is the form it comes in when you download it.
Thank you.
In this case, your starting data set has only one column (aka feature). But in this case, there are three types of data that are encoded here about each message: a timestamp, the label of the person, and a message. It will be more useful to transform these into a table where each message is in its own row, and each column represents a different aspect of each observation, i.e. in long, or "tidy", format: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
In the approach below, the user first defines what features are repeated in the data set. I call them "headers" here, since I'm working toward a table where these are the column headers. Then the script adds that information to the data and converts the single-column data into a tidy format with one row per message, and one aspect of each message in each column.
Your requested output is a minor variation of this, addressed in the last line below: %>% spread(person, msg), which separates out the Person A and Person b data into separate columns.
library(tidyverse)
header_names <- c("timestamp", "person", "msg")
rows_per <- length(header_names)
data_length <- length(data$ChatData) / rows_per
data2 <- data %>%
mutate(msg_number = rep(1:(nrow(data)/rows_per), each=rows_per),
# This line repeats the header_names sequence for each msg
header = rep(header_names, data_length)) %>%
spread(header, ChatData) %>%
mutate(timestamp = lubridate::mdy_hm(timestamp)) %>%
spread(person, msg)
head(data2)
# A tibble: 2 x 4
msg_number timestamp `Person A` `Person B`
<int> <dttm> <chr> <chr>
1 1 2018-09-30 19:12:00 Hello NA
2 2 2018-09-30 19:11:00 NA Hello there
As you basically just have a character vector that you would like to convert into a 3 columnn data.frame
One other option is to simply use matrix and specify ncol=3 and byrow=TRUE
# your sample data
d <- structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list( NULL, c("date_time", "person", "message")) )
Result is a character matrix:
date_time person message
[1,] "Sep 30, 2018 7:12pm" "Person A" "Hello"
[2,] "Sep 30, 2018 7:11pm" "Person B" "Hello there"
But you can wrap that in as.data.frame() to convert to a data.frame and continue working from there with dplyr if that's what you want.
Put it together for a whole solution:
It becomes a nice short, readable bit of code IMO:
library(dplyr)
library(lubridate)
result_df <-
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list(NULL, c("date_time", "person", "message")) ) %>%
as.data.frame() %>%
mutate(date_time=lubridate::mdy_hm(date_time))
Here is one approach:
data %>% group_by(msg_number = rep(1:(nrow(data)/3), each=3)) %>%
summarize(msg_data = list(ChatData)) %>% as.data.frame
msg_number msg_data
1 1 Sep 30, 2018 7:12pm, Person A, Hello
2 2 Sep 30, 2018 7:11pm, Person B, Hello there
This numbers each message and puts the data into a column list.
I am trying to apply a function to convert financial accounts from a number of companies to USD.
The firms relating to each currency can be found below.
USDfirms <- c("GOOG", "AMZN", "AAPL", "CSCO", "FB", "HP", "IBM", "0992.HK",
"MSFT", "CRM", "TWTR", "WB", "ZTE.CN")
CNYfirms <- c("BABA", "BIDU", "1169.HK", "HMI", "3888.HK", "1357.HK", "NTES",
"TCEHY", "1810.HK", "0763.HK")
TWDfirms <- c("2357.TW", "2324.TW", "2356.TW", "2498.TW", "3231.TW")
KRWfirms <- c("003550.KS", "005930.KS")
JPYfirms <- c("5563.T", "7752.T")
EURfirms <- "NOK"
So CNYfirms correspond to Chinese firms whose financial accounts are in CNY. The dput() is a dump of approximately 30 companies finances and can be found here.
EDIT: link 2 here http://s000.tinyupload.com/download.php?file_id=06545415747486823455&t=0654541574748682345555828
Its called BSISCF
The df - dput() is called BSISCF.
I also have a currency conversion table:
Which looks like the following;
# A tibble: 1 x 6
date cny_usd twd_usd krw_usd jpy_usd eur_usd
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-04-01 0.159 0.0344 0.000943 0.00941 1.23
Data
fx <-structure(list(date = structure(17622, class = "Date"), cny_usd = 0.159228,
twd_usd = 0.03442, krw_usd = 0.000943, jpy_usd = 0.009408,
eur_usd = 1.232305), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .Names = c("date", "cny_usd", "twd_usd",
"krw_usd", "jpy_usd", "eur_usd"))
I have the following function and what I am trying to do is take the symbolcolumn and use ifelse statements. So if the symbol column matches with a symbol in CNYfirms then in the exchange.rates column put the cny_usd exchange rate from the fx table. Do this for all the symbols. NA values will correspond to USD firms since there is no USD exchange rate in the fx table.
BSISCF <- BSISCF %>%
separate(symbol, into = c("ticker", "country"),
sep = "[.]", convert = TRUE, remove = FALSE) %>% # The NA values just correspond to US data
mutate(exchange.rates = ifelse(symbol == CNYfirms, fx$cny_usd,
ifelse(symbol == TWDfirms, fx$twd_usd,
ifelse(symbol == KRWfirms, fx$krw_usd,
ifelse(symbol == JPYfirms, fx$jpy_usd,
ifelse(symbol == EURfirms, fx$eur_usd, 0)))))) %>%
select(exchange.rates, everything())
Okay this information is the next steps I will take once the exchange rates have been collected
mutate(exchange.rates = ifelse(is.na(country), 1, exchange.rates)) %>%
mutate_at(.funs = funs(fx = .*exchange.rates), .vars = vars(Cash.And.Cash.Equivalents:Change.In.Cash.and.Cash.Equivalents)) %>%
mutate(adjusted_fx = adjusted*exchange.rates) #All financial statements and the stocks adjusted price converted into USD
If there is something I did not explain well, let me know.
I'm working with survey data containing value labels. The haven package allows one to import data with value label attributes. Sometimes these value labels need to be edited in routine ways.
The example I'm giving here is very simple, but I'm looking for a solution that can be applied to similar problems across large data.frames.
d <- dput(structure(list(var1 = structure(c(1, 2, NA, NA, 3, NA, 1, 1), labels = structure(c(1,
2, 3, 8, 9), .Names = c("Protection of environment should be given priority",
"Economic growth should be given priority", "[DON'T READ] Both equally",
"[DON'T READ] Don't Know", "[DON'T READ] Refused")), class = "labelled")), .Names = "var1", row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame")))
d$var1
<Labelled double>
[1] 1 2 NA NA 3 NA 1 1
Labels:
value label
1 Protection of environment should be given priority
2 Economic growth should be given priority
3 [DON'T READ] Both equally
8 [DON'T READ] Don't Know
9 [DON'T READ] Refused
If a value label begins with "[DON'T READ]" I want to remove "[DON'T READ]" from the beginning of the label and add "(VOL)" at the end. So, "[DON'T READ] Both equally" would now read "Both equally (VOL)."
Of course, it's straightforward to edit this individual variable with a function from haven's associated labelled package. But I want to apply this solution across all the variables in a data.frame.
library(labelled)
val_labels(d$var1) <- c("Protection of environment should be given priority" = 1,
"Economic growth should be given priority" = 2,
"Both equally (VOL)" = 3,
"Don't Know (VOL)" = 8,
"Refused (VOL)" = 9)
How can I achieve the result of the function directly above in a way that can be applied to every variable in a data.frame?
The solution must work regardless of the specific value. (In this instance it is values 3,8, & 9 that need alteration, but this is not necessarily the case).
There are a few ways to do this. You could use lapply() or (if you want a one(ish)-liner) you could use any of the scoped variants of mutate():
1). Using lapply()
This method loops over all columns with gsub() to remove the part you do not want and adds the " (VOL)" to the end of the string. Of course you could use this with a subset as well!
d[] <- lapply(d, function(x) {
labels <- attributes(x)$labels
names(labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(labels))
attributes(x)$labels <- labels
x
})
d$var1
[1] 1 2 NA NA 3 NA 1 1
attr(,"labels")
Protection of environment should be given priority Economic growth should be given priority
1 2
Both equally (VOL) Don't Know (VOL)
3 8
Refused (VOL)
9
attr(,"class")
[1] "labelled"
2) Using mutate_all()
Using the same logic (with the same result) you could change the name of the labels in a tidier way:
d %>%
mutate_all(~{names(attributes(.)$labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(attributes(.)$labels));.}) %>%
map(attributes) # just to check on the result