I have a .csv file like this (except that the real .csv file has many more columns):
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
I only want id1, id2, data1, and data2.
I can do this:
df <- read_csv("df.csv",
col_names = TRUE,
cols_only(id1 = col_character(),
id2 = col_character(),
data1 = col_integer(),
data2 = col_integer()))
But, as mentioned above, my real dataset has many more columns, so I'd like to use tidyselect helpers to only read in specified columns and ensure specified formats.
I tried this:
df2 <- read_csv("df.csv",
col_names = TRUE,
cols_only(starts_with("id") = col_character(),
starts_with("data") & !ends_with("s") = col_integer()))
But the error message indicates that there's a problem with the syntax. Is it possible to use tidyselect helpers in this way?
My proposal is around the houses somewhat but it pretty much does let you customise the read spec on a 'rules' rather than explicit basis
library(tidyverse)
tibble(id1 = c("a", "b"),
id2 = c("c", "d"),
data1 = c(1, 2),
data2 = c(3, 4),
data1s = c(5, 6),
data2s = c(7, 8)) %>%
write_csv("df.csv")
# read only 1 row to make a spec from with minimal read; really just to get the colnames
df_spec <- spec(read_csv("df.csv",
col_names = TRUE,
n_max = 1))
#alter the spec with base R functions startsWith / endsWith etc.
df_spec$cols <- imap(df_spec$cols,~{if(startsWith(.y,"id")){
col_character()
} else if(startsWith(.y,"data") &
!endsWith(.y,"s")){
col_integer()
} else {
col_skip()
}})
df <- read_csv("df.csv",
col_types = df_spec$cols)
Related
I have a working directory with a large number of xlsm files (600ish). I need to merge all of these files into one dataframe, but ONLY the second sheet of the excel file. Since there are a lot of files, ideally I would use a loop, but I'm struggling with how to do this. Right now I have this code, which is obviously not working. Any thoughts on how to best do this would be greatly appreciated.
library(readxl)
library(tidyverse)
data.files = list.files(pattern = "*.xlsm")
data_to_merge <- lapply(data.files, read_excel(x, sheet = 2))
combined_df <- bind_rows(data_to_merge)
Not sure how to include examples of the data so it's easily reproducible since my question is dealing with excel sheets, not data that's already in r, but if this is useful, all of the 2nd sheets have the same simple structure that looks something like this:
data1 <- data.frame(id = 1:6,
x1 = c(5, 1, 4, 9, 1, 2),
x2 = c("A", "Y", "G", "F", "G", "Y"))
data2 <- data.frame(id = 4:9,
y1 = c(3, 3, 4, 1, 2, 9),
y2 = c("a", "x", "a", "x", "a", "x"))
You were close. You just need to slightly alter your lapply statement, so that the function and parameter are separated by a column.
library(readxl)
library(tidyverse)
data.files = list.files(pattern = "*.xlsm")
data_to_merge <- lapply(data.files, read_excel, sheet = 2)
combined_df <- bind_rows(data_to_merge)
Or a more tidyverse approach:
combined_df <- list.files(pattern = "*.xlsm") %>%
map(., ~ read_excel(.x, sheet = 2)) %>%
bind_rows()
I want to iterate over several columns of a flextable using the mk_par function. Consider the following example:
tibble(a = c(1:10),
b1 = letters[1:10],
b2 = LETTERS[1:10],
c1 = paste0("new_",letters[1:10]),
c2 = paste0(LETTERS[1:10], "_new")) %>%
flextable(col_keys = c("a", "b", "c")) %>%
mk_par(j = "b", value = as_paragraph(b1, b2)) %>%
mk_par(j = "c", value = as_paragraph(c1, c2))
I would like to replace the two mk_par statements by a single expression which takes the arguments c("b", "c") and renders the same output. I have succeeded in rewriting this with a for loop
for(pref in c("b", "c")){
tt <- tt %>%
mk_par(j = pref,
value = as_paragraph(.data[[paste0(pref,1)]],
.data[[paste0(pref,2)]]))
}
but I wonder if there is a one line expression that does the same which integrates smoothly in a dplyr pipe syntax?
I am new in R and I have a question. I have two data frames, and I want to change the values of a column in the second data frame based on the values of a column in the first data frame. Both columns are string and contain 4 numbers separated by (-). Here is an example,
So, based on this example, column b of Table 2 should change in a way that, if the first and last numbers in each set are equal then replace the values in Table 1. Also if a cell exists in column b of table 2 which the first and last numbers do not exist in table 1, delete that row (in this example: 2-201-2012-250).
Thank you
Is that what you're looking for :
library(stringr) #for str_split()
library(dplyr) #for left_join()
my_df <- data.frame("a" = c(1, 2, 3, 4),
"b" = c("7-1-1-100", "7-1-1-12", "31-1-1-5", "31-1-1-8"),
"c" = c(0, 0, 0, 0), stringsAsFactors = FALSE)
my_df2 <- data.frame("a" = c(1, 2, 3, 4, 5),
"b" = c("7-1-1-100", "7-1-1-12", "2-1-1-250", "31-1-1-5", "31-1-1-8"),
"c" = c("ABC", "ABCD", "AD", "ABV", "CDF"), stringsAsFactors = FALSE)
my_var <- str_split(string = my_df$b, pattern = "-", n = 4, simplify = TRUE)
my_var2 <- str_split(string = my_df2$b, pattern = "-", n = 4, simplify = TRUE)
my_df$d <- paste(my_var[, 1], my_var[, 4], sep = "-")
my_df2$d <- paste(my_var2[, 1], my_var2[, 4], sep = "-")
my_df <- left_join(my_df[, c("a", "b", "d")], my_df2[, c("d", "c")], by = "d")
my_df <- my_df[, c("a", "b", "c")]
I have some tidy data and need to transform it into a format that works for building small graphs (sparklines) using the dataui package. You can see the required dataframe format in the code example below, df_sparkline.
The tidy data I have has about 30 companies and a year of data which is < 10,000 rows. What is the best (clearest to understand is valued more than raw speed) way to transform df_tidy to df_sparklines?
library("dataui")
library("reactable")
library("tidyverse")
df_tidy <- tibble(
company = c("A", "B", "A", "B", "A", "B"),
line_data = c(1, 2, 2, 2, 1, 1),
date = c(as.Date("2021-01-01"), as.Date("2021-01-01"), as.Date("2021-01-02"), as.Date("2021-01-02"), as.Date("2021-01-03"), as.Date("2021-01-03"))
)
df_sparkline <- structure(list(company = c("A", "B"), line_data = list(list(c(1, 2, 1)), list(c(2, 2, 1)))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
rt1 <- reactable(
df_sparkline,
columns = list(
line_data = colDef(
cell = function(value, index) {
dui_sparkline(
data = value[[1]],
height = 80,
components = dui_sparklineseries(curve = "linear") # https://github.com/williaster/data-ui/tree/master/packages/sparkline#series
)
}
)
)
)
rt1
All you need is group_by() and summarise():
df_sparkline2 = df_tidy %>%
group_by(company) %>%
summarise(line_data=list(list(line_data)))
waldo::compare(df_sparkline, df_sparkline2)
# √ No differences
The key here is to call list() inside summarise().
I'm looking to update fields in one data table with information from another data table like this:
dt1$name <- dt2$name where dt1$id = dt2$id
In SQL pseudo code : update dt1 set name = dt2.name where dt1.id = dt2.id
As you can see I'm very new to R so every little helps.
Update
I think it was my fault - what I really want is to update an telephone number if the usernames from both dataframes match.
So how can I compare names and update a field if both names match?
Please help :)
dt1$phone<- dt2$phone where dt1$name = dt2$name
Joran's answer assumes dt1 and dt2 can be matched by position.
If it's not the case, you may need to merge() first:
dt1 <- data.frame(id = c(1, 2, 3), name = c("a", "b", "c"), stringsAsFactors = FALSE)
dt2 <- data.frame(id = c(7, 3), name = c("f", "g"), stringsAsFactors = FALSE)
dt1 <- merge(dt1, dt2, by = "id", all.x = TRUE)
dt1$name <- ifelse( ! is.na(dt1$name.y), dt1$name.y, dt1$name.x)
dt1
(Edit per your update:
dt1 <- data.frame(id = c(1, 2, 3), name = c("a", "b", "c"), phone = c("123-123", "456-456", NA), stringsAsFactors = FALSE)
dt2 <- data.frame(name = c("f", "g", "a"), phone = c(NA, "000-000", "789-789"), stringsAsFactors = FALSE)
dt1 <- merge(dt1, dt2, by = "name", all.x = TRUE)
dt1$new_phone <- ifelse( ! is.na(dt1$phone.y), dt1$phone.y, dt1$phone.x)
Try:
dt1$name <- ifelse(dt1$id == dt2$id, dt2$name, dt1$name)
Alternatively, maybe:
dt1$name[dt1$id == dt2$id] <- dt2$name[dt1$id == dt2$id]
If you're more comfortable working in SQL, you can use the sqldf package:
dt1 <- data.frame(id = c(1, 2, 3),
name = c("A", "B", "C"),
stringsAsFactors = FALSE)
dt2 <- data.frame(id = c(2, 3, 4),
name = c("X", "Y", "Z"),
stringsAsFactors = FALSE)
library(sqldf)
sqldf("SELECT dt1.id,
CASE WHEN dt2.name IS NULL THEN dt1.name ELSE dt2.name END name
FROM dt1
LEFT JOIN dt2
ON dt1.id = dt2.id")
But, computationally, it's about 150 times slower than joran's solution, and quite a bit slower in the human time as well. However, if you are ever in a bind and just need to do something that you can do easily in SQL, it's an option.