Spread dataframe - r

I have the following dataframe/tibble sample:
structure(list(name = c("Contents.Key", "Contents.LastModified",
"Contents.ETag", "Contents.Size", "Contents.Owner", "Contents.StorageClass",
"Contents.Bucket", "Contents.Key", "Contents.LastModified", "Contents.ETag"
), value = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0_0e94e664-4d5e-4646-b2b9-1937398cfaed_2019-01-01-07-54-46-064",
"2019-01-01T07:54:47.000Z", "\"378d04496cb27d93e1c37e1511a79ec7\"",
"24187", "e7c0d260939d15d18866126da3376642e2d4497f18ed762b608ed2307778bdf1",
"STANDARD", "vfevvv-edrfvevevev-streamed-data", "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0_33a8ba28-245c-490b-99b2-254507431d47_2019-01-01-07-54-56-755",
"2019-01-01T07:54:57.000Z", "\"df8cc7082e0cc991aa24542e2576277b\""
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
I want to spread the names column using tidyr::spread() function but I don't get the desired result
df %>% tidyr::spread(key = name, value = value)
I get an error:
Error: Duplicate identifiers for rows:...
Also tried with melt function same result.
I have connected to S3 using aws.s3::get_bucket() function and trying to convert it to dataframe. I am aware there is a aws.s3::get_bucket_df() function which should do this but it doesn't work (you may look at my relevant question.
After I've got the bucket list, I've unlisted it and run enframe command.
Please advise.

You can introduce a new column first(introduces NAs, will have to deal with them).
df %>%
mutate(RN=row_number()) %>%
group_by(RN) %>%
spread(name,value)

Related

How to build a new list base on another file and sort it in a certain way

I have a list that contain multiple files, that looks like this:
Now I have a df that looks like this:
structure(list(Order = c(1, 2, 3, 4), Data = c("Bone Scan", "Brain Scan",
"", "Cancer History")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
How can I build a new data list which only contain the data that is in df$Data and stored in the order that appears in df?
Try to subset datalist using df$Data. It should give data in the same order as df$Data.
result <- datalist[df$Data]
We can also use pluck
library(purrr)
datalist %>%
pluck(df$Data)

rtweet - multiple AND/OR keyword search

I am using the rtweet package to retrieve tweets that contain specific keywords. I know how to do an "and"/"or" match, but how to chain these together into one keyword query with multiple OR/and conditions . For example, a search query I may wish to put into the search_twitter function is:
('cash' or 'currency' or 'banknote' or 'accepting cash' or 'cashless') AND ('covid' or 'virus' or 'coronavirus')
So the tweets can contain any one of the words in the first bracket and also any one of the words in the second bracket.
Using dplyr:
Assuming you have a df with a column that contains a character field of tweets:
Sample data:
df <- structure(list(Column = c("coronavirus cash", "covid", "currency covid",
"currency coronavirus", "coronavirus virus", "trees", "plants",
"moneys")), row.names = c(NA, -8L), class = c("tbl_df", "tbl",
"data.frame"))
You can use the following:
library(dplyr)
match <- df %>%
dplyr::filter(str_detect(Column, "cash|currency|banknote|accepting cash|cashless")) %>%
dplyr::filter(str_detect(Column, "covid|virus|coronavirus"))

Problem with R error when using dplyr::distinct(): "no applicable method for 'distinct_' applied to an object of class "c('double', 'numeric')""

Here's my example dataframe:
df.ex <- structure(
list(
id_1 = c(15796L, 15796L, 15799L, 15799L),
id_2 = c(61350L,
351261L, 61488L, 315736L),
days = c(30.5, 36.4854, 30.5, 30.5)
),
row.names = c(NA,-4L),
class = "data.frame",
.Names = c("id_1",
"id_2", "days")
)
I am getting this error with dplyr::distinct()
Error in UseMethod("distinct_") : no applicable method for 'distinct_' applied to an object of class "c('double', 'numeric')"
What's confusing is that it works whenever I pass a dataframe to the function and specify the column like this: distinct(df.ex, days). However, if I create a vector of the variable of interest like so: days_vec <- df.ex$days and pass the vector as an argument to the function like so: distinct(days_vec) I then get the error.
In my actual code I need to use distinct in a dplyr pipe like so:
df.ex %>% summarise(distinct_values = distinct(days))
And of course, this also doesn't work. Does anyone know how to overcome this error?
Many thanks,
Peter
EDIT: for my acutal problem I need to make a summary table with the count of distinct values for days which would be grouped by id_1, it would look like this:
result <- tibble(
id_1 = c(15796, 15799),
count_distinct_values = c(2, 1)
)
I would have thought that the following would help, however it returns another error:
result <- df.ex %>% group_by(id_1) %>% summarise(count_distinct_values = count(distinct(., days)))
Any ideas would be very much appreciated.
UPDATE accordingly to question. I think this solves your problem:
df.ex %>% group_by(id_1) %>% summarise(distinct_values = n_distinct(days))

arrange with period object (ms function) doesn't work - R

I have a time recorded in following format mm:ss where the minutes values can actually be greater than 59. I have parsed it as chr. Now I need to sort my observations in a descending order so I firstly created a time2 variable with ms function and used arrange on the new variable. However arranging doesn't work and the values in the second column are totally mixed up.
library(tidyverse)
library(lubridate)
test <- structure(list(time = c("00:58", "07:30", "08:07", "15:45", "19:30",
"24:30", "30:05", "35:46", "42:23", "45:30", "48:08", "52:01",
"63:45", "67:42", "80:12", "86:36", "87:51", "04:27", "09:34",
"12:33", "18:03", "20:28", "21:39", "23:31", "24:02", "26:28",
"31:13", "43:03", "44:00", "45:38")), .Names = "time", row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))
test %>% mutate(time2 = ms(time)) %>% arrange(time2) %>% View()
How can I arrange this times?
I think it would be easier to just put time in te same unit and then arrange(). Try this:
test %>% mutate(time_in_seconds = minute(ms(time) )*60 + second(ms(time))) %>%
arrange(desc(time_in_seconds)) %>%
View()
seconds_to_period(minute(ms(test$time))*60 + second(ms(test$time))) # to get right format (with hours)
This is a known limitation with arrange. dplyr does not support S4 objects: https://github.com/tidyverse/lubridate/issues/515

How to return the col type of a R tibble in compact string representation?

For example I have a tibble like this.
test <- tibble(a = 10, b = "a")
with this input, I want a function that can return "dc" which represent double and character.
The reason I ask this is that I want to read in lots of files. and I don't want to let read_table function to decide the type for each columns. I can specific the string manually, but since the actually data I want to import have 50 columns, it is quite hard to do manually.
Thanks.
While the aforementioned test %>% summarise_all(class) will give you the class names of the columns it does so in a long form, whereas in this problem you to convert them to single character codes that mean something to read_table col_types. To map from class names to single letter codes you can use a lookup table, here's an (incomplete) example with dput:
structure(list(col_type = c("character", "integer", "numeric",
"double", "logical"), code = c("c", "i", "n", "d", "l")), .Names = c("col_type",
"code"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L))
Now using this table, I'll call it types, we can finally transform the column types in a single string:
library(dplyr)
library(tidyr)
library(stringr)
test %>%
summarise_all(class) %>%
gather(col_name, col_type) %>%
left_join(types) %>%
summarise(col_types = str_c(code, collapse = "")) %>%
unlist(use.names = FALSE)
This gets the class for each column (summarise_all) then gathers them into a tibble matching the column name with the column type (gather). The left_join matches on the col_type column and gives the short 1-char code for each column name. Now we don't do anything with the column names, so it's fine to just concatenate with a summarise and str_c. Finally unlist pulls the string out of a tibble.
test <- tibble(a = 10, b = "a")
test %>% purrr::map_chr(pillar::type_sum) %>% paste(collapse = "_")
# "dbl_chr"
References:
https://tibble.tidyverse.org/articles/types.html
current dplyr version: ‘1.0.9’.
Thank you all for your input. I wanted to update the answer to include more column types and to avoid superseded dplyr version functions.
col_types argument in readr packages has some more types than those mentioned in the answers above:
types <-structure(list(code = c("c", "i", "d", "l", "f", "D", "T", "t"),
col_type = c("chr", "int", "dbl", "lgl", "fct", "date", "dttm", "time")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -8L))
I removed the guess and skip options.
Using the pillar::type_sum()functions returns the same column abbreviations that are used in tibble package. so
test |>
summarise(across(everything(), pillar::type_sum)) |>
pivot_longer(everything(), names_to = "col_names", values_to = "col_type") |>
left_join(types) |>
pull(code) |>
str_c(collapse = "")
This returns a character vector, that could be used as an argument, while using the readr package. This is useful when reading and appending multiple csvs, and you want to force the column types, to avoid bind_rows() throwing an error.
So running map_dfr(all_csv_paths, read_csv) would not depend on correct guessing of column types.

Resources