Data wrangling - data spread over three rows - dplyr

Data wrangling - data spread over three rows - dplyr - r

I have a very untidy data set something like this
A tibble: 200000 x 2
ChatData
<chr>
1 Sep 30, 2018 7:12pm
2 Person A
3 Hello
4 Sep 30, 2018 7:11pm
5 Person B
6 Hello there
7 Sep 30, 2018 7:10pm
8 Person A
...
As you can see it goes date, person name, comment, and repeats.
I am working on the problem and have a very complex method that adds a score column depending on the names etc....
I would like to transform this into something like this
Person A , Person B
Hello NA
NA Hello there
how's you, NA
...
(The date as a row name or third column would be great but not essential to the question)
Optimally I am looking for a dplyr/tidyverse solution
I am working with lots of data so no slow for loops etc..
Raw data to work with:
structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
If anyone is wondering I am analysing facebook messenger data, and this is the form it comes in when you download it.
Thank you.

In this case, your starting data set has only one column (aka feature). But in this case, there are three types of data that are encoded here about each message: a timestamp, the label of the person, and a message. It will be more useful to transform these into a table where each message is in its own row, and each column represents a different aspect of each observation, i.e. in long, or "tidy", format: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
In the approach below, the user first defines what features are repeated in the data set. I call them "headers" here, since I'm working toward a table where these are the column headers. Then the script adds that information to the data and converts the single-column data into a tidy format with one row per message, and one aspect of each message in each column.
Your requested output is a minor variation of this, addressed in the last line below: %>% spread(person, msg), which separates out the Person A and Person b data into separate columns.
library(tidyverse)
header_names <- c("timestamp", "person", "msg")
rows_per <- length(header_names)
data_length <- length(data$ChatData) / rows_per
data2 <- data %>%
mutate(msg_number = rep(1:(nrow(data)/rows_per), each=rows_per),
# This line repeats the header_names sequence for each msg
header = rep(header_names, data_length)) %>%
spread(header, ChatData) %>%
mutate(timestamp = lubridate::mdy_hm(timestamp)) %>%
spread(person, msg)
head(data2)
# A tibble: 2 x 4
msg_number timestamp `Person A` `Person B`
<int> <dttm> <chr> <chr>
1 1 2018-09-30 19:12:00 Hello NA
2 2 2018-09-30 19:11:00 NA Hello there

As you basically just have a character vector that you would like to convert into a 3 columnn data.frame
One other option is to simply use matrix and specify ncol=3 and byrow=TRUE
# your sample data
d <- structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list( NULL, c("date_time", "person", "message")) )
Result is a character matrix:
date_time person message
[1,] "Sep 30, 2018 7:12pm" "Person A" "Hello"
[2,] "Sep 30, 2018 7:11pm" "Person B" "Hello there"
But you can wrap that in as.data.frame() to convert to a data.frame and continue working from there with dplyr if that's what you want.
Put it together for a whole solution:
It becomes a nice short, readable bit of code IMO:
library(dplyr)
library(lubridate)
result_df <-
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list(NULL, c("date_time", "person", "message")) ) %>%
as.data.frame() %>%
mutate(date_time=lubridate::mdy_hm(date_time))

Here is one approach:
data %>% group_by(msg_number = rep(1:(nrow(data)/3), each=3)) %>%
summarize(msg_data = list(ChatData)) %>% as.data.frame
msg_number msg_data
1 1 Sep 30, 2018 7:12pm, Person A, Hello
2 2 Sep 30, 2018 7:11pm, Person B, Hello there
This numbers each message and puts the data into a column list.

Related

Specify number of columns to read when first row is missing values

I have data from a logger that inserts timestamps as rows within the comma separated data. I've sorted out a way to wrangle those timestamps into a tidy data frame (thanks to the responses to this question).
The issue I'm having now is that the timestamp lines don't have the same number of comma-separated values as the data rows (3 vs 6), and readr is defaulting to reading only in only 3 columns, despite me manually specifying column types and names for 6. Last summer (when I last used the logger) readr read the data in correctly, but to my dismay the current version (2.1.1) throws a warning and lumps columns 3:6 all together. I'm hoping that there's some option for "correcting" back to the old behaviour, or some work-around solution I haven't thought of (editing the logger files is not an option).
Example code:
library(tidyverse)
# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# throws warning and reads 3 columns
read_csv(
txt1,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# works correctly
read_csv(
txt2,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
NA, NA, "Logger Start 12:34", NA, NA, NA,
-112, 53, "N=1", 9, 15, ".25",
-112, 53, "N=2",12, 17, ".17"
)

Use the base read.csv then convert to typle if need be:
read.csv(text=txt1, header = FALSE,
col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
lon lat n red nir NDVI
1 NA NA Logger Start 12:34 NA NA NA
2 -112 53 N=1 9 15 0.25
3 -112 53 N=2 12 17 0.17

I think I would use read_lines and write_lines to convert the "bad CSV" into "good CSV", and then read in the converted data.
Assuming you have a file test.csv like this:
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
Try something like this:
library(dplyr)
library(tidyr)
read_lines("test.csv") %>%
# assumes all timestamp lines are the same format
gsub(",,Logger Start (.*?)$", "\\1,,,,,,", ., perl = TRUE) %>%
# assumes that NDVI (last column) is always present and ends with a digit
# you'll need to alter the regex if not the case
gsub("^(.*?\\d)$", ",\\1", ., perl = TRUE) %>%
write_lines("test_out.csv")
test_out.csv now looks like this:
12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17
So we now have 7 columns, the first is the timestamp.
This code reads the new file, fills in the missing timestamp values and removes rows where n is NA. You may not want to do that, I've assumed that n is only missing because of the original row with the timestamp.
mydata <- read_csv("test_out.csv",
col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>%
fill(ts) %>%
filter(!is.na(n))
The final mydata:
# A tibble: 2 x 7
ts lon lat n red nir NDVI
<time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34 -112 53 N=1 9 15 0.25
2 12:34 -112 53 N=2 12 17 0.17

Removing all characters before and after text in R, then creating columns from the new text

So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.
The format is this:
"blah, grabbed by ???-??-?????."
I need this:
???-??-?????
I've used placeholders here, but this is how the string typically looks
"blah, grabbed by PHI-80-J.Matthews."
or
"blah, grabbed by NE-5-J.Mills."
and sometimes there is text after the name like this:
"blah, grabbed by KC-10-T.Hill. Blah blah blah."
This is what I would like the end result to be:
Place
Number
Name
PHI
80
J.Matthews
NE
5
J.Mills
KC
10
T. Hill
Edit for further explanation:
Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.
Ex.
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
Desired Output:
Place
Number
Name
KC
10
T. Hill

This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash
library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"),
".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)
-ouputt
# A tibble: 4 x 3
Place Number Name
<chr> <int> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
4 KC 10 T.Hill
Or do this in base R
read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1",
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number Name
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
data
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.",
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.",
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:
library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
separate(col1,
into = c("Place", "Number", "Name"),
sep = "-")
# A tibble: 3 x 3
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).

Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point.
separate that splits the string by by, and finally again separate to get the desired columns.
library(dplyr)
library(tidyr)
df1 %>%
mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>%
separate(test1, c('remove', 'keep'), sep = " by ") %>%
separate(keep, c("Place", "Number", "Name"), sep = "-") %>%
select(Place, Number, Name)
Output:
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill

Pre-processing data in R: filtering and replacing using wildcards

Good day!
I have a dataset in which I have values like "Invalid", "Invalid(N/A)", "Invalid(1.23456)", lots of them in different columns and they are different from file to file.
Goal is to make script file to process different CSVs.
I tried read.csv and read_csv, but faced errors with data types or no errors, but no action either.
All columns are col_character except one - col_double.
Tried this:
is.na(df) <- startsWith(as.character(df, "Inval")
no luck
Tried this:
is.na(df) <- startsWith(df, "Inval")
no luck, some error about non char object
Tried this:
df %>%
mutate(across(everything(), .fns = ~str_replace(., "invalid", NA_character_)))
no luck
And other google stuff - no luck, again, errors with data types or no errors, but no action either.
So R is incapable of simple find and replace in data frame, huh?
data frame exampl
Output of dput(dtype_Result[1:20, 1:4])
structure(list(Location = c("1(1,A1)", "2(1,B1)", "3(1,C1)",
"4(1,D1)", "5(1,E1)", "6(1,F1)", "7(1,G1)", "8(1,H1)", "9(1,A2)",
"10(1,B2)", "11(1,C2)", "12(1,D2)", "13(1,E2)", "14(1,F2)", "15(1,G2)",
"16(1,H2)", "17(1,A3)", "18(1,B3)", "19(1,C3)", "20(1,D3)"),
Sample = c("Background0", "Background0", "Standard1", "Standard1",
"Standard2", "Standard2", "Standard3", "Standard3", "Standard4",
"Standard4", "Standard5", "Standard5", "Standard6", "Standard6",
"Control1", "Control1", "Control2", "Control2", "Unknown1",
"Unknown1"), EGF = c(NA, NA, "6.71743640129069", "2.66183193679533",
"16.1289784536322", "16.1289784536322", "78.2706654825781",
"78.6376213069722", "382.004087907716", "447.193928257862",
"Invalid(N/A)", "1920.90297258996", "7574.57784103579", "29864.0308009592",
"167.830723655146", "109.746615928611", "868.821939675054",
"971.158518683179", "9.59119569511596", "4.95543581398464"
), `FGF-2` = c(NA, NA, "25.5436745776637", NA, "44.3280630362038",
NA, "91.991708192168", "81.9459159768959", "363.563899234418",
"425.754478700876", "Invalid(2002.97340881547)", "2027.71958119836",
"9159.40221389147", "11138.8722428849", "215.58494072476",
"70.9775438699825", "759.798876479002", "830.582605561901",
"58.7007261370257", "70.9775438699825")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))

The error is in the use of startsWith. The following grepl solution is simpler and works.
is.na(df) <- sapply(df, function(x) grepl("^Invalid", x))

The str_replace function will attempt to edit the content of a character string, inserting a partial replacement, rather than replacing it entirely. Also, the across function is targeting all of the columns including the numeric id. The following code works, building on the tidyverse example you provided.
To fix it, use where to identify the columns of interest, then use if_else to overwrite the data with NA values when there is a partial string match, using str_detect to spot the target text.
Example data
library(tiyverse)
df <- tibble(
id = 1:3,
x = c("a", "invalid", "c"),
y = c("d", "e", "Invalid/NA")
)
df
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 invalid e
3 3 c Invalid/NA
Solution
df <- df %>%
mutate(
across(where(is.character),
.fns = ~if_else(str_detect(tolower(.x), "invalid"), NA_character_, .x))
)
print(df)
Result
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 NA e
3 3 c NA

R: Select a cell when merging two data frames

I'm having some trouble when I try to merge two data frames. Here is an example:
Number <- c("1", "2", "3")
Letter <- factor(c("a", "b", "c"))
map <- data.frame(Number, Letter, row.names = c("Belgium", "Italy", "Senegal"))
This is my first data frame called "map", it looks like this:
Number Letter
Belgium 1 a
Italy 2 b
Senegal 3 c
And if I try to select by row and column I don't have any problem:
map["Belgium", "Number"]
[1] "1"
Here I have my second data frame called "calendar":
Month <- c("January", "February", "March")
calendar <- data.frame(Month, row.names = c("Belgium", "Italy", "Senegal"))
It looks like this:
Month
Belgium January
Italy February
Senegal March
The problem comes when I try to merge both data frames:
map.amp = merge(map, calendar, by = 0)
Row.names Number Letter Month
1 Belgium 1 a January
2 Italy 2 b February
3 Senegal 3 c March
Now, when I try to select a cell using rows and columns, the outcome is always NA
map.amp["Italy", "Month"]
[1] NA
map.amp["Belgium", "Number"]
[1] NA
How can I merge both data frames so I can keep using that kind of select function?

You have to re-set the row names:
row.names(map.amp) <- map.amp$Row.names

If you want to keep using those row names you have to set the Row.names column back to row names. tibble::column_to_rownames is a nice option for this:
map.amp <- merge(map, calendar, by = 0) %>% tibble::column_to_rownames(var = "Row.names")

map.amp[map.amp$Row.names =='Italy', 'Month']
Will work now as row.names is also a column now

You could use the answer in the comment by #thelatemail. Or use
subset(map.amp, Row.names =='Italy')[[ 'Month']] # first get matching rows but them narrow to named column.
or
subset(map.amp, Row.names =='Italy', 'Month') # third argument is for column selection

apply rename_if predicate to column names

I am working with a set of excel spreadsheets which has column names which are dates.
After reading in the data with readxl::read_xlsx(), these column names become excel index dates (i.e. an integer representing days elapsed from 1899-12-30)
Is it possible to used dplyr::rename_if() or similar to rename all column names that are currently integers? I have written a function rename_func that I would like to apply to all such columns.
df %>% rename_if(is.numeric, rename_func) is not suitable as is.numeric is applied to the data in the column not the column name itself. I have also tried:
is.name.numeric <- function(x) is.numeric(names(x))
df %>% rename_if(is.name.numeric, rename_func)
which does not work and does not change any names (i.e. is.name.numeric returns FALSE for all cols)
edit: here is a dummy version of my data
df_badnames <- structure(list(Level = c(1, 2, 3, 3, 3), Title = c("AUSTRALIAN TOTAL",
"MANAGERS", "Chief Executives, Managing Directors & Legislators",
"Farmers and Farm Managers", "Hospitality, Retail and Service Managers"
), `38718` = c(213777.89, 20997.52, 501.81, 121.26, 4402.7),
`38749` = c(216274.12, 21316.05, 498.1, 119.3, 4468.67),
`38777` = c(218563.95, 21671.84, 494.08, 118.03, 4541.02),
`38808` = c(220065.05, 22011.76, 488.56, 116.24, 4609.28)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
and I would like:
df_goodnames <- structure(list(Level = c(1, 2, 3, 3, 3), Title = c("AUSTRALIAN TOTAL",
"MANAGERS", "Chief Executives, Managing Directors & Legislators",
"Farmers and Farm Managers", "Hospitality, Retail and Service Managers"
), Jan2006 = c(213777.89, 20997.52, 501.81, 121.26, 4402.7),
Feb2006 = c(216274.12, 21316.05, 498.1, 119.3, 4468.67),
Mar2006 = c(218563.95, 21671.84, 494.08, 118.03, 4541.02),
Apr2006 = c(220065.05, 22011.76, 488.56, 116.24, 4609.28)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
I understand that it is best practice to create a date column and change the shape of this df, but I need to join a few spreadsheets first and having integer column names causes a lot of problems. I currently have a work around but the crux of my question (apply a rename_if predicate to a name, rather than a column) is still interesting.

Although, the names look numeric but they are not
class(names(df_badnames))
#[1] "character"
so they would not be caught by is.numeric or similar other functions.
One way to do this is find out which names can be coerced to numeric and then convert them into the date format of our choice
cols <- as.numeric(names(df_badnames))
names(df_badnames)[!is.na(cols)] <- format(as.Date(cols[!is.na(cols)],
origin = "1899-12-30"), "%b%Y")
df_badnames
# Level Title Jan2006 Feb2006 Mar2006 Apr2006
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 AUSTRALIAN TOTAL 213778. 216274. 218564. 220065.
#2 2 MANAGERS 20998. 21316. 21672. 22012.
#3 3 Chief Executives, Managing Directors & Legisla… 502. 498. 494. 489.
#4 3 Farmers and Farm Managers 121. 119. 118. 116.
#5 3 Hospitality, Retail and Service Managers 4403. 4469. 4541. 4609.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data wrangling - data spread over three rows - dplyr - r

Related

Specify number of columns to read when first row is missing values

Removing all characters before and after text in R, then creating columns from the new text

Pre-processing data in R: filtering and replacing using wildcards

R: Select a cell when merging two data frames

apply rename_if predicate to column names

Categories

Resources