Reshaping untidy and unbalanced dataset from wide to long [duplicate] - r

This question already has an answer here:
How do I convert a wide dataframe to a long dataframe for a multilevel structure with 'quadruple nesting'?
(1 answer)
Closed 6 years ago.
I have a dataset (data) that looks like this:
ID,ABC.BC,ABC.PL,DEF.BC,DEF.M,GHI.PL
SB0005,C01,D20,C01a,C01b,D20
BC0013,C05,D5,C05a,NA,D5
I want to reshape it from wide-to-long format to get something like this:
ID,FC,Type,Var
SB0005,ABC,BC,C01
SB0005,ABC,PL,D20
SB0005,DEF,BC,C01a
SB0005,DEF,M,C01b
SB0005,GHI,PL,D20
BC0013,ABC,BC,C05
BC0013,ABC,PL,D5
BC0013,DEF,BC,C05a
# BC0013,DEF,M,NA (This row need not be in the dataset as I will remove it later)
BC0013,GHI,PL,D5
The usual reshape package does not work as the dataset is unbalanced. I also tried Reshape from splitstackshape but it does not give me what I want.
library(splitstackshape)
vary <- grep("\\.BC$|\\.PL$|\\.M$", names(data))
stubs <- unique(sub("\\..*$", "", names(data[vary])))
Reshape(data, id.vars=c("ID"), var.stubs=stubs, sep=".")
ID,time,ABC,DEF,GHI
SB0005,1,C01,C01a,D20
BC0013,1,C05,C05a,D5
SB0005,2,D20,C01b,NA
BC0013,2,D5,NA,NA
SB0005,3,NA,NA,NA
BC0013,3,NA,NA,NA
Appreciate any suggestions, thanks!
Providing the output of dput(data) as requested
structure(list(ID = structure(c(2L, 1L), .Label = c("BC0013",
"SB0005"), class = "factor"), ABC.BC = structure(1:2, .Label = c("C01",
"C05"), class = "factor"), ABC.PL = structure(1:2, .Label = c("D20",
"D5"), class = "factor"), DEF.BC = structure(1:2, .Label = c("C01a",
"C05a"), class = "factor"), DEF.M = structure(1:2, .Label = c("C01b",
"NA"), class = "factor"), GHI.PL = structure(1:2, .Label = c("D20",
"D5"), class = "factor")), .Names = c("ID", "ABC.BC", "ABC.PL",
"DEF.BC", "DEF.M", "GHI.PL"), row.names = c(NA, -2L), class = "data.frame")

You need to reshape your data into long format first and then you can spit the variable column into to columns. With splitstackshape you could do:
library(splitstackshape) # this will also load 'data.table' from which the 'melt' function is used
cSplit(melt(mydf, id.vars = 'ID'),
'variable',
sep = '.',
direction = 'wide')[!is.na(value)]
which results in:
ID value variable_1 variable_2
1: SB0005 C01 ABC BC
2: BC0013 C05 ABC BC
3: SB0005 D20 ABC PL
4: BC0013 D5 ABC PL
5: SB0005 C01a DEF BC
6: BC0013 C05a DEF BC
7: SB0005 C01b DEF M
8: SB0005 D20 GHI PL
9: BC0013 D5 GHI PL
An alternative with tidyr:
library(tidyr)
mydf %>%
gather(var, val, -ID) %>%
separate(var, c('FC','Type')) %>%
filter(!is.na(val))

Related

create data frame from nested entries

I have a data frame test like this:
dput(test)
structure(list(X = 1L, entityId = structure(1L, .Label = "HOST-123", class = "factor"),
displayName = structure(1L, .Label = "server1", class = "factor"),
discoveredName = structure(1L, .Label = "server1", class = "factor"),
firstSeenTimestamp = 1593860000000, lastSeenTimestamp = 1603210000000,
tags = structure(1L, .Label = "c(\"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\"), c(\"app1\", \"client\", \"org\", \"app1\", \"DATA_CENTER\", \"PURPOSE\", \"REGION\", \"Test\"), c(NA, \"NONE\", \"Host:Environment:test123\", \"111\", \"222\", \"GENERAL\", \"444\", \"555\")", class = "factor")), .Names = c("X",
"entityId", "displayName", "discoveredName", "firstSeenTimestamp",
"lastSeenTimestamp", "tags"), class = "data.frame", row.names = c(NA,
-1L))
There is a column called tags which should become a dataframe. I need to get rid of the first row in tags (which keep saying CONTEXTLESS, expand the second column in tags(make them columns. Lastly I need to insert the 3rd column values in tags under each expanded columns.
For example in needs to look like this:
structure(list(entityId = structure(1L, .Label = "HOST-123", class = "factor"),
displayName = structure(1L, .Label = "server1", class = "factor"),
discoveredName = structure(1L, .Label = "server1", class = "factor"),
firstSeenTimestamp = 1593860000000, lastSeenTimestamp = 1603210000000,
app1 = NA, client = structure(1L, .Label = "None", class = "factor"),
org = structure(1L, .Label = "Host:Environment:test123", class = "factor"),
app1.1 = 111L, data_center = 222L, purppose = structure(1L, .Label = "general", class = "factor"),
region = 444L, test = 555L), .Names = c("entityId", "displayName",
"discoveredName", "firstSeenTimestamp", "lastSeenTimestamp",
"app1", "client", "org", "app1.1", "data_center", "purppose",
"region", "test"), class = "data.frame", row.names = c(NA, -1L
))
I need to remove the 1st vector that keeps saying "contextless", add the second vector the columns. Each 2nd vector value should be a column name. Last vector should be values of the newly added columns.
If you are willing to drop the first "row" of garbage and then do a ittle cleanup of the parse-side-effects, then this might be a good place to start:
read.table(text=gsub("\\),", ")\n", test$tags[1]), sep=",", skip=1, #drops line
header=TRUE)
c.app1 client org app1 DATA_CENTER PURPOSE REGION Test.
1 c(NA NONE Host:Environment:test123 111 222 GENERAL 444 555)
The read.table function uses the scan function which doesn't know that "c(" and ")" are meaningful. The other alternative might be to try eval(parse(text= .)) (which would know that they are enclosing vectors) on the the second and third lines, but I couldn't see a clean way to do that. I initially tried to separate the lines using strsplit, but that caused me to loose the parens.
Here's a stab at some cleanup via that addition of some more gsub operations:
read.table(text=gsub("c\\(|\\)","", # gets rid of enclosing "c(" and ")"
gsub("\\),", "\n", # inserts line breaks
test$tags[1])),
sep=",", #lets commas be parsed
skip=1, #drops line
header=TRUE) # converts to colnames
app1 client org app1.1 DATA_CENTER PURPOSE REGION Test
1 NA NONE Host:Environment:test123 111 222 GENERAL 444 555
The reason for the added ".1" in the second instance of app1 is that R colnames in dataframes need to be unique unless you override that with check.names=FALSE
Here is a tidyverse approach
library(dplyr)
library(tidyr)
str2dataframe <- function(txt, keep = "all") {
# If you can confirm that all vectors are of the same length, then we can make them into columns of a data.frame
out <- eval(parse(text = paste0("data.frame(", as.character(txt),")")))
# rename columns as X1, X2, ...
nms <- make.names(seq_along(out), unique = TRUE)
if (keep == "all")
keep <- nms
`names<-`(out, nms)[, keep]
}
df %>%
mutate(
tags = lapply(tags, str2dataframe, -1L),
tags = lapply(tags, function(d) within(d, X2 <- make.unique(X2)))
) %>%
unnest(tags) %>%
pivot_wider(names_from = "X2", values_from = "X3")
df looks like this
> df
X entityId displayName discoveredName firstSeenTimestamp lastSeenTimestamp
1 1 HOST-123 server1 server1 1.59386e+12 1.60321e+12
tags
1 c("CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS"), c("app1", "client", "org", "app1", "DATA_CENTER", "PURPOSE", "REGION", "Test"), c(NA, "NONE", "Host:Environment:test123", "111", "222", "GENERAL", "444", "555")
Output looks like this
# A tibble: 1 x 14
X entityId displayName discoveredName firstSeenTimestamp lastSeenTimestamp app1 client org app1.1 DATA_CENTER PURPOSE REGION Test
<int> <fct> <fct> <fct> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 HOST-123 server1 server1 1593860000000 1603210000000 NA NONE Host:Environment:test123 111 222 GENERAL 444 555

Converting empty values to NULL in R - Handling date column

I have a simple dataframe as: dput(emp)
structure(list(name = structure(1L, .Label = "Alex", class = "factor"),
job = structure(1L, .Label = "", class = "factor"), Mgr = structure(1L, .Label = "", class = "factor"),
update = structure(18498, class = "Date")), class = "data.frame", row.names = c(NA,
-1L))
I want to convert all empty rows to NULL
The simplest way to achieve is:
emp[emp==""] <- NA
Which ofcourse would have worked but I get the error for the date column as:
Error in charToDate(x) :
character string is not in a standard unambiguous format
How can I convert all other empty rows to NULL without having to deal with the date column? Please note that the actual data frame has 30000+ rows.
Try formating the date variable as character, make the change and transform to date again:
#Format date
emp$update <- as.character(emp$update)
#Replace
emp[emp=='']<-NA
#Reformat date
emp$update <- as.Date(emp$update)
Output:
name job Mgr update
1 Alex <NA> <NA> 2020-08-24
You can try type.convert like below
type.convert(emp,as.is = TRUE)
such that
name job Mgr update
1 Alex NA NA 2020-08-24
You may try this using dplyr:
library(dplyr)
df %>%
mutate_at(vars(update),as.character) %>%
na_if(.,"")
As mentioned by #Duck, you have to format the date variable as character.
afterwards you can transform it back to date if you need it:
library(dplyr)
df %>%
mutate_at(vars(update),as.character) %>%
na_if(.,"") %>%
mutate_at(vars(update),as.Date)
See if this works:
> library(dplyr)
> library(purrr)
> emp <- structure(list(name = structure(1L, .Label = "Alex", class = "factor"),
+ job = structure(1L, .Label = "", class = "factor"), Mgr = structure(1L, .Label = "", class = "factor"),
+ update = structure(18498, class = "Date")), class = "data.frame", row.names = c(NA,
+ -1L))
> emp
name job Mgr update
1 Alex 2020-08-24
> emp %>% mutate(update = as.character(update)) %>% map_df(~gsub('^$',NA, .x)) %>% mutate(update = as.Date(update)) %>% mutate(across(1:3, as.factor))
# A tibble: 1 x 4
name job Mgr update
<fct> <fct> <fct> <date>
1 Alex NA NA 2020-08-24
>

How can I pivot_longer() while maintaining column pairings?

There's got to be a simpler way to do this!
I start with wide format data:
| family_id | first_name_child1 | surname_child1 | first_name_child2 | second_name_child2 | ... |
|...........|...................|................|...................|....................|.....|
And I want to turn it into long format:
| family_id | sibling_number | first_name | surname |
|...........|................|............|.........|
Question: How can I pivot_longer() while maintaining the first name/surname pairings?
This is how I did it:
df <- structure(list(family_id = 1:2, first_name_child1 = c("Verdie",
"Quentin"), first_name_child2 = c("Iris", "Bryon"), first_name_child3 = c(NA,
"Karie"), first_name_child4 = c(NA, "Christopher"), surname_child1 = c("Moy",
"Mccowen"), surname_child2 = c("Moy", "Mccowen"), surname_child3 = c(NA,
"Mccowen"), surname_child4 = c(NA, "Mccowen")), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
library(dplyr)
library(tidyr)
fun <- function(x) {
names(x) <- gsub("_child\\d+", "", names(x))
x
}
df %>%
nest(child1 = ends_with("_child1"),
child2 = ends_with("_child2"),
child3 = ends_with("_child3"),
child4 = ends_with("_child4")) %>%
mutate_at(vars(starts_with("child")), lapply, fun) %>%
pivot_longer(-family_id, names_to = "sibling_number",
names_prefix = "child",
values_to = "name") %>%
unnest(name)
BUT I can do the reverse with 1 line:
df2 <- structure(list(family_id = c(1L, 1L, 2L, 2L, 2L, 2L), sibling_number = c(1L,
2L, 1L, 2L, 3L, 4L), first_name = c("Verdie", "Iris", "Quentin",
"Bryon", "Karie", "Christopher"), surname = c("Moy", "Moy", "Mccowen",
"Mccowen", "Mccowen", "Mccowen")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
pivot_wider(df2,
names_from = sibling_number,
names_prefix = "child",
values_from = c("first_name", "surname"))
Is this pivot_wider() easily reversible? Or alternatively, I thought there might be a way to combine do.call(), nest() and ends_with(), but couldn't work it out?
additional solution
df %>%
pivot_longer(cols = -family_id,
names_to = c(".value", "set"),
names_pattern = "(.*)(\\d+)")
The data.table solution would be:
library(data.table)
g <- melt(setDT(df),
id.vars = "family_id",
measure.vars = patterns(first_name = "first_name_child",
surname = "surname_child"),
variable.name = "sibling_number",
na.rm = F)
g[order(family_id)]
family_id sibling_number first_name surname
1: 1 1 Verdie Moy
2: 1 2 Iris Moy
3: 1 3 <NA> <NA>
4: 1 4 <NA> <NA>
5: 2 1 Quentin Mccowen
6: 2 2 Bryon Mccowen
7: 2 3 Karie Mccowen
8: 2 4 Christopher Mccowen
And, as a side note, you can get it back to wide format with
dcast(g,
family_id ~ sibling_number,
value.var = c("first_name","surname"))
This answer is copied from Henrik's comment:
pivot_longer(df, cols = -1, names_to = c(".value", "sibling_nr"), names_sep = "child")
Answer is posted and accepted to close out question.

Easy way to transform data frame in R - one variable values as separate columns [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 7 years ago.
What I have (obviously I'm presenting a very small fraction of my current data):
my_df <- structure(list(X = structure(c(48.75, 49.25), .Dim = 2L), Y = structure(c(17.25, 17.25), .Dim = 2L), Time = structure(c(14625, 14626), .Dim = 2L, class = "Date"), spei = c(-0.460236400365829, -0.625695407390594)), .Names = c("X", "Y", "Time", "spei"), row.names = 1:2, class = "data.frame")
What I need:
new_df <- structure(list(X = structure(c(48.75, 49.25), .Dim = 2L), Y = structure(c(17.25, 17.25), .Dim = 2L), "2010-01-16" = c(-0.460236400365829, NaN), "2010-01-17" = c(NaN, -0.625695407390594)), .Names = c("X", "Y", "2010-01-16", "2010-01-17"), row.names = 1:2, class = "data.frame")
What is the easiest way of doing this?
I thought about writing a for loop, but I guess that apply/sapply might help on this?
You can use library tidyr and its spread function like this:
library(tidyr)
spread(my_df, Time, spei)
X Y 2010-01-16 2010-01-17
1 48.75 17.25 -0.4602364 NA
2 49.25 17.25 NA -0.6256954
Without any additional packages you could do that with reshape():
reshape(my_df, idvar = c('X', 'Y'), timevar = "Time", direction = 'wide')
Which gives:
X Y spei.2010-01-16 spei.2010-01-17
1 48.75 17.25 -0.4602364 NA
2 49.25 17.25 NA -0.6256954

Combining two columns in R

I have two columns in R:
Origin Dest
ALB ATL
ALB LAG
ALB LAX
I need them to look like this (in one column):
Origin-Dest
ALB-ATL
ALB-LAG
ALB-LAX
Does anyone know how to combine two lines without getting too complicated?
This is the code I have so far:
air <- read.table(delta, header=T, sep=",")
aircolSQL <- sqldf("select Origin, Dest, ActualElapsedTime from air")
airsortSQL <- sqldf("select * from aircolSQL order by Origin asc, Dest")
airsortSQL$ActualTimeHours = round((airsortSQL$ActualElapsedTime/60),1)
airsortSQL$ActualElapsedTime <- NULL
Thanks!
If df is your data frame:
df <- data.frame(Origin = rep('ALB', 3), Dest = c('ATL', 'LAG', 'LAX'))
library(tidyr)
unite_(df, '`Origin-Dest`', c('Origin', 'Dest'), sep = "-")
`Origin-Dest`
1 ALB-ATL
2 ALB-LAG
3 ALB-LAX
In SQLite || is used for string concatenation:
library(sqldf)
sqldf("select Origin || '-' || Dest OD from air")
giving:
OD
1 ALB-ATL
2 ALB-LAG
3 ALB-LAX
data We used this as air:
air <- structure(list(Origin = structure(c(1L, 1L, 1L), .Label = "ALB",
class = "factor"), Dest = structure(1:3, .Label = c("ATL", "LAG", "LAX"),
class = "factor")), .Names = c("Origin", "Dest"), class = "data.frame",
row.names = c(NA, -3L))
I often use paste:
DataFrame$OriginDest<- paste(DataFrame$Origin,DataFrame$Dest, sep = '-')
I find it the least hassle approach.

Resources