I'm very new to R and currently trying to move away from my reliance on excel. I have a file which I'm trying to clean up and the final step of my clean-up procedure is to order the columns by the header value. The headers contain values for e.g.:
"August.2019", "October.2019", "February.2020", "March 2020", "June
2019", July.2019, etc......."
I can't say exactly how many columns like this may exist, as this varies from file to file.
As you can see, they're not in any particular order. Is there a way to organise the columns in ascending order? Any guidance would be hugely appreciated.
One option would be to convert the column names to Date and then order
df[order(as.Date(paste0("01.", names(df)), "%d.%B.%Y"))]
Adding a reproducible example,
df <- structure(list(August.2019 = c(1L, 5L), October.2019 = 2:3,
February.2020 = 3:2, March.2020 = c(4L, 1L), June.2019. = c(5L, 2L),
July.2019 = c(6L,1L)), class = "data.frame", row.names = c(NA, -2L))
which looks like this
df
# August.2019 October.2019 February.2020 March.2020 June.2019. July.2019
#1 1 2 3 4 5 6
#2 5 3 2 1 2 1
df[order(as.Date(paste0("01.", names(df)), "%d.%B.%Y"))]
# June.2019. July.2019 August.2019 October.2019 February.2020 March.2020
#1 5 6 1 2 3 4
#2 2 1 5 3 2 1
We could use the same logic of converting columns to date using functions from other packages like lubridate/anytime or as.yearmon from zoo.
We can use setcolorder from data.table
library(zoo)
library(data.table)
setcolorder(df, order(as.yearmon(names(df), "%b.%Y")))
df
# June.2019. July.2019 August.2019 October.2019 February.2020 March.2020
#1 5 6 1 2 3 4
#2 2 1 5 3 2 1
data
df <- structure(list(August.2019 = c(1L, 5L), October.2019 = 2:3,
February.2020 = 3:2, March.2020 = c(4L, 1L), June.2019. = c(5L, 2L),
July.2019 = c(6L,1L)), class = "data.frame", row.names = c(NA, -2L))
Related
I've a dataset with multiple different ranges of columns in each row (each row corresponds to one individual), as below. Each instance of the different column types have 3 levels (0,1 and 2).
id col1_0 col1_1 col1_2 col2_0 col2_1 col2_2 col3_0 col3_1 col3_2
1 0 1 3 2 2 3 3 4 5
2 1 1 2 2 4 7 4 5 5
.
.
etc.
What I would need is to collapse all col1 into one column, all col2 into another and all col3's into another, for each id. As below.
id x col1 col2 col4
1 0 0 2 3
1 1 1 2 4
1 2 3 3 5
2 0 1 2 4
2 1 1 4 5
2 2 1 7 5
.
.
etc.
In addition, I would also need to create an x-column with values 0,1 and 2, for each id. However, I only manage to collapse the first range of columns (col1) with the code below.
library(tidyverse)
longer_data <- dataframe %>%
group_by(id) %>%
pivot_longer(col1_0:col1_2, names_to = "x1", values_to = "col1")
x1 here creates a column with the original column names. So I would create need an additional x-column that only keeps the last numbers of the original column names.
Is there a way to achieve this? Many thanks in advance!
We don't need any group_by. It can be directly done with pivot_longer by specifying the names_sep and the .value in names_to. Note the order of .value and x. It implies the values of that column should go into the each of those prefixes before the _ and the new column with suffix stub goes into 'x'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -id, names_to = c('.value', 'x'), names_sep = "_")
-output
# A tibble: 6 x 5
# id x col1 col2 col3
# <int> <chr> <int> <int> <int>
#1 1 0 0 2 3
#2 1 1 1 2 4
#3 1 2 3 3 5
#4 2 0 1 2 4
#5 2 1 1 4 5
#6 2 2 2 7 5
data
df1 <- structure(list(id = 1:2, col1_0 = 0:1, col1_1 = c(1L, 1L), col1_2 = 3:2,
col2_0 = c(2L, 2L), col2_1 = c(2L, 4L), col2_2 = c(3L, 7L
), col3_0 = 3:4, col3_1 = 4:5, col3_2 = c(5L, 5L)),
class = "data.frame", row.names = c(NA,
-2L))
Here is a base R option using reshape, where timevar="x" creates a column named x, and sep="_" helps to fetch the last numbers of the original column names.
res <- reshape(
df,
direction = "long",
idvar = "id",
varying = -1,
timevar = "x",
sep = "_"
)
res <- res[order(res$id), ]
Output
> res
id x col1 col2 col3
1.0 1 0 0 2 3
1.1 1 1 1 2 4
1.2 1 2 3 3 5
2.0 2 0 1 2 4
2.1 2 1 1 4 5
2.2 2 2 2 7 5
Data
> dput(df)
structure(list(id = 1:2, col1_0 = 0:1, col1_1 = c(1L, 1L), col1_2 = 3:2,
col2_0 = c(2L, 2L), col2_1 = c(2L, 4L), col2_2 = c(3L, 7L
), col3_0 = 3:4, col3_1 = 4:5, col3_2 = c(5L, 5L)), class = "data.frame", row.names = c(NA,
-2L))
I want perform join.
df1=structure(list(id = 1:3, group_id = c(10L, 20L, 40L)), class = "data.frame", row.names = c(NA,
-3L))
df2 has another structure, in group_id's field contain many groups. For examle {10,100,400}
so dput()
df2=structure(list(id = 1:3, group_id = structure(c(1L, 3L, 2L), .Label = c("{`10`,100,`40`}",
"{3,`40`,600,100}", "{4}"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
df2 has group_id 10 and 40,but they are in braces together with other groups.
How get desired joined output
id group_id
1 10
1 40
3 40
You can clean group_id in df2 using gsub, bring each id in separate rows and filter.
library(dplyr)
df2 %>%
mutate(group_id = gsub('[{}`]', '', group_id)) %>%
tidyr::separate_rows(group_id) %>%
filter(group_id %in% df1$group_id)
# id group_id
#1 1 10
#2 1 40
#3 3 40
Here's a data.table alternative:
df2[, strsplit(gsub('[{}`]', '', group_id), ','), by = id][V1 %in% df1$group_id]
# id V1
#1: 1 10
#2: 1 40
#3: 3 40
here is an option with base R using regmatches/regexpr
subset(setNames(stack(setNames(regmatches(df2$group_id, gregexpr("\\d+", df2$group_id)),
df2$id))[2:1], c('id', 'group_id')), group_id %in% df1$group_id)
# id group_id
#1 1 10
#3 1 40
#6 3 40
Suppose I have a data frame like this:
1 8
2 12
3 2
5 -6
6 1
8 5
I want to add a row in the places where the 4 and 7 would have gone in the first column and have the second column for these new rows be 0, so adding these rows:
4 0
7 0
I have no idea how to do this in R.
In excel, I could use a vlookup inside an iferror. Is there a similar combo of functions in R to make this happen?
Edit: also, suppose that row 1 was missing and needed to be filled in similarly. Would this require another solution? What if I wanted to add rows until I reached ten rows?
Use tidyr::complete to fill in the missing sequence between min and max values.
library(tidyr)
library(rlang)
complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))
# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5
If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1 to 10, we can do.
complete(df, V1 = 1:10, fill = list(V2 = 0))
If we don't know the column names beforehand, we can do something like :
col1 <- names(df)[1]
col2 <- names(df)[2]
complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L,
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))
My df is as followed
monday_A monday_B tuesday_A tuesday_B
1 2 4 100
6 7 8 5
I want to reorder this so it becomes
date Group quantitive
Monday A 1
Monday A 6
Monday B 2
Monday B 7
Tuesday A 4
Tuesday A 8
Tuesday B 100
Tuesday B 5
What i've done
df %>% pivot_longer(monday_A:tuesday_B, names_to="tempGroup", values_to="quantitive")
This made it
tempGroup quantitive
monday_A 1
monday_A 6
monday_B 2
monday_B 7
tuesday_A 4
tuesday_A 8
tuesday_B 100
tuesday_B 5
Now how do I separate tempgroup ? I think regex by ifelse could do it by separating the undercore
Use names_sep :
tidyr::pivot_longer(df, cols = everything(),
names_sep = "_",
names_to= c("date", "tempGroup"),
values_to="quantitative")
# A tibble: 8 x 3
# date tempGroup quantitative
# <chr> <chr> <int>
#1 monday A 1
#2 monday B 2
#3 tuesday A 4
#4 tuesday B 100
#5 monday A 6
#6 monday B 7
#7 tuesday A 8
#8 tuesday B 5
data
df <- structure(list(monday_A = c(1L, 6L), monday_B = c(2L, 7L),
tuesday_A = c(4L, 8L), tuesday_B = c(100L, 5L)),
class = "data.frame", row.names = c(NA, -2L))
Base R Solution:
# Transpose dataframe matrix: tpd => as.data.frame
tpd <- as.data.frame(t(df))
# Restructure the dataframe into the desired format: df_td => data.frame
df_td <-
data.frame(
day = gsub("_.*", "", rep(row.names(tpd), ncol(tpd))),
group = gsub(".*_", "", rep(row.names(tpd), ncol(tpd))),
quantitative = unlist(tpd),
row.names = NULL
)
Data
# Create re-usable data: df => data.frame
df <-
structure(
list(
monday_A = c(1L, 6L),
monday_B = c(2L, 7L),
tuesday_A = c(4L,
8L),
tuesday_B = c(100L, 5L)
),
row.names = c(NA,-2L),
class = "data.frame"
)
With the following dataset in R
ID=Custid
ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1 NA On-line 1 New 5 0 1
1 NA On-line 1 Stream 5 0 1
3 EU Tele 2 Stream 5 1 0
I would like to convert the dataset to this format of columns
ID Geo Brand Neworstream OnlineRevQ112 TeleRevQ112 OnlineRevQ212 TeleRevQ212
What is the best way to go about doing this? Can't figure out the best command in R.
Thanks in advance
You can use the reshape2 package and its melt and dcast functions to restructure your data.
data <- structure(list(ID = c(1L, 1L, 3L), Geo = structure(c(NA, NA,
1L), .Label = "EU", class = "factor"), Channel = structure(c(1L,
1L, 2L), .Label = c("On-line", "Tele"), class = "factor"), Brand = c(1L,
1L, 2L), Neworstream = structure(c(1L, 2L, 2L), .Label = c("New",
"Stream"), class = "factor"), RevQ112 = c(5L, 5L, 5L), RevQ212 = c(0L,
0L, 1L), RevQ312 = c(1L, 1L, 0L)), .Names = c("ID", "Geo", "Channel",
"Brand", "Neworstream", "RevQ112", "RevQ212", "RevQ312"), class = "data.frame", row.names = c(NA,
-3L))
library(reshape2)
## melt data
df_long<-melt(data,id.vars=c("ID","Geo","Channel","Brand","Neworstream"))
## recast in combinations of channel and time frame
dcast(df_long,... ~Channel+variable,sum)
Update/facepalm
The "NA" in your dataset probably aren't NA values but rather, the abbreviation "NA" for North America or something like that.
If you had used na.strings when reading your data in, you should have no problems using reshape as I originally indicated:
mydf <- read.table(header = TRUE, na.strings = "",
text = 'ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1 NA On-line 1 New 5 0 1
1 NA On-line 1 Stream 5 0 1
3 EU Tele 2 Stream 5 1 0')
reshape(mydf, direction = "wide",
idvar = c("ID", "Geo", "Brand", "Neworstream"),
timevar = "Channel")
(I might, however, recommend changing your abbreviation for legibility and to reduce confusion!)
Original Answer (since there's still something interesting about reshape there)
This should do it:
reshape(mydf, direction = "wide",
idvar = c("ID", "Geo", "Brand", "Neworstream"),
timevar = "Channel")
# ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line RevQ312.On-line
# 1 1 <NA> 1 New 5 0 1
# 3 3 EU 2 Stream NA NA NA
# RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1 NA NA NA
# 3 5 1 0
Update (To try to salvage the answer a little bit)
As #Arun points out, the above isn't quite right. The culprit here is interaction(), which is used by reshape() to create a new temporary ID variable when more than one ID variable is specified.
Here's the line from reshape() and what it looks like when applied to our "mydf" object:
data[, tempidname] <- interaction(data[, idvar], drop = TRUE)
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] <NA> <NA> 3.EU.2.Stream
# Levels: 3.EU.2.Stream
Hmmm. This seems to simplify to two IDs, NA and 3.EU.2.Stream.
What happens if we replace NA with ""?
mydf$Geo <- as.character(mydf$Geo)
mydf$Geo[is.na(mydf$Geo)] <- ""
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] 1..1.New 1..1.Stream 3.EU.2.Stream
# Levels: 1..1.New 1..1.Stream 3.EU.2.Stream
Aaahh. That's a little bit better. We now have three unique IDs... and reshape() seems to work.
reshape(mydf, direction = "wide",
idvar=names(mydf)[c(1, 2, 4, 5)],
timevar="Channel")
# ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line
# 1 1 1 New 5 0
# 2 1 1 Stream 5 0
# 3 3 EU 2 Stream NA NA
# RevQ312.On-line RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1 1 NA NA NA
# 2 1 NA NA NA
# 3 NA 5 1 0