Column Split based on certain separator format - r

I have a format to separate where I will have this data:
df = data.frame(id=c(1,2),name=c('A~B~C','A~B~D'),value=c('1~2~3','1~~2'))
id name value
1 A~B~C 1~2~3
2 A~B~D 1~~2
which is expected to have the following output where the column name is the original column name followed by the text in the name column:
id value_A value_B value_C value_D
1 1 2 3
2 1 2
I manage to achieve the splitting for the by using many nested for loops to process on my data row by row. It works on small sample data but once the data gets huge, the time is an issue.
Also,there could be more than 1 value columns, but they all should map into the same name column.
Example output:
id value_A value_B value_C value1_A value1_B value1_C
1 1 2 3 1 2 3
2 1 2 3 1 2 3

You can try dplyr:
library(tidyverse)
df %>%
separate_rows(name, value, sep = "~") %>%
spread(name, value)
id A B C D
1 1 1 2 3 <NA>
2 2 1 <NA> 2
Instead of NA you can fill empty cells by anything you specify within fill = ""
Or baseR and reshape2:
a <- strsplit(as.character(df$name), "~")
b <- strsplit(as.character(df$value), "~")
df2 <- do.call(rbind.data.frame, Map(cbind, df$id, a, b))
library(reshape2)
dcast(df2, V1~V2, value.var = "V3")
A B C D
1 1 2 3 <NA>
2 1 <NA> 2

Here is an option using cSplit/dcast. Split the rows into 'long' format with cSplit and dcast it to 'wide' format
library(splitstackshape)
dcast(cSplit(df, c('name','value'), '~', 'long')[!is.na(value)], id ~ paste0('value_', name))
# id value_A value_B value_C value_D
#1: 1 1 2 3 NA
#2: 2 1 NA NA 2

Related

Frequency count for multiple columns with same values

I'd like to make a frequency count individually for multiple columns with same possible values. The idea is to keep all columns from original data table, just adding a new one for levels and aggregating.
Here is an example of input data:
foo <- data.table(a = c(1,3,2,3,3), b = c(2,3,3,1,1), c = c(3,1,2,3,2))
# a b c
#1: 1 2 3
#2: 3 3 1
#3: 2 3 2
#4: 3 1 3
#5: 3 1 2
And desired output:
data.table(levels = 1:3, a = c(1,1,3), b = c(2,1,2), c = c(1,2,2))
# levels a b c
#1: 1 1 2 1
#2: 2 1 1 2
#3: 3 3 2 2
Thanks for helping !
We may use
library(data.table)
dcast(melt(foo)[, .N, .(variable, levels = value)],
levels ~ variable, value.var = 'N')
-output
Key: <levels>
levels a b c
<num> <int> <int> <int>
1: 1 1 2 1
2: 2 1 1 2
3: 3 3 2 2
Or using base R
table(stack(foo))
ind
values a b c
1 1 2 1
2 1 1 2
3 3 2 2
You could also use recast from reshape2:
reshape2::recast(foo, value~variable)
# No id variables; using all as measure variables
# Aggregation function missing: defaulting to length
value a b c
1 1 1 2 1
2 2 1 1 2
3 3 3 2 2
or even
reshape2::recast(foo, value~variable, length)
Here is an option using purrr and dplyr from the tidyverse:
library(purrr)
library(dplyr)
foo %>%
imap(~ as.data.frame(table(.x, dnn = "levels"), responseName = .y)) %>%
reduce(left_join, by = "levels")
Alternatively, you could use the pivot functions from tidyr:
library(dplyr)
library(tidyr)
foo %>%
pivot_longer(everything(),
values_to = "levels") %>%
count(name, levels) %>%
pivot_wider(id_cols = levels,
names_from = name,
values_from = n)
foo |>
melt() |>
dcast(value ~ variable, fun.aggregate = length)
# value a b c
# 1: 1 1 2 1
# 2: 2 1 1 2
# 3: 3 3 2 2

how to count and dcast for all columns in r

I have following dataframe in r
Company Education Health
A NA 1
A 1 2
A 1 NA
I want the count of levels in each columns(1,2,NA) in a following format
Company Education_1 Education_NA Health_1 Health_2 Health_NA
A 2 1 1 1 1
How can I do it in R?
You can do the following:
library(tidyverse)
df %>%
gather(k, v, -Company) %>%
unite(tmp, k, v, sep = "_") %>%
count(Company, tmp) %>%
spread(tmp, n)
## A tibble: 1 x 6
# Company Education_1 Education_NA Health_1 Health_2 Health_NA
# <fct> <int> <int> <int> <int> <int>
#1 A 2 1 1 1 1
Sample data
df <- read.table(text =
" Company Education Health
A NA 1
A 1 2
A 1 NA ", header = T)
Using DF in the Note at the end where we have added a company B as well and using the reshape2 package it can be done in one recast call. The id.var and fun arguments can be omitted and the same answer will be given but it will produce a message saying it used those defaults.
library(reshape2)
recast(DF, Company ~ variable + value,
id.var = "Company", fun = length)
giving this data frame:
Company Education_1 Education_NA Health_1 Health_2 Health_NA
1 A 2 1 1 1 1
2 B 2 1 1 1 1
Note
Lines <- " Company Education Health
1 A NA 1
2 A 1 2
3 A 1 NA
4 B NA 1
5 B 1 2
6 B 1 NA"
DF <- read.table(text = Lines)
In plyr you can use a hack with ddply by transposing tables to get what appear to be new columns:
x <- data.frame(Company="A",Education=c(NA,1,1),Health=c(1,2,NA))
library(plyr)
ddply(x,.(Company),plyr::summarise,
Education=t(table(addNA(Education))),
Health=t(table(addNA(Health)))
)
Company Education.1 Education.NA Health.1 Health.2 Health.NA
1 A 2 1 1 1 1
However, they are not really columns, but table elements in the data.frame.
You can use a do.call(data.frame,y) construct to make them proper data frame columns, but you need more than one row for it to work.

Fill sequence by factor

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

How to find the first and last occurrence in a panel data set in R

I have a table:
id time
1 1
1 2
1 5
2 3
2 2
2 7
3 8
3 3
3 14
And I want to convert it to:
id first last
1 1 5
2 3 7
3 8 14
Please help!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', we get the first and last value of 'time'
library(data.table)
setDT(df1)[, list(firstocc = time[1L], lastocc = time[.N]),
by = id]
Or with dplyr, we use the same methodology.
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(firstocc = first(time), lastocc = last(time))
Or with base R (no packages needed)
do.call(rbind, lapply(split(df1, df1$id),
function(x) data.frame(id = x$id[1],
firstocc = x$time[1], lastocc = x$time[nrow(x)])))
If we need to be based on the min and max values (not related to the expected output) , the data.table option is
setDT(df1)[, setNames(as.list(range(time)),
c('firstOcc', 'lastOcc')) ,id]
and dplyr is
df1 %>%
group_by(id) %>%
summarise(firstocc = min(time), lastocc = max(time))
There are many packages that can perform aggregation of this sort in R. We show how to do it without any packages and then show it with some packages.
1) Use aggregate. No packages needed.
ag <- aggregate(time ~ id, DF, function(x) c(first = min(x), last = max(x)))
giving:
> ag
id time.first time.last
1 1 1 5
2 2 2 7
3 3 3 14
ag is a two column data frame whose second column contains a two column matrix with columns named 'first' and 'last'. If you want to flatten it to a 3 column data frame use:
do.call("cbind", ag)
giving:
id first last
[1,] 1 1 5
[2,] 2 2 7
[3,] 3 3 14
1a) This variation of (1) is more compact at the expense of uglier column names.
aggregate(time ~ id, DF, range)
2) sqldf
library(sqldf)
sqldf("select id, min(time) first, max(time) last from DF group by id")
giving:
id first last
[1,] 1 1 5
[2,] 2 2 7
[3,] 3 3 14
3) summaryBy summaryBy in the doBy package is much like aggregate:
library(doBy)
summaryBy(time ~ id, data = DF, FUN = c(min, max))
giving:
id time.min time.max
1 1 1 5
2 2 2 7
3 3 3 14
Note: Here is the input DF in reproducible form:
Lines <- "id time
1 1
1 2
1 5
2 3
2 2
2 7
3 8
3 3
3 14"
DF <- read.table(text = Lines, header = TRUE)
Update: Added (1a), (2) and (3) and fixed (1).
You can remove duplicates and reshape it
dd <- read.table(header = TRUE, text = "id time
1 1
1 2
1 5
2 3
2 2
2 7
3 8
3 3
3 14")
d2 <- dd[!(duplicated(dd$id) & duplicated(dd$id, fromLast = TRUE)), ]
reshape(within(d2, tt <- c('first', 'last')), dir = 'wide', timevar = 'tt')
# id time.first time.last
# 1 1 1 5
# 4 2 3 7
# 7 3 8 14

combine different row's values in table in r

I need to re-format a table in R.
I have a table like this.
ID category
1 a
1 b
2 c
3 d
4 a
4 c
5 a
And I want to reform it as
ID category1 category2
1 a b
2 c null
3 d null
4 a c
5 a null
Is this doable in R?
This is a very straightforward "long to wide" type of reshaping problem, but you need a secondary "id" (or "time") variable.
You can try using getanID from my "splitstackshape" package and use dcast to reshape from long to wide. getanID will create a new column called ".id" that would be used as your "time" variable:
library(splitstackshape)
dcast.data.table(getanID(mydf, "ID"), ID ~ .id, value.var = "category")
# ID 1 2
# 1: 1 a b
# 2: 2 c NA
# 3: 3 d NA
# 4: 4 a c
# 5: 5 a NA
Same as Ananda's, but using dplyr and tidyr:
library(tidyr)
library(dplyr)
mydf %>% group_by(ID) %>%
mutate(cat_row = paste0("category", 1:n())) %>%
spread(key = cat_row, value = category)
# Source: local data frame [5 x 3]
#
# ID category1 category2
# 1 1 a b
# 2 2 c NA
# 3 3 d NA
# 4 4 a c
# 5 5 a NA

Resources