How to custom flatten a data frame? [duplicate] - r

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
I have a data frame as follows:
df <- data.frame(x=c('a,b,c','d,e','f'),y=c(1,2,3))
df
> df
x y
1 a,b,c 1
2 d,e 2
3 f 3
I can get the flattened df$x like this:
unique(unlist(strsplit(as.character(df$x), ",")))
[1] "a" "b" "c" "d" "e" "f"
What would be the best way to transform my input df into:
x y
a 1
b 1
c 1
d 2
e 2
f 3
Basically flatten df$x and individually assign its corresponding y

If you are working on data.frame, I recommend using tidyr
df <- data.frame(x=c('a,b,c','d,e','f'),y=c(1,2,3),stringsAsFactors = F)
library(tidyr)
df %>%
transform(x= strsplit(x, ",")) %>%
unnest(x)
y x
1 1 a
2 1 b
3 1 c
4 2 d
5 2 e
6 3 f

sapply(unlist(strsplit(as.character(df$x), ",")), function(ss)
df$y[which(grepl(pattern = ss, x = df$x))])
#a b c d e f
#1 1 1 2 2 3
If you want a dataframe
do.call(rbind, lapply(1:NROW(df), function(i)
setNames(data.frame(unlist(strsplit(as.character(df$x[i]), ",")), df$y[i]),
names(df))))
# x y
#1 a 1
#2 b 1
#3 c 1
#4 d 2
#5 e 2
#6 f 3

FWIW, you could also repeat the row indices according to how many elements each x value has:
df <- data.frame(x=c('a,b,c','d,e','f'),y=c(1,2,3),stringsAsFactors = F)
df[,1] <- strsplit(df[,1],",")
cbind(x=unlist(df[,1]),df[rep(1:nrow(df), lengths(df[,1])),-1,F])
# x y
# 1 a 1
# 1.1 b 1
# 1.2 c 1
# 2 d 2
# 2.1 e 2
# 3 f 3

Related

R function to replace tricky merge in Excel (vlookup + hlookup)

I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!
You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)
You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)

Difference of maximum and minimum by group

I have the following data frame
v1 v2 v3
a 2 5
b 5 3
c 2 1
d 2 1
e 1 2
a 2 4
a 8 1
e 1 6
b 0 1
c 2 8
d 1 5
using R, I want to compute for every unique value of V1, the difference between the max V3 and the min V3.
Expected :
Val max_min
a “5-1”
b “3-1”
c “8-1”
d “5-1”
e “6-2”
I am trying using
ddply(fil1, c("V1"), summarise, max(V3) - min(V1))
but, don't have the expected result. It gives the same value in max_min: the max(V3) - min(V3) for the whole data frame and not for the group.
I have also try average, with no success.
Or in base R,
MAX = aggregate(df$v3, list(df$v1), max)
MIN = aggregate(df$v3, list(df$v1), min)
MAX[,2] - MIN[,2]
[1] 4 2 7 4 4
A one liner of the above would be,
aggregate(v3 ~ v1, df, FUN = function(i)max(i) - min(i))
# v1 v3
#1 a 4
#2 b 2
#3 c 7
#4 d 4
#5 e 4
We can also use tapply which will display the output as follows,
with(df, tapply(v3, list(v1), function(i) max(i)-min(i)))
#a b c d e
#4 2 7 4 4
You could also go for split:
lapply(split(df$v3, df$v1), function(a) max(a)-min(a))
# $a
# [1] 4
# $b
# [1] 2
# $c
# [1] 7
# $d
# [1] 4
# $e
# [1] 4
In case you persist to see your defined output:
ls <- lapply(split(df$v3, df$v1), function(a) max(a)-min(a))
data.frame(Val=names(ls), max_min=unlist(ls))
# Val max_min
#a a 4
#b b 2
#c c 7
#d d 4
#e e 4
If you're using dplyr you can use the summarise function. In base R, range returns a vector containing the min and max values, and diff finds the difference. So a one-liner is:
df %>% group_by(V1) %>% summarise(max_min=diff(range(V3)))

Getting columns with equivalent values in rows

I need to get from a number of rows where some columns are equivalent and extract exactly those columns.
I have the following dataframe:
a <- c(1,2,3)
b <- c(1,2,3)
c <- c(4,5,6)
A <- data.frame(a,b,c)
> A
a b c d
1 1 2 4 1
2 2 2 5 2
3 3 3 6 3
I would like the following result:
> columnInnerJoin(A)
a d
1 1 1
2 2 2
3 3 3
Or, more specifically:
> columnInnerJoinGiveColumns(A)
a d
We can try with duplicated
res <- A[duplicated(as.list(A))|duplicated(as.list(A), fromLast=TRUE)]
names(res)
#[1] "a" "d"

Strsplit on a column of a data frame [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have a data.frame where one of the variables is a vector (or a list), like this:
MyColumn <- c("A, B,C", "D,E", "F","G")
MyDF <- data.frame(group_id=1:4, val=11:14, cat=MyColumn)
# group_id val cat
# 1 1 11 A, B,C
# 2 2 12 D,E
# 3 3 13 F
# 4 4 14 G
I'd like to have a new data frame with as many rows as the vector
FlatColumn <- unlist(strsplit(MyColumn,split=","))
which looks like this:
MyNewDF <- data.frame(group_id=c(rep(1,3),rep(2,2),3,4), val=c(rep(11,3),rep(12,2),13,14), cat=FlatColumn)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
In essence, for every factor which is an element of the list of MyColumn (the letters A to G), I want to assign the corresponding values of the list. Every factor appears only once in MyColumn.
Is there a neat way for this kind of reshaping/unlisting/merging? I've come up with a very cumbersome for-loop over the rows of MyDF and the length of the corresponding element of strsplit(MyColumn,split=","). I'm very sure that there has to be a more elegant way.
You can use separate_rows from tidyr:
tidyr::separate_rows(MyDF, cat)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
How about
lst <- strsplit(MyColumn, split = ",")
k <- lengths(lst) ## expansion size
FlatColumn <- unlist(lst, use.names = FALSE)
MyNewDF <- data.frame(group_id = rep.int(MyDF$group_id, k),
val = rep.int(MyDF$val, k),
cat = FlatColumn)
# group_id val cat
#1 1 11 A
#2 1 11 B
#3 1 11 C
#4 2 12 D
#5 2 12 E
#6 3 13 F
#7 4 14 G
We can use cSplit from splitstackshape
library(splitstackshape)
cSplit(MyDF, "cat", ",", "long")
# group_id val cat
#1: 1 11 A
#2: 1 11 B
#3: 1 11 C
#4: 2 12 D
#5: 2 12 E
#6: 3 13 F
#7: 4 14 G
We can also use do with base R with strsplit to split the 'cat' column into a list, replicate the sequence of rows of 'MyDF' with the lengths of 'lst', and create the 'cat' column by unlisting the 'lst'.
lst <- strsplit(as.character(MyDF$cat), ",")
transform(MyDF[rep(1:nrow(MyDF), lengths(lst)),-3], cat = unlist(lst))

r element frequency and column name

I have a dataframe that has four columns A, B, C and D:
A B C D
a a b c
b c x e
c d y a
d z
e
f
I would like to get the frequency of all elements and lists of columns they appear, ordered by the frequency ranking. The output would be something like this:
Ranking frequency column
a 1 3 A, B, D
c 1 3 A, B, D
b 2 2 A, C
d 2 2 A, B
e 2 2 A, D
f .....
I would appreciate any help.
Thank you!
Something like this maybe:
Data
df <- read.table(header=T, text='A B C D
a a b c
b c x e
c d y a
d NA NA z
e NA NA NA
f NA NA NA',stringsAsFactors=F)
Solution
#find unique elements
elements <- unique(unlist(sapply(df, unique)))
#use a lapply to find the info you need
df2 <- data.frame(do.call(rbind,
lapply(elements, function(x) {
#find the rows and columns of the elements
a <- which(df == x, arr.ind=TRUE)
#find column names of the elements found
b <- names(df[a[,2]])
#find frequency
c <- nrow(a)
#produce output
c(x, c, paste(b, collapse=','))
})))
#remove NAs
df2 <- na.omit(df2)
#change column names
colnames(df2) <- c('element','frequency', 'columns')
#order according to frequency
df2 <- df2[order(df2$frequency, decreasing=TRUE),]
#create the ranking column
df2$ranking <- as.numeric(factor(df2$frequency,levels=unique(df2$frequency)))
Output:
> df2
element frequency columns ranking
1 a 3 A,B,D 1
3 c 3 A,B,D 1
2 b 2 A,C 2
4 d 2 A,B 2
5 e 2 A,D 2
6 f 1 A 3
8 x 1 C 3
9 y 1 C 3
10 z 1 D 3
And if you want the elements column to be as row.names and the ranking column to be first you can also do:
row.names(df2) <- df2$element
df2$element <- NULL
df2 <- df2[c('ranking','frequency','columns')]
Output:
> df2
ranking frequency columns
a 1 3 A,B,D
c 1 3 A,B,D
b 2 2 A,C
d 2 2 A,B
e 2 2 A,D
f 3 1 A
x 3 1 C
y 3 1 C
z 3 1 D
Here's an approach using "dplyr" and "tidyr":
library(dplyr)
library(tidyr)
df %>%
gather(var, val, everything()) %>% ## Make a long dataset
na.omit %>% ## We don't need the NA values
group_by(val) %>% ## All calculations grouped by val
summarise(column = toString(var), ## This collapses
freq = n()) %>% ## This counts
mutate(ranking = dense_rank(desc(freq))) %>% ## This ranks
arrange(ranking) ## This sorts
# Source: local data frame [9 x 4]
#
# val column freq ranking
# 1 a A, B, D 3 1
# 2 c A, B, D 3 1
# 3 b A, C 2 2
# 4 d A, B 2 2
# 5 e A, D 2 2
# 6 f A 1 3
# 7 x C 1 3
# 8 y C 1 3
# 9 z D 1 3

Resources