Strsplit on a column of a data frame [duplicate] - r

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have a data.frame where one of the variables is a vector (or a list), like this:
MyColumn <- c("A, B,C", "D,E", "F","G")
MyDF <- data.frame(group_id=1:4, val=11:14, cat=MyColumn)
# group_id val cat
# 1 1 11 A, B,C
# 2 2 12 D,E
# 3 3 13 F
# 4 4 14 G
I'd like to have a new data frame with as many rows as the vector
FlatColumn <- unlist(strsplit(MyColumn,split=","))
which looks like this:
MyNewDF <- data.frame(group_id=c(rep(1,3),rep(2,2),3,4), val=c(rep(11,3),rep(12,2),13,14), cat=FlatColumn)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
In essence, for every factor which is an element of the list of MyColumn (the letters A to G), I want to assign the corresponding values of the list. Every factor appears only once in MyColumn.
Is there a neat way for this kind of reshaping/unlisting/merging? I've come up with a very cumbersome for-loop over the rows of MyDF and the length of the corresponding element of strsplit(MyColumn,split=","). I'm very sure that there has to be a more elegant way.

You can use separate_rows from tidyr:
tidyr::separate_rows(MyDF, cat)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G

How about
lst <- strsplit(MyColumn, split = ",")
k <- lengths(lst) ## expansion size
FlatColumn <- unlist(lst, use.names = FALSE)
MyNewDF <- data.frame(group_id = rep.int(MyDF$group_id, k),
val = rep.int(MyDF$val, k),
cat = FlatColumn)
# group_id val cat
#1 1 11 A
#2 1 11 B
#3 1 11 C
#4 2 12 D
#5 2 12 E
#6 3 13 F
#7 4 14 G

We can use cSplit from splitstackshape
library(splitstackshape)
cSplit(MyDF, "cat", ",", "long")
# group_id val cat
#1: 1 11 A
#2: 1 11 B
#3: 1 11 C
#4: 2 12 D
#5: 2 12 E
#6: 3 13 F
#7: 4 14 G
We can also use do with base R with strsplit to split the 'cat' column into a list, replicate the sequence of rows of 'MyDF' with the lengths of 'lst', and create the 'cat' column by unlisting the 'lst'.
lst <- strsplit(as.character(MyDF$cat), ",")
transform(MyDF[rep(1:nrow(MyDF), lengths(lst)),-3], cat = unlist(lst))

Related

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

From axis values to coodinates pairs [duplicate]

I have two vectors of integers, say v1=c(1,2) and v2=c(3,4), I want to combine and obtain this as a result (as a data.frame, or matrix):
> combine(v1,v2) <--- doesn't exist
1 3
1 4
2 3
2 4
This is a basic case. What about a little bit more complicated - combine every row with every other row? E.g. imagine that we have two data.frames or matrices d1, and d2, and we want to combine them to obtain the following result:
d1
1 13
2 11
d2
3 12
4 10
> combine(d1,d2) <--- doesn't exist
1 13 3 12
1 13 4 10
2 11 3 12
2 11 4 10
How could I achieve this?
For the simple case of vectors there is expand.grid
v1 <- 1:2
v2 <- 3:4
expand.grid(v1, v2)
# Var1 Var2
#1 1 3
#2 2 3
#3 1 4
#4 2 4
I don't know of a function that will automatically do what you want to do for dataframes(See edit)
We could relatively easily accomplish this using expand.grid and cbind.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
expand.grid(df1, df2) # doesn't work so let's try something else
id <- expand.grid(seq(nrow(df1)), seq(nrow(df2)))
out <-cbind(df1[id[,1],], df2[id[,2],])
out
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#1.1 1 3 6 b
#2.1 2 4 6 b
Edit: As Joran points out in the comments merge does this for us for data frames.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
merge(df1, df2)
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#3 1 3 6 b
#4 2 4 6 b

How to use stack to produce multiple columns data frame?

I want to convert a list of lists into a data.frame. First I each sublist was only of length 1 and so I used stack(as.data.frame(...)) but stack does not seam to be able to produce multicolumns data.frame. So what it the best way to achieve that:
# works fine with only sublists of length 1
l = list(a = sample(1:5, 5), b = sample(1:5, 5))
> stack(as.data.frame(l))
values ind
1 5 a
2 4 a
3 1 a
4 2 a
5 3 a
6 2 b
7 1 b
8 3 b
9 5 b
10 4 b
Now my list is a list of lists:
l = list(a = list(first = sample(1:5, 5), sec = sample(1:5, 5)), b = list(first = sample(1:5, 5), sec = sample(1:5, 5)))
stack(as.data.frame(l))
values ind
1 4 a.first
2 5 a.first
3 3 a.first
4 1 a.first
5 2 a.first
6 3 a.sec
7 5 a.sec
8 1 a.sec
9 2 a.sec
10 4 a.sec
11 5 b.first
12 4 b.first
13 3 b.first
14 1 b.first
15 2 b.first
16 3 b.sec
17 4 b.sec
18 1 b.sec
19 2 b.sec
20 5 b.sec
while I'd like to have still a column ind with a and b and two columns first and sec
We can flatten the list by concatenating (c) the nested elements ('l1'), get the substring from the names of 'l1' ('nm1' and 'nm2'), split the 'l1' by 'nm1' (i.e. substring obtained by removing the prefix) while we set the names of 'l1' with 'nm2' (substring obtained by removing suffix starting with .), loop through the list and stack it ('lst'). Then, we cbind the 'ind' column (which is the same in all the list elements so we get it from the first list element - lst[[1]][2]) with the 'value' column i.e. the first column.
l1 <- do.call(c, l)
nm1 <- sub("[^.]+\\.", "", names(l1))
nm2 <- sub("\\..*", "", names(l1))
lst <- lapply(split(setNames(l1, nm2), nm1), stack)
cbind(lst[[1]][2],lapply(lst, `[[`, 1))
# ind first sec
#1 a 1 1
#2 a 5 5
#3 a 4 4
#4 a 3 3
#5 a 2 2
#6 b 3 4
#7 b 4 5
#8 b 2 2
#9 b 1 3
#10 b 5 1
Or using dplyr/purrr we can get the expected output.
library(purrr)
library(dplyr)
l1 <- transpose(l)
n1 <- names(l1)
l1 %>%
map(stack) %>%
bind_cols %>%
setNames(., make.unique(names(.))) %>%
select(ind, matches("value")) %>%
setNames(., c("ind", n1))
# ind first sec
# (fctr) (int) (int)
#1 a 1 1
#2 a 5 5
#3 a 4 4
#4 a 3 3
#5 a 2 2
#6 b 3 4
#7 b 4 5
#8 b 2 2
#9 b 1 3
#10 b 5 1
Here is another approach:
df <- stack(as.data.frame(l))
# split names of variables
indVars <- strsplit(as.character(df$ind), split="\\.")
# add variables to data.frame
df$letters <- sapply(indVars, function(i) i[1])
df$order <- sapply(indVars, function(i) i[2])
# get final data.frame
cbind("order"=unstack(df, letters~order)[,1], unstack(df, values~order))

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

How to matching missing IDs?

I have a large table with 50000 obs. The following mimic the structure:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B",NA,"D","E",NA,"G","H","I")
b <- c(11,2233,12,2,22,13,23,23,100)
c <- c(12,10,12,23,16,17,7,9,7)
df <- data.frame(ID ,a,b,c)
Where there are some missing values on the vector "a". However, I have some tables where the ID and the missing strings are included:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B","C","D","E","F","G","H","I")
key <- data.frame(ID,a)
Is there a way to include the missing strings from key into the column a using the ID?
Another options is to use data.tables fast binary join and update by reference capabilities
library(data.table)
setkey(setDT(df), ID)[key, a := i.a]
df
# ID a b c
# 1: 1 A 11 12
# 2: 2 B 2233 10
# 3: 3 C 12 12
# 4: 4 D 2 23
# 5: 5 E 22 16
# 6: 6 F 13 17
# 7: 7 G 23 7
# 8: 8 H 23 9
# 9: 9 I 100 7
If you want to replace only the NAs (not all the joined cases), a bit more complicated implemintation will be
setkey(setDT(key), ID)
setkey(setDT(df), ID)[is.na(a), a := key[.SD, a]]
You can just use match; however, I would recommend that both your datasets are using characters instead of factors to prevent headaches later on.
key$a <- as.character(key$a)
df$a <- as.character(df$a)
df$a[is.na(df$a)] <- key$a[match(df$ID[is.na(df$a)], key$ID)]
df
# ID a b c
# 1 1 A 11 12
# 2 2 B 2233 10
# 3 3 C 12 12
# 4 4 D 2 23
# 5 5 E 22 16
# 6 6 F 13 17
# 7 7 G 23 7
# 8 8 H 23 9
# 9 9 I 100 7
Of course, you could always stick with factors and factor the entire "ID" column and use the labels to replace the values in column "a"....
factor(df$ID, levels = key$ID, labels = key$a)
## [1] A B C D E F G H I
## Levels: A B C D E F G H I
Assign that to df$a and you're done....
Named vectors make nice lookup tables:
lookup <- a
names(lookup) <- as.character(ID)
lookup is now a named vector, you can access each value by lookup[ID] e.g. lookup["2"] (make sure the number is a character, not numeric)
## should give you a vector of a as required.
lookup[as.character(ID_from_big_table)]

Resources