Expanding data.frame rows based on single field - r

I have a simple data set in the form:
From,To,Date,Subject
I would like to reshape this data such that lines as:
e1,e2;e3;e4,d1,s1
Get expanded too:
e1,e2,d1,s1
e1,e3,d1,s1
e1,e4,d1,s1
Now, I get this done with a for loop over my data frame and constructing a new one on the fly, but I wondered if there is more "R"-way of doing this?
Edit:
This is what I currently have, it works but is kind of ugly (and showing my still somewhat limited R-skills):
filteredEmailsExpanded <- NULL
toCol <- 2
for (row in 1:nrow(filteredEmails)) {
receivers <- sapply(strsplit(filteredEmails[row, toCol], ","), function(x) gsub(" ", "", ))
for (receiver in receivers) {
newRow <- rep(filteredEmails[row,], times = 1)
newRow$To <- receiver
rbind(filteredEmailsExpanded, newRow)
}
}

How about you first expand your data frame (call it d), repeating the ith row n(i) times, where n(i) is the number of occurences of ';' in d$To[i], and then you replace d$To by these occurences? I've added an extra row to your example data to illustrate this better
d <- data.frame(
From = c("e1", "e5"),
To = c("e2;e3;e4", "e6;e7"),
Date = c("d1", "d2"),
Subject = c("s1", "s2"),
stringsAsFactors = FALSE)
v <- strsplit(d$To, ";")
lengths <- sapply(v, length)
d <- d[rep(1:nrow(d), lengths), ]
d$To <- unlist(v)

You may want to look at my "splitstackshape" package, in particular, the function concat.split.multiple which has a "long" argument.
Using #konvas's sample data, try:
library(splitstackshape)
concat.split.multiple(d, "To", ";", "long")
# From Date Subject time To
# 1 e1 d1 s1 1 e2
# 2 e5 d2 s2 1 e6
# 3 e1 d1 s1 2 e3
# 4 e5 d2 s2 2 e7
# 5 e1 d1 s1 3 e4
# 6 e5 d2 s2 3 <NA>
Alternatively, check out its successor function (which hasn't yet made it into the package). The successor is presently called cSplit and is available as a Gist. It is much faster but just as easy to use:
## cSplit(indt = d, splitCols = "To", sep = ";", direction = "long")
cSplit(d, "To", ";", "long")
# From To Date Subject
# 1: e1 e2 d1 s1
# 2: e1 e3 d1 s1
# 3: e1 e4 d1 s1
# 4: e5 e6 d2 s2
# 5: e5 e7 d2 s2

Related

Replacing Column Name with Loop in R

I have three df P1,P2,P3 with each three columns. I want to change the second column from each df to D1, D2, D3 with a loop but nothing is working. What do I miss out?
C1 <- c(12,34,22)
C2 <- c(43,86,82)
C3 <- c(98,76,25)
C4 <- c(12,34,22)
C5 <- c(43,86,82)
C6 <- c(98,76,25)
C7 <- c(12,34,22)
C8 <- c(43,86,82)
C9 <- c(98,76,25)
P1 <- data.frame(C1,C2,C3)
P2 <- data.frame(C4,C5,C6)
P3 <- data.frame(C7,C8,C9)
x <- c("P1", "P2", "P3")
b <- c("D1","D2","D3")
for (V in b){
names(x)[2] <- "V"
}
The output I would expect is:
P1 <- data.frame(C1,D1,C3)
P2 <- data.frame(C4,D2,C6)
P3 <- data.frame(C7,D3,C9)
We can use mget to get the values of the string vector in a list, use Map to rename the second column each of the list element with the corresponding 'b' value, then use list2env to update those objects in the global env
list2env(Map(function(x, y) {names(x)[2] <- y; x}, mget(x), b), .GlobalEnv)
-output
P1
# C1 D1 C3
#1 12 43 98
#2 34 86 76
#3 22 82 25
P2
# C4 D2 C6
#1 12 43 98
#2 34 86 76
#3 22 82 25
P3
# C7 D3 C9
#1 12 43 98
#2 34 86 76
#3 22 82 25
For understanding the code, first step is mget on the vector of strings
mget(x)
returns a list of data.frame
Then, we are passing this as argument to Map along with the corresponding 'b' vector. i.e. each element of list is a unit, similarly each element of vector is a unit
Map(function(x, y) x, mget(x), b)
The function(x, y) is anonymous/lambda function construct. Using that we set the names of the second column to that of 'b'. Here, the anonymous argument for 'b' is 'y'. Later, we wrap everything in list2env as it is a named list and it will look up for those names in the global env to update it
It is usually best to work with lists for this type of thing. That can make life easier and is generally a good workflow to learn.
list1 <- list(P1 = P1, P2 = P2, P3 = P3)
# here is your loop
for (i in seq_along(list1)) {
names(list1[[i]])[2] <- b[i]
}
# you can use Map as well for iteration, similar to #akrun's solution
# this is really identical at this point, except you've created the list differently
Map(function(df, b) {names(df)[2] <- b; df}, list1, b)

iterating a loop to give a new value if the previous value already exists in a data frame in R

I have a data frame with this information here:
df <- data.frame("string1" = c("ABECDE","ABECDE","ABECDE"),
"string2" = c("ABCD","ABCD","ABCD"),
"site1" = NA, "site2" = NA, "combine" = NA, "filtered" = NA)
I would like to write a code that picks sites E and D in the string and adds them to the data frame.
If the combination is already created I'd like for it to go back and chose a new combination and check again until it gets one that has not been picked.
I have provided below the code I have done so far which gives the output of:
string1 string2 site1 site2 combine filtered
1 ABECDE ABCD E3 D4 E3D4 E3D4
2 ABECDE ABCD E3 D4 E3D4 <NA>
3 ABECDE ABCD E3 D4 E3D4 <NA>
Here, E3D4 is the value you get when it first goes through the function.
I would now like for it to go back and pick the next possible combinations:
E6D4 and D5D4 for the next two lines but I have no idea how to properly structure the iteration.
Here is the code I have so far (there is probably a less redundant way to write it but I am a beginner so apologies if it is overly long)
#make the columns of string1 and string2 into vectors
string1 <- df$string1
string2 <- df$string2
#for each string in the vector check to see first if it has an E, if not, then a D
#get the output as a letter and its position (eg E3)
for (i in 1:nrow(df)){
if (grepl("E", string1[i])){
sites1 = gregexpr('E', string1[i])
df$site1 <- paste0(substring(string1[i], sites1[[1]][1], sites1[[1]][1]), sites1[[1]][1])
} else if (grepl("D", string1[i])){
sites = gregexpr('D', string1[i])
df$site1 <- paste0(substring(string1[i], sites1[[1]][1], sites1[[1]][1]), sites1[[1]][1])
}
}
#do the same for the second vector
for (i in 1:nrow(df)){
if (grepl("E", string2[i])){
sites2 <- gregexpr('E', string2[i])
df$site2 <- paste0(substring(string2[i], sites2[[1]][1], sites2[[1]][1]), sites2[[1]][1])
} else if (grepl("D", string2[i])){
sites2 <- gregexpr('D', string2[i])
df$site2 <- paste0(substring(string2[i], sites2[[1]][1], sites2[[1]][1]), sites2[[1]][1])
}
}
#combine the sites
df$combine <- paste0(df$site1, df$site2)
#for each row of combined sites, check to see if the value is already created
for (i in 1:nrow(df)){
if(!df$combine[i] %in% df$filtered){
df$filtered[i] <- df$combine[i]
} else if(df$combine[i] %in% df$filtered){
#go back to for loop and look for either another E in the list
#if there is none, go to the next condition (looking for a D).
#pick the next possible values, put them together and check again
#do this continuously until you get a unique combine.
#do this for string1 and then string2 (or alternating both, which ever is easier)
}
}
Perhaps you could simplify and try the following.
Create a custom function, that will detect all positions of "D" and "E" in your strings. Then use expand.grid to get all combinations of these positions. In your example data, this will include combinations of positions 3, 5, 6 with position 4 (in the end 3 combinations: (3, 4), (5, 4), and (6, 4)).
Then, you can go through each of these combinations and create the desired strings, by combining with paste the letter from the position with the position number. A list will hold these results and be assembled in the end with rbind.
There are a few questions that remain, including if there are situations when there are no "D" or "E" letters found.
my_fun <- function(x) {
p1 <- as.numeric(unlist(gregexpr(pattern = 'D|E', x[["string1"]])))
p2 <- as.numeric(unlist(gregexpr(pattern = 'D|E', x[["string2"]])))
cbn <- expand.grid(p1, p2)
lst <- list()
for (i in seq_len(nrow(cbn))) {
site1 <- paste0(substr(x[["string1"]], cbn[i, "Var1"], cbn[i, "Var1"]), cbn[i, "Var1"])
site2 <- paste0(substr(x[["string2"]], cbn[i, "Var2"], cbn[i, "Var2"]), cbn[i, "Var2"])
lst[[i]] <- c(string1 = x[["string1"]], string2 = x[["string2"]], site1 = site1, site2 = site2, combine = paste0(site1, site2))
}
return(as.data.frame(do.call("rbind", lst)))
}
do.call(rbind, apply(df, 1, my_fun))
I created example data to test this out:
string1 string2 site1 site2 combine filtered
1 ABECDE ABCD NA NA NA NA
2 AABCDE ABCE NA NA NA NA
3 ABCDDE ACDD NA NA NA NA
Which would give the following output:
string1 string2 site1 site2 combine
1 ABECDE ABCD E3 D4 E3D4
2 ABECDE ABCD D5 D4 D5D4
3 ABECDE ABCD E6 D4 E6D4
4 AABCDE ABCE D5 E4 D5E4
5 AABCDE ABCE E6 E4 E6E4
6 ABCDDE ACDD D4 D3 D4D3
7 ABCDDE ACDD D5 D3 D5D3
8 ABCDDE ACDD E6 D3 E6D3
9 ABCDDE ACDD D4 D4 D4D4
10 ABCDDE ACDD D5 D4 D5D4
11 ABCDDE ACDD E6 D4 E6D4

Add column to df that's the output of a function that uses different column values combined to be a vector input

This is a very simplified version of my actual problem.
My real df has many columns and I need to perform this action using a select from a character vector of column names.
library(tidyverse)
df <- data.frame(a1 = c(1:5),
b1 = c(3,1,3,4,6),
c1 = c(10:14),
a2 = c(9:13),
b2 = c(3:7),
c2 = c(15:19))
df
a1 b1 c1 a2 b2 c2
1 1 3 10 9 3 15
2 2 1 11 10 4 16
3 3 3 12 11 5 17
4 4 4 13 12 6 18
5 5 6 14 13 7 19
Let's say I wanted to get the cor for each row for selected columns using mutate - I tried:
df %>%
mutate(my_cor = cor(x = c(a1,b1,c2), y = c(a2,b2,c2)))
but this doesn't work as it uses the full column of data for each column header input.
The first row of the my_cor column of the output df from above should be the calculation:
cor(x = c(1,3,10), y = c(9,3,15))
And the next row should be:
cor(x = c(2,1,11), y = c(10,4,16))
and so on. The actual function I'm using is more complex but it does take two vector inputs like cor does so I figured this would be a good proxy.
I have a feeling I should be using purrr for this action (similar to this post) but I haven't gotten it to work.
Bonus: The actual problem I'm facing is using a function that would use many different columns so I'd like to be able select them from a a character vector like my_list_of_cols <- c("a1", "b1", "c1") (my true list is much longer).
I suspect I'd be using pmap_dbl like the post I linked to but I can't get it to work - I tried something like...
mutate(my col = pmap_dbl(select(., var = my_list_of_cols), somefunction))
(note that somefunction in the above portion takes a 2 vector inputs but one of them is static and pre-defined - you can assume the vector c(a2, b2, c2) is the static and predefined one like:
somefunction <- function(a1,b1,c1){
a2 = 1
b2 = 4
c2 = 5
my_vec = c(a2, b2, c2)
cor(x = (a1,b1,c1), y = my_vec)
}
)
I'm still learning how to use purrr so any help would be greatly appreciated!
Here is one option to pass an object of column names and other names passed into select
library(tidyverse)
my_list_of_cols <- c("a1", "b1", "c1")
another_list_cols <- c("a2", "b2", "c2")
df %>%
mutate(my_cor = pmap_dbl(
select(., my_list_of_cols,
another_list_cols), ~ c(...) %>%
{cor(.[my_list_of_cols], .[setdiff(names(.), my_list_of_cols)])}
))

Sort a data.frame with columns of another data.frame

I have the folowing data.frames:
d1 <- data.frame(A=c(-1,-1,1,1,-1,-1,1,1), B=c(-1,1,-1,1,-1,1,1,-1), Y=c(2,3,4,5,8,9,10,11))
d2 <- data.frame(A=c(1,1,-1,-1,1,-1,1,-1), B=c(-1,1,1,-1,-1,1,1,-1))
I would like to add the column Y to the d2 data.frame. I've tried using merge function, but then I duplicate the number of rows of the data.frame.
I've also tried to use the functions order and match to order the first data.frame by the columns of the second one:
d1[order(match(
paste(d1$A,d1$B),
paste(d2df$A,d2df$B))
),]
But it doesn't work and I don't know why.
If I understand correctly, this would give you the results you're after:
# Sort d1 and d2 with columns A and B
d1 <- d1[order(d1$A,d1$B),]
d2 <- d2[order(d2$A,d2$B),]
# Copy Y from d1 to d2
d2$Y <- d1$Y
# Restore original order in d1 and d2
d1 <- d1[order(rownames(d1)),]
d2 <- d2[order(rownames(d2)),]
d2$Y
#[1] 4 5 3 2 11 9 10 8
In case d1 and d2 have different number of rows:
chosen <- numeric(length = nrow(d1))
choices <- paste(d1$A, d1$B)
for(i in seq(nrow(d1))){
chosen[i] <- match(paste(d2$A[i], d2$B[i]), choices)
choices[chosen] <- 0
}
d2$Y <- d1[chosen, "Y"]
d2$Y
#[1] 4 5 3 2 11 9 10 8

subset rows and columns in a dataframe based on boundary conditions

I have some problems to express myself. Probably, that is why I havent found anything which helps me yet. The example should make clear what I want.
Suppose I have a m x m matrix structure of coordinates. Lets say it ranges from A1 to E5 . and I want to subset the rows/columns which are k lines away from the outer coordinates.
In my example k is 2. So I want to select all records in the data frame which have the coordinates B2, B3, B4, C2, C4, D2, D3, D4. Manually, I would do the following:
cc <- data.frame(x=(LETTERS[1:5]), y=c(rep(1,5),rep(2,5),rep(3,5), rep(4,5), rep(5,5)) , z=rnorm(25))
slct <- with(cc, which( (x=="B" | x=="C" | x=="D" ) & (y==2 | y==3 | y==4) & !(x=="C" & y==3) ))
cc[slct,] # result data frame
But if the matrix dimensions increase that is not the way which will work great. Any better ideas?
Rather hard to read but it does the trick.
m <- 5 # Matrix dimensions
k <- 2 # The index of the the inner square that you want to extract
cc[(cc$x %in% LETTERS[c(k,m-k+1)] & !cc$y %in% c(1:(k-1), m:(m-k+2))) |
(cc$y %in% c(k, m-k+1) & !cc$x %in% LETTERS[c(1:(k-1), m:(m-k+2))]),]
The first line of comparisons extracts the k:th column from the left and right edges of the matrix, but not the parts that are closer than k to the upper and lower edges. The second line does the same thing but for rows.
cc$xy <- paste0(cc$x,cc$y)
coords <- c("B2","B3","B4", "C2", "C4", "D2", "D3", "D4")
cc[cc$xy %in% coords,]
# x y z xy
#7 B 2 -0.9031472 B2
#8 C 2 -0.1405147 C2
#9 D 2 1.6017619 D2
#12 B 3 1.7713041 B3
#14 D 3 -0.2005749 D3
#17 B 4 1.8671238 B4
#18 C 4 0.3428815 C4
#19 D 4 0.1470436 D4

Resources