R string split manipulation in a data frame does not work - r

Here is a simple testing case.
Was planning to split and extract only the first part of each string.
library(dplyr)
library(stringr)
test = data.frame(x= c('a b', 'c d'),stringsAsFactors = F)
test
x
1 a b
2 c d
test %>% mutate(y = str_split(x,'\\s+')[[1]][1])
x y
1 a b a
2 c d a
Was expecting something like:
x y
1 a b a
2 c d c

Nowadays there are various packaged functions for splitting columns into pieces. Here you could use the separate() function from the tidyr package. Since you want the first value of a split on the spaces, you can just remove everything after the first space.
tidyr::separate(test, x, "y", "\\s.*", FALSE, extra = "drop")
# x y
# 1 a b a
# 2 c d c

str_split returns a list where each element corresponds to an element in the original atomic vector. As such you will need to use lapply or similar to index appropriately
test %>% mutate(y = unlist(lapply(str_split(x,'\\s+'),'[[',1)))

We can also use sub
library(data.table)
setDT(test)[, y:= sub('\\s+.*', '', x)]
test
# x y
#1: a b a
#2: c d c

Related

Special reshape in R

Consider a 3x3 char dataframe:
example <- data.frame(one = c("a","b","c"),
two = c("a","b","b"),
three = c ("c","a","b"))
I want to resize these data to 6x2 and add the following content:
desired <- data.frame(one = c("a","a","b","b",
"c","b"),
two = c("a","c","b","a","b","b"))
For the original example dataframe, I want to rbind() the contents of example[,2:3] beneath each row index.
This can be achieved by:
ex <- as.matrix(example)
des <- as.data.frame(rbind(ex[,1:2], ex[,2:3]))
Maybe using library(tidyverse) for an arbitrary number of columns would be nicer?
For each pair of columns, transpose the sub-data.frame defined by them and coerce to vector. Then coerce to data.frame and set the result's names.
The code that follows should be scalable, it does not hard code the number of columns.
desired2 <- as.data.frame(
lapply(seq(names(example))[-1], \(k) c(t(example[(k-1):k])))
)
names(desired2) <- names(example)[-ncol(example)]
identical(desired, desired2)
#[1] TRUE
The code above rewritten as a function.
reformat <- function(x){
y <- as.data.frame(
lapply(seq(names(x))[-1], \(k) c(t(x[(k-1):k])))
)
names(y) <- names(x)[-ncol(x)]
y
}
reformat(example)
example %>% reformat()
Another example, with 6 columns input.
ex1 <- example
ex2 <- example
names(ex2) <- c("fourth", "fifth", "sixth")
ex <- cbind(ex1, ex2)
reformat(ex)
ex %>% reformat()
A tidyverse approach using tidyr::pivot_longer may look like so:
library(dplyr)
library(tidyr)
pivot_longer(example, -one, values_to = "two") %>%
select(-name)
#> # A tibble: 6 × 2
#> one two
#> <chr> <chr>
#> 1 a a
#> 2 a c
#> 3 b b
#> 4 b a
#> 5 c b
#> 6 c b
A base-R solution with Map:
#iterate over example$one, example$two, and example$three at the same
#time, creating the output you need.
mylist <- Map(function(x ,y ,z ) {
data.frame(one = c(x, y), two = c(y, z))
},
example$one #x,
example$two #y,
example$three #z)
do.call(rbind, mylist)
one two
a.1 a a
a.2 a c
b.1 b b
b.2 b a
c.1 c b
c.2 b b

Is there a way to replace rows in one dataframe with another in R?

I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets

How can I use one column to determine where I get the value for another column?

I'm trying to use one column to determine which column to use as the value for another column It looks something like this:
X Y Z Target
1 a b c X
2 d e f Y
3 g h i Z
And I want something that looks like this:
X Y Z Target TargetValue
1 a b c X a
2 d e f Y e
3 g h i Z i
Where each TargetValue is the value determined by the column specified by Target. I've been using dplyr a bit to get this to work. If I knew how to make the output of paste the input for mutate that would be great,
mutate(TargetWordFixed = (paste("WordMove",TargetWord,".rt", sep="")))
but maybe there is another way to do the same thing.
Be gentle, I'm new to both stackoverflow and R...
A vectorized approach would be to use matrix subsetting:
df %>% mutate(TargetValue = .[cbind(1:n(), match(Target, names(.)))])
# X Y Z Target TargetValue
#1 a b c X a
#2 d e f Y e
#3 g h i Z i
Or just using base R (same approach):
transform(df, TargetValue = df[cbind(1:nrow(df), match(Target, names(df)))])
Explanation:
match(Target, names(.)) computes the column indices of the entries in Target (which column is called X etc)
The . in the dplyr version refers to the data you "pipe" into the mutate statement with %>% (i.e. it refers to df)
df[cbind(1:n(), match(Target, names(df))] creates a matrix to subset df to the correct values - the first column of the matrix is just the row numbers starting from 1 to the number of rows of df (therefore 1:nrow(df)) and the second column in the matrix is the index which column holds the Target value of interest (computed by match(Target, names(df))).
The matrix that is produced for subsetting the example data is:
cbind(1:nrow(df), match(df$Target, names(df)))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
You could try apply rowwise like this:
transform(df, TargetValue = apply(df, 1, function(x) x[x["Target"]]))
# X Y Z Target TargetValue
# 1 a b c X a
# 2 d e f Y e
# 3 g h i Z i
library(tidyverse)
df <-setNames(data.frame(cbind(matrix(letters[1:9],3,3,byrow=T), c("X", "Y", "Z"))), c("X", "Y", "Z", "Target"))
df
df %>%
gather(key="ID", value="TargetValue", X:Z) %>%
filter(ID==Target) %>%
select(Target, TargetValue) %>%
left_join(df, by="Target")

dplyr group_by and summarize for two df's with same column name

suppose you have the following two data.frames:
set.seed(1)
x <- letters[1:10]
df1 <- data.frame(x)
z <- rnorm(20,100,10)
df2 <- data.frame(x,z)
(note that both dfs have a column named "x")
and you want to summarize the sums of df2$z for the groups of "x" in df1 like this:
df1 %.%
group_by(x) %.%
summarize(
z = sum(df2$z[df2$x == x])
)
this returns an error "invalid indextype integer" (translated).
But when I change the name of column "x" in any one of the two dfs, it works:
df2 <- data.frame(x1 = x,z) #column is now named "x1", it would also work if the name was changed in df1
df1 %.%
group_by(x) %.%
summarize(
z = sum(df2$z[df2$x1 == x])
)
# x z
#1 a 208.8533
#2 b 205.7349
#3 c 185.4313
#4 d 193.8058
#5 e 214.5444
#6 f 191.3460
#7 g 204.7124
#8 h 216.8216
#9 i 213.9700
#10 j 202.8851
I can imagine many situations, where you have two dfs with the same column name (like an "ID" column) for which this might be a problem, unless there is a simple way around it.
Did I miss something? There may be other ways to get to the same result for this example but I'm interested in understanding if this is possible in dplyr (or perhaps why not).
(the two dfs dont necessarily need to have the same unique "x" values as in this example)
Following the comment from #beginneR, I'm guessing it'd be something like:
inner_join(df1, df2) %.% group_by(x) %.% summarise(z=sum(z))
Joining by: "x"
Source: local data frame [10 x 2]
x z
1 a 208.8533
2 b 205.7349
3 c 185.4313
4 d 193.8058
5 e 214.5444
6 f 191.3460
7 g 204.7124
8 h 216.8216
9 i 213.9700
10 j 202.8851
you can try:
df2%.%filter(x%in%df1$x)%.%group_by(x)%.%summarise(sum(z))
hth

How to merge two columns in R with a specific symbol?

I have a table read in R as follows:
column1 column2
A B
What is the command to be used to match two columns together as follows?
Column 3
A_B
I'm a bit unsure what you mean by "merge", but is this what you mean?
> DF = data.frame(A = LETTERS[1:10], B = LETTERS[11:20])
> DF$C = paste(DF$A, DF$B, sep="_")
> head(DF)
A B C
1 A K A_K
2 B L B_L
3 C M C_M
4 D N D_N
Or equivalently, as #daroczig points out:
within(DF, C <- paste(A, B, sep='_'))
My personal favourite involves making use of the unite in tidyr:
set.seed(1)
df <- data.frame(colA = sample(LETTERS, 10),
colB = sample(LETTERS, 10))
# packs: pipe + unite
require(magrittr); require(tidyr)
# Unite
df %<>%
unite(ColAandB, colA, colB, remove = FALSE)
Results
> head(df, 3)
ColAandB colA colB
1 G_F G F
2 J_E J E
3 N_Q N Q
Side notes
Personally, I find the remove = TRUE / FALSE functionality of unite very useful. In addition tidyr firs the dplyr workflow very well and plays well with separate in case you change your mind about the columns being merged. On the same lines, if NAs are the problem introducing na.omit to your workflow would enable you to conveniently drop the undesirable rows before creating the desired column.

Resources