Split data frame column by number of characters specified in another column

Split data frame column by number of characters specified in another column - r

Sorry if this sounds trivial, but I have been stuck for a while with this.
I want to split a column of strings into two, splitting at the number of the character specified in another column:
dat <- tibble(x=c("ABCDEFG", "QRSTUVWXYZ", "FGYHGBJIOW"), y=c(4,3,8))
dat
A tibble: 3 x 2
x y
<chr> <dbl>
1 ABCDEFG 4
2 QRSTUVWXYZ 3
3 FGYHGBJIOW 8
Desired outcome:
x1 x2 y
-------------------------
ABCD EFG 4
QRS TUVWXYZ 3
FGYHGBJI OW 8
I have tried using tidy::separate, where it can take the number of characters in the sep =, but it won't take the number from another column. I have tried writing a function in the hope that it would do that (https://dplyr.tidyverse.org/articles/programming.html), but it seems it doesn't let the sep= part to take column name as arguments (https://tidyr.tidyverse.org/reference/separate.html).
Any help would be appreciated!

A simple solution would be:
dat <- dat %>% mutate(x1 = substring(x, 1, y),
x2 = substring(x, y + 1, nchar(x)))

Similar to #PinotTiger's solution using within.
dat <- within(dat, {
x2 <- substring(x, y + 1, nchar(x))
x1 <- substring(x, 1, y)
rm(x)
})[c(2, 3, 1)]
dat
# x1 x2 y
# 1 ABCD EFG 4
# 2 QRS TUVWXYZ 3
# 3 FGYHGBJI OW 8

You can use str_extractfrom the library stringrand force the calculation of the number of characters to be extracted each time into the pattern to be matched:
dat$x1 <- str_extract(dat$x, paste0("\\w{",dat$y,"}"))
dat$x2 <- str_extract(dat$x, paste0("\\w{",nchar(dat$x) - dat$y,"}$"))
dat
# A tibble: 3 x 4
x y x1 x2
<chr> <dbl> <chr> <chr>
1 ABCDEFG 4 ABCD EFG
2 QRSTUVWXYZ 3 QRS TUVWXYZ
3 FGYHGBJIOW 8 FGYHGBJI OW

An option with separate after creating delimiter at the position specified by 'y' with str_replace
library(dplyr)
library(tidyr)
library(stringr)
dat %>%
mutate(x = str_replace(x, sprintf("(.{%d})", y), "\\1,")) %>%
separate(x, into = c('x1', 'x2'))

Related

Converting columns in dataframe within list?

What is the best way to convert a specific column in each list object to a specific format?
For instance, I have a list with four objects (each of which is a data frame) and I want to change column 3 in each data.frame from double to integer?
I'm guessing something along the line of lapply but I didn't know what specific synthax to use. I was trying:
lapply(df,function(x){as.numeric(var1(x))})
but it wasn't working.
Thanks!

Yes, lapply works well here:
lapply(listofdfs, function(df) { # loop through each data.frame in list
df[ , 3] <- as.integer(df[ , 3]) # make the 3rd column of type integer
df # return the new data.frame
})

This is just an alternative to C. Braun's answer.
You can also use map() function from the purr library.
Input:
library(tidyverse)
df <- tibble(a = c(1, 2, 3), b =c(4, 5, 6), d = c(7, 8, 9))
myList <- list(df, df, df)
myList
Method:
map(myList, ~(.x %>% mutate_at(vars(3), funs(as.integer(.)))))
Output:
[[1]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[2]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[3]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9

You can use this:
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
Simple example:
data <- data.frame(cbind(c("1","2","3","4",NA),c(1:5)),stringsAsFactors = F)
typeof(data[,1]) #character
dlist <- list(data,data,data)
coltochange <- 1
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
typeof(dlist[[1]][,1]) #character
typeof(dlist2[[1]][,1]) #double

Using dplyr mutate_at when a function takes multiple arguments which are different columns

I have a data.frame with a large number of columns whose names follow a pattern. Such as:
df <- data.frame(
x_1 = c(1, NA, 3),
x_2 = c(1, 2, 4),
y_1 = c(NA, 2, 1),
y_2 = c(5, 6, 7)
)
I would like to apply mutate_at to perform the same operation on each pair of columns. As in:
df %>%
mutate(
x = ifelse(is.na(x_1), x_2, x_1),
y = ifelse(is.na(y_1), y_2, y_1)
)
Is there a way I can do that with mutate_at/mutate_each?
This:
df %>%
mutate_each(vars(x_1, y_1), funs(ifelse(is.na(.), vars(x_2, y_2), .)))
and various variations I've tried all fail.
The question is similar to Using functions of multiple columns in a dplyr mutate_at call, but different in that the second argument to the function call is not a single column, but a different column for each column in vars.
Thanks in advance.

I don't know if you can get it that way, but here's a different perspective on the problem. If you find yourself with really wide data (e.g., tons of columns with similar names) and you want to do something with them, it might help to tidy the data (long in stata terms) with tidyr::gather (see docs here http://tidyr.tidyverse.org/).
> df %>% gather()
key value
1 x_1 1
2 x_1 NA
3 x_1 3
4 x_2 1
5 x_2 2
6 x_2 4
7 y_1 NA
8 y_1 2
9 y_1 1
10 y_2 5
11 y_2 6
12 y_2 7
After converting the data to this format, it's easier to combine and rearrange values using group_by instead of trying to mutate_at things. E.g., you can ge the first values with df %>% gather() %>% mutate(var = substr(key,1,1)) and manipulate the xs and ys differently using group_by(var).

Old question, but I agree with Jesse that you need to tidy your data a bit. gather would be the way to go, but it lacks somehow the possibility of stats::reshape where you can specify groups of columns to gather. So here's a solution with reshape:
df %>%
reshape(varying = list(c("x_1", "y_1"), c("x_2", "y_2")),
times = c("x", "y"),
direction = "long") %>%
mutate(x = ifelse(is.na(x_1), x_2, x_1)) %>%
reshape(idvar = "id",
timevar = "time",
direction = "wide") %>%
rename_all(funs(gsub("[a-zA-Z]+(_*)([0-9]*)\\.([a-zA-Z]+)", "\\3\\1\\2", .)))
# id x_1 x_2 x y_1 y_2 y
# 1 1 1 1 1 NA 5 5
# 2 2 NA 2 2 2 6 2
# 3 3 3 4 3 1 7 1
In order to do that with any number of column pairs, you could do something like:
df2 <- setNames(cbind(df, df), c(t(outer(letters[23:26], 1:2, paste, sep = "_"))))
v <- split(names(df2), purrr::map_chr(names(df2), ~ gsub(".*_(.*)", "\\1", .)))
n <- unique(purrr::map_chr(names(df2), ~ gsub("_[0-9]+", "", .) ))
df2 %>%
reshape(varying = v,
times = n,
direction = "long") %>%
mutate(x = ifelse(is.na(!!sym(v[[1]][1])), !!sym(v[[2]][1]), !!sym(v[[1]][1]))) %>%
reshape(idvar = "id",
timevar = "time",
direction = "wide") %>%
rename_all(funs(gsub("[a-zA-Z]+(_*)([0-9]*)\\.([a-zA-Z]+)", "\\3\\1\\2", .)))
# id w_1 w_2 w x_1 x_2 x y_1 y_2 y z_1 z_2 z
# 1 1 1 1 1 NA 5 5 1 1 1 NA 5 5
# 2 2 NA 2 2 2 6 2 NA 2 2 2 6 2
# 3 3 3 4 3 1 7 1 3 4 3 1 7 1
This assumes that columns which should be compared are next to each other and that all columns for with possible NA values are in columns suffixed by _1 and the replacement value columns are sufficed by _2.

When I asked this question, the answer was "you can't!" That's no longer the answer, since tidyr now supports pivot_wider and pivot_longer.

Conditional Subsetting based on column numbers

I need to subset data for when columns don't match. For example if I have an identifier in the first column X like 1 then all of the following examples in column Y should match:
X <- rep(1:4, times=2, each=2)
Y <- rep(c("Dave","Sam","Sam","Sam"))
Z <- as.data.frame(cbind(X,Y))
head(Z)
So on this one I would like to subset the data when X = 1 and 3 on this example since column y doesn't fully agree by not subset column 2. It would be great to get a function to subset for this type of problem I have on a larger dataframe
Thanks,

With dplyr:
df <- data.frame(x = rep(1:4, times=2, each=2),
y = rep(c("Dave","Sam","Sam","Sam")))
library(dplyr)
df %>%
group_by(x) %>%
filter(any(!y == lag(y), na.rm = T))
#> Source: local data frame [8 x 2]
#> Groups: x [2]
#>
#> x y
#> <int> <fctr>
#> 1 1 Dave
#> 2 1 Sam
#> 3 3 Dave
#> 4 3 Sam
#> 5 1 Dave
#> 6 1 Sam
#> 7 3 Dave
#> 8 3 Sam
I tested some cases, not sure if this holds a lot of edge cases

This is the way I would do it, though there may be a more elegant way. Is this what you need?
X <- rep(1:4, times=2, each=2)
Y <- rep(c("Dave","Sam","Sam","Sam"))
Z <- as.data.frame(cbind(X,Y))
head(Z)
# First Create Concatenated column
Z$XY <- paste(Z$X, Z$Y)
# Eliminate all duplicates
Z_unique <- unique(Z)
# Find number of occurences of each X value
n_occur <- data.frame(table(Z_unique$X))
# Pull only those that have occurred more than once
n_occur[n_occur$Freq > 1,]
# Subset the output to only those values
output <- Z[Z$X %in% n_occur$Var1[n_occur$Freq > 1],]

We can use data.table
library(data.table)
setDT(df)[, .SD[any(!y == shift(y))], x]
# x y
#1: 1 Dave
#2: 1 Sam
#3: 1 Dave
#4: 1 Sam
#5: 3 Dave
#6: 3 Sam
#7: 3 Dave
#8: 3 Sam
data
df <- data.frame(x = rep(1:4, times=2, each=2),
y = rep(c("Dave","Sam","Sam","Sam")))

Concatenate rows and columns

I have a data set like this
x y z
a 5 4
b 1 2
And i want concat columns and rows :
ay 5
az 4
by 1
bz 2
Thanks

You can use melt, and paste but you will need to make your rownames a variable, i..e
df$new <- rownames(df)
m_df <- reshape2::melt(df)
rownames(m_df) <- paste0(m_df$new, m_df$variable)
m_df <- m_df[-c(1:2)]
m_df
# value
#ax 5
#bx 1
#ay 4
#by 2
#az 3
#bz 1
After your edit, you don't need to convert rownames to a variable so just,
m1_df <- reshape2::melt(df)
m1_df$new <- paste0(m1_df$x, m1_df$variable)
m1_df
# x variable value new
#1 a y 5 ay
#2 b y 1 by
#3 a z 4 az
#4 b z 2 bz
You can then tidy your data frame to required output

with dplyr-tidyr
library(dplyr)
library(tidyr)
df %>%
gather(var, val, -x) %>%
mutate(var=paste0(x, var)) %>%
select(var, val)%>%
arrange(var)
# var val
#1 ay 5
#2 az 4
#3 by 1
#4 bz 2

library(reshape2)
library(dplyr)
library(tibble)
library(stringr)
# Create dataframe
x <- data.frame(x = c(5, 1),
y = c(4, 2),
z = c(3, 1),
row.names = c('a', 'b'))
# Convert rowname to column and melt
x <- tibble::rownames_to_column(x, "rownames") %>%
melt('rownames')
# assign concat columns as rownames
row.names(x) <- str_c(x$rownames, x$variable)
# Select relevant columns only
x <- select(x, value)
# Remove names from dataframe
names(x) <- NULL
> x
ax 5
bx 1
ay 4
by 2
az 3
bz 1

Here is another option in base R
stack(setNames(as.list(unlist(df1[-1])), outer(df1$x, names(df1)[-1], paste0)))[2:1]

forloop inside dplyr mutate

I would like to do a few column operations using mutate in more elegant way as I have more than 200 columns in my table that I would like transform using mutate.
here is an example
Sample data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20),c2_y=rnorm(20),c3_z=rnorm(20),
c4_x=rnorm(20),c5_y=rnorm(20),c6_z=rnorm(20),
c7_x=rnorm(20),c8_y=rnorm(20),c9_z=rnorm(20),
c10_x=rnorm(20),c11_y=rnorm(20),c12_z=rnorm(20),
c_n=rnorm(20))
sample code:
dfm<-df %>%
mutate(cx=(c1_x*c4_x/c_n+c7_x*c10_x/c_n),
cy=(c2_y*c5_y/c_n+c8_y*c11_y/c_n),
cz=(c3_z*c6_z/c_n+c9_z*c12_z/c_n))

Despite the tangent, the initial recommendations for using tidyr functions is where you need to go. This pipe of functions seems to do the job based on what you've provided.
Your data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20), c2_y=rnorm(20), c3_z=rnorm(20),
c4_x=rnorm(20), c5_y=rnorm(20), c6_z=rnorm(20),
c7_x=rnorm(20), c8_y=rnorm(20), c9_z=rnorm(20),
c10_x=rnorm(20), c11_y=rnorm(20), c12_z=rnorm(20),
c_n=rnorm(20))
library(dplyr)
library(tidyr)
This first auxiliary data.frame is used to translate your c#_[xyz] variable into a unified one. I'm sure there are other ways to handle this, but it works and is relatively easy to reproduce and extend based on your 200+ columns.
variableTransform <- data_frame(
cnum = paste0("c", 1:12),
cvar = rep(paste0("a", 1:4), each = 3)
)
head(variableTransform)
# Source: local data frame [6 x 2]
# cnum cvar
# <chr> <chr>
# 1 c1 a1
# 2 c2 a1
# 3 c3 a1
# 4 c4 a2
# 5 c5 a2
# 6 c6 a2
Here's the pipe all at once. I'll explain the steps in a sec. What you're looking for is likely a combination of the treatment, xyz, and ans columns.
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
tidyr::spread(cvar, value) %>%
mutate(
ans = a1 * (a2/c_n) + a3 * (a4/c_n)
) %>%
head
# treatment c_n xyz a1 a2 a3 a4 ans
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419 1.15801448
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979 -0.01828831
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878 -2.03197283
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493 0.15759418
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839 0.65270681
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095 0.06136036
First, we take the original data and turn all (except two) columns into two columns of "column name" and "column values" pairs:
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
# treatment c_n cnum value
# 1 a 0.20745647 c1_x -0.1250222
# 2 b 0.01015871 c1_x -0.4585088
# 3 a 1.65671028 c1_x -0.2455927
# 4 b -0.24037137 c1_x 0.6219516
# 5 a -1.16092349 c1_x -0.3716138
# 6 b 1.61191700 c1_x 1.7605452
It will be helpful to split c1_x into c1 and x in order to translate the first and preserve the latter:
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
# treatment c_n cnum xyz value
# 1 a 0.20745647 c1 x -0.1250222
# 2 b 0.01015871 c1 x -0.4585088
# 3 a 1.65671028 c1 x -0.2455927
# 4 b -0.24037137 c1 x 0.6219516
# 5 a -1.16092349 c1 x -0.3716138
# 6 b 1.61191700 c1 x 1.7605452
From here, let's translate the c1, c2, and c3 variables into a1 (repeat for other 9 variables) using variableTransform:
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
# treatment c_n xyz value cvar
# 1 a 0.20745647 x -0.1250222 a1
# 2 b 0.01015871 x -0.4585088 a1
# 3 a 1.65671028 x -0.2455927 a1
# 4 b -0.24037137 x 0.6219516 a1
# 5 a -1.16092349 x -0.3716138 a1
# 6 b 1.61191700 x 1.7605452 a1
Since we want to deal with multiple variables simultaneously (with a simple mutate), we need to bring some of the variables back into columns. (The reason we gathered and will now spread helps me with keeping things organized and named well. I'm confident somebody can come up with another way to do it.)
tidyr::spread(cvar, value) %>% head
# treatment c_n xyz a1 a2 a3 a4
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095
From here, we just need to mutate to get the right answer.

Similar to r2evans's answer, but with more manipulation instead of the joins (and less explanation).
library(tidyr)
library(stringr)
library(dplyr)
# get it into fully long form
gather(df, key = cc_xyz, value = value, c1_x:c12_z) %>%
# separate off the xyz and the c123
separate(col = cc_xyz, into = c("cc", "xyz")) %>%
# extract the number
mutate(num = as.numeric(str_replace(cc, pattern = "c", replacement = "")),
# mod it by 4 for groupings and add a letter so its a good col name
num_mod = paste0("v", (num %% 4) + 1)) %>%
# remove unwanted columns
select(-cc, -num) %>%
# go into a reasonable data width for calculation
spread(key = num_mod, value = value) %>%
# calculate
mutate(result = v1 + v2/c_n + v3 + v4 / c_n)
# treatment c_n xyz v1 v2 v3 v4 result
# 1 a -1.433858289 x 1.242153708 -0.985482158 -0.0240414692 1.98710285 0.51956295
# 2 a -1.433858289 y -0.019255516 0.074453615 -1.6081599298 1.18228939 -2.50389188
# 3 a -1.433858289 z -0.362785313 2.296744655 -0.0610463292 0.89797526 -2.65188998
# 4 a -0.911463819 x -1.088308527 -0.703388193 0.6308253909 0.22685013 0.06534405
# 5 a -0.911463819 y 1.284513516 1.410276163 0.5066869590 -2.07263912 2.51790289
# 6 a -0.911463819 z 0.957778345 -1.136532104 1.3959561507 -0.50021647 4.14947069
# ...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split data frame column by number of characters specified in another column - r

A simple solution would be: dat <- dat %>% mutate(x1 = substring(x, 1, y), x2 = substring(x, y + 1, nchar(x)))

Similar to #PinotTiger's solution using within. dat <- within(dat, { x2 <- substring(x, y + 1, nchar(x)) x1 <- substring(x, 1, y) rm(x) })[c(2, 3, 1)] dat # x1 x2 y # 1 ABCD EFG 4 # 2 QRS TUVWXYZ 3 # 3 FGYHGBJI OW 8

An option with separate after creating delimiter at the position specified by 'y' with str_replace library(dplyr) library(tidyr) library(stringr) dat %>% mutate(x = str_replace(x, sprintf("(.{%d})", y), "\\1,")) %>% separate(x, into = c('x1', 'x2'))

Related

Converting columns in dataframe within list?

Using dplyr mutate_at when a function takes multiple arguments which are different columns

Conditional Subsetting based on column numbers

Concatenate rows and columns

forloop inside dplyr mutate

Categories

Resources