I have a dataframe:
x y
A1 ''
A2 '123,0'
A3 '4557777'
A4 '8756784321675'
A5 ''
A6 ''
A7
A8
A9 '1533,10'
A10
A11 '51'
I want to add column "type" to it, which has three types: 1,2,3. 1 is if value in y is a number without comma, 2 is for number with comma, 3 is for empty value ''(two apostrophes). So desired output is:
x y type
A1 '' 3
A2 '123,0' 2
A3 '4557777' 1
A4 '8756784321675' 1
A5 '' 3
A6 '' 3
A7
A8
A9 '1533,10' 2
A10
A11 '51' 1
How could i do it? The most unclear part for me is captioning each type in column y
Here's a solution via ifelseand regex:
Data:
df <- data.frame(
y = c("", "", "1,234", "5678", "001,2", "", "455"), stringsAsFactors = F)
Solution:
df$type <- ifelse(grepl(",", df$y), 2,
ifelse(grepl("[^,]", df$y), 1, 3))
Result:
df
y type
1 3
2 3
3 1,234 2
4 5678 1
5 001,2 2
6 3
7 455 1
Update:
df <- data.frame(
y = c("''", "", "1,234", "5678", "001,2", "", "''", 455), stringsAsFactors = F)
df$type <- ifelse(grepl(",", df$y), 2,
ifelse(grepl("[^,']", df$y), 1,
ifelse(df$y=="", "", 3)))
df
y type
1 '' 3
2
3 1,234 2
4 5678 1
5 001,2 2
6
7 '' 3
8 455 1
Is this what you had in mind?
assuming the empty rows have NULL values in them, I thought of dividing into 3 parts:
Those which are empty strings (1)
Those which are convertible to numerics without invoking NA (3)
Those which are NULL (no value)
the only one outside of this set are the ones who belong to group 2, so:
THREE <- which(df$y == "")
ONE <- which(is.na(df$y %>% as.numeric)==FALSE)
EMPTY <- which(is.null(df$y))
type <- c()
type[THREE] = 3
type[ONE] = 1
type[EMPTY] = NA
type[-c(ONE,THREE,EMPTY)] = 2
finally you have a vector which you can join into your dataframe as a column with :
df2 = cbind(df,type)
Related
I've got a problem on how to neatly merge a lot of columns into less columns.
My df looks something like this (but with a lot more similar columns).
df <- data.frame(
A1 = c(1,1,1,NA,NA,NA),
A2 = c(NA,NA,NA,1,1,1),
B1 = c("text","text","text",NA,NA,NA),
B2 = c(NA,NA,NA,"text","text","text")
)
# which looks like this
A1 A2 B1 B2
1 NA "text" NA
1 NA "text" NA
1 NA "text" NA
NA 1 NA "text"
NA 1 NA "text"
NA 1 NA "text"
I would like to merge all the A columns into one A column and all the B columns into a B column. Like this.
A B
1 "text"
1 "text"
1 "text"
1 "text"
1 "text"
1 "text"
I am able to do this for one set of columns with this code:
df %<>% mutate(A1 = ifelse(is.na(A1), A2, A1))
# or possibly
df %<>% unite(A, A1, A2, sep = "", na.rm = TRUE) %>% mutate(A = as.numeric(A))
However, I have a lot of columns that need to be merged like this, resulting in a huge mutate command. Is there a way to do this cleaner/shorter?
Note: The names in the example are called A1 and A2 for clarity, in my orginal df, they are not that easily coupled.
You can try the base R code
unstack(
transform(
subset(u <- stack(df), complete.cases(u)),
ind = gsub("\\d+$", "", ind)
)
)
which gives
A B
1 1 text
2 1 text
3 1 text
4 1 text
5 1 text
6 1 text
Here's one approach... use a named list of column pairs and fcoalesce from "data.table". The names of the list will become the column names in the final data.frame.
pairs = list(A = 2:3, B = c(4, 6), C = c(5, 1))
data.frame(lapply(pairs, function(x) data.table::fcoalesce(df[x])))
# A B C
# 1 1 text 1.2
# 2 1 text 1.2
# 3 1 text 1.2
# 4 1 text 1.2
# 5 1 text 1.2
# 6 1 text 1.2
Sample data used for this example:
df <- data.frame(
C2 = c(1.2, NA, 1.2, 1.2, NA, NA), A1 = c(1,1,1,NA,NA,NA),
A2 = c(NA,NA,NA,1,1,1), B1 = c("text","text","text",NA,NA,NA),
C1 = c(NA, 1.2, NA, NA, 1.2, 1.2), B2 = c(NA,NA,NA,"text","text","text")
)
df
# C2 A1 A2 B1 C1 B2
# 1 1.2 1 NA text NA <NA>
# 2 NA 1 NA text 1.2 <NA>
# 3 1.2 1 NA text NA <NA>
# 4 1.2 NA 1 <NA> NA text
# 5 NA NA 1 <NA> 1.2 text
# 6 NA NA 1 <NA> 1.2 text
I have a very large data.table in which (a large number of) items are defined by strings including text and numbers.
library(data.table)
dd <- data.table(x = c("A4","A4","A4","A14","A14","A14","B4","B4","B4"),y = c("A4","A14","B4","A4","A14","B4","A4","A14","B4"), z = c(1,2,3,4,5,6,7,8,9))
x y z
A4 A4 1
A4 A14 2
A4 B4 3
A14 A4 4
A14 A14 5
A14 B4 6
B4 A4 7
B4 A14 8
B4 B4 9
Numbers can be single or double digit and therefore R will order them always according to the first digit in the number (A14 before A4). Mixedsort can handle this. However, when I reshape the long data to wide
wide <- dcast(dd, x ~ y, value.var = "z")
R is applying again the ordering according to the basic ordering rule.
x A14 A4 B4
A14 5 4 6
A4 2 1 3
B4 8 7 9
I need however the original ordering for following matrix calculations. Is there any efficient way to rename string + single digits to string + double digits (A4 -> A04) or another approach I have missed?
Another, and probably the easiest, option is to use mixedorder from the gtools-package:
wide <- dcast(dd, x ~ y, value.var = "z")[gtools::mixedorder(x)]
which gives:
> wide
x A14 A4 B4
1: A4 2 1 3
2: A14 5 4 6
3: B4 8 7 9
If you also want to get the column order set the same way, you can additionally use setcolorder:
setcolorder(wide, c(1, gtools::mixedorder(names(wide)[-1]) + 1))
which then gives:
> wide
x A4 A14 B4
1: A4 1 2 3
2: A14 4 5 6
3: B4 7 8 9
No additional zeros required in this solution.
# Data frame
df <- data.frame(x = c("A4","A4","A4","A14","A14","A14","B4","B4","B4"),
y = c("A4","A14","B4","A4","A14","B4","A4","A14","B4"),
z = c(1,2,3,4,5,6,7,8,9),
stringsAsFactors = FALSE)
# Reorder columns and rows using `mixedsort`.
wide <- dcast(df, x ~ y,value.var = "z") %>%
select(x, mixedsort(unique(df$x))) %>%
slice(match(x, mixedsort(unique(df$x))))
gives,
# x A4 A14 B4
# 1 A4 1 2 3
# 2 A14 4 5 6
# 3 B4 7 8 9
You can use sprintf() to prepad numbers with 0s
sprintf("%s%02.0d", "A", 1:20)
# [1] "A01" "A02" "A03" "A04" "A05" "A06" "A07" "A08" "A09" "A10" "A11" "A12" "A13" "A14" "A15" "A16" "A17" "A18" "A19" "A20"
You can add the 0s to your data with
dd[nchar(x) == 2, x := paste0(substr(x, 1, 1), 0, substr(x, 2, 2))]
dd[nchar(y) == 2, y := paste0(substr(y, 1, 1), 0, substr(y, 2, 2))]
# x y z
# 1: A04 A04 1
# 2: A04 A14 2
# 3: A04 B04 3
# 4: A14 A04 4
# 5: A14 A14 5
# 6: A14 B04 6
# 7: B04 A04 7
# 8: B04 A14 8
# 9: B04 B04 9
Or, if you need to apply to more columns:
to.change <- c('x', 'y')
dd[, (to.change) := lapply(.SD, function(x) ifelse(nchar(x) > 2, x
, paste0(substr(x, 1, 1), 0, substr(x, 2, 2))))
, .SDcols = to.change]
You might want to consider implementing this order directly in the data through factors, so you don't have to fix it with data wrangling later.
if you already have these unique values sorted somewhere you won't need mixedorder not mixedsort, just convert them as factors then.
Else you can get the order back :
library(gtools)
dd[,1:2] <- lapply(dd[,1:2],function(x) factor(x, mixedsort(unique(x))))
And proceed normally:
dcast(dd, x ~ y, value.var = "z")
# x A4 A14 B4
# 1: A4 1 2 3
# 2: A14 4 5 6
# 3: B4 7 8 9
I'm trying to cbind or unnest or as.data.table a partially nested list.
id <- c(1,2)
A <- c("A1","A2","A3")
B <- c("B1")
AB <- list(A=A,B=B)
ABAB <- list(AB,AB)
nested_list <- list(id=id,ABAB=ABAB)
The length of id is the same as ABAB (2 in this case). I don't know how to unlist a part of this list (ABAB) and cbind another part (id). Here's my desired result as a data.table:
data.table(id=c(1,1,1,2,2,2),A=c("A1","A2","A3","A1","A2","A3"),B=rep("B1",6))
id A B
1: 1 A1 B1
2: 1 A2 B1
3: 1 A3 B1
4: 2 A1 B1
5: 2 A2 B1
6: 2 A3 B1
I haven't tested for more general cases, but this works for the OP example:
library(data.table)
as.data.table(nested_list)[, lapply(ABAB, as.data.table)[[1]], id]
# id A B
#1: 1 A1 B1
#2: 1 A2 B1
#3: 1 A3 B1
#4: 2 A1 B1
#5: 2 A2 B1
#6: 2 A3 B1
Or another option (which is probably faster, but is more verbose):
rbindlist(lapply(nested_list$ABAB, as.data.table),
idcol = 'id')[, id := nested_list$id[id]]
This is some super ugly base R, but produces the desired output.
Reduce(rbind, Map(function(x, y) setNames(data.frame(x, y), c("id", "A", "B")),
as.list(nested_list[[1]]),
lapply(unlist(nested_list[-1], recursive=FALSE),
function(x) Reduce(cbind, x))))
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
lapply takes the a list of two elements (each containing the A and B variables) extracted with unlist and recursive=FALSE. It returns a list of character matrices with the B variable filled in by recycling. A list of the individual id variables from as.list(nested_list[[1]]) and the lit of matrices are fed to Map which converts corresponding pairs to a data.frame and gives the columns the desired names and returns a list of data.frames. Finally, this list of data.frames is fed to Reduce, which rbinds the results to a single data.frame.
The final Reduce(rbind, could be replaced by data.tables rbindlist if desired.
Here's another hideous solution
max_length = max(unlist(lapply(nested_list, function(x) lapply(x, lengths))))
data.frame(id = do.call(c, lapply(nested_list$id, rep, max_length)),
do.call(rbind, lapply(nested_list$ABAB, function(x)
do.call(cbind, lapply(x, function(y) {
if(length(y) < max_length) {
rep(y, max_length)
} else {
y
}
})))))
# id A B
#1 1 A1 B1
#2 1 A2 B1
#3 1 A3 B1
#4 2 A1 B1
#5 2 A2 B1
#6 2 A3 B1
And one more, also inelegant- but I`d gone too far by the time I saw the other answers.
restructure <- function(nested_l) {
ids <- as.numeric(max(unlist(lapply(unlist(nested_l, recursive = FALSE), function(x){
lapply(x, length)
}))))
temp = data.frame(rep(nested_l$id, each = ids),
sapply(1:length(nested_l$id), function(x){
out <-unlist(lapply(nested_l[[2]], function(y){
return(y[x])
}))
}))
names(temp) <- c("id", unique(substring(unlist(nested_l[2]), first = 1, last = 1)))
return(temp)
}
> restructure(nested_list)
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
Joining the party:
library(tidyverse)
temp <- map(nested_list,~map(.x,~expand.grid(.x)))
df <- map_df(1:2,~cbind(temp$id[[.x]],temp$ABAB[[.x]]))
Var1 A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
I've got two strings of variable names that looks like this
> names_a = paste(paste0('a', seq(0,6,1)), collapse = ", ")
> names_a
[1] "a0, a1, a2, a3, a4, a5, a6"
> names_b = paste(paste0('b', seq(0,6,1)), collapse = ", ")
> names_b
[1] "b0, b1, b2, b3, b4, b5, b6"
Eacha and b variable contains a vector of ids, for example:
> head(a3)
[1] "1234" "56567" "457659"...
I aim to get all possible pairs of a and b ids. For this purpose I try to paste variables' names rigth into function rbind and then to expand.grid
pairs = expand.grid(rbind(parse(text = names_a), rbind(parse(text = names_b))
I mean I try to collapse all a0 to a6 vectors into a single vector using rbind, let it be named a, the same for all b's vectors and then find all pairs of values in a and b
surprisingly nothing works. Can it be fixed?
Something like this?
a1 = 1:2
a2 = 3:4
b1 = 5:6
b2 = 7:8
expand.grid(do.call(rbind, mget(paste("a", 1:2, sep = ""))),
do.call(rbind, mget(paste("b", 1:2, sep = ""))))
# Var1 Var2
#1 1 5
#2 3 5
#3 2 5
#4 4 5
#5 1 7
#6 3 7
#7 2 7
#8 4 7
#9 1 6
#10 3 6
#11 2 6
#12 4 6
#13 1 8
#14 3 8
#15 2 8
#16 4 8
Collapse all of a0 through a6 into one vector:
a <- as.vector(sapply(strsplit(gsub(" ","",names_a),",")[[1]],function(x) get(x)))
(or if you don't have the names as a single string you need to parse):
a <- as.vector(sapply(paste0("a",0:6),function(x) get(x)))
Do the same with b and then
merge(a,b) #all pairs
This will generate duplicates if any of the a or b variables has duplicates, so you may want to add unique to the collapsing of a and b
I would like to achieve the following data.frame in R:
i1 i2 i3
1 A1 A2 A3
2 No A2 A3
3 A1 No A3
4 No No A3
5 A1 A2 No
6 No A2 No
7 A1 No No
8 No No No
In each column the variable can either be the concatenated string "A" and the column number or "No". The data.frame should contain all possible combinations.
My idea was to use expand.grid, but I don't know how to create the list dynamically. Or is there a better approach?
expand.grid(list(c("A1", "No"), c("A2", "No"), c("A3", "No")))
I guess you could create your own helper function, something like that
MyList <- function(n) expand.grid(lapply(paste0("A", seq_len(n)), c, "No"))
Then simply pass it the number of elements (e.g., 3)
MyList(3)
# Var1 Var2 Var3
# 1 A1 A2 A3
# 2 No A2 A3
# 3 A1 No A3
# 4 No No A3
# 5 A1 A2 No
# 6 No A2 No
# 7 A1 No No
# 8 No No No
Alternatively, you could also try data.tables CJ equivalent which should much more efficient than expand.grid for a big n
library(data.table)
DTCJ <- function(n) do.call(CJ, lapply(paste0("A", seq_len(n)), c, "No"))
DTCJ(3) # will return a sorted cross join
# V1 V2 V3
# 1: A1 A2 A3
# 2: A1 A2 No
# 3: A1 No A3
# 4: A1 No No
# 5: No A2 A3
# 6: No A2 No
# 7: No No A3
# 8: No No No
Another option is using Map with expand.grid
n <- 3
expand.grid(Map(c, paste0('A', seq_len(n)), 'NO'))
Or
expand.grid(as.data.frame(rbind(paste0('A', seq_len(n)),'NO')))
Another option, only using the most fundamental functions in R, is to use the indices:
df <- data.frame(V1 = c('A','A','A', 'A',rep('No',4)), V2 = c('A','A','No','No','A','A','No','No'), V3 = c('A','No','A','No','A','No','A','No'), stringsAsFactors = FALSE)
to get the row and col indices of the elements we need to change:
rindex <- which(df != 'No') %% nrow(df)
cindex <- ceiling(which(df != 'No')/nrow(df))
the solution is basically a one-liner:
df[matrix(c(rindex,cindex),ncol=2)] <- paste0(df[matrix(c(rindex,cindex),ncol=2)],cindex)
> df
V1 V2 V3
1 A1 A2 A3
2 A1 A2 No
3 A1 No A3
4 A1 No No
5 No A2 A3
6 No A2 No
7 No No A3
8 No No No