I have a column containing 1200 character strings. In each one, every four character group is hexadecimal for a number. i.e. 300 numbers in hexadecimal crammed into a 1200 character string, in every row. I need to get each number out into decimal, and into its own column (300 new columns) named 1-300.
Here's what I've figured out so far:
Data.frame:
BigString
[1] 0043003E803C0041004A...(etc...)
Here's what I've done so far:
decimal.fours <- function(x) {
strtoi(substring(BigString[x], seq(1,1197,4), seq(4,1197,4)), 16L)
}
decimal.fours(1)
[1] 283 291 239 177 ...
But now I'm stuck. How can I output these individual number, (and the remaining 296, into new columns? I have fifty total rows/strings. It would be great to do them all at once, i.e. 300 new columns, containing split up substrings from 50 strings.
You can use read.fwf which read in files with fixed width for each column:
# an example vector of big strings
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
n = 5 # n is the number of columns for your result(300 for your real case)
as.data.frame(
lapply(read.fwf(file = textConnection(BigString),
widths = rep(4, n),
colClasses = "character"),
strtoi, base = 16))
# V1 V2 V3 V4 V5
#1 67 62 32828 65 74
#2 67 62 32828 65 74
#3 67 62 32828 65 74
If you'd like to keep the decimal.hours function, you can modify it as follows and call lapply to convert your bigStrings to list of integers which can be further converted to data.frame with do.call(rbind, ...) pattern:
decimal.fours <- function(x) {
strtoi(substring(x, seq(1,1197,4), seq(4,1197,4)), 16L)
}
do.call(rbind, lapply(BigString, decimal.fours))
Obligatory tidyverse example:
library(tidyverse)
Setup some data
set.seed(1492)
bet <- c(0:9, LETTERS[1:6]) # alphabet for hex digit sequences
i <- 8 # number of rows
n <- 10 # number of 4-hex-digit sequences
df <- data_frame(
some_other_col=LETTERS[1:i],
big_str=map_chr(1:i, ~sample(bet, 4*n, replace=TRUE) %>% paste0(collapse=""))
)
df
## # A tibble: 8 × 2
## some_other_col big_str
## <chr> <chr>
## 1 A 432100D86CAA388C15AEA6291E985F2FD3FB6104
## 2 B BC2673D112925EBBB3FD175837AF7176C39B4888
## 3 C B4E99FDAABA47515EADA786715E811EE0502ABE8
## 4 D 64E622D7037D35DE6ADC40D0380E1DC12D753CBC
## 5 E CF7CDD7BBC610443A8D8FCFD896CA9730673B181
## 6 F ED86AEE8A7B65F843200B823CFBD17E9F3CA4EEF
## 7 G 2B9BCB73941228C501F937DA8E6EF033B5DD31F6
## 8 H 40823BBBFDF9B14839B7A95B6E317EBA9B016ED5
Do the manipulation
read_fwf(paste0(df$big_str, collapse="\n"),
fwf_widths(rep(4, n)),
col_types=paste0(rep("c", n), collapse="")) %>%
mutate_all(strtoi, base=16) %>%
bind_cols(df) %>%
select(some_other_col, everything(), -big_str)
## # A tibble: 8 × 11
## some_other_col X1 X2 X3 X4 X5 X6 X7 X8 X9
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 A 17185 216 27818 14476 5550 42537 7832 24367 54267
## 2 B 48166 29649 4754 24251 46077 5976 14255 29046 50075
## 3 C 46313 40922 43940 29973 60122 30823 5608 4590 1282
## 4 D 25830 8919 893 13790 27356 16592 14350 7617 11637
## 5 E 53116 56699 48225 1091 43224 64765 35180 43379 1651
## 6 F 60806 44776 42934 24452 12800 47139 53181 6121 62410
## 7 G 11163 52083 37906 10437 505 14298 36462 61491 46557
## 8 H 16514 15291 65017 45384 14775 43355 28209 32442 39681
## # ... with 1 more variables: X10 <int>
just a try using base-R
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
df = data.frame(BigString)
t(sapply(df$BigString, function(x) strtoi(substring(x, seq(1, 297, 4)[1:5],
seq(4, 300, 4)[1:5]), base = 16)))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 67 62 32828 65 74
#[2,] 67 62 32828 65 74
#[3,] 67 62 32828 65 74
# you can set the columns together at the end using `paste0("new_col", 1:300)`
# [1:5] was just used for this example, because i had strings of length 20cahr
Related
I have a large matrix:
id v1 v2 v3 v4 v5 v6 v7 v8
1001 37 15 30 37 4 11 35 37
2111 44 31 44 30 24 39 44 18
3121 43 49 39 34 44 43 26 24
4532 45 31 26 33 12 47 37 15
5234 23 27 34 23 30 34 23 4
6345 9 46 39 34 8 43 26 24
For each row (id), I would like to identify intervals of numbers in column v1 to v8. An interval is here defined as a sequence of numbers which starts and ends with the same number.
For example, in the first row, there are two sequences which both start and ends with 37: From column 1 to 4 (37, 15, 30, 37) and from column 4 to column 8 (37, 4, 11, 35, 37).
The focal value should only occur in start and end positions. For example, in the first row, the sequence from 37 at V1, to 37 at V8 is not included, because 37 also occurs in V4.
For each interval, I want the index of the start and end columns, the focal start and end value, and the sequence of numbers in between.
Desired output:
1001 [v1] to [v4] 37 to 37: 15,30
1001 [v4] to [v8] 37 to 37: 4, 11, 35
2111 [v1] to [v3] 44 to 44: 31
2111 [v3] to [v7] 44 to 44: 30, 24, 39
Any suggestions? Algorithm?
I managed to code for the indices for a vector not a matrix,
a <- which(x == 37)
from <- a[!(a-1) %in% a]
to <- a[!(a+1) %in% a]
rbind(from, to)
Very brute-force method. Get unique elements for the given row, check if they are present more than once but not side-by-side, then lapply through each, getting the elements of the row x between them.
apply(m, 1, function(x) {
u <- unique(x)
u <- u[sapply(u, function(u) any(diff(which(x == u)) > 1))]
lapply(setNames(u, u), function(u){
ind <- which(x == u)
lapply(seq(length(ind) - 1),
function(i) x[seq(ind[i] + 1, ind[i + 1] - 1)])
})
})
Output:
# [[1]]
# [[1]]$`37`
# [[1]]$`37`[[1]]
# [1] 15 30
#
# [[1]]$`37`[[2]]
# [1] 4 11 35
#
#
#
# [[2]]
# [[2]]$`44`
# [[2]]$`44`[[1]]
# [1] 31
#
# [[2]]$`44`[[2]]
# [1] 30 24 39
#
#
#
# [[3]]
# [[3]]$`43`
# [[3]]$`43`[[1]]
# [1] 49 39 34 44
#
#
#
# [[4]]
# named list()
#
# [[5]]
# [[5]]$`23`
# [[5]]$`23`[[1]]
# [1] 27 34
#
# [[5]]$`23`[[2]]
# [1] 30 34
#
#
# [[5]]$`34`
# [[5]]$`34`[[1]]
# [1] 23 30
#
#
#
# [[6]]
# named list()
Edit: Henrik's answer inspired me to do a join-based version
library(data.table)
library(magrittr)
d <- melt(as.data.table(m), "id", variable.name = 'ci')[, ci := rowid(id)]
setorder(d, id)
options(datatable.nomatch = 0)
d[d, on = .(id, value, ci > ci)
, .(id, value, i.ci, x.ci)
, mult = 'first'] %>%
.[d, on = .(id, i.ci < ci, x.ci > ci)
, .(id, value, from_ci = x.i.ci, to_ci = x.x.ci, i.value)] %>%
.[, .(val = .(i.value))
, by = setdiff(names(.), 'i.value')]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 34 3 6 23,30
# 8: 5234 23 4 7 30,34
Here's a data.table alternative.
Convert matrix to data.table and melt to long format. Create a column index 'ci' to keep track of the original columns (rowid(id)). Order by 'id'.
For each 'id' and 'value' (by = .(id, value)), check if number of rows is larger than one (if(.N > 1)), i.e. if there is at least one sequence. If so, grab the row index (.I) of the sequences and their column indexes (in the original data). For each sequence, grab the corresponding values between start and end index. Wrap in list twice (.(.() to create a list column.
library(data.table)
d <- melt(as.data.table(m), id.vars = "id")
d[ , `:=`(
ci = rowid(id),
variable = NULL)]
setorder(d, id)
d2 <- d[ , if(.N > 1){
.(from = .I[-.N], to = .I[-1],
from_ci = ci[-.N], to_ci = ci[ -1])
}, by = .(id, value)]
d2[ , val := .(.(d$value[seq(from + 1, to - 1)])), by = 1:nrow(d2)]
d2[ , `:=`(from = NULL, to = NULL)]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 23 4 7 30,34
# 8: 5234 34 3 6 23,30
I would like to create an update function using lazy evaluation and the mutate_if function from dplyrExtras by skranz.
It would work something like this:
data %>%
update(variable1_original = variable1_update,
variable2_original = variable2_update)
would be evaluated as
data %>%
mutate_if(!is.na(variable1_update),
variable1_original = variable1_update) %>%
mutate_if(!is.na(variable2_update),
variable2_original = variable2_update) %>%
select(-variable1_update, variable2_update)
Yikes, that package isn't very fun to use. mutate_if doesn't seem to work with data.frames and the package doesn't have standard-evaluation alternatives for functions like standard dplyr does. Here's an attempt to re-create the function
myupdate <- function(.data, ...) {
dots <- as.list(substitute(...()))
dx <- Reduce(function(a,b) {
upd <- b[[1]]
ifc <- bquote(!is.na(.(upd)))
do.call("mutate_if", c(list(a, ifc), b))
}, split(dots, seq_along(dots)), .data)
select_(dx, .dots=sapply(dots, function(x) bquote(-.(x))))
}
To test it, i used
library(data.table)
dd<-data.table(
a = c(1:3, NA, 5:8)+0,
b = c(1:2, NA, 4:5, NA, 7:8)+100,
x= 1:8+20,
y=1:8+30
)
dd
# a b x y
# 1: 1 101 21 31
# 2: 2 102 22 32
# 3: 3 NA 23 33
# 4: NA 104 24 34
# 5: 5 105 25 35
# 6: 6 NA 26 36
# 7: 7 107 27 37
# 8: 8 108 28 38
and then I ran
myupdate(dd, x=b, y=a)
# x y
# 1: 101 1
# 2: 102 2
# 3: 23 3
# 4: 104 34
# 5: 105 5
# 6: 26 6
# 7: 107 7
# 8: 108 8
Notice how columns "a" and "b" disappear. Also see how values in rows 3 and 6 in column "x" and the value in row 4 in column "y" was preserved because the corresponding values in columns "b" and "a" were NA.
I have vectors of different lengths. For instance:
df1
[1] 1 95 5 2 135 4 3 135 4 4 135 4 5 135 4 6 135 4
df2
[1] 1 70 3 2 110 4 3 112 4
I'm trying to write a script in R in order to have any vector enter the function or for loop and it returns a dataframe of three columns. So a separate dataframe for each input vector. Each vector is a multiple of three (hence, the three columns). I'm fairly new to R in terms of writing functions and can't seem to figure this out. Here was my attempt:
newdf = c()
ld <- length(df1)
ld_mult <- length(df1)/3
ld_seq <- seq(from=1,to=ld,by=3)
ld_seq2 < ld_seq +2
for (i in 1:ld_mult) {
newdf[i,] <- df1[ld_seq[i]:ld_seq2[i]]
}
the output I want for df1 would be:
1 95 5
2 135 4
3 135 4
4 135 4
5 135 4
6 135 4
Here's an example of how you could use matrix for that purpose:
x <- c(1, 95, 5,2, 135, 4, 3, 135, 4)
as.data.frame(matrix(x, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 95 5
#2 2 135 4
#3 3 135 4
And
y <- c(1, 70, 3, 2, 110, 4, 3, 112, 4)
as.data.frame(matrix(y, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
Or if you want to make it a custom function:
newdf <- function(vec) {
as.data.frame(matrix(vec, ncol = 3, byrow = TRUE))
}
newdf(y)
#V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
You could also let the user specify the number of columns he wants to create with the function if you add another argument to newdf:
newdf <- function(vec, cols = 3) {
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
Now, the default number of columns is 3, if the user doesnt specify a number. If he wants to, he could use it like this:
newdf(z, 5) # to create 5 columns
Another nice little addon for the function would be a check if the input vector length is a multiple of the number of columns specified in the function call:
newdf <- function(vec, cols = 3) {
if(length(vec) %% cols != 0) {
stop("Number of columns is not a multiple of input vector length. Please double check.")
}
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
newdf(x, 4)
#Error in newdf(x, 4) :
# Number of columns is not a multiple of input vector length. Please double check.
If you had multiple vectors sitting in a list, here's how you could convert each of them to be a data.frame:
> l <- list(x,y)
> l
#[[1]]
#[1] 1 95 5 2 135 4 3 135 4
#
#[[2]]
#[1] 1 70 3 2 110 4 3 112 4
> lapply(l, newdf)
#[[1]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
#
#[[2]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
I have matrix like :
[,1][,2][,3][,4]
[1,] 12 32 43 55
[2,] 54 54 7 8
[3,] 2 56 76 88
[4,] 58 99 93 34
I do not know in advance how many rows and columns I will have in matrix. Thus, I need to create row and column names dynamically.
I can name columns (row) directly like:
colnames(rmatrix) <- c("a", "b", "c", "d")
However, how can I create my names vector dynamically to fit the dimensions of the matrix?
nm <- ("a", "b", "c", "d")
colnames(rmatrix) <- nm
You can use rownames and colnames and setting do.NULL=FALSE in order to create names dynamically, as in:
set.seed(1)
rmatrix <- matrix(sample(0:100, 16), ncol=4)
dimnames(rmatrix) <- list(rownames(rmatrix, do.NULL = FALSE, prefix = "row"),
colnames(rmatrix, do.NULL = FALSE, prefix = "col"))
rmatrix
col1 col2 col3 col4
row1 26 19 58 61
row2 37 86 5 33
row3 56 97 18 66
row4 89 62 15 42
you can change prefix to name the rows/cols as you want to.
To dynamically names columns (or rows) you can try
colnames(rmatrix) <- letters[1:ncol(rmatrix)]
where letters can be replaced by a vector of column names of your choice. You can do similar thing for rows.
You may use provideDimnames. Some examples with various degree of customisation:
m <- matrix(1:12, ncol = 3)
provideDimnames(m)
# A B C
# A 1 5 9
# B 2 6 10
# C 3 7 11
# D 4 8 12
provideDimnames(m, base = list(letters, LETTERS))
# A B C
# a 1 5 9
# b 2 6 10
# c 3 7 11
# d 4 8 12
provideDimnames(m, base = list(paste0("row_", letters), paste0("col_", letters)))
# col_a col_b col_c
# row_a 1 5 9
# row_b 2 6 10
# row_c 3 7 11
# row_d 4 8 12
I have a list with 138 tables in it (prop.table). Each table can have up to 20 variables in it (numerical categories ranging from 11-95 as the colnames). I need to convert this list to a master dataframe. The first three tables look like this:
[[1]]
x
21 41 42 43 52 71 81 82
0.02007456 0.58158876 0.22483510 0.09349011 0.05248064 0.01204474 0.00544881 0.01003728
[[2]]
x
21 41 42 43 52 71 90
0.01175122 0.36973345 0.34107194 0.03066781 0.08655775 0.01633706 0.14388077
[[3]]
x
21 22 23 41 42
0.043254082 0.008307075 0.016614151 0.930392438 0.001432254
I need to convert this to a matrix so it looks like this, with NAs or 0 when the categorical variable is not available:
x<-matrix (nrow=3, ncol=11 )
colnames(x) <-c('21', '22', '23', '41', '42', '43', '52', '71', '81', '82', '90' )
I have tried using this line from a previous similar question but the table is not correct:
df <- data.frame(matrix(unlist(prop.table), nrow=138, byrow=T))
Any suggestions on how to resolve this issue and get the table I need?
Is this is what you want?
x1 <- c(1, 5, 7)
names(x1) <- 1:3
x2 <- c(1, 2, 7)
names(x2) <- c(1,3,5)
l <- list(x1, x2)
m <- matrix(nrow=length(l), ncol=5)
colnames(m) <- 1:5
for (i in 1:length(l)) {
m[i, names(l[[i]])] <- l[[i]]
}
Maybe one can replace the loop with an apply function, but I'm not sure...Basically, I loop through the list and set in every row of the matrix those columns that match with the names of the vector in the list.
Sorry for not using your data set, but you didn't have the code at hand and I was too lazy to type it out.
rbind.fill from the plyr package will do just this for you:
# make an example `prop.table`:
tbl <- 1:10
names(tbl) <- letters[1:10]
tbl <- as.matrix(tbl)
# make sure some of the columns are missing
prop.table <- list(tbl[sample(10, size=8),], tbl[sample(10, size=7),], tbl[sample(10, size=9),])
# [[1]]
# d b g c h f e i
# 4 2 7 3 8 6 5 9
# [[2]]
# h g d a j f c
# 8 7 4 1 10 6 3
# [[3]]
# c i b d j a h g e
# 3 9 2 4 10 1 8 7 5
You can use the rbind.fill function from plyr, which is just rbind but it fills missing columns out with NA. It can take in a list of data frames to rbind together, so I convert each element of prop.table into a dataframe first (needed the t to ensure each prop.table[[i]] was treated as a row, not a column)
rbind.fill(lapply(prop.table, function (x) as.data.frame(t(x))))
# d b g c h f e i a j
# 1 4 2 7 3 8 6 5 9 NA NA
# 2 4 NA 7 3 8 6 NA NA 1 10
# 3 4 2 7 3 8 NA 5 9 1 10
(Note - you can sort the columns of the output dataframe with x[, order(colnames(x))])
Here is simple way to using lapply, rbind and do.call
ptl
## [[1]]
## x
## 21 41 42 43 52 71 81 82
## 0.02007456 0.58158876 0.22483510 0.09349011 0.05248064 0.01204474 0.00544881 0.01003728
##
## [[2]]
## x
## 21 41 42 43 52 71 90
## 0.01175122 0.36973345 0.34107194 0.03066781 0.08655775 0.01633706 0.14388077
##
## [[3]]
## x
## 21 22 23 41 42
## 0.043254082 0.008307075 0.016614151 0.930392438 0.001432254
##
## [[4]]
## x
## 21 22 31 41 42 43 81
## 0.10028653 0.03123209 0.00487106 0.66103152 0.03037249 0.01604585 0.15616046
##
## [[5]]
## x
## 21 41 42 43 81
## 0.0662080825 0.8291774147 0.0005732302 0.0865577529 0.0174835196
##
## [[6]]
## x
## 21 22 31 41 42 43 81
## 0.081948424 0.002292264 0.006303725 0.825501433 0.029226361 0.020630372 0.034097421
##
# Get unique names of all columns in tables in the list
resCol <- unique(unlist(lapply(ptl, names)))
# Get dimensions of desired result
nresCol <- length(resCol)
nresRow <- length(ptl)
# Create 'Template' data.frame row
DF <- as.data.frame(matrix(rep(0, nresCol), nrow = 1, dimnames = list(1, resCol)))
# for every table in list, create copy of DF, fill it appropriately, then rbind result together using do.call
result <- do.call(rbind, lapply(ptl, function(x) {
retDF <- DF
retDF[, names(x)] <- x
return(retDF)
}))
# rename rows(optional)
rownames(result) <- 1:nrow(result)
result
## 21 41 42 43 52 71 81 82 90 22 23 31
## 1 0.02007456 0.5815888 0.2248351018 0.09349011 0.05248064 0.01204474 0.00544881 0.01003728 0.0000000 0.000000000 0.00000000 0.000000000
## 2 0.01175122 0.3697334 0.3410719404 0.03066781 0.08655775 0.01633706 0.00000000 0.00000000 0.1438808 0.000000000 0.00000000 0.000000000
## 3 0.04325408 0.9303924 0.0014322544 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.008307075 0.01661415 0.000000000
## 4 0.10028653 0.6610315 0.0303724928 0.01604585 0.00000000 0.00000000 0.15616046 0.00000000 0.0000000 0.031232092 0.00000000 0.004871060
## 5 0.06620808 0.8291774 0.0005732302 0.08655775 0.00000000 0.00000000 0.01748352 0.00000000 0.0000000 0.000000000 0.00000000 0.000000000
## 6 0.08194842 0.8255014 0.0292263610 0.02063037 0.00000000 0.00000000 0.03409742 0.00000000 0.0000000 0.002292264 0.00000000 0.006303725
I'm just going to suggest one solution. How about you just concatenate all of the lists in one. So you would have
MyDataFrame
variable1 1 1 1 1 1 1 1 1
variable2 21 41 42 43 52 71 81 82
variable30.02007456 0.58158876 0.22483510 0.09349011 0.05248064 0.01204474 0.00544881 0.01003728
variable1 2 2 2 2 2 2 2
variable2 21 41 42 43 52 71 90
variable30.01175122 0.36973345 0.34107194 0.03066781 0.08655775 0.01633706 0.14388077
variable1 3 3 3 3 3
variable2 21 22 23 41 42
variable30.043254082 0.008307075 0.016614151 0.930392438 0.001432254
And once you have only one data frame. You can use the reshape function. like
install.packages('reshape')
library('reshape')
cast(MyDataFrame, variable1~variable2)
This won't be the most efficient, but using plyr and reshape2, and assuming your list of prop.tables is called foo
library(plyr)
library(reshape2)
allData <- dcast(ldply(lapply(seq_along(foo), function(x) data.frame(foo[[x]], id = x))),
id ~ x, value.var = 'Freq')
or more straight forwardly
ff <- c('21', '22', '23', '41', '42', '43', '52', '71', '81', '82', '90' )
t(sapply(foo, function(x,y) {x[ff]} ))