Separate a string of characters space-separated of dataframe in different columns - r

I am pretty new at using R and I have some data that I need to tidy a bit before I can use it. Basically I have a dataframe with a bunch of rows and columns and in every cell of this dataframe I have a string of 20 numbers of 1 and zeroes ("0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0").
Now I am trying to separate every number of a field having each number in a new column (1 field would be 20 columns). After that I would like to convert these newly separated strings into numbers. I will show a small sample of the data. Here I would need the numbers separated in 40 columns and 3 rows:
df<-data.frame(
"V1" = c("0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ","0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ","1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "),
"V2" = c("0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ","0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 ","0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 "))
As you can see a good way to separate each number of a string would be treating the space as a delimiter, but I am not having any luck with that. I tried my luck with df<-lapply(strsplit(df, " "), as.numeric) but the dataframe can't be treated with this function. I tried then df<-lapply(strsplit(as.character(df), " "), as.numeric)
That way it separates correctly but making the full dataframe as a character messes up the data.
I suppose that it's easier than I think but I still lack skill in this code.

Easier option is read.table (no packages used)
read.table(text = as.character(df$V1), header = FALSE)
For multiple columns, use lapply
lapply(df, function(x) read.table(text = as.character(x), header = FALSE))

You can use cSplit from splitstackshape to convert multiple columns into separate columns.
splitstackshape::cSplit(df, names(df), " ")
# V1_01 V1_02 V1_03 V1_04 V1_05 V1_06 V1_07 V1_08 V1_09 V1_10 V1_11
#1: 0 0 0 0 0 0 0 0 0 0 0
#2: 0 0 0 1 0 0 0 0 0 0 0
#3: 1 0 0 0 0 0 0 0 0 0 0
# V1_12 V1_13 V1_14 V1_15 V1_16 V1_17 V1_18 V1_19 V1_20 V2_01 V2_02
#1: 0 0 0 1 0 0 0 0 0 0 0
#2: 0 0 0 0 0 0 0 0 0 0 0
#3: 0 0 0 0 0 0 0 0 0 0 0
# V2_03 V2_04 V2_05 V2_06 V2_07 V2_08 V2_09 V2_10 V2_11 V2_12 V2_13
#1: 0 0 0 0 1 0 0 0 0 0 0
#2: 0 0 0 0 0 0 0 0 0 0 0
#3: 0 0 0 0 0 0 0 1 0 0 0
# V2_14 V2_15 V2_16 V2_17 V2_18 V2_19 V2_20
#1: 0 0 0 0 0 0 0
#2: 0 0 0 0 0 1 0
#3: 0 0 0 0 0 0 0
Note that I have used names(df) here since you want to convert all the columns into separate columns. If you have additional columns and want to separate only few of them, you can also do
splitstackshape::cSplit(df, c("V1", "V2"), " ")

I found both answers equally good but the use of cSplit made the posterior process better I think. What I finally did to obtain the result:
df<-cSplit(df, names(df), " ")
df<-lapply(df,as.numeric)
df<-as.data.frame(df)
I suppose that this can be done with less lines of code but this way is more understandable for me. Thank you very much for your answers!

Related

Genetic Algorithm in R: Specify number of 1s in binary chromosomes

I am using the rbga function, but my question still stands for other genetic algorithm implementations in R. Is there a way to specify the number of 1s in binary chromosomes?
I have the following example provided by the library documentation.
data(iris)
library(MASS)
X <- as.data.frame(cbind(scale(iris[,1:4]), matrix(rnorm(36*150), 150, 36)))
Y <- iris[,5]
iris.evaluate <- function(indices) {
print("Chromosome")
print(indices)
print("================================")
result = 1
if (sum(indices) > 2) {
huhn <- lda(X[,indices==1], Y, CV=TRUE)$posterior
result = sum(Y != dimnames(huhn)[[2]][apply(huhn, 1,
function(x)
which(x == max(x)))]) / length(Y)
}
result
}
monitor <- function(obj) {
minEval = min(obj$evaluations);
plot(obj, type="hist");
}
woppa <- rbga.bin(size=40, mutationChance=0.05, zeroToOneRatio=10,
evalFunc=iris.evaluate, showSettings=TRUE, verbose=TRUE)
Here are some of the chromosomes.
"Chromosome"
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
"================================"
The 1s (i.e., the chosen characteristics) are 5, 8, 5 and 4 respectively.
I am trying to follow the technique specified in a paper and they claim that they apply a genetic algorithm and in the end they pick a specific number of characteristics.
Is it possible to specify in a genetic algorithm the number of characteristics that I want my solution(s)/chromosome(s) to have?
Could this be done on the final solution/chromosome and if yes how?

Filling a table with additional columns if they don't exist

I've the following difficult problem. Here short example of my data. Assume that I've two data sets (my real example has something about 20). The data frames result as a list computed by a self written function with lapply. So, I put the data frames in my example in a list, too. Then I "rbind" them to compute a frequency table.
df1 <- data.frame(rev(seq(12:0)), paste0("a=",sample(0:12, 13, replace=T)))
colnames(df1) <- c("k", "a")
df2 <- data.frame(rev(seq(12:0)), paste0("a=",sample(0:12, 13, replace=T)))
colnames(df2) <- c("k", "a")
list_df <- list(df1,df2)
df_combine<- plyr::ldply(list_df, rbind)
freq_foo <- table(df_combine$k,df_combine$a)
I get a frequency table of the following form.
a=0 a=11 a=12 a=2 a=5 a=6 a=7 a=8 a=3 a=9
1 1 0 0 0 0 0 0 1 0 0
2 1 0 0 0 0 0 0 0 0 1
3 1 0 0 0 0 1 0 0 0 0
4 0 0 0 1 0 1 0 0 0 0
5 0 0 0 1 1 0 0 0 0 0
6 0 0 0 0 0 0 1 0 0 1
7 0 1 1 0 0 0 0 0 0 0
8 1 0 0 0 0 1 0 0 0 0
9 0 0 0 0 0 0 2 0 0 0
10 0 0 1 0 1 0 0 0 0 0
11 1 1 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 1 0 1 0
13 1 0 1 0 0 0 0 0 0 0
I want to extend and manipulate my table in the following way:
First the table should go over a range of a=0 to a=15. So if there is a missing column, it should be added. And 2nd) I want to order the columns from 0 to 15.
For the first problem I tried
if(freq_foo$paste0("a=",0:15) == F){freq_foo$paste("a=",0:15) <- 0}
but this should work only for data frames and not for tables. Also. i've no idea how to order the columns with an ascending order. The data type isnt important to me because I just want to use the output for further calculations. So, it can also be a data frame instead of a table.
#convert freq_foo table to dataframe
df <- as.data.frame.matrix(freq_foo)
#add all zeros column for missing column name in 0:15 series
df[, paste0("a=", c(0:15)[!(c(0:15) %in% as.numeric(gsub(".*=(\\d+)", "\\1", names(df))))])] <- 0
#order columns from 0 to 15
df <- df[, order(as.numeric(gsub(".*=(\\d+)", "\\1", names(df))))]
Output is:
a=0 a=1 a=2 a=3 a=4 a=5 a=6 a=7 a=8 a=9 a=10 a=11 a=12 a=13 a=14 a=15
1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0
5 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
7 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
8 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
10 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
11 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
12 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
13 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
(Edit: Updated code after getting a requirement clarification from OP)

Can't name my columns as dates

I have a range of dates where some products were bought. I create a sort of a pivot table relating the products and the dates, but there are dates where nothing was sold. I can find the missing dates and even add them to the main data frame, the problem is that instead of keeping the date format, they adopt the integer format (with the integer being the distance to origin) and I can't order them. The code I'm using is this:
upper.bound<- paste("01", month[1], 2013, sep="-")
lower.bound <- paste("30", month[4], 2013, sep="-")
dates <- seq(as.Date(upper.bound, "%d-%m-%Y"), as.Date(lower.bound, "%d-%m-%Y"), "days")
diff <- setdiff(dates, as.Date(colnames(export_f_ub), "%Y-%m-%d"))
len <- dim(as.matrix(diff))[1]*11
aux <- data.frame()
aux <- seq(0,0,length.out=len)
dim(aux) <- c(11, dim(as.matrix(diff))[1])
col_dates <- as.Date(diff, origin="1970-01-01")
colnames(aux)<- c(col_dates)
This was a tryout to set the matrix to zeros and then bind it to the main one. But this doesn't work, as in the result I get the column names as numeric. Here's a screenshot of the console:
Console log
I've never seen someone try to assign a Date vector as column names of a matrix. Dimension names must always be character strings, so in general this is not something you should be doing.
That being said, in terms of effect, the intuitive expectation would be that the column-name-assignment machinery in R would at some point coerce the Date vector to character along the lines of as.character(), and thus you'd get the text representation of the dates, rather than a stringification of their underlying double values.
Calling `colnames<-`() on a matrix eventually calls `dimnames<-`() which drops into the C code by running .Primitive("dimnames<-"). I haven't really looked into the C implementation, but we can guess that at some point it pulls out the double values underlying the Date vector, coerces them to character, and that's why you end up with numbers as your column names.
The correct approach here is to call as.character() yourself when assigning the names:
col_dates <- as.Date(c('2013-06-03','2013-06-04','2013-06-05','2013-06-06','2013-06-08','2013-06-22','2013-07-07','2013-07-08','2013-07-11','2013-07-13','2013-07-23','2013-07-25','2013-07-26','2013-08-27','2013-09-03','2013-09-04','2013-09-05','2013-09-06','2013-09-07','2013-09-09','2013-09-10','2013-09-11','2013-09-13','2013-09-14','2013-09-15','2013-09-16','2013-09-18','2013-09-20','2013-09-21','2013-09-22','2013-09-24','2013-09-30'));
aux <- matrix(0,11L,length(col_dates));
colnames(aux) <- as.character(col_dates);
aux;
## 2013-06-03 2013-06-04 2013-06-05 2013-06-06 2013-06-08 2013-06-22 2013-07-07 2013-07-08 2013-07-11 2013-07-13 2013-07-23 2013-07-25 2013-07-26 2013-08-27 2013-09-03 2013-09-04 2013-09-05 2013-09-06 2013-09-07 2013-09-09 2013-09-10 2013-09-11 2013-09-13 2013-09-14 2013-09-15 2013-09-16 2013-09-18 2013-09-20 2013-09-21 2013-09-22 2013-09-24 2013-09-30
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [7,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [8,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [9,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [10,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [11,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
But you should be aware that the column names do not retain the Date class or internal representation (referring to the double values); they are pure character strings. If you want to recreate the Date vector from the column names, you'll have to run them through as.Date().
And by the way, dim(as.matrix(x))[1] for a vector x is an unnecessarily roundabout way of getting length(x).

Remove labels from imported file in R

I am importing a file and trying to display only the numbers in each row, with any commas or labels. With the following code, my output is given below:
mydata <- read.table("/home/mukhera3/Desktop/Test/part-r-00000", sep=",")
mydata
Output
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V461 V462 V463 V464 V465 V466 V467 V468 V469 V470 V471 V472 V473 V474 V475 V476 V477 V478 V479 V480 V481 V482 V483
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V484 V485 V486 V487 V488 V489 V490 V491 V492 V493 V494 V495 V496 V497 V498 V499 V500 V501 V502 V503 V504 V505 V506
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V507 V508 V509 V510 V511 V512 V513 V514 V515 V516 V517 V518 V519 V520 V521 V522 V523 V524 V525 V526 V527 V528 V529
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V530 V531 V532 V533 V534 V535 V536 V537 V538 V539 V540 V541 V542 V543 V544 V545 V546 V547 V548 V549 V550 V551 V552
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V553 V554 V555 V556 V557 V558 V559 V560 V561 V562 V563 V564 V565 V566 V567 V568 V569 V570 V571 V572 V573 V574 V575
When I replace the "," for sep with whitespace (sep=""), keeping everything else the same. this is what I get:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
I want to display the numbers 0,1 .. without any commas or other row numbers etc. I am new to R programming, and do not know how to do this. Any help would be appreciated.
If you want your file to be read directly as a vector and not as a dataframe, you can, for instance, use scan instead of read.table. Example with your example file saved as a.txt in my working directory:
> mydata <- scan(file="a.txt",sep=",")
Read 46 items
> mydata
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
You can also get that result from read.table with some additional steps:
> mydata <- read.table("a.txt",sep=",") # Reads your file as a data.frame
> mydata <- unlist(mydata) # Transforms into a named vector
> names(mydata) <- NULL # Gets rid of the names
> mydata
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
If you just want to "display" it like that but don't want to change the nature of your table, you can simply use cat (combined with unlist):
> mydata <- read.table("a.txt",sep=",")
> cat(unlist(mydata))
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

Loop through two data tables from column to row wise?

I have two data frames:
DT1: (This data frame's column values I need to edit based on another datatable DT2)
BIC BCC1 BCC2 BCC6 BCC8 BCC9 BCC10 BCC11
990081899A 0 1 0 0 0 0 0
9900023620 0 1 1 0 0 0 0
9900427160 0 1 0 1 0 0 0
990064457TA 1 1 0 1 0 0 0
990066595A 0 0 0 0 0 0 1
990088248A 0 0 0 0 0 0 1
990088882C1 0 0 0 0 0 0 1
990088882C2 0 0 0 1 1 0 0
990088882C3 0 0 0 1 1 0 0
990088882C4 0 0 0 1 1 0 0
990088882C5 0 0 0 1 1 0 0
DT2:
BCC HIER1 HIER2 HIER3 HIER4 HIER5
BCC8 BCC9 BCC10 BCC11 BCC12 0
BCC9 BCC10 BCC11 BCC12 0 0
BCC10 BCC11 BCC12 0 0 0
BCC11 BCC12 0 0 0 0
BCC17 BCC18 BCC19 0 0 0
BCC18 BCC19 0 0 0 0
BCC27 BCC28 BCC29 BCC80 0 0
BCC28 BCC29 0 0 0 0
BCC46 BCC48 0 0 0 0
BCC54 BCC55 0 0 0 0
BCC57 BCC58 0 0 0 0
BCC70 BCC71 BCC72 BCC103 BCC104 BCC169
I want to look up the column names in DT1 though first column values in DT2$BCC, according to the hierarchy logic, as:
I want to loop through DT1 column names except first column and nest that loop through DT2 first column values to check if they are equal. If they are equal then get that DT2$BCC value and check if DT1$(DT2$BCC) = 1, if yes then set value 0 in DT1 columns are present in (HIER1, HIER2, HIER3,.......)
Result should be:
BIC BCC1 BCC2 BCC6 BCC8 BCC9 BCC10 BCC11
990081899A 0 1 0 0 0 0 0
9900023620 0 1 1 0 0 0 0
9900427160 0 1 0 1 0 0 0
990064457TA 1 1 0 1 0 0 0
990066595A 0 0 0 0 0 0 0
990088248A 0 0 0 0 0 0 0
990088882C1 0 0 0 0 0 0 0
990088882C2 0 0 0 1 0 0 0
990088882C3 0 0 0 1 0 0 0
990088882C4 0 0 0 1 0 0 0
990088882C5 0 0 0 1 0 0 0
I am doing this now:
cols<-setdiff(names(DT1), "HIC")
subs<-as.character(DT2$BCC)
colsHier<-setdiff(names(DT2), "BCC")
paste0("DT1$", eval(cols[i]))<-
for( i in 1:length(cols)){
for (k in 1:length(subs)){
ifelse(cols[i] == subs[k],
ifelse(do.call(paste0, list('DT1$', eval(cols[1]),'[]')) == 1,
for (j in 1:length(colsHeir)){
if(colsHeir[j]!= 0)
x<-paste0('DT2$',eval(colsHier[j]))
paste0('DT1$',eval(x[k])):= 0}
,DT1$cols[i]), DT1$cols[i])}}
I am trying to match the value of do.call(paste0, list('DT1$', eval(cols[1]),'[]')) == 1, but when I am running this expression in R I am getting following:
> do.call(paste0, list('DT1$', eval(cols[2]),'[1]'))
[1] "DT1$BCC2[1]"
and NOT the value of the cell. How can I access the value of that cell to match with 1.
I am not able get the correct way of doing this. I am sorry for long question. Any help is appreciated.
library(reshape2)
melt the data
dt1.m <- melt(dt1, id = "BIC")
dt2.m <- melt(dt2, id = "BCC")
If the dt1.m$variable is equal to one of the values in dt2.m set it to 0
dt1.m$value <- ifelse(dt1.m$variable %in% dt2.m$value, 0, dt1.m$value)
cast the data into proper form
dt1.c <- dcast(dt1.m, ...~variable)
Dcast automatically reorders the rows.

Resources