Create columns from tagged words - r

I have a vector with tagged words like c(#142#856#856.2#745, NA, #856#855, NA, #685, #663, #965.23, #855#658#744#122).
Words are separated by sharp. I would like create a data frame with one column for each different code, and then write 1 or 0 (or NA) depending if that code it is in that row or not.
The idea is that each element becomes a row, and each code becomes a column, and then if the code is in that element then in the column is marked with 1, or 0 if that code is not in that element.
ID | 142 | 856 |856.2 | ... | 122 |
1 | 1 | 1 | 1 | ... | 0 |
2 | 0 | 0 | 0 | ... | 0 |
...
I know how to do this with a complex algorithm plenty of loops. But, is it there any easy way to do this in a easy way?

You can accomplish this fairly easily using stringr:
# First we load the package
library(stringr)
# Then we create your example data vector
tagged_vector <- c('#142#856#856.2#745', NA, '#856#855', NA, '#685', '#663',
'#965.23', '#855#658#744#122')
# Next we need to get all the unique codes
# stringr's str_extract_all() can do this:
all_codes <- str_extract_all(string=tagged_vector, pattern='(?<=#)[0-9\\.]+')
# We just looked for one or more numbers and/or dots following a '#' character
# Now we just want the unique ones:
unique_codes <- unique(na.omit(unlist(all_codes)))
# Then we can use grepl() to check whether each code occurs in any element
# I've also used as.numeric() since you want 0/1 instead of TRUE/FALSE
result <- data.frame(sapply(unique_codes, function(x){
as.numeric(grepl(x, tagged_vector))
}))
# Then we add in your ID column and move it to the front:
result$ID <- 1:nrow(result)
result <- result[ , c(ncol(result), 1:(ncol(result)-1))]
The result is
ID X142 X856 X856.2 X745 X855 X685 X663 X965.23 X658 X744 X122
1 1 1 1 1 1 0 0 0 0 0 0 0
2 2 0 0 0 0 0 0 0 0 0 0 0
3 3 0 1 0 0 1 0 0 0 0 0 0
4 4 0 0 0 0 0 0 0 0 0 0 0
5 5 0 0 0 0 0 1 0 0 0 0 0
6 6 0 0 0 0 0 0 1 0 0 0 0
7 7 0 0 0 0 0 0 0 1 0 0 0
8 8 0 0 0 0 1 0 0 0 1 1 1
You may notice in the column names an "X" precedes each code. That's because in R a variable name may not begin with a number.

Related

R - Creating a new column within a data frame when two or more columns are a match in a row

I'm currently stuck on a part of my code that feels intuitive but I can't figure a way to do it. I have a very big data frame (nrows = 34036, ncol = 43) in which I want to create a continuous sequence of the variables where the value of the row is 1 (without having multiple columns with 1). It consists of only zeros and ones similar to the following:
A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1
I was able to remove the zeroes using:
#find the sum of each row
placeholderData <- transform(placeholderData, sum=rowSums(placeholderData))
placeholderData <- placeholderData[!(placeholderData$sum <= 0),]
And the data frame now looks like:
A B C D sum
1 0 0 0 1
0 0 0 1 1
0 0 0 1 1
1 0 1 0 2
1 0 1 0 2
0 1 0 0 1
0 1 0 0 1
1 0 0 1 2
My main problem comes when there are two or more 1's in a row. To try to solve this, I used the following code to identify the columns that have a sum of 2 or more:
placeholderData$Matches <- lapply(apply(placeholderData == 1, 1, which), names)
Which added the following column to the data frame:
A B C D sum Matches
1 0 0 0 1 A
0 0 0 1 1 D
0 0 0 1 1 D
1 0 1 0 2 c("A","C")
1 0 1 0 2 c("A","C")
0 1 0 0 1 B
0 1 0 0 1 B
1 0 0 1 2 c("A", "D")
I added the Matches column as an approach to solve the problem, but I'm not sure how would I do it without using a lot of logical operators (I don't know what columns have matches or not). What I would like to do is to aggregate the rows that have more than (or equal to) two 1's into a new column, to be able to have a data frame like this:
A B C D AC AD sum Matches
1 0 0 0 0 0 1 A
0 0 0 1 0 0 1 D
0 0 0 1 0 0 1 D
0 0 0 0 1 0 1 c("A","C")
0 0 0 0 1 0 1 c("A","C")
0 1 0 0 0 0 1 B
0 1 0 0 0 0 1 B
0 0 0 0 0 1 1 c("A", "D")
Then, I would be able to use my code as normal (It works just fine when there are no repeated values in rows). I tried searching to find similar questions, but I'm not sure if I was even asking the right question. I was wondering if anyone could provide some help or some ideas that I could try.
Thank you very much!
This seems a lot like making dummy variables, so I would use the model.matrix function commonly used for dummy variables (one-hot encoding):
m = read.table(header = T, text = "A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1")
m = m[rowSums(m) > 0, ]
d = factor(sapply(apply(m == 1, 1, which), function(x) paste(names(m)[x], collapse = "")))
result = data.frame(model.matrix(~ d + 0))
names(result) = levels(d)
# A AC AD B D
# 1 1 0 0 0 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 1 0 0 0
# 6 0 0 0 1 0
# 7 0 0 0 1 0
# 8 0 0 1 0 0

Permutation position of numbers in R

I'm looking for a function in R which can do the permutation. For example, I have a vector with five 1 and ten 0 like this:
> status=c(rep(1,5),rep(0,10))
> status
[1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
Now I'd like to randomly permute the position of these numbers but keep the same number of 0 and 1 in vector and to get new series of number, for example to get something like this:
1 1 0 1 0 1 0 0 0 0 0 1 0 0 0
or
1 0 0 0 0 0 0 1 1 0 0 1 0 1 0
I found the function sample() can help us to sample, but the number of 1 and 0 is not the same each time. Do you know how can I do this with R? Thanks in advance.
We can use sample
sample(status)
#[1] 1 0 0 1 0 0 1 0 0 0 0 1 0 1 0
sample(status)
#[1] 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0
If we use sample to return the entire vector, it will do the permutation and give the frequency count same for each of the unique elements
colSums(replicate(5, sample(status)))
#[1] 5 5 5 5 5
i.e. we get 5 one's in each of the sampling. So, the remaining 0's would be 10.

Merging two columns with two values

I have columns which I know there name and that their data are 0 and 1.
I would like to merge them to one but if in one row exist the 1 take the one value or if I have 1 and 1 keep 1.
Example of data:
stockI stockII
1 0
1 0
0 0
0 0
0 0
0 0
0 0
1 0
0 0
1 1
the output I could expect:
stockI/stockII
0
1
0
0
0
0
0
0
0
1
Is there any cbind method to make it?
We can try
as.integer(with(df1, (c(FALSE,stockI[-1] &
stockI[-nrow(df1)]) & stockI) | (stockI & stockII)))
#[1] 0 1 0 0 0 0 0 0 0 1

Apply a function to elements of matrix on condition

I'm looking to change the value of a certain entry in a matrix based on the value of another entry. Its easiest to explain with an example:
Matrix
ABC-DEF 1 0 0 0
HIJ-KLM 0 0 0 0
NOP-QRS 1 0 0 0
KLM-HIJ 0 0 0 0
DEF-ABC 0 0 0 0
QRS-NOP 0 0 0 0
As you can see, each of the rows in the matrix above has a counterpart (e.g. ABC-DEF's counterpart is DEF-ABC).
Is there some way in which I can look to see which rows have a one in the first column and then place a 2 in the fourth column of its counterpart? In the above example then:
ABC-DEF 1 0 0 0
HIJ-KLM 0 0 0 0
NOP-QRS 1 0 0 0
KLM-HIJ 0 0 0 0
DEF-ABC 0 0 0 2
QRS-NOP 0 0 0 2
I'm quite stuck and would really appreciate any help!
Thanks!
Assuming your column names are V1,...,V5, you can do something like this :
values <- d$V1[d$V2==1]
d$V5[d$V1 %in% gsub("(...)-(...)","\\2-\\1", values)] <- 2
Which will give :
V1 V2 V3 V4 V5
1 ABC-DEF 1 0 0 0
2 HIJ-KLM 0 0 0 0
3 NOP-QRS 1 0 0 0
4 KLM-HIJ 0 0 0 0
5 DEF-ABC 0 0 0 2
6 QRS-NOP 0 0 0 2
If, instead of a data frame, your data is a numeric matrix m with row names, you can do :
values <- rownames(m)[m[,1]==1]
m[rownames(m) %in% gsub("(...)-(...)","\\2-\\1", values),4] <- 2
EDIT : To understand what the code is doing, you must see that :
gsub("(...)-(...)","\\2-\\1", values)
will replace any character string in the values vector of the form XXX-YYY by YYY-XXX via regexp matching. The result is a character vector of the "counterparts" of values. Then we use %in% to select every rows whose rownames appear in these counterpart values, and assign 2 in the fourth column for these rows.

Simple crosstable with row- and multicolumn columnnames from R to latex

I am trying to produce a simple crosstable in R and have that exported to latex using knitr in Rstudio.
I want the table to look like a publishable table, with row header, column header, and subheaders for each category of the variable in the column. Since my table have identical categories for rows and columns, I wish to replace the column level headers with numbers. See example below:
Profession Mother
ProfesssionFather 1. 2. 3.
1. Bla frequency frequency frequency
2. blahabblab
3. blahblahblah
I am getting close with 'xtable' (I can't get row and column headers to print, and not multicolumn header), and the 'tables' package (I can't replace the column categories with numbers).
Minimal example:
work1 <- paste("LongString", 1:10, sep="")
work2 <- paste("LongString", 1:10, sep="")
t <- table(work1, work2) # making table
t # table with repated row/column names
colnames(t) <- paste(1:10, ".", sep="") # replacing column names with numeric values
xtable(t) # headers are omitted for both rows and columns
work <- data.frame(cbind(work1, work2)) # prepare for use of tabular
tabular((FathersProfession=work1) ~ (MothersProfession=work2), data=work) # have headers, but no way to change column categories from "LongString"x to numeric.
You need to assign the output of the tabular function to a named object:
tb <- tabular((FathersProfession=work1) ~ (MothersProfession=work2), data=work)
str(tb)
It should be obvious that the data is in a list and that the column-names are in the attribute that begins:
- attr(*, "colLabels")= chr [1:2, 1:10] "MothersProfession" "LongString1" NA "LongString10" ...
So
attr(tb, "colLabels") <-
gsub("LongString", "" , attr(tb, "colLabels") )
This is then the output to the screen, but the output to a latex device would be different.
> tb
MothersProfession
FathersProfession 1 10 2 3 4 5 6 7 8 9
LongString1 1 0 0 0 0 0 0 0 0 0
LongString10 0 1 0 0 0 0 0 0 0 0
LongString2 0 0 1 0 0 0 0 0 0 0
LongString3 0 0 0 1 0 0 0 0 0 0
LongString4 0 0 0 0 1 0 0 0 0 0
LongString5 0 0 0 0 0 1 0 0 0 0
LongString6 0 0 0 0 0 0 1 0 0 0
LongString7 0 0 0 0 0 0 0 1 0 0
LongString8 0 0 0 0 0 0 0 0 1 0
LongString9 0 0 0 0 0 0 0 0 0 1

Resources