Simple crosstable with row- and multicolumn columnnames from R to latex - r

I am trying to produce a simple crosstable in R and have that exported to latex using knitr in Rstudio.
I want the table to look like a publishable table, with row header, column header, and subheaders for each category of the variable in the column. Since my table have identical categories for rows and columns, I wish to replace the column level headers with numbers. See example below:
Profession Mother
ProfesssionFather 1. 2. 3.
1. Bla frequency frequency frequency
2. blahabblab
3. blahblahblah
I am getting close with 'xtable' (I can't get row and column headers to print, and not multicolumn header), and the 'tables' package (I can't replace the column categories with numbers).
Minimal example:
work1 <- paste("LongString", 1:10, sep="")
work2 <- paste("LongString", 1:10, sep="")
t <- table(work1, work2) # making table
t # table with repated row/column names
colnames(t) <- paste(1:10, ".", sep="") # replacing column names with numeric values
xtable(t) # headers are omitted for both rows and columns
work <- data.frame(cbind(work1, work2)) # prepare for use of tabular
tabular((FathersProfession=work1) ~ (MothersProfession=work2), data=work) # have headers, but no way to change column categories from "LongString"x to numeric.

You need to assign the output of the tabular function to a named object:
tb <- tabular((FathersProfession=work1) ~ (MothersProfession=work2), data=work)
str(tb)
It should be obvious that the data is in a list and that the column-names are in the attribute that begins:
- attr(*, "colLabels")= chr [1:2, 1:10] "MothersProfession" "LongString1" NA "LongString10" ...
So
attr(tb, "colLabels") <-
gsub("LongString", "" , attr(tb, "colLabels") )
This is then the output to the screen, but the output to a latex device would be different.
> tb
MothersProfession
FathersProfession 1 10 2 3 4 5 6 7 8 9
LongString1 1 0 0 0 0 0 0 0 0 0
LongString10 0 1 0 0 0 0 0 0 0 0
LongString2 0 0 1 0 0 0 0 0 0 0
LongString3 0 0 0 1 0 0 0 0 0 0
LongString4 0 0 0 0 1 0 0 0 0 0
LongString5 0 0 0 0 0 1 0 0 0 0
LongString6 0 0 0 0 0 0 1 0 0 0
LongString7 0 0 0 0 0 0 0 1 0 0
LongString8 0 0 0 0 0 0 0 0 1 0
LongString9 0 0 0 0 0 0 0 0 0 1

Related

Create columns from tagged words

I have a vector with tagged words like c(#142#856#856.2#745, NA, #856#855, NA, #685, #663, #965.23, #855#658#744#122).
Words are separated by sharp. I would like create a data frame with one column for each different code, and then write 1 or 0 (or NA) depending if that code it is in that row or not.
The idea is that each element becomes a row, and each code becomes a column, and then if the code is in that element then in the column is marked with 1, or 0 if that code is not in that element.
ID | 142 | 856 |856.2 | ... | 122 |
1 | 1 | 1 | 1 | ... | 0 |
2 | 0 | 0 | 0 | ... | 0 |
...
I know how to do this with a complex algorithm plenty of loops. But, is it there any easy way to do this in a easy way?
You can accomplish this fairly easily using stringr:
# First we load the package
library(stringr)
# Then we create your example data vector
tagged_vector <- c('#142#856#856.2#745', NA, '#856#855', NA, '#685', '#663',
'#965.23', '#855#658#744#122')
# Next we need to get all the unique codes
# stringr's str_extract_all() can do this:
all_codes <- str_extract_all(string=tagged_vector, pattern='(?<=#)[0-9\\.]+')
# We just looked for one or more numbers and/or dots following a '#' character
# Now we just want the unique ones:
unique_codes <- unique(na.omit(unlist(all_codes)))
# Then we can use grepl() to check whether each code occurs in any element
# I've also used as.numeric() since you want 0/1 instead of TRUE/FALSE
result <- data.frame(sapply(unique_codes, function(x){
as.numeric(grepl(x, tagged_vector))
}))
# Then we add in your ID column and move it to the front:
result$ID <- 1:nrow(result)
result <- result[ , c(ncol(result), 1:(ncol(result)-1))]
The result is
ID X142 X856 X856.2 X745 X855 X685 X663 X965.23 X658 X744 X122
1 1 1 1 1 1 0 0 0 0 0 0 0
2 2 0 0 0 0 0 0 0 0 0 0 0
3 3 0 1 0 0 1 0 0 0 0 0 0
4 4 0 0 0 0 0 0 0 0 0 0 0
5 5 0 0 0 0 0 1 0 0 0 0 0
6 6 0 0 0 0 0 0 1 0 0 0 0
7 7 0 0 0 0 0 0 0 1 0 0 0
8 8 0 0 0 0 1 0 0 0 1 1 1
You may notice in the column names an "X" precedes each code. That's because in R a variable name may not begin with a number.

Creating dummy variables in R based on multiple chr values within each cell

I'm trying to create multiple dummy variables, based on one column called 'Tags' within my df (14 rows, 2 columns, Score and Tags. My problem is that in each cell there can be any number of chr values (up to about 30 values).
When I ask for:
str(df$Tags)
R returns:
chr [1:14] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolera"| __truncated__ ...
And when I ask for:
df$Tags[1]
R returns:
[1] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolerantie\", \"met familie\", \"met vrienden\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", chinees, gastronomisch, glutenvrij, kindvriendelijk, romantisch, traditioneel, trendy, verjaardag, zakelijk"
It seems that the values within the first cell are not formatted the same (the values between comma's)
So what I wish for, is to create a dummy variable for each possible value that occurs within each cells. So the first new dummy should be called "biologische gerechten" (or any alike) and should show for each case whether the corresponding value is present (1) in the column 'Tags' or not (0).
i tried several things with 'dplyr' like:
df = mutate(df, biologisch = ifelse(Tags == "biologische gerechten", 1, 0))
R does create a new column 'biologisch', but it only contains zero's. Is there another way to separate all values and then create dummy variables for all possible values? Hope someone can help me, thank you!
Here's one solution:
# make some toy data to test
set.seed(1)
df <- data.frame(Score = rnorm(10),
Tags = replicate(10, paste(sample(LETTERS, 5), collapse = ", ")),
stringsAsFactors = FALSE)
# load stringr, which we'll use to trim whitespace from the split-up tags
library(stringr)
# use strsplit to break your jumbles of tags into separate elements, with a
# list for each position in the original vector. i've split on commas here,
# but you'll probably want to split on slashes or slashes and quotation marks.
t <- strsplit(df$Tags, split = ",")
# get a vector of the unique elements of those lists. you may need to use str_trim
# or something like it to cut leading and trailing whitespace. you might also
# need to use stringr's `str_subset` and a regular expression to cut the result
# down to, say, only alphanumeric strings. without a reproducible example, though,
# i can't do that for your specific case here.
tags <- unique(str_trim(unlist(t)))
# now, use `sapply` and `grepl` to look for each element of `tags` in each list;
# use `any` to summarize the results;
# use `+` to convert those summaries to binary;
# use `lapply` to iterate that process over all elements of `tags`;
# use `Reduce(cbind, ...)` to collapse the results into a table; and
# use `as.data.frame` to turn that table into a df.
df2 <- as.data.frame(Reduce(cbind, lapply(tags, function(i) sapply(t, function(j) +(any(grepl(i, j), na.rm = TRUE))))))
# assign the tags as column names
names(df2) <- tags
Voila:
> df2
Y F P C Z K A J U H M O L E S R T Q V B I X G
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0
6 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
7 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
8 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
9 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0
10 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1

IF THEN on a Dataframe in r with LAG

I have a dataframe with multiple columns, but two columns in particular are interesting for me.
Column1 contains values 0 and a number (>0)
Column2 contains numbers as well.
I want to create 21 new columns containing new information from Column2 given Column1.
So when Column1 is positive (not 0) I want the first new column, Column01, to take the value from Column2 that goes 10 back. and Column02 goes 9 back,.. Column11 is the exact same as Column2 value.. and Column21 is 10 forward.
For example
Column 1 Column2 Columns01 Columns02.. Columns11..Columns20 Columns21
0 5 0 0 0 0 0
0 2 0 0 0 0 0
0 0 0 0 0 0 0
1 3 0 0 3 5 4
0 10 0 0 0 0 0
0 83 0 0 0 0 0
0 2 0 0 0 0 0
0 5 0 0 0 0 0
0 4 0 0 0 0 0
1 8 0 5 8 5 3
0 6 0 0 0 0 0
0 5 0 0 0 0 0
0 55 0 0 0 0 0
0 4 0 0 0 0 0
2 3 10 83 3 5 0
0 2 0 0 0 0 0
0 3 0 0 0 0 0
0 4 0 0 0 0 0
0 5 0 0 0 0 0
0 3 0 0 0 0 0
1 22 6 5 22 0 0
0 12 0 0 0 0 0
0 0 0 0 0 0 0
0 5 0 0 0 0 0
Hope this makes sense to you and you can help.
Here's one way using the newly implemented shift() function from data.table v1.9.5:
require(data.table) ## v1.9.5+
setDT(dat) ## (1)
cols = paste0("cols", sprintf("%.2d", 1:21)) ## (2)
dat[, cols[1:10] := shift(Column2, 10:1, fill=0)] ## (3)
dat[, cols[11] := Column2] ## (4)
dat[, cols[12:21] := shift(Column2, 1:10, fill=0, type="lead")] ## (5)
dat[Column1 == 0, (cols) := 0] ## (6)
Assuming dat is your data.frame, setDT(dat) converts it to a data.table, by reference (the data is not copied physically to a new location in memory, for efficiency).
Generate all the column names.
Generated lagged vectors of Column2 with periods 10:1 and assign it to the first 10 columns.
11th column is = Column2.
Generated leading vectors of Column2 with periods 1:10 and assign it to the last 10 columns.
Get indices of all the rows where Column1 == 0, and replace/reset all newly generated columns for those indices to 0.
Use setDF(dat) if you want a data.frame back.
You can wrap this in a function with the values -10:10 and choosing type="lag" or type="lead" accordingly, depending on whether the values are negative or positive.. I'll leave that to you.
An option using base R
cols = paste0("cols", sprintf("%.2d", 1:21)) #copied from #Arun's post
m1 <- matrix(c(rep(0,10), dat1[,2]), nrow=nrow(dat1)+10+1, ncol=21,
dimnames=list(NULL, cols))[1:nrow(dat1),]
dat2 <- cbind(dat1,m1*dat1[,1])
NOTE: While creating m1, there will be a warning though.
Checking with the output from #Arun's solution (after running the codes on 'dat')
library(data.table)
setDF(dat) #convert the 'data.table' to 'data.frame'
all.equal(dat2, dat, check.attributes=FALSE)
#[1] TRUE
data
set.seed(24)
dat1 <- data.frame(Column1 = sample(0:1,10, replace=TRUE),
Column2 = sample(1:5, 10, replace=TRUE))
dat <- copy(dat1)

Apply a function to elements of matrix on condition

I'm looking to change the value of a certain entry in a matrix based on the value of another entry. Its easiest to explain with an example:
Matrix
ABC-DEF 1 0 0 0
HIJ-KLM 0 0 0 0
NOP-QRS 1 0 0 0
KLM-HIJ 0 0 0 0
DEF-ABC 0 0 0 0
QRS-NOP 0 0 0 0
As you can see, each of the rows in the matrix above has a counterpart (e.g. ABC-DEF's counterpart is DEF-ABC).
Is there some way in which I can look to see which rows have a one in the first column and then place a 2 in the fourth column of its counterpart? In the above example then:
ABC-DEF 1 0 0 0
HIJ-KLM 0 0 0 0
NOP-QRS 1 0 0 0
KLM-HIJ 0 0 0 0
DEF-ABC 0 0 0 2
QRS-NOP 0 0 0 2
I'm quite stuck and would really appreciate any help!
Thanks!
Assuming your column names are V1,...,V5, you can do something like this :
values <- d$V1[d$V2==1]
d$V5[d$V1 %in% gsub("(...)-(...)","\\2-\\1", values)] <- 2
Which will give :
V1 V2 V3 V4 V5
1 ABC-DEF 1 0 0 0
2 HIJ-KLM 0 0 0 0
3 NOP-QRS 1 0 0 0
4 KLM-HIJ 0 0 0 0
5 DEF-ABC 0 0 0 2
6 QRS-NOP 0 0 0 2
If, instead of a data frame, your data is a numeric matrix m with row names, you can do :
values <- rownames(m)[m[,1]==1]
m[rownames(m) %in% gsub("(...)-(...)","\\2-\\1", values),4] <- 2
EDIT : To understand what the code is doing, you must see that :
gsub("(...)-(...)","\\2-\\1", values)
will replace any character string in the values vector of the form XXX-YYY by YYY-XXX via regexp matching. The result is a character vector of the "counterparts" of values. Then we use %in% to select every rows whose rownames appear in these counterpart values, and assign 2 in the fourth column for these rows.

Random subsampling in R

I am new in R, therefore my question might be really simple.
I have a 40 sites with abundances of zooplankton.
My data looks like this (columns are species abundances and rows are sites)
0 0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 85 0
0 0 0 0 0 45 5 57 0
0 0 0 0 0 13 0 3 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 7 0
0 3 0 0 12 8 0 57 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 59 0 0 0
0 0 0 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 105 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 100 0
0 35 0 55 0 0 0 0 0
1 4 0 0 0 0 0 0 0
0 0 0 0 0 34 21 0 0
0 0 0 0 0 9 17 0 0
0 54 0 0 0 27 5 0 0
0 1 0 0 0 1 0 0 0
0 17 0 0 0 54 3 0 0
What I would like to is take a random sub-sample (e.g. 50 individuals) from each site without replacement several times (bootstrap) in order to calculate diversity indexes to the new standardized abundances afterwards.
Try something like this:
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]
What the OP is probably looking for here is a way to bootstrap the data for a Hill or Simpson diversity index, which provides some assumptions about the data being sampled:
Each row is a site, each column is a species, and each value is a count.
Individuals are being sampled for the bootstrap, NOT THE COUNTS.
To do this, bootstrapping programs will often model the counts as a string of individuals. For instance, if we had a record like so:
a b c
2 3 4
The record would be modeled as:
aabbbcccc
Then, a sample is usually drawn WITH replacement from the string to create a larger set based on the model set.
Bootstrapping a site: In R, we have a way to do this that is actually quite simple with the 'sample' function. If you select from the column numbers, you can provide probabilities using the count data.
# Test data.
data <- data.frame(a=2, b=3, c=4)
# Sampling from first row of data.
row <- 1
N_samples <- 50
samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])
Converting the sample into the format of the original table: Now we have an array of samples, with each item indicating the column number that the sample belongs to. We can convert back to the original table format in multiple ways, but here is a fairly simple one using a simple counting loop:
# Count the number of each entry and store in a list.
for (i in 1:ncol(data)){
site_sample[[i]] <- sum(samples==i)
}
# Unlist the data to get an array that represents the bootstrap row.
site_sample <- unlist(site_sample)
Just stumbled upon this thread, and the vegan package has a function called 'rrarify' that does precisely what you're looking to do (and in the same ecological context, too)
This should work. It's a little more complicated than it looks at first, since each cell contains counts of a species. The solution uses the apply function to send each row of the data to the user-defined sample_species function. Then we generate n random numbers and order them. If there are 15 of species 1, 20 of species 2, and 20 of species 3, the random numbers generated between 1 and 15 signify species 1, 16 and 35 signify species 2, and 36-55 signify species 3.
## Initially takes in a row of the data and the number of samples to take
sample_species <- function(counts,n) {
num_species <- length(counts)
total_count <- sum(counts)
samples <- sample(1:total_count,n,replace=FALSE)
samples <- samples[order(samples)]
result <- array(0,num_species)
total <- 0
for (i in 1:num_species) {
result[i] <- length(which(samples > total & samples <= total+counts[i]))
total <- total+counts[i]
}
return(result)
}
A <- matrix(sample(0:100,10*40,replace=T), ncol=10) ## mock data
B <- t(apply(A,1,sample_species,50)) ## results

Resources