R: Text and function in one 'cell' - r

I have to create a table where I analyse 9 variables in a bigger data set. For each variable, I have to state how it is scaled, what the measure of central tendency is, and what the dispersion measure is.
As, depending on how the variable is scaled, I have different measures, I would like to specify that inside the corresponding cell of the table I'm writing. Example:
"Median: (median(GB$government,na.rm=T)"
or
"Median:" (median(GB$government, na.rm=T)
This doesn't work, RStudio warns me because of an unexpected symbol. The code I have is this (it includes specify_decimal because I have to include two decimals of each value - that function works flawlessly so don't mind it :)
MZT <- c("Median:" specify_decimal(median(GB$government,na.rm=T),2),
specify_decimal(Modus(GB$local),2),specify_decimal(Modus(GB$gender),2),
specify_decimal(mean(GB$height,na.rm=T),2),
specify_decimal(mean(GB$weight,na.rm=T),2),specify_decimal(mean(GB$age,na.rm=T),2),
specify_decimal(mean(GB$education,na.rm=T),2),
specify_decimal(median(GB$income,na.rm=T),2),
specify_decimal(median(GB$father_educ,na.rm=T),2))
/ edit: I now understand how kable works :D

One way to make custom tables in R is to use the knitr::kable() function, along with R Markdown. Here is a trivial example that prints a table comparing sample and theoretical values for an exponential distribution where lambda = 0.2.
library(knitr)
Statistic <- c("Mean","Std. Deviation")
Sample <- c(round(5.220134,2),round(5.4018713,2))
Theoretical <- c(5.0,5.0)
theTable <- data.frame(Statistic,Sample,Theoretical)
rownames(theTable) <- NULL
kable(theTable)
...and the text based output:
> kable(theTable)
|Statistic | Sample| Theoretical|
|:--------------|------:|-----------:|
|Mean | 5.22| 5|
|Std. Deviation | 5.40| 5|
>
When run in R Markdown, the output looks like this:
Explanation
I used the following technique to create the table.
Data Frame is used as the container to hold the data
Each column in the table is a column of data in the data frame
The first column stores the names of the rows
The second thru n-th columns store different values related to each row

Related

Viewing non-feature index in R/ R studio

I'm trying to cluster items (find similar items) based on their attributes. I initially had a CSV of the format:
Item | Attribute1 | Attribute2.....about 200 attributes
Since its a mixed format set of attributes (INT, String...), I decided to concatenate the attributes and now I have:
Item | ConcatenatedAttributes.
My clustering code is:
uniqueItem <- unique(as.character(data$ConcatenatedAttributes))
distanceMatrix <- stringdistmatrix(uniqueItem ,uniqueItem ,method = "jw")
rownames(distanceMatrix ) <- uniqueItem
hc <- hclust(as.dist(distanceMatrix ))
dfClust <- data.frame(uniqueItem , cutree(hc, k=200))
Now, I want to be able to see which Items have been clustered together based on their similarities of the ConcatenatedAttributes field. How can I do that?
So, something like:
ClusterNumber | Item |
You want to group_by your data frame.
One obvious way is to use a for loop. Most R fans will suggest to learn dplyr.
But IMHO, you idea of concatenating everything into one unmanageable field and then abusing a string distance is just horrible.

Creating a histogram from a subset created from the subset function

This is how I've retrieved my dataset, everything is good so far.
> mantis<-read.csv("mantis.csv")
> attach(mantis)
The dataset provides numerical data on body mass/length/claw strength/etc. of FEMALE and MALE mantises. The object is to create a histogram showing the body masses of ONLY female mantises. I created a subset;
> mantis_sub<-subset(mantis, Sex=="f",select="Body.Mass.g")
Then I tried;
> hist(mantis_sub)
Error in hist.default(mantis_sub) : 'x' must be numeric
I've searched this link;
Plot a histogram of subset of a data
...and I cannot figure out how to properly create this histogram. I am unfortunately not fluent enough in R to understand the solution and the textbook I'm using does not cover this.
It is because mantis_sub is a dataframe (ie a table of body masses, lengths, claw strengths, ..), not a set of numbers, so hist is unsure which column you wish to plot.
You need to extract the column you want to do a histogram of. To do this you put mantis_sub${column name}. The dollar sign extracts the appropriate column from the mantis_sub table.
e.g. to do a histogram of the column named "BodyMass"
hist(mantis_sub$BodyMass)
If you want to do histograms of many columns automatically, then you'll have to loop through them, e.g.
for (column in c("BodyMass", "ClawStrength")) {
hist(mantis_sub[[column]])
}

Output for large correlation matrices in R

From what I've seen, R cannot very easily produce usable output for large correlation matrices (50-100 variables). For instance, "corr.test" or "cor" output is horrendously wrapped (each variable should have only one row and one column, but this is certainly not the case) and does not copy well into Excel for later examination. Is there a way to produce SPSS-like correlation output in R? Namely, correlation matrices that can be copied and pasted easily into something like Excel, where each row and each column pertains to one variable (no wrapping of text), and ideally, sample-sizes and significance values are somehow available. Corr.test provides this information, albeit in an inconvenient format, and when variables exceed output viewer space in R, the output is basically unreadable. Any thoughts would be greatly, greatly appreciated, as I'm frequently working with many variables at once.
Is there anything wrong with
z <- matrix(rnorm(10000),100)
write.csv(cor(z),file="cortmp.csv")
? View(cor(z)) works for me, although I don't know if it's copy-and-pasteable.
For psych::corr.test
dimnames(z) <- list(1:100,1:100)
z[1,2] <- NA ## unbalance to induce sample size matrix
ct <- psych::corr.test(z)
write.csv(ct$n,file="ntmp.csv") ## sample sizes
write.csv(ct$t,file="ttmp.csv") ## t statistics
write.csv(ct$p,file="ptmp.csv") ## p-values
et cetera. (See str(ct).)
R's paradigm is that if you want to transfer information to another program you're going to output it to a file rather than copying and pasting it from the console ...

Calling a specific column in a subset of data that has been binned and stored in a list

I have a very large data set that I have binned, and stored each bin (subset) as a list so that I can easily call any given subset. My problem is in calling for a specific column within a subset.
For example my data (which has diameters and strengths as the columns), is broken up into 20 bins, by diameter. I manually binned the data, like so:
subset.1 <- subset(mydata, Diameter <= 0.01)
Similar commands were used, to make 20 bins. Then I stored the names (subset.1 through subset.20) into a list:
diameter.bin<-list(subset.1, ... , subset.20)
I can successfully call each diameter bin using:
diameter.bin[x]
Now, if I only want to see the strength values for a given diameter bin, I can use the original name (that is store in the list):
subset.x$Strength
But I cannot get this information using the list call:
diameter.bin[x]$Strength
This command returns NULL
Note that when I call any subset (either by diameter.bin[x], subset.x or even subset.x$Strength) my column headers do show up. When I use:
names(subset.1)
This returns "Diameter" and "Strength"
But when I use:
names(diameter.bin[1])
This returns NULL.
I'm assuming that the column header is part of the problem, but I'm not sure how to fix it, other than take the headers off of the original data file. I would prefer not to do this if at all possible.
The end goal is to look at the distribution of strength values for each diameter bin, so I will be doing things like drawing histograms, calculating parameters etc. I was hoping to do something along these lines to produce the histograms:
n=length(diameter.bin)
for(i in (1:n))
{
hist(diameter.bin[i]$Strength)
}
And do something similar to this to store median values for each bin in a new vector.
Any tips are greatly appreciated, as right now I'm doing it all 1 bin at a time, and I know a loop (or something similar) would really speed up my analysis.
You need two square brackets. Here is a reproducible example demonstrating the issue:
> diam <- data.frame(x=rnorm(5), y=rnorm(5))
>
> diam.l <- list(diam, diam)
> diam.l[1]$x
NULL
> diam.l[[1]]$x
[1] -0.5389441 -0.5155441 -1.2437108 -2.0044323 -0.6914124

How to aggregate on IQR in SPSS?

I have to aggregate (of course with a categorical break variable) a quite big data table containing some continuous variables by resulting the mean, median, standard deviation and interquartile range (IQR) of the required variables.
The first three is an easy one with the SPSS Aggregate command, but I have no idea how to compute IQR by aggregating the data table.
I know I could compute IQR by using Descriptives (by quartiles), but as I need the calculations in aggregation - this is not an option. Unfortunately using R fails also thanks to some odd circumstances (not able to load a huge comma separated file in R neither with base:: read.table, neither with sqldf, neither with bigmemory and neither with ff packages).
Any idea is welcomed! And of course: thank you in advance.
P.S.: I thought about estimating IQR by multiplying the standard deviation by 1.5, but that method would not work as the distributions are skewed, so assuming normality does not stands.
P.S.: do you think using R within SPSS would not result in memory problems like while opening the dataset in pure R?
This syntax should do the trick. There is no need to migrate back and forth between SPSS and R solely for this task.
*making fake data, 4 million records and 150 variables.
input program.
loop i = 1 to 4000000.
end case.
end loop.
end file.
end input program.
dataset name Temp.
execute.
vector X(150).
do repeat X = X1 to X150.
compute X = RV.NORMAL(0,1).
end repeat.
*This is the command you are interested in, puts the stats table into a new dataset.
Dataset declare IQR.
OMS
/SELECT TABLES
/IF SUBTYPES = 'Statistics'
/DESTINATION FORMAT = SAV outfile = 'IQR' VIEWER=NO.
freq var = X1
/format = notable
/ntiles = 4.
OMSEND.
This takes along time still with such a large dataset, but thats to be expected. Just search the SPSS help files for "OMS" to find the example syntax with how OMS works.
Given the further constraint that you want to calculate the IQR for many groups, there is a few different ways I could see to proceed. One would be just use the split file command and run the above frequency command again.
split file by group.
freq var = X1 X2
/format = notable
/ntiles = 4.
split file end.
You could also get specific percentiles within ctables (and can do whatever grouping/nesting you want for that). Potentially a more useful solution at this point though is to make a program that actually saves separate files (or reduces the full dataset the specific group while still loaded), does the calculation on each separate file and dumps it into a dataset. Working with the dataset that has the 4 million records is a pain, and it does not appear to be necessary if you are just splitting the file up anyway. This could be accomplished via macro commands.
OMS can capture any pivot table as a dataset, so any statistical results displayed that way can be used as a dataset. Another approach, however, in this case would be to use the RANK command. RANK allows for grouping variables, so you could get rank within group, and it can compute the quartiles and percentiles within group. For example,
RANK VARIABLES=salary (A) BY jobcat minority
/RANK /NTILES(4) /PERCENT. Then aggregating with FIRST and the group variables as breaks would give you a dataset of the quartiles by group from which to compute the iqr.
Many ways to skin a cat.
-Jon Peck

Resources