creating a two-way table with totals in R - r

I was wondering if there is an easy way to create a table that has the columns as well as row totals?
smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
colnames(smoke) <- c("High","Low","Middle")
rownames(smoke) <- c("current","former","never")
smoke <- as.table(smoke)
I thought this would be super easy, but the solutions i found until now seem to be pretty complicated involving lapply and rbind. However, this seems as such a trivial task, there must be some easier way?
derired results:
> smoke
High Low Middle TOTAL
current 51 43 22 116
former 92 28 21 141
never 68 22 9 99
TOTAL 211 93 52 51

addmargins(smoke)
addmargins is in the stats package.

You can use adorn_totals from janitor :
library(janitor)
library(magrittr)
smoke %>%
as.data.frame.matrix() %>%
tibble::rownames_to_column() %>%
adorn_totals(name = 'TOTAL') %>%
adorn_totals(name = 'TOTAL', where = 'col')
# rowname High Low Middle TOTAL
# current 51 43 22 116
# former 92 28 21 141
# never 68 22 9 99
# TOTAL 211 93 52 356

Related

Table in r to be weighted

I'm trying to run a crosstab/contingency table, but need it weighted by a weighting variable.
Here is some sample data.
set.seed(123)
sex <- sample(c("Male", "Female"), 100, replace = TRUE)
age <- sample(c("0-15", "16-29", "30-44", "45+"), 100, replace = TRUE)
wgt <- sample(c(1:10), 100, replace = TRUE)
df <- data.frame(age,sex, wgt)
I've run this to get a regular crosstab table
table(df$sex, df$age)
to get a weighted frequency, I tried the Hmisc package (if you know a better package let me know)
library(Hmisc)
wtd.table(df$sex, df$age, weights=df$wgt)
Error in match.arg(type) : 'arg' must be of length 1
I'm not sure where I've gone wrong, but it doesn't run, so any help will be great.
Alternatively, if you know how to do this in another package, which may be better for analysing survey data, that would be great too. Many thanks in advance.
Try this
GDAtools::wtable(df$sex, df$age, w = df$wgt)
Output
0-15 16-29 30-44 45+ NA tot
Female 56 73 60 76 0 265
Male 76 99 106 90 0 371
NA 0 0 0 0 0 0
tot 132 172 166 166 0 636
Update
In case you do not want to install the whole package, here are two essential functions you need:
wtable and dichotom
Source them and you should be able to use wtable without any problem.
A solution is to repeat the rows of the data.frame by weight and then table the result.
The following repeats the data.frame's rows (only relevant columns):
df[rep(row.names(df), df$wgt), 1:2]
And it can be used to get the contingency table.
table(df[rep(row.names(df), df$wgt), 1:2])
# sex
#age Female Male
# 0-15 56 76
# 16-29 73 99
# 30-44 60 106
# 45+ 76 90
Base R, in stats, has xtabs for exactly this:
xtabs(wgt ~ age + sex, data=df)
A tidyverse solution using your data same set.seed, uncount is the equivalent to #Rui's rep of the weights.
library(dplyr)
library(tidyr)
df %>%
uncount(weights = .$wgt) %>%
select(-wgt) %>%
table
#> sex
#> age Female Male
#> 0-15 56 76
#> 16-29 73 99
#> 30-44 60 106
#> 45+ 76 90

using map function to create a dataframe from google trends data

relatively new to r, I have a list of words I want to run through the gtrendsr function to look at the google search hits, and then create a tibble with dates as index and relevant hits for each word as columns, I'm struggling to do this using the map functions in purr,
I started off trying to use a for loop but I've been told to try and use map in the tidyverse package instead, this is what I had so far:
library(gtrendsr)
words = c('cruise', 'plane', 'car')
for (i in words) {
rel_word_data = gtrends(i,geo= '', time = 'today 12-m')
iot <- data.frame()
iot[i] <- rel_word_data$interest_over_time$hits
}
I need to have the gtrends function take one word at a time, otherwise it will give a value for hits which is a adjusted for the popularity of the other words. so basically, I need the gtrends function to run the first word in the list, obtain the hits column in the interest_over_time section and add it to a final dataframe that contains a column for each word and the date as index.
I'm a bit lost in how to do this without a for loop
Assuming the gtrends output is the same length for every keyword, you can do the following:
# Load packages
library(purrr)
library(gtrendsR)
# Generate a vector of keywords
words <- c('cruise', 'plane', 'car')
# Download data by iterating gtrends over the vector of keywords
# Extract the hits data and make it into a dataframe for each keyword
trends <- map(.x = words,
~ as.data.frame(gtrends(keyword = .x, time = 'now 1-H')$interest_over_time$hits)) %>%
# Add the keywords as column names to the three dataframes
map2(.x = .,
.y = words,
~ set_names(.x, nm = .y)) %>%
# Convert the list of three dataframes to a single dataframe
map_dfc(~ data.frame(.x))
# Check data
head(trends)
#> cruise plane car
#> 1 50 75 84
#> 2 51 74 83
#> 3 100 67 81
#> 4 46 76 83
#> 5 48 77 84
#> 6 43 75 82
str(trends)
#> 'data.frame': 59 obs. of 3 variables:
#> $ cruise: int 50 51 100 46 48 43 48 53 43 50 ...
#> $ plane : int 75 74 67 76 77 75 73 80 70 79 ...
#> $ car : int 84 83 81 83 84 82 84 87 85 85 ...
Created on 2020-06-27 by the reprex package (v0.3.0)
You can use map to get all the data as a list and use reduce to combine the data.
library(purrr)
library(gtrendsr)
library(dplyr)
map(words, ~gtrends(.x,geo= '', time = 'today 12-m')$interest_over_time %>%
dplyr::select(date, !!.x := hits)) %>%
reduce(full_join, by = 'date')
# date cruise plane car
#1 2019-06-30 64 53 96
#2 2019-07-07 75 48 97
#3 2019-07-14 73 48 100
#4 2019-07-21 74 48 100
#5 2019-07-28 71 47 100
#6 2019-08-04 67 47 97
#7 2019-08-11 68 56 98
#.....

use dplyr mutate() in programming

I am trying to assign a column name to a variable using mutate.
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df%>%mutate(name = ifelse(x <50, "small", "big"))
}
When I run
new(name = "newVar")
it doesn't work. I know mutate_() could help but I'm struggling in using it together with ifelse.
Any help would be appreciated.
Using dplyr 0.7.1 and its advances in NSE, you have to UQ the argument to mutate and then use := when assigning. There is lots of info on programming with dplyr and NSE here: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I've changed the name of the function argument to myvar to avoid confusion. You could also use case_when from dplyr instead of ifelse if you have more categories to recode.
df <- data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(myvar){
df %>% mutate(UQ(myvar) := ifelse(x < 50, "small", "big"))
}
new(myvar = "newVar")
This returns
x y newVar
1 37 1.82669 small
2 63 -0.04333 big
3 46 0.20748 small
4 93 0.94169 big
5 83 -0.15678 big
6 14 -1.43567 small
7 61 0.35173 big
8 26 -0.71826 small
9 21 1.09237 small
10 90 1.99185 big
11 60 -1.01408 big
12 70 0.87534 big
13 55 0.85325 big
14 38 1.70972 small
15 6 0.74836 small
16 23 -0.08528 small
17 27 2.02613 small
18 76 -0.45648 big
19 97 1.20124 big
20 99 -0.34930 big
21 74 1.77341 big
22 72 -0.32862 big
23 64 -0.07994 big
24 53 -0.40116 big
25 16 -0.70226 small
26 8 0.78965 small
27 34 0.01871 small
28 24 1.95154 small
29 82 -0.70616 big
30 77 -0.40387 big
31 43 -0.88383 small
32 88 -0.21862 big
33 45 0.53409 small
34 29 -2.29234 small
35 54 1.00730 big
36 22 -0.62636 small
37 100 0.75193 big
38 52 -0.41389 big
39 36 0.19817 small
40 89 -0.49224 big
41 81 -1.51998 big
42 18 0.57047 small
43 78 -0.44445 big
44 49 -0.08845 small
45 20 0.14014 small
46 32 0.48094 small
47 1 -0.12224 small
48 66 0.48769 big
49 11 -0.49005 small
50 87 -0.25517 big
Following the dlyr programming vignette, define your function as follows:
new <- function(name)
{
nn <- enquo(name) %>% quo_name()
df %>% mutate( !!nn := ifelse(x <50, "small", "big"))
}
enquo takes its expression argument and quotes it, followed by quo_name converting it into a string. Since nn is now quoted, we need to tell mutate not to quote it a second time. That's what !! is for. Finally, := is a helper operator to make it valid R code. Note that with this definition, you can simply pass newVar instead of "newVar" to your function, maintaining dplyr style.
> new( newVar ) %>% head
x y newVar
1 94 -1.07642088 big
2 85 0.68746266 big
3 80 0.02630903 big
4 74 0.18323506 big
5 86 0.85086915 big
6 38 0.41882858 small
Base R solution
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df[,name]='s'
df[,name][df$x>50]='b'
return(df)
}
I am using dplyr 0.5 so i just combine base R with mutate
new <- function(Name){
df=mutate(df,ifelse(x <50, "small", "big"))
names(df)[3]=Name
return(df)
}
new("newVar")

R sum multiple columns with multiple row

So i have this data
10 21 22 23 23 43
20 12 26 43 23 65
21 54 64 73 25 75
My expected outcome is:
142
189
312
I tried to use:
df = data.matrix(df)
df = colSums(df)
df = as.data.frame(df)
However, the sum of values are wrong. I would like to know how to improve or correct this solution?
We can use rowSums
rowSums(df)
#[1] 142 189 312
Your data is stored as factors. You must convert it to numeric using as.numeric(as.character()).
In your situation I suggest to do:
for(i in 1:nrow(df)){
df[i,]<-as.numeric(as.character(df[i,]))
}
rowSums(df)

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Resources