Making contingency table - r

I'm having trouble with contingency table.
I want to convert that kind of table:
dat <- read.csv(text="Gatunek,Obecnosc,Lokalizacja,Frekwencja
Koń dziki,TAK,Polska,11
Koń dziki,NIE,Polska,14
Koń dziki,TAK,Kujawy,39
Koń dziki,NIE,Kujawy,31",header=TRUE)
# Gatunek Obecnosc Lokalizacja Frekwencja
#Koń dziki TAK Polska 11
#Koń dziki NIE Polska 14
#Koń dziki TAK Kujawy 39
#Koń dziki NIE Kujawy 31
to this:
Don't be afraid, it's just Polish language.
For that moment I only have table which look like this:

xtabs should do the trick:
x <- data.frame(a = c(1, 2, 1, 2), b = c("a", "a", "b", "b"), c = c(11, 14, 39, 31))
xtabs(c ~ a + b, data = x)
# b
#a a b
# 1 11 39
# 2 14 31

Related

Arranging a dataset by groups in R

I need help trying to make a dataset which contains which treatment the participants are on and what they scored in a composite test (this is just an exercise for my course no real data used)
A <- c(36, 35, 22, 20)
B <- c(26, 30, 25, 20)
C <- c(42, 30, 45, 62)
treatment <- c("A", "B", "C")
depression <- c(A, B, C)
df1 <- data.frame(treatment, depression)
It arranges the data in the dataframe wrong with each value being A, B, C instead of A, A , A , A, B...
anyone know how to convert and arrange this data?
I need the data arranged so I can split and do different calculations with them.
treatment <- rep(LETTERS[1:3], each=4)
depression <- c(A, B, C)
df1 <- data.frame(treatment, depression)
I think you're looking for
df1 <- data.frame(treatment = rep(treatment, each = 4), depression)
For production/"real-life" code you would probably want to do something fancier, e.g.
L <- tibble::lst(A,B,C) ## self-naming list
data.frame(treatment = rep(names(L), lengths(L)),
depression = unlist(L))
Here is tidyverse approach:
library(tidyverse)
tibble(depression) %>%
mutate(treatment = rep(treatment, each=length(A)))
depression treatment
<dbl> <chr>
1 36 A
2 35 A
3 22 A
4 20 A
5 26 B
6 30 B
7 25 B
8 20 B
9 42 C
10 30 C
11 45 C
12 62 C

Subset a data frame based on count of values of column x. Want only the top two in R

here is the data frame
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b")
df <- data.frame(p, x)
I want to subset the data frame such that I get a new data frame with only the top two"x" based on the count of "x".
One of the simplest ways to achieve what you want to do is with the package data.table. You can read more about it here. Basically, it allows for fast and easy aggregation of your data.
Please note that I modified your initial data by appending the elements 10 and c to p and x, respectively. This way, you won't see a NA when filtering the top two observations.
The idea is to sort your dataset and then operate the function .SD which is a convenient way for subsetting/filtering/extracting observations.
Please, see the code below.
library(data.table)
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2, 10)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b", "c")
df <- data.table(p, x)
# Sort by the group x and then by p in descending order
setorder( df, x, -p )
# Extract the first two rows by group "x"
top_two <- df[ , .SD[ 1:2 ], by = x ]
top_two
#> x p
#> 1: a 45
#> 2: a 6
#> 3: b 6
#> 4: b 3
#> 5: c 54
#> 6: c 10
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?
Using dplyr:
library(dplyr)
df %>%
add_count(x) %>%
slice_max(n, n = 2)
p x n
1 1 a 4
2 3 b 4
3 45 a 4
4 1 a 4
5 1 b 4
6 6 a 4
7 6 b 4
8 2 b 4

Recoding multiple variables in R

I would like to recode multiple variables at once in R. The variables are within a larger dataframe. Here is some example data:
z <- data.frame (A = c(1,2,300,444,555),
B = c(555,444,300,2,1),
C = c(1,2,300,444,555),
D = c(1,2,300,444,555))
What I would like to do is recode all values that equal 300 as 3, 444 as 4, and 555 as 5.
I thought I could possibly do this in a list. Here is what I tried:
example_list = list(c("A", "B", "C", "D"))
example_list <- apply(z[,example_list], 1, function(x) ifelse(any(x==555, na.rm=F), 0.5,
ifelse(any(x==444), 0.25),
ifelse(any(x==300), 3, example_list)))
I get this error:
Error during wrapup: invalid subscript type 'list'
Then tried using "lapply" and I got this error:
Error during wrapup: '1' is not a function, character or symbol
Even then I'm not sure this is the best way to go about doing this... I would just like to avoid doing this line by line for multiple variables. Any suggestions would be amazing, as I'm new to R and don't entirely understand what I'm doing wrong.
I did find a similar questions on SO: Question, but I'm not sure how to apply that to my specific problem.
Using case_when:
library(dplyr)
z %>% mutate_all(
function(x) case_when(
x == 300 ~ 3,
x == 444 ~ 4,
x == 555 ~ 5,
TRUE ~ x
)
)
A B C D
1 1 5 1 1
2 2 4 2 2
3 3 3 3 3
4 4 2 4 4
5 5 1 5 5
Here's a base R attempt which should be neatly extendable and pretty fast:
# set find and replace vectors
f <- c(300,444,555)
r <- c(3, 4, 5)
# replace!
m <- lapply(z, function(x) r[match(x,f)] )
z[] <- Map(function(z,m) replace(m,is.na(m),z[is.na(m)]), z, m)
# A B C D
#1 1 5 1 1
#2 2 4 2 2
#3 3 3 3 3
#4 4 2 4 4
#5 5 1 5 5
This seems a bit clunky but it works:
mutate_cols <- c('A', 'B')
z[, mutate_cols] <- as.data.frame(lapply(z[, mutate_cols], function(x) ifelse(x == 300, 3,
ifelse(x == 444, 4,
ifelse(x== 555, 5, x)))))
This should work.
library(plyr)
new.z<- apply(z, 1, function(x) mapvalues(x, from = c(300, 444, 555), to = c(3, 4, 5)))
z = data.frame (A = c(1,2,300,444,555),
B = c(555,444,300,2,1),
C = c(1,2,300,444,555),
D = c(1,2,300,444,555))
library(expss)
to_recode = c("A", "B", "C", "D")
recode(z[, to_recode]) = c(300 ~ 3, 444 ~ 4, 555 ~ 5)
If you alredy have factor variables and also want factor variables as result you can use the following code:
library(tidyverse)
z <- data.frame (A = factor(c(1,2,300,444,555)),
B = factor(c(555,444,300,2,1)),
C = factor(c(1,2,300,444,555)),
D = factor(c(1,2,300,444,555)))
new.z <- z %>%
mutate_all(function(x) recode_factor(x, "300" = "3", "444" = "4", "555" = "5"))

R - nested aggregate

I have a table d1 like this (three columns, JB Y and P)
JB Y P
AA 11 1
BB 11 2
AA 12 3
BB 12 4
AA 13 3
CC 12 4
CC 13 2
DD 11 1
DD 12 1
DD 13 3
BB 12 3
and what I am trying to do is is get nested aggregate. I mean the result should like this:
JB Y Avergare (P)
AA 11 1
AA 12 2
AA 13 3
BB 11 2
BB 12 3.5
CC 12 4
CC 13 2
DD 11 1
DD 12 1
DD 13 3
The nested aggregate first aggregates using Y and than JB and provides mean P? Not sure if possible. I know how to get just simple aggregate but wonder if there is a way to analyse data in two (or more steps)
Here is a solution using data.table:
library(data.table)
dt <- data.table(
JB = c("AA", "BB", "AA", "BB", "AA", "CC", "CC", "DD", "DD", "DD", "BB"),
Y = c(11, 11, 12, 12, 13, 12, 13, 11, 12, 13, 12),
P = c(1, 2, 3, 4, 3, 4, 2, 1, 1, 3, 3))
dt[order(JB), .(avg = mean(P)), by = .(JB, Y)]
The .() in the middle is used to name the aggregation result. If ordering is not necessary, you may omit the first part, e.g. just call
dt[, .(avg = mean(P)), by = .(JB, Y)].
We can use data.table
library(data.table)
setDT(df)[, list(P= mean(P)) , .(JB, Y)]
By the looks of it, this is a vanilla aggregate problem, so you have lots of tools available.
In base R, the obvious candidate is aggregate.
aggregate(P ~ JB + Y, mydf, mean)
You can also use the "dplyr" package, as suggested by #eipi10, if that is more your style:
library(dplyr)
mydf %>% group_by(JB, Y) %>% summarise(P = mean(P))

Merge select columns from multiple tables using common identifiers in R

I would like to combine (merge) select columns from multiple tables with following organization.
Here's two datasets as examples that I want to combine
"dataset1"
A B C D E F (header)
1 2 3 4 5 F1(1st row)
6 7 8 9 10 F2(2nd row)
11 12 13 14 15 F3 (3rd row)
....
"dataset2"
A B C D E F (header)
16 17 18 19 20 F1(1st row)
21 22 23 24 25 F2(2nd row)
26 27 28 29 30 F3 (3rd row)
....
Here, header for all different datasets (I have more than 100 datasets) are identical, and I want to use names in F columns (F1, F2, F3...more than F200) as unique identifier.
For example, If I combine column "A" from all different datasets using column F as identifier, the results should look like this. Also to distinguish where the data come from, header also needs to be changed to dataset ID.
dataset1 dataset2 F (header)
1 16 F1 (1st row)
6 21 F2 (2nd row)
11 26 F3 (3rd row)
....
Note that all datasets I have contain different numbers of row, so that some data point values corresponding to F1~F200 could be missing. in this case I want to put NA or leave it as empty.
To this end, I tried following code
x <- merge(dataset1, dataset2, by="F", all=T)
But this way, I cannot extract only column A, rather it merges evert columns.
Similarly, I tried also
x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1, dataset2))
This gave me actually identical results as previous code. To further extract only column A using this code, I tried following one, but did not worked.
x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1[,1], dataset2[,1))
And I have no idea how to change name of header into the name of data set which came from.
Please understand I just started to learn R basics.
I'm using RStudio 0.98507 and currently all datasets (more than hundred) were loaded and in present in "Global Environment"
Thank you very much!
Here's one solution with the following four sample data frames:
dataset1 <- data.frame(A = c(1, 6, 11),
B = c(2, 7, 12),
C = c(3, 8, 12),
D = c(4, 9, 13),
E = c(5, 10, 14),
F = c("F1", "F2", "F3"))
dataset2 <- data.frame(A = c(16, 21, 26),
B = c(17, 22, 27),
C = c(18, 23, 28),
D = c(19, 24, 29),
E = c(20, 25, 30),
F = c("F1", "F2", "F3"))
dataset3 <- data.frame(A = c(30, 61),
B = c(57, 90),
C = c(38, 33),
D = c(2, 16),
E = c(77, 25),
F = c("F1", "F2"))
dataset4 <- data.frame(A = c(36, 61),
B = c(47, 30),
C = c(37, 33),
D = c(45, 10),
E = c(66, 29),
F = c("F1", "F2"))
First combine them into a list:
datasets <- list(dataset1, dataset2, dataset3, dataset4)
Then rename all the columns except the F column. This is because later when we merge the data frames together, if the columns all have the same names then merge will try to differentiate them by adding .x or .y to the names -- which is fine when you're only merging two data sets, but gets confusing with more than two.
for (i in seq_along(datasets)) {
for (j in seq_along(colnames(datasets[[i]]))) {
if (colnames(datasets[[i]])[j] != "F") {
colnames(datasets[[i]])[j] <- paste(colnames(datasets[[i]])[j], i, sep = ".")
}
}
}
This gives us data frames whose column headers look like this:
datasets[[1]]
## A.1 B.1 C.1 D.1 E.1 F
## 1 1 2 3 4 5 F1
## 2 6 7 8 9 10 F2
## 3 11 12 12 13 14 F3
Then use Reduce:
df <- Reduce(function(x, y) merge(x, y, all = TRUE, by = "F"), datasets)
And select the columns you want, in this case all the columns with A in the column name:
df[, c("F", grep("A", names(df), value = TRUE))]
## F A.1 A.2 A.3 A.4
## 1 F1 1 16 30 36
## 2 F2 6 21 61 61
## 3 F3 11 26 NA NA

Resources