Create a contingency table with 2 factors from messy data - r

I have the following data in messy format:
structure(list(com_level = c("B", "B", "B", "B", "A", "A"),
hf_com = c(1, 1, 1, 1, 1, 1),
sal_level = c("2", "3", "1", "2", "1", "4"),
exp_sal = c(NA, 1, 1, NA, 1, NA)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -6L))
Column com_level is the factor with 2 levels and column hf_com gives the frequency count for that level.
Column sal_level is the factor with 4 levels and column exp_sal gives the frequency count for that level.
I want to create a contingency table similar to this:
structure(list(`1` = c(1L, 2L),
`2` = c(0L, 1L),
`3` = c(0L, 2L),
`4` = c(1L, 0L)),
row.names = c("A", "B"), class = "data.frame")
I have code that works when I want to compare two columns with the same factor:
# 1 step to create table with frequency counts for exp_sal and curr_sal per category of level
cs_es_table <- df_not_na_num %>%
dplyr::count(sal_level, exp_sal, curr_sal) %>%
tidyr::spread(key = sal_level,value = n) %>% # this code spreads on just one key
select(curr_sal, exp_sal, 1, 2, 3, 4, 5, 6, 7, -8) %>% # reorder columns and omit Column 8 (no answer)
as.data.frame()
# step 2- convert cs_es_table to long format and summarise exp_sal and curr_sal frequencies
cs_es_table <- cs_es_table %>%
gather(key, value, -curr_sal,-exp_sal) %>% # crucial step to make data long
mutate(curr_val = ifelse(curr_sal == 1,value,NA),
exp_val = ifelse(exp_sal == 1,value,NA)) %>% #mutate actually cleans up the data and assigns a value to each new column for 'exp' and 'curr'
group_by(key) %>% #for your summary, because you want to sum up your previous rows which are now assigned a key in a new column
summarise_at( .vars = vars(curr_val, exp_val), .funs = sum, na.rm = TRUE)
This code produces this table but just spreads on one key in step 1:
structure(list(curr_val = c(533L, 448L, 237L, 101L, 56L), exp_val = c(179L,
577L, 725L, 401L, 216L)), row.names = c("< 1000 EUR", "1001-1500 EUR",
"2001-3000 EUR", "3001-4000 EUR", "4001-5000 EUR"), class = "data.frame")
Will I need to use pivot_wider as in this example?
Is it possible to use spread on multiple columns in tidyr similar to dcast?
or
tidyr::spread() with multiple keys and values
Any help would be appreciated to compare the two columns with different factors.

Related

Calculate mean of each row in a large list of dataframes in R

I know this question has been asked before on this forum. But my data set is significantly large and I could not make any of the existing solutions work.
Here's a sample dataset.
list(structure(list(id = c("id1", "id2", "id3"), value = c(2,
0, 2), value_2 = c(0, 1, 2)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(id = c("id1", "id2", "id3"), value = c(-1,
0, 0), value_2 = c(1, 0, -3)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(id = c("id1", "id2", "id3"), value = c(-2,
1, 0), value_2 = c(-2, 0, 1)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(id = c("id1", "id2", "id3"), value = c(2,
0, 0), value_2 = c(-2, 0, -1)), class = "data.frame", row.names = c(NA,
-3L)))
I want to calculate the mean of the column 'value' for each 'id' across the list. The result should look like this, where 'value_mean' should be the average of the column 'value' of each id in lists 1, 2, 3 and 4.
structure(list(id = c("id1", "id2", "id3"), value_mean = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
Please note that my real list has 5000 data frames where each data frame has 100,000 rows. I have tried using "bind_rows" and similar functions to convert the list/ to a data frame first, but the data frame becomes too large and R runs out of memory.
Any help would be much appreciated! Thanks!
We may bind the list elements to a single data and then use a group by mean operation
library(dplyr)
bind_rows(lst1) %>%
group_by(id) %>%
summarise(value_mean = mean(value, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 3 x 2
id value_mean
<chr> <dbl>
1 id1 0.25
2 id2 0.25
3 id3 0.5
If the datasets have a the same dimension and the 'id' are in same order, extract the 'value' column, use Reduce to do elementwise + and divide by the length of list
Reduce(`+`, lapply(lst1, `[[`, "value"))/length(lst1)
[1] 0.25 0.25 0.50
Or a more efficient approach is with dapply/t_list from collapse
library(collapse)
dapply(t_list(dapply(lst1, `[[`, "value")), fmean)
V1 V2 V3
0.25 0.25 0.50
You could try to calculate the mean for each data.frame in your list. Weighted by the elements in each data.frame you could calculate the mean for all data.frames:
library(dplyr)
library(purrr)
my_list %>%
map_df(~ .x %>%
group_by(id) %>%
summarise(n = n(),
mean = mean(value, na.rm = TRUE))) %>%
group_by(id) %>%
summarize(mean_value = sum(n * mean)/ sum(n))
This returns
# A tibble: 3 x 2
id mean_value
<chr> <dbl>
1 id1 0.25
2 id2 0.25
3 id3 0.5
Disclaimer: I'm tired right now, don't knwo if this makes any sense.

Reshape from long to wide according to the number of occurrence of one variable

I have a dataframe looks like this
df1<-structure(list(person = c("a", "a", "a", "a", "b", "b", "b",
"c"), visitID = c(123, 123, 256, 816, 237, 828, 828, 911), v1 = c(10,
5, 15, 8, 95, 41, 31, 16), v2 = c(8, 72, 29, 12, 70, 23, 28,
66), v3 = c(0, 1, 0, 0, 1, 1, 0, 1)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
Where person is the name/id of the person and visitID is a number generated for each visit. Now each visit may have one or multiple variables (v1, v2, v3). My problem is that I'm trying to transform the structure where cases are aggregated into unique row with wide visits and variables, to look like:
df2<-structure(list(person = c("a", "b", "c"), visit1 = c(123, 237,
911), visit2 = c(256, 828, NA), visit3 = c(816, NA, NA), v1.visit1 = c("10,5",
"95", "16"), v1.visit2 = c("15", "41,31", NA), v1.visit3 = c("8",
NA, NA), v2.visit1 = c("8,72", "70", "66"), v2.visit2 = c("29",
"23,28", NA), v1.visit3 = c("12", NA, NA), v3.visit1 = c("0,1",
"1", "1"), v3.visit2 = c("0", "1,0", NA), v3.visit3 = c("0",
NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
Methods I have tried so far:
Method1:
1-aggregate according to "person" with all other variables separated by comma
2-split the variables into multiple columns
The problem with this method is that I would not know then which variable corresponds to which visit, especially that some may have multiple entries and some may not.
Method2:
1-Count number of each visitID. Take the maximum number of visits per unique person (in the case above is 3)
2-Create 3 columns for each variable.
didn't know how to proceed from here
I found a great answer in the thread Reshape three column data frame to matrix ("long" to "wide" format)
so tried working around with reshape and pivot_wider but couldn't get it to work.
Any ideas are appreciated even if did not lead to the same output.
Thank you
You can try something like this:
df1 %>%
group_by(person, visitID) %>%
summarise(across(matches("v[0-9]+"), list)) %>%
group_by(person) %>%
mutate(visit = seq_len(n()) %>% str_c("visit.", .)) %>%
ungroup() %>%
pivot_wider(
id_cols = person,
names_from = visit,
values_from = c("visitID", matches("v[0-9]+"))
)
replace list with ~str_c(.x, collapse = ",") if you want to have it in character style.

R function that drops columns according column value and primary keys?

Currently I have a compiled data frame, so for the same item code there are fixed and changing variables. For example:
..primary key..b...... c.
1. 1234........apple ..pear
2. 1234........apple ..orange
3. 5678........berry .. lime
4. 5679........orange.apple
5. 5679........orange.apple
In this case, since column c has different variables for both line1 and line2 despite having the same primary key #1234, column c should be dropped.
Is there any way i can do this without hard coding the column names?
Using dplyr, we can find the columns that need to be dropped using summarize_all():
library(dplyr)
df <- tibble(key = c(1, 1, 2, 3, 3),
b = c("a", "a", "b", "c", "c"),
c = c("p", "o", "l", "a", "a"))
drop_cols <- df %>%
group_by(key) %>%
summarize_all(~ any(. != .[1])) %>%
select(-key) %>%
select_if(any) %>%
colnames()
df %>% select(- one_of(drop_cols))
One way with dplyr is for each primary key count number of distinct elements in each column. We then select columns where there is only one unique value for each primary key.
library(dplyr)
df %>%
select(df %>%
group_by(primary) %>%
mutate_at(vars(-group_cols()), n_distinct) %>%
select_if(~all(. == 1)) %>%
names)
# primary b
#1 1234 apple
#2 1234 apple
#3 5678 berry
#4 5679 orange
#5 5679 orange
data
df <- structure(list(primary = c(1234L, 1234L, 5678L, 5679L, 5679L),
b = structure(c(1L, 1L, 2L, 3L, 3L), .Label = c("apple",
"berry", "orange"), class = "factor"), c = structure(c(4L,
3L, 2L, 1L, 1L), .Label = c("apple", "lime", "orange", "pear"
), class = "factor")), class = "data.frame", row.names = c(NA,-5L))

R: new column based whether categorical levels of another column are the same or different from each other

I am having a problem creating a new column in a data where the column content is defined by levels in a factor in a different column are the same or different, which is dependent on another 2 columns.
Basically, I have a bunch of cows with different ID's that can have different parities. The quarter is the udder quarter affected by the disease and I would like to create a new column with a result that is based on whether quarters are the same or different or occurring once. Any help would be appreciated. Code for abbreviated data frame below/ The new column is the one I would like to achieve.
AnimalID <- c(10,10,10,10,12,12,12,12,14)
Parity <- c(8,8,9,9,4,4,4,4,2)
Udder_quarter <- c("LH","LH","RH","RH","LH","RH","LF","RF","RF")
new_column <- c("same quarter","same quarter","different quarter","different quarter","different quarter","different quarter","different quarter","different quarter","one quarter")
quarters<- data.frame(AnimalID,Parity,Udder_quarter,new_column)
structure(list(HerdAnimalID = c(100165, 100165, 100327, 100327,
100450, 100450), Parity = c(6, 6, 5, 5, 3, 3), no_parities = c(1,
1, 1, 1, 1, 1), case = c("1pathogen_lact", "1pathogen_lact",
"1pathogen_lact", "1pathogen_lact", "1pathogen_lact", "1pathogen_lact"
), FARM = c(1, 1, 1, 1, 1, 1), `CASE NO` = c("101", "101", "638",
"638", "593", "593"), MASTDATE = structure(c(1085529600, 1087689600,
1097884800, 1101254400, 1106092800, 1106784000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), QRT = c("LF", "LF", "RH", "LF", "LH",
"LH"), MastitisDiagnosis = c("Corynebacterium spp", "Corynebacterium spp",
"S. uberis", "S. uberis", "Bacillus spp", "Bacillus spp"), PrevCalvDate =
structure(c(1075334400,
1075334400, 1096156800, 1096156800, 1091145600, 1091145600), class =
c("POSIXct",
"POSIXt"), tzone = "UTC")), .Names = c("HerdAnimalID", "Parity",
"no_parities", "case", "FARM", "CASE NO", "MASTDATE", "QRT",
"MastitisDiagnosis", "PrevCalvDate"), row.names = c(NA, -6L), class =
c("tbl_df",
"tbl", "data.frame"))
library(dplyr)
quarters %>%
group_by(AnimalID) %>%
mutate(new_column = ifelse(n()==1, 'one quarter', NA)) %>%
group_by(Parity, add=T) %>%
mutate(new_column=ifelse(length(unique(Udder_quarter))==1 & is.na(new_column),
"same quarter",
ifelse(length(unique(Udder_quarter))>1,
"different quarter",
new_column))) %>%
data.frame()
Output is:
AnimalID Parity Udder_quarter new_column
1 10 8 LH same quarter
2 10 8 LH same quarter
3 10 9 RH same quarter
4 10 9 RH same quarter
5 12 4 LH different quarter
6 12 4 RH different quarter
7 12 4 LF different quarter
8 12 4 RF different quarter
9 14 2 RF one quarter
Sample data:
quarters <- structure(list(AnimalID = c(10, 10, 10, 10, 12, 12, 12, 12, 14
), Parity = c(8, 8, 9, 9, 4, 4, 4, 4, 2), Udder_quarter = structure(c(2L,
2L, 4L, 4L, 2L, 4L, 1L, 3L, 3L), .Label = c("LF", "LH", "RF",
"RH"), class = "factor")), .Names = c("AnimalID", "Parity", "Udder_quarter"
), row.names = c(NA, -9L), class = "data.frame")
I would use ave to do that:
f <- function(x) {
if (length(x)==1) return("one")
else if (all(x == x[1])) return("same")
else return("different")
}
ave(Udder_quarter, interaction(AnimalID, Parity), FUN=f)
# [1] "same" "same" "same" "same" "different"
# [6] "different" "different" "different" "one"

Passing current value of ddply split on to function

Here is some sample data for which I want to encode the gender of the names over time:
names_to_encode <- structure(list(names = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("names", "year"), row.names = c(NA, -6L), class = "data.frame")
Here is a minimal set of the Social Security data, limited to just those names from 1890 and 1990:
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
I've defined a function which subsets the Social Security data given a year or range of years. In other words, it calculates whether a name was male or female over a given time period by figuring out the proportion of male and female births with that name. Here is the function along with a helper function:
require(plyr)
require(dplyr)
select_ssa <- function(years) {
# If we get only one year (1890) convert it to a range of years (1890-1890)
if (length(years) == 1) years <- c(years, years)
# Calculate the male and female proportions for the given range of years
ssa_select <- ssa_demo %.%
filter(year >= years[1], year <= years[2]) %.%
group_by(name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = sapply(proportion_female, male_or_female))
return(ssa_select)
}
# Helper function to determine whether a name is male or female in a given year
male_or_female <- function(proportion_female) {
if (proportion_female > 0.5) {
return("female")
} else if(proportion_female == 0.5000) {
return("either")
} else {
return("male")
}
}
Now what I want to do is use plyr, specifically ddply, to subset the data to be encoded by year, and merge each of those pieces with the value returned by the select_ssa function. This is the code I have.
ddply(names_to_encode, .(year), merge, y = select_ssa(year), by.x = "names", by.y = "name", all.x = TRUE)
When calling select_ssa(year), this command works just fine if I hard code a value like 1890 as the argument to the function. But when I try to pass it the current value for year that ddply is working with, I get an error message:
Error in filter_impl(.data, dots(...), environment()) :
(list) object cannot be coerced to type 'integer'
How can I pass the current value of year on to ddply?
I think you're making things too complicated by trying to do a join inside ddply. If I were to use dplyr I would probably do something more like this:
names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame")
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
names_to_encode$name <- as.character(names_to_encode$name)
names_to_encode$year <- as.integer(names_to_encode$year)
tmp <- left_join(ssa_demo,names_to_encode) %.%
group_by(year,name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5,"either",
ifelse(proportion_female > 0.5,"female","male")))
Note that 0.1.1 is still a little finicky about the types of join columns, so I had to convert them. I think I saw some activity on github that suggested that was either fixed in the dev version, or at least something they're working on.

Resources