Apply cohen.d to multiple columns - r

I have a data frame of survey question responses. I would like to estimate Cohen's d effect sizes for each response using cohen.d from effsize.
Here are the first 6 rows of my data frame:
structure(list(id = c("HO1001", "HO1001", "HO1002", "HO1002",
"HO1003", "HO1003"), time = structure(c(1L, 2L, 1L, 2L, 1L, 2L
), .Label = c("0", "1"), class = "factor"), grit.distract = c(1,
1, 3, 2, 1, 2), grit.setback = c(5, 4, 3, 3, 4, 4), grit.obsess = c(3,
2, 2, 2, 3, 2), grit.work = c(4, 5, 3, 4, 5, 5), grit.goal = c(2,
3, 2, 1, 4, 4), grit.focus = c(3, 3, 3, 1, 2, 3), grit.finish = c(4,
4, 4, 4, 4, 3), grit.diligent = c(4, 4, 3, 4, 5, 4), grit.mean = c(3.25,
3.25, 2.875, 2.625, 3.5, 3.375)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L))
I successfully converted the df into wide format to use effsize on the summary statistics i.e. mean/total as follows:
structure(list(id = c("HO1001", "HO1002", "HO1003", "HO1004",
"HO1005", "HO1006"), pre = c(3.25, 2.875, 3.5, 2.25, NA, NA),
post = c(3.25, 2.625, 3.375, 2.5, 2.75, 2.875), change = c(0,
-0.25, -0.125, 0.25, NA, NA), highconf = structure(c(2L,
1L, 2L, 1L, NA, NA), .Label = c("0", "1"), class = "factor")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L)
cohen.d(grit.tot$pre, grit.tot$post, na.rm = T)
What I would like to do is calculate the effect sizes for each survey item i.e. grit distract, grit.setback, etc. between time 0 and time 1 (please no comments on my statistical methods). Given that I have plenty more data frames like this and don't want to do them all individually, I believe that I should use a function and a loop such as apply but I'm not sure how to construct this.

If I have understood your question this may help.
If your data frame from the first part of your questions is stored as dt running the following should give the cohen d for each survey item.
lapply(dt[c(-1,-2)],function(x) cohen.d(x ~ dt$time))
dt[c(-1,-2)] removes the ID column and the time column as you don't want to run the cohen d test on these.

Related

Using Map on Function if name of df lists match

I have given a
a namend vector col:
col <- c(id = "CLUSTER", x = "LONGNUM", y = "LATNUM", n = "Severely.stunted.child.under.5.years..Total", pos = "Severely.stunted.child.under.5.years.Yes")
#a List of Dataframes with the the Elements of col as Columns and namend after a specific study area. (see deput below) The List Results contains 19 different files (different years)
a list of shapefiles of with 6 Elements (corresponding countries):
study_area <- c("Ethiopia", "Liberia", "Malawi", "Rwanda", "Uganda", "Zimbabwe")
Countries <- lapply(study_area, function(x){gisco_get_countries(country= x, resolution = 60 )})
Countries <- lapply(Countries, function(x) {as_Spatial(x, cast = TRUE, IDs = c("CNTR_NAME", "ISO§_CODE", "CNRT_ID", "NAME_ENGL", "FID"))})
names(Countries) <- study_area
I would like to preform the function from the prevR Library :
s.prevR(Results[[1]], col, Countries[[1]])
But actually for every element in the lists where the names fit:
I tried something like: Map(function(x, y) { as.prevR(x, col, y)}, Results, Countries)
But there it does (obviously) not match by names of x and y
dput( dput(Results[[1]][1:5,1:24])
structure(list(CLUSTER = c("", "1", "10", "100", "101"), Severely.stunted.child.under.5.years.No = c(3438,
8, 7, 9, 6), Severely.stunted.child.under.5.years.Yes = c(1047,
4, NA, 7, 1), Severely.stunted.child.under.5.years..Total = c(4485,
12, 7, 16, 7), Stunted.child.under.5.years.No = c(2531, 2, 7,
7, 5), Stunted.child.under.5.years.Yes = c(1954, 10, NA, 9, 2
), Stunted.child.under.5.years..Total = c(4485, 12, 7, 16, 7),
Severely.wasted.child.under.5.years.No = c(4295, 11, 7, 16,
7), Severely.wasted.child.under.5.years.Yes = c(190, 1, NA,
NA, NA), Severely.wasted.child.under.5.years..Total = c(4485,
12, 7, 16, 7), Wasted.child.under.5.years.No = c(3957, 10,
7, 16, 6), Wasted.child.under.5.years.Yes = c(528, 2, NA,
NA, 1), Wasted.child.under.5.years..Total = c(4485, 12, 7,
16, 7), Severely.underweight.child.under.5.years.No = c(4028,
10, 7, 12, 7), Severely.underweight.child.under.5.years.Yes = c(457,
2, NA, 4, NA), Severely.underweight.child.under.5.years..Total = c(4485,
12, 7, 16, 7), Underweight.child.under.5.years.No = c(3185,
7, 7, 12, 5), Underweight.child.under.5.years.Yes = c(1300,
5, NA, 4, 2), Underweight.child.under.5.years..Total = c(4485,
12, 7, 16, 7), LATNUM = c(NA, 10.889096, 5.323272, 8.830199,
10.806748), LONGNUM = c(NA, 37.269565, 39.556812, 40.72964,
39.7703), SurveyId = c("ET2005DHS", "ET2005DHS", "ET2005DHS",
"ET2005DHS", "ET2005DHS"), DHSC = c("ET", "ET", "ET", "ET",
"ET"), Country = c("Ethiopia", "Ethiopia", "Ethiopia", "Ethiopia",
"Ethiopia")), row.names = c(NA, 5L), class = "data.frame")
and Countries
dput(Countries[[1]])
new("SpatialPolygonsDataFrame", data = structure(list(CNTR_NAME = "Federal Democratic Republic of Ethiopia",
ISO3_CODE = "ETH", CNTR_ID = "ET", NAME_ENGL = "Ethiopia",
FID = "ET"), class = "data.frame", row.names = 1L), polygons = list(
new("Polygons", Polygons = list(new("Polygon", labpt = c(39.6420582930584,
8.63562315843106), area = 93.13026982, hole = FALSE, ringDir = 1L,
coords = structure(c(41.6307, 42.4043, 41.816, 41.8348,
42.9681, 42.7628, 42.9804, 43.9589, 45.6126, 46.9411,
47.8524, 45.6126, 45.4747, 45.2923, 44.9162, 43.4741,
42.8138, 41.9101, 41.2328, 40.708, 39.9305, 39.5667,
38.9731, 38.1026, 36.9621, 35.9477, 35.8294, 35.3235,
35.0325, 34.9588, 34.5428, 33.7557, 33.0448, 33.2485,
33.8204, 34.0937, 34.1132, 34.4181, 34.8021, 35.2153,
35.6227, 36.1342, 36.5603, 37.2972, 37.5268, 37.9201,
38.5391, 39.0217, 40.0851, 40.8941, 41.6307, 13.3913,
12.4686, 11.6292, 11.0448, 10.9974, 10.7159, 10.0644,
9.0545, 8.4674, 8.0224, 7.9151, 5.5657, 5.4241, 5.2367,
4.9368, 4.7993, 4.301, 3.9823, 3.9616, 4.2326, 3.8858,
3.5224, 3.5158, 3.6459, 4.3833, 4.62, 5.2367, 5.413,
5.8494, 6.4537, 6.7418, 7.6074, 7.899, 8.381, 8.4168,
8.6026, 9.4986, 10.6735, 10.8052, 11.9187, 12.5064, 12.8315,
14.2577, 14.3876, 14.2588, 14.8128, 14.4413, 14.5899,
14.5456, 14.0891, 13.3913), dim = c(51L, 2L)))), plotOrder = 1L,
labpt = c(39.6420582930584, 8.63562315843106), ID = "1",
area = 93.13026982)), plotOrder = 1L, bbox = structure(c(33.0448,
3.5158, 47.8524, 14.8128), dim = c(2L, 2L), dimnames = list(c("x",
"y"), c("min", "max"))), proj4string = new("CRS", projargs = "+proj=longlat +datum=WGS84 +no_defs"))
If the Countries names are all in the Results names and if 'Results' have duplicates for names, then we can make the Countries to have the same length by replicating based on the names of the 'Results'
Map(function(x, y) { as.prevR(x, col, y)}, Results, Countries[names(Results)])

Using R, how to assign a category into a new column based on a date in another column and carry that value forward until the next date?

I have a dataframe with approximately 3 million rows. Each row is assigned a unique ID and has up to 4 dates. I wish to create a set of new columns for month and year (i.e. Jan-21, Feb-21, Mar-21, etc) and assign a value of "0" for each month/year prior to the first date, and then a value of "1" for the month/year containing the date for each ID, and maintain the value of "1" in each subsequent month/year column until the next column that matches the 2nd date.
I understand that it's easier to help me with examples, so I have put together this dput output with an example of what my current data looks like:
structure(list(id = c(1, 2, 3, 4, 5), date1 = structure(c(1623801600,
1615420800, 1654560000, 1620259200, 1615248000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), date2 = structure(c(1629158400, 1621987200,
1658448000, 1623974400, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
date3 = structure(c(NA, 1630454400, 1662076800, 1647907200,
NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), date4 = structure(c(NA,
1639008000, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
And this is what I would like it to look like:
structure(list(id = c(1, 2, 3, 4, 5), `Mar-21` = c(0, 1, 0, 0,
1), `Apr-21` = c(0, 1, 0, 0, 1), `May-21` = c(0, 2, 0, 1, 1),
`Jun-21` = c(1, 2, 0, 2, 1), `Jul-21` = c(1, 2, 0, 2, 1),
`Aug-21` = c(2, 2, 0, 2, 1), `Sep-21` = c(2, 3, 0, 2, 1),
`Oct-21` = c(2, 3, 0, 2, 1), `Nov-21` = c(2, 3, 0, 2, 1),
`Dec-21` = c(2, 4, 0, 2, 1), `Jan-22` = c(2, 4, 0, 2, 1),
`Feb-22` = c(2, 4, 0, 2, 1), `Mar-22` = c(2, 4, 0, 3, 1),
`Apr-22` = c(2, 4, 0, 3, 1), `May-22` = c(2, 4, 0, 3, 1),
`Jun-22` = c(2, 4, 1, 3, 1), `Jul-22` = c(2, 4, 2, 3, 1),
`Aug-22` = c(2, 4, 2, 3, 1), `Sep-22` = c(2, 4, 3, 3, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
Just a note that I have this dataset in both wide and long format, in case using it in a long format makes more sense.
Thank you!
This was a fun exercise! I'm sure there are a billion ways to do this more efficiently, but I think this works and was a fun puzzle for me. I first put the dates into long format to get a min and max. Then I made a sequence of those dates by month. I then used expand grid to make all combinations of the months with each ID to join it to the original data frame. Then I just summed how many dates1:4 were greater then the months in the list. I had to use floor_date to change dates1:4 to the first of the month. Hopefully this helps!
library(dplyr)
library(lubridate)
library(tidyr)
dat2<-dat%>%
tidyr::pivot_longer(cols = -id, values_drop_na = T)
dat_min_max<-data.frame("Min" = min(dat2$value), "Max" = max(dat2$value))
month_seq<-seq(dat_min_max$Min, dat_min_max$Max+months(1), by = "month")
dat3<-dat%>%
mutate(date1 = floor_date(date1, "month"),
date2 = floor_date(date2, "month"),
date3 = floor_date(date3, "month"),
date4 = floor_date(date4, "month")
)%>%
left_join(expand.grid(dat$id, month_seq), by = c("id" = "Var1"))%>%
rowwise()%>%
mutate(c = sum(date1 <= Var2, date2 <= Var2, date3 <= Var2, date4 <= Var2, na.rm = T))%>%
mutate(Var2 = format(Var2, "%b-%y"))%>%
select(-date1, -date2, -date3, -date4)%>%
tidyr::pivot_wider(names_from = Var2, values_from = c)

How do I combine data taken separately into a single dataset?

I have a dataset comprised of leaves which I've weighed individually in order of emergence (first emerged through final emergence), and I'd like to combine these masses so that I have the entire mass of all the leaves for each individual plant.
How would I add these up using R programming language, or what would I need to google to get started on figuring this out?
structure(list(Tray = c(1, 1, 1, 1, 1, 1), Plant = c(2, 2, 2,
2, 3, 3), Treatment = structure(c(4L, 4L, 4L, 4L, 4L, 4L), .Label = c("2TLH",
"E2TL", "EH", "WL"), class = "factor"), PreSwitch = c("Soil",
"Soil", "Soil", "Soil", "Soil", "Soil"), PostSwitch = c("Soil",
"Soil", "Soil", "Soil", "Soil", "Soil"), Pellet = c(1, 1, 1,
1, 1, 1), Rep = c(1, 1, 1, 1, 1, 1), Date = structure(c(1618963200,
1618963200, 1618963200, 1618963200, 1618963200, 1618963200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), DAP = c(60, 60, 60, 60, 60, 60), Position = c(2,
1, 3, 4, 4, 3), Whorl = structure(c(1L, 1L, 2L, 2L, 2L, 2L), .Label = c("1",
"2", "3", "4", "5"), class = "factor"), PetioleLength = c(1.229,
1.365, 1.713, 1.02, 0, 1.408), BladeLength = c(1.604, 1.755,
2.466, 2.672, 0.267, 2.662), BladeWidth = c(1.023, 1.185, 1.803,
1.805, 0.077, 1.771), BladeArea = c(1.289, 1.634, 3.492, 3.789,
0.016, 3.704), BladePerimeter = c(6.721, 7.812, 11.61, 12.958,
1.019, 14.863), BladeCircularity = c(0.359, 0.336, 0.326, 0.284,
0.196, 0.211), BPR = c(1.30512611879577, 1.28571428571429, 1.43957968476357,
2.61960784313725, NA, 1.890625), Leaf.Mass = c(9, 11, 31, 33,
32, 33), BladeAR = c(1.56793743890518, 1.48101265822785, 1.36772046589018,
1.4803324099723, 3.46753246753247, 1.50310559006211), Subirrigation = c(0,
0, 0, 0, 0, 0), Genotype = c(1, 1, 1, 1, 1, 1), Location = c(0,
0, 0, 0, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I may be missing something but isn't this a sum by Plant?
One solution below sums it for each plant into a separate table with just the totals and the second summarizes and adds it back to the main data set in a single step.
library(tidyverse)
#summary data set
plant_total <- df %>% group_by(Plant) %>% summarize(plant_weight = sum(Leaf.Mass, na.rm= TRUE))
#add plant_weight column to df data set
plant_total <- df %>% group_by(Plant) %>% mutate(plant_weight = sum(Leaf.Mass, na.rm = TRUE))

R: ICC for Inter-Rater and Intra-rater Variability, getting ICC=1

I am having issues with the ICC function from the psych package in R. Pretty much we had three technicians (AA,AB,AC) who measure 11 control solutions three times. We know the control values for these solutions(F_exp). The three measurements were averaged, leaving to AA_avg,AB_avg,AC_avg.
I am trying to calculate the Inter-rater reliability of these three technicians (It reflects the variation between 2 or more raters who measure the same group of subjects). I am planning to use ICC (2,1)
When I try to run
ICC(try[3:5]) # n*p matrix where n=subjects, p=raters.
I get the following results:
I am not sure what to do. I am feeding the data as instructed. When I do it with icc in the irr package, which is more specific with its format of data, I get:
And well and ICC of 0.999998 seems too good to be true. I would really appreciate any help. Thank you!
Here is the structure of my data:
try<-structure(list(Input = c(1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
F_Exp = c(3, 100, 1, 40, 4, 40, 4, 40, 1, 40, 100), AA_avg = c(3.11666666666667,
103.716666666667, 1, 40.8333333333333, 4.18333333333333,
40.8666666666667, 4.18333333333333, 40.9166666666667, 1.03333333333333,
40.9333333333333, 103.783333333333), AB_avg = c(3.25, 103.016666666667,
1.13333333333333, 40.8333333333333, 3.94666666666667, 40.45,
4.28333333333333, 41.1166666666667, 1.05, 40.9166666666667,
104), AC_avg = c(3.2, 103.55, 1.23333333333333, 40.9, 4.26666666666667,
40.4, 4.28333333333333, 40.9, 1.05, 40.95, 103.733333333333
), ALL_avg = c(3.18888888888889, 103.427777777778, 1.12222222222222,
40.8555555555556, 4.13222222222222, 40.5722222222222, 4.25,
40.9777777777778, 1.04444444444444, 40.9333333333333, 103.838888888889
), AA_error = c(-0.116666666666667, -3.71666666666667, 0,
-0.833333333333336, -0.183333333333334, -0.866666666666667,
-0.183333333333334, -0.916666666666664, -0.0333333333333334,
-0.93333333333333, -3.78333333333333), AB_error = c(-0.25,
-3.01666666666667, -0.133333333333333, -0.833333333333336,
0.0533333333333332, -0.450000000000003, -0.283333333333333,
-1.11666666666667, -0.05, -0.916666666666664, -4), AC_error = c(-0.2,
-3.55, -0.233333333333333, -0.899999999999999, -0.266666666666667,
-0.399999999999999, -0.283333333333333, -0.899999999999999,
-0.05, -0.950000000000003, -3.73333333333333)), row.names = c(NA,
-11L), groups = structure(list(Input = c(1, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12), .rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Your raters' scores are negligibly different across both average rating and rating for each id. It can't estimate random intercept variation if there isn't any. Why don't you believe that your ICC is really high?
Between-rater means:
lapply(try[, 3:5], mean)
$AA_avg
[1] 34.96061
$AB_avg
[1] 34.90879
$AC_avg
[1] 34.95152

Propensity Score Matching and subset the data by using a weighting factor in R

I'm doing Propensity Score Matching and want to subset the data for treatment and control by using weights. There are 5 variables: ID, treatment(yes/No), Outcome(Yes/No), Age and "Weights". I was trying to write a programme in R, but have problems to do this according to weights. The survey package is used.
dput(dat2):
structure(list(ID = c(1, 2, 3, 4, 6, 7),
Weight = c(2.4740626, 2.4740626, 2.4740626, 2.4740626, 1.9548149, 1.9548149),
Age = c("35-44", "<15-24", "25-34", "35-44", ">45", "25-34"),
Treatment = c(1, 0, 0, 1, 0, 0),
Outcome = c(1, 1, 1, 0, 1, 1)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")))
head(dat2):
data<-svydesign(ids = ~dat2$Id,
weights = ~dat2$Weight,
data = dat2)
treat<-subset(dat, dat2$treatment==1)
cont<-subset(dat, dat2$treatment==0)
I am sharing sample of data. I have 1587 rows. When I am finding dimensions without weights then the dimensions of treat and cont is 877*5 and 710*5 respectively. But with weights it will be 803*5 and 784*5.
Please help me.
Thanks in advance.
One way to do this is as below:
Sample Data
dat2 <- structure(list(ID = c(1, 2, 3, 4, 6, 7),
Weight = c(2.4740626, 2.4740626, 2.4740626, 2.4740626, 1.9548149, 1.9548149),
Age = c("35-44", "<15-24", "25-34", "35-44", ">45", "25-34"),
Treatment = c(1, 0, 0, 1, 0, 0),
Outcome = c(1, 1, 1, 0, 1, 1)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
Script
data<-svydesign(ids = ~dat2$ID,
weights = ~dat2$Weight,
data = dat2)
treat<-subset(data, Treatment==1)
cont<-subset(data, Treatment==0)

Resources