Related
I have the following dataset
structure(list(Var1 = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("0", "1"), class = "factor"), Var2 = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("congruent", "incongruent"
), class = "factor"), Var3 = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("spoken", "written"), class = "factor"),
Freq = c(8L, 2L, 10L, 2L, 10L, 2L, 10L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
I would like to add another column reporting sum of coupled subsequent rows. Thus the final result would look like this:
I have proceeded like this
Table = as.data.frame(table(data_1$unimodal,data_1$cong_cond, data_1$presentation_mode)) %>%
mutate(Var1 = factor(Var1, levels = c('0', '1')))
row = Table %>% #is.factor(Table$Var1)
summarise(across(where(is.numeric),
~ .[Var1 == '0'] + .[Var1 == '1'],
.names = "{.col}_sum"))
column = c(rbind(row$Freq_sum,rep(NA, 4)))
Table$column = column
But I am looking for the quickest way possible with no scripting separated codes. Here I have used the dplyr package, but if you might know possibly suggest some other ways with map(), for loop, and or the method you deem as the best, please just let me know.
This should do:
df$column <-
rep(colSums(matrix(df$Freq, 2)), each=2) * c(1, NA)
If you are fine with no NAs in the dataframe, you can
df %>%
group_by(Var2, Var3) %>%
mutate(column = sum(Freq))
# A tibble: 8 × 5
# Groups: Var2, Var3 [4]
Var1 Var2 Var3 Freq column
<fct> <fct> <fct> <int> <int>
1 0 congruent spoken 8 10
2 1 congruent spoken 2 10
3 0 incongruent spoken 10 12
4 1 incongruent spoken 2 12
5 0 congruent written 10 12
6 1 congruent written 2 12
7 0 incongruent written 10 12
8 1 incongruent written 2 12
I have several datasets.
The first one
lid=structure(list(x1 = 619490L, x2 = 10L, x3 = 0L, x4 = 6089230L,
x5 = 0L, x6 = -10L), class = "data.frame", row.names = c(NA,
-1L))
second dataset
lidar=structure(list(A = c(638238.76, 638238.76, 638239.29, 638235.39,
638233.86, 638233.86, 638235.55, 638231.97, 638231.91, 638228.41
), B = c(6078001.09, 6078001.09, 6078001.15, 6078001.15, 6078001.07,
6078001.07, 6078001.02, 6078001.08, 6078001.09, 6078001.01),
C = c(186.64, 186.59, 199.28, 189.37, 186.67, 186.67, 198.04,
200.03, 199.73, 192.14), gpstime = c(319805734.664265, 319805734.664265,
319805734.67875, 319805734.678768, 319805734.678777, 319805734.678777,
319805734.687338, 319805734.701928, 319805734.701928, 319805734.701945
), Intensity = c(13L, 99L, 5L, 2L, 20L, 189L, 2L, 11L, 90L,
1L), ReturnNumber = c(2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
3L), NumberOfReturns = c(2L, 1L, 3L, 2L, 1L, 1L, 3L, 1L,
1L, 4L), ScanDirectionFlag = c(1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L), EdgeOfFlightline = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Classification = c(1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
How to subtract the value for each row of the lidar dataset from lid dataset using the formula
(lidar$A-lid$x1)/lid$x3
then
(lidar$B-lid$x4)/lid$x6
So for first row will be result
(lidar$A-lid$x1)/lid$x3=1874,876(but everything after the comma is discarded)=1874(without,876)
(lidar$B-lid$x4)/lid$x6=1122
also in lidar dataset for column lidar$C
subtract the smallest value from the largest value. In this case lidar$c11-lidar$c1=5,5
so desired output for this will be
A B C Intensity ReturnNumber NumberOfReturns row col subs(lidar$Cmax-lidar$Cmin)
638238.76 6078001.09 186.64 13 2 2 1874 1122 5,5
638238.76 6078001.09 186.59 99 1 1 1874 1122 5,5
638239.29 6078001.15 199.28 5 1 3 1874 1122 5,5
638235.39 6078001.15 189.37 2 2 2 1874 1122 5,5
the result of subtraction (lidar$Cmax-lidar$Cmin) for all rows is always the same.
row and col this the result of this arithmetic
(lidar$A-lid$x1)/lid$x3 (row)
then
(lidar$B-lid$x4)/lid$x6 (col)
with the value after the comma, these values(row and col) are different, but we must remove the part after the comma, so they seem to be the same.
How can i get desired output according to such arithmetic operations.
Any of your help is valuable.Thank you
If I understand your purpose correctly, the main question is how to remove the part after comma, which is a decimal separator in your examples.
If that's true, one way of doing that is to split the number into two parts, one which comes before the comma and another one which comes after it, and then extract only the first part. In R you can do this by strsplit(). However, this function requires the input to be characters, not numerics. So, you need to coerce the numbers into characters, do the splitting, coerce the result back to numbers, and then extract its first element.
Here is an example of a function to implement the steps:
remove_after_comma <- function(num_with_comma){
myfun <- function(num_with_comma) {
num_with_comma|>
as.character() |>
strsplit("[,|.]") |>
unlist() |>
as.numeric() |>
getElement(1)
}
vapply(num_with_comma, myfun, FUN.VALUE = numeric(1))
}
Notes:
[,|.] is used to anticipate other systems that use . instead of , as the decimal separator.
vapply is used to make it possible to apply this function to a numeric vectors, such as a numeric column.
Check:
remove_after_comma(c(a = '1,5', b = '12,74'))
# a b
# 1 12
(4:10)/3
#[1] 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 3.333333
remove_after_comma ((4:10)/3)
#[1] 1 1 2 2 2 3 3
Assuming that lid$x3 = 10L in your example:
(lidar$A-lid$x1)/lid$x3
#[1] 1874.876 1874.876 1874.929 1874.539 1874.386 1874.386 1874.555 1874.197 #1874.191 1873.841
remove_after_comma((lidar$A-lid$x1)/lid$x3)
#[1] 1874 1874 1874 1874 1874 1874 1874 1874 1874 1873
I'm not sure if this is what you mean
`
lidar$row <- round((lidar$A-lid$x1)/lid$x3, 0)
lidar$col <- (lidar$B-lid$x4)/lid$x6
lidar$cdif <- max(lidar$C)-min(lidar$C)
`
I've got repeated measurements data in which patients are measured an irregular amount of times (2 through 6 times per patient) and also with unequally spaced time intervals (some subsequent measures are 6 months apart, some 3 years). Is it possible to model this in a GEE model? For example by specifying a continuous AR1 correlation structure?
I've got some example data:
library(tidyverse)
library(magrittr)
library(geepack)
library(broom)
example_data <- structure(list(pat_id = c(2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4,
4, 7, 7, 8, 8, 8, 13, 13), measurement_number = c(1L, 2L, 3L,
4L, 5L, 6L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 1L, 2L, 3L, 1L,
2L), time = c(0, 0.545, 2.168, 2.68, 3.184, 5.695, 0, 1.892,
0, 0.939, 1.451, 1.955, 4.353, 0, 4.449, 0, 0.465, 4.005, 0,
0.364), age_standardized = c(-0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -1.76464003778333, -1.76464003778333, -0.667610044472762,
-0.667610044472762, -0.667610044472762, -0.667610044472762, -0.667610044472762,
0.142696200586183, 0.142696200586183, 0.00556745142236116, 0.00556745142236116,
0.00556745142236116, 0.0554324511182961, 0.0554324511182961),
sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Female",
"Male"), class = "factor"), outcome = c(4241.943359375, 4456.4,
6533.673242397, 7255.561628906, 7594.527875667, 6416.4, 373.782029756049,
614.318359374, 6675.19041238403, 10623.94276368, 10849.01013281,
10627.30859375, 13213, 541.40780090332, 2849.5551411438,
2136.2, 2098.1, 2063.9, 5753.56313232422, 5108.199752386)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L))
head(example_data)
# A tibble: 6 x 6
pat_id measurement_number time age_standardized sex outcome
<dbl> <int> <dbl> <dbl> <fct> <dbl>
1 2 1 0 -0.0942 Female 4242.
2 2 2 0.545 -0.0942 Female 4456.
3 2 3 2.17 -0.0942 Female 6534.
4 2 4 2.68 -0.0942 Female 7256.
5 2 5 3.18 -0.0942 Female 7595.
6 2 6 5.70 -0.0942 Female 6416.
I actually have also modelled these data with a linear mixed model (using nlme specifying a continuous AR1), but my supervisor asked me to also explore using a GEE, thats why I ask.
I've read that, using the geepack package, it is possible to define the correlation structure yourself, but I can't code that well to see if it is possible to define the structure so that rho is adjusted for the time interval in between measurements (by making it rho^s where s is the number of time units).
Can you think about an intuitive way of calculating the number of times the word space appears in a certain column? Or any other solution that is viable.
I basically want to know how many times the space key was pressed, however some participants made the mistake and pressed other keys which would also be considered a mistake. So I was wondering if I should go with the "key_resp.rt" column instead and count the number of response times instead. If you had any idea of how to do both it would be great as I may need to use both.
I used the following code but the results do not conform to the data.
Data %>% group_by(Participant, Session) %>% summarise(false_start = sum(str_count(key_resp.keys, "space")))
Here is a snippet of my data:
Participant RT Session key_resp.keys key_resp.rt
X 0.431265 1 ["space"] [2.3173399999941466]
X 0.217685 1
X 0.317435 2 ["space","space"] [0.6671900000001187,2.032510000000002] 2020.1.3 4
Y 0.252515 1
Y 0.05127 2 ["space","space","space","space","space","space","space","space","space"] [4.917419999999765,6.151149999999689,6.333714999999771,6.638249999999971,6.833514999999338,7.0362499999992,7.217724999999504,7.38576999999988,7.66913999999997]
dput(droplevels(head(Data_PVT)))
structure(list(Interval_stimulus = c(4.157783411, 4.876139922,
5.67011868, 9.338167417, 9.196342656, 7.62448411), Participant = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "ADH80254", class = "factor"),
RT = c(431.265, 277.99, 253.515, 310.53, 299.165, 539.46),
Session = c(1L, 1L, 1L, 1L, 1L, 1L), date = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "2020-06-12_11h11.47.141", class = "factor"),
key_resp.keys = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"[\"space\"]"), class = "factor"), key_resp.rt = structure(c(2L,
1L, 1L, 1L, 1L, 1L), .Label = c("", "[2.3173399999941466]"
), class = "factor"), psychopyVersion = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = "2020.1.3", class = "factor"),
Trials = 0:5, Reciprocal = c(2.31875992719094, 3.59725169970143,
3.94453977082224, 3.22030077609249, 3.3426370063343, 1.85370555740926
)), row.names = c(NA, 6L), class = "data.frame")
Expected output:
Participant Session false_start
x 1 0
x 2 1
y 1 2
y 2 1
z 1 10
z 2 3
We can use str_count to count "space" values for each Participant and Session and sum them to get total. For all_false_start we count number of words in it.
library(dplyr)
library(stringr)
df %>%
group_by(Participant, Session) %>%
summarise(false_start = sum(str_count(key_resp.keys, '\\bspace\\b')),
all_false_start = sum(str_count(key_resp.keys, '\\b\\w+\\b')))
Good morning,
I am currently working on animal body condition score (BCS). For each individual, I have several rows but not the same number of row from one to another. As columns I have the animal name (factor), the date (factor) when the BCS was recorded and BCS (numeric) itself.
There is an exemple of my data:
structure(list(name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("INDIV1",
"INDIV2", "INDIV3", "INDIV4", "INDIV5", "INDIV6",
"INDIV7", "INDIV8", "INDIV9", "INDIV10",
"INDIV11", "INDIV12", "INDIV13", "INDIV14", "INDIV15",
"INDIV16", "INDIV17", "INDIV18", "INDIV19",
"INDIV20", "INDIV21", "INDIV22", "INDIV23",
"INDIV24", "INDIV25", "INDIV26", "INDIV27",
"INDIV28", "INDIV29", "INDIV30", "INDIV31",
"INDIV32", "INDIV33", "INDIV34", "INDIV35",
"INDIV36", "INDIV37", "INDIV38", "INDIV39", "INDIV40",
"INDIV41", "INDIV42", "INDIV43", "INDIV44", "INDIV45",
"INDIV46", "INDIV47", "INDIV48", "INDIV49",
"INDIV50", "INDIV51", "INDIV52", "INDIV53",
"INDIV54", "INDIV55", "INDIV56", "INDIV57", "INDIV58",
"INDIV59", "INDIV60", "INDIV61", "INDIV62",
"INDIV63", "INDIV64", "INDIV65", "INDIV66",
"INDIV67", "INDIV68", "INDIV69", "INDIV70",
"INDIV71", "INDIV72", "INDIV73", "INDIV74",
"INDIV75", "INDIV76", "INDIV77", "INDIV78", "INDIV79",
"INDIV80", "INDIV81", "INDIV82", "INDIV83",
"INDIV84", "INDIV85", "INDIV86", "INDIV87",
"INDIV88", "INDIV89", "INDIV90",
"INDIV91", "INDIV92", "INDIV93", "INDIV94",
"INDIV95", "INDIV96", "INDIV97", "INDIV98",
"INDIV99", "INDIV100", "INDIV101", "INDIV102", "INDIV103",
"INDIV104", "INDIV105", "INDIV106", "INDIV107", "INDIV108",
"INDIV109", "INDIV110", "INDIV111", "INDIV112",
"INDIV113", "INDIV114", "INDIV115", "INDIV116",
"INDIV117", "INDIV118"), class = "factor"), date = structure(c(4L,
4L, 4L, 36L, 36L, 36L, 8L, 8L, 8L, 21L, 21L, 21L, 38L, 38L, 38L,
1L, 1L, 1L, 4L, 4L), .Label = c("03/10/2019", "03/12/2019", "04/12/2019",
"05/02/2019", "06/02/2019", "07/04/2019", "08/01/2019", "10/04/2019",
"10/12/2019", "11/02/2019", "11/09/2019", "11/12/2019", "12/08/2019",
"12/09/2019", "12/12/2019", "13/02/2019", "13/03/2019", "13/08/2019",
"13/09/2019", "14/05/2019", "14/06/2019", "14/11/2019", "15/07/2019",
"15/10/2019", "15/11/2019", "16/01/2019", "16/04/2019", "16/07/2019",
"16/10/2019", "17/05/2019", "18/06/2019", "18/10/2019", "19/03/2019",
"19/06/2019", "19/12/2019", "20/03/2019", "21/03/2019", "23/07/2019",
"25/04/2019", "26/04/2019", "27/09/2019", "28/01/2019", "28/05/2019",
"28/06/2019", "31/05/2019"), class = "factor"), BCS = c(4, 4,
4, 4, 4, 4, 4, 4, 4, 4.75, 4.75, 4.75, 4.75, 4.75, 4.75, 4.5,
4.5, 4.5, 2.25, 2.25)), row.names = c(NA, 20L), class = "data.frame")
My goal here, is to identify individuals with a BCS >= 4 for each measurement.
I have tried to make up functions using if and while statements but, so far, I can't get the information I am looking for...
I apologize in advance if this kind of question have been asked previously.
Thank you for your futur help !
I named the data frame you provided df, so try:
df = droplevels(df)
tapply(df$BCS>=4,df$name,all)
INDIV1 INDIV2
TRUE FALSE
The step above takes makes a boolean out of each BCS value, if >=4 it becomes TRUE, and then tapply splits this boolean according to name, and you ask whether all is true using all.
From the result above, it means INDIV1 has all BCS>=4
To get the names, do:
names(which(tapply(df$BCS>=4,df$name,all)))
[1] "INDIV1"
Not very clear about your objective by
to identify individuals with a BCS >= 4 for each measurement
Maybe something like below is your desired output
> aggregate(BCS~name,df, FUN = function(x) all(x>=4))
name BCS
1 INDIV1 TRUE
2 INDIV2 FALSE
We can use tidyverse
library(dplyr)
df1 %>%
group_by(name) %>%
summarise(BCS = all(BCS >= 4))
# A tibble: 2 x 2
# name BCS
# <fct> <lgl>
#1 INDIV1 TRUE
#2 INDIV2 FALSE