Unstructured txt file with similar pattern for all rows in R

Unstructured txt file with similar pattern for all rows in R - r

I am currently working with a .txt file and have used the read_table2() function to read it, resulting in the following structure.
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 FVP110~ 2.08e6 1101~ 1.10e 3 6 0 0 0 6 01101 6 0 0 0 6 01101 6 0 0 0
2 FVP110~ 2.06e4 8 9.3 e 1 2 93 0 0 0 0 0 093 0 0 0 0 0 093 0 0
3 FVP110~ 2.10e6 6 9.3 e 1 2 93 0 0 0 0 0 093 0 0 0 0 0 093 0 0
4 FVP110~ 2.10e6 6 3.11e18 3111 8 0 0 0 8 03111 8 0 0 0 8 03111 8 0 0
5 FVP110~ 2.08e6 94 2 e 0 94 0 0 0 0 0 094 0 0 0 0 0 094 0 0 0
6 FVP110~ 2.06e4 6 9.2 e 1 2 92 0 0 0 0 0 092 0 0 0 0 0 092 0 0
# ... with 31 more variables: X21 <chr>, X22 <chr>, X23 <chr>, X24 <chr>, X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>,
# X29 <chr>, X30 <chr>, X31 <chr>, X32 <chr>, X33 <chr>, X34 <chr>, X35 <chr>, X36 <chr>, X37 <chr>, X38 <chr>, X39 <chr>,
# X40 <chr>, X41 <chr>, X42 <chr>, X43 <chr>, X44 <chr>, X45 <chr>, X46 <chr>, X47 <chr>, X48 <chr>, X49 <chr>, X50 <chr>,
# X51 <dbl>
I know that my first column, instead of being FVP1104Q1V110121011010110110527421101011165 is always a 4 chr 3 dbl 2chr 2chr 1dbl 2dbl etc. In total, there are 51 columns but if parsed correctly they will become a total of 129.
These are the first 10 rows and 10 columns of my data set.
structure(list(X1 = c("FVP1104Q1V110121011010110110527421101011165",
"FVP1104Q1V110121011010110110527421101022262", "FVP1104Q1V110121011010110110527421101033231",
"FVP1104Q1V110121011010110110527421101044134", "FVP1104Q1V110121011010110110527421102011165",
"FVP1104Q1V110121011010110110527421102022260", "FVP1104Q1V110121011010110110527421102033138",
"FVP1104Q1V110121011010110110527421102044232", "FVP1104Q1V11012101101011011052742110205616",
"FVP1104Q1V110121011010110110527421102063142"), X2 = c(2080110,
20601, 2100112, 2100112, 2080110, 20601, 2120115, 2100112, 10501,
40701), X3 = c("11011116112", "8", "6", "6", "94", "6", "6",
"6", "124", "8"), X4 = c(1101, 93, 93, 3111045932226084352, 2,
92, 3185102331226052608, 93, 91, 6), X5 = c(6, 2, 2, 3111, 94,
2, 3185, 2, 2, 11011216112), X6 = c(0, 93, 93, 8, 0, 92, 8, 93,
91, 1101), X7 = c("0", "0", "0", "0", "0", "0", "0", "0", "0",
"6"), X8 = c("0", "0", "0", "0", "0", "0", "0", "0", "0", "0"
), X9 = c("6", "0", "0", "0", "0", "0", "0", "0", "0", "0"),
X10 = c("01101", "0", "0", "8", "0", "0", "8", "0", "0",
"0"), X11 = c("6", "0", "0", "03111", "094", "0", "03185",
"0", "0", "6"), X12 = c("0", "093", "093", "8", "0", "092",
"8", "093", "091", "01101"), X13 = c("0", "0", "0", "0",
"0", "0", "0", "0", "0", "6"), X14 = c("0", "0", "0", "0",
"0", "0", "0", "0", "0", "0"), X15 = c("6", "0", "0", "0",
"0", "0", "0", "0", "0", "0")), row.names = c(NA, 10L), class = "data.frame")
And I want to get
structure(list(fileid = structure(c("FVP1", "FVP1", "FVP1", "FVP1",
"FVP1", "FVP1", "FVP1", "FVP1", "FVP1", "FVP1"), label = "File Identification", format.stata = "%9s"),
schedule = structure(c(104, 104, 104, 104, 104, 104, 104,
104, 104, 104), label = "Schedule", format.stata = "%8.0g"),
quarter = structure(c("Q3", "Q3", "Q3", "Q3", "Q3", "Q3",
"Q3", "Q3", "Q3", "Q3"), label = "Quarter", format.stata = "%9s"),
visit = structure(c("V1", "V1", "V1", "V1", "V1", "V1", "V1",
"V1", "V1", "V1"), label = "Visit", format.stata = "%9s"),
sector = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), label = "Sector", format.stata = "%8.0g"),
state = structure(c(36, 36, 36, 36, 36, 36, 36, 36, 36, 36
), label = "State/Ut Code", format.stata = "%8.0g"), district = structure(c(10,
10, 10, 10, 10, 10, 10, 10, 10, 10), label = "District Code", format.stata = "%8.0g"),
region = structure(c(362, 362, 362, 362, 362, 362, 362, 362,
362, 362), label = "NSS-Region", format.stata = "%8.0g"),
stratum = structure(c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2), label = "Stratum", format.stata = "%8.0g"),
substratum = structure(c(8, 8, 8, 8, 8, 8, 8, 8, 8, 8), label = "Sub-Stratum", format.stata = "%8.0g"),
subsample = structure(c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2), label = "Sub-Sample", format.stata = "%8.0g"),
subregion = structure(c(3613, 3613, 3613, 3613, 3613, 3613,
3613, 3613, 3613, 3613), label = "Fod Sub-Region", format.stata = "%8.0g"),
fsu = structure(c(50030, 50030, 50030, 50030, 50030, 50030,
50030, 50030, 50030, 50030), label = "FSU", format.stata = "%10.0g"),
sbno = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), label = "Sample Sg/Sb No.", format.stata = "%8.0g"),
sss = structure(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2), label = "Second Stage Stratum No.", format.stata = "%8.0g")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I'm trying to replicate reshaping of a .txt data using a dictionary .dct in Stata, but I don't find a clear way to do that in R.
My data also includes NA's

As per MrFlick's suggestion, we can use tidyr::separate to break apart your first column into multiple columns by position:
library(tidyr)
data.frame(X1 = "FVP1104Q1V110121011010110110527421101011165") %>%
separate(
X1,
sep = c(4, 7, 9, 11, 12),
into = paste0("X1_", 1:6)
)
# X1_1 X1_2 X1_3 X1_4 X1_5 X1_6
# 1 FVP1 104 Q1 V1 1 0121011010110110527421101011165

Related

How to convert panel data with R so that each observation per ID is saved in one row and still arranged by year?

I am planning to do a supvervised machine learning project where I use data from a panel (371'503 rows & 20 columns). The goal is to use data from 2002 to 2014. I have now done a first data-preprocessing and the data frame looks like the following in a highly abbreviated form:
data_ex_old <- structure(
list(
ID = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
2, 3, 1, 2, 3, 1, 2, 3),
Studyyear = c(
2002,
2002,
2002,
2004,
2004,
2004,
2006,
2006,
2006,
2008,
2008,
2008,
2010,
2010,
2010,
2012,
2012,
2012,
2014,
2014,
2014
),
Gender = c(2, 1, 2, 2, 1,
2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2),
Predictor1 = c(
"6",
"5",
"4",
"NA",
"NA",
"NA",
"5",
"6",
"4",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA"
),
Predictor2 = c(2,
2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1),
Predictor3 = c(
"NA",
"NA",
"NA",
"6",
"0",
"0",
"NA",
"NA",
"NA",
"0",
"6",
"1",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA"
),
Outcome1 = c(
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"NA",
"0",
"1",
"1",
"NA",
"NA",
"NA",
"1",
"1",
"0",
"NA",
"NA",
"NA"
),
Outcome2 = c(0, 0, 1, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1)
),
class = c("tbl_df",
"tbl", "data.frame"),
row.names = c(NA,-21L)
)
So that I can make useful predictions, all obeservations per ID should be in a row (arranged by year). The result should look like this:
data_ex_new <-
structure(
list(
ID = c(1, 2, 3),
Gender = c(2, 1, 2),
Predictor1_2002 = c(6,
5, 4),
Predictor2_2002 = c(2, 2, 1),
Predictor3_2002 = c("NA",
"NA", "NA"),
Outcome1_2002 = c("NA", "NA", "NA"),
Outcome2_2002 = c(0,
0, 1),
Predictor1_2004 = c("NA", "NA", "NA"),
Predictor2_2004 = c(1,
2, 2),
Predictor3_2004 = c(6, 0, 0),
Outcome1_2004 = c("NA",
"NA", "NA"),
Outcome2_2004 = c(0, 0, 1),
Predictor1_2006 = c(5,
6, 4),
Predictor2_2006 = c(1, 2, 2),
Predictor3_2006 = c("NA",
"NA", "NA"),
Outcome1_2006 = c("NA", "NA", "NA"),
Outcome2_2006 = c(0,
0, 1),
Predictor1_2008 = c("NA", "NA", "NA"),
Predictor2_2008 = c(2,
2, 1),
Predictor3_2008 = c(0, 6, 1),
Outcome1_2008 = c(0, 1,
1),
Outcome2_2008 = c(0, 0, 1),
Predictor1_2010 = c("NA", "NA",
"NA"),
Predictor2_2010 = c(1, 2, 2),
Predictor3_2010 = c("NA",
"NA", "NA"),
Outcome1_2010 = c("NA", "NA", "NA"),
Outcome2_2010 = c(0,
0, 0),
Predictor1_2012 = c("NA", "NA", "NA"),
Predictor2_2012 = c(2,
2, 2),
Predictor3_2012 = c("NA", "NA", "NA"),
Outcome1_2012 = c(1,
1, 0),
Outcome2_2012 = c(1, 1, 0),
Predictor1_2014 = c("NA",
"NA", "NA"),
Predictor2_2014 = c(2, 2, 1),
Predictor3_2014 = c("NA",
"NA", "NA"),
Outcome1_2014 = c("NA", "NA", "NA"),
Outcome2_2014 = c(1,
1, 1)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA,-3L)
)
How can one convert the data with R so that each observation per ID is saved in one row and still arranged by year (see "data_ex_new")?
I already tried out different dplyr functions like spread() to make the data wide. But so far it didn't work.

I think this does it:
data_ex_old %>% pivot_wider(id_cols = c(ID, Gender), names_from = Studyyear,
values_from = c(starts_with('Predictor'),
starts_with('Outcome')))
# A tibble: 3 × 37
# ID Gender Predic…¹ Predi…² Predi…³ Predi…⁴ Predi…⁵ Predi…⁶ Predi…⁷ Predi…⁸ Predi…⁹ Predi…˟ Predi…˟
# <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 2 6 NA 5 NA NA NA NA 2 1 1 2
# 2 2 1 5 NA 6 NA NA NA NA 2 2 2 2
# 3 3 2 4 NA 4 NA NA NA NA 1 2 2 1
# # … with 24 more variables: Predictor2_2010 <dbl>, Predictor2_2012 <dbl>, Predictor2_2014 <dbl>,
# # Predictor3_2002 <chr>, Predictor3_2004 <chr>, Predictor3_2006 <chr>, Predictor3_2008 <chr>,
# # Predictor3_2010 <chr>, Predictor3_2012 <chr>, Predictor3_2014 <chr>, Outcome1_2002 <chr>,
# # Outcome1_2004 <chr>, Outcome1_2006 <chr>, Outcome1_2008 <chr>, Outcome1_2010 <chr>,
# # Outcome1_2012 <chr>, Outcome1_2014 <chr>, Outcome2_2002 <dbl>, Outcome2_2004 <dbl>,
# # Outcome2_2006 <dbl>, Outcome2_2008 <dbl>, Outcome2_2010 <dbl>, Outcome2_2012 <dbl>,
# # Outcome2_2014 <dbl>, and abbreviated variable names ¹Predictor1_2002, ²Predictor1_2004, …

Create a binary variable based on a threshold in R

The following dataset contains 7 columns (i.e., AI_1 until AI_7) that have 1440 observations per ID (in total 42 IDs). I want to create a dataset that makes a binary variable of each AI based on a threshold. For example if AI_1 > 0,1 it should get the value 1 in a new variable called ACTIVITY otherwise the value 0 in the same variable ACTIVITY. I tried this with the following code but when I try to find the mean value of the binary variable it indicates that the mean is above 1.. which is curious since it can only take the value of either 0 or 1. So does anyone know how to make 7 of these binary variables in the same dataset where the mean is between 0 and 1?
structure(list(X = 1:30, x1.time = c("00:00:00", "00:01:00",
"00:02:00", "00:03:00", "00:04:00", "00:05:00", "00:06:00", "00:07:00",
"00:08:00", "00:09:00", "00:10:00", "00:11:00", "00:12:00", "00:13:00",
"00:14:00", "00:15:00", "00:16:00", "00:17:00", "00:18:00", "00:19:00",
"00:20:00", "00:21:00", "00:22:00", "00:23:00", "00:24:00", "00:25:00",
"00:26:00", "00:27:00", "00:28:00", "00:29:00"), AI_1 = c(0.17532896077581,
0.174249939439765, 0.174170544792533, 0.172877357886967, 0.173679017353614,
0.174216799443538, 0.174514454250882, 0.174656389074666, 0.173377175454716,
0.173044040397703, 0.172476572884875, 0.174738790856458, 0.173833445732856,
0.174229265722835, 0.174392878820111, 0.174715890976243, 0.174241614289181,
0.173229751013599, 0.173579164085914, 0.173829069216696, 0.173499039975341,
0.174387946222767, 0.173802854581089, 0.174107580137568, 0.174113709936873,
0.173172609295233, 0.174509255493075, 0.173383120975257, 0.173398927511582,
0.173466516952908), AI_2 = c(0.173549588758752, 0, 0.85729795236214,
0.513925586220723, 0.140789239632585, 0.0989981552300843, 0.321625480480368,
0.62540390366724, 0.00714855410741877, 0, 0, 0, 0.212943798631015,
0, 0, 0.023650258664654, 0.00159158576982517, 0.0172670511608436,
0, 0, 0, 0.25653572767355, 0.41158598021939, 0.433889173147664,
0.442200975044019, 0.471931171507954, 0.415009919603445, 0.43364443321512,
0.449930874231746, 0.48397633182816), AI_3 = c(0.026069149474549,
0.0417747330978121, 0.276687600798659, 0.258591321128928, 0.208790296683244,
0.0300099278967508, 0.15234594700642, 0.26519848659315, 0.34220566727692,
0.352310255219813, 0.297621781376737, 0.292800000618149, 0.481566536382664,
0.337770306519177, 0.743182296874282, 0.256202127993172, 0.201340506649845,
0.200155318345632, 0.237126429055375, 0.234974163009848, 0.235808994849961,
0.302168675921402, 0.377936665388589, 0.416123299239618, 0.389279883023212,
0.357972848973051, 0.305268847437493, 0.290040891577408, 0.197384083463156,
0.258282654013295), AI_4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00841646877382803,
0), AI_5 = c(0, 0, 0.0015062890214412, 0.00154798776365785, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0), AI_6 = c(0.190018331633492, 0.241159552783285, 0.231916111803065,
0.193196835220518, 0.240381778378367, 0.266125762332231, 0.339227319507121,
0.354841547583334, 0.277011867279295, 0.474462632995715, 0.516356521276347,
0.559477604383845, 0.374857636694405, 0.376675155204282, 0.516347133869462,
0.627633542885353, 0.565732682034457, 0.544148310829377, 0.545022418887296,
0.602327138107482, 0.529578366594453, 0.571672817412653, 0.51963881197827,
0.493590581088222, 0.487545798153711, 0.525272191616523, 0.586906227102549,
0.555446579214151, 0.578788883825157, 0.617822898150646), AI_7 = c(0.139608768263461,
0.165583663096789, 0.326959508587122, 0.221739297198209, 0.160657663051105,
0.107439748199699, 0.117594125364214, 0.133528520361788, 0.117950354159875,
0.131428192187155, 0.125355403562937, 0.119185646272255, 0.196285453922129,
0.167061057207379, 0.169855099745761, 0.141077126343563, 0.078433720675593,
0.0999303057993443, 0.0798045801131668, 0.0331137028671696, 0.0920945831761988,
0.0233052285173748, 0, 0, 0, 0.00876293044107867, 0, 0.109134564970416,
0.110323312017635, 0.117772975747077), ID = c("ID1", "ID1", "ID1",
"ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1",
"ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1",
"ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1"
), activity = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0), activity2 = c("0",
"1", "0", "0", "0", "1", "0", "0", "1", "1", "1", "1", "0", "1",
"1", "1", "1", "1", "1", "1", "1", "0", "0", "0", "0", "0", "0",
"0", "0", "0"), activity3 = c("1", "1", "0", "0", "0", "1", "0",
"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",
"0", "0", "0", "0", "0", "0", "0", "0", "0", "0"), activity4 = c("1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1"), activity5 = c("1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1"), activity6 = c("0",
"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",
"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",
"0", "0", "0"), activity7 = c("0", "0", "0", "0", "0", "0", "0",
"0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "0", "0", "0")), row.names = c(NA,
30L), class = "data.frame")
This is the code I used
Threshold <- Activity_index_1 %>%
mutate(activity = case_when(
AI_1 <= 0.1 ~ "1",
AI_1 > 0.1 ~ "0",
))
Threshold2 <- Threshold %>%
mutate(activity2 = case_when(
AI_2 <= 0.1 ~ "1",
AI_2 > 0.1 ~ "0",
))
Threshold3 <- Threshold2 %>%
mutate(activity3 = case_when(
AI_3 <= 0.1 ~ "1",
AI_3 > 0.1 ~ "0",
))
Threshold4 <- Threshold3 %>%
mutate(activity4 = case_when(
AI_4 <= 0.1 ~ "1",
AI_4 > 0.1 ~ "0",
))
Threshold5 <- Threshold4 %>%
mutate(activity5 = case_when(
AI_5 <= 0.1 ~ "1",
AI_5 > 0.1 ~ "0",
))
Threshold6 <- Threshold5 %>%
mutate(activity6 = case_when(
AI_6 <= 0.1 ~ "1",
AI_6 > 0.1 ~ "0",
))
Threshold7 <- Threshold6 %>%
mutate(activity7 = case_when(
AI_7 <= 0.1 ~ "1",
AI_7 > 0.1 ~ "0",
))

Here is a solution with mutate/across and a logical condition returning FALSE/TRUE then coerced to integers 0/1.
The posted data already has columns activity so I start by removing them from the data.
suppressPackageStartupMessages({
library(dplyr)
library(stringr)
})
Threshold <- Activity_index_1 %>%
select(-starts_with("activity")) %>%
mutate(across(starts_with("AI_"), ~ as.integer(.x <= 0.1), .names = "activity_{col}")) %>%
rename_at(vars(starts_with("activity_AI")), ~ str_remove(., "_AI_"))
str(Threshold)
#> 'data.frame': 30 obs. of 17 variables:
#> $ X : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ x1.time : chr "00:00:00" "00:01:00" "00:02:00" "00:03:00" ...
#> $ AI_1 : num 0.175 0.174 0.174 0.173 0.174 ...
#> $ AI_2 : num 0.174 0 0.857 0.514 0.141 ...
#> $ AI_3 : num 0.0261 0.0418 0.2767 0.2586 0.2088 ...
#> $ AI_4 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ AI_5 : num 0 0 0.00151 0.00155 0 ...
#> $ AI_6 : num 0.19 0.241 0.232 0.193 0.24 ...
#> $ AI_7 : num 0.14 0.166 0.327 0.222 0.161 ...
#> $ ID : chr "ID1" "ID1" "ID1" "ID1" ...
#> $ activity1: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ activity2: int 0 1 0 0 0 1 0 0 1 1 ...
#> $ activity3: int 1 1 0 0 0 1 0 0 0 0 ...
#> $ activity4: int 1 1 1 1 1 1 1 1 1 1 ...
#> $ activity5: int 1 1 1 1 1 1 1 1 1 1 ...
#> $ activity6: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ activity7: int 0 0 0 0 0 0 0 0 0 0 ...
Created on 2022-10-10 with reprex v2.0.2

Comparing just AI variables with .1, convert to numeric, set colnames and cbind.
res <- cbind(dat, ((dat[grep('^AI', names(dat))] <= .1)^1) |>
{\(.) `colnames<-`(., gsub('AI', 'activity', colnames(.)))}())
str(res)
# 'data.frame': 30 obs. of 16 variables:
# $ x1.time : chr "00:00:00" "00:01:00" "00:02:00" "00:03:00" ...
# $ AI_1 : num 0.175 0.174 0.174 0.173 0.174 ...
# $ AI_2 : num 0.174 0 0.857 0.514 0.141 ...
# $ AI_3 : num 0.0261 0.0418 0.2767 0.2586 0.2088 ...
# $ AI_4 : num 0 0 0 0 0 0 0 0 0 0 ...
# $ AI_5 : num 0 0 0.00151 0.00155 0 ...
# $ AI_6 : num 0.19 0.241 0.232 0.193 0.24 ...
# $ AI_7 : num 0.14 0.166 0.327 0.222 0.161 ...
# $ ID : chr "ID1" "ID1" "ID1" "ID1" ...
# $ activity_1: num 0 0 0 0 0 0 0 0 0 0 ...
# $ activity_2: num 0 1 0 0 0 1 0 0 1 1 ...
# $ activity_3: num 1 1 0 0 0 1 0 0 0 0 ...
# $ activity_4: num 1 1 1 1 1 1 1 1 1 1 ...
# $ activity_5: num 1 1 1 1 1 1 1 1 1 1 ...
# $ activity_6: num 0 0 0 0 0 0 0 0 0 0 ...
# $ activity_7: num 0 0 0 0 0 0 0 0 0 0 ...
dat <- structure(list(x1.time = c("00:00:00", "00:01:00", "00:02:00",
"00:03:00", "00:04:00", "00:05:00", "00:06:00", "00:07:00", "00:08:00",
"00:09:00", "00:10:00", "00:11:00", "00:12:00", "00:13:00", "00:14:00",
"00:15:00", "00:16:00", "00:17:00", "00:18:00", "00:19:00", "00:20:00",
"00:21:00", "00:22:00", "00:23:00", "00:24:00", "00:25:00", "00:26:00",
"00:27:00", "00:28:00", "00:29:00"), AI_1 = c(0.17532896077581,
0.174249939439765, 0.174170544792533, 0.172877357886967, 0.173679017353614,
0.174216799443538, 0.174514454250882, 0.174656389074666, 0.173377175454716,
0.173044040397703, 0.172476572884875, 0.174738790856458, 0.173833445732856,
0.174229265722835, 0.174392878820111, 0.174715890976243, 0.174241614289181,
0.173229751013599, 0.173579164085914, 0.173829069216696, 0.173499039975341,
0.174387946222767, 0.173802854581089, 0.174107580137568, 0.174113709936873,
0.173172609295233, 0.174509255493075, 0.173383120975257, 0.173398927511582,
0.173466516952908), AI_2 = c(0.173549588758752, 0, 0.85729795236214,
0.513925586220723, 0.140789239632585, 0.0989981552300843, 0.321625480480368,
0.62540390366724, 0.00714855410741877, 0, 0, 0, 0.212943798631015,
0, 0, 0.023650258664654, 0.00159158576982517, 0.0172670511608436,
0, 0, 0, 0.25653572767355, 0.41158598021939, 0.433889173147664,
0.442200975044019, 0.471931171507954, 0.415009919603445, 0.43364443321512,
0.449930874231746, 0.48397633182816), AI_3 = c(0.026069149474549,
0.0417747330978121, 0.276687600798659, 0.258591321128928, 0.208790296683244,
0.0300099278967508, 0.15234594700642, 0.26519848659315, 0.34220566727692,
0.352310255219813, 0.297621781376737, 0.292800000618149, 0.481566536382664,
0.337770306519177, 0.743182296874282, 0.256202127993172, 0.201340506649845,
0.200155318345632, 0.237126429055375, 0.234974163009848, 0.235808994849961,
0.302168675921402, 0.377936665388589, 0.416123299239618, 0.389279883023212,
0.357972848973051, 0.305268847437493, 0.290040891577408, 0.197384083463156,
0.258282654013295), AI_4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00841646877382803,
0), AI_5 = c(0, 0, 0.0015062890214412, 0.00154798776365785, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0), AI_6 = c(0.190018331633492, 0.241159552783285, 0.231916111803065,
0.193196835220518, 0.240381778378367, 0.266125762332231, 0.339227319507121,
0.354841547583334, 0.277011867279295, 0.474462632995715, 0.516356521276347,
0.559477604383845, 0.374857636694405, 0.376675155204282, 0.516347133869462,
0.627633542885353, 0.565732682034457, 0.544148310829377, 0.545022418887296,
0.602327138107482, 0.529578366594453, 0.571672817412653, 0.51963881197827,
0.493590581088222, 0.487545798153711, 0.525272191616523, 0.586906227102549,
0.555446579214151, 0.578788883825157, 0.617822898150646), AI_7 = c(0.139608768263461,
0.165583663096789, 0.326959508587122, 0.221739297198209, 0.160657663051105,
0.107439748199699, 0.117594125364214, 0.133528520361788, 0.117950354159875,
0.131428192187155, 0.125355403562937, 0.119185646272255, 0.196285453922129,
0.167061057207379, 0.169855099745761, 0.141077126343563, 0.078433720675593,
0.0999303057993443, 0.0798045801131668, 0.0331137028671696, 0.0920945831761988,
0.0233052285173748, 0, 0, 0, 0.00876293044107867, 0, 0.109134564970416,
0.110323312017635, 0.117772975747077), ID = c("ID1", "ID1", "ID1",
"ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1",
"ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1",
"ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1"
)), row.names = c(NA, 30L), class = "data.frame")

Data cleaning, from cross-sectional (multiple files) to panel in RStudio: merge/gather?

I have yearly observations for individuals on different variables from 2008-2020. I have data on family (25 variables), income (15 variables), and schooling (22 variables).
Right now, have 'cleaned' every single dataset so that every column of every category has the same column name. For context, this is what my R looks like now.
The thing is, I would like to have one big dataset with all of the individuals and years in one dataframe. I know that I should/could use the innerjoin or merge function first of all sorting by 'Householdmember', and that I could use the gather function, but I am truly struggling in what order I should do this and where I should start. I've been trying a lot of things, but considering the number of dataframes, it's hard to keep track of what I'm doing. I also created lists of every category for every year because this was recommended in one method, but that did not work out...
I want to end up with a dataframe that looks similar to this:
Individual
Year
Var1
Var2
1
2008
value
value
1
2009
value
value
1
2010
value
value
2
2008
value
value
2
2009
value
value
2
2010
value
value
What I should do as first step... If I merge the dataframes, I don't think R knows which values correspond to which year...
> head(fam08)
# A tibble: 6 x 25
HouseholdMember RandomChild YearBirthRandom Gender Age FatherBirth FatherAlive MotherBirth MotherAlive Divorce SeeFather SeeMother
<dbl> <dbl+lbl> <dbl> <dbl+l> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+lbl> <dbl+lbl>
1 800033 16 [not ap… NA 1 [mal… 16 1952 1 [yes] 1961 1 [yes] 1 [yes] 7 [ever… 7 [ever…
2 800042 16 [not ap… NA 2 [fem… 32 1946 1 [yes] 1948 1 [yes] 2 [no] 4 [at l… 4 [at l…
3 800045 16 [not ap… NA 1 [mal… 65 1913 2 [no] 1915 2 [no] 2 [no] NA NA
4 800057 16 [not ap… NA 1 [mal… 33 1939 1 [yes] 1945 1 [yes] 1 [yes] 4 [at l… 4 [at l…
5 800076 16 [not ap… NA 2 [fem… 22 1955 1 [yes] 1955 1 [yes] 1 [yes] 5 [at l… 3 [a fe…
6 800119 16 [not ap… NA 2 [fem… 57 1908 2 [no] 1918 2 [no] 2 [no] NA NA
# … with 13 more variables: Married <dbl+lbl>, Child <dbl+lbl>, NumChild <dbl>, SchoolCH1 <dbl+lbl>, SchoolCH2 <dbl+lbl>,
# SchoolCH3 <dbl+lbl>, SchoolCH4 <dbl+lbl>, BirthCH1 <dbl>, BirthCH2 <dbl>, BirthCH3 <dbl>, BirthCH4 <dbl>, FamSatisfaction <dbl+lbl>,
# Year <dbl>
> head(fam09)
# A tibble: 6 x 25
HouseholdMember RandomChild YearBirthRandom Gender Age FatherBirth FatherAlive MotherBirth MotherAlive Divorce SeeFather SeeMother
<dbl> <dbl+lbl> <dbl> <dbl+l> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+lbl> <dbl+lbl>
1 800033 16 [not ap… NA 1 [mal… 17 1952 1 [yes] 1961 1 [yes] NA 5 [at l… 7 [ever…
2 800042 16 [not ap… NA 2 [fem… 33 1946 1 [yes] 1948 1 [yes] NA 4 [at l… 4 [at l…
3 800057 16 [not ap… NA 1 [mal… 34 1939 1 [yes] 1945 1 [yes] NA 3 [a fe… 3 [a fe…
4 800076 16 [not ap… NA 2 [fem… 23 1955 1 [yes] 1955 1 [yes] NA 5 [at l… 3 [a fe…
5 800119 16 [not ap… NA 2 [fem… 58 NA NA NA NA NA NA NA
6 800125 16 [not ap… NA 2 [fem… 50 NA NA 1928 1 [yes] NA NA 1 [neve…
# … with 13 more variables: Married <dbl+lbl>, Child <dbl+lbl>, NumChild <dbl>, SchoolCH1 <dbl+lbl>, SchoolCH2 <dbl+lbl>,
# SchoolCH3 <dbl+lbl>, SchoolCH4 <dbl+lbl>, BirthCH1 <dbl>, BirthCH2 <dbl>, BirthCH3 <dbl>, BirthCH4 <dbl>, FamSatisfaction <dbl+lbl>,
# Year <dbl>
dput(head(fam09,10))
structure(list(HouseholdMember = c(800033, 800042, 800057, 800076,
800119, 800125, 800170, 800186, 800201, 800204), RandomChild = structure(c(16,
16, 16, 16, 16, 16, 3, 16, 16, 16), label = "Randomly chosen child", labels = c(`child 1` = 1,
`child 2` = 2, `child 3` = 3, `child 4` = 4, `child 5` = 5, `child 6` = 6,
`child 7` = 7, `child 8` = 8, `child 9` = 9, `child 10` = 10,
`child 11` = 11, `child 12` = 12, `child 13` = 13, `child 14` = 14,
`child 15` = 15, `not applicable` = 16), class = "haven_labelled"),
YearBirthRandom = c(NA, NA, NA, NA, NA, NA, 1999, NA, NA,
NA), Gender = structure(c(1, 2, 1, 2, 2, 2, 2, 2, 1, 1), label = "Gender respondent", labels = c(male = 1,
female = 2), class = "haven_labelled"), Age = c(17, 33, 34,
23, 58, 50, 50, 69, 35, 67), FatherBirth = structure(c(1952,
1946, 1939, 1955, NA, NA, 1926, NA, 1948, NA), label = "What is the year of birth of your father?", labels = c(`I don't know` = 99999), class = "haven_labelled"),
FatherAlive = structure(c(1, 1, 1, 1, NA, NA, 1, NA, 1, NA
), label = "Is your father still alive?", labels = c(yes = 1,
no = 2, `I don't know` = 99), class = "haven_labelled"),
MotherBirth = structure(c(1961, 1948, 1945, 1955, NA, 1928,
1931, NA, 1950, NA), label = "What is the year of birth of your mother?", labels = c(`I don't know` = 99999), class = "haven_labelled"),
MotherAlive = structure(c(1, 1, 1, 1, NA, 1, 1, NA, 1, NA
), label = "Is your mother still alive?", labels = c(yes = 1,
no = 2, `I don't know` = 99), class = "haven_labelled"),
Divorce = structure(c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), label = "Did your own parents ever divorce?", labels = c(yes = 1,
no = 2, `my parents never had a relationship` = 3, `I don't know` = 99
), class = "haven_labelled"), SeeFather = structure(c(5,
4, 3, 5, NA, NA, 6, NA, 3, NA), label = "How often did you see your father over the past 12 months?", labels = c(never = 1,
once = 2, `a few times` = 3, `at least every month` = 4,
`at least every week` = 5, `a few times per week` = 6, `every day` = 7
), class = "haven_labelled"), SeeMother = structure(c(7,
4, 3, 3, NA, 1, 6, NA, 3, NA), label = "How often did you see your mother over the past 12 months?", labels = c(never = 1,
once = 2, `a few times` = 3, `at least every month` = 4,
`at least every week` = 5, `a few times per week` = 6, `every day` = 7
), class = "haven_labelled"), Married = structure(c(NA, 1,
2, 2, 1, 2, 1, 1, 1, 1), label = "Are you married to this partner?", labels = c(yes = 1,
no = 2), class = "haven_labelled"), Child = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), label = "Have you had any children?", labels = c(yes = 1,
no = 2), class = "haven_labelled"), NumChild = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), SchoolCH1 = structure(c(NA,
NA, NA, NA, NA, NA, 4, NA, NA, NA), label = "What school does child 1 (born in the years 1991 through 2004) attend?", labels = c(`primary school` = 1,
`school for special primary education` = 2, `secondary school` = 3,
other = 4), class = "haven_labelled"), SchoolCH2 = structure(c(NA,
NA, NA, NA, NA, NA, 3, NA, NA, NA), label = "What school does child 2 (born in the years 1991 through 2004) attend?", labels = c(`primary school` = 1,
`school for special primary education` = 2, `secondary school` = 3,
other = 4), class = "haven_labelled"), SchoolCH3 = structure(c(NA,
NA, NA, NA, NA, NA, 1, NA, NA, NA), label = "What school does child 3 (born in the years 1991 through 2004) attend?", labels = c(`primary school` = 1,
`school for special primary education` = 2, `secondary school` = 3,
other = 4), class = "haven_labelled"), SchoolCH4 = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), label = "What school does child 4 (born in the years 1991 through 2004) attend?", labels = c(`primary school` = 1,
`school for special primary education` = 2, `secondary school` = 3,
other = 4), class = "haven_labelled"), BirthCH1 = c(NA, 2005,
2007, NA, 1983, NA, 1991, 1964, NA, 1974), BirthCH2 = c(NA,
2007, NA, NA, 1985, NA, 1994, 1966, NA, 1976), BirthCH3 = c(NA,
NA, NA, NA, NA, NA, 1999, 1970, NA, NA), BirthCH4 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), FamSatisfaction = structure(c(NA,
8, 9, NA, 8, NA, 8, NA, NA, NA), label = "How satisfied are you with your family life?", labels = c(`entirely dissatisfied` = 0,
`entirely satisfied` = 10, `I don’t know` = 999), class = "haven_labelled"),
Year = c(2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009,
2009, 2009)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))

I believe you could do something along these lines:
fam = bind_rows(fam_list)
inc = bind_rows(inc_list)
ws = bind_rows(ws_list)
result = fam %>%
left_join(inc, by=c("HouseholdMember", "Year")) %>%
left_join(ws, by=c("HouseholdMember", "Year"))
Output:
HouseholdMember Year fam_v1 fam_v2 fam_v3 inc_v1 inc_v2 inc_v3 ws_v1 ws_v2 ws_v3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8001 2008 0.609 -0.253 -1.30 0.0147 0.719 -0.765 0.120 0.974 -0.764
2 8002 2008 0.395 1.73 -0.503 0.119 -3.33 -0.798 0.325 0.664 1.65
3 8003 2008 0.562 0.157 0.243 -1.18 -0.260 0.105 1.09 0.855 1.19
4 8004 2008 1.32 0.737 -1.18 0.725 -1.82 0.356 0.362 2.04 1.76
5 8005 2008 -0.497 -0.444 -0.632 -0.534 1.63 0.984 1.29 0.614 0.576
6 8006 2008 -1.70 -0.989 -1.32 0.868 0.0979 0.468 -0.0146 1.11 0.957
7 8007 2008 -2.19 -0.419 1.69 1.34 -0.404 -1.43 -0.156 0.648 -0.186
8 8008 2008 1.48 0.350 -0.595 0.785 -0.609 1.28 -1.01 1.04 0.845
9 8009 2008 -0.315 -0.530 0.419 0.390 -0.0951 -0.755 0.135 0.696 -1.97
10 8010 2008 -0.882 1.38 2.06 -0.0757 1.53 -0.494 -1.03 1.14 1.87
Note:
I manufactured the data for this example by creating a lists of tibbles; I believe the fam_list, inc_list, and ws_list are similar to the list objects in your image. These are list of data frames / tibbles. I then use bind_rows to bind these similar structure tibbles together so that I have a three large tibbles.
I then use left_join twice to join inc and ws to fam
Input Data:
library(tidyverse)
fam_list = lapply(8:20, function(x) {
tibble(HouseholdMember = c(8000+seq(1:100)),
Year=2000+x,
fam_v1=rnorm(100),
fam_v2=rnorm(100),
fam_v3=rnorm(100)
)
})
names(fam_list) = paste0("fam_20", 8:20)
inc_list = lapply(8:20, function(x) {
tibble(HouseholdMember = c(8000+seq(1:100)),
Year=2000+x,
inc_v1=rnorm(100),
inc_v2=rnorm(100),
inc_v3=rnorm(100)
)
})
names(inc_list) = paste0("inc_20", 8:20)
ws_list = lapply(8:20, function(x) {
tibble(HouseholdMember = c(8000+seq(1:100)),
Year=2000+x,
ws_v1=rnorm(100),
ws_v2=rnorm(100),
ws_v3=rnorm(100)
)
})
names(ws_list) = paste0("ws_20", 8:20)
Input

R how to use case_when() to determine if previous value in a column is greater than the proceeding value in an ordered vector

I am working on calculating growth for a coral demography dataset and need to make a comparison of Max Diameter (cm) to determine at what TimeStep corals shrank. I attempted to use lag but for some reason, my new column is all NA instead of only the rows where it changes to a new coral ID. Does anyone have a sense of what I need to do to make it so my Diff column only has NAs where a transition to a new colony occurs?
Dataframe
A tibble: 20 x 22
`Taxonomic Code` ID Date Year Site_long Shelter `Module #` Side Location Settlement_Area TimeStep size_class `Cover Code` `Max Diameter (… `Max Orthogonal…
<chr> <fct> <date> <chr> <fct> <fct> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 PR H30 2018-11-27 18 Hanauma … Low 216 S D3 0.759 7 3 2 22 17
2 PR H30 2019-02-26 19 Hanauma … Low 216 S D3 0.751 8 3 1 24 19
3 PR H30 2019-05-28 19 Hanauma … Low 216 S D3 0.607 9 3 1 30 20
4 PR H30 2019-08-27 19 Hanauma … Low 216 S D3 0.615 10 1 1 8 8
5 PR H30 2019-11-26 19 Hanauma … Low 216 S D3 0.622 11 5 1 46 30
6 PR H37 2018-09-09 18 Hanauma … High 215 S C1 0.759 6 2 1 14 12
7 PR H37 2018-11-27 18 Hanauma … High 215 S C1 0.751 7 3 1 22 19
8 PR H37 2019-03-12 19 Hanauma … High 215 S C1 0.759 8 3 1 26 20
9 PR H37 2019-05-21 19 Hanauma … High 215 S C1 0.759 9 3 3 29 21
10 PR H37 2019-09-03 19 Hanauma … High 215 S C1 0.683 10 3 1 30 26
11 PR H66 2018-06-05 18 Hanauma … High 213 N A1 0.759 5 2 1 20 19
12 PR H66 2018-09-09 18 Hanauma … High 213 N A1 0.759 6 2 1 20 19
13 PR H66 2018-12-04 18 Hanauma … High 213 N A1 0.653 7 3 1 24 22
14 PR H66 2019-03-05 19 Hanauma … High 213 N A1 0.759 8 3 1 25 24
15 PR H66 2019-05-28 19 Hanauma … High 213 N A1 0.615 9 3 1 28 24
16 PR H66 2019-09-03 19 Hanauma … High 213 N A1 0.531 10 3 1 23 20
17 PR H66 2019-12-03 19 Hanauma … High 213 N A1 0.600 11 3 1 23 16
18 PR H76 2018-09-09 18 Hanauma … High 213 N A4 0.759 6 3 1 21 18
19 PR H76 2018-12-04 18 Hanauma … High 213 N A4 0.653 7 3 1 24 12
20 PR H76 2019-03-05 19 Hanauma … High 213 N A4 0.759 8 3 1 22 19
# … with 7 more variables: `Height (cm)` <dbl>, `Status Code` <chr>, area_mm_squared <dbl>, area_cm_squared <dbl>, Volume_mm_cubed <dbl>, Volume_cm_cubed <dbl>, MD <dbl>
Dataframe Code
data <- structure(list(`Taxonomic Code` = c("PR", "PR", "PR", "PR", "PR",
"PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR",
"PR", "PR", "PR", "PR"), ID = structure(c(35L, 35L, 35L, 35L,
35L, 38L, 38L, 38L, 38L, 38L, 55L, 55L, 55L, 55L, 55L, 55L, 55L,
61L, 61L, 61L), .Label = c("H1051", "H108", "H110", "H1101",
"H112", "H113", "H116", "H118", "H1188", "H1211", "H122", "H125",
"H1253", "H1289", "H171", "H172", "H174", "H186", "H187", "H188",
"H189", "H191", "H192", "H236", "H237", "H244", "H252", "H254",
"H258", "H274", "H277", "H288", "H292", "H293", "H30", "H332",
"H366", "H37", "H374", "H396", "H466", "H479", "H484", "H499",
"H531", "H560", "H580", "H593", "H597", "H625", "H644", "H647",
"H649", "H653", "H66", "H693", "H695", "H712", "H728", "H737",
"H76", "H760", "H774", "H854", "H926", "H96", "H963", "H98",
"H985", "H991", "H996", "W1038", "W1101", "W1152", "W1154", "W1192",
"W1208", "W1209", "W1214", "W1227", "W1243", "W1245", "W1315",
"W1345", "W1361", "W1377", "W1399", "W1438", "W1494", "W1495",
"W1537", "W1557", "W1614", "W1636", "W1655", "W1669", "W1690",
"W1697", "W1729", "W1741", "W1758", "W1782", "W1785", "W1847",
"W1919", "W2000", "W2004", "W2011", "W2036", "W2044", "W2046",
"W2131", "W2133", "W234", "W249", "W251", "W254", "W307", "W355",
"W359", "W369", "W433", "W450", "W461", "W470", "W480", "W538",
"W542", "W544", "W584", "W601", "W606", "W781", "W79", "W807",
"W872", "W874", "W887", "W890", "W891", "W923", "W952"), class = "factor"),
Date = structure(c(17862, 17953, 18044, 18135, 18226, 17783,
17862, 17967, 18037, 18142, 17687, 17783, 17869, 17960, 18044,
18142, 18233, 17783, 17869, 17960), class = "Date"), Year = c("18",
"19", "19", "19", "19", "18", "18", "19", "19", "19", "18",
"18", "18", "19", "19", "19", "19", "18", "18", "19"), Site_long = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("Hanauma Bay", "Waikiki"), class = "factor"),
Shelter = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("High",
"Low"), class = "factor"), `Module #` = c(216, 216, 216,
216, 216, 215, 215, 215, 215, 215, 213, 213, 213, 213, 213,
213, 213, 213, 213, 213), Side = c("S", "S", "S", "S", "S",
"S", "S", "S", "S", "S", "N", "N", "N", "N", "N", "N", "N",
"N", "N", "N"), Location = c("D3", "D3", "D3", "D3", "D3",
"C1", "C1", "C1", "C1", "C1", "A1", "A1", "A1", "A1", "A1",
"A1", "A1", "A4", "A4", "A4"), Settlement_Area = c(0.75902336,
0.751433126, 0.607218688, 0.614808922, 0.622399155, 0.75902336,
0.751433126, 0.75902336, 0.75902336, 0.683121024, 0.75902336,
0.75902336, 0.65276009, 0.75902336, 0.614808922, 0.531316352,
0.599628454, 0.75902336, 0.65276009, 0.75902336), TimeStep = c(7,
8, 9, 10, 11, 6, 7, 8, 9, 10, 5, 6, 7, 8, 9, 10, 11, 6, 7,
8), size_class = c(3, 3, 3, 1, 5, 2, 3, 3, 3, 3, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3), `Cover Code` = c(2, 1, 1, 1, 1, 1,
1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), `Max Diameter (cm)` = c(22,
24, 30, 8, 46, 14, 22, 26, 29, 30, 20, 20, 24, 25, 28, 23,
23, 21, 24, 22), `Max Orthogonal (cm)` = c(17, 19, 20, 8,
30, 12, 19, 20, 21, 26, 19, 19, 22, 24, 24, 20, 16, 18, 12,
19), `Height (cm)` = c(2, 2, 3, 1, 3, 1, 2, 1, 1, 3, 1, 1,
1, 2, 2, 2, 2, 1, 1, 1), `Status Code` = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, "B", NA, NA, "PB", NA, NA,
NA, NA), area_mm_squared = c(374, 456, 600, 64, 1380, 168,
418, 520, 609, 780, 380, 380, 528, 600, 672, 460, 368, 378,
288, 418), area_cm_squared = c(3.74, 4.56, 6, 0.64, 13.8,
1.68, 4.18, 5.2, 6.09, 7.8, 3.8, 3.8, 5.28, 6, 6.72, 4.6,
3.68, 3.78, 2.88, 4.18), Volume_mm_cubed = c(391.651884147528,
477.522083345649, 942.477796076938, 33.5103216382911, 2167.69893097696,
87.9645943005142, 437.728576400178, 272.271363311115, 318.871654339364,
1225.22113490002, 198.967534727354, 198.967534727354, 276.460153515902,
628.318530717959, 703.716754404114, 481.710873550435, 385.368698840348,
197.920337176157, 150.79644737231, 218.864288200089), Volume_cm_cubed = c(0.391651884147528,
0.477522083345649, 0.942477796076938, 0.0335103216382911,
2.16769893097696, 0.0879645943005142, 0.437728576400178,
0.272271363311115, 0.318871654339364, 1.22522113490002, 0.198967534727354,
0.198967534727354, 0.276460153515902, 0.628318530717959,
0.703716754404114, 0.481710873550435, 0.385368698840348,
0.197920337176157, 0.15079644737231, 0.218864288200089),
MD = c(22, 24, 30, 8, 46, 14, 22, 26, 29, 30, 20, 20, 24,
25, 28, 23, 23, 21, 24, 22)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Code
data_new <- data %>% group_by(ID, TimeStep) %>%
mutate(Diff = `Max Diameter (cm)` - dplyr::lag(`Max Diameter (cm)`))
Output
data_output <- structure(list(`Taxonomic Code` = c("PR", "PR", "PR", "PR", "PR",
"PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR",
"PR", "PR", "PR", "PR"), ID = structure(c(35L, 35L, 35L, 35L,
35L, 38L, 38L, 38L, 38L, 38L, 55L, 55L, 55L, 55L, 55L, 55L, 55L,
61L, 61L, 61L), .Label = c("H1051", "H108", "H110", "H1101",
"H112", "H113", "H116", "H118", "H1188", "H1211", "H122", "H125",
"H1253", "H1289", "H171", "H172", "H174", "H186", "H187", "H188",
"H189", "H191", "H192", "H236", "H237", "H244", "H252", "H254",
"H258", "H274", "H277", "H288", "H292", "H293", "H30", "H332",
"H366", "H37", "H374", "H396", "H466", "H479", "H484", "H499",
"H531", "H560", "H580", "H593", "H597", "H625", "H644", "H647",
"H649", "H653", "H66", "H693", "H695", "H712", "H728", "H737",
"H76", "H760", "H774", "H854", "H926", "H96", "H963", "H98",
"H985", "H991", "H996", "W1038", "W1101", "W1152", "W1154", "W1192",
"W1208", "W1209", "W1214", "W1227", "W1243", "W1245", "W1315",
"W1345", "W1361", "W1377", "W1399", "W1438", "W1494", "W1495",
"W1537", "W1557", "W1614", "W1636", "W1655", "W1669", "W1690",
"W1697", "W1729", "W1741", "W1758", "W1782", "W1785", "W1847",
"W1919", "W2000", "W2004", "W2011", "W2036", "W2044", "W2046",
"W2131", "W2133", "W234", "W249", "W251", "W254", "W307", "W355",
"W359", "W369", "W433", "W450", "W461", "W470", "W480", "W538",
"W542", "W544", "W584", "W601", "W606", "W781", "W79", "W807",
"W872", "W874", "W887", "W890", "W891", "W923", "W952"), class = "factor"),
Date = structure(c(17862, 17953, 18044, 18135, 18226, 17783,
17862, 17967, 18037, 18142, 17687, 17783, 17869, 17960, 18044,
18142, 18233, 17783, 17869, 17960), class = "Date"), Year = c("18",
"19", "19", "19", "19", "18", "18", "19", "19", "19", "18",
"18", "18", "19", "19", "19", "19", "18", "18", "19"), Site_long = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("Hanauma Bay", "Waikiki"), class = "factor"),
Shelter = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("High",
"Low"), class = "factor"), `Module #` = c(216, 216, 216,
216, 216, 215, 215, 215, 215, 215, 213, 213, 213, 213, 213,
213, 213, 213, 213, 213), Side = c("S", "S", "S", "S", "S",
"S", "S", "S", "S", "S", "N", "N", "N", "N", "N", "N", "N",
"N", "N", "N"), Location = c("D3", "D3", "D3", "D3", "D3",
"C1", "C1", "C1", "C1", "C1", "A1", "A1", "A1", "A1", "A1",
"A1", "A1", "A4", "A4", "A4"), Settlement_Area = c(0.75902336,
0.751433126, 0.607218688, 0.614808922, 0.622399155, 0.75902336,
0.751433126, 0.75902336, 0.75902336, 0.683121024, 0.75902336,
0.75902336, 0.65276009, 0.75902336, 0.614808922, 0.531316352,
0.599628454, 0.75902336, 0.65276009, 0.75902336), TimeStep = c(7,
8, 9, 10, 11, 6, 7, 8, 9, 10, 5, 6, 7, 8, 9, 10, 11, 6, 7,
8), size_class = c(3, 3, 3, 1, 5, 2, 3, 3, 3, 3, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3), `Cover Code` = c(2, 1, 1, 1, 1, 1,
1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), `Max Diameter (cm)` = c(22,
24, 30, 8, 46, 14, 22, 26, 29, 30, 20, 20, 24, 25, 28, 23,
23, 21, 24, 22), `Max Orthogonal (cm)` = c(17, 19, 20, 8,
30, 12, 19, 20, 21, 26, 19, 19, 22, 24, 24, 20, 16, 18, 12,
19), `Height (cm)` = c(2, 2, 3, 1, 3, 1, 2, 1, 1, 3, 1, 1,
1, 2, 2, 2, 2, 1, 1, 1), `Status Code` = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, "B", NA, NA, "PB", NA, NA,
NA, NA), area_mm_squared = c(374, 456, 600, 64, 1380, 168,
418, 520, 609, 780, 380, 380, 528, 600, 672, 460, 368, 378,
288, 418), area_cm_squared = c(3.74, 4.56, 6, 0.64, 13.8,
1.68, 4.18, 5.2, 6.09, 7.8, 3.8, 3.8, 5.28, 6, 6.72, 4.6,
3.68, 3.78, 2.88, 4.18), Volume_mm_cubed = c(391.651884147528,
477.522083345649, 942.477796076938, 33.5103216382911, 2167.69893097696,
87.9645943005142, 437.728576400178, 272.271363311115, 318.871654339364,
1225.22113490002, 198.967534727354, 198.967534727354, 276.460153515902,
628.318530717959, 703.716754404114, 481.710873550435, 385.368698840348,
197.920337176157, 150.79644737231, 218.864288200089), Volume_cm_cubed = c(0.391651884147528,
0.477522083345649, 0.942477796076938, 0.0335103216382911,
2.16769893097696, 0.0879645943005142, 0.437728576400178,
0.272271363311115, 0.318871654339364, 1.22522113490002, 0.198967534727354,
0.198967534727354, 0.276460153515902, 0.628318530717959,
0.703716754404114, 0.481710873550435, 0.385368698840348,
0.197920337176157, 0.15079644737231, 0.218864288200089),
MD = c(22, 24, 30, 8, 46, 14, 22, 26, 29, 30, 20, 20, 24,
25, 28, 23, 23, 21, 24, 22), Diff = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), groups = structure(list(ID = structure(c(35L,
35L, 35L, 35L, 35L, 38L, 38L, 38L, 38L, 38L, 55L, 55L, 55L, 55L,
55L, 55L, 55L, 61L, 61L, 61L), .Label = c("H1051", "H108", "H110",
"H1101", "H112", "H113", "H116", "H118", "H1188", "H1211", "H122",
"H125", "H1253", "H1289", "H171", "H172", "H174", "H186", "H187",
"H188", "H189", "H191", "H192", "H236", "H237", "H244", "H252",
"H254", "H258", "H274", "H277", "H288", "H292", "H293", "H30",
"H332", "H366", "H37", "H374", "H396", "H466", "H479", "H484",
"H499", "H531", "H560", "H580", "H593", "H597", "H625", "H644",
"H647", "H649", "H653", "H66", "H693", "H695", "H712", "H728",
"H737", "H76", "H760", "H774", "H854", "H926", "H96", "H963",
"H98", "H985", "H991", "H996", "W1038", "W1101", "W1152", "W1154",
"W1192", "W1208", "W1209", "W1214", "W1227", "W1243", "W1245",
"W1315", "W1345", "W1361", "W1377", "W1399", "W1438", "W1494",
"W1495", "W1537", "W1557", "W1614", "W1636", "W1655", "W1669",
"W1690", "W1697", "W1729", "W1741", "W1758", "W1782", "W1785",
"W1847", "W1919", "W2000", "W2004", "W2011", "W2036", "W2044",
"W2046", "W2131", "W2133", "W234", "W249", "W251", "W254", "W307",
"W355", "W359", "W369", "W433", "W450", "W461", "W470", "W480",
"W538", "W542", "W544", "W584", "W601", "W606", "W781", "W79",
"W807", "W872", "W874", "W887", "W890", "W891", "W923", "W952"
), class = "factor"), TimeStep = c(7, 8, 9, 10, 11, 6, 7, 8,
9, 10, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8), .rows = list(1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))

The issue is with the grouping. When we include 'TimeStep', there is only a single row per each group and the lag of a single element is NA
library(dplyr)
data %>%
group_by(ID %>%
mutate(Diff = `Max Diameter (cm)` - dplyr::lag(`Max Diameter (cm)`))

Adding a row with Sum and mean of the columns

I'm having a dataframe as like below.
`> am_me
Group.1 Group.2 x.x x.y
2 AM clearterminate 3 21.00000
3 AM display.cryptic 86 30.12791
4 AM price 71 898.00000`
I would like to get result as like below.
`> am_me_t
Group.2 x.x x.y
2 clearterminate 3 21
3 display.cryptic 86 30.1279069767442
4 price 71 898
41 AM 160 316.375968992248`
I have taken out the first column and got the result like below
`> am_res
Group.2 x.x x.y
2 clearterminate 3 21.00000
3 display.cryptic 86 30.12791
4 price 71 898.00000`
When I try rbind to Add "AM" to new row, as like below, I'm getting a warning message and getting NA.
`> am_me_t <- rbind(am_res, c("AM", colSums(am_res[2]), colMeans(am_res[3])))
Warning message:
invalid factor level, NAs generated in: "[<-.factor"(`*tmp*`, ri, value = "AM")
Group.2 x.x x.y
2 clearterminate 3 21
3 display.cryptic 86 30.1279069767442
4 price 71 898
41 <NA> 160 316.375968992248`
For your information, Output of edit(am_me)
`> edit(am_me)
structure(list(Group.1 = structure(as.integer(c(2, 2, 2)), .Label = c("1Y",
"AM", "BE", "CM", "CO", "LX", "SN", "US", "VK", "VS"), class = "factor"),
Group.2 = structure(as.integer(c(2, 5, 9)), .Label = c("bestbuy",
"clearterminate", "currency.display", "display", "display.cryptic",
"fqa", "mileage.display", "ping", "price", "reissue", "reissuedisplay",
"shortaccess.followon"), class = "factor"), x.x = as.integer(c(3,
86, 71)), x.y = c(21, 30.1279069767442, 898)), .Names = c("Group.1",
"Group.2", "x.x", "x.y"), row.names = c("2", "3", "4"), class = "data.frame")`
Also
`> edit(me)
structure(list(Group.1 = structure(as.integer(c(1, 2, 2, 2, 3,
4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8,
8, 8, 9, 9, 10, 10, 10, 10, 10, 10)), .Label = c("1Y", "AM",
"BE", "CM", "CO", "LX", "SN", "US", "VK", "VS"), class = "factor"),
Group.2 = structure(as.integer(c(8, 2, 5, 9, 10, 1, 2, 5,
9, 1, 2, 5, 9, 1, 2, 3, 4, 7, 9, 11, 12, 2, 4, 6, 1, 2, 5,
9, 2, 5, 1, 2, 3, 5, 9, 10)), .Label = c("bestbuy", "clearterminate",
"currency.display", "display", "display.cryptic", "fqa",
"mileage.display", "ping", "price", "reissue", "reissuedisplay",
"shortaccess.followon"), class = "factor"), x.x = as.integer(c(1,
3, 86, 71, 1, 2, 5, 1, 52, 10, 7, 27, 15, 5, 267, 14, 4,
1, 256, 1, 1, 80, 1, 78, 2, 10, 23, 6, 1, 2, 4, 3, 3, 11,
1, 1)), x.y = c(5, 21, 30.1279069767442, 898, 12280, 800,
56.4, 104, 490.442307692308, 1759.1, 18.1428571428571, 1244.81481481481,
518.533333333333, 3033.2, 18.5468164794007, 20, 3788.5, 23,
2053.49609375, 3863, 6376, 17.825, 240, 1752.21794871795,
1114.5, 34, 1369.60869565217, 1062.16666666667, 23, 245,
5681.5, 11.3333333333333, 13.3333333333333, 1273.81818181818,
2076, 5724)), .Names = c("Group.1", "Group.2", "x.x", "x.y"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31",
"32", "33", "34", "35", "36"), class = "data.frame")
Group.1 Group.2 x.x x.y
1 1Y ping 1 5.00000
2 AM clearterminate 3 21.00000
3 AM display.cryptic 86 30.12791
4 AM price 71 898.00000
5 BE reissue 1 12280.00000
6 CM bestbuy 2 800.00000
7 CM clearterminate 5 56.40000
8 CM display.cryptic 1 104.00000
9 CM price 52 490.44231
10 CO bestbuy 10 1759.10000
11 CO clearterminate 7 18.14286
12 CO display.cryptic 27 1244.81481
13 CO price 15 518.53333
14 LX bestbuy 5 3033.20000
15 LX clearterminate 267 18.54682
16 LX currency.display 14 20.00000
17 LX display 4 3788.50000
18 LX mileage.display 1 23.00000
19 LX price 256 2053.49609
20 LX reissuedisplay 1 3863.00000
21 LX shortaccess.followon 1 6376.00000
22 SN clearterminate 80 17.82500
23 SN display 1 240.00000
24 SN fqa 78 1752.21795
25 US bestbuy 2 1114.50000
26 US clearterminate 10 34.00000
27 US display.cryptic 23 1369.60870
28 US price 6 1062.16667
29 VK clearterminate 1 23.00000
30 VK display.cryptic 2 245.00000
31 VS bestbuy 4 5681.50000
32 VS clearterminate 3 11.33333
33 VS currency.display 3 13.33333
34 VS display.cryptic 11 1273.81818
35 VS price 1 2076.00000
36 VS reissue 1 5724.00000`

The type of the Group.2 column is factor, and that limits the possible values. You can transform it to character with am_me$Group.2 <- as.character(am_me$Group.2), after that the AM value will be added without errors.
Note that you can also use sum() and mean() for single column operations.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unstructured txt file with similar pattern for all rows in R - r

Related

How to convert panel data with R so that each observation per ID is saved in one row and still arranged by year?

Create a binary variable based on a threshold in R

Data cleaning, from cross-sectional (multiple files) to panel in RStudio: merge/gather?

R how to use case_when() to determine if previous value in a column is greater than the proceeding value in an ordered vector

Adding a row with Sum and mean of the columns

Categories

Resources