tidyr::gather can't find my value variables - r

I have an extremely large data.frame. I reproduce part of it.
RECORDING_SESSION_LABEL condition TRIAL_INDEX IA_LABEL IA_DWELL_TIME
1 23 match 1 eyes 3580
2 23 match 1 nose 2410
3 23 match 1 mouth 1442
4 23 match 1 face 841
5 23 mismatch 3 eyes 1817
6 23 mismatch 3 nose 1724
7 23 mismatch 3 mouth 1600
8 23 mismatch 3 face 1136
9 23 mismatch 4 eyes 4812
10 23 mismatch 4 nose 3710
11 23 mismatch 4 mouth 4684
12 23 mismatch 4 face 1557
13 23 mismatch 6 eyes 4645
14 23 mismatch 6 nose 2321
15 23 mismatch 6 mouth 674
16 23 mismatch 6 face 684
17 23 match 7 eyes 1062
18 23 match 7 nose 1359
19 23 match 7 mouth 215
20 23 match 7 face 0
I need to calculate the percentage of IA_DWELL_TIME for each IA_LABEL in each trial index. For that, I first put IA_label in different columns
data_IA_DWELL_TIME <- tidyr::spread(data_IA_DWELL_TIME, key = IA_LABEL, value = IA_DWELL_TIME)
For calculating the percentage, I create a new dataframe:
data_IA_DWELL_TIME_percentage <-data_IA_DWELL_TIME
data_IA_DWELL_TIME_percentage$eyes <- 100*(data_IA_DWELL_TIME$eyes/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$nose <- 100*(data_IA_DWELL_TIME$nose/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$mouth <- 100*(data_IA_DWELL_TIME$mouth/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$face <- 100*(data_IA_DWELL_TIME$face/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
So all is fine, and I get the wanted output. The problem is when I want to put the columns back to the rows
data_IA_DWELL_TIME_percentage <- tidyr::gather(key = IA_LABEL, value = IA_DWELL_TIME,-RECORDING_SESSION_LABEL,-condition, -TRIAL_INDEX)
I obtain this error:
Error in tidyr::gather(key = IA_LABEL, value = IA_DWELL_TIME,
-RECORDING_SESSION_LABEL, : object 'RECORDING_SESSION_LABEL' not found
>
Any idea of what is going on here? Thanks!

As explained, you're not referring to your data frame in the gather statement.
However, you could avoid the need for referring to it altogether and put the second part in a dplyr pipeline, like below:
library(dplyr)
library(tidyr)
data_IA_DWELL_TIME <- spread(data_IA_DWELL_TIME, key = IA_LABEL, value = IA_DWELL_TIME)
data_IA_DWELL_TIME %>%
mutate_at(
vars(eyes, nose, mouth, face),
~ 100 * (. / (rowSums(data_IA_DWELL_TIME[, c("eyes", "nose", "mouth", "face")])))
) %>%
gather(key = IA_LABEL, value = IA_DWELL_TIME,-RECORDING_SESSION_LABEL,-condition, -TRIAL_INDEX)

Related

Eliminate cases based on multiple rows values

I have a base with the following information:
edit: *each row is an individual that lives in a house, multiple individuals with a unique P_ID and AGE can live in the same house with the same H_ID, I'm looking for all the houses with all the individuals based on the condition that there's at least one person over 60 in that house, I hope that explains it better *
show(base)
H_ID P_ID AGE CONACT
1 10010000001 1001000000102 35 33
2 10010000001 1001000000103 12 31
3 10010000001 1001000000104 5 NA
4 10010000001 1001000000101 37 10
5 10010000002 1001000000206 5 NA
6 10010000002 1001000000205 10 NA
7 10010000002 1001000000204 18 31
8 10010000002 1001000000207 3 NA
9 10010000002 1001000000203 24 35
10 10010000002 1001000000202 43 33
11 10010000002 1001000000201 47 10
12 10010000003 1001000000302 26 33
13 10010000003 1001000000301 29 10
14 10010000004 1001000000401 56 32
15 10010000004 1001000000403 22 31
16 10010000004 1001000000402 49 10
17 10010000005 1001000000503 1 NA
18 10010000005 1001000000501 24 10
19 10010000005 1001000000502 23 10
20 10010000006 1001000000601 44 10
21 10010000007 1001000000701 69 32
I want a list with all the houses and all the individuals living there based on the condition that there's at least one person 60+, here's a link for the data: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing
And here's how I made the base:
hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)
base<-data.frame(datos$ID_VIV, datos$ID_PERSONA, datos$EDAD, datos$CONACT)
base
Any help is much much appreciated, Thanks!
This can be done by:
Adding a variable with the maximum age per household
base$maxage <- ave(base$AGE, base$H_ID, FUN=max)
Then only keeping households with a maximum age above 60.
base <- subset(base, maxage >= 60)
Or you could combine the two lines into one. With the column names in your linked data:
> base <- subset(base, ave(base$datos.EDAD, base$datos.ID_VIV, FUN=max) >= 60)
> head(base)
datos.ID_VIV datos.ID_PERSONA datos.EDAD datos.CONACT
21 10010000007 1001000000701 69 32
22 10010000008 1001000000803 83 33
23 10010000008 1001000000802 47 33
24 10010000008 1001000000801 47 10
36 10010000012 1001000001204 4 NA
37 10010000012 1001000001203 2 NA
Using dplyr, we can group_by H_ID and select houses where any AGE is greater than 60.
library(dplyr)
df %>% group_by(H_ID) %>% filter(any(AGE > 60))
Similarly with data.table
library(data.table)
setDT(df)[, .SD[any(AGE > 60)], H_ID]
To get a list of the houses with a tenant Age > 60 we can filter and create a list of distinct H_IDs
house_list <- base %>%
filter(AGE > 60) %>%
distinct(H_ID) %>%
pull(H_ID)
Then we can filter the original dataframe based on that house_list to remove any households that do not have someone over the age of 60.
house_df <- base %>%
filter(H_ID %in% house_list)
To then calculate the CON values we can filter out NA values in CONACT, group_by(H_ID) and summarize to find the number of individuals within each house that have a non-NA CONACT value.
CON_calcs <- house_df %>%
filter(!is.na(CONACT)) %>%
group_by(H_ID) %>%
summarize(Count = n())
And join that back into the house_df based on H_ID to include the newly calculated CON values, and I believe that should end with your desired result.
final_df <- left_join(house_df, CON_calcs, by = 'H_ID')

writing out .dat file in r

I have a dataset looks like this:
ids <- c(111,12,134,14,155,16,17,18,19,20)
scores.1 <- c(0,1,0,1,1,2,0,1,1,1)
scores.2 <- c(0,0,0,1,1,1,1,1,1,0)
data <- data.frame(ids, scores.1, scores.1)
> data
ids scores.1 scores.1.1
1 111 0 0
2 12 1 1
3 134 0 0
4 14 1 1
5 155 1 1
6 16 2 2
7 17 0 0
8 18 1 1
9 19 1 1
10 20 1 1
ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library.
item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset
write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)),
colnames = FALSE, sep = "")
I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids
11100
1211
13400
1411
15511
1622
1700
1811
1911
2011
rather than at the right side of the ids.
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11
Any recommendations?
Thanks!
We can use unite to unite the 'score' columns into a single one and then use write.csv
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
with #akrun's help, this gives what I wanted:
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
write.fwf(data, file = "data.dat",
width = c(5,item.count),
colnames = FALSE, sep = " ")
in the .dat file, the dataset looks like this below:
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11

How to change column names for mrset in R?

I am trying to create crosstabs I have a dataframe in which I have multiple select questions. I am importing the data frame from SPSS file using foreign and expss package. I am creating the multiple select questions using the mrset function. Here's the demo code for this to make it clear.
Banner1 = w %>%
tab_cells(mrset(as.category( temp1,counted_value = "Checked"))) %>%
tab_cols(total(),mrset(as.category( temp2, counted_value = "Checked"))) %>%
tab_stat_cases(total_row_position = "none",label = "")
tab_pivot(Banner1)
The datatable imported looks like this
Total Q12_1 Q12_2 Q12_3 Q12_4 Q12_5
A B C D E F
Total Cases 803 34 18 14 38 37
Q13_1 64 11 7 8 9 7
Q13_2 12 54 54 43 13 12
Q13_3 67 54 23 21 6 4
Sorry about the alignment here....So this is the imported dataset.
Coming to the problem, As you can see this dataset has column labels as Question numbers and not variable labels. For single select questions everything works fine. Is there any function I can change the colnames for mrset functions dynamically?
The desired output should be something like this. For eg,
Total Apple Mango Banana Orange Grapes
A B C D E F
Total Cases 803 34 18 14 38 37
Apple 64 11 7 8 9 7
Mango 12 54 54 43 13 12
banana 67 54 23 21 6 4
Any help would be greatly appreciated.

groups of different size randomly selected within different classes

i have such a difficult question (at least to me) that i spend 2 hours just writing it. Complete impossible to program it by my self. I try to be very clear and i´m sorry if i didn´t. I´m doing this in a very rustic way in excel, but i really need to program this.
i have a data.frame like this
id_pix id_lote clase f1 f2
45 4 Sg 2460 2401
46 4 Sg 2620 2422
47 4 Sg 2904 2627
48 5 M 2134 2044
49 5 M 2180 2104
50 5 M 2127 2069
83 11 S 2124 2062
84 11 S 2189 2336
85 11 S 2235 2162
86 11 S 2162 2153
87 11 S 2108 2124
with 17451 "id_pixel"(rows), 2080 "id_lote" and 9 "clase"
this is the "id_lote" count per "clase" (v1 is the id_lote count)
clase v1
1: S 1099
2: P 213
3: Sg 114
4: M 302
5: Alg 27
6: Az 77
7: Po 228
8: Cit 13
9: Ma 7
i need to split the "id_lote" randomly within the "clase". I mean i have 1099 "id_lote" for the "S" "clase" that are 9339 "id_pixel" (rows) and i want to randomly select 50 % of "id_lote" that are x "id_pixel"(rows). And do this for every "clase" considering that the size (number of "id_lote") of every "clase" are different. I also would like to be able to change the size of the selection (50 %, 30 %, etc). And i also want to keep the not selected set of "id_lote". I hope some one can help me with this!
here is the reproducible example
this is the data with 2 clase (S and Az), with 6 id_lote and 13 id_pixel
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
30 2 S 3021 2985
31 2 S 3020 2596
71 9 S 4725 4404
72 9 S 4759 4943
75 11 S 2728 2225
218 21 Az 4830 3007
219 21 Az 4574 2761
220 21 Az 5441 3092
1155 126 Az 7209 2449
1156 126 Az 7035 2932
and one result could be:
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
75 11 S 2728 2225
1155 126 Az 7209 2449
1156 126 Az 7035 2932
were 50% of id_lote were randomly selected in clase "S" (2 of 4 id_lote) but all the id_pixel in selected id_lote were keeped. The same for clase "Az", one id_lote was randomly selected (1 of 2 in this case) and all the id_pixel in selected id_lote were keeped.
what colemand77 proposed helped a lot. I think dplyr package is usefull for this but i think that if i do
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
i get the 30 % of the data of each clase but not grouped by id_lote like i need! I mean 30 % of the rows (id_pixel) were selected instead of id_lote.
i hope this example help to understand what i want to do and make it usefull for everybody. I´m sorry if i wasn´t clear enough the first time.
Thanks a lot!
First glimpse I'd say the dplyr package is your friend here.
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
so you first use group_by() and include the grouping levels you want to sample from, then you use sample_frac to sample the fraction of the results you want for each group.
As near as I can tell this is what you are asking for. If not, please consider re-stating your question to include either a reproducible example or clarify. Cheers.
to "keep" the not-selected members, I would add a column of unique ids, and use an anti-join anti_join()(also from the dplyr package) to find the id's that are not in common between the two data.frames (the results of the sampling and the original).
## Update ##
I'm understanding better now, I believe. Think about this as a two step process...
1) you want to select x% (50 in example) of the id_lote from each clase and return those id_lote #s (i'm assuming that a given id_lote does not exist for multiple clase?)
2) you want to see all of the id_pixels that correspond to each id_lote, all in one data.frame
I've broken this down into multiple steps for illustration, not because it is the fastest / prettiest.
raw data: (couldn't read your data into R.)
df<-data.frame(id_pix = c(1:200),
id_lote = sample(1:20,200, replace = TRUE),
clase = sample(letters[seq_along(1:10)], 200, replace = TRUE),
f1 = sample(1000:2000,200, replace = TRUE),
f2 = sample(2000:3000,200, replace = TRUE))
1) figure out which id_lote correspond to which clase - for this we use the dplyr summarise function and store it in a variable
summary<-df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise()
returns:
Source: local data frame [125 x 2]
Groups: clase
clase id_lote
1 a 1
2 a 2
3 a 4
4 a 5
5 a 6
6 a 7
7 a 8
8 a 9
9 a 11
10 a 12
.. ... ...
then we sample to get the 30% of the id_lote for each clase..
sampled_summary <- summary %>%
group_by(clase) %>%
sample_frac(.3,replace = FALSE)
so the result of this is a data table with two columns, (clase and id_lote) with 30% of the id_lotes shown for each clase.
2) ok so now we have the id_lotes randomly selected from each class but not the id_pix that are associated with that class. To accomplish this we do a join to get the corresponding full data set including the id_pix, etc.
result <- sampled_summary %>%
left_join(df)
The above makes a copy of the data set a bunch, so if you have a substantial data set you could just do it all at one go:
result <- df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise() %>%
group_by(clase) %>%
sample_frac(.5,replace = FALSE) %>%
left_join(df)
if this doesn't get you what you want, let me know and we'll take another crack at it.

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Resources