R: "Adding" 2 variables (columns) to create an aggregate variable (column)? - r

This may be a strange request - I hope I am wording it correctly:
I have a dataset (df) and three variables (BELONG_1, GRPOR_14, ETHNIC10) that I want to "add" so as to get an aggregate variable (PsychIntegration) that can be run in a regression analysis - e.g. controlling for other variables such as gender, age etc.
BELONG_1: 1=Do not belong - 10=Do belong
GRPOR_14: 1=Not proud - 10=Very proud
ETHNIC10: 1=Not important - 4=Very important
The Cronbach Alpha for these three variables is 0.62
ID BELONG_1 GRPOR_14 ETHNIC10 PsychIntegration
1 10 8 4 ??
2 3 4 2 ??
3 7 10 3 ??
4 1 1 1 ??
How exactly do I "add" (?) these variables to get PsychIntegration?
I hope that makes sense - thanks again!

Try using the package dplyr (or package tidyverse). Assuming your data are stored in a data.frame named df
df <- df %>%
mutate(PsychIntegration = rowSums(select(., -ID)))
if you want to sum all rows. Use
df <- df %>%
mutate(PsychIntegration = BELONG_1 + GRPOR_14 + ETHNIC10)
if you just want to sum those three columns.
Using just base R one possibility is
df$PsychIntegration <- df$BELONG_1 + df$GRPOR_14 + df$ETHNIC10

Related

How to assign unambiguous values for each row in a data frame based on values found in rows from another data frame using R?

I have been struggling with this question for a couple of days.
I need to scan every row from a data frame and then assign an univocal identifier for each rows based on values found in a second data frame. Here is a toy exemple.
df1<-data.frame(c(99443975,558,99009680,99044573,599,99172478))
names(df1)<-"Building"
V1<-c(558,134917,599,120384)
V2<-c(4400796,14400095,99044573,4500481)
V3<-c(NA,99009680,99340705,99132792)
V4<-c(NA,99156365,NA,99132794)
V5<-c(NA,99172478,NA, 99181273)
V6<-c(NA, NA, NA,99443975)
row_number<-1:4
df2<-data.frame(cbind(V1, V2,V3,V4,V5,V6, row_number))
The output I expect is what follows.
row_number_assigned<-c(4,1,2,3,3,2)
output<-data.frame(cbind(df1, row_number_assigned))
Any hints?
Here's an efficient method using the arr.ind feature of thewhich function:
sapply( df1$Building, # will send Building entries one-by-one
function(inp){ which(inp == df2, # find matching values
arr.in=TRUE)[1]}) # return only row; not column
[1] 4 1 2 3 3 2
Incidentally your use of the data.frame(cbind(.)) construction is very dangerous. A much less dangerous, and using fewer keystrokes as well, method for dataframe construction would be:
df2<-data.frame( V1=c(558,134917,599,120384),
V2=c(4400796,14400095,99044573,4500481),
V3=c(NA,99009680,99340705,99132792),
V4=c(NA,99156365,NA,99132794),
V5=c(NA,99172478,NA, 99181273),
V6=c(NA, NA, NA,99443975) )
(It didn't cause coding errors this time but if there were any character columns it would changed all the numbers to character values.) If you learned this from a teacher, can you somehow approach them gently and do their future students a favor and let them know that cbind() will coerce all of the arguments to the "lowest common denominator".
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
df1 %>%
left_join(df2 %>%
pivot_longer(-row_number) %>%
select(-name),
by = c("Building" = "value"))
This returns
Building row_number
1 99443975 4
2 558 1
3 99009680 2
4 99044573 3
5 599 3
6 99172478 2

R code to detect a change in a variable over time for multiple patients

I have a data set with multiple rows per patient, where each row represents a 1-week period of time over the course of 4 months. There is a variable grade that can take on values of 1,2,or 3, and I want to detect when a single patient's grade INCREASES (1 to 2, 1 to 3, or 2 to 3) at any point (the result would be a yes/no variable). I could write a function to do it but I'm betting there is some clever functional programming I could do to make use of existing R functions. Here is a sample data set below. Thank you!
df=data.frame(patient=c(1,1,1,2,2,3,3,3,3),period=c(1,2,3,1,3,1,3,4,5),grade=c(1,1,1,2,3,1,1,2,3))
what I would want is a resulting data frame of:
data.frame(patient=c(1,2,3),grade.increase=c(0,1,1))
library(dplyr)
df %>%
arrange(patient, period) %>%
mutate(grade.increase = case_when(grade > lag(grade) ~ TRUE,TRUE ~ FALSE)) %>%
group_by(patient) %>%
summarise(grade.increase = max(grade.increase))
Combining lag which checks the previous value with case_when allows us to identify each grade.increase.
Summarising the maximum of grade.increase for each patient gets the desired results as boolean calculations treat FALSE as 0 and TRUE as 1.
If you feel like doing this in base R, here's a solution that uses the split-apply-combine approach.
You use split to make a list with a separate data frame for each patient;
you use lapply to iterate a summarization function over each list element, where the summarization function uses diff to look at changes in grade and if and any to summarize; and then
you wrap the whole thing in do.call(rbind, ...) to collapse the resulting list into a data frame.
Here's what that looks like:
do.call(rbind, lapply(split(df, df[,"patient"]), function(i) {
data.frame(patient = i[,"patient"][1],
grade.increase = if (any(diff(i[,"grade"]) > 0)) 1 else 0 )
}))
Result:
patient grade.increase
1 1 0
2 2 1
3 3 1

Match columns by regex and perform calculations using mutate in R dplyr?

I am given a dataframe like this:
uncalibrated_gyro_x uncalibrated_gyro_y uncalibrated_gyro_z
1 4 7
2 5 8
3 6 9
sometimes I get these columns as unc_gyr_x, unc_gyr_y, unc_gyr_z
In any case I need to calculate the norm of this: sqrt(x^2 + y^2 + z^2)
These columns are part of a large dataframe with 50 columns.
How can I "tell" mutate using regex to use these columns that sometimes are given as uncalibrated_gyro_x,y,z and the other time as unc_gyr_x,y,z?
I know there is a function matches but it doesn't work for me in mutate.
Please advise.
One approach would be to conditionally rename the variables so they're consistent and go from there:
df %>%
rename_at(vars(starts_with("uncalibrated_gyro_")),
funs(sub("uncalibrated_gyro_", "unc_gyr_", .))) %>%
mutate(myvar = sqrt(rowSums(select(.,starts_with("unc_gyr_"))^2)))

dplyr::mutate changes row numbers, how to keep them?

I am using lme4::lmList on a tibble to obtain the coefficients of linear fit lines fitted for each subject (id) in my data. What I actually want is a nice long chain of pipes because I don't want to keep any of this output, just use it for a slope/intercept plot. However, I am running into a problem. lmList is creating a dataframe where the row numbers are the original subject ID numbers. I want to keep that information, but as soon as I use mutate on the output, the row numbers change to be sequential from 1. I tried rescuing them first by using rowid_to_column but that just gives me a column of sequential numbers from 1 too. What can I do, other than drop out of the pipe and put them in a column with base R? Is unique(a_df$id) really the best solution? I had a look around on here but didn't see a question like this one.
library(tibble)
library(dplyr)
library(Matrix)
library(lme4)
a_df <- tibble(id = c(rep(4, 3), rep(11, 3), rep(12, 3), rep(42, 3)),
age = c(rep(seq(1, 3), 4)),
hair = 1 + (age*2) + rnorm(12) + as.vector(sapply(rnorm(4), function(x) rep(x, 3))))
# as.data.frame to get around stupid RStudio diagnostics bug
int_slope <- coef(lmList(hair ~ age | id, as.data.frame(a_df))) %>%
setNames(., c("Intercept", "Slope"))
# Notice how the row numbers are the original subject ids?
print(int_slope)
Intercept Slope
4 2.9723596 1.387635
11 0.2824736 2.443538
12 -1.8912636 2.494236
42 0.8648395 1.680082
int_slope2 <- int_slope %>% mutate(ybar = Intercept + (mean(a_df$age) * Slope))
# Look! Mutate has changed them to be the numbers 1 to 4
print(int_slope2)
Intercept Slope ybar
1 2.9723596 1.387635 5.747630
2 0.2824736 2.443538 5.169550
3 -1.8912636 2.494236 3.097207
4 0.8648395 1.680082 4.225004
# Try to rescue them with rowid_to_column
int_slope3 <- int_slope %>% rowid_to_column(var = "id")
# Nope, 1 to 4 again
print(int_slope3)
id Intercept Slope
1 1 2.9723596 1.387635
2 2 0.2824736 2.443538
3 3 -1.8912636 2.494236
4 4 0.8648395 1.680082
Thanks,
SJ
The dplyr/tidyverse universe doesn't "believe in" row names. Any data that is important for an observation should be included in a column. The tibble package includes a function to move row names into a column. Try
int_slope %>% rownames_to_column()
before any mutates.
Nothing like asking for help to make you see the answer. Those aren't row numbers, they're numeric row names. Of course they are! Non-contiguous row numbers make no sense. rownames_to_column is my answer.
Why you just donĀ“t create another 'ybar' column on int_slope?
int_slope$ybar<- Intercept + mean(a_df$age) * Slope

How do I make the list output from the 'by' function in R usable?

I have a set of data with a dependent variable and two factors. I would like randomly sample the dependent variable (with replacement) within each subset of combinations of my two factors (and the number of random samples retrieved should equal the number that existed originally at each combination of the two factors). I've been able to do this using the 'by' function. The problem is the output is a list and I'd like something more accessible but haven't had any luck converting to a data frame. My end goal is to run the simulation described above 1000 times and for each simulation calculate the average of the random samples retrieved for each combination of the factors.
This produces the dataset:
value<-runif(100,5,25)
cat1<-factor(rep(1:10,10))
a<-rep("A",50)
b<-rep("B",50)
cat2<-append(a,b)
data<-as.data.frame(cbind(value,cat1,cat2))
This creates one simulation of random values drawn from the factor levels and
stores that info in a list:
list<-by(data[,"value"],data[,c("cat1","cat2")],function(x) sample(x,length(x),T))
What I'd like to do is wind up with a dataframe that has as columns "Simulation", "AverageValue", "cat1", and "cat2" - so that I would have 1000 simulation lines for each combination of cat1 and cat 2.
Any suggestions on how to make the 'by' output more accessible so I can run a for loop on the output or other suggestions would be great.
Thanks!
As a more general method, you might like to use dplyr rather than by. this way you'll keep your data.frame.
In this case, you would use group_by to group by your cat1 and cat2, rather than by, and use mutate to add a new column on. You could replace new = with value = if you don't want to keep your old data:
library(dplyr)
data %>% group_by(cat1, cat2) %>%
mutate(new = sample(value, length(value), replace = T))
Source: local data frame [100 x 4]
Groups: cat1, cat2 [20]
value cat1 cat2 new
(fctr) (fctr) (fctr) (fctr)
1 13.9639607304707 1 A 13.2139691384509
2 22.6068278681487 2 A 5.27278678957373
3 24.6930849226192 3 A 22.0293137291446
4 16.842244095169 4 A 9.56347029190511
5 18.467006101273 5 A 23.1605510273948
6 20.6661582039669 6 A 24.3043746100739
7 9.37060782220215 7 A 13.9268753770739
8 6.68592340312898 8 A 20.034239795059
9 6.95704637560993 9 A 12.676755907014
10 17.2769332909957 10 A 24.453850784339
.. ... ... ...

Resources