Advanced pivot_longer: extract pattern in variables - r

My R df looks like this:
ID CO.RT CO.ER SC.RT SC.ER
1 0.19 0.06 1.24 0.09
2 0.61 0.01 0.63 0.03
3 0.43 0.02 1.31 0.09
I've been trying to find a way to use tidyr::pivot_longer to achieve the following:
ID Type RT ER
1 CO 0.19 0.06
1 SC 1.24 0.09
2 CO 0.61 0.01
2 SC 0.63 0.03
3 CO 0.43 0.02
3 SC 1.31 0.09
My issue: I can only pivot RT and RT into a single "score"-column—I fail to find the right regex(?) pattern to pivot my df in the way shown above.
Here is what I tried:
df %>% pivot_longer(cols = c(CO.RT:SC.ER),
names_pattern = "(.+).(.+)",
names_to=c("Type", ".value"))
There are many pivot_longer/names_to questions out there, however, I couldn't find the right one for my problem. Can anyone help?
My dataset:
df <- tibble(
ID = c(1,2,3),
CO.RT = c(0.19,0.61,0.43),
CO.ER = c(0.06,0.01,0.02),
SC.RT = c(1.24,0.63,1.31),
SC.ER = c(0.09,0.03,0.09)
)

Related

Exclude factor loadings from ID variable in order to create latent concept

I conducted a factor analysis and wanted to create the latent concept (postmaterialism and materialism) with the correlated variables (see output fa). Later on I want to merge this data set I used for the fa with another data set, hence I kept the ID variable in order to use it later as key variable. Now my problem is that I need to exclude the factor loadings from the ID variable because otherwise it'll contort the score of the latent concept of each individual. I tried different commands like:
!("ID"), with = FALSE, - ("ID"), with = FALSE, setdiff(names(expl_fa2),("ID")), with = FALSE
but nothing worked.
This is my code for the latent variables:
data_fa_1 <- data_fa_1 %>% mutate(postmat = expl_fa2$score[,1], mat = expl_fa2$scores[,2])
And this is the output from the factor analysis:
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 h2 u2 com
import_of_new_ideas 0.48 0.06 0.233 0.77 1.0
import_of_safety 0.06 0.61 0.375 0.63 1.0
import_of_trying_things 0.66 0.03 0.435 0.57 1.0
import_of_obedience 0.01 0.49 0.240 0.76 1.0
import_of_modesty 0.01 0.44 0.197 0.80 1.0
import_of_good_time 0.62 0.01 0.382 0.62 1.0
import_of_freedom 0.43 0.16 0.208 0.79 1.3
import_of_strong_gov 0.15 0.57 0.350 0.65 1.1
import_of_adventures 0.64 -0.15 0.427 0.57 1.1
import_of_well_behav 0.03 0.64 0.412 0.59 1.0
import_of_traditions 0.03 0.50 0.253 0.75 1.0
import_of_fun 0.67 0.03 0.449 0.55 1.0
ID 0.07 0.04 0.007 0.99 1.7
Can anyone help me with the command I need to use in order to exclude the factor loadings from the ID variable (see output fa) from the creation of the latent variables "postmat" and "mat"?
Not sure if this is really your question, but assuming you just want to remove the first column from a data.table, here is an example data.table and 3 ways how you could exclude the ID column for that example:
DT <- data.table(
ID=LETTERS[1:10],
matrix(rnorm(50), nrow=10, dimnames = list(NULL, paste0("col", 1:5)))
)
DT[,- 1]
DT[, -"ID"]
DT[, setdiff(colnames(DT), "ID"), with=FALSE]

how to use the `map` family command in **purrr** pacakge to swap the columns across rows in data frame?

Imagine there are 4 cards on the desk and there are several rows of them (e.g., 5 rows in the demo). The value of each card is already listed in the demo data frame. However, the exact position of the card is indexed by the pos columns, see the demo data I generated below.
To achieve this, I swap the cards with the [] function across the rows to switch the cards' values back to their original position. The following code already fulfills such a purpose. To avoid explicit usage of the loop, I wonder whether I can achieve a similar effect if I use the vectorization function with packages from tidyverse family, e.g. pmap or related function within the package purrr?
# 1. data generation ------------------------------------------------------
rm(list=ls())
vect<-matrix(round(runif(20),2),nrow=5)
colnames(vect)<-paste0('card',1:4)
order<-rbind(c(2,3,4,1),c(3,4,1,2),c(1,2,3,4),c(4,3,2,1),c(3,4,2,1))
colnames(order)=paste0('pos',1:4)
dat<-data.frame(vect,order,stringsAsFactors = F)
# 2. data swap ------------------------------------------------------------
for (i in 1:dim(dat)[1]){
orders=dat[i,paste0('pos',1:4)]
card=dat[i,paste0('card',1:4)]
vec<-card[order(unlist(orders))]
names(vec)=paste0('deck',1:4)
dat[i,paste0('deck',1:4)]<-vec
}
dat
You could use pmap_dfr :
card_cols <- grep('card', names(dat))
pos_cols <- grep('pos', names(dat))
dat[paste0('deck', seq_along(card_cols))] <- purrr::pmap_dfr(dat, ~{
x <- c(...)
as.data.frame(t(unname(x[card_cols][order(x[pos_cols])])))
})
dat
# card1 card2 card3 card4 pos1 pos2 pos3 pos4 deck1 deck2 deck3 deck4
#1 0.05 0.07 0.16 0.86 2 3 4 1 0.86 0.05 0.07 0.16
#2 0.20 0.98 0.79 0.72 3 4 1 2 0.79 0.72 0.20 0.98
#3 0.50 0.79 0.72 0.10 1 2 3 4 0.50 0.79 0.72 0.10
#4 0.03 0.98 0.48 0.06 4 3 2 1 0.06 0.48 0.98 0.03
#5 0.41 0.72 0.91 0.84 3 4 2 1 0.84 0.91 0.41 0.72
One thing to note here is to make sure that the output from pmap function does not have original names of the columns. If they have the original names, it would reshuffle the columns according to the names and output would not be in correct order. I use unname here to remove the names.

Column Mean for rows with unique values

how can I compute the mean R, R1, R2, R3 values from the rows sharing the same lon,lat field? I'm sure this questions exists multiple times but I could not easily find it.
lon lat length depth R R1 R2 R3
1 147.5348 -35.32395 13709 1 0.67 0.80 0.84 0.83
2 147.5348 -35.32395 13709 2 0.47 0.48 0.56 0.54
3 147.5348 -35.32395 13709 3 0.43 0.29 0.36 0.34
4 147.4290 -35.27202 12652 1 0.46 0.61 0.60 0.58
5 147.4290 -35.27202 12652 2 0.73 0.96 0.95 0.95
6 147.4290 -35.27202 12652 3 0.77 0.92 0.92 0.91
I'd recommend using the split-apply-combine strategy, where you're splitting by BOTH lon and lat, applying mean to each group, then recombining into a single data frame.
I'd recommend using dplyr:
library(dplyr)
mydata %>%
group_by(lon, lat) %>%
summarize(
mean_r = mean(R)
, mean_r1 = mean(R1)
, mean_r2 = mean(R2)
, mean_r3 = mean(R3)
)

Read tab delimited text file

I am trying to read the data from this link in R using the following code but I keep getting warning messages and the dataframe doesn't read the data properly.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission.txt'
df <- read.table(url, sep = '\t',header = F, skip = 2,quote='', comment='')
Can you tell what I need to change to read the data
EDIT
Adding data snippet
REMISS CELL SMEAR INFIL LI BLAST TEMP
1 0.8 0.83 0.66 1.9 1.1 1
1 0.9 0.36 0.32 1.4 0.74 0.99
0 0.8 0.88 0.7 0.8 0.18 0.98
0 1 0.87 0.87 0.7 1.05 0.99
1 0.9 0.75 0.68 1.3 0.52 0.98
0 1 0.65 0.65 0.6 0.52 0.98
1 0.95 0.97 0.92 1 1.23 0.99
0 0.95 0.87 0.83 1.9 1.35 1.02
It is an issue about encoding. Please see this thread for more information (Get "embedded nul(s) found in input" when reading a csv using read.csv()).
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission.txt'
df <- read.table(url, sep = '\t',header = TRUE, fileEncoding = "UTF-16LE")
Also consider,
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission.txt'
df <- read.csv(url, sep="\t", header=T)

apply a function on columns with specific names

I am new in R.
I have hundreds of data frames like this
ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07
This is just an example. The number and names of the Ratio_ columns are different between data frames, but all of them start with Ratio_. I want to apply a function (for example, log(x)), to the Ratio_ columns without specify the column number or the whole name.
I know how to do it df by df, for the one in the example:
A <- function(x) log(x)
df_log<-data.frame(df[1:2], lapply(df[3:6], A))
but I have a lot of them, and as I said the number of columns is different in each.
Any suggestion?
Thanks
Place the datasets in a list and then loop over the list elements
lapply(lst, function(x) {i1 <- grep("^Ratio_", names(x));
x[i1] <- lapply(x[i1], A)
x})
NOTE: No external packages are used.
data
lst <- mget(paste0("df", 1:100))
This type of problem is very easily dealt with using the dplyr package. For example,
df <- read.table(text = 'ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07',
header = TRUE)
library(dplyr)
df_transformed <- mutate_each(df, funs(log(.)), starts_with("Ratio_"))
df_transformed
# > df_transformed
# ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
# 1 AA ABCD -2.4079456 -0.4004776 -2.3025851 -1.96611286
# 2 AB ABCE -3.2188758 -0.1625189 -3.2188758 -2.81341072
# 3 AC ABCG -0.8439701 -1.5606477 -0.6161861 -1.96611286
# 4 AD ABCF -1.8325815 -0.4780358 -1.3862944 -0.03045921
# 5 AF ABCJ -0.5276327 -0.9942523 -0.4155154 -2.65926004

Resources