Create subset of ranges and individual items - r

I'm using R and have a dataset with ~3000 psychological test data. The data is all dyadic in male-female partners (though this shouldn't matter for you). I'm creating a new data frame with just the variables of interest, most of them are not sequentially listed in the original data so I select them by name like below:
new_df <- subset(data, select=c("MQ4", "FQ4", #RX STATUS
"MQ9", "FQ9", #ETHNICITY
"MQ10", "FQ10", #RACE
"MQ465", "FQ465", #SEX
"MQ13", "FQ13", #GENDER
"MQ14", "FQ14", #SEXORIENT
"MQ180", "MQ181", "MQ182", "MQ182" ### HERE IS WHERE I NEED HELP
))
However, I have about 150 unique items that are listed sequentially and I'd like to select them without writing out "MQ180" through "MQ310" to select them all. I've been trying to figure out a way to select the range as well to the individual items I have been doing. This is currently what I'm trying:
new_df <- subset(data, select=c("MQ4", "FQ4", #RX STATUS
"MQ9", "FQ9", #ETHNICITY
"MQ10", "FQ10", #RACE
"MQ465", "FQ465", #SEX
"MQ13", "FQ13", #GENDER
"MQ14", "FQ14", #SEXORIENT
163:310 ### HERE IS WHERE I NEED HELP
))

One option:
dplyr::select(mtcars, "cyl", 5:8)
This subsets the mtcars dataframe to just the cyl column and the 5th thru 8th column:
cyl drat wt qsec vs
Mazda RX4 6 3.90 2.620 16.46 0
Mazda RX4 Wag 6 3.90 2.875 17.02 0
Datsun 710 4 3.85 2.320 18.61 1
Here's a base R alternative but there's probably a better way:
cbind(mtcars[, 'cyl'], mtcars[, 5:8])
mtcars originally:
5 6 7 8
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

In the index part of subset select can use names
subset(data, select=c("MQ4", "FQ4", #RX STATUS
"MQ9", "FQ9", #ETHNICITY
"MQ10", "FQ10", #RACE
"MQ465", "FQ465", #SEX
"MQ13", "FQ13", #GENDER
"MQ14", "FQ14", #SEXORIENT
names(data)[163:310]
))
The issue arises because of the property of vector which can only have a single class. So, when we have both character and integer, the integers are converted to character and thus it will look for column names that matches the name "163" instead of the position index

Related

Get value based on another column in dplyr

I have the following dataset:
df <- mtcars[1:4,c("wt","qsec")]
df
wt qsec
Mazda RX4 2.620 16.46
Mazda RX4 Wag 2.875 17.02
Datsun 710 2.320 18.61
Hornet 4 Drive 3.215 19.44
How to achieve the following by using dynamic variable via dplyr?
df %>%
mutate(wt=floor(wt[which.min(qsec)]))
This is what I tried so far:
myvar<-"wt"
df %>%
mutate(!!myvar :=floor(!!as.name(myvar)[which.min(qsec)]))
Error in which.min(qsec) : object 'qsec' not found
Please let me know if you know why does the above code failed. Thank you!
In the latest versions of dplyr, you use := to set names with a character value and you use .data[[]] to get columns with a character value. Your transformation would look like this
df %>% mutate("{myvar}" := floor(.data[[myvar]][which.min(qsec)]))

How to use a short script to eliminate all but one duplicate column variables based on the prefix of the colname

I want to know to use a short script to eliminate all but one duplicate column variables based on the prefix of the colname without inputting the variables I want to remove by hand.
For example, I created repeats of the mtcars$am variables, called am1, am2, am3, and am4 in a data frame called mtcars_example_2. I removed the original am variable in the mtcars_example_2 data frame.
I can use the script below to eliminate all variables with the prefix "am" but the am1 variable into a new variable called mtcars_example_3 using the code below, which inputs all variables to remove by hand:
## long way of removing all variable with am prefix that were not am1
mtcars_example_3 <-
mtcars_example_2 %>%
select(
-c(
"am2", "am3", "am4"
)
)
But this seems like the long way of doing this. Is there a faster way that does not require me to individual type in the names of each of the variables that I want to remove from the data.
Is this possible? If so, how can this be done?
Thanks ahead of time.
Here is the code for the example:
# example data
## loads packages
library(tidyverse)
## creates mtcars_example data
mtcars_example_1 <- data.frame(mtcars)
mtcars_example_2 <- data.frame(mtcars_example_1)
## creates duplicate variables, based on am variable
mtcars_example_2$am1 <- mtcars_example_1$am
mtcars_example_2$am2 <- mtcars_example_1$am
mtcars_example_2$am3 <- mtcars_example_1$am
mtcars_example_2$am4 <- mtcars_example_1$am
## removes original variable
mtcars_example_2 <-
mtcars_example_2 %>%
select(
-c(
"am"
)
)
## long way of removing all variable with am prefix that were not am1
mtcars_example_3 <-
mtcars_example_2 %>%
select(
-c(
"am2", "am3", "am4"
)
)
You can remove all the variables that start with am but keep am1 :
library(dplyr)
mtcars_example_2 %>% select(-starts_with('am'), am1) %>% head
# mpg cyl disp hp drat wt qsec vs gear carb am1
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 4 4 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 4 4 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 4 1 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 0
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 3 2 0
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1 0
Depending on your actual scenario you can also use regex to remove columns.
mtcars_example_2 %>% select(-matches('am[2-4]')) %>% head
We could also do
library(dplyr)
mtcars_example_2 %>%
select(-contains('am'), am1)

Cannot use a variable named with numbers in R

I have some dataframes named as:
1_patient
2_patient
3_patient
Now I am not able to access its variables. For example:
I am not able to obtain:
2_patient$age
If I press tab when writing the name, it automatically gets quoted, but I am still unable to use it.
Do you know how can I solve this?
It is not recommended to name an object with numbers as prefix, but we can use backquote to extract the value from the object
`1_patient`$age
If there are more than object, we can use mget to return the objects in a list and then extract the 'age' column by looping over the list with lapply
mget(ls(pattern = "^\\d+_mtcars$"))
#$`1_mtcars`
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
lapply(mget(ls(pattern = "^\\d+_patient$")), `[[`, 'age')
Using a small reproducible example
data(mtcars)
`1_mtcars` <- head(mtcars, 2)
1_mtcars$mpg
Error: unexpected input in "1_"
`1_mtcars`$mpg
#[1] 21 21

Indexing by column name to the end of the dataframe - R

I'm wondering if there is a way to select a group of columns by the name of the first column in the group and then all the next columns either a) to the end of the data frame, or b) to another column, also using its name.
a) As an example for the first question, in the mtcars dataset, is there a way to select the columns from drat to the end of the data frame? (Something like mtcars[,'drat':ncol(mtcars)])
b) For the second question, is there a way to select the columns starting at cyl and ending at wt? (Something like mtcars[,'cyl':'wt'])
Many elegant solutions already provided but one can even use base-R to get the desired result using which as:
Ans a:
mtcars[,which(names(mtcars) == "drat"):ncol(mtcars)]
Ans b:
mtcars[,which(names(mtcars) == "cyl"):which(names(mtcars) == "wt")]
# cyl disp hp drat wt
#Mazda RX4 6 160.0 110 3.90 2.620
#Mazda RX4 Wag 6 160.0 110 3.90 2.875
#Datsun 710 4 108.0 93 3.85 2.320
#Hornet 4 Drive 6 258.0 110 3.08 3.215
#Hornet Sportabout 8 360.0 175 3.15 3.440
#......so on
We can do with this with select from dplyr
Answer a)
mtcars %>% select(drat:get(last(names(.))))
Answer b)
mtcars %>% select(cyl:wt)
In dplyr, the select function does exactly this (no quotes needed):
mtcards %>%
select(cyl:wt)
If we need to use a quoted string, convert it to sym (symbol) and then do the evaluation (!!
mtcars %>%
select(!! (rlang::sym("cyl")): !!(rlang::sym("wt")))
It would be when these are stored in an object
a <- "cyl"
b <- "wt"
mtcars %>%
select(!! (rlang::sym(a)): !!(rlang::sym(b)))
Or another option is
mtcars %>%
select(!! rlang::parse_expr(glue::glue("{a}:{b}")))

Compare item in one row against all other rows and loop through all rows using data.table - R

I'm combining similar names using stringdist(), and have it working using lapply, but it's taking 11 hours to run through 500k rows and I'd like to see if a data.table solution would work faster. Here's an example and my attempted solution so far built from readings here, here, here, here, and here, but I'm not quite pulling it off:
library(stringdist)
library(data.table)
data("mtcars")
mtcars$cartype <- rownames(mtcars)
mtcars$id <- seq_len(nrow(mtcars))
I'm currently using lapply() to cycle through the strings in the cartype column and bring together those rows whose cartype names are closer than a specified value (.08).
output <- lapply(1:length(mtcars$cartype), function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ])
> output[1:3]
[[1]]
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 Mazda RX4 1
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Mazda RX4 Wag 2
[[2]]
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 Mazda RX4 1
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Mazda RX4 Wag 2
[[3]]
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 Datsun 710 3
Data Table Attempt:
mtcarsdt <- as.data.table(mtcars)
myfun <- function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ]
An intermediate step: This code pulls similar names based on the row's value that I manually plug into myfun(), but it repeats that value for all the rows.
res <- mtcarsdt[,.(vlist = list(myfun(1))),by=id]
res$vlist[[1]] #correctly combines the 2 mazda names
res$vlist[[6]] #but it's repeated down the line
I'm now trying to cycle through all the rows using set(). I'm close, but although the code appears to be correctly matching the text from the 12th column (cartype) it's returning the values from the first column, mpg:
for (i in 1:32) set(mtcarsdt,i ,12L, myfun(i))
> mtcarsdt
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 c(21, 21) 1
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 c(21, 21) 2
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8 3
Now, this is pretty hacky, but I found that if I create a copy of the cartype column and place it in the first column it pretty much works, but there must be a cleaner way to do this. Also, it would be nice to keep the output in a list form like the lapply() output above as I have other post-processing steps set up for that format.
mtcars$cartypeorig <- mtcars$cartype
mtcars <- mtcars[,c(14,1:13)]
mtcarsdt <- as.data.table(mtcars)
for (i in 1:32) set(mtcarsdt,i ,13L, myfun(i))
> mtcarsdt[1:14,cartype]
[1] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"
[2] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"
[3] "Datsun 710"
[4] "Hornet 4 Drive"
[5] "Hornet Sportabout"
[6] "Valiant"
[7] "Duster 360"
[8] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\")"
[9] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[10] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[11] "c(\"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[12] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"
[13] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"
[14] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"
Have you tried using the matrix version of stringdist?
res = stringdistmatrix(mtcars$cartype, mtcars$cartype, method = 'jw', p = 0.08)
out = as.data.table(which(res < 0.08, arr.ind = T))[, .(list(mtcars[row,])), by = col]$V1
identical(out, output)
#[1] TRUE
Now, you probably can't just run the above for a 500k X 500k matrix, but you can split it into smaller pieces (pick size appropriate for your data/memory sizes):
size = 4 # dividing into pieces of size 4x4
# I picked a divisible number, a little more work will be needed
# if you have a residue (nrow(mtcars) = 32)
setDT(mtcars)
grid = CJ(seq_len(nrow(mtcars)/4), seq_len(nrow(mtcars)/4))
indices = grid[, {
res = stringdistmatrix(mtcars[seq((V1-1)*size+1, (V1-1)*size + size), cartype],
mtcars[seq((V2-1)*size+1, (V2-1)*size + size), cartype],
method = 'jw', p = 0.08)
out = as.data.table(which(res < 0.08, arr.ind = T))
if (nrow(out) > 0)
out[, .(row = (V1-1)*size+row, col = (V2-1)*size +col)]
}, by = .(V1, V2)]
identical(indices[, .(list(mtcars[row])), by = col]$V1, lapply(output, setDT))
#[1] TRUE

Resources