Extract specific columns based on target list , - r

I have a data set comprising 1000 rows and 5000 columns, and I want to select specific columns ("mydata"). The target columns are listed in the data frame "cols".
I've tried some of the examples posted here, but no success! Could anyone please suggest it to me?
cols[1:5,]
c("S100024810", "S100024905", "S100024920", "S100024923", "S100025437"
)
mydata[1:3,1:5]
structure(list(S100024732 = c("1", "0", "1"), S100024733 = c("-",
"0", "0"), S100024803 = c("0", "0", "-"), S100024810 = c("1",
"1", "1"), S100024817 = c("-", "1", "-")), row.names = c(NA,
3L), class = "data.frame")
mydata[,cols[cols %in% names(mydata)]]
Error in .subset(x, j) : invalid subscript type 'list'

cols appears to be a data frame with one column, so you need to reference for that column.
cols[cols[, 1] %in% names(mydata), 1]
# [1] "S100024810"
To prevent the result to be coerced to a vector if it's just one column (as it happens with your cols data frame), we need to do drop=FALSE.
mydata[, cols[cols[, 1] %in% names(mydata), 1], drop=FALSE]
# S100024810
# 1 1
# 2 1
# 3 1
Data:
cols <- structure(list(V1 = c("S100024810", "S100024905", "S100024920",
"S100024923", "S100025437")), class = "data.frame", row.names = c(NA,
-5L))
mydata <- structure(list(S100024732 = c("1", "0", "1"), S100024733 = c("-",
"0", "0"), S100024803 = c("0", "0", "-"), S100024810 = c("1",
"1", "1"), S100024817 = c("-", "1", "-")), row.names = c(NA,
3L), class = "data.frame")

Related

Loop for creating multiple new 3 level variables from another 5 level variable

I'm looking for a way to generate multiple 3-level variables from an older 5-level variable, while keeping the old variables.
This is how it is now:
structure(list(Quesiton1 = c("I", "5", "4", "4"), Question2 = c("I",
"5", "4", "4"), Question3 = c("I", "3", "2", "4")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
I would like this:
structure(list(Quesiton1 = c("I", "5", "4", "4"), Question2 = c("I",
"5", "4", "4"), Question3 = c("I", "3", "2", "4"), Question1_3l = c("NA",
"3", "3", "3"), Question2_3l = c("NA", "3", "3", "3"), Question3_3l = c("NA",
"2", "1", "3")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
I have this code to recode the 5-level variable
df2 %>%
mutate_at(vars(Question1, Question2, Question3), recode,'1'=1, '2'=1, '3'=3, '4'=5, '5'=5, 'l' = NA)
But what I want to do is to keep the old variable and generate the 3 level variable into something like Question1_3l, Question2_3l, Question3_3l.
It shouldn't be too difficult. In Stata it looks something like this:
foreach i of varlist ovsat-not_type_number {
local lbl : variable label `i'
recode `i' (1/2=1)(3=2)(4/5=3), gen(`i'_3l)
}
Thank you.
Not the most elegant, not the fastest (but still pretty fast), not the most idiomatic, but this does what you want (I think) and should be easy to read and customize.
dt <- structure(list(Quesiton1 = c("I", "5", "4", "4"),
Question2 = c("I", "5", "4", "4"),
Question3 = c("I", "3", "2", "4")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L))
#transfor your data into a data.table
setDT(dt)
#define the names of the columns that you want to recode
vartoconv <- names(dt)
#define the names of the recoded columns
newnames <- paste0(vartoconv, "_3l")
#define an index along the vector of the names of the columns to recode
for(varname_loopid in seq_along(vartoconv)){
#identify the name of the column to recode for each iteration
varname_loop <- vartoconv[varname_loopid]
#identify the name of the recoded column for each iteration
newname_loop <- newnames[varname_loopid]
#create the recoded variable by using conditionals on the variable to recode
dt[get(varname_loop) %in% c(1, 2), (newname_loop) := 1]
dt[get(varname_loop) == 3, (newname_loop) := 2]
dt[get(varname_loop) %in% c(4, 5), (newname_loop) := 3]
}
Try:
library(tidyverse)
library(stringr)
df2 <- replicate(6, sample(as.character(1:5), 50, replace = TRUE), simplify = "matrix") %>%
as_tibble(.name_repair = ~str_c("Question", 1:6))
df2 %>%
mutate_at(vars(Question1:Question3),
~case_when(.x %in% c('1', '2') ~ 1L, # 1L means integer 1
.x %in% c('3') ~ 3L,
.x %in% c('4', '5') ~ 5L,
TRUE ~ as.integer(NA)))

Iteration over data.table

I have a table A with column names: Var1, Var2, Var3.
Var1 = c("N1", "N2", "0", "0", "N3", "N4", "0", "0")
Var2 = c("0", "A", "0", "0", "0", "B", "0", "0")
Var3 = c("0", "Yes", "No", "All", "0", "x", "y", "z")
I would like to obtain vectors based on Table A, which contains values from column eg: N2 = (Yes, No, All), N4 = (x, y, z).
I have tried few iterations with "for loop" and "logical if" but with no success. Please, give me some hint.
With data.table:
replace the 0s in Var1 by NA
carry forward last occurence of non NA values (using for example zoo::na.locf because data.table::nafill doesnt yet work for characters).
filter according to Var1:
library(data.table)
data <- data.table(Var1 = c("N1", "N2", "0", "0", "N3", "N4", "0", "0"), Var2 = c("0", "A", "0", "0", "0", "B", "0", "0"), Var3 = c("0", "Yes", "No", "All", "0", "x", "y", "z"))
# replace 0s by NA for next step
data[Var1==0,Var1:=NA]
# last occurence carried forward
data[,Var1:=zoo::na.locf(Var1)]
data[Var1=='N2',Var3]
#[1] "Yes" "No" "All"
data[Var1=='N4',Var3]
#[1] "x" "y" "z"

R for loop on objects in global environment to subset on one variable

Say I have 3 objects (A, B and C) in my global environment:
A <- data.frame(
"Var1" = c("0", "0", "1", "0"),
"Var2" = c("1", "0", "0", "0"),
"Var3" = c("0","1", "1", "1"),
"Site" = c("alpha", "alpha", "beta" ,"gamma")
)
B <- data.frame(
"Var4" = c("0", "0", "1", "1"),
"Var5" = c("1", "0", "0", "1"),
"Site" = c("alpha", "beta" , "beta" ,"gamma")
)
C <- data.frame(
"Var6" = c("0", "0", "1"),
"Var7" = c("1", "0", "0"),
"Site" = c("alpha", "beta" ,"gamma")
)
I would like to create a loop through all the objects in my global environment and subset on the "Site" variable and create and naming the following datasets as:
A_alpha = A[A$Site == 'alpha',]
A_beta = A[A$Site == 'beta',]
A_gamma = A[A$Site == 'gamma',]
B_alpha = B[B$Site == 'alpha',]
B_beta = B[B$Site == 'beta',]
B_gamma = B[B$Site == 'gamma',]
C_alpha = C[C$Site == 'alpha',]
C_beta = C[C$Site == 'beta',]
C_gamma = C[C$Site == 'gamma',]
How would the loop look like?

set correct column names when output list items to csv in r

Given a list (e.g. out), I want to write each item out to a separate csv file for use elsewhere.
The list is large and contains a lot of items so I wanted to shortcut using a for loop. I created the following to
build a name for each output file based on the data group and date. When I run it everything works except it renames the columns
using the list item name and the existing colnames (e.g. instead of 'week4' I get 'pygweek4'. I do not want it to change my column names.
I tried setting col.names = TRUE, hoping to retain the existing names, and using the code below to specify the names,
as well as setting col.names = FALSE. In all cases I get a warning message saying that "attempt to set 'col.names' ignored".
Can anyone suggest a simple method of retaining the column names I already have?
out <- list(pyg = structure(list(week4 = c("0", "1", "1", "0", "1"),
week5 = c("0", "1", "1", "1", "1"), week6 = c("0", "1", "0", "1", "1"),
week7 = c("0", "0", "0", "1", "1"), week8 = c("0", "1", "0", "1", "1")),
row.names = 281:285, class = "data.frame"),
saw = structure(list(week4 = c("0", "0", "0", "0", "0"),
week5 = c("0", "0", "0", "0", "0"), week6 = c("0", "0", "0", "0", "0"),
week7 = c("0", "0", "0", "0", "0"), week8 = c("0", "0", "0", "0", "1")),
row.names = c(NA, 5L), class = "data.frame"))
for(i in 1:length(out)){
n = paste(paste(names(out)[i],Sys.Date(), sep = "_"), ".csv", sep = "") # create set name and version control
write.csv(out[i], file = n, row.names = FALSE, col.names = c("week4", "week5", "week6", "week7", "week8"))
}
Sorry for the lack of decent tags... I don't have the reputation to set tags that I think are useful for this post and couldn't find ones that made sense in the ones available.
We don't need to specify the col.names. The issue seems to be that, the list elements are not extracted correctly. It should be [[i]] instead of [i]. With [i], it is still a list of one data.frame element. By doing [[i]], it extracts the data.frame from the list
for(i in seq_along(out)){
n <- paste(paste(names(out)[i],Sys.Date(), sep = "_"),
".csv", sep = "")
write.csv(out[[i]], file = n, row.names = FALSE, quote = FALSE)
}
The difference can be found from checking the str
str(out[[1]])
str(out[1])

Convert column types to their read_csv() column type in R

One of my favorite things about library(readr) and the read_csv() function in R is that it almost always sets the column types of my data to the correct class. However, I am currently working with an API in R that returns data to me as a dataframe of all character classes, even if the data is clearly numbers. Take this dataframe for example, which has some sports data:
dput(mydf)
structure(list(isUnplayed = c("false", "false", "false"), isInProgress =
c("false", "false", "false"), isCompleted = c("true", "true", "true"), awayScore = c("106",
"95", "95"), homeScore = c("94", "97", "111"), game.ID = c("31176",
"31177", "31178"), game.date = c("2015-10-27", "2015-10-27",
"2015-10-27"), game.time = c("8:00PM", "8:00PM", "10:30PM"),
game.location = c("Philips Arena", "United Center", "Oracle Arena"
), game.awayTeam.ID = c("88", "86", "110"), game.awayTeam.City = c("Detroit",
"Cleveland", "New Orleans"), game.awayTeam.Name = c("Pistons",
"Cavaliers", "Pelicans"), game.awayTeam.Abbreviation = c("DET",
"CLE", "NOP"), game.homeTeam.ID = c("91", "89", "101"), game.homeTeam.City = c("Atlanta",
"Chicago", "Golden State"), game.homeTeam.Name = c("Hawks",
"Bulls", "Warriors"), game.homeTeam.Abbreviation = c("ATL",
"CHI", "GSW"), quarterSummary.quarter = list(structure(list(
`#number` = c("1", "2", "3", "4"), awayScore = c("25",
"23", "34", "24"), homeScore = c("25", "18", "23", "28"
)), .Names = c("#number", "awayScore", "homeScore"), class = "data.frame", row.names = c(NA,
4L)), structure(list(`#number` = c("1", "2", "3", "4"), awayScore = c("17",
"23", "28", "27"), homeScore = c("26", "20", "25", "26")), .Names = c("#number",
"awayScore", "homeScore"), class = "data.frame", row.names = c(NA,
4L)), structure(list(`#number` = c("1", "2", "3", "4"), awayScore = c("35",
"14", "26", "20"), homeScore = c("39", "20", "35", "17")), .Names = c("#number",
"awayScore", "homeScore"), class = "data.frame", row.names = c(NA,
4L)))), .Names = c("isUnplayed", "isInProgress", "isCompleted",
"awayScore", "homeScore", "game.ID", "game.date", "game.time",
"game.location", "game.awayTeam.ID", "game.awayTeam.City", "game.awayTeam.Name",
"game.awayTeam.Abbreviation", "game.homeTeam.ID", "game.homeTeam.City",
"game.homeTeam.Name", "game.homeTeam.Abbreviation", "quarterSummary.quarter"
), class = "data.frame", row.names = c(NA, 3L))
It is quite a hassle to deal with this dataframe once it is returned by the API, given the class types. I've come up with a sort of a hack to update the column classes, which is as follows:
write_csv(mydf, 'mydf.csv')
mydf <- read_csv('mydf.csv')
By writing to CSV and then re-reading the CSV using read_csv(), the dataframe columns update. Unfortunately I am left with a CSV file in my directory that I don't want. Is there a way to update the columns of an R dataframe to their 'read_csv()' column classes, without actually having to write the CSV?
Any help is appreciated!
You don't need to write and read the data if you just want readr to guess you column type. You could use readr::type_convert for that:
iris %>%
dplyr::mutate(Sepal.Width = as.character(Sepal.Width)) %>%
readr::type_convert() %>%
str()
For comparison:
iris %>%
dplyr::mutate(Sepal.Width = as.character(Sepal.Width)) %>%
str()
try this code, type.convert convert a character vector to logical, integer, numeric, complex or factor as appropriate.
indx <- which(sapply(df, is.character))
df[, indx] <- lapply(df[, indx], type.convert)
indx <- which(sapply(df, is.factor))
df[, indx] <- lapply(df[, indx], as.character)

Resources