Keeping specific columns in read_excel

Keeping specific columns in read_excel - r

I am importing an excel file into R. I only want to keep columns A and C not B (columns are A,B,C in order), but the following code keeps column B too. How can I get rid of column B without subseting in another line of code?
df <- read_excel("df.xlsm", "futsales", range = cell_cols(c("A","C")), na = " ")

By going through the documentation for read_excel function, you have to give a range like,
df <- read_excel("df.xlsm", "futsales", range = cell_cols("A:C"), na = " ")

It looks like you can't specify multiple ranges in the range parameter of read_excel. However, you can use the map function from purrr to apply read_excel to a vector of ranges. In your use case, map_dfc will bind the column A and C selections back together into a single output dataset.
library(readxl)
library(purrr)
path <- readxl_example("datasets.xlsx")
ranges <- list("A", "C")
ranges %>%
purrr::map_dfc(
~ read_excel(
path,
range = cell_cols(.))
)

I just did this to successfully read in 5 columns of an Excel file with 27 columns, so here is how you can do it for a file with name you have stored in x and retrieving only first and third columns, assuming that column A is text and column C is numeric:
library(tibble)
library(readxl)
df.temp <- as.tibble(read_excel(x,
col_names=TRUE,
col_types=c("text","skip","numeric")
)

Other option of what #MauritsEvers said would be:
df <- read_excel("df.xlsm", "futsales")[,c(1,3)]
You are making a matrix with all the data, and at the same time, making df with all the rows (that's why the [,), and only the first ("A") and third ("C") columns (that's why the ,c(1,3)])

Related

Vector gets stored as a dataframe instead of being a vector

I am new to r and rstudio and I need to create a vector that stores the first 100 rows of the csv file the programme reads . However , despite all my attempts my variable v1 ends up becoming a dataframe instead of an int vector . May I know what I can do to solve this? Here's my code:
library(readr)
library(readr)
cup_data <- read_csv("C:/Users/Asus.DESKTOP-BTB81TA/Desktop/STUDY/YEAR 2/
YEAR 2 SEM 2/PREDICTIVE ANALYTICS(1_PA_011763)/Week 1 (Intro to PA)/
Practical/cup98lrn variable subset small.csv")
# Retrieve only the selected columns
cup_data_small <- cup_data[c("AGE", "RAMNTALL", "NGIFTALL", "LASTGIFT",
"GENDER", "TIMELAG", "AVGGIFT", "TARGET_B", "TARGET_D")]
str(cup_data_small)
cup_data_small
#get the number of columns and rows
ncol(cup_data_small)
nrow(cup_data_small)
cat("No of column",ncol(cup_data_small),"\nNo of Row :",nrow(cup_data_small))
#cat
#Concatenate and print
#Outputs the objects, concatenating the representations.
#cat performs much less conversion than print.
#Print the first 10 rows of cup_data_small
head(cup_data_small, n=10)
#Create a vector V1 by selecting first 100 rows of AGE
v1 <- cup_data_small[1:100,"AGE",]
Here's what my environment says:

cup_data_small is a tibble, a slightly modified version of a dataframe that has slightly different rules to try to avoid some common quirks/inconsistencies in standard dataframes. E.g. in a standard dataframe, df[, c("a")] gives you a vector, and df[, c("a", "b")] gives you a dataframe - you're using the same syntax so arguably they should give the same type of result.
To get just a vector from a tibble, you have to explicitly pass drop = TRUE, e.g.:
library(dplyr)
# Standard dataframe
iris[, "Species"]
iris_tibble = iris %>%
as_tibble()
# Remains a tibble/dataframe
iris_tibble[, "Species"]
# This gives you just the vector
iris_tibble[, "Species", drop = TRUE]

Can't reorder data frame columns by matching column names given in another column

I'm trying to re-order the variables of my data frame using the contents of a variable in another data frame but it's not working and I don't know why.
Any help would be appreciated!
# Starting point
df_main <- data.frame(coat=c(1:5),hanger=c(1:5),book=c(1:5),
bottle=c(1:5),wall=c(1:5))
df_order <- data.frame(order_var=c("wall","book","hanger","coat","bottle"),
number_var=c(1:5))
# Goal
df_goal <- data.frame(wall=c(1:5),book=c(1:5),hanger=c(1:5),
coat=c(1:5),bottle=c(1:5))
# Attempt
df_attempt <- df_main[df_order$order_var]

In you df_order, put stringsAsFactors = FALSE in the data.frame call.

The issue is that you have the order as a factor, if you change it to a character it will work:
df_goal <- df_main[as.character(df_order$order_var)]

trying to read specific columns from multiple csv with fread and apply

Here is the situation, I have many csv files, say, 20 of them. Each of them has different column names. So I built a map for them.
map
# variable location
# A 1
# B 1
# C 2
I was trying to read them all once, so I have a code like this:
Table <- rbindlist(
apply(map, 1, function(x) {
fil <- paste0(x[2], ".csv")
sel <- x[1]
fread(file = fil, select = sel)
}
When it is done, I get a data.table with 1 column of all data. If I use rbind, I get a large matrix of wanted element, but can't be converted into the data.table form I need. How can I make it happen? Please advise, thanks.

The issue is with the columns that are factor class in the dataset 'map'. When we use apply, it is converted to a matrix and the factor columns get coerced to integer values and that causes the mismatch. One option is to convert to character class. This can be done more compactly with Map
rbindlist(Map(fread, file = paste0(map$location, ".csv"),
select = as.character(map$variable)))

Manipulating a dataset by separating variables

I have a data set that looks similar to the image shown below. Total, it is over a 1000 observations long. I want to create a new data frame that separates the single variable into 3 variables. Each variable is separated by a "+" in each observation, so it will need to be separated by using that as a factor.

Here is a solution using data.table:
library(data.table)
# Data frame
df <- data.frame(MovieId.Title.Genres = c("yyyy+xxxx+wwww", "zzzz+aaaa+aaaa"))
# Data frame to data table.
df <- data.table(df)
# Split column into parts.
df[, c("MovieId", "Title", "Genres") := tstrsplit(MovieId.Title.Genres, "\\+")]
# Print data table
df

I'll assume that your movieData object is a single column data.frame object.
If you want to split a single element from your data set, use strsplit using the character + (which R wants to see written as "\\+"):
# split the first element of movieData into a vector of strings:
strsplit(as.character(movieData[1,1]), "\\+")
Use lapply to apply this to the entire column, then massage the resulting list into a nice, usable data.frame:
# convert to a list of vectors:
step1 = lapply(movieData[,1], function(x) strsplit(as.character(x), "\\+"))
# step1 is a list, so make it into a data.frame:
step2 = as.data.frame(step1)
# step2 is a nice data.frame, but its names are garbage. Fix it:
movieDataWithColumns = setNames(step2, c("MovieId", "Title", "Genres"))

Select a column from multiple dataframes in a list

My list has multiple data frames with only two columns
DateTime Value
30-06-2016 100
31-07-2016 200
.
.
.
I just want to extract the column Value from the list. The fillowing code proved unsuccesful for me. What am I doing wrong here ?
actual_data <- lapply(test_data, function(df) df[,is.numeric(df)])
> actual_data[[1]]
data frame with 0 columns and 12 rows
Thank you

purrr::map (an enhanced version of lapply) provides a shortcut for this type of operation:
# Generate test data
set.seed(35156)
test_df <- data.frame('DateTime' = rnorm(100), 'Value' = rnorm(100))
test_data <- rep(list(test_df), 100)
# Use `map` from the purrr package to subset the data.frames
purrr::map(test_data, 'Value')
purrr::map(test_data, 2)
As you can see in the example above, you can select columns in a data.frame either by name, by passing a character string as the second argument to purrr::map, or by position, by passing a number.