Select a column from multiple dataframes in a list - r

My list has multiple data frames with only two columns
DateTime Value
30-06-2016 100
31-07-2016 200
.
.
.
I just want to extract the column Value from the list. The fillowing code proved unsuccesful for me. What am I doing wrong here ?
actual_data <- lapply(test_data, function(df) df[,is.numeric(df)])
> actual_data[[1]]
data frame with 0 columns and 12 rows
Thank you

purrr::map (an enhanced version of lapply) provides a shortcut for this type of operation:
# Generate test data
set.seed(35156)
test_df <- data.frame('DateTime' = rnorm(100), 'Value' = rnorm(100))
test_data <- rep(list(test_df), 100)
# Use `map` from the purrr package to subset the data.frames
purrr::map(test_data, 'Value')
purrr::map(test_data, 2)
As you can see in the example above, you can select columns in a data.frame either by name, by passing a character string as the second argument to purrr::map, or by position, by passing a number.

Related

Unnest a ts class

My data has multiple customers data with different start and end dates along with their sales data.So I did simple exponential smoothing.
I applied the following code to apply ses
library(zoo)
library(forecast)
z <- read.zoo(data_set,FUN = function(x) as.Date(x) + seq_along(x) / 10^10 , index = "Date", split = "customer_id")
L <- lapply(as.list(z), function(x) ts(na.omit(x),frequency = 52))
HW <- lapply(L, ses)
Now my output class is list with uneven lengths.Can someone help me how to unnest or unlist the output in to a data frame and get the fitted values,actuals,residuals along with their dates,sales and customer_id.
Note : the reson I post my input data rather than data of HW is,the HW data is too large.
Can someone help me in R.
I would use tidyverse package to handle this problem.
map(HW, ~ .x %>%
as.data.frame %>% # convert each element of the list to data.frame
rownames_to_column) %>% # add row names as columns within each element
bind_rows(.id = "customer_id") # bind all elements and add customer ID
I am not sure how to relate dates and actual sales to your output (HW). If you explain it I might provide solution to that part of the problem too.
Firstly took all the unique customer_id into a variable called 'k'
k <- unique(data_set$customer_id)
Created a empty data frame
b <- data.frame()
extracted all the fitted values using a for loop and stored in 'a'.Using the rbind function attached all the fitted values to data frame 'b'
for(key in k){
print(a <- as.data.frame((as.numeric(HW_ses[[key]]$model$fitted))))
b <- rbind(b,a)
}
Finally using column bind function attached the input data set with data frame 'b'
data_set_final <- cbind(data_set,b)

Vector gets stored as a dataframe instead of being a vector

I am new to r and rstudio and I need to create a vector that stores the first 100 rows of the csv file the programme reads . However , despite all my attempts my variable v1 ends up becoming a dataframe instead of an int vector . May I know what I can do to solve this? Here's my code:
library(readr)
library(readr)
cup_data <- read_csv("C:/Users/Asus.DESKTOP-BTB81TA/Desktop/STUDY/YEAR 2/
YEAR 2 SEM 2/PREDICTIVE ANALYTICS(1_PA_011763)/Week 1 (Intro to PA)/
Practical/cup98lrn variable subset small.csv")
# Retrieve only the selected columns
cup_data_small <- cup_data[c("AGE", "RAMNTALL", "NGIFTALL", "LASTGIFT",
"GENDER", "TIMELAG", "AVGGIFT", "TARGET_B", "TARGET_D")]
str(cup_data_small)
cup_data_small
#get the number of columns and rows
ncol(cup_data_small)
nrow(cup_data_small)
cat("No of column",ncol(cup_data_small),"\nNo of Row :",nrow(cup_data_small))
#cat
#Concatenate and print
#Outputs the objects, concatenating the representations.
#cat performs much less conversion than print.
#Print the first 10 rows of cup_data_small
head(cup_data_small, n=10)
#Create a vector V1 by selecting first 100 rows of AGE
v1 <- cup_data_small[1:100,"AGE",]
Here's what my environment says:
cup_data_small is a tibble, a slightly modified version of a dataframe that has slightly different rules to try to avoid some common quirks/inconsistencies in standard dataframes. E.g. in a standard dataframe, df[, c("a")] gives you a vector, and df[, c("a", "b")] gives you a dataframe - you're using the same syntax so arguably they should give the same type of result.
To get just a vector from a tibble, you have to explicitly pass drop = TRUE, e.g.:
library(dplyr)
# Standard dataframe
iris[, "Species"]
iris_tibble = iris %>%
as_tibble()
# Remains a tibble/dataframe
iris_tibble[, "Species"]
# This gives you just the vector
iris_tibble[, "Species", drop = TRUE]

Keeping specific columns in read_excel

I am importing an excel file into R. I only want to keep columns A and C not B (columns are A,B,C in order), but the following code keeps column B too. How can I get rid of column B without subseting in another line of code?
df <- read_excel("df.xlsm", "futsales", range = cell_cols(c("A","C")), na = " ")
By going through the documentation for read_excel function, you have to give a range like,
df <- read_excel("df.xlsm", "futsales", range = cell_cols("A:C"), na = " ")
It looks like you can't specify multiple ranges in the range parameter of read_excel. However, you can use the map function from purrr to apply read_excel to a vector of ranges. In your use case, map_dfc will bind the column A and C selections back together into a single output dataset.
library(readxl)
library(purrr)
path <- readxl_example("datasets.xlsx")
ranges <- list("A", "C")
ranges %>%
purrr::map_dfc(
~ read_excel(
path,
range = cell_cols(.))
)
I just did this to successfully read in 5 columns of an Excel file with 27 columns, so here is how you can do it for a file with name you have stored in x and retrieving only first and third columns, assuming that column A is text and column C is numeric:
library(tibble)
library(readxl)
df.temp <- as.tibble(read_excel(x,
col_names=TRUE,
col_types=c("text","skip","numeric")
)
Other option of what #MauritsEvers said would be:
df <- read_excel("df.xlsm", "futsales")[,c(1,3)]
You are making a matrix with all the data, and at the same time, making df with all the rows (that's why the [,), and only the first ("A") and third ("C") columns (that's why the ,c(1,3)])

How to display multiple columns data into single column

My data is in the following form:
Parameter Value Parameter Value Parameter Value
Speed 100 Time 1 Distance 260
and I want to display it in tabular format as all the 'Parameters' in one column and all the 'Values' in another column
Parameter Value
Speed 100
Time 1
Distance 260
Please help me with this.
Thanks in advance..!!
Here is a quick and dirty solution. I'm assuming the number of columns is even.
library(tidyverse)
library(magrittr)
library(janitor) # For making column names unique.
# Create your example dataset.
test = c('Speed', 100, 'Time', 1, 'Distance', 260) %>%
t() %>%
as.tibble() %>%
clean_names() # Make column names unique. tidyverse functions won't work otherwise.
# If you're reading your dataset into R via read_csv(), read_excel(), etc, be sure to
# run the imported tibble through clean_names().
# Create empty list to house each parameter and its value in each element.
params = list()
# Loop through the 'test' tibble, grabbing each parameter-value pair
# and throwing them in their own element of the list 'params'
for (i in 1:(ncol(test)/2)) {
# The 1st & 2nd columns are a parameter-value pair. Grab them.
params[[i]] = select(test, 1:2)
# Drop the 1st and second columns.
test = select(test, -(1:2))
}
# We want to combine the tibbles in 'params' row-wise into one big tibble.
# First, the column names must be the same for each tibble.
params = lapply(X = params, FUN = setNames, c('v1', 'v2'))
# Combine the tibbles row-wise into one big tibble.
test2 = do.call(rbind, params) %>%
set_colnames(c('Parameter', 'Value'))
# End. 'test2' is the desired output.
#Namrata here is an approach that uses base R functions, and doesn't require cleaning of the column names.
rawData <- "Parameter Value Parameter Value Parameter Value
Speed 100 Time 1 Distance 260"
# read the data and convert to a matrix of character values
# read the first row as data, not variable names
theData <- as.matrix(read.table(textConnection(rawData),header=FALSE,
stringsAsFactors=FALSE))
# transpose so required data is in second column
transposedData <- t(theData)
# calculate indexes to extract parameter names (odd) and values (even) rows
# from column 2
parmIndex <- seq(from=1,to=nrow(transposedData)-1,by=2)
valueIndex <- seq(from=2,to=nrow(transposedData),by=2)
# create 2 vectors
parameter <- transposedData[parmIndex,2]
value <- transposedData[valueIndex,2]
# convert to data frame and reset rownames
resultData <- data.frame(parameter,value=as.numeric(value),stringsAsFactors=FALSE)
rownames(resultData) <- 1:nrow(resultData)
The resulting output is:
regards,
Len

Manipulating a dataset by separating variables

I have a data set that looks similar to the image shown below. Total, it is over a 1000 observations long. I want to create a new data frame that separates the single variable into 3 variables. Each variable is separated by a "+" in each observation, so it will need to be separated by using that as a factor.
Here is a solution using data.table:
library(data.table)
# Data frame
df <- data.frame(MovieId.Title.Genres = c("yyyy+xxxx+wwww", "zzzz+aaaa+aaaa"))
# Data frame to data table.
df <- data.table(df)
# Split column into parts.
df[, c("MovieId", "Title", "Genres") := tstrsplit(MovieId.Title.Genres, "\\+")]
# Print data table
df
I'll assume that your movieData object is a single column data.frame object.
If you want to split a single element from your data set, use strsplit using the character + (which R wants to see written as "\\+"):
# split the first element of movieData into a vector of strings:
strsplit(as.character(movieData[1,1]), "\\+")
Use lapply to apply this to the entire column, then massage the resulting list into a nice, usable data.frame:
# convert to a list of vectors:
step1 = lapply(movieData[,1], function(x) strsplit(as.character(x), "\\+"))
# step1 is a list, so make it into a data.frame:
step2 = as.data.frame(step1)
# step2 is a nice data.frame, but its names are garbage. Fix it:
movieDataWithColumns = setNames(step2, c("MovieId", "Title", "Genres"))

Resources