Recursive indexing only works up to [[1:3]] - r

I need to refer to individual dataframes within a list of dataframes (one by one) produced from a lapply function, but I'm getting the "recursive indexing failed at level 3" error. I've found similar questions, but none of them explain why this doesn't work.
I used lapply to make a list of dataframes, each with a different filter applied. The output in my reproducible example has 4 dataframes in the output (dfs). Now I want to refer to each dataframe in turn by indexing its position in the list.
If I use the format c(dfs[[1]], dfs[[2]], dfs[[3]], dfs[[4]]) I get the output that I want, and it works for the next function I need to apply, but it seems very inefficient.
When I try to shorten it by using c(dfs[[1:4]]) instead, I get the error Error in data1[[1:4]] : recursive indexing failed at level 3. If I try c(dfs[[1:3]]), it runs, bit doesn't give the output I expect (no longer a list of dataframes).
Here's an example:
library(tidyverse) # for glimpse, filter, mutate
data(mtcars)
mtcars2 <- mutate(mtcars, var = rep(c("A", "B", "C", "D"), len = 32)) # need a variable with more than 3 possible outcomes
glimpse(mtcars2)
list <- c("A", "B", "C", "D") # each new dataframe will filter based on these variables
dfs <- lapply(list, function(x) {
mtcars2 %>% filter(var == x) %>% glimpse()
}) # each dataframe now only contains A, B, C, or D
dfs # list of dataframes produced from lapply
dflist1 <- list(dfs[[1]], dfs[[2]], dfs[[3]], dfs[[4]]) # indexing one by one
dflist1 # this is what I want
dflist2 <- list(dfs[[1:4]]) # indexing all together
dflist2 # this produces an error
dflist3 <- list(dfs[[1:3]])
dflist3 # this runs, but the output is just `[[1]] [1] 4`, not a list of dataframes
I want something that looks like the output from dflist1 but that doesn't require me to add and remove list items every time the number of dataframes changes. I can't use the lapply output (dfs) as it is because my next function can't locate the variables within each dataframe as needed.
Any guidance appreciated.

Related

Pairwise t test loop through dataframes contained in a list

I have a very large dataframe which is built as follows:
Originaldf
I want to perform a pairwise t test within item A, comparing the measured value within the condition groups. So I would like to see if for all observations pertaining to item A, there is a difference between the measured values of the control group, test group, and placebo group (Condition).
The first thing I did was to split the dataframe into a list using dplyr's filter function.
Listdf <- split(originaldf, Item)
This worked and I got a list containing 82 elements with one dataframe corresponding to each item in the original dataframe.
I now am trying to perform the pairwise.t.test function on each element of the list. I am relatively new to R and think that writing a loop for this process, though inefficient, would help me understand what is going on the background. I know there is also the option to use the lapply function. I tried this on the Listdf with the following code, which I know is most likely much too simple but was worth a try.
lapply(Listdf, pairwise.t.test(Value, Condition))
However, I get the error Error in factor(g) : object 'Condition' not found. Not sure if there is a way to more specifically reference Condition so that it can be found. I've performed an individual pairwise.t.test on one of the items which worked with the following code.
pairwise.t.test(List$ItemA$Value, List$ItemA$Condition, p.adjust.method = "none")
However, I assume this would not work within the lapply function because I want it to perform the t.test for ItemA, ItemB, ItemC etc...
The loop I have tried so far is as follows:
for (i in Listdf) {
pairwise.t.test(List$i$logAddedConstant, List$i$Condition, p.adjust = "none")
}
For this I get the error "Error in split.default(X, group) : first argument must be a vector"
I believe this error corresponds to the original splitting of the original dataframe. However I don't quite understand why this error would show up this late in the code because the splitting of the dataframe worked without a problem.
I know I am probably missing something fundamental, but I am quite stumped and have tried multiple options to no avail. If anyone has another idea or suggestion I would be very grateful for the help. Please let me know if I should add some more information.
I made a very short example of a data.frame which is likewise structured as your originaldf
df <- data.frame(Item = c("A", "B", "C", "A", "B", "C"),
Value=runif(6),
Condition=c("Control","Control","Control", "Test", "Test", "Test"))
Listdf <- split(df, df$Item)
Using a simple for-loop
p <-list()
for (i in 1:length(Listdf)) {
p[[i]] <- pairwise.t.test(Listdf[[i]]$Value, Listdf[[i]]$Condition, p.adjust = "none")
}
Using lapply
p <- lapply(1:length(Listdf), function(x) {pairwise.t.test(Listdf[[x]]$Value, Listdf[[x]]$Condition, p.adjust = "none")})

Vector gets stored as a dataframe instead of being a vector

I am new to r and rstudio and I need to create a vector that stores the first 100 rows of the csv file the programme reads . However , despite all my attempts my variable v1 ends up becoming a dataframe instead of an int vector . May I know what I can do to solve this? Here's my code:
library(readr)
library(readr)
cup_data <- read_csv("C:/Users/Asus.DESKTOP-BTB81TA/Desktop/STUDY/YEAR 2/
YEAR 2 SEM 2/PREDICTIVE ANALYTICS(1_PA_011763)/Week 1 (Intro to PA)/
Practical/cup98lrn variable subset small.csv")
# Retrieve only the selected columns
cup_data_small <- cup_data[c("AGE", "RAMNTALL", "NGIFTALL", "LASTGIFT",
"GENDER", "TIMELAG", "AVGGIFT", "TARGET_B", "TARGET_D")]
str(cup_data_small)
cup_data_small
#get the number of columns and rows
ncol(cup_data_small)
nrow(cup_data_small)
cat("No of column",ncol(cup_data_small),"\nNo of Row :",nrow(cup_data_small))
#cat
#Concatenate and print
#Outputs the objects, concatenating the representations.
#cat performs much less conversion than print.
#Print the first 10 rows of cup_data_small
head(cup_data_small, n=10)
#Create a vector V1 by selecting first 100 rows of AGE
v1 <- cup_data_small[1:100,"AGE",]
Here's what my environment says:
cup_data_small is a tibble, a slightly modified version of a dataframe that has slightly different rules to try to avoid some common quirks/inconsistencies in standard dataframes. E.g. in a standard dataframe, df[, c("a")] gives you a vector, and df[, c("a", "b")] gives you a dataframe - you're using the same syntax so arguably they should give the same type of result.
To get just a vector from a tibble, you have to explicitly pass drop = TRUE, e.g.:
library(dplyr)
# Standard dataframe
iris[, "Species"]
iris_tibble = iris %>%
as_tibble()
# Remains a tibble/dataframe
iris_tibble[, "Species"]
# This gives you just the vector
iris_tibble[, "Species", drop = TRUE]

Saving data frames to values in a list

I have a list of titles that I would like to iterate over and create/save data frames to. I have tried the using the paste() function (as seen below) but that does not work for me. Any advice would be greatly appreciated.
samples <- list("A","B","C")
for (i in samples){
paste(i,sumT,sep="_") <- data.frame(col1=NA,col1=NA)
}
My desired output is three empty data frames named: A_sumT, B_sumT and C_sumT
Here's an answer with purrr.
samples <- list("A", "B", "C")
samples %>%
purrr::map(~ data.frame()) %>%
purrr::set_names(~ paste(samples, "sumT", sep="_"))
Consider creating a list of dataframes and avoid many separate objects flooding global environment as this example can extend to hundreds and not just three. Plus with this approach, you will maintain one container capable of running bulk operations across all dataframes.
By using sapply below on a character vector, you create a named list:
samples <- c("A","B","C") # OR unlist(list("A","B","C"))
df_list <- sapply(samples, function(x) data.frame(col1=NA,col2=NA), simplify=FALSE)
# RUN ANY DATAFRAME OPERATION
head(df_list$A)
tail(df_list$B)
summary(df_list$C)
# BULK OPERATIONS
stacked_df <- do.call(rbind, df_list)
stacked_df <- do.call(cbind, df_list)
merged_df <- Reduce(function(x,y) merge(x,y,by="col1"), df_list)
Or if you need to rename list
# RENAME LIST
df_list <- setNames(df_list, paste0(samples, "_sumT"))
# RUN ANY DATAFRAME OPERATION
head(df_list$A_sumT)
tail(df_list$B_sumT)
summary(df_list$C_sumT)

How to assign the output of a sapply loop to the original columns in a data frame without losing other columns

I a data frame with different columns that has string answers from different assessors, who used random upper or lower cases in their answers. I want to convert everything to lower case. I have a code that works as follows:
# Creating a reproducible data frame similar to what I am working with
dfrm <- data.frame(a = sample(names(islands))[1:20],
b = sample(unname(islands))[1:20],
c = sample(names(islands))[1:20],
d = sample(unname(islands))[1:20],
e = sample(names(islands))[1:20],
f = sample(unname(islands))[1:20],
g = sample(names(islands))[1:20],
h = sample(unname(islands))[1:20])
# This is how I did it originally by writing everything explicitly:
dfrm1 <- dfrm
dfrm1$a <- tolower(dfrm1$a)
dfrm1$c <- tolower(dfrm1$c)
dfrm1$e <- tolower(dfrm1$e)
dfrm1$g <- tolower(dfrm1$g)
head(dfrm1) #Works as intended
The problem is that as the number of assessors increase, I keep making copy paste errors. I tried to simplify my code by writing a function for tolower, and used sapply to loop it, but the final data frame does not look like what I wanted:
# function and sapply:
dfrm2 <- dfrm
my_list <- c("a", "c", "e", "g")
my_low <- function(x){dfrm2[,x] <- tolower(dfrm2[,x])}
sapply(my_list, my_low) #Didn't work
# Alternative approach:
dfrm2 <- as.data.frame(sapply(my_list, my_low))
head(dfrm2) #Lost the numbers
What am I missing?
I know this must be a very basic concept that I'm not getting. There was this question and answer that I simply couldn't follow, and this one where my non-working solution simply seems to work. Any help appreciated, thanks!
Maybe you want to create a logical vector that selects the columns to change and run an apply function only over those columns.
# only choose non-numeric columns
changeCols <- !sapply(dfrm, is.numeric)
# change values of selected columns to lower case
dfrm[changeCols] <- lapply(dfrm[changeCols], tolower)
If you have other types of columns, say logical, you also could be more explicit regarding the types of columns that you want to change. For example, to select only factor and character columns, use.
changeCols <- sapply(dfrm, function(x) is.factor(x) | is.character(x))
For your first attempt, if you want the assignments to your data frame dfrm2 to stick, use the <<- assignment operator:
my_low <- function(x){ dfrm2[,x] <<- tolower(dfrm2[,x]) }
sapply(my_list, my_low)
Demo

Correct implementation of lapply

In so far as I understand it, when using r it can be more elegant to use functions such as lapply rather than for loops (that are used more often than not in other object oriented languages). However I cannot get my head around the syntax and am making foolish errors when trying to implement simple tasks with the command. For example:
I have a series of dataframes loaded from csv files using a for loop.The following dummy dataframes adequately describe the data:
x <- c(0,10,11,12,13)
y <- c(1,NA,NA,NA,NA)
z <- c(2,20,21,22,23)
a <- c(0,6,5,4,3)
b <- c(1,7,8,9,10)
c <- c(2,NA,NA,NA,NA)
df1 <- data.frame(x,y,z)
df2 <- data.frame(a,b,c)
I first generate a list of dataframe names (data_names- I do this when loading the csv files) and then simply want to sum the columns. My attempt of course does not work:
lapply(data_names, function(df) {
counts <- colSums(!is.na(data_names))
})
I could of course use lists (and I realise in the long run this maybe better) however from a pedagogical point of view I would like to understand lapply better.
Many thanks for any pointers
It's really just your use of is.na and the fact you don't need to use the asignment operator <- inside the function. lapply returns a list which is the result of applying FUN to each element of the input list. You assign the output of lapply to a variable, e.g. res <- lapply( .... , FUN ).
I'm also not too sure how you made the list initially, but the below should suffice. You also don't need an anonymous function in this case, you can use the named colSums and also provide the na.rm = TRUE argument to take care of persky NAs in your data:
lapply( list( df1, df2 ) , colSums , na.rm = TRUE )
[[1]]
x y z
46 1 88
[[2]]
a b c
18 35 2
So you can read this as:
For each df in the list:
apply colSums with the argument na.rm = TRUE
The result is a list, each element of which is the result of applying colSums to each df in the list.

Resources