Split Data Frame and call subframe rows by their index - r

This is a very basic R programming question but I haven't found the answer anywhere, would really appreciate your help:
I split my large dataframe into 23 subframes of 4 rows in length as follows:
DataframeSplits <- split(Dataframe,rep(1:23,each=4))
Say I want to call the second subframe I can:
DataframeSplits[2]
But what if I want to call a specific row of that subframe (using index position)?
I was hoping for something like this (say I call 2nd subframe's 2nd row):
DataframeSplits[2][2,]
But that doesn't work with the error message
Error in DataframeSplits[2][2, ] : incorrect number of dimensions

If you want to subset the list which is returned by split and use it for later subsetting, you must use double parenthesis like this to get to the sub-data.frame. Then you can subset this one with single parenthesis as you already tried:
Dataframe <- data.frame(x = rep(c("a", "b", "c", "d"), 23), y = 1)
DataframeSplits <- split(Dataframe,rep(1:23,each=4))
DataframeSplits[[2]][2,]
# x y
# 6 b 1
More info on subsetting can be found in the excellent book by Hadley Wickham.

Related

Recursive indexing only works up to [[1:3]]

I need to refer to individual dataframes within a list of dataframes (one by one) produced from a lapply function, but I'm getting the "recursive indexing failed at level 3" error. I've found similar questions, but none of them explain why this doesn't work.
I used lapply to make a list of dataframes, each with a different filter applied. The output in my reproducible example has 4 dataframes in the output (dfs). Now I want to refer to each dataframe in turn by indexing its position in the list.
If I use the format c(dfs[[1]], dfs[[2]], dfs[[3]], dfs[[4]]) I get the output that I want, and it works for the next function I need to apply, but it seems very inefficient.
When I try to shorten it by using c(dfs[[1:4]]) instead, I get the error Error in data1[[1:4]] : recursive indexing failed at level 3. If I try c(dfs[[1:3]]), it runs, bit doesn't give the output I expect (no longer a list of dataframes).
Here's an example:
library(tidyverse) # for glimpse, filter, mutate
data(mtcars)
mtcars2 <- mutate(mtcars, var = rep(c("A", "B", "C", "D"), len = 32)) # need a variable with more than 3 possible outcomes
glimpse(mtcars2)
list <- c("A", "B", "C", "D") # each new dataframe will filter based on these variables
dfs <- lapply(list, function(x) {
mtcars2 %>% filter(var == x) %>% glimpse()
}) # each dataframe now only contains A, B, C, or D
dfs # list of dataframes produced from lapply
dflist1 <- list(dfs[[1]], dfs[[2]], dfs[[3]], dfs[[4]]) # indexing one by one
dflist1 # this is what I want
dflist2 <- list(dfs[[1:4]]) # indexing all together
dflist2 # this produces an error
dflist3 <- list(dfs[[1:3]])
dflist3 # this runs, but the output is just `[[1]] [1] 4`, not a list of dataframes
I want something that looks like the output from dflist1 but that doesn't require me to add and remove list items every time the number of dataframes changes. I can't use the lapply output (dfs) as it is because my next function can't locate the variables within each dataframe as needed.
Any guidance appreciated.

Iterate over a list of items, count matches in data frame

Frustrated newbie question in R:
Say I have a list of strings = ("a", "b", "c"), and a data frame with a column df$stuff.
I want to loop through each string in the list, count the number of times that string appears in df$stuff, and add it cumulatively. In other words, the number of times "a" appears, plus the number of times "b" appears, plus the number of times "c" appears. I've tried count, table, and aggregate functions, and all I get is errors.
There simply has to be a nice clean way of doing this.
It is difficult to answer this without a sample of your data and what you want the output to look like, but I will try. First I will make a guess at what your data look like:
df <- data.frame(stuff = sample(letters[1:5], 30, replace = TRUE))
strings <- letters[1:3]
To get the counts of strings in df[["stuff"]] you can use table and then index into the table with strings.
table(df[["stuff"]])[strings]
I had a different idea about what was being asked. So I will give it a shot too.
strings = c("a", "b", "c")
stuff = c("the cat", "the bat", "the dog")
sapply(strings, function(s) length(grep(s, stuff)))
a b c
2 1 1
Gets you the number of matches for each string. So
sum(sapply(strings, function(s) length(grep(s, stuff))))
gives you the sum of all of those.
Is that what you wanted?

How to assign a subset from a data frame `a' to a subset of data frame `b'

It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.

R: not meaningful as factors

what is best practice to handle this particular problem when it comes up? for example I have created a dataframe:
dat<- sqlQuery(con,"select * from mytable")
in which my table looks like:
ID RESULT GROUP
-- ------ -----
1 Y A
2 N A
3 N B
4 Y B
5 N A
in which ID is an int, Result and Group are both factors.
problem is that when I want to do something like:
tapply(dat$RESULT,dat$GROUP,sum)
I get complaints about columns being a factor:
Error in Summary.factor(c(2L,2L,2L,2L,1L,2L,1L,2L,2L,1L,1L, :
sum not meaningful for factors
Given that factors are essential for use in things like ggplot, how does everyone else handle this?
Setting stringsAsFactors=FALSE and rerunning gives
tapply(dat$RESULT,dat$GROUP,sum)
Error in FUN(X[[1L]], ...) : invalid "type" (character) or argument
so I'm not sure merely setting stringsAsFactors=FALSE is the right approach
I assume you want to sum up the "Y"s in the RESULT column.
As suggested by #akrun, one possibility is to use table()
with(dat,table(GROUP,RESULT))
If you want to stick with the tapply(), you can change the type of the RESULT column to a boolean:
dat$RESULT <- dat$RESULT=="Y"
tapply(dat$RESULT,dat$GROUP,sum)
If your goal is to have some columns as factors and other as strings, you can convert to factors only selected columns in the result, e.g. with
dat<- sqlQuery(con,"select ID,RESULT,GROUP from mytable",as.is=2)
As in the read.table man page (recalled by the sqlQuery man page) : as.is is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.
But then again, you need either to use table() or to turn the result into a boolean.
I'm not clear what your question is, either. If you're just trying to sum the Y's, how about:
library(dplyr)
df <- data.frame(ID = 1:5,
RESULT = as.factor(c("Y","N","N","Y","N")),
GROUP = as.factor(c("A", "A", "B", "B", "A")))
df %>% mutate(logRes = (RESULT == "Y")) %>%
summarise(sum=sum(logRes))

Counting non-missing occurrences

I need help counting the number of non-missing data points across files and subsetting out only two columns of the larger data frame.
I was able to limit the data to only valid responses, but then I struggled getting it to return only two of the columns.
I found http://www.statmethods.net/management/subset.html and tried their solution, but myvars did not house my column label, it return the vector of data (1:10). My code was:
myvars <- c("key")
answer <- data_subset[myvars]
answer
But instead of printing out my data subset with only the "key" column, it returns the following errors:
"Error in [.data.frame(observations_subset, myvars) : undefined columns selected" and "Error: object 'answer' not found
Lastly, I'm not sure how I count occurrences. In Excel, they have a simple "Count" function, and in SPSS you can aggregate based on the count, but I couldn't find a command similarly titled in R. The incredibly long way that I was going to go about this once I had the data subsetted was adding in a column of nothing but 1's and summing those, but I would imagine there is an easier way.
To count unique occurrences, use table.
For example:
# load the "iris" data set that's built into R
data(iris)
# print the count of each species
table(iris$Species)
Take note of the handy function prop.table for converting a table into proportions, and of the fact that table can actually take a second argument to get a cross-tab. There's also an argument useNA, to include missing values as unique items (instead of ignoring them).
Not sure whether this is what you wanted.
Creating some data as it was mentioned in the post as multiple files.
set.seed(42)
d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))
set.seed(49)
d2 <- as.data.frame(matrix(sample(c(NA,0:8), 5*10, replace=TRUE), ncol=10))
Create a list with datasets as the list elements
l1 <- mget(ls(pattern="d\\d+"))
Create a index to subset the list element that has the maximum non-missing elements
indx <- which.max(sapply(l1, function(x) sum(!is.na(x))))
Key of columns to subset from the larger (non-missing) dataset
key <- c("V2", "V3")
Subset the dataset
l1[[indx]][key]
# V2 V3
#1 1 1
#2 1 3
#3 0 0
#4 4 5
#5 7 8
names(l1[indx])
#[1] "d2"

Resources