Create a list from a dataframe in R - r

Consider the following dataframe:
test.df <- data.frame(a = c("1991-01-01","1991-01-01","1991-02-01","1991-02-01"), b = rnorm(4), c = rnorm(4))
I would like to create a list from test.df. Each element of the list would be a subset dataframe of test.df corresponding to a specific value of column a, i.e. each date. In other words, in this case, column a takes unique values 1991-01-01 and 1991-02-01. Therefore, the resulting list would be comprised of two elements: the subset of test.df when a = 1991-01-01 (excluding column a), and the other element of the list would be the subset of test.df when 1991-02-01 = 2 (excluding column a). Here is the output I am looking for:
lst <- list(test.df[1:2,2:3], test.df[3:4,2:3])
Note that the subset dataframes may not have the same number of rows.
In my real practical example, column a is a date column with many more values.
I would appreciate any attempt of help! Thanks a lot!

You can use split
lst <- split(test.df, test.df$a)
If you want to get rid of column a, use split(test.df[-1], test.df$a) (thanks to #akrun for comment).

You can use the following code:
sapply(union(test.df$a,NULL), function(y,x) x[x$a==y,], x=test.df, simplify=FALSE)

You could also use the dlply function in the plyr package:
> library(plyr)
> dlply(test.df, .(a))
$`1991-01-01`
a b c
1 1991-01-01 1.3658775 0.9805356
2 1991-01-01 -0.2292211 2.2812914
$`1991-02-01`
a b c
1 1991-02-01 -0.2678131 0.5323250
2 1991-02-01 0.3736910 0.4988308
Or the data.table package:
> library(data.table)
> setDT(test.df)
> dt <- test.df[, list(list(.SD)), by = a]$V1
> names(dt) <- unique(test.df$a)
> dt
$`1991-01-01`
b c
1: 1.3658775 0.9805356
2: -0.2292211 2.2812914
$`1991-02-01`
b c
1: -0.2678131 0.5323250
2: 0.3736910 0.4988308

Related

Remove data.table rows whose vector elements contain nested NAs

I need to remove from a data.table any row in which column a contains any NA nested in a vector:
library(data.table)
a = list(as.numeric(c(NA,NA)), 2,as.numeric(c(3, NA)), c(4,5) )
b <- 11:14
dt <- data.table(a,b)
Thus, rows 1 and 3 should be removed.
I tried three solutions without success:
dt1 <- dt[!is.na(a)]
dt2 <- dt[!is.na(unlist(a))]
dt3 <- dt[dt[,!Reduce(`&`, lapply(a, is.na))]]
Any ideas? Thank you.
You can do the following:
dt[sapply(dt$a, \(l) !any(is.na(l)))]
This alternative also works, but you will get warnings
dt[sapply(dt$a, all)]
Better approach (thanks to r2evans, see comments)
dt[!sapply(a,anyNA)]
Output:
a b
1: 2 12
2: 4,5 14
A third option that you might prefer: You could move the functionality to a separate helper function that ingests a list of lists (nl), and returns a boolean vector of length equal to length(nl), and then apply that function as below. In this example, I explicitly call unlist() on the result of lapply() rather than letting sapply() do that for me, but I could also have used sapply()
f <- \(nl) unlist(lapply(nl,\(l) !any(is.na(l))))
dt[f(a)]
An alternative to *apply()
dt[, .SD[!anyNA(a, TRUE)], by = .I][, !"I"]
# a b
# <list> <int>
# 1: 2 12
# 2: 4,5 14

Which() for the whole dataset

I want to write a function in R that does the following:
I have a table of cases, and some data. I want to find the correct row matching to each observation from the data. Example:
crit1 <- c(1,1,2)
crit2 <- c("yes","no","no")
Cases <- matrix(c(crit1,crit2),ncol=2,byrow=FALSE)
data1 <- c(1,2,1)
data2 <- c("no","no","yes")
data <- matrix(c(data1,data2),ncol=2,byrow=FALSE)
Now I want a function that returns for each row of my data, the matching row from Cases, the result would be the vector
c(2,3,1)
Are you sure you want to be using matrices for this?
Note that the numeric data in crit1 and data1 has been converted to string (matrices can only store one data type):
typeof(data[ , 1L])
# [1] character
In R, a data.frame is a much more natural choice for what you're after. data.table is (among many other things) a toolset for working with "enhanced" data.frames; See the Introduction.
I would create your data as:
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
We can get the matching row indices as asked by doing a keyed join (See the vignette on keys):
setkey(Cases) # key by all columns
Cases
# crit1 crit2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
setkey(data)
data
# data1 data2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
Cases[data, which=TRUE]
# [1] 1 2 3
This differs from 2,3,1 because the order of your data has changed, but note that the answer is still correct.
If you don't want to change the order of your data, it's slightly more complicated (but more readable if you're not used to data.table syntax):
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
Cases[data, on = setNames(names(data), names(Cases)), which=TRUE]
# [1] 2 3 1
The on= part creates the mapping between the columns of data and those of Cases.
We could write this in a bit more SQL-like fashion as:
Cases[data, on = .(crit1 == data1, crit2 == data2), which=TRUE]
# [1] 2 3 1
This is shorter and more readable for your sample data, but not as extensible if your data has many columns or if you don't know the column names in advance.
The prodlim package has a function for that:
library(prodlim)
row.match(data,Cases)
[1] 2 3 1

R selecting rows by conditions given in an external table

Given the following data
data_min <- data.frame("cond"=c("a","b","c"),"min"=c(1,3,1))
data <- data.frame("cond"=c("a","b","b","a","c"),"val"=c(0,2,4,7,0))
I would like to select all rows from data for that the value in val is bigger than the minimum value specified in data_min for that condidition. Thus, in the given example, I expect to end up with a table
cond val
b 4
a 7
So far, I have tried
datanew <- data[which(data$cond==data_min$cond & data$val > data_min$min),]
which gives me a 7but not b 4. I have two questions, (1) why do I get the result I get, and (2) how do I get the desired result?
You need to use match because the data.frames have different numbers of rows:
data[data_min[match(data$cond, data_min$cond),]$min <= data$val,]
You could just merge the two data frames together to make things easier:
> m=merge(data,data_min,by='cond')
> m[which(m$val > m$min), c('cond','val')]
cond val
2 a 7
4 b 4
A solution using dplyr. We can perform a join first and then filter the condition between the val and min column.
library(dplyr)
data2 <- data %>%
left_join(data_min, by = "cond") %>%
filter(val > min) %>%
select(-min)
data2
cond val
1 b 4
2 a 7

R - number of unique values in a column of data frame

for a dataframe df, I need to find the unique values for some_col. Tried the following
length(unique(df["some_col"]))
but this is not giving the expected results. However length(unique(some_vector)) works on a vector and gives the expected results.
Some preceding steps while the df is created
df <- read.csv(file, header=T)
typeof(df) #=> "list"
typeof(unique(df["some_col"])) #=> "list"
length(unique(df["some_col"])) #=> 1
Try with [[ instead of [. [ returns a list (a data.frame in fact), [[ returns a vector.
df <- data.frame( some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
length(unique(df[["some_col"]]))
#[1] 4
class( df[["some_col"]] )
[1] "numeric"
class( df["some_col"] )
[1] "data.frame"
You're getting a value of 1 because the list is of length 1 (1 column), even though that 1 element contains several values.
you need to use
length(unique(unlist(df[c("some_col")])))
When you call column by df[c("some_col")] or by df["some_col"] ; it pulls it as a list. Unlist will convert it into the vector and you can work easily with it. When you call column by df$some_col .. it pulls the data column as vector
I think you might just be missing a ,
Try
length(unique(df[,"some_col"]))
In response to comment :
df <- data.frame(cbind(A=c(1:10),B=rep(c("A","B"),5)))
df["B"]
Output :
B
1 A
2 B
3 A
4 B
5 A
6 B
7 A
8 B
9 A
10 B
and
length(unique(df[,"B"]))
Output:
[1] 1
Which is the same incorrect/undesirable output as the OP posted
HOWEVER With a comma ,
df[,"B"]
Output :
[1] A B A B A B A B A B
Levels: A B
and
length(unique(df[,"B"]))
Now gives you the correct/desired output by the OP. Which in this example is 2
[1] 2
The reason is that df["some_col"] calls a data.frame and length call to an object class data.frame counts the number of data.frames in that object which is 1, while df[,"some_col"] returns a vector and length call to a vector correctly returns the number of elements in that vector. So you see a comma (,) makes all the difference.
using tidyverse
df %>%
select("some_col") %>%
n_distinct()
The data.table package contains the convenient shorthand uniqueN. From the documentation
uniqueN is equivalent to length(unique(x)) when x is anatomic vector, and nrow(unique(x)) when x is a data.frame or data.table. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient.
You can use it with a data frame:
df <- data.frame(some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
data.table::uniqueN(df[['some_col']])
[1] 4
or if you already have a data.table
dt <- setDT(df)
dt[,uniqueN(some_col)]
[1] 4
Here is another option:
df %>%
distinct(column_name) %>%
count()
or this without tidyverse:
count(distinct(df, column_name))
checking benchmarks in the web you will see that distinct() is fast.

Select column dataframe index R

I have a data frame df like this
1 2 3 4
A B C A
where the colnames are {1,2,3,4}. I would like to select one of the column of the data frame according to an index that I set externally
colf <- as.numeric(mo)
fmo <- df[[colf]]
Many thanks,
First things first I don't recommend having numbers as column names. Saying that, this should help you out.
> df <- data.frame("1"="A","2"="B","3"="C")
> df
X1 X2 X3
1 A B C
> df$X1 #Get column by name
[1] A
Levels: A
> df[,1] #Get first column
[1] A
Levels: A
>
Treat the data frame as a matrix and index it using [row,column] notation, i.e.
fmo = df[,colf]
This will always get column number colf.

Resources