Number of elements in an iterator - r

I'm using the isplit command from the iterators pacakge to loop over a data frame. Does anyone know if it's possible to get the number of elements in iterator?
E.g.,
library(iterators)
df <- data.frame(a = sample(letters[1:26], 100, replace = TRUE), b = runif(100))
df.iter <- isplit(df, df$a, drop = TRUE)

One option would be to convert it to a list with as.list (similar to list(generator) in python and get the length of it
length(as.list(df.iter))
#[1] 26
which is equal to the length from split
length(split(df, df$a, drop = TRUE))
#[1] 26

Related

Convert characters like "84+3" into numeric variables using R

I have a large data.frame with several variables like "89+2" (all two-digit integer + one-digit integer) and I'm trying to quickly convert to numeric variables. Realistically, either just eliminating the second numeric OR performing the calculation and adding them together would work... Bit of an R newbie. Any help appreciated.
example:
df$LM = c("91+2", "89+3", "88+2")
Looking for
df$LM_num = c(91, 89, 88)
or
df$LM_num = c(93, 92, 90)
We can use separate
library(tidyr)
library(dplyr)
separate(df, LM, into = c("LM_num1", "LM_num2"), convert = TRUE) %>%
mutate(LM_num = LM_num1 + LM_num2)
Or with parse_number
library(readr)
df$LM_num <- parse_number(df$LM)
Or another option is eval(parse
df$LM_num <- sapply(df$LM, function(x) eval(parse(text = x)))
Assuming df given reproducibly in the Note at the end use the first line below to get the first number or the second line to get the sum. They both read the LM column as if it were a file splitting on + creating a two column data frame. The first line extracts the first column whereas the second line adds the two columns. No packages are used.
transform(df, LM_num = read.table(text = LM, sep = "+")[[1]])
transform(df, LM_num = rowSums(read.table(text = LM, sep = "+")))
Note
df <- data.frame(LM = c("91+2", "89+3", "88+2"))
Another option would be:
x <- '92+3'
sum(as.numeric(strsplit(x, split = '+',fixed = TRUE)[[1]]))
In case of having a data.frame:
df <- data.frame(LM = c("91+2", "89+3", "88+2"))
df$sum <- sapply(seq_len(nrow(df)),
function(i) sum(as.numeric(strsplit(df$LM, split = '+', fixed = TRUE)[[i]])))
# LM sum
# 1 91+2 93
# 2 89+3 92
# 3 88+2 90

Select 1 column when DF has 2 similar column names in R

I have 2 problems. First, I have datasets with 2 column names that are similar. I want to select the first one and not use the second one. The numeric values in the column names are the serial number of the sensor and can vary and they can be in various columns.
How can I select the first column name of the 2 so I can plot it or use it in calculations?
How can I recover those long column names so I can use them? For example how to I get "Depth_456" to use in depthmax2 with out typing it in or making a subset named depth. The problem is the numeric value which is the serial number of the sensor and it changes from instrument to instrument and dataset to dataset. I am trying to write generic code that will work on all the different instruments.
My Data
df1 <- data.frame(Sal_224 = 1:8, Temp_696 = 1:8, Depth_456 = 1:8, Temp_654 = 8:15)
df2<-data.frame(sapply(df1, function(x) as.numeric(as.character(x))))
temp<- df2[grep("Temp", names(df2), value=TRUE)]
depth<- df2[grep("Depth", names(df2), value=TRUE)]
depthmax<- max(depth, na.rm = TRUE)
depthmax2<- max(df2$"Depth_456", na.rm = TRUE)
This doesn't work
depthmax2<- max(df2$grep("Depth", names(df2), value=TRUE), na.rm = TRUE)
We need [[ instead of $.
max(df2[[ grep("Depth", names(df2), value=TRUE)]], na.rm = TRUE)
#[1] 8
Or another option is startsWith
max(df2[[names(df2)[startsWith(names(df2), "Depth")]]], na.rm = TRUE)
#[1] 8
Also, max works on a vector. If there are more than one match, we may have to loop over and get the max
sapply(df2[ grep("Depth", names(df2), value=TRUE)], max, na.rm = TRUE)

In a list of data frames, pad one variable with leading zeros (ideally w/ stringr)

I'm working with a list of data frames. In each data frame, I would like to pad a single ID variable with leading zeros. The ID variables are character vectors and are always the first variable in the data frame. In each data frame, however, the ID variable has a different length. For example:
df1_id ranges from 1:20, thus I need to pad with up to one zero,
df2_id ranges from 1:100, thus I need to pad with up to two zeros,
etc.
My question is, how can I pad each data frame without having to write a single line of code for each data frame in the list.
As mentioned above, I can solve this problem by using the str_pad function on each data frame separately. For example, see the code below:
#Load stringr package
library(stringr)
#Create sample data frames
df1 <- data.frame("x" = as.character(1:20), "y" = rnorm(20, 10, 1),
stringsAsFactors = FALSE)
df2 <- data.frame("v" = as.character(1:100), "y" = rnorm(100, 10, 1),
stringsAsFactors = FALSE)
df3 <- data.frame("z" = as.character(1:1000), "y" = rnorm(1000, 10, 1),
stringsAsFactors = FALSE)
#Combine data fames into list
dfl <- list(df1, df2, df3)
#Pad ID variables with leading zeros
dfl[[1]]$x <- str_pad(dfl[[1]]$x, width = 2, pad = "0")
dfl[[2]]$v <- str_pad(dfl[[2]]$v, width = 3, pad = "0")
dfl[[3]]$z <- str_pad(dfl[[3]]$z, width = 4, pad = "0")
While this solution works relatively well for a short list, as the number of data frames increases, it becomes a bit unwieldy.
I would love if there was a way that I could embed some sort of "sequence" vector into the width argument of the str_pad function. Something like this:
dfl <- lapply(dfl, function(x) {x[,1] <- str_pad(x[,1], width = SEQ, pad =
"0")})
where SEQ is a vector of variable lengths. Using the above example it would look something like:
seq <- c(2,3,4)
Thanks in advance, and please let me know if you have any questions.
~kj
You could use Map here, which is designed to apply a function "to the first elements of each ... argument, the second elements, the third elements", see ?mapply for details.
library(stringr)
vec <- c(2,3,4) # this is the vector of 'widths', don't name it seq
Map(function(i, y) {
dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
dfl[[i]] # this gets returned
},
# you iterate over these two vectors in parallel
i = 1:length(dfl),
y = vec)
Output
#[[1]]
# x y
#1 01 9.373546
#2 02 10.183643
#3 03 9.164371
#
#[[2]]
# v y
#1 001 11.595281
#2 002 10.329508
#3 003 9.179532
#4 004 10.487429
#
#[[3]]
# z y
#1 0001 10.738325
#2 0002 10.575781
#3 0003 9.694612
#4 0004 11.511781
#5 0005 10.389843
explanation
The function that we pass to Map is an anonymous function, which more or less you provided in your question:
function(i, y) {
dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
dfl[[i]] # this gets returned
}
You see the function takes two argument, i and y (choose other names if you like such as df and width), and for each dataframe in your list it modifies the first column dfl[[i]][, 1] <- ... . What the anonymous function does is it applies str_pad to the first column of each dataframe
... <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
but you see that we don't pass a fixed value to the width argument, but y.
Coming back to Map. Map now applies str_pad to the first dataframe, with argument width = 2, it applies str_pad to the second dataframe, with argument width = 3 and - you probably guessed it - it applies str_pad to the third dataframe in your list, with argument width = 4.
The arguments are specified in the last two lines of the code as
i = 1:length(dfl),
y = vec)
I hope this helps.
data
(consider to create a minimal example next time as the number of rows of the dataframes is not relevant for the problem)
set.seed(1)
df1 <- data.frame("x" = as.character(1:3), "y" = rnorm(3, 10, 1),
stringsAsFactors = FALSE)
df2 <- data.frame("v" = as.character(1:4), "y" = rnorm(4, 10, 1),
stringsAsFactors = FALSE)
df3 <- data.frame("z" = as.character(1:5), "y" = rnorm(5, 10, 1),
stringsAsFactors = FALSE)
#Combine data fames into list
dfl <- list(df1, df2, df3)

cbind equally named vectors in multiple data.frames in a list to a single data.frame

I have a list similar to this one:
set.seed(1602)
l <- list(data.frame(subst_name = sample(LETTERS[1:10]), perc = runif(10), crop = rep("type1", 10)),
data.frame(subst_name = sample(LETTERS[1:7]), perc = runif(7), crop = rep("type2", 7)),
data.frame(subst_name = sample(LETTERS[1:4]), perc = runif(4), crop = rep("type3", 4)),
NULL,
data.frame(subst_name = sample(LETTERS[1:9]), perc = runif(9), crop = rep("type5", 9)))
Question: How can I extract the subst_name-column of each data.frame and combine them with cbind() (or similar functions) to a new data.frame without messing up the order of each column? Additionally the columns should be named after the corresponding crop type (this is possible 'cause the crop types are unique for each data.frame)
EDIT: The output should look as follows:
Having read the comments I'm aware that within R it doesn't make much sense but for the sake of having alook at the output the data.frame's View option is quite handy.
With the help of this SO-Question I came up with the following sollution. (There's probably room for improvement)
a <- lapply(l, '[[', 1) # extract the first element of the dfs in the list
a <- Filter(function(x) !is.null(unlist(x)), a) # remove NULLs
a <- lapply(a, as.character)
max.length <- max(sapply(a, length))
## Add NA values to list elements
b <- lapply(a, function(v) { c(v, rep(NA, max.length-length(v)))})
e <- as.data.frame(do.call(cbind, d))
names(e) <- unlist(lapply(lapply(lapply(l, '[[', "crop"), '[[', 2), as.character))
It is not really correct to do this with the given example because the number of rows is not the same in each one of the list's data frames . But if you don't care you can do:
nullElements = unlist(sapply(l,is.null))
l = l[!nullElements] #delete useless null elements in list
columns=lapply(l,function(x) return(as.character(x$subst_name)))
newDf = as.data.frame(Reduce(cbind,columns))
If you don't want recycled elements in the columns you can do
for(i in 1:ncol(newDf)){
colLength = nrow(l[[i]])
newDf[(colLength+1):nrow(newDf),i] = NA
}
newDf = newDf[1:max(unlist(sapply(l,nrow))),] #remove possible extra NA rows
Note that I edited my previous code to remove NULL entries from l to simplify things

Get the longest element of a list

Suppose you have a list of data.frames like
dfs <- list(
a = data.frame(x = c(1:4, 7:10), a = runif(8)),
b = data.frame(x = 1:10, b = runif(10)),
c = data.frame(x = 1:10, c = runif(10))
)
I would now like to extract the longest data.frame or data.frames in this list. How?
I am stuck at this point:
library(plyr)
lengths <- lapply(dfs, nrow)
longest <- max(lengths)
There are two built-in functions in R that could solve your question in my opinion:
which.max: returns the index of the first element of your list that is equal to the max
> which.max(lengths)
[1] 2
which function returns all indexes that are TRUE
Here:
> which(lengths==longest)
[1] 2 3
Then you can subset you list to the desired element:
dfs[which(lengths==longest)]
will return b and c in your example.
cnt <- sapply(dfs, nrow)
dfs[cnt == max(cnt)]
Or if you only need the first occurrence of the maximum length:
dfs[which.max(cnt)]

Resources