Subsetting a dataframe by partial Data point names - r

I have a dataframe of detections from acoustic receivers. I have about 70 receivers and am looking to subset my data by "line" of receivers. Station names are indicated as such: "TRC1-69", "TRC1-180", "TRC2-69", "TRC2-180".... "TRD1-69", "TRD1-180", "TRD2-69", "TRD2-180". Basically I'm trying to get all the C receivers in one dataframe, the D receivers in one dataframe and so on.
This is what I've tried so far
Dline <- AC[rownames(AC) %like% "TRD", ]
or
Dline <- subset(AC, Station == "TRD")

Here's a way:
df1 <- data.frame(
val = 1:8,
row.names = c("TRC1-69", "TRC1-180", "TRC2-69", "TRC2-180",
"TRD1-69", "TRD1-180", "TRD2-69", "TRD2-180"))
split(df1, substr(row.names(df1),3,3))
# $C
# val
# TRC1-69 1
# TRC1-180 2
# TRC2-69 3
# TRC2-180 4
#
# $D
# val
# TRD1-69 5
# TRD1-180 6
# TRD2-69 7
# TRD2-180 8

You can use simple regex via gsub, i.e. (Using #Moody_Mudskipper data set)
split(df1, gsub('(.*)[0-9]+-[0-9]+', '\\1', rownames(df1)))
#$`TRC`
# val
#TRC1-69 1
#TRC1-180 2
#TRC2-69 3
#TRC2-180 4
#$TRD
# val
#TRD1-69 5
#TRD1-180 6
#TRD2-69 7
#TRD2-180 8

We can use grepl in subset when there is partial match
subset(AC, grepl("^TRD", Station))
and to do this in one step, split into a list of data.frames
lst1 <- split(AC, grepl("^TRD", AC$Station))

Related

Rbind data frames with names in a list [duplicate]

I have an issue that I thought easy to solve, but I did not manage to find a solution.
I have a large number of data frames that I want to bind by rows. To avoid listing the names of all data frames, I used "paste0" to quickly create a vector of names of the data frames. The problem is that I do not manage to make the rbind function identify the data frames from this vector of name.
More explicitely:
df1 <- data.frame(x1 = sample(1:5,5), x2 = sample(1:5,5))
df2 <- data.frame(x1 = sample(1:5,5), x2 = sample(1:5,5))
idvec <- noquote(c(paste0("df",c(1,2))))
> [1] df1 df2
What I would like to get:
dftot <- rbind(df1,df2)
x1 x2
1 4 1
2 5 2
3 1 3
4 3 4
5 2 5
6 5 3
7 1 4
8 2 2
9 3 5
10 4 1
dftot <- rbind(idvec)
> [,1] [,2]
> idvec "df1" "df2"
If there are multiple objects in the global environment with the pattern df followed by digits, one option is using ls to find all those objects with the pattern argument. Wrapping it with mget gets the values in the list, which we can rbind with do.call.
v1 <- ls(pattern='^df\\d+')
`row.names<-`(do.call(rbind,mget(v1)), NULL)
If we know the objects, another option is paste to create a vector of object names and then do as before.
v1 <- paste0('df', 1:2)
`row.names<-`(do.call(rbind,mget(v1)), NULL)
This should give the result:
dfcount <- 2
dftot <- df1 #initialise
for(n in 2:dfcount){dftot <- rbind(dftot, eval(as.name(paste0("df", as.character(n)))))}
eval(as.name(variable_name)) reads the data frames from strings matching their names.

How to extract parts of a list that have the same name?

I want to extract parts of a list that are also a list into a data frame, but the parts I want have the same name. Here's an example list:
study <- list(type='RCT', samplesize=10, centre=list(date='10/2/2015', type='A'), centre=list(date='20/3/2015', type='C'))
If I use:
sapply('centre', function(x) unname(unlist(study[names(study)==x])), simplify=FALSE)
Then it comes out as a vector:
$centre
[1] "10/2/2015" "A" "20/3/2015" "C"
What I want is:
centre date type
1 10/2/2015 A
2 20/3/2015 C
If you are open for a concise tidyverse/purrr solution, we can use imap_dfr()
study %>% purrr::imap_dfr(~if(.y == "centre") .x)
# A tibble: 2 x 2
# date type
# * <chr> <chr>
# 1 10/2/2015 A
# 2 20/3/2015 C
You can first subset the list names with 'centre', convert it to dataframe and assign row index.
data <- data.frame(t(sapply(study[names(study) == 'centre'], unlist)),
row.names = NULL)
data$centre <- 1:nrow(data)
data
# date type centre
#1 10/2/2015 A 1
#2 20/3/2015 C 2
You can also get data as :
data <- do.call(rbind.data.frame, study[names(study) == 'centre'])

R - split dataset by row position and save in different files

I have a huge dataset in which several mini dataset were merged. I want to split them in different dataframes and save them. The mini datasets are identified by a variable name (which always include the string "-gram") on a given row.
I have been trying to construct a for loop, but with no luck.
grams <- read.delim("grams.tsv", header=FALSE) #read dataset
index <- which(grepl("-gram", grams$V1), arr.ind=TRUE) # identify the row positions where each mini dataset starts
index[10] <- nrow(grams) # add the total number of rows as last variable of the vector
start <- c() # initialize vector
end <- c() # initialize vector
for (i in 1:length(index)-1) for ( k in 2:length(index)) {
start[i] <- index[i] # add value to the vector start
if (k != 10) { end[k-1] <- index[k]-1 } else { end[k-1] <- index[k] } # add value to the vector end
gram <- grams[start[i]:end[i],] #subset the dataset grams so that the split mini dataset has start and end that correspond to the index in the vector
write.csv(gram, file=paste0("grams_", i, ".csv"), row.names=FALSE) # save dataset
}
I get an error when I try to subset the dataset:
Error in start[i]:end[i] : argument of length 0
...and I do not understand why! Can anyone help me?
Thanks!
You can cumsum and split:
dat <- data.frame(V1 = c("foo", "bar", "quux-gram", "bar-gram", "something", "nothing"),
V2 = 1:6, stringsAsFactors = FALSE)
dat
# V1 V2
# 1 foo 1
# 2 bar 2
# 3 quux-gram 3
# 4 bar-gram 4
# 5 something 5
# 6 nothing 6
grepl("-gram$", dat$V1)
# [1] FALSE FALSE TRUE TRUE FALSE FALSE
cumsum(grepl("-gram$", dat$V1))
# [1] 0 0 1 2 2 2
spl_dat <- split(dat, cumsum(grepl("-gram$", dat$V1)))
spl_dat
# $`0`
# V1 V2
# 1 foo 1
# 2 bar 2
# $`1`
# V1 V2
# 3 quux-gram 3
# $`2`
# V1 V2
# 4 bar-gram 4
# 5 something 5
# 6 nothing 6
With that, you can write them to files with:
ign <- Map(write.csv, spl_dat, sprintf("gram-%03d.csv", seq_along(spl_dat)),
list(row.names=FALSE))
An option with group_split and endsWith
library(dplyr)
library(stringr)
dat %>%
group_split(grp = cumsum(endsWith(V1, '-gram')), keep = FALSE)

r create a column that contains the objects names inside a lapply function

I would like to create a column that contains the objects names inside a lapply function, as a proxy I call it name.of.x.as.strig.function(), unfortunately I am not sure how to do it, maybe a combination of assign, do.call and paste. But so far using this function only led my into deeper troubles, I am quite sure there is a more R like solution.
# generates a list of dataframes,
data <- list(data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)))
# assigns names to dataframe
names(data) <- list("one","two", "tree", "four")
# subsets the second column into the object data.anova
data.anova <- lapply(data, function(x){x <- x[[2]];
return(matrix(x))})
This should allow me to create a column inside the dataframe that contains its name, for all matrices inside the list
data.anova <- lapply(data, function(x){
x$id <- name.of.x.as.strig.function(x)
return(x)})
I would like to retrieve:
3 one
3 one
3 two
3 two
...
Any input is highly appreciated.
Search history: function to retrieve object name as string, R get name of an object inside lapply...
Can it be that you are just looking for stack?
stack(lapply(data, `[[`, 2))
# values ind
# 1 3 one
# 2 3 one
# 3 3 two
# 4 3 two
# 5 3 tree
# 6 3 tree
# 7 3 four
# 8 3 four
(Or, using your original approach: stack(lapply(data, function(x) {x <- x[[2]]; x})))
If this is the case, melt from "reshape2" would also work.
Loop through the indices of data.anova, and use that to fetch both the data and the names:
data.anova <- lapply(seq_along(data.anova), function(i){
x <- as.data.frame(data.anova[[i]])
x$id <- names(data.anova)[i]
return(x)})
This produces:
# [[1]]
# V1 id
# 1 3 one
# 2 3 one
# [[2]]
# V1 id
# 1 3 two
# 2 3 two
# [[3]]
# V1 id
# 1 3 tree
# 2 3 tree
# [[4]]
# V1 id
# 1 3 four
# 2 3 four

Create Data Frame and Populate It R

How do I create a fixed size data frame of size [40 2], declare the first column with unique strings, and populate the other with specific values? Again, I want the first column to be the list of strings; I don't
want a row of headers.
(Someone please give me some pointers. I haven't program in R for a while and my R skills are terrible to
begin with.)
Two approaches:
# sequential strings
library(stringr)
df.1 <- data.frame(id=paste0("X",str_pad(1:40,2,"left","0")),value=NA)
head(df.1)
# id value
# 1 X01 NA
# 2 X02 NA
# 3 X03 NA
# 4 X04 NA
# 5 X05 NA
# 6 X06 NA
Second Approach:
# random strings
rstr <- function(n,k){
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
set.seed(1)
df.2 <- data.frame(id=rstr(40,5),value=NA)
head(df.2)
# id value
# 1 gjoxf NA
# 2 xyrqb NA
# 3 ferju NA
# 4 mszju NA
# 5 yfqdg NA
# 6 kajwi NA
The function rstr(n,k) produces a vector of length n with each element being a string of random characters of length k. rstr(...) does not guarantee that all strings are unique, but the probability of duplication is O(n/26^k).
Create the data.frame and define it's columns with the values
The reciclying rule, repeats the strings to match the 40 rows defined by the second column
df <- data.frame(x = c("unique_string 1", "unique_string 2"), y = rpois(40, 2))
# Change column names
names(df) <- c("string_col", "num_col")
I found this way of creating dataframes in R extremely productive and easy,
Create a raw array of values , then convert into matrix of required dimenions and finally name the columns and rows
dataframe.values = c(value1, value2,.......)
dataframe = matrix(dataframe.values,nrow=number of rows ,byrow = T)
colnames(dataframe) = c("column1","column2",........)
row.names(dataframe) = c("row1", "row2",............)
exampledf <- data.frame(columnofstrings=c("a string", "another", "yetanother"),
columnofvalues=c(2,3,5) )
gives
> exampledf
columnofstrings columnofvalues
1 a string 2
2 another 3
3 yetanother 5

Resources