Remove columns in a dataframe by partial columns characters recognition R - r

I would like to subset my data frame by selecting columns with partial characters recognition, which works when I have a single "name" to recognize.
where the data frame is:
ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA
library(stringr)
df[str_detect(names(df), "ABBA" )]
works, and returns:
ABBA01A ABBA01B ABBA02A ABBA02B
1908 NA NA NA NA
So, I would like to create a dataframe for each of my species:
Speciesnames=unique ( substring (names(df),0, 4))
Speciesnames
[1] "ABBA" "ACRU" "ARCU" "PIAB" "PIGL"
I have tried to make a loop and use [i] as species name but the str_detect funtion does not recognise it.
and I would like to add additional calculations in the loop
for ( i in seq_along(Speciesnames)){
df=df[str_detect(names(df), pattern =[i])]
print(df)
#my function for the subsetted dataframe
}
thank you for your help!

Using your data you could do the following:
create a list to hold the data.frames to be created.
filter the data.frames and store in the list
give each data.frame the name of of the specie
bring all the data.frames to the global environment out of the list
Speciesnames <- unique(substring(names(df),0, 4))
data <- vector("list", length(Speciesnames))
for(i in seq_along(Speciesnames)) {
data[[i]] <- df %>% select(starts_with(Speciesnames[i]))
}
names(data) <- Speciesnames
list2env(data, envir = globalenv())
The end result after list2envis 2 data.frames called "ABBA" "ACRU" which you then can access. If further manipulation is needed you might leave everything in the list and do it there.

An option is to use mapply with SIMPLIFY=FALSE to return list of data frames for each species. startsWith function from base-R will provide option to subset columns starting with specie name.
# First find species but taking unique first 4 characters from column names
species <- unique(gsub("([A-Z]{4}).*", "\\1",names(df)))
# Pass each species
listOfDFs <- mapply(function(x){
df[,startsWith(names(df),x)] # Return only columns starting with species
}, species, SIMPLIFY=FALSE)
listOfDFs
# $ABBA
# ABBA01A ABBA01B ABBA02A ABBA02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
#
# $ACRU
# ACRU01A ACRU01B ACRU02A ACRU02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
Data:
df <- read.table(text =
"ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA",
header = TRUE, stringsAsFactors = FALSE)

I think that you should select all matching columns first, and then subselect your data.frame.
patterns <- c("ABB", "CDC")
res <- lapply(patterns, function(x) grep(x, colnames(df), value=TRUE))
df[, unique(unlist(res))]
res object is a list of matched columns for each pattern
Next step is to select unique set of columns: unique(unlist(res)) and subselect data.frame.
If you are writing production code probably it is not the best answer.

Related

How can I filter into a new data frame conditional on the columns existing in another data frame?

So I have two dataframes.
df1:
Date Bond2 Bond3 Bond5 Bond6 Bond7 Bond8 Bond10 Bond11 Bond12 Bond14 Bond15 Bond16 Bond17
1 41275 NA NA NA NA NA NA NA NA NA NA NA NA NA
2 41276 NA NA NA NA NA NA NA NA NA NA NA NA NA
3 41277 NA NA NA NA NA NA NA NA NA NA NA NA NA
4 41278 NA NA NA NA NA NA NA NA NA NA NA NA NA
5 41279 NA NA NA NA NA NA NA NA NA NA NA NA NA
df1:
Date Bond2 Bond3 Bond4 Bond5 Bond6 Bond7 Bond8 Bond10 Bond11 Bond12 Bond14 Bond16 Bond17 Bond19
1 41275 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 41276 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 41277 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 41278 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
I want to create a new df3 which is df1 where all the columns are contained in df2, and then a df4 which is df2 where all the columns are contained within df1.
I was thinking along the lines of filter(df1, names() %in% names(df2)) or select(names(df1) %in% names(df2) but neither works.
Thanks
We may use intersect on the column names from 'df1', 'df2' and use that to subset the columns from df1, df2 to create df3, df4 respectively
nm1 <- intersect(names(df1), names(df2))
df3 <- df1[nm1]
df4 <- df2[nm1]
If you want to stick to tidyr, one way would be:
df3 <- df1 %>% select(any_of(names(df2))
df4 <- df2 %>% select(any_of(names(df1))

Create a new data frame

I have a data frame with only one column. Column contain some names. I need change this data frame.
I created a list with some places:
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
How can i include on this data frame the number of column according the names of the list?
Is a vector your one column data frame? You can convert a vector to a data.frame and add columns. I use to add columns with NA and add values later. Check this example:
vtr <-c(1:6)
df <- as.data.frame(vtr)
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
df[,2:(length(voos_inter)+1)] <- NA
names(df)[2:(length(voos_inter)+1)] <- voos_inter
df
vtr PUJ SCL EZE MVD ASU VVI
1 1 NA NA NA NA NA NA
2 2 NA NA NA NA NA NA
3 3 NA NA NA NA NA NA
4 4 NA NA NA NA NA NA
5 5 NA NA NA NA NA NA
6 6 NA NA NA NA NA NA

error in reading a csv file

I have been facing an error while reading a csv file. first few lines of the line is as given below:
"","1.CEL","2.CEL","3.CEL","4.CEL"
"1_s_at",NA,NA,NA,NA
"2_at",NA,NA,NA,NA
"3_at",NA,NA,NA,NA
"4_at",NA,NA,NA,NA
"5_g_at",NA,NA,NA,NA
"6_at",NA,NA,NA,NA
"7_at",NA,NA,NA,NA
reading the csv.file
test <- read.csv(file='/home/userxyz/test.csv')
head(test)
# X X1.CEL X2.CEL X3.CEL X4.CEL
#1 1_s_at NA NA NA NA
#2 2_at NA NA NA NA
#3 3_at NA NA NA NA
#4 4_at NA NA NA NA
#5 5_g_at NA NA NA NA
#6 6_at NA NA NA NA
Explicitly specifying the presence of the header.
test <- read.csv(file='/home/userxyz/test.file', header=T)
head(test)
# X X1.CEL X2.CEL X3.CEL X4.CEL
#1 1_s_at NA NA NA NA
#2 2_at NA NA NA NA
#3 3_at NA NA NA NA
#4 4_at NA NA NA NA
#5 5_g_at NA NA NA NA
#6 6_at NA NA NA NA
While explicitly specifying the row.names, it didn't work.
test <- read.csv(file='/home/userxyz/test.file', row.names=T)
#Error in read.table(file = file, header = header, sep = sep, quote = quote, :
# invalid 'row.names' specification
read.table, read.delim functions have also been looked at.
Is the error because of special characters in the row.names?
I think you are trying to read in the first column as row name. Try:
x <- '"","1.CEL","2.CEL","3.CEL","4.CEL"
"1_s_at",NA,NA,NA,NA
"2_at",NA,NA,NA,NA
"3_at",NA,NA,NA,NA
"4_at",NA,NA,NA,NA
"5_g_at",NA,NA,NA,NA
"6_at",NA,NA,NA,NA
"7_at",NA,NA,NA,NA'
read.csv(text = x, row.names = 1L)
# X1.CEL X2.CEL X3.CEL X4.CEL
#1_s_at NA NA NA NA
#2_at NA NA NA NA
#3_at NA NA NA NA
#4_at NA NA NA NA
#5_g_at NA NA NA NA
#6_at NA NA NA NA
#7_at NA NA NA NA
If you want to preserve exactly the header, do
read.csv(text = x, row.names = 1L, check.names = FALSE)
# 1.CEL 2.CEL 3.CEL 4.CEL
#1_s_at NA NA NA NA
#2_at NA NA NA NA
#3_at NA NA NA NA
#4_at NA NA NA NA
#5_g_at NA NA NA NA
#6_at NA NA NA NA
#7_at NA NA NA NA
Regarding row.name, read ?read.csv:
row.names: a vector of row names. This can be a vector giving the
actual row names, or a single number giving the column of the
table which contains the row names, or character string
giving the name of the table column containing the row names.

How do I subset a data.table based on another data.table?

I am trying to get my head around how to use data.tables. It is not going well.
I have a large data.table with a bunch of returns and AUM. I subsetted that data.table into two data.tables, one with returns, and one with AUM. I now want to subset the returns data.table, to get only the returns from funds with AUM less than the 50th percentile.
To give you an idea, this is my code:
fundDetails <- data.table(read.table("Fund_Deets.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))
fundNAV <- data.table(read.table("NAV_AUM.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))
allFundDetails <- fundDetails[Currency == 'USD']
allFundNAV <- fundNAV[Fund.ID %in% allFundDetails$Fund.ID]
allFundAUM <- allFundNAV[Type == 'AUM', -c(1,3), with = FALSE]
allFundAUM <- setnames(data.table(t(sapply(allFundAUM[,-1, with = FALSE],as.numeric))), as.character(allFundAUM$Fund.ID))
allFundReturns <- allFundNAV[Type == 'Return', -c(1,3), with = FALSE]
allFundReturns <- setnames(data.table(t(sapply(allFundReturns[,-1, with = FALSE],as.numeric)/100)), as.character(allFundReturns$Fund.ID))
smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))
This Produces the following three tables (smallFundReturns is obviously just NA's):
> allFundAUM[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA 1 27
2: NA NA NA NA NA NA 117 NA 1 27
3: NA NA NA NA NA NA 120 NA 1 27
4: NA NA NA NA NA NA 133 NA 1 27
5: NA NA NA NA NA NA 146 NA 1 29
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
> allFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA 0.0188 -0.0116
2: NA NA NA NA NA NA -0.0315 NA -0.0120 0.0134
3: NA NA NA NA NA NA -0.0978 NA -0.0908 -0.0206
4: NA NA NA NA NA NA -0.0445 NA -0.0269 -0.0287
5: NA NA NA NA NA NA 0.0139 NA 0.0298 -0.0141
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
> smallFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA NA NA NA NA
4: NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
for (i in 1:nrow(allFundReturns)){
theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
theSubset[is.na(theSubset)] <- FALSE
theSubset <- colnames(allFundReturns)[theSubset]
smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}
I am trying to subset using this for loop (using a for loop in an attempt to debug):
for (i in 1:nrow(allFundReturns)){
theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
theSubset[is.na(theSubset)] <- FALSE
theSubset <- colnames(allFundReturns)[theSubset]
smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}
This produces an error:
Error in `[<-.data.table`(`*tmp*`, i, theSubset, with = FALSE, value = list( :
unused argument (with = FALSE)
I tried removing the 'with' part, but this spits out a bunch of warnings:
> warnings()
Warning messages:
1: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '41526' (3019 unused)
2: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '45993' (3019 unused)
3: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
Supplied 3020 items to be assigned to 1 items of column '45994' (3019 unused)
4: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526", ... :
I am confused on how to do this. Any ideas on how I can subset the second data.table by the subset on the first?
EDIT:
I tried the suggestion below:
smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]
And I got these warnings():
> warnings()
Warning messages:
1: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
2: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
3: In `[.data.table`(smallFundReturns, i, `:=`((theSubset), ... :
And the code produced this, with 'TRUE' everywhere I would expect a number:
> smallFundReturns[,1:10, with = FALSE]
33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
1: NA NA NA NA NA NA NA NA TRUE TRUE
2: NA NA NA NA NA NA NA NA NA NA
3: NA NA NA NA NA NA NA NA NA NA
4: NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA
---
260: NA NA NA NA NA NA NA NA NA NA
261: NA NA NA NA NA NA NA NA NA NA
262: NA NA NA NA NA NA NA NA NA NA
263: NA NA NA NA NA NA NA NA NA NA
264: NA NA NA NA NA NA NA NA NA NA
EDIT 2:
I figured out the issue. Apparently, this line:
smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))
created the table as being logical. I changed it to this line:
smallFundReturns <- data.table(sapply(allFundReturns, function(x) as.numeric(rep(NA, length(x)))))
And everything worked after #HubertL fix. Thanks!!
You have to write it like that:
smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]
Suggestions for improvement:
Try reading data with fread instead of read.table if possible. It's way faster and the result is data.table not data.frame.
When doing "data.table operations" with the statement ", with=FALSE" you actually force R to use the much slower data.frame operations instead of using the blazingly fast data.table methods.
Have fun

Conditionally apply a function on a row

I'm trying to do the following:
Evaluate a POSIXct class column in a data frame, if an observation is an even second do nothing, and if its an odd second add 1.
This is what I have so far:
df[,1] <- lapply(df[,1], function(x) ifelse(as.integer(x)%%2==0, yes = x, no = x+1))
At this point my computer is running out of memory, probably because of the yes = x in the ifelse function... But I'm not sure.
Is there a better way to approach this problem?
Note: There are around 150,000 obs.
Edit:
Here is a sample of my data:
Timestamp PSTC01 PSTC02 PSTC03 PSTC04 PSTC05 PSTC06 PSTC07 PSTC08 PSTC09 PSTC10 PSTC11
1 2013-09-02 23:56:02 0.225339 NA NA NA NA 0.222298 0.253884 NA 0.243435 NA NA
2 2013-09-02 23:56:32 0.220459 NA NA NA NA 0.220009 0.250797 NA 0.241659 NA NA
3 2013-09-02 23:57:02 0.218379 NA NA NA NA 0.216663 0.252008 NA 0.240208 NA NA
4 2013-09-02 23:57:32 0.218264 NA NA NA NA 0.215935 0.256784 NA 0.240165 NA NA
5 2013-09-02 23:58:02 0.222438 NA NA NA NA 0.220964 0.253382 NA 0.241622 NA NA
6 2013-09-02 23:58:32 0.222154 NA NA NA NA 0.222533 0.252455 NA 0.242187 NA NA
7 2013-09-02 23:59:02 0.223612 NA NA NA NA 0.226128 0.253611 NA 0.243376 NA NA
8 2013-09-02 23:59:32 0.221370 NA NA NA NA 0.225215 0.253617 NA 0.243793 NA NA
9 2013-09-06 00:00:01 0.268708 NA NA NA 0.207481 NA 0.277915 NA NA 0.241519 0.242069
10 2013-09-06 00:00:31 0.268708 NA NA NA 0.207481 NA 0.277915 NA NA 0.241519 0.242069
11 2013-09-06 00:01:01 0.265310 NA NA NA 0.218782 NA 0.269295 NA NA 0.236030 0.211069
12 2013-09-06 00:01:31 0.270377 NA NA NA 0.217813 NA 0.272507 NA NA 0.236714 0.219648
13 2013-09-06 00:02:01 0.271845 NA NA NA 0.213403 NA 0.271097 NA NA 0.236685 0.218460
df[1] <- df[1] + as.integer(df[[1]]) %% 2
will be much faster due to vectorized operations. By the way: You don't need yes explicitly in ifelse, but it doesn't affect speed.
The above command adds one second to all odd values. If you want to do the opposite, i.e., add one second to even values, you just have to insert the logical not operator, !:
df[1] <- df[1] + !as.integer(df[[1]]) %% 2
Try
df[df[,1] %% 2 == 0,1] <- df[df[,1] %% 2 == 0,1] + 1
EDIT:
After reading your comment is looks like Sven's idea would work best. Try this (and adopting Sven's approach):
df[,1] <- df[,1] + {as.integer(df[,1]) %% 2}

Resources