Creating subset based on string conditions

Creating subset based on string conditions - r

Havig a dataframe like this:
df_in <- data.frame(x = c('x1','x2','x3','x4'),
col1 = c('http://youtube.com/something','NA','https://www.yahooexample.com','https://www.yahooexample2.com'),
col2 = c('https://google.com', 'http://www.bbcnews2.com?id=321','NA','https://google.com/text'),
col3 = c('http://www.bbcnews.com?id=321', 'http://google.com?id=1234','NA','https://bbcnews.com/search'),
col4 = c('NA', 'https://www.youtube/com','NA', 'www.youtube.com/searcht'))
Example of dataframe input as printed in the console:
x col1 col2 col3 col4
1 x1 http://youtube.com/something https://google.com http://www.bbcnews.com?id=321 NA
2 x2 NA http://www.bbcnews2.com?id=321 http://google.com?id=1234 https://www.youtube/com
3 x3 https://www.yahooexample.com NA NA NA
4 x4 https://www.yahooexample2.com https://google.com/text https://bbcnews.com/search www.youtube.com/searcht
I would like to create a dataframe of a specific subset conditions. Example I would like to keep only the one which contain the "google", "youtube" and "bbc" in their sting.
Example of expected output:
df_out <- data.frame(x = c('x1','x2','x4'),
col1new = c('http://youtube.com/something', 'http://www.bbcnews2.com?id=321', 'https://google.com/text'),
col2new = c('https://google.com', 'http://google.com?id=1234', 'https://bbcnews.com/search'),
col3new = c('http://www.bbcnews.com?id=321', 'https://www.youtube/com', 'www.youtube.com/searcht'))
Example of dataframe output as printed in the console:
x col1new col2new col3new
1 x1 http://youtube.com/something https://google.com http://www.bbcnews.com?id=321
2 x2 http://www.bbcnews2.com?id=321 http://google.com?id=1234 https://www.youtube/com
3 x4 https://google.com/text https://bbcnews.com/search www.youtube.com/searcht

We could create a logical condition with grep to filter the rows based on the entries of elements having atleast one of the pattern after the http://
i1 <- Reduce('|', lapply(df_in[-1], grepl, pattern= "https?://(google|youtube|bbc)"))
Then, loop through the rows of the subset data and get the links that match with google/youtube/bbc
tmp <- t(apply(df_in[i1,-1], 1, function(x) x[grepl("(google|youtube|bbc)", x)]))
colnames(tmp) <- paste0('col', seq_len(ncol(tmp)), "new")
and cbind with the subset of first column
cbind(df_in[i1, 1, drop = FALSE], tmp)
# x col1new col2new col3new
#1 x1 http://youtube.com/something https://google.com http://www.bbcnews.com?id=321
#2 x2 http://www.bbcnews2.com?id=321 http://google.com?id=1234 https://www.youtube/com
#4 x4 https://google.com/text https://bbcnews.com/search www.youtube.com/searcht

Related

How to partition to multiple .csv from df based on whitespace row?

I'm working with a database that has a timestamp, 3 numeric vectors, and a character vector.
Basically, each "set" of data is delineated by a new row. I need each series of rows to save as .csv when the row reads that each column is empty (x = \t\r\n). There's about 370 in my dataset.
For example,
library(dplyr)
data <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 3,
x4 = c("text", "no text", "example", "hello"))
new_row <- c("\t\r\n", "\t\r\n", "\t\r\n", "\t\r\n")
data1 <- rbind(data, new_row)
data2 <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 4,
x4 = c("text", "no text", "example", "hello"))
data2 <- rbind(data2, new_row)
data3 <- rbind(data1, data2)
view(data3)
This is what my data set looks like (without the timestamp). I need every set of consecutive rows after a row full or \t\r\n to be exported as an individual .csv.
I'm doing text analysis. Each group of rows, with highly variable group size, represents a thread on different subject. I need to analyze these individual threads.
What is the best way to go about doing this? I haven't had this problem before.

ind <- grepl("\t", data3$x4)
ind <- replace(cumsum(ind), ind, -1)
ind
# [1] 0 0 0 0 -1 1 1 1 1 -1
data4 <- split(data3, ind)
data4
# $`-1`
# x1 x2 x3 x4
# 5 \t\r\n \t\r\n \t\r\n \t\r\n
# 10 \t\r\n \t\r\n \t\r\n \t\r\n
# $`0`
# x1 x2 x3 x4
# 1 1 4 3 text
# 2 2 3 3 no text
# 3 3 2 3 example
# 4 4 1 3 hello
# $`1`
# x1 x2 x3 x4
# 6 1 4 4 text
# 7 2 3 4 no text
# 8 3 2 4 example
# 9 4 1 4 hello
The use of -1 was solely to keep the "\t\r\n" rows from being included in each of their respective groups, and we know that cumsum(ind) should start at 0. You can obviously drop the first frame :-)
From here, you can export with
data4 <- data4[-1]
ign <- Map(write.csv, data4, sprintf("file_%03d.csv", seq_along(data4)))

Combine two identical dataframe columns into comma seperated columns in R

I have two identically structured dataframe (same amount of rows, columns and same headers). What I would like to do is to combine the two into one dataframe that has comma seperated columns.
I know how to do it with this dummy data frames, but using it on my own data would be very cumbersome.
This are my dummy data frames, the headers of my "real" data are "1","2","3" etc. while those of the dummy data frames are "X1","X2","X3" etc.
> data1
X1 X2 X3 X4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
> data2
X1 X2 X3 X4
1 8 9 13 14
2 9 10 14 15
3 10 11 15 16
What I would like:
>data3
new1 new2 new3 new4
1 1,8 2,9 3,13 4,14
2 2,9 3,10 4,14 5,15
3 3,10 4,11 5,15 6,16
How I managed to get this output. But, it is too cumbersome for a large dataset I think.:
data1<- data.frame('1'=1:3, '2'=2:4, '3'=3:5,'4'=4:6)
data2<- data.frame('1'=8:10, '2'=9:11, '3'=13:15,'4'=14:16)
names(data1) <- c("1a","2a","3a","4a")
names(data2) <- c("1b","2b","3b","4b")
data3<- cbind(data1,data2)
cols.1 <- c('1a','1b'); cols.2 <-c('2a','2b')
cols.3 <- c('3a','3b'); cols.4 <-c('4a','4b')
data3$new1 <- apply( data3[ , cols.1] , 1 , paste , collapse = "," )
data3$new2 <- apply( data3[ , cols.2] , 1 , paste , collapse = "," )
data3$new3 <- apply( data3[ , cols.3] , 1 , paste , collapse = "," )
data3$new4 <- apply( data3[ , cols.4] , 1 , paste , collapse = "," )
data3 <-data3[,c(9:12)]
Is there a way in which I can iterate this, perhaps with a for loop? Any help would be appreciated.
These posts are somehow similar:
Same question but for rows in stead of columns:
how to convert column values into comma seperated row vlaues
Similar, but didn't work on my large dataset:
Paste multiple columns together

using only base:
data1 <- data.frame(x1 = 1:3, x2 = 2:4, x3 = 3:5, x4 = 4:6)
data2 <- data.frame(x1 = 8:10, x2 = 9:11, x3 = 13:15, x4 = 14:16)
data3 <- mapply(function(x, y){paste(x,y, sep = ",")}, data1, data2)
data3 <- as.data.frame(data3)
x1 x2 x3 x4
1 1,8 2,9 3,13 4,14
2 2,9 3,10 4,14 5,15
3 3,10 4,11 5,15 6,16

Here's a basic for loop approach:
newdf = data.frame(matrix(ncol=ncol(data1),nrow=nrow(data1)))
for (i in 1:ncol(data1)) {
newdf[,i] = paste(data1[,i], data2[,i], sep=",")
}
#> newdf
# X1 X2 X3 X4
# 1 1,8 2,9 3,13 4,14
# 2 2,9 3,10 4,14 5,15
# 3 3,10 4,11 5,15 6,16
Line by line explanation:
initialize new empty dataframe of appropriate dimensions:
newdf = data.frame(matrix(ncol=ncol(data1),nrow=nrow(data1)))
loop through 1,2,..n columns and fill each column with the paste results:
for (i in 1:ncol(data1)) {
newdf[,i] = paste(data1[,i], data2[,i], sep=",")
}
Disclaimer that this may be very slow on large datasets - a dplyr or data.frame approach (and perhaps some v/s/apply*() statement) will be faster, if you are interested in learning those methods.

Removing characters from column value and adding a new letter

I have the following data frame df1. I want to remove "/" from all values in column x2 and add letter v at the end of each value in x2.
df1
x1 x2
1 aa/bb/cc
2 ff/bb/cc
3 uu/bb/cc
Resulting df2
df2
x1 x2
1 aabbccv
2 ffbbccv
3 uubbccv

You can use gsub to remove the / and paste0 to add the v in each row:
df2 <- transform(df1, x2 = paste0(gsub("/", "", x2, fixed = TRUE), "v"))
df2
# x1 x2
#1 1 aabbccv
#2 2 ffbbccv
#3 3 uubbccv

String split into duplicate rows [duplicate]

This question already has an answer here:
Split parts of strings into a list column and then make a vector column
(1 answer)
Closed 9 years ago.
Given the following sample dataset:
col1 <- c("X1","X2","X3|X4|X5","X6|X7")
col2 <- c("5","8","1","4")
dat <- data.frame(col1,col2)
How can I split the col1 by | and enter them as separate rows with duplicated col2 values? Here's the dataframe that I'd like to end up with:
col1 col2
X1 5
X2 8
X3 1
X4 1
X5 1
X6 4
X7 4
I need a solution that can accomodate multiple columns similar to col2 that also need to be duplicated.

Just split the character string and then repeat the other columns based on the length.
y<-strsplit(as.character( dat[,1]) , "|", fixed=TRUE)
data.frame(col1= unlist(y), col2= rep(dat[,2], sapply(y, length)))
col1 col2
1 X1 5
2 X2 8
3 X3 1
4 X4 1
5 X5 1
6 X6 4
7 X7 4
And if you need to repeat many columns except the first
data.frame(col1= unlist(y), dat[ rep(1:nrow(dat), sapply(y, length)) , -1 ] )

Combining Survey Items in R/ Recoding NAs

I have two lists (from a multi-wave survey) that look like this:
X1 X2
1 NA
NA 2
NA NA
How can I easily combine this into a third item, where the third column always takes the non-NA value of column X1 or X2, and codes NA when both values are NA?

Combining Gavin's use of within and Prasad's use of ifelse gives us a simpler answer.
within(df, x3 <- ifelse(is.na(x1), x2, x1))
Multiple ifelse calls are not needed - when both values are NA, you can just take one of the values directly.

Another way using ifelse:
df <- data.frame(x1 = c(1, NA, NA, 3), x2 = c(NA, 2, NA, 4))
> df
x1 x2
1 1 NA
2 NA 2
3 NA NA
4 3 4
> transform(df, x3 = ifelse(is.na(x1), ifelse(is.na(x2), NA, x2), x1))
x1 x2 x3
1 1 NA 1
2 NA 2 2
3 NA NA NA
4 3 4 3

This needs a little extra finesse-ing due to the possibility of both X1 and X2 being NA, but this function can be used to solve your problem:
foo <- function(x) {
if(all(nas <- is.na(x))) {
NA
} else {
x[!nas]
}
}
We use the function foo by applying it to each row of your data (here I have your data in an object named dat):
> apply(dat, 1, foo)
[1] 1 2 NA
So this gives us what we want. To include this inside your object, we do this:
> dat <- within(dat, X3 <- apply(dat, 1, foo))
> dat
X1 X2 X3
1 1 NA 1
2 NA 2 2
3 NA NA NA

You didn't say what you wanted done when both were valid numbers, but you can use either pmax or pmin with the na.rm argument:
pmax(df$x1, df$x2, na.rm=TRUE)
# [1] 1 2 NA 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating subset based on string conditions - r

Related

How to partition to multiple .csv from df based on whitespace row?

Combine two identical dataframe columns into comma seperated columns in R

Removing characters from column value and adding a new letter

String split into duplicate rows [duplicate]

Combining Survey Items in R/ Recoding NAs

Categories

Resources