Removing characters from column value and adding a new letter - r

I have the following data frame df1. I want to remove "/" from all values in column x2 and add letter v at the end of each value in x2.
df1
x1 x2
1 aa/bb/cc
2 ff/bb/cc
3 uu/bb/cc
Resulting df2
df2
x1 x2
1 aabbccv
2 ffbbccv
3 uubbccv

You can use gsub to remove the / and paste0 to add the v in each row:
df2 <- transform(df1, x2 = paste0(gsub("/", "", x2, fixed = TRUE), "v"))
df2
# x1 x2
#1 1 aabbccv
#2 2 ffbbccv
#3 3 uubbccv

Related

How to partition to multiple .csv from df based on whitespace row?

I'm working with a database that has a timestamp, 3 numeric vectors, and a character vector.
Basically, each "set" of data is delineated by a new row. I need each series of rows to save as .csv when the row reads that each column is empty (x = \t\r\n). There's about 370 in my dataset.
For example,
library(dplyr)
data <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 3,
x4 = c("text", "no text", "example", "hello"))
new_row <- c("\t\r\n", "\t\r\n", "\t\r\n", "\t\r\n")
data1 <- rbind(data, new_row)
data2 <- data.frame(x1 = 1:4,
x2 = 4:1,
x3 = 4,
x4 = c("text", "no text", "example", "hello"))
data2 <- rbind(data2, new_row)
data3 <- rbind(data1, data2)
view(data3)
This is what my data set looks like (without the timestamp). I need every set of consecutive rows after a row full or \t\r\n to be exported as an individual .csv.
I'm doing text analysis. Each group of rows, with highly variable group size, represents a thread on different subject. I need to analyze these individual threads.
What is the best way to go about doing this? I haven't had this problem before.
ind <- grepl("\t", data3$x4)
ind <- replace(cumsum(ind), ind, -1)
ind
# [1] 0 0 0 0 -1 1 1 1 1 -1
data4 <- split(data3, ind)
data4
# $`-1`
# x1 x2 x3 x4
# 5 \t\r\n \t\r\n \t\r\n \t\r\n
# 10 \t\r\n \t\r\n \t\r\n \t\r\n
# $`0`
# x1 x2 x3 x4
# 1 1 4 3 text
# 2 2 3 3 no text
# 3 3 2 3 example
# 4 4 1 3 hello
# $`1`
# x1 x2 x3 x4
# 6 1 4 4 text
# 7 2 3 4 no text
# 8 3 2 4 example
# 9 4 1 4 hello
The use of -1 was solely to keep the "\t\r\n" rows from being included in each of their respective groups, and we know that cumsum(ind) should start at 0. You can obviously drop the first frame :-)
From here, you can export with
data4 <- data4[-1]
ign <- Map(write.csv, data4, sprintf("file_%03d.csv", seq_along(data4)))

Sum Values of Every Column in Data Frame with Conditional For Loop

So I want to go through a data set and sum the values from each column based on the condition of my first column. The data and my code so far looks like this:
x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20
for(i in colnames(data)){
if(data$x>2){
x1 <-sum(data[[i]])
}
else{
x2 <-sum(data[[i]])
}
}
My assumption was that the for loop would call each column by name from the data and then sum the values in each column based on whether they matched the condition of column x.
I want to sum half the values from each column and assign them to a value x1 and do the same for the remainder, assigning it to x2. I keep getting an error saying the following:
the condition has length > 1 and only the first element will be used
What am I doing wrong and is there a better way to go about this? Ideally I want a table that looks like this:
v1 v2 v3
x1 6 7 35
x2 4 3 15
Here's a dplyr solution. First, I define the data frame.
df <- read.table(text = "x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20", header = TRUE)
# x v1 v2 v3
# 1 1 0 1 5
# 2 2 4 2 10
# 3 3 5 3 15
# 4 4 1 4 20
Then, I create a label (x_check) to indicate which group each row belongs to based on your criterion (x > 2), group by this label, and summarise each column with a v in its name using sum.
# Load library
library(dplyr)
df %>%
mutate(x_check = ifelse(x>2, "x1", "x2")) %>%
group_by(x_check) %>%
summarise_at(vars(contains("v")), funs(sum))
# # A tibble: 2 x 4
# x_check v1 v2 v3
# <chr> <int> <int> <int>
# 1 x1 6 7 35
# 2 x2 4 3 15
Not sure if I understood your intention correctly, but here is how you would reproduce your results with base R:
df <- data.frame(
x = c(1:4),
v1 = c(0, 4, 5, 1),
v2 = 1:4,
v3 = (1:4)*5
)
x1 <- colSums(df[df$x > 2, 2:4, drop = FALSE])
x2 <- colSums(df[df$x <= 2, 2:4, drop = FALSE])
Where
df[df$x > 2, 2:4, drop = FALSE] will create a subset of df where the rows satisfy df$x > 2 and the columns are 2:4 (meaning the second, third and fourth column), drop = FALSE is there mainly to prevent R from simplifying the results in some special cases
colSums does a by-column sum on the subsetted data.frame
If your x column was really a condition (e.g. a logical vector) you could just do
x1 <- colSums(df[df$x, 2:4, drop = FALSE])
x2 <- colSums(df[!df$x, 2:4, drop = FALSE])
Note that there is no loop needed to get to the results, with R you should use vectorized functions as much as possible.
More generally, you could do such aggregation with aggregate:
aggregate(df[, 2:4], by = list(condition = df$x <= 2), FUN = sum)

Combine two identical dataframe columns into comma seperated columns in R

I have two identically structured dataframe (same amount of rows, columns and same headers). What I would like to do is to combine the two into one dataframe that has comma seperated columns.
I know how to do it with this dummy data frames, but using it on my own data would be very cumbersome.
This are my dummy data frames, the headers of my "real" data are "1","2","3" etc. while those of the dummy data frames are "X1","X2","X3" etc.
> data1
X1 X2 X3 X4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
> data2
X1 X2 X3 X4
1 8 9 13 14
2 9 10 14 15
3 10 11 15 16
What I would like:
>data3
new1 new2 new3 new4
1 1,8 2,9 3,13 4,14
2 2,9 3,10 4,14 5,15
3 3,10 4,11 5,15 6,16
How I managed to get this output. But, it is too cumbersome for a large dataset I think.:
data1<- data.frame('1'=1:3, '2'=2:4, '3'=3:5,'4'=4:6)
data2<- data.frame('1'=8:10, '2'=9:11, '3'=13:15,'4'=14:16)
names(data1) <- c("1a","2a","3a","4a")
names(data2) <- c("1b","2b","3b","4b")
data3<- cbind(data1,data2)
cols.1 <- c('1a','1b'); cols.2 <-c('2a','2b')
cols.3 <- c('3a','3b'); cols.4 <-c('4a','4b')
data3$new1 <- apply( data3[ , cols.1] , 1 , paste , collapse = "," )
data3$new2 <- apply( data3[ , cols.2] , 1 , paste , collapse = "," )
data3$new3 <- apply( data3[ , cols.3] , 1 , paste , collapse = "," )
data3$new4 <- apply( data3[ , cols.4] , 1 , paste , collapse = "," )
data3 <-data3[,c(9:12)]
Is there a way in which I can iterate this, perhaps with a for loop? Any help would be appreciated.
These posts are somehow similar:
Same question but for rows in stead of columns:
how to convert column values into comma seperated row vlaues
Similar, but didn't work on my large dataset:
Paste multiple columns together
using only base:
data1 <- data.frame(x1 = 1:3, x2 = 2:4, x3 = 3:5, x4 = 4:6)
data2 <- data.frame(x1 = 8:10, x2 = 9:11, x3 = 13:15, x4 = 14:16)
data3 <- mapply(function(x, y){paste(x,y, sep = ",")}, data1, data2)
data3 <- as.data.frame(data3)
x1 x2 x3 x4
1 1,8 2,9 3,13 4,14
2 2,9 3,10 4,14 5,15
3 3,10 4,11 5,15 6,16
Here's a basic for loop approach:
newdf = data.frame(matrix(ncol=ncol(data1),nrow=nrow(data1)))
for (i in 1:ncol(data1)) {
newdf[,i] = paste(data1[,i], data2[,i], sep=",")
}
#> newdf
# X1 X2 X3 X4
# 1 1,8 2,9 3,13 4,14
# 2 2,9 3,10 4,14 5,15
# 3 3,10 4,11 5,15 6,16
Line by line explanation:
initialize new empty dataframe of appropriate dimensions:
newdf = data.frame(matrix(ncol=ncol(data1),nrow=nrow(data1)))
loop through 1,2,..n columns and fill each column with the paste results:
for (i in 1:ncol(data1)) {
newdf[,i] = paste(data1[,i], data2[,i], sep=",")
}
Disclaimer that this may be very slow on large datasets - a dplyr or data.frame approach (and perhaps some v/s/apply*() statement) will be faster, if you are interested in learning those methods.

Removing duplicate rows based on column while keeping the highest value of the next column

I'd like to remove the duplicates from column x1 and x2 while keeping the higher value from x3.
DF:
x1 x2 x3
1 1 1
1 1 2
1 1 3
2 2 2
2 2 5
Expected result:
x1 x2 x3
1 1 3
2 2 5
I've gotten as far as df[!duplicated(df[,c(1,2)]),] but it's displaying the lowest value of x3. I'd like to get the highest x3 value.
Thanks ahead of time.
You could aggregate(), using the first two columns for grouping
aggregate(x3 ~ x1 + x2, df, max)
# x1 x2 x3
# 1 1 1 3
# 2 2 2 5
If you want to find the max in more than one column, you can add variables to the left hand side of the formula with cbind(). For example,
aggregate(cbind(x3, x4, x5) ~ x1 + x2, df, max)
Using the dplyr package:
library(dplyr)
df %>% group_by(x1,x2) %>% summarise(x3 = max(x3))
You could title the maximum variable "maxOfx3" or similar for clarity.
Edit: If you have additional variables whose maxima you want, you can include them in the summarise() call:
df %>% group_by(x1,x2) %>% summarise(x3 = max(x3), x4 = max(x4), avg_of_x5 = mean(x5)) etc.
Yet another alternative with data.table:
library(data.table)
dt <- data.table(DF)
dt[,.SD[which.max(x3)],by=list(x1, x2)]
x1 x2 x3
1: 1 1 3
2: 2 2 5

String split into duplicate rows [duplicate]

This question already has an answer here:
Split parts of strings into a list column and then make a vector column
(1 answer)
Closed 9 years ago.
Given the following sample dataset:
col1 <- c("X1","X2","X3|X4|X5","X6|X7")
col2 <- c("5","8","1","4")
dat <- data.frame(col1,col2)
How can I split the col1 by | and enter them as separate rows with duplicated col2 values? Here's the dataframe that I'd like to end up with:
col1 col2
X1 5
X2 8
X3 1
X4 1
X5 1
X6 4
X7 4
I need a solution that can accomodate multiple columns similar to col2 that also need to be duplicated.
Just split the character string and then repeat the other columns based on the length.
y<-strsplit(as.character( dat[,1]) , "|", fixed=TRUE)
data.frame(col1= unlist(y), col2= rep(dat[,2], sapply(y, length)))
col1 col2
1 X1 5
2 X2 8
3 X3 1
4 X4 1
5 X5 1
6 X6 4
7 X7 4
And if you need to repeat many columns except the first
data.frame(col1= unlist(y), dat[ rep(1:nrow(dat), sapply(y, length)) , -1 ] )

Resources