Swap values round in a column - R - r

How do I swap one value with another in a column within a dataframe?
For example swap the 2's and 4's around in df1 to give df2:
df1 <- as.data.frame(col1 = c(1,2,1,4))
df2 <- as.data.frame(col1 = c(1,4,1,2))

Simple solution using replace in base R:
df2 <- data.frame(col1 = replace(df1$col1, c(4,2), c(2,4)))
Output
col1
1 1
2 4
3 1
4 2

We can try using case_when from the dplyr package for some switch functionality:
df2 <- df1
df2$col1 <- case_when(
df2$col1 == 2 ~ 4,
df2$col1 == 4 ~ 2,
TRUE ~ df2$col1
)
df2
col1
1 1
2 4
3 1
4 2
Data:
df1 <- data.frame(col1 = c(1,2,1,4))

you can swap by reassigning the index for that column.
With the dataframe:
df1 <- data.frame(col1 = c("a","b","c","d"))
> df1
col1
1 a
2 b
3 c
4 d
we can:
df1[,1] <- df1[c(1,4,3,2),1]
to get
> df1
col1
1 a
2 d
3 c
4 b

Related

How to rename columns in R with dplyr using a character object?

My data frame is as such:
#generic dataset
datatest <- data.frame(col1 = c(1,2,3,4), col2 = c('A', 'B', 'C', 'D'))
#character objects
name1 <- 'A'
name2 <- 'B'
I want to rename my columns using the name1 and name2 objects. These dynamically change in the code so I can't use the following:
#I DON'T WANT THIS
datatest %>% rename(A = col1, B = col2)
I want to use this:
datatest %>% rename(name1 = col1, name2 = col2)
but then the data table columns end up becoming 'name1' and 'name2' respectively, when they should be A and B. Here is the data table at the moment.
name1 (I want this to be A)
name2 (I want this to be B)
1
A
2
B
3
C
4
D
Any help is hugely appreciated. I have the same issue with kable tables too.
Thanks in advance!
Couple of options -
Using rename_with -
library(dplyr)
name1 <- 'A'
name2 <- 'B'
datatest %>% rename_with(~c(name1, name2), c(col1, col2))
#If there are only two columns in datatest
datatest %>% rename_with(~c(name1, name2))
# A B
#1 1 A
#2 2 B
#3 3 C
#4 4 D
Use a named vector
name <- c(A = 'col1', B = 'col2')
datatest %>% rename(!!name)
You may try
datatest %>% rename({{name1}} := col1, {{name2}} := col2)
A B
1 1 A
2 2 B
3 3 C
4 4 D
Here is one more option using !!!setNames
datatest %>%
rename(!!!setNames(names(.), c(name1, name2)))
A B
1 1 A
2 2 B
3 3 C
4 4 D

Subsetting data, if the column entry contains letters

I have data as follows:
DT <- as.data.frame(c("1","2", "3", "A", "B"))
names(DT)[1] <- "charnum"
What I want is quite simple, but I could not find an example on it on stackoverflow.
I want to split the dataset into two. DT1 with all the rows for which DT$charnum has numbers and DT2 with all the rows for which DT$charnum has letters. I tried something like:
DT1 <- DT[is.numeric(as.numeric(DT$charnum)),]
But that gives:
[1] 1 2 3 A B
Levels: 1 2 3 A B
Desired result:
> DT1
charnum
1 1
2 2
3 3
> DT2
charnum
1 A
2 B
You can use regular expressions to separate the two types of data that you have and then separate the two datasets.
result <- split(DT, grepl('^\\d+$', DT$charnum))
DT1 <- type.convert(result[[1]])
DT1
# charnum
#4 A
#5 B
DT2 <- type.convert(result[[2]])
DT2
# charnum
#1 1
#2 2
#3 3
Using tidyverse
library(dplyr)
library(purrr)
library(stringr)
DT %>%
group_split(grp = str_detect(charnum, "\\d+"), .keep = FALSE) %>%
map(type.convert, as.is = TRUE)

elements of list column matching rows in other data.frame

I have the following two data.frames:
df1 <- data.frame(Var1=c(3,4,8,9),
Var2=c(11,32,1,7))
> df1
Var1 Var2
1 3 11
2 4 32
3 8 1
4 9 7
df2 <- data.frame(ID=c('A', 'B', 'C'),
ball=I(list(c("3","11", "12"), c("4","1"), c("9","32"))))
> df2
ID ball
1 A 3, 11, 12
2 B 4, 1
3 C 9, 32
Note that column ball in df2 is a list.
I want to select the ID in df2 with elements in column ball that match a row in df1.
The ideal output would look like this:
> df3
ID ball1 ball2
1 A 3 11
Does anyone have an idea how to do this efficiently? The original data consists of millions of rows in both data.frames.
A data.table solution would work much more quickly than this base R solution but here is a possibility.
your data:
df1 <- data.frame(Var1=c(3,4,8,9),
Var2=c(11,32,1,7))
df2 <- data.frame(ID=c('A', 'B', 'C'),
ball=I(list(c("3","11", "12"), c("4","1"), c("9","32"))))
the process:
df2$ID <- as.character(df2$ID) # just in case they are levels instead
n <- length(df2)# initialize the size of df3 to be big enough
df3 <- data.frame(ID = character(n),
Var1 = numeric(n), Var2 = numeric(n),
stringsAsFactors = F) # to make sure we get the ID as a string
count = 0 # counter
for(i in 1:nrow(df1)){
for(j in 1:nrow(df2)){
if(all(df1[i,] %in% df2$ball[[j]])){
count = count + 1
df3$ID[count] <- df2$ID[j]
df3$Var1[count] <- df1$Var1[i]
df3$Var2[count] <- df1$Var2[i]
}
}
}
df3_final <- df3[-which(df3$ID == ""),] # since we overestimated the size of d3
df3_final

In R, sort a list of dataframes by name, then calculate sum of two columns in each data frame

I searched the forum for a bit, but I couldn't find a question that's similar to the one I have. Basically, I have a list of dataframes that have the same column names. I want to first sort the dataframes in the list by number, then calculate the sum of Col1 and Col2 in each dataframes and then store it in a vector that reflects the sorted list of dataframes.
I thought list [order(names(list))] would work, but it didn't.
For example:
df1 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(2,3,4,5,6), Col3=rep(a,5))
df3 <- data.frame(Col1=c(5,4,3,2,1),Col2=c(6,5,4,3,2), Col3=rep(a,5))
df2 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(1,2,3,4,5), Col3=rep(a,5))
list <- list(df1, df3, df2)
>list
$df1
Col1 Col2 Col3
1 2 a
2 3 a
3 4 a
4 5 a
5 6 a
$df3
Col1 Col2 Col3
5 6 a
4 5 a
3 4 a
2 3 a
1 2 a
$df2
Col1 Col2 Col3
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
First, I want to sort it, like this
$df1
Col1 Col2 Col3
1 2 a
2 3 a
3 4 a
4 5 a
5 6 a
$df2
Col1 Col2 Col3
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
$df3
Col1 Col2 Col3
5 6 a
4 5 a
3 4 a
2 3 a
1 2 a
Then, I want to get the sum of Col1 and Col2 in each dataframe, and store it in a new vector (let's call it x). The result should look like this
x
35, 30, 35
With what I presented, I would imagine that there is both a for-loop solution and a lapply solution.
Here is a one line method using an anonymous function:
a = 1
df1 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(2,3,4,5,6), Col3=rep(a,5))
df3 <- data.frame(Col1=c(5,4,3,2,1),Col2=c(6,5,4,3,2), Col3=rep(a,5))
df2 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(1,2,3,4,5), Col3=rep(a,5))
list <- list(df1 = df1, df3 =df3, df2 =df2)
r = unlist(lapply(list[order(names(list))], function(df) {sum(df[,1]) + sum(df[,2])}))
Here is an approach using the sqldf package. Is this what you need?
library(sqldf)
df1 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(2,3,4,5,6))
df3 <- data.frame(Col1=c(5,4,3,2,1),Col2=c(6,5,4,3,2))
df2 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(1,2,3,4,5))
list <- list(df1, df3, df2)
list
df1 <- sqldf("SELECT * FROM df1 ORDER BY Col1, Col2")
df2 <- sqldf("SELECT * FROM df2 ORDER BY Col1, Col2")
df3 <- sqldf("SELECT * FROM df3 ORDER BY Col1 DESC, Col2 DESC")
df1
df2
df3
df1 <- sqldf("SELECT SUM(Col1 +Col2) FROM df1")
df2 <- sqldf("SELECT SUM(Col1+Col2) FROM df2")
df3 <- sqldf("SELECT SUM(Col1+Col2) FROM df3")
df1
df2
df3
x <- vector()
x <- c(df1, df2, df3)
x
Which Gives the following result:
> x
$`SUM(Col1 +Col2)`
[1] 35
$`SUM(Col1+Col2)`
[1] 30
$`SUM(Col1+Col2)`
[1] 35

Only Keep Certain Combinations of Predictors in a Dataframe

Imagine that I have a data frame like this:
> col1 <- rep(1:3,10)
> col2 <- rep(c("a","b"),15)
> col3 <- rnorm(30,10,2)
> sample_df <- data.frame(col1 = col1, col2 = col2, col3 = col3)
> head(sample_df)
col1 col2 col3
1 1 a 13.460322
2 2 b 3.404398
3 3 a 8.952066
4 1 b 11.148271
5 2 a 9.808366
6 3 b 9.832299
I only want to keep combinations of predictors which, together, have a col3 standard deviation below 2. I can find the combinations using ddply, but I don't know how to backtrack to the original DF and select the correct levels.
> sample_df_summ <- ddply(sample_df, .(col1, col2), summarize, sd = sd(col3), count = length(col3))
> head(sample_df_summ)
col1 col2 sd count
1 1 a 2.702328 5
2 1 b 1.032371 5
3 2 a 2.134151 5
4 2 b 3.348726 5
5 3 a 2.444884 5
6 3 b 1.409477 5
For clarity, in this example, I'd like the DF with col1 = 3, col2 = b and col1 = 1 and col 2 = b. How would I do this?
You can add a "keep" column that is TRUE only if the standard deviation is below 2. Then, you can use a left join (merge) to add the "keep" column to the initial dataframe. In the end, you just select with keep equal to TRUE.
# add the keep column
sample_df_summ$keep <- sample_df_summ$sd < 2
sample_df_summ$sd <- NULL
sample_df_summ$count <- NULL
# join and select the rows
sample_df_keep <- merge(sample_df, sample_df_summ, by = c("col1", "col2"), all.x = TRUE, all.y = FALSE)
sample_df_keep <- sample_df_keep[sample_df_keep$keep, ]
sample_df_keep$keep <- NULL
Using dplyr:
library(dplyr)
sample_df %>% group_by(col1, col2) %>% mutate(sd = sd(col3)) %>% filter(sd < 2)
You get:
#Source: local data frame [6 x 4]
#Groups: col1, col2
#
# col1 col2 col3 sd
#1 1 a 10.516437 1.4984853
#2 1 b 11.124843 0.8652206
#3 2 a 7.585740 1.8781241
#4 3 b 9.806124 1.6644076
#5 1 a 7.381209 1.4984853
#6 1 b 9.033093 0.8652206

Resources