I have a data frame like so:
df <- data.frame(
id = c(1, 1, 2, 2),
V1 = c(1:4),
V2 = c(5:8),
V3 = c(9:12))
Printed to the console it looks like this:
# id V1 V2 V3
# 1 1 1 5 9
# 2 1 2 6 10
# 3 2 3 7 11
# 4 2 4 8 12
Now, I would like to transform it to this shape:
# id V1 V2 V3 V4 V5 V6
# 1 1 1 5 9 2 6 10
# 2 2 3 7 11 4 8 12
How can I do this with base R or the tidyverse?
a possible tidyverse solution
wide <- df %>%
group_by(id) %>%
mutate(obs = row_number()) %>%
gather(var, val, V1:V3) %>%
unite(comb, obs, var) %>%
spread(comb, val)
colnames(wide)[-1] <- paste("V", seq(1,ncol(wide) -1), sep = "")
# A tibble: 2 x 7
# Groups: id [2]
# id V1 V2 V3 V4 V5 V6
#1 1 1 5 9 2 6 10
#2 2 3 7 11 4 8 12
You could do it with e.g. using by.
df2 <- do.call(rbind,
by(df, df$id, function(x) c(x[1, "id"], as.vector(t(x[names(x) != "id"]))))
)
colnames(df2) <- c("id", paste0("V", seq(ncol(df2)-1)))
id V1 V2 V3 V4 V5 V6
1 1 1 5 9 2 6 10
2 2 3 7 11 4 8 12
Base R:
lists <- Map(function(x) data.frame(c(x[1,], x[2,-1])), split(df, df$id))
df2 <- do.call(rbind, lists)
To change the column names:
colnames(df2) <- c("id", paste0("V", seq_along(df2[-1])))
And the result:
# > df2
# id V1 V2 V3 V4 V5 V6
# 1 1 1 5 9 2 6 10
# 2 2 3 7 11 4 8 12
Related
I have a dataset of survey responses similar to this:
toy <- data.frame(v1 = c(1,2,3), v2 = c(1,6,3), v3 = c(1,2,4), v4 = c(1,7,3))
toy
v1 v2 v3 v4
1 1 1 1 1
2 2 6 2 7
3 3 3 4 3
I want to detect "straightlining" by finding the most common value for each row and calculating the proportion of columns with that value.
Two examples:
if the value of every column in a row is 5, then the new variable should return 1
If the value of 60% of the columns in a row is 3 and 40% of the columns is 4, then the variable should return .6
Desired output:
v1 v2 v3 v4 straightline_pct
1 1 1 1 1 1
2 2 6 2 7 .50
3 3 3 4 3 .75
One base approach:
toy <- data.frame(v1 = c(1,2,3), v2 = c(1,6,3), v3 = c(1,2,4), v4 = c(1,7,3))
toy$straightline_pct = apply(as.matrix(toy),
1L,
function (x) max(prop.table(table(x)))
)
toy
#> v1 v2 v3 v4 straightline_pct
#> 1 1 1 1 1 1.00
#> 2 2 6 2 7 0.50
#> 3 3 3 4 3 0.75
Slight variation with just table
toy$straightline_pct <- apply(toy, 1, function(x) max(table(x))/length(x) )
toy
v1 v2 v3 v4 straightline_pct
1 1 1 1 1 1.00
2 2 6 2 7 0.50
3 3 3 4 3 0.75
A possible solution:
library(tidyverse)
toy <- data.frame(v1 = c(1,2,3), v2 = c(1,6,3), v3 = c(1,2,4), v4 = c(1,7,3))
toy %>%
rowwise %>%
mutate(perc = table(c_across(everything())) %>%
{(ncol(toy) - length(.) + 1) / ncol(toy)}) %>% ungroup
#> # A tibble: 3 × 5
#> v1 v2 v3 v4 perc
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1 1
#> 2 2 6 2 7 0.5
#> 3 3 3 4 3 0.75
An alternative, based on ave function:
toy %>%
rowwise %>%
mutate(perc = c_across(1:4) %>%
{max(ave(., ., FUN=length)) / ncol(toy)}) %>% ungroup
I like #Paul Smith's and #Cole's answers better, but for completeness here's a more verbose approach:
library(tidyverse)
toy %>%
bind_cols(toy %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
count(row, value) %>%
group_by(row) %>%
mutate(straightline_pct = n / sum(n)) %>%
slice_max(straightline_pct) %>%
ungroup() %>%
select(straightline_pct)
)
Here is a simple and verbose solution that is largely similar to other answers already:
library(tidyverse)
toy %>%
rowwise() %>%
mutate(
straightline_pct = max(table(c_across(everything()))) / ncol(.)
) %>%
ungroup()
# A tibble: 3 x 5
v1 v2 v3 v4 straightline_pct
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 2 6 2 7 0.5
3 3 3 4 3 0.75
If you can turn your data into a matrix first:
toy <- data.matrix(toy)
storage.mode(toy) <- "integer"
Then with the help of matrixStats:
library(matrixStats)
rowMaxs(rowTabulates(toy)) / ncol(toy)
[1] 1.00 0.50 0.75
I would like to remove part of a string from a V2 column in a df.
df
V1 V2
3 scale_KD_1
10 scale_KD_5
4 scale_KD_10
7 scale_KD_7
The desired outcome would be:
df
V1 V2
3 1
10 5
4 10
7 7
Using readr and stringr packages:
library(readr)
df %>% mutate(V2 = parse_number(V2))
V1 V2
1 3 1
2 10 5
3 4 10
4 7 7
library(stringr)
df %>% mutate(V2 = str_remove(V2, '.*_'))
V1 V2
1 3 1
2 10 5
3 4 10
4 7 7
There are many ways to accomplish this. Just check which one is faster. Besides the ones mentioned by #Karthik S, you can try these ones:
library(dplyr)
library(stringr)
df %>%
mutate(V2 = str_extract(V2, '\\d+$'))
df %>%
mutate(V2 = str_remove(V2, '\\D+'))
V1 V2
1 3 1
2 10 5
3 4 10
4 7 7
You can use sub to remove everything until _:
df$V2 <- sub(".*_", "", df$V2)
#df$V2 <- sub("\\D*", "", df$V2) #Some Alternatives
#df$V2 <- sub("[^[:digit:]]*", "", df$V2)
df
# V1 V2
#1 3 1
#2 10 5
#3 4 10
#4 7 7
Data:
df <- read.table(header=T, text=" V1 V2
3 scale_KD_1
10 scale_KD_5
4 scale_KD_10
7 scale_KD_7")
I have two dataframes:
df1 <- data.frame( v1 = c(1,2,3,4),
v2 = c(2, 10, 5, 11),
v3=c(20, 25, 23, 2))
> df1
v1 v2 v3
1 1 2 20
2 2 10 35
3 3 5 23
4 4 11 2
df2 <- data.frame(v1 = 4, = 10, v3 = 30)
> df2
v1 v2 v3
1 4 10 30
I want to add a new column that would say "Fail" when df1 is larger than df2 and "Pass" when it is smaller so that the intended results would be:
> df3
v1 v2 v3 check
1 1 2 20 Pass
2 2 10 35 Fail
3 3 5 23 Pass
4 4 11 2 Fail
You can make size of both the dataframes similar and directly compare :
ifelse(rowSums(df1 >= df2[rep(1,length.out = nrow(df1)), ]) == 0, 'Pass', 'Fail')
#[1] "Pass" "Fail" "Pass" "Fail"
Or using Map :
ifelse(Reduce(`|`, Map(`>=`, df1, df2)), 'Fail', 'Pass')
#Other similar alternatives :
#c('Pass', 'Fail')[Reduce(`|`, Map(`>=`, df1[-1], df2[-1])) + 1]
#c('Fail', 'Pass')[(rowSums(mapply(`>=`, df1, df2)) == 0) + 1]
In tidyverse, we can make use of c_across
library(dplyr) # >= 1.0.0
df1 %>%
rowwise %>%
mutate(check = c('Pass', 'Fail')[1 + any(c_across(everything()) >= df2)])
# A tibble: 4 x 4
# Rowwise:
# v1 v2 v3 check
# <dbl> <dbl> <dbl> <chr>
#1 1 2 20 Pass
#2 2 10 25 Fail
#3 3 5 23 Pass
#4 4 11 2 Fail
I know that my problem is trival, however now I'm learing methods how to reshape data in different ways, so please be understanding.
I have data like this:
Input = (
'col1 col2
A 2
B 4
A 7
B 3
A 4
B 2
A 4
B 6
A 3
B 3')
df = read.table(textConnection(Input), header = T)
> df
col1 col2
1 A 2
2 B 4
3 A 7
4 B 3
5 A 4
6 B 2
7 A 4
8 B 6
9 A 3
10 B 3
And I'd like to have something like this, where the column names are not important:
col1 v1 v2 v3 v4 v5
1 A 2 7 4 4 3
2 B 4 3 2 6 3
So far, I did something like:
res_1 <- aggregate(col2 ~., df, toString)
col1 col2
1 A 2, 7, 4, 4, 3
2 B 4, 3, 2, 6, 3
And it actually works, however, I have one column and valiues are comma separated, instead of being in new columns, so I decided to fix it up:
res_2 <- do.call("rbind", strsplit(res_1$col2, ","))
[,1] [,2] [,3] [,4] [,5]
[1,] "2" " 7" " 4" " 4" " 3"
[2,] "4" " 3" " 2" " 6" " 3"
Adn finally combine it and remove unnecessary columns:
final <- cbind(res_1,res_2)
final$col2 <- NULL
col1 1 2 3 4 5
1 A 2 7 4 4 3
2 B 4 3 2 6 3
So I have my desired output, but I'm not satisfied about the method, I'm sure there's one easy and short command for this. As I said I'd like to learn new more elegant options using different packages.
Thanks!
You can simply do,
do.call(rbind, split(df$col2, df$col1))
# [,1] [,2] [,3] [,4] [,5]
#A 2 7 4 4 3
#B 4 3 2 6 3
You can wrap it to data.frame() to convert from matrix to df
The question is tagged with reshape2 and reshape so we show the use of that package and the base reshape function. Also the use of dplyr/tidyr is illustrated. Finally we show a data.table solution and a second base R solution using xtabs.
reshape2 Add a group column and then convert from long to wide form:
library(reshape2)
df2 <- transform(df, group = paste0("v", ave(1:nrow(df), col1, FUN = seq_along)))
dcast(df2, col1 ~ group, value.var = "col2")
giving:
col1 v1 v2 v3 v4 v5
1 A 2 7 4 4 3
2 B 4 3 2 6 3
2) reshape Using df2 from (1) we have the following base R solution using the reshape function:
wide <- reshape(df2, dir = "wide", idvar = "col1", timevar = "group")
names(wide) <- sub(".*\\.", "", names(wide))
wide
giving:
col1 v1 v2 v3 v4 v5
1 A 2 7 4 4 3
2 B 4 3 2 6 3
3) dplyr/tidyr
library(dplyr)
library(tidyr)
df %>%
group_by(col1) %>%
mutate(group = paste0("v", row_number())) %>%
ungroup %>%
pivot_wider(names_from = "group", values_from = "col2")
giving:
# A tibble: 2 x 6
col1 v1 v2 v3 v4 v5
<fct> <int> <int> <int> <int> <int>
1 A 2 7 4 4 3
2 B 4 3 2 6 3
4) data.table
library(data.table)
as.data.table(df)[, as.list(col2), by = col1]
giving:
col1 V1 V2 V3 V4 V5
1: A 2 7 4 4 3
2: B 4 3 2 6 3
5) xtabs Another base R solution uses df2 from (1) and xtabs. This produces an object of class c("xtabs", "table")`. Note that it labels the dimensions.
xtabs(col2 ~., df2)
giving:
group
col1 v1 v2 v3 v4 v5
A 2 7 4 4 3
B 4 3 2 6 3
For each row in my dataframe, I want to find the second highest occurring value, as well as the least occurring value. How can i do this?
Df:
label v1 v2 v3 v4 v5 v6
5 3 3 3 6 6 8
5 7 1 1 1 7 0
5 3 5 6 6 6 5
I want to consider all columns besides 'label'
Expected output:
second largest occuring least occuring
6 8
7 0
5 3
Edit: I have updated the example after the answer was accepted to make it less confusing
A dplyr solution:
df %>%
rowid_to_column() %>%
gather(var, val, -label, -rowid) %>%
group_by(rowid, val) %>%
tally() %>%
summarise(second_largest_occuring = val[dense_rank(n) == 2],
least_occuring = val[n == min(n)]) %>%
ungroup() %>%
select(-rowid)
# A tibble: 3 x 2
second_largest_occuring least_occuring
<int> <int>
1 2 1
2 2 0
3 5 3
Data:
df <- read.table(text = "label v1 v2 v3 v4 v5 v6
5 3 3 3 2 2 1
5 2 1 1 1 2 0
5 3 5 6 6 6 5", header= TRUE)
Another dplyr solution that is a bit more readable and handles errors for NA and instances where there are multiple occurrences of the second largest. This solution also allows you to select multiple columns using dplyr language.
library(dplyr)
dat = read.table(text = 'label v1 v2 v3 v4 v5 v6
5 3 3 3 2 2 1
5 2 1 1 1 2 0
5 3 5 6 6 6 5', header = T)
second_largest <- function(x,na.rm = TRUE) {
if(na.rm) { x <- na.omit(x) } # omit NA values
second_largest <- x[dense_rank(x) == 2] # return all values where the rank is equal to 2nd largest
second_largest <- max(second_largest) # keep one value out of all the second largest, or NA
return(second_largest)
}
df <- dat %>%
mutate(
second_largest = select(., v1:v6) %>% apply(1, second_largest,na.rm = TRUE), # apply second_largest func to every row
min = select(., v1:v6) %>% apply(1,min,na.rm = TRUE) # apply min to every row
)
# label v1 v2 v3 v4 v5 v6 second_largest min
# 1 5 3 3 3 2 2 1 2 1
# 2 5 2 1 1 1 2 0 1 0
# 3 5 3 5 6 6 6 5 5 3
A few things to notice.
In the apply statement the 1 indicates that the function should be applied to the rows.
Update
If you want the value of the second most frequent number you just plug in a new function to do that.
second_most_frequent <- function(x, is_numeric = TRUE) {
out <- x %>%
table() %>% # Create a table of frequencies as characters
as.data.frame(stringsAsFactors = FALSE) %>%
arrange(desc(Freq)) %>% # Arrange with frequency descending
.[,1] %>% # Select the first column
.[2] # select the second most frequent (WARNING: Doesn't check for ties)
if(is_numeric){ out <- as.numeric(out) }
return(out)
}
df <- df %>%
mutate(
second_most_freq = select(., v1:v6) %>% apply(1,second_most_frequent,is_numeric = TRUE)
)
# label v1 v2 v3 v4 v5 v6 second_largest min second_most_freq
# 1 5 3 3 3 2 2 1 2 1 2
# 2 5 2 1 1 1 2 0 1 0 2
# 3 5 3 5 6 6 6 5 5 3 5