I have a data frame with non-numeric values with the following format:
DF1:
col1 col2
1 a b
2 a c
3 z y
4 z x
5 a d
6 m n
I need to convert it into this format,
DF2:
col1 col2 col3 col4
1 a b c d
2 z y x NA
3 m n NA NA
With col1 as the primary key (not sure if this is the right terminology in R), and the rest of the columns contain the elements associated with that key (as seen in DF1).
DF2 will include more columns compared to DF1 depending upon the number of elements associated with any key.
Some columns will have no value resulting from different number of elements associated with each key, represented as NA (as shown in DF2).
The column names could be anything.
I have tried to use the reshape(), melt() + cast(), even a generic for loop where I use cbind and try to delete the row.
It is part of a very big dataset with over 50 million rows. I might have to use cloud services for this task but that is a different discussion.
I am new to R so there might be some obvious solution which I am missing.
Any help would be much appreciated.
-Thanks
If this is a big dataset, we can use data.table
library(data.table)
setDT(DF1)[, i1:=paste0("col", seq_len(.N)+1L), col1]
dcast(DF1, col1~i1, value.var='col2')
# col1 col2 col3 col4
#1: a b c d
#2: m n NA NA
#3: z y x NA
Using dplyr and tidyr :
library(tidyr)
library(dplyr)
DF <- data_frame(col1 = c("a", "a", "z", "z", "a", "m"),
col2 = c("b", "c", "y", "x", "d", "n"))
# you need to another column as key value for spreading
DF %>%
group_by(col1) %>%
mutate(colname = paste0("col", 1:n() + 1)) %>%
spread(colname, col2)
#> Source: local data frame [3 x 4]
#> Groups: col1 [3]
#>
#> col1 col2 col3 col4
#> (chr) (chr) (chr) (chr)
#> 1 a b c d
#> 2 m n NA NA
#> 3 z y x NA
Related
I have a data frame on R. I would like to get the unique rows based on the first three columns and also append the min value of the 4th column in each unique row.
dat <- tibble(
x = c("a", "a", "k", "k"),
y = c("a", "a", "l", "l"),
z = c("e", "e", "m" ,"m"),
t = c("4", "3", "8" ,"9"))
What I would like to see is below.
x
y
z
t
a
a
e
3
k
l
m
8
I believe there is a very easy way to do that but I can not see it at that moment.
With tidyverse, use group_by with summarise
library(dplyr)
dat %>%
group_by(across(x:z)) %>%
summarise(t = min(t), .groups = 'drop')
-output
# A tibble: 2 × 4
x y z t
<chr> <chr> <chr> <chr>
1 a a e 3
2 k l m 8
Or do an arrange and use distinct
dat %>%
arrange(across(everything())) %>%
distinct(across(x:z), .keep_all = TRUE)
# A tibble: 2 × 4
x y z t
<chr> <chr> <chr> <chr>
1 a a e 3
2 k l m 8
We may call to apply() to find the unique rows values per row in dat. Then, we can used duplicated() to look for duplicates and use the negation ! to return rows that are not duplicates. We use which to obtain integers corresponding to the rows in dat that are not duplicates. Finally, use these integers (unique_rows) to extract the unique rows from dat. As such, we do not have to append.
unique_rows <- which(!duplicated(apply(dat[, 1:3], 1, unique)))
out <- dat[unique_rows, ]
Output
> out
x y z t
1 a a e 4
3 k l m 8
Another way to deal with this would be to take minimum value of t column and keep remaining columns as group in aggregate function.
aggregate(t~., dat, min)
# x y z t
#1 a a e 3
#2 k l m 8
I have a data frame with 4 groups (defined by categories "a" and "b" in column 1 and categories "X" and "Y" in column 2). I want to rank the attributes in column 3 by their values in column 4, but specifically within the groups in columns 1 and 2 (AX, AY, BX, BY), and then select only the top n (e.g., n = 2) values from each group.
arrange(col1, col2, desc(col4)) works to arrange the data, but because the data are not technically grouped, functions like top_n return just the top n values of the entire list. I thought of using slice_max but can't install the beta version of dplyr from GitHub on my restricted network. What is the best approach?
Original data:
col1 col2 col3 col4
a X pat 1
b Y dog 2
b X leg 3
a X hog 4
b Y egg 5
a Y log 6
b X map 7
b Y ice 8
b X mat 9
a Y sat 10
arrange(col1, col2, desc(col4)) gives
col1 col2 col3 col4
a X hog 4
a X pat 1
a Y sat 10
a Y log 6
b X mat 9
b X map 7
b X leg 3
b Y ice 8
b Y egg 5
b Y dog 2
but I cannot figure out how to filter this down to just the top 2 values.
(example input code below)
col1 <- c('a','b','b','a','b','a','b','b','b','a')
col2 <- c('X','Y','X','X','Y','Y','X','Y','X','Y')
col3 <- c('pat','dog','leg','hog','egg','log','map','ice','mat','sat')
col4 <- c(1,2,3,4,5,6,7,8,9,10)
df <- data.frame(col1,col2,col3,col4)
colA <- c('a','a','a','a','b','b','b','b','b','b')
colB <- c('X','X','Y','Y','X','X','X','Y','Y','Y')
colC <- c('hog','pat','sat','log','mat','map','leg','ice','egg','dog')
colD <- c(4,1,10,6,9,7,3,8,5,2)
df1 <- data.frame(colA,colB,colC,colD)
We can use top_n after grouping by the 'colA', 'colB'
library(dplyr)
df %>%
group_by(colA, colB) %>%
top_n(2)
I have a data table in the format:
myTable <- data.table(Col1 = c("A", "A", "A", "B", "B", "B"), Col2 = 1:6)
print(myTable)
Col1 Col2
1: A 1
2: A 2
3: A 3
4: B 4
5: B 5
6: B 6
I want show only the highest result for each category in Col1, then collapse all others and present their sum in Col2. It should look like this:
print(myTable)
Col1 Col2
1: A 3
2: Others 3
3: B 6
4: Others 9
I managed to do it with the following code:
unique <- unique(myTable$Col1) # unique values in Col1
myTable2 <- data.table() # empty data table to populate
for(each in unique){
temp <- myTable[Col1 == each, ] # filter myTable for unique Col1 values
temp <- temp[order(-Col2)] # order filtered table increasingly
sumCol2 <- sum(temp$Col2) # sum of values in filtered Col2
temp <- temp[1, ] # retain only first element
remSum <- sumCol2 - sum(temp$Col2) # remaining sum in Col2 (without first element)
temp <- rbindlist(list(temp, data.table("Others", remSum))) # rbind first element and remaining elements
myTable2 <- rbindlist(list(myTable2, temp)) # populate data table from beginning
}
This works, but I am trying to shorten a very large data table, so it takes forever.
Is there any better way to approach this?
Thanks.
UPDATE: Actually my procedure is a little bit more complicated. I figured I would be able to develop it myself after the basics were mastered but it seems I will need further help instead. I want to display the 5 highest values in Col1, and collapse the others, but some entries in Col1 do not have 5 values; in these case, all entries should be displayed, and no "Others" row should be added.
Here the data is split into groups according to the value of Col1 (by = Col1). .N is the index of the last row in the given group, so c(Col2[.N], sum(Col2) - Col2[.N])) gives the last value of Col2, and the sum of Col2 minus the last value. The newly created variables are surrounded by .() because .() is an alias for the list() function when using data.table, and the created columns need to go in a list.
library(data.table)
setDT(df)
df[, .(Col1 = c(Col1, 'Others'),
Col2 = c(Col2[.N], sum(Col2) - Col2[.N]))
, by = Col1][, -1]
# Col1 Col2
# 1: A 3
# 2: Others 3
# 3: B 6
# 4: Others 9
If it just a matter of displaying things you could the 'tables' packages :
others <- function(x) sum(x)-last(x)
df %>% tabular(Col1*(last+others) ~ Col2, .)
# Col1 Col2
# A last 3
# others 3
# B last 6
# others 9
do.call(
rbind, lapply(split(myTable, factor(myTable$Col1)), function(x) rbind(x[which.max(x$Col2),], list("Other", sum(x$Col2[-which.max(x$Col2)]))))
)
# Col1 Col2
#1: A 3
#2: Other 3
#3: B 6
#4: Other 9
I did it! I made a new myTable to illustrate. I want to retain only the 4 highest values by category, and collapse the others.
set.seeed(123)
myTable <- data.table(Col1 = c(rep("A", 3), rep("B", 5), rep("C", 4)), Col2 = sample(1:12, 12))
print(myTable)
Col1 Col2
1: A 8
2: A 5
3: A 2
4: B 7
5: B 10
6: B 9
7: B 12
8: B 11
9: C 4
10: C 6
11: C 3
12: C 1
# set key to Col2, it will sort it increasingly
setkey(myTable, Col2)
# if there are more than 4 entries by Col1 category, will return all information, otherwise will return 4 entries completing with NA
myTable <- myTable[,.(Col2 = Col2[1:max(c(4, .N))]) , by = Col1]
# will print in Col1: 4 entries of Col1 category, then "Other"
# will print in Col2: 4 last entries of Col2 in that category, then the remaining sum
myTable <- myTable[, .(Col1 = c(rep(Col1, 4), "Other"), Col2 = c(Col2[.N-3:0], sum(Col2) - sum(Col2[.N-3:0]))), by = Col1]
# removes rows with NA inserted in first step
myTable <- na.omit(myTable)
# removes rows where Col2 = 0, inserted because that Col1 category had exactly 4 entries
myTable <- myTable[Col2 != 0]
Owooooo!
Here's a base R solution and the dplyr equivalent:
res <- aggregate(Col2 ~.,transform(
myTable, Col0 = replace(Col1,duplicated(Col1,fromLast = TRUE), "Other")), sum)
res[order(res$Col1),-1]
# Col0 Col2
# 1 A 3
# 3 Other 3
# 2 B 6
# 4 Other 9
myTable %>%
group_by(Col0= Col1, Col1= replace(Col1,duplicated(Col1,fromLast = TRUE),"Other")) %>%
summarize_at("Col2",sum) %>%
ungroup %>%
select(-1)
# # A tibble: 4 x 2
# Col1 Col2
# <chr> <int>
# 1 A 3
# 2 Other 3
# 3 B 6
# 4 Other 9
Suppose I have the following df
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
> df
col1 col2 col3
1 1 2 <NA>
2 3 4 <NA>
3 1 2 c
My goal is to delete all duplicate rows based on col1 and col2 such that the longer row "survives". In this case, the first row should be deleted. I tried
df[duplicated(df[, 1:2]), ]
but this gives me only the third row (and not the third and the second one). How to do it properly?
EDIT: The real df has 15 columns, of which the first 13 are used for identifying duplicates. In the last two columns roughly 2/3 of the rows are filled with NAs (the first 13 columns do not contain any NAs). Thus, my example df was misleading in the sense that there are two columns to be excluded for identifying the duplicates. I am sorry for that.
You can try this:
library(dplyr)
df %>% group_by(col1,col2) %>%
slice(which.min(is.na(col3)))
or this :
df %>%
group_by(col1,col2) %>%
arrange(col3) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: col1, col2 [2]
# col1 col2 col3
# <dbl> <dbl> <fctr>
# 1 1 2 c
# 2 3 4 NA
A GENERAL SOLUTION
with the most general solution there can be only one row per value of col1, see comment below to add col2 to the grouping variables. It assumes all NAs are on the right.
df %>% mutate(nna = df %>% is.na %>% rowSums) %>%
group_by(col1) %>% # or group_by(col1,col2)
slice(which.min(nna)) %>%
select(-nna)
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
3 1 2 c
2 3 4 <NA>
EDIT: Keep all non-NA rows
df <- data.frame(col1 = c(1, 3, 1,3, 1), col2 = c(2, 4, 2,4, 2), col3 = c("a", NA, "c",NA, "b"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2]) & is.na(df[,3])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
1 1 2 a
5 1 2 b
3 1 2 c
2 3 4 <NA>
You can sort NAs to the top or bottom before dropping dupes:
# in base, which puts NAs last
odf = df[do.call(order, df), ]
odf[!duplicated(odf[, c("col1", "col2")]), ]
# col1 col2 col3
# 3 1 2 c
# 2 3 4 <NA>
# or with data.table, which puts NAs first
library(data.table)
DF = setorder(data.table(df))
unique(DF, by=c("col1", "col2"), fromLast=TRUE)
# col1 col2 col3
# 1: 1 2 c
# 2: 3 4 NA
This approach cannot be taken with dplyr, which doesn't offer "sort by all columns" in arrange, nor fromLast in distinct.
I have the following data frame
>data.frame
col1 col2
A
x B
C
D
y E
I need a new data frame that looks like:
>new.data.frame
col1 col2
A
x
C
D
y
I just need a method for reading from col1 and if there is ANY characters in Col1 then clear corresponding row value of col2. I was thinking about using an if statement and data.table for this but am unsure of how to relay the information for deleting col2's values based on ANY characters being present in col1.
Something like this works:
# Create data frame
dat <- data.frame(col1=c(NA,"x", NA, NA, "y"), col2=c("A", "B", "C", "D", "E"))
# Create new data frame
dat_new <- dat
dat_new$col2[!is.na(dat_new$col1)] <- NA
# Check that it worked
dat
dat_new
This depends on what you mean by 'remove'. Here I'm assuming a blank string "". However, the same principle will apply for NAs
## create data frame
df <- data.frame(col1 = c("", "x", "","", "y"),
col2 = LETTERS[1:5],
stringsAsFactors = FALSE)
df
# col1 col2
# 1 A
# 2 x B
# 3 C
# 4 D
# 5 y E
## subset by blank values in col1, and replace the values in col2
df[df$col1 != "",]$col2 <- ""
## or df$col2[df$col1 != ""] <- ""
df
# col1 col2
# 1 A
# 2 x
# 3 C
# 4 D
# 5 y
And as you mentioned data.table, the code for this would be
library(data.table)
setDT(df)
## filter by blank entries in col1, and update col2 by-reference (:=)
df[col1 != "", col2 := ""]
df
Using dplyr
library(dplyr)
df %>%
mutate(col2 = replace(col2, col1!="", ""))
# col1 col2
#1 A
#2 x
#3 C
#4 D
#5 y