Copy cell in a row if it matches column name - r

I searched for a while to try to solve this, but unfortunately couldn't find an answer.
In my dataframe, the last column contains strings which match column names. I would like to create another column that for each row returns(copies) the value that matches that column name.
For example, say my data is:
col1 <- c(1, 4, 6, 0, 5)
col2 <- c(4, 6, 7, 8, 6)
col3 <- c(0, 4, 2, 2, 1)
col4 <- c("col1", "col1", "col2", "col3", "col1")
df <- data.frame(col1, col2, col3, col4)
and what I want to achieve is col5 which copies relevant cells from each row:
col1 col2 col3 col4 col5
1 4 0 col1 1
4 6 4 col1 4
6 7 2 col2 7
0 8 2 col3 2
5 6 1 col1 5
Basically it looks at col4 and returns the value from the same row that matches that column name.
This is obviously a very simplified version of my data which is why I'd like to automate it.
I would really appreciate any help :)

We can use row/col indexing to extract the elements from the dataset to create the 'col5'.
df$col5 <- df[-4][cbind(1:nrow(df), match(as.character(df$col4), colnames(df)))]
df$col5
#[1] 1 4 7 2 5

Related

Adding a list to dataframe as element

I have a list:
mylist <- c("a","b","c")
mylist <- as.list(mylist)
and I have a Dataframe like this:
df <- data.frame("col1" = 1, "col2" = 2, "col3" = 3)
View(df)
col1 col2 col3
1 1 2 3
How to add mylist as a new element in a new column col4 in row number 1?
Expected output:
col1 col2 col3 col4
1 1 2 3 [a,b,c]
Is it event possible in R? I started learning R few days ago, I came from Pandas and it was possible there.

How to delete duplicate rows (the shorter ones) based on certain columns?

Suppose I have the following df
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
> df
col1 col2 col3
1 1 2 <NA>
2 3 4 <NA>
3 1 2 c
My goal is to delete all duplicate rows based on col1 and col2 such that the longer row "survives". In this case, the first row should be deleted. I tried
df[duplicated(df[, 1:2]), ]
but this gives me only the third row (and not the third and the second one). How to do it properly?
EDIT: The real df has 15 columns, of which the first 13 are used for identifying duplicates. In the last two columns roughly 2/3 of the rows are filled with NAs (the first 13 columns do not contain any NAs). Thus, my example df was misleading in the sense that there are two columns to be excluded for identifying the duplicates. I am sorry for that.
You can try this:
library(dplyr)
df %>% group_by(col1,col2) %>%
slice(which.min(is.na(col3)))
or this :
df %>%
group_by(col1,col2) %>%
arrange(col3) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: col1, col2 [2]
# col1 col2 col3
# <dbl> <dbl> <fctr>
# 1 1 2 c
# 2 3 4 NA
A GENERAL SOLUTION
with the most general solution there can be only one row per value of col1, see comment below to add col2 to the grouping variables. It assumes all NAs are on the right.
df %>% mutate(nna = df %>% is.na %>% rowSums) %>%
group_by(col1) %>% # or group_by(col1,col2)
slice(which.min(nna)) %>%
select(-nna)
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
3 1 2 c
2 3 4 <NA>
EDIT: Keep all non-NA rows
df <- data.frame(col1 = c(1, 3, 1,3, 1), col2 = c(2, 4, 2,4, 2), col3 = c("a", NA, "c",NA, "b"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2]) & is.na(df[,3])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
1 1 2 a
5 1 2 b
3 1 2 c
2 3 4 <NA>
You can sort NAs to the top or bottom before dropping dupes:
# in base, which puts NAs last
odf = df[do.call(order, df), ]
odf[!duplicated(odf[, c("col1", "col2")]), ]
# col1 col2 col3
# 3 1 2 c
# 2 3 4 <NA>
# or with data.table, which puts NAs first
library(data.table)
DF = setorder(data.table(df))
unique(DF, by=c("col1", "col2"), fromLast=TRUE)
# col1 col2 col3
# 1: 1 2 c
# 2: 3 4 NA
This approach cannot be taken with dplyr, which doesn't offer "sort by all columns" in arrange, nor fromLast in distinct.

resample each two columns together in a data frame in R

I have a very large data frame that contains 100 rows and 400000 columns.
To sample each column, I can simply do:
df <- apply(df, 2, sample)
But I want every two column to be sampled together. For example, if originally col1 is c(1,2,3,4,5) and col2 is also c(6,7,8,9,10), and after resampling, col1 becomes c(1,3,2,4,5), I want col2 to be c(6,8,7,9,10) that follows the resampling pattern of col1. Same thing for col3 & col4, col5 & col6, etc.
I wrote a for loop to do this, which takes forever. Is there a better way? Thanks!
You might try this; split the data frame every two columns with split.default, for each sub data frame, sample the rows and then bind them together:
df <- data.frame(col1 = 1:5, col2 = 6:10, col3 = 11:15)
index <- seq_len(nrow(df))
cbind.data.frame(
setNames(lapply(
split.default(df, (seq_along(df) - 1) %/% 2),
function(sdf) sdf[sample(index),,drop=F]),
NULL)
)
# col1 col2 col3
#5 5 10 12
#4 4 9 11
#1 1 6 15
#2 2 7 14
#3 3 8 13

How do I sum all values in a data frame that match multiple criteria? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I'm trying to use the sapply (or similiar) function to sum all of the values that match multiple criteria throughout the data set.
I was able to write the code for a specific match, but am not sure how to use R to apply to every unique match in the data frame.
For example, if my data frame is constructed with 3 columns
col1 <- c("a", "a", "a", "b", "b", "b", "b", "b", "b")
col2 <- c(1, 1, 1, 2, 2, 2, 1, 1, 1)
col3 <- c(10, 5, 10, 5, 5, 1, 3, 4, 5)
df <- data.frame(col1, col2, col3)
Here is the code I'm using for one match:
tmp <- subset(df, col1 == "a" & col2==1)
sum(tmp[,3])
This code correctly returns 25 for the sum of col3 matching the 2 criteria in the subset function.
How do I do this calculation for the 3 unique combinations in the data frame? I'm looking for the following output
col1 col2 sum_col3
a 1 25
b 1 12
b 2 11
Thanks for assistance in advance.
Here's what you can try :
> result <- aggregate(col3 ~ col1 + col2 , df, sum)
> result
col1 col2 col3
1 a 1 25
2 b 1 12
3 b 2 11

consolidate duplicate rows and add column in R [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 7 years ago.
I'd like to know how to consolidate duplicate rows in a data frame and then combine the duplicated values in another column.
Here's a sample of the existing dataframe and two dataframes that would be acceptable as a solution
df1 <- data.frame(col1 = c("test1", "test2", "test2", "test3"), col2 = c(1, 2, 3, 4))
df.ideal <- data.frame(col1 = c("test1", "test2", "test3"), col2 = c(1, "2, 3", 4))
df.ideal2 <- data.frame(col1 = c("test1", "test2", "test3"),
col2 = c(1, 2, 4),
col3 = c(NA, 3, NA))
In the first ideal dataframe, the duplicated row is collapsed and the column is added with both numbers. I've looked at other similar questions on stack overflow, but they all dealt with combining rows. I need to delete the duplicate row because I have another dataset I'm merging it with that needs the a certain number of rows. So, I want to preserve all of the values. Thanks for your help!
To go from df1 to df.ideal, you can use aggregate().
aggregate(col2~col1, df1, paste, collapse=",")
# col1 col2
# 1 test1 1
# 2 test2 2,3
# 3 test3 4
If you want to get to df.ideal2, that's more of a reshaping from long to wide process. You can do
reshape(transform(df1, time=ave(col2, col1, FUN=seq_along)), idvar="col1", direction="wide")
# col1 col2.1 col2.2
# 1 test1 1 NA
# 2 test2 2 3
# 4 test3 4 NA
using just the base reshape() function.
Another option would be to use splitstackshape
library(data.table)
library(splitstackshape)
DT1 <- setDT(df1)[,list(col2=toString(col2)) ,col1]
DT1
# col1 col2
#1: test1 1
#2: test2 2, 3
#3: test3 4
You could split the col2 in DT1 to get the df.ideal2 or
cSplit(DT1, 'col2', sep=',')
# col1 col2_1 col2_2
#1: test1 1 NA
#2: test2 2 3
#3: test3 4 NA
or from df1
dcast.data.table(getanID(df1, 'col1'), col1~.id, value.var='col2')
# col1 1 2
#1: test1 1 NA
#2: test2 2 3
#3: test3 4 NA

Resources