Hi I have a dataframe that looks like the following
I want to apply a function to it so that it reshapes it like this
How would I do that?
Here is one option that could work. W loop through the unique names of the dataset, create a logical index with ==, extract the columns, unlist, create a data.frame, and then cbind it together or just use data.frame (assumption is that the number of duplicate elements are equal for each set)
data.frame(lapply(unique(names(df1)), function(x)
setNames(data.frame(unlist(df1[names(df1)==x], use.names = FALSE)), x)))
# type model make
#1 a b c
#2 d e f
data
df1 <- data.frame(type = "a", model = "b", make = "c", type = "d",
model = "e",
make = "f", check.names=FALSE, stringsAsFactors=FALSE)
Related
I need to recode a factor variable with almost 90 levels. It is trait names from database which I then need to pivot to get the dataset for analysis.
Is there a way to do it automatically without typing each OldName=NewName?
This is how I do it with dplyr for fewer levels:
df$TraitName <- recode_factor(df$TraitName, 'Old Name' = "new.name")
My idea was to use a key dataframe with a column of old names and corresponding new names but I cannot figure out how to feed it to recode
You could quite easily create a named vector from your lookup table and pass that to recode using splicing. It might as well be faster than a join.
library(tidyverse)
# test data
df <- tibble(TraitName = c("a", "b", "c"))
# Make a lookup table with your own data
# Youll bind your two columns instead here
# youll want to keep column order to deframe it.
# column names doesnt matter.
lookup <- tibble(old = c("a", "b", "c"), new = c("aa", "bb", "cc"))
# Convert to named vector and splice it within the recode
df <-
df |>
mutate(TraitNameRecode = recode_factor(TraitName, !!!deframe(lookup)))
One way would be a lookup table, a join, and coalesce (to get the first non-NA value:
my_data <- data.frame(letters = letters[1:6])
levels_to_change <- data.frame(letters = letters[4:5],
new_letters = LETTERS[4:5])
library(dplyr)
my_data %>%
left_join(levels_to_change) %>%
mutate(new = coalesce(new_letters, letters))
Result
Joining, by = "letters"
letters new_letters new
1 a <NA> a
2 b <NA> b
3 c <NA> c
4 d D D
5 e E E
6 f <NA> f
I have a data frame ('ju') that has three columns and 230 rows. The first two columns represent a pair of objects. The third column includes one of those objects. I'd like to add the fourth column which will contain the second object from that pair, as shown below.
I wrote a code to identify the value for the forth column (loser), but it does not give me any output when I run it.
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
paste(ju$letter1[i])
} else {
paste (ju$letter2[i])
}
}
I can not see what is wrong with the code. Also I would appreciate if you can suggest how I could create this fourth column directly into my data frame, instead of creating a separate vector and then adding it to the data frame. Thanks
This will do it without a for loop:
ju$loser <- ifelse(ju$winner %in% ju$letter1, ju$letter2, ju$letter1)
Gives:
> ju
letter1 letter2 winner loser
1 a c a c
2 c b b c
3 t j j t
4 r k k r
If you want to print to console, you'll need to add:
cat(ju$letter1[i])
or
print(ju$letter1[i])
Regarding the New Column question, a possible solution (sub-optimal to use a for loop here -- See suggestion from #lab_rat_kid):
ju$NewColumn = NA
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
ju$NewColumn[i] <- ju$letter1[i]
} else {
ju$NewColumn[i] <- ju$letter2[i]
}
}
with tidyverse:
dt <- tibble(l1 = c("a", "c", "t", "r"),
l2 = c("c", "b", "j", "k"),
winner = c("a", "b", "j", "k"))
dt <- dt %>%
mutate(looser = if_else(winner == l1, l2, l1))
(dt)
I have the following code that it taking forever to run on my 80k rows CBP table. Anyone could help me optimize my loop. Trying simply to find duplicates sharing the same values in certain (not all) columns, getting the number of duplicates there is and then returning the ids for each of the duplicates:
for (row in 1:nrow(CBP)){
subs <- subset(CBP, CBP$Lower_Bound__c == CBP[row,"Lower_Bound__c"] & CBP$Price_Book__c == CBP[row,"Price_Book__c"] & CBP$Price__c == CBP[row,"Price__c"] & CBP$Product__c == CBP[row,"Product__c"] & CBP$Department__c == CBP[row,"Department__c"] & CBP$UOM__c == CBP[row,"UOM__c"] & CBP$Upper_Bound__c == CBP[row,"Upper_Bound__c"])
if (nrow(subs)>1){
CBP[row,]$dup <- nrow(subs)
CBP[row,]$dupids <- paste(subs[,"Id"], collapse = ",")
}
print(row)
}
I'm having a hard time understanding your example. However, here's a simple approach with data.table that might work for your situation. You can create a variable (nsame in the example) that counts if the something is a duplicate by multiple variables (var1 and var2 in the example). Then just grab the row index.
library(data.table)
# generate some example data
dt <- data.table(
var1 = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
var2 = c("a", "a", "z", "b", "y", "b", "c", "c", "c"),
var3 = 1:9
)
# counter for each combination of var1-var2
dt[ , nsame := 1:.N, by=.(var1, var2)]
# duplicates are where the counter is > 1
which(dt$nsame > 1)
## 2 6 8 9
Using base R:
dupe_columns = c(
"Lower_Bound__c", "Price_Book__c", "Price__c", "Product__c",
"Department__c", "UOM__c", "Upper_Bound__c"
)
# which rows are duplicated
dupes = which(duplicated(CBP[, dupe_columns]) | duplicated(CBP[, dupe_columns], fromLast = TRUE))
# how many are there
length(dupes)
# IDs that are duplicated
CBP[dupes, "Id"]
# collapse Ids with duplicates by group:
aggregate(CBP$Id, by = CBP[dupe_columns], FUN = paste, collapse = ",")
If any of this doesn't work or you need more help, post 10-20 rows of sample data (use dput() so it is copy/pasteable!!!) so we can test and verify.
Subtle point, but I use CBP[, dupe_columns] in the duplicated() line because duplicated() will work the same whether we give it a data frame or a vector. CBP[, dupe_columns] will be a data frame if you have more than one column to check for dupes, but will be a vector if you give it a single column. However, when we get down to aggregate we need the by argument to be a list (like a data frame). So I use CBP[dupe_columns] (no comma) which will guarantee a data frame even if we are only checking a single column.
Thanks for your help.
I have two data frames. The data frames are of differing lengths. One is a data set that often includes mistakes. Another is a set of corrections. I'm trying to do two things at once with these two data sets. First, I would like to compare three columns of df1 with three columns in df2. This means reading the first row of data in df1 and seeing if those three variables match any of the rows in df2 for those three variables, then moving on to row 2, and so on. If a match is found in a row for all three variables, then replace the value in one of the columns in df1 with a replacement in df2. I have included an example below.
df1 <- data.frame("FIRM" = c("A", "A", "B", "B", "C", "C"), "LOCATION" = c("N", "S", "N", "S", "N", "S"), "NAME" = c("Apple", "Blooberry", "Cucumber", "Date", "Egplant", "Fig"))
df2 <- data.frame("FIRM" = c("A", "C"), "LOCATION" = c("S", "N"), "NAME" = c("Blooberry", "Egplant"), "NEW_NAME" = c("Blueberry", "Eggplant"))
df1[] <- lapply(df1, as.character)
df2[] <- lapply(df2, as.character)
If there is a row in df1 that matches against "FIRM", "LOCATION" and "NAME" in df2, then I would like to replace the "NAME" in df1 with "NEW_NAME" in df2, such that "Blooberry" and "Egplant" change to "Blueberry" and "Eggplant".
I can do the final replacements using*:
df1$NAME[match(df2$NAME, df1$NAME)] <- df2$NEW_NAME[match(df1$NAME[match(df2$NAME, df1$NAME)], df2$NAME)]
But this does not include the constraint of the three matches. Also, my code seems unnecessarily complex with the nested match functions. I think I could accomplish this task by subsetting df2 and using a for loop to match rows one by one but I would think that there is a better vectorized method out there.
*I'm aware that inside the brackets of df2$NEW_NAME[], the function calls both elements in that column, but I'm trying to generalize.
Consider an all.x merge (i.e., LEFT JOIN in SQL speak) with an ifelse conditional comparing NAME and NEW_NAME.
Below, transform allows same line column assignment and the bracketed sequence at end keeps first three columns.
mdf <- transform(merge(df1,df2,all.x=TRUE),NAME=ifelse(is.na(NEW_NAME),NAME,NEW_NAME))[1:3]
mdf
# FIRM LOCATION NAME
# 1 A N Apple
# 2 A S Blueberry
# 3 B N Cucumber
# 4 B S Date
# 5 C N Eggplant
# 6 C S Fig
I have two big and small dataframes (actually dataset is very very big !). The following just for working.
big <- data.frame (SN = 1:5, names = c("A", "B", "C", "D", "E"), var = 51:55)
SN names var
1 1 A 51
2 2 B 52
3 3 C 53
4 4 D 54
5 5 E 55
small <- data.frame (names = c("A", "C", "E"), type = c("New", "Old", "Old") )
names type
1 A New
2 C Old
3 E Old
Now I need to create and new variable in "big" with the help of "type" variable in small. The names in small and big will match and corresponding type will be stored in column type. If there is no match between the names columns it will be result in new value "unknown". The expected output is as follows:
resultdf <- data.frame(SN = 1:5, names = c("A", "B", "C", "D", "E"), var = 51:55,
type = c("New","Unknown", "Old", "Unknown", "Old"))
resultdf
SN names var type
1 1 A 51 New
2 2 B 52 Unknown
3 3 C 53 Old
4 4 D 54 Unknown
5 5 E 55 Old
I know this is simple question for experts but I could not figure it out.
First use merge() with the argument all=TRUE to merge the two data.frames, keeping rows of big that found no matching value in the small$names. Then, replace those elements of big$type that didn't find a match (marked by merge() with "NA"s) with the string "Unknown".
Note that because big and small share just one column name in common, that column is by default used to perform the merge. For more control over which columns are used as the basis of the merge, see the function's by, by.x, and by.y arguments.
small <- data.frame (names = c("A", "C", "E"),
type = c("New", "Old", "Old"), stringsAsFactors=FALSE)
big <- data.frame (SN = 1:5, names = c("A", "B", "C", "D", "E"), var = 51:55,
stringsAsFactors=FALSE)
big <- merge(big, small, all=TRUE)
big$type[is.na(big$type)] <- "Unknown"
big$type <- c(as.character(small$type),"Unknown") [
match(
x=big$names,
table=small$names,
nomatch=length(small$type)+1)]
The basic strategy is to convert the factor to character, add an "unknown" value, and then use big$names to look up the correct index for "types" in the 'small' dataframe. Generating indices is a typical use of the match function.