dataset=structure(list(goods = structure(1:6, .Label = c("a", "b", "c",
"d", "e", "f"), class = "factor")), .Names = "goods", class = "data.frame", row.names = c(NA,
-6L))
goods
1 a
2 b
3 c
4 d
5 e
6 f
i want create new data, simple i do
df1=dataset$goods
but after it df1 doesn't have name column goods.
Why?
str(df1)
Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 5 6
As you can see it hasn't name goods
How to do that df1 data has name column goods?
If this post is dublicate, let me know, i delete it.
You are assigning a column vector, not a data frame. To assign the whole data frame, simply do
df = dataset
If you want to preserve only some columns and not all, use column subsetting (documentation):
df = dataset[, "goods", drop = FALSE]
drop = FALSE is necessary here because the dataframe subset operator will otherwise return a vector instead of a data frame with a single column (this is arguably a bug, which is why tidyverse tibbles behave differently).
Using tidyverse operations (aka the “modern” R way), this would be written as
library(dplyr)
df = select(dataset, goods)
df1=data.frame(goods=dataset$goods, stringsAsFactors=F) works perfectly well, or you can use the longer but (somewhat?) more explicit:
ds <- dataset[,c("goods")]
df1=data.frame(goods=dataset$goods)
library(dplyr)
ds <- dataset[,c("goods")] %>% as.data.frame(stringsAsFactors=F)
colnames(ds) <- "goods"
edit: Added the stringsAsFactors option as it is useful to control where you'd like factor conversion or not. c("goods") is equivalent to "goods", but I left it as a template in case you need to add more columns.
Related
I need to recode a factor variable with almost 90 levels. It is trait names from database which I then need to pivot to get the dataset for analysis.
Is there a way to do it automatically without typing each OldName=NewName?
This is how I do it with dplyr for fewer levels:
df$TraitName <- recode_factor(df$TraitName, 'Old Name' = "new.name")
My idea was to use a key dataframe with a column of old names and corresponding new names but I cannot figure out how to feed it to recode
You could quite easily create a named vector from your lookup table and pass that to recode using splicing. It might as well be faster than a join.
library(tidyverse)
# test data
df <- tibble(TraitName = c("a", "b", "c"))
# Make a lookup table with your own data
# Youll bind your two columns instead here
# youll want to keep column order to deframe it.
# column names doesnt matter.
lookup <- tibble(old = c("a", "b", "c"), new = c("aa", "bb", "cc"))
# Convert to named vector and splice it within the recode
df <-
df |>
mutate(TraitNameRecode = recode_factor(TraitName, !!!deframe(lookup)))
One way would be a lookup table, a join, and coalesce (to get the first non-NA value:
my_data <- data.frame(letters = letters[1:6])
levels_to_change <- data.frame(letters = letters[4:5],
new_letters = LETTERS[4:5])
library(dplyr)
my_data %>%
left_join(levels_to_change) %>%
mutate(new = coalesce(new_letters, letters))
Result
Joining, by = "letters"
letters new_letters new
1 a <NA> a
2 b <NA> b
3 c <NA> c
4 d D D
5 e E E
6 f <NA> f
trying to do some quick data manipulation in R, and I am very new to it.
So I am trying to use the unique function on some data, what I want to achieve is being able to keep unique rows based on some combination of columns. As I understand from the documentation this should be possible using the 'by' argument for the unique method, but as I cannot get this to work.
I have the dataTest:
name age
1 A 1
2 B 2
3 C 1
after using unique(dataTest,by="age"), the output does not change while I would expect it to change to name age
1 A 1
2 B 2
see attatchment for the code in action.
Again, its probably a beginner mistake but I cannot seem to figure it out, help much appreciated.
I think you have a dataframe, convert your dataframe to data.table and it should work. See the difference in output
1) When it is a dataframe.
df <- structure(list(name = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), age = c(1L, 2L, 1L)), class = "data.frame",
row.names = c("1", "2", "3"))
unique(df, by = "age")
# name age
#1 A 1
#2 B 2
#3 C 1
2) After changing it to data.table
library(data.table)
setDT(df)
unique(df, by = "age")
# name age
#1: A 1
#2: B 2
Another option is to use duplicated
df[!duplicated(df$age), ]
We can use distinct
library(dplyr)
df %>%
distinct(age)
I have been grappling with the following problem for a while, as I need to load in, manipulate, and produce scores from new datasets as quickly as possible. I have defined a data dictionary containing a description of each variable class (e.g. numeric, factor, character, date) and, where applicable, a list of all possible factor levels:
DD <- data.frame(Var = c("a", "b", "c", "d"),
Class = c("Numeric", "Factor", "Factor", "Date"),
Levels = c(NA, "B1, B2, B3", "C1, C2", NA))
Data <- data.frame(a = 5, b = "B1", c = "C2", d = "2015-05-01")
Ultimately, I intend to use model.matrix to produce a design matrix with a common set of indicator variables/ columns regardless of the actual factor levels observed in the particular dataset, so I can accurately score up the data from a model.
I need to do these tasks as quickly as possible and, hence, I am trying to find a solution that avoids using lapply/ loops. Here is (a slightly convoluted version of) my existing solution for setting the factor levels, which is currently too slow for my requirements:
lapply(1:ncol(Data[,DD$Class=="Factor"]), function(i) {
factor( as.character( unlist( Data[,DD$Class=="Factor"][i])) ,
levels = unlist(strsplit(as.character(DD$Levels[DD$Class=="Factor"][i]), ", ")) )
})
Any suggestions for avoiding use of a loop here, if it is even possible, or any alternative solutions would be much appreciated!
Thanks!
First remark
It is inefficient to store the levels as a single string that needs to be parsed. I suggest using a list and storing the levels as a character vector directly.
DD <- list(a = list(class="numeric"),
b = list(class="factor", levels = c("B1", "B2", "B3")),
c = list(class="factor", levels = c("C1", "C2")),
b = list(class="Date"))
Second remark:
You don't need to "unfactor" your factor to add new levels.
Example:
fac <- factor(c("a", "b", "a"))
# two levels
factor(fac, letters)
# 26 levels, data has not changed
Third remark
I don't see how to avoid a "loop" here, but lapply is an efficient way of performing loop, I doubt this is the performance bottleneck. With remarks 1 and 2, you can write a more efficient inner function. Ditching the data.frame for a data.table or a tbl may prove useful too.
You are going to have iterate over the columns but it could be done in a bit cleaner fashion like this. We have assumed you want to preserve the input data frame so we create a new data frame, Data2 to hold the result. Note that we have made the columns of DD character rather than factor. Data is from the question.
DD <- data.frame(Var = c("a", "b", "c", "d"),
Class = c("Numeric", "Factor", "Factor", "Date"),
Levels = c(NA, "B1, B2, B3", "C1, C2", NA), stringsAsFactors = FALSE)
Data2 <- Data
# factors
Factors <- subset(DD, Class == "Factor")
Levels <- strsplit(as.character(Factors$Levels), ", ")
Data2[Factors$Var] <- Map(factor, Data2[Factors$Var], Levels)
# dates
Dates <- subset(DD, Class == "Date")
Data2[Dates$Var] <- Map(as.Date, Data2[Dates$Var])
giving:
> str(Data2)
'data.frame': 1 obs. of 4 variables:
$ a: num 5
$ b: Factor w/ 3 levels "B1","B2","B3": 1
$ c: Factor w/ 2 levels "C1","C2": 2
$ d: Date, format: "2015-05-01"
You can define column classes (and hence levels for your factors) right when you create your data frame:
mylevels <- c("b1","b2","b3","c1","c2")
myData <- data.frame(a =
numeric(), b = factor(levels=mylevels), c = factor(levels=mylevels), d
= character())
This creates a blank data frame that you can fill up later.
In reply to your comment: rbind will only work if myData has same column classes as data. Otherwise column classes will change. My point is: if data that you load is similar (has same number of columns, columns have same names, etc.) then you can set classes right when you load that data, something like:
myclasses <- c(“numeric”,”factor”,”factor”,”date”)
myData <- read.table(“mydatafile.csv”, sep=”,”, header = TRUE, colClasses = myclasses)
myLevels <- c(“a1”,”a2”,”b1”,”b2”)
myfactorlist <- c(“b”,”c”)
levels(myData[,myfactorlist]) <- myLevels
here I assume your column names are as in previous example: “a”,”b”,”c”,”d”, where "c" and "b" are factor columns.
I have two data frames that I want to join using dplyr. One is a data frame containing first names.
test_data <- data.frame(first_name = c("john", "bill", "madison", "abby", "zzz"),
stringsAsFactors = FALSE)
The other data frame contains a cleaned up version of the Kantrowitz names corpus, identifying gender. Here is a minimal example:
kantrowitz <- structure(list(name = c("john", "bill", "madison", "abby", "thomas"), gender = c("M", "either", "M", "either", "M")), .Names = c("name", "gender"), row.names = c(NA, 5L), class = c("tbl_df", "tbl", "data.frame"))
I essentially want to look up the gender of the name from the test_data table using the kantrowitz table. Because I'm going to abstract this into a function encode_gender, I won't know the name of the column in the data set that's going to be used, and so I can't guarantee that it will be name, as in kantrowitz$name.
In base R I would perform the merge this way:
merge(test_data, kantrowitz, by.x = "first_names", by.y = "name", all.x = TRUE)
That returns the correct output:
first_name gender
1 abby either
2 bill either
3 john M
4 madison M
5 zzz <NA>
But I want to do this in dplyr because I'm using that package for all my other data manipulation. The dplyr by option to the various *_join functions only lets me specify one column name, but I need to specify two. I'm looking for something like this:
library(dplyr)
# either
left_join(test_data, kantrowitz, by.x = "first_name", by.y = "name")
# or
left_join(test_data, kantrowitz, by = c("first_name", "name"))
What is the way to perform this kind of join using dplyr?
(Never mind that the Kantrowitz corpus is a bad way to identify gender. I'm working on a better implementation, but I want to get this working first.)
This feature has been added in dplyr v0.3. You can now pass a named character vector to the by argument in left_join (and other joining functions) to specify which columns to join on in each data frame. With the example given in the original question, the code would be:
left_join(test_data, kantrowitz, by = c("first_name" = "name"))
This is more a workaround than a real solution. You can create a new object test_data with another column name:
left_join("names<-"(test_data, "name"), kantrowitz, by = "name")
name gender
1 john M
2 bill either
3 madison M
4 abby either
5 zzz <NA>
I have two big and small dataframes (actually dataset is very very big !). The following just for working.
big <- data.frame (SN = 1:5, names = c("A", "B", "C", "D", "E"), var = 51:55)
SN names var
1 1 A 51
2 2 B 52
3 3 C 53
4 4 D 54
5 5 E 55
small <- data.frame (names = c("A", "C", "E"), type = c("New", "Old", "Old") )
names type
1 A New
2 C Old
3 E Old
Now I need to create and new variable in "big" with the help of "type" variable in small. The names in small and big will match and corresponding type will be stored in column type. If there is no match between the names columns it will be result in new value "unknown". The expected output is as follows:
resultdf <- data.frame(SN = 1:5, names = c("A", "B", "C", "D", "E"), var = 51:55,
type = c("New","Unknown", "Old", "Unknown", "Old"))
resultdf
SN names var type
1 1 A 51 New
2 2 B 52 Unknown
3 3 C 53 Old
4 4 D 54 Unknown
5 5 E 55 Old
I know this is simple question for experts but I could not figure it out.
First use merge() with the argument all=TRUE to merge the two data.frames, keeping rows of big that found no matching value in the small$names. Then, replace those elements of big$type that didn't find a match (marked by merge() with "NA"s) with the string "Unknown".
Note that because big and small share just one column name in common, that column is by default used to perform the merge. For more control over which columns are used as the basis of the merge, see the function's by, by.x, and by.y arguments.
small <- data.frame (names = c("A", "C", "E"),
type = c("New", "Old", "Old"), stringsAsFactors=FALSE)
big <- data.frame (SN = 1:5, names = c("A", "B", "C", "D", "E"), var = 51:55,
stringsAsFactors=FALSE)
big <- merge(big, small, all=TRUE)
big$type[is.na(big$type)] <- "Unknown"
big$type <- c(as.character(small$type),"Unknown") [
match(
x=big$names,
table=small$names,
nomatch=length(small$type)+1)]
The basic strategy is to convert the factor to character, add an "unknown" value, and then use big$names to look up the correct index for "types" in the 'small' dataframe. Generating indices is a typical use of the match function.