I have data as follows:
DT <- as.data.frame(c("1","2", "3", "A", "B"))
names(DT)[1] <- "charnum"
What I want is quite simple, but I could not find an example on it on stackoverflow.
I want to split the dataset into two. DT1 with all the rows for which DT$charnum has numbers and DT2 with all the rows for which DT$charnum has letters. I tried something like:
DT1 <- DT[is.numeric(as.numeric(DT$charnum)),]
But that gives:
[1] 1 2 3 A B
Levels: 1 2 3 A B
Desired result:
> DT1
1 1
2 2
3 3
> DT2
1 A
2 B
You can use regular expressions to separate the two types of data that you have and then separate the two datasets.
result <- split(DT, grepl('^\\d+$', DT$charnum))
DT1 <- type.convert(result[[1]])
# charnum
#4 A
#5 B
DT2 <- type.convert(result[[2]])
# charnum
#1 1
#2 2
#3 3
Using tidyverse
DT %>%
group_split(grp = str_detect(charnum, "\\d+"), .keep = FALSE) %>%
map(type.convert, as.is = TRUE)
How do I swap one value with another in a column within a dataframe?
For example swap the 2's and 4's around in df1 to give df2:
df1 <- as.data.frame(col1 = c(1,2,1,4))
df2 <- as.data.frame(col1 = c(1,4,1,2))
Simple solution using replace in base R:
df2 <- data.frame(col1 = replace(df1$col1, c(4,2), c(2,4)))
1 1
2 4
3 1
4 2
We can try using case_when from the dplyr package for some switch functionality:
df2 <- df1
df2$col1 <- case_when(
df2$col1 == 2 ~ 4,
df2$col1 == 4 ~ 2,
TRUE ~ df2$col1
1 1
2 4
3 1
4 2
df1 <- data.frame(col1 = c(1,2,1,4))
you can swap by reassigning the index for that column.
With the dataframe:
df1 <- data.frame(col1 = c("a","b","c","d"))
> df1
1 a
2 b
3 c
4 d
we can:
df1[,1] <- df1[c(1,4,3,2),1]
to get
> df1
1 a
2 d
3 c
4 b
Starting list of dataframes
lst <- list(df1 = data.frame(X.1 = as.character(1:2),
heading = letters[1:2]),
df2 = data.frame(X.32 = as.character(3:4),
another.topic = paste("Line ", 1:2)))
#> $df1
#> X.1 heading
#> 1 1 a
#> 2 2 b
#> $df2
#> X.32 another.topic
#> 1 3 Line 1
#> 2 4 Line 2
Expected "combined" dataframe, with new consistent variable names, and old variable names in the first row of each constituent dataframe.
#> id h1 h2
#> 1 df1 X.1 heading
#> 2 df1 1 a
#> 3 df1 2 b
#> 4 df2 X.32 another.topic
#> 5 df2 3 Line 1
#> 6 df2 4 Line 2
add_row requires "Name-value pairs, passed on to tibble(). Values can be defined only for columns that already exist in .data and unset columns will get an NA value."
Which is what I think I have achieved with this:
df_nms <-
map(lst, names) %>%
#> $df1
#> X.1 heading
#> "X.1" "heading"
#> $df2
#> X.32 another.topic
#> "X.32" "another.topic"
But I cannot tie up the last bit, using a purrr function to add the names to the head of each dataframe. I've tried numerous variations with map2 and pmap the closest I can get at present (if I treat add_row as a formula , prefixing it with ~ and remove the .y I get a new first row populated with NAs). I think I'm missing how to pass the name-value pairs to the add_row function.
map2(lst, df_nms, add_row(.x, .y, .before = 1)) %>%
map(set_names, c("h1", "h2")) %>%
map_dfr(bind_rows, .id = "id")
#> Error in add_row(.x, .y, .before = 1): object '.x' not found
A pointer to resolve this last step would be most appreciated.
Not quite sure how to do this via purrr map functions, but here is an alternative,
bind_rows(lapply(lst, function(i){d1 <- as.data.frame(matrix(names(i), ncol = ncol(i)));
rbind(d1, setNames(i, names(d1)))}), .id = 'id')
# id V1 V2
#1 df1 X.1 heading
#2 df1 1 a
#3 df1 2 b
#4 df2 X.32 another.topic
#5 df2 3 Line 1
#6 df2 4 Line 2
Here's an approach using map, rbindlist from data.table and some base R functions:
map(lst, ~ as.data.frame(unname(rbind(colnames(.x),as.matrix(.x))))) %>%
rbindlist(idcol = "id")
# id V1 V2
#1: df1 X.1 heading
#2: df1 1 a
#3: df1 2 b
#4: df2 X.32 another.topic
#5: df2 3 Line 1
#6: df2 4 Line 2
Alternatively we could use map_df if we use colnames<-:
map_df(lst, ~ as.data.frame(rbind(colnames(.x),as.matrix(.x))) %>%
`colnames<-`(.,paste0("h",seq(1,dim(.)[2]))), .id = "id")
# id h1 h2
#1 df1 X.1 heading
#2 df1 1 a
#3 df1 2 b
#4 df2 X.32 another.topic
#5 df2 3 Line 1
#6 df2 4 Line 2
Key things here are:
Use as.matrix to get rid of the factor / character incompatibility.
Remove names with unname or set them with colnames<-
Use the idcols = or .id = feature to get the names of the list as a column.
I altered your sample data a bit, setting stringsAsFactors to FALSE when creating the data.frames in lst.
here is a solution using data.table::rbindlist().
#sample data
lst <- list(df1 = data.frame(X.1 = as.character(1:2),
heading = letters[1:2],
stringsAsFactors = FALSE), # !! <--
df2 = data.frame(X.32 = as.character(3:4),
another.topic = paste("Line ", 1:2),
stringsAsFactors = FALSE) # !! <--
DT <- data.table::rbindlist( lapply( lst, function(x) rbind( names(x), x ) ),
use.names = FALSE, idcol = "id" )
setnames(DT, names( lst[[1]] ), c("h1", "h2") )
# id h1 h2
# 1: df1 X.1 heading
# 2: df1 1 a
# 3: df1 2 b
# 4: df2 X.32 another.topic
# 5: df2 3 Line 1
# 6: df2 4 Line 2
I have a data frame where column "A" has 6 distinct values. Column "B" has float values. By using dplyr, I can group by column "A" and find mean of column "B" of each group as follows:
mydf %>% group_by(A) %>% summarize(Mean = mean(B, na.rm=TRUE))
My utter aim is to find rows in each group whose "B" values are higher than the group average. How can I achieve this (using base R or dplyr)?
A simple alternative with base R ave would be
df[df$b > ave(df$b, df$a) , ]
# a b
#4 1 4
#5 1 5
#9 2 9
#10 2 10
The default argument for ave is mean so no need to mention it explicitly, if there are NA values present in b modify it to
df[df$b > ave(df$b, df$a, FUN = function(x) mean(x,na.rm = TRUE)) , ]
Another solution with subset and ave as suggested by #Onyambu
# a b
#4 1 4
#5 1 5
#9 2 9
#10 2 10
df <- data.frame(a = rep(c(1, 2), each = 5), b = 1:10)
# a b
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#5 1 5
#6 2 6
#7 2 7
#8 2 8
#9 2 9
#10 2 10
You can just group and then filter:
mydf %>%
group_by(A) %>%
filter(B > mean(B, na.rm = TRUE)) %>%
Using Base R, I would go for this. It is not as elegant as dplyr.
mean.df <- aggregate(mydf$b, by =list(a = mydf$a), FUN = mean)
names(mean.df)[2] <- "mean"
mydf <- merge(mydf, mean.df, by = "a")
# Rows whose values are higher than mean
new.df <- subset(mydf, b > mean, select = -mean)
I like working with Data tables. So a data.table solution would be,
mydt <- data.table(mydf)
mydt[, mean := mean(b), by = a]
new.dt <- mydt[b > mean, -c("mean"), with = TRUE]
Another way to do it using base R and tapply:
mydf = cbind.data.frame(A=sample(6,20,rep=T),B=runif(20))
mydf.ave = tapply(mydf$B,mydf$A,mean)
newdf = mydf[mydf$B > mydf.ave[as.character(mydf$A)],]
(thus the one liner would be:mydf[mydf$B > tapply(mydf$B,mydf$A,mean)[as.character(mydf$A)],])
I would like to subset a data.frame based on the dates in the rownames. My dates are of this format:
df$Date <- as.yearmon(as.character(df$Date), "%Y%m")
edit: I set the rownames equal to the Date variabel like this (and would like to delete Date afterwards):
I thought of subsetting like this:
train_dates <- seq(as.yearmon(as.character("1959-12-31"), "%Y%m"), as.yearmon(as.character("1984-12-31"), "%Y%m", "months"))
df <- subset(df, rownames(df) %in% train_dates)
But I am having difficulties creating the correct sequence.
Try using format
train_dates <- format(seq(as.Date.character('1959-01-31'),
as.Date.character('1959-12-31'), by = 'month'), '%Y%m')
and then, using library(data.table)
df <- as.data.table(df)
train_df <- df[Date %in% train_dates]
One solution could be using rownames_to_column from tibble package.
df <- data.frame(A = 1:5, B = letters[1:5])
rownames(df) <- c("195901", "196008", "196109", "201812", "196112")
# A B
# 195901 1 a
# 196008 2 b
# 196109 3 c
# 201812 4 d # not in train_dates
# 196112 5 e
#create sequence from 1959 to 1968. Lookup table
train_dates <- format(as.yearmon(1959 + seq(0, 119)/12), format="%Y%m")
Option #1:
df %>%
rownames_to_column("datemon") %>%
filter(datemon %in% train_dates) %>%
# A B
# 195901 1 a
# 196008 2 b
# 196109 3 c
# 196112 5 e
Option #2
df[rownames(df) %in% train_dates, ]
# A B
# 195901 1 a
# 196008 2 b
# 196109 3 c
# 196112 5 e
I have a simple question about aggregating values in R.
Suppose I have a dataframe:
DF <- data.frame(col1=c("Type 1", "Type 1B", "Type 2"), col2=c(1, 2, 3))
which looks like this:
col1 col2
1 Type 1 1
2 Type 1B 2
3 Type 2 3
I notice that I have Type 1 and Type 1B in the data, so I would like to combine Type 1B into Type 1.
So I decide to use dplyr:
filter(DF, col1=='Type 1' | col1=='Type 1B') %>%
summarise(n = sum(col2))
But now I need to keep going with it:
DF2 <- data.frame('Type 1', filter(DF, col1=='Type 1' | col1=='Type 1B') %>%
summarise(n = sum(col2)))
I guess I want to cbind this new DF2 back to the original DF, but that means I have to set the column names to be consistent:
names(DF2) <- c('col1', 'col2')
OK, now I can rbind:
rbind(DF2, DF[3,])
The result? It worked....
col1 col2
1 Type 1 3
3 Type 2 3
...but ugh! That was awful! There has to be a better way to simply combine values.
Here's a possible dplyr approach:
DF %>%
group_by(col1 = sub("(.*\\d+).*$", "\\1", col1)) %>%
summarise(col2 = sum(col2))
#Source: local data frame [2 x 2]
# col1 col2
#1 Type 1 3
#2 Type 2 3
Using sub() with aggregate(), removing anything other than a digit from the end of col1,
aggregate(col2 ~ cbind(col1 = sub("\\D+$", "", col1)), DF, sum)
# col1 col2
# 1 Type 1 3
# 2 Type 2 3
The do.call() wrapper is there so that the first column after aggregate() is properly changed from a matrix to a vector. This way there aren't any surprises later on down the road.
In my opinion, aggregate() is the perfect function for this purpose, but you shouldn't have to do any text processing (e.g. gsub()). I would do this in a two-step process:
Overwrite col1 with the new desired grouping.
Compute the aggregation using the new col1 to specify the grouping.
DF$col1 <- ifelse(DF$col1 %in% c('Type 1','Type 1B'),'Type 1',levels(DF$col1));
## col1 col2
## 1 Type 1 1
## 2 Type 1 2
## 3 Type 2 3
DF <- aggregate(col2~col1, DF, FUN=sum );
## col1 col2
## 1 Type 1 3
## 2 Type 2 3
You can try:
setDT(transform(DF, col1=gsub("(.*)[A-Z]+$","\\1",DF$col1)))[,list(col2=sum(col2)),col1]
# col1 col2
# 1: Type 1 3
# 2: Type 2 3
Or even more directly:
setDT(DF)[, .(col2 = sum(col2)), by = .(col1 = sub("[[:alpha:]]+$", "", col1))]