I'm trying to accomplish something like what is illustrated in this this question
However, in my situation, I'll have there might be multiple cases where I have 2 columns that evaluates to True:
year cat1 cat2 cat3 ... catN
2000 0 1 1 0
2001 1 0 0 0
2002 0 1 0 1
....
2018 0 1 0 0
In the DF above year 2000 can have cat2 and cat3 categories. In this case, how do I create a new row, that will have the second category. Something like this:
year category
2000 cat2
2000 cat3
2001 cat1
2002 cat2
2002 catN
....
2018 cat2
You can use gather from the Tidyverse
library(tidyverse)
data = tribble(
~year,~ cat1, ~cat2, ~cat3, ~catN,
2000, 0, 1, 1, 0,
2001, 1, 0, 0 , 0,
2002, 0, 1, 0, 1
)
data %>%
gather(key = "cat", value = "bool", 2:ncol(.)) %>%
filter(bool == 1)
One way would be to get row/column indices of all the values which are 1, subset the year values from row indices and column names from column indices to create a new dataframe.
mat <- which(df[-1] == 1, arr.ind = TRUE)
df1 <- data.frame(year = df$year[mat[, 1]], category = names(df)[-1][mat[, 2]])
df1[order(df1$year), ]
# year category
#2 2000 cat2
#5 2000 cat3
#1 2001 cat1
#3 2002 cat2
#6 2002 catN
#4 2018 cat2
data
df <- structure(list(year = c(2000L, 2001L, 2002L, 2018L), cat1 = c(0L,
1L, 0L, 0L), cat2 = c(1L, 0L, 1L, 1L), cat3 = c(1L, 0L, 0L, 0L
), catN = c(0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -4L))
You can also use melt in reshape2
new_df = melt(df, id.vars='year')
new_df[new_df$value==1, c('year','variable')]
Data
df = data.frame(year=c(2000,2001),
cat1=c(0,1),
cat2=c(1,0),
cat3=c(1,0))
Output:
year variable
2 2001 cat1
3 2000 cat2
5 2000 cat3
Here is another variation with gather, by mutateing the columns having 0 to NA, then gather while removing the NA elements with na.rm = TRUE
library(dplyr)
library(tidyr)
data %>%
mutate_at(-1, na_if, y = 0) %>%
gather(category, val, -year, na.rm = TRUE) %>%
select(-val)
# A tibble: 5 x 2
# year category
# <dbl> <chr>
#1 2001 cat1
#2 2000 cat2
#3 2002 cat2
#4 2000 cat3
#5 2002 catN
data
data <- structure(list(year = c(2000, 2001, 2002), cat1 = c(0, 1, 0),
cat2 = c(1, 0, 1), cat3 = c(1, 0, 0), catN = c(0, 0, 1)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
Related
I have two files and want to transfer date from one to other after doing a test
File1:
ID, X1, X2, X3
2000, 1, 2, 3
2001, 3, 4, 5
1999, 2, 5, 6
2003, 3, 5, 4
File2:
ID, X1, X2, X3,
2000,
2001,
2002,
2003,
Result file will be like:
1999 "There is an error"
File2:
ID, X1, X2, X3
2000, 1, 2, 3
2001, 3, 4, 5
2002, Na, Na, Na
2003, 3, 5, 4
I tried to use for loop with if, Unfortunately, it doesn't work:
for(j in length(1: nrows(file1){
for(i in length(1: nrows(file2){
if( file1&ID[j]>= file2&ID[j+1]){
print(j, ' wrong value')
esle
file2[i,]<- file1[j,]
break
It would be very nice if I can get some ideas, codes how I can get something similar to result file
I hope I can find the right code to solve this problem
No need to iterate using loops, you can simply use right_join from dplyr package
df1 %>%
right_join(df2, by="ID") %>%
arrange(ID)
ID X1 X2 X3
1 2000 1 2 3
2 2001 3 4 5
3 2002 NA NA NA
4 2003 3 5 4
Sample data
df1 <- structure(list(ID = c(2000L, 2001L, 1999L, 2003L), X1 = c(1L,
3L, 2L, 3L), X2 = c(2L, 4L, 5L, 5L), X3 = c(3L, 5L, 6L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ID = 2000:2003), class = "data.frame", row.names = c(NA,
-4L))
Using data.table
library(data.table)
setDT(df2)[df1, names(df1)[-1] := mget(paste0("i.", names(df1)[-1])), on = .(ID)]
-output
> df2
ID X1 X2 X3
1: 2000 1 2 3
2: 2001 3 4 5
3: 2002 NA NA NA
4: 2003 3 5 4
Here is a slightly different approach which does not give the exact expected output: Note that year 1999 is kept in the dataframe:
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
bind_rows(df1, df2) %>%
group_by(ID) %>%
summarise_all(coalesce_by_column)
ID X1 X2 X3
<int> <int> <int> <int>
1 1999 2 5 6
2 2000 1 2 3
3 2001 3 4 5
4 2002 NA NA NA
5 2003 3 5 4
I have a dataframe that consists of vegetation data. Columns are species names and rows are their relative abundances per site. Site, plotcode and year are also variables. Data looks like this:
Site Code Year speca specb specc
A A1 2001 0 1 10
A A2 2001 5 5 15
B B1 2001 0 5 20
B B1 2004 15 75 0
C C1 2006 50 0 15
I want the datatable to look like this:
species A1_2001 A2_2001 B1_2001 B1_2004 C1_2006
speca 0 5 0 15 50
specb 1 5 5 75 0
specc 10 15 20 0 15
I tried using the tidyr:pivot_longer function, but this does not give the result i want.
tidyr::pivot_longer(df, 4:length(df), names_to = "species", values_to = "abundance")
Is there a way to achieve this in a codefriendly way, preferably using tidyr (tidyverse)?
We reshape it to 'long' format and then do the 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('spec'), names_to = 'species') %>%
unite(CodeYear, Code, Year) %>%
select(-Site) %>%
pivot_wider(names_from = CodeYear, values_from = value)
# A tibble: 3 x 6
# species A1_2001 A2_2001 B1_2001 B1_2004 C1_2006
# <chr> <int> <int> <int> <int> <int>
#1 speca 0 5 0 15 50
#2 specb 1 5 5 75 0
#3 specc 10 15 20 0 15
data
df <- structure(list(Site = c("A", "A", "B", "B", "C"), Code = c("A1",
"A2", "B1", "B1", "C1"), Year = c(2001L, 2001L, 2001L, 2004L,
2006L), speca = c(0L, 5L, 0L, 15L, 50L), specb = c(1L, 5L, 5L,
75L, 0L), specc = c(10L, 15L, 20L, 0L, 15L)), class = "data.frame",
row.names = c(NA,
-5L))
In data.table:
library(data.table)
DT <- data.table(Site = c('A1','A2','B1','B1','C1'),
Year = c(2001, 2001, 2001, 2004, 2006),
speca = c(0,5,0,15,50),
specb = c(1,5,5,75,0),
specc = c(10,15,20,0,15))
DT <- melt(DT, id.vars = c('Site', 'Year'),
measure.vars = c('speca', 'specb', 'specc') , variable.name = 'species')
DT <- dcast(DT, species ~ Site + Year, value.var = c('value'))
> DT
species A1_2001 A2_2001 B1_2001 B1_2004 C1_2006
1: speca 0 5 0 15 50
2: specb 1 5 5 75 0
3: specc 10 15 20 0 15
You mainly need a pivot_wider() to follow your pivot_longer():
library(tidyverse)
df <- tribble(~Site, ~Code, ~Year, ~speca, ~specb, ~specc,
"A", "A1", 2001, 0, 1, 10,
"A", "A2", 2001, 5, 5, 15,
"B", "B1", 2001, 0, 5, 20,
"B", "B1", 2004, 15, 75, 0,
"C", "C1", 2006, 50, 0, 15)
df %>%
mutate(Code = paste(Code, Year, sep = "_")) %>%
select(-Site, -Year) %>%
pivot_longer(starts_with("spec"), names_to = "species", values_to = "abundance") %>%
pivot_wider(names_from = Code, values_from = abundance)
The result is
# A tibble: 3 x 6
species A1_2001 A2_2001 B1_2001 B1_2004 C1_2006
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 speca 0 5 0 15 50
2 specb 1 5 5 75 0
3 specc 10 15 20 0 15
I have a data set that looks something like this
A B 1960 1970 1980
x a 1 2 3
x b 1.1 2.1 NA
y a 2 3 4
y b 1 NA 1
I want to transform the columns based on row B so that it looks something like this
A year a b
x 1960 1 1.1
x 1970 2 2.1
x 1980 3 NA
y 1960 2 1
y 1970 3 NA
y 1980 4 1
I am not sure how to do this. I know that I can do a full transformation using t() or using row_to_columns() from tidyverse, but the result is not what I want.
The initial data has about 60 columns and 165 distinct values in column B.
You can do pivot_long() and then pivot_wide() , although might be a bad idea to rename your column "B" again:
library(dplyr)
library(tidyr)
df %>% pivot_longer(-c(A,B)) %>%
pivot_wider(names_from=B) %>% rename(B=name)
# A tibble: 6 x 4
A B a b
<fct> <chr> <dbl> <dbl>
1 x 1960 1 1.1
2 x 1970 2 2.1
3 x 1980 3 NA
4 y 1960 2 1
5 y 1970 3 NA
6 y 1980 4 1
df = structure(list(A = structure(c(1L, 1L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), B = structure(c(1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), `1960` = c(1, 1.1, 2, 1), `1970` = c(2,
2.1, 3, NA), `1980` = c(3L, NA, 4L, 1L)), class = "data.frame", row.names = c(NA,
-4L))
library(data.table)
dt <- fread('A B 1960 1970 1980
x a 1 2 3
x b 1.1 2.1 NA
y a 2 3 4
y b 1 NA 1')
names(dt) <- as.character(dt[1,])
dt <- dt[-1,]
dt[,(3:5):=lapply(.SD,as.numeric),.SDcols=3:5]
dcast(melt(dt,measure.vars = 3:5),...~B,value.var = "value")
#> A variable a b
#> 1: x 1960 1 1.1
#> 2: x 1970 2 2.1
#> 3: x 1980 3 NA
#> 4: y 1960 2 1.0
#> 5: y 1970 3 NA
#> 6: y 1980 4 1.0
Created on 2020-05-05 by the reprex package (v0.3.0)
Base R solution:
long_df <- reshape(df, direction = "long",
varying = which(!names(df) %in% c("A", "B")),
v.names = "value",
timevar = "year",
times = names(df)[!(names(df) %in% c("A", "B"))],
ids = NULL,
new.row.names = 1:(length(which(!names(df) %in% c("A", "B"))) * nrow(df)))
wide_df <- setNames(reshape(long_df, direction = "wide",
idvar = c("A", "year"),
timevar = "B"), c("A", "B", unique(df$B)))
Data:
df <- structure(list(A = c("x", "x", "y", "y"), B = c("a", "b", "a",
"b"), `1960` = c(1, 1.1, 2, 1), `1970` = c(2, 2.1, 3, NA), `1980` = c(3L,
NA, 4L, 1L)), row.names = 2:5, class = "data.frame")
I have a data set of the following:
Id Val1 Val2
ID1 3 12
ID1 4 NA
ID1 -2 NA
ID1 4 33
ID2 4 NA
I want to replace the NA with Val1+Val2 from the previous row if the Id is the same. The following is the ideal output:
Id Val1 Val2
ID1 3 12
ID1 4 15
ID1 -2 19
ID1 4 33
ID2 4 NA
I have a very big dataset. I personally don’t like the for loop in r and am looking for a beautiful vectorization solutions.
Here is one option where we group by 'Id' and a group created by taking the cumulative sum of logical vector i.e. where there are no missing values in 'Val2', then add (+) the first element of 'Val2' with the cumsum of 'Val1', take the lag, ungroup and remove the temporary 'grp' column
library(dplyr)
df1 %>%
group_by(Id, grp = cumsum(!is.na(Val2))) %>%
mutate(Val2 = lag(first(Val2) + cumsum(Val1), default = first(Val2))) %>%
ungroup %>%
select(-grp)
# A tibble: 5 x 3
# Id Val1 Val2
# <fct> <dbl> <dbl>
#1 ID1 3 12
#2 ID1 4 15
#3 ID1 -2 19
#4 ID1 4 33
#5 ID2 4 NA
data
df1 <- structure(list(Id = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("ID1",
"ID2"), class = "factor"), Val1 = c(3, 4, -2, 4, 4), Val2 = c(12,
NA, NA, 33, NA)), class = "data.frame", row.names = c(NA, -5L
))
I've been using the dplyr package to create aggregated data tables, for example using the following code:
agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(price1, na.rm = TRUE),2),
mean_price2 = round(mean(price2, na.rm = TRUE),2))
However, I would like to only calculate the mean over the distinct values of price1 and price2 within groups
e.g:
Price1: 1 1 2 1 2 2 1
Goes to (before aggregation):
Price1: 1 2 1 2 1
(and these in general don't have the same numbers of after removal for price1 and price2). I would also like to calculate a count for each (price1 and price2), counting only distinct values within groups. (Groups are defined as two or more identical values adjacent to each other)
I have tried:
agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(distinct(price1), na.rm = TRUE),2),
mean_price2 = round(mean(distinct(price2), na.rm = TRUE),2))
And also tried wrapping the columns within the select function with distinct(), but both these throw errors.
Is there a way to do this using dplyr or another similar package without having to write something from scratch?
To satisfy your requirement for distinct, we need to remove successive values that are the same. For numeric vectors, this can be accomplished by:
x <- x[c(1, which(diff(x) != 0)+1)]
The default use of diff computes the difference between adjoining elements in the vector. We use this to detect successive values that are different, for which diff(x) != 0. Since the output differences are lagged by 1, we add 1 to the indices of these distinct elements, and we also want the first element as distinct. For example:
x <- c(1,1,2,1,2,2,1)
x <- x[c(1, which(diff(x) != 0)+1)]
##[1] 1 2 1 2 1
We can then use this with dplyr:
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(price1[c(1,which(diff(price1) != 0)+1)]),
mean_price1 = round(mean(price1[c(1,which(diff(price1) != 0)+1)], na.rm=TRUE),2),
count_non_rep_2 = length(price2[c(1,which(diff(price2) != 0)+1)]),
mean_price2 = round(mean(price2[c(1,which(diff(price2) != 0)+1)], na.rm=TRUE),2))
or, better yet, define the function:
remove.repeats <- function(x) {
x[c(1,which(diff(x) != 0)+1)]
}
and use it with dplyr:
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(remove.repeats(price1)),
mean_price1 = round(mean(remove.repeats(price1), na.rm=TRUE),2),
count_non_rep_2 = length(remove.repeats(price2)),
mean_price2 = round(mean(remove.repeats(price2), na.rm=TRUE),2))
Using this on some example data that is hopefully similar to yours:
df <- structure(list(calc.method = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
price1 = c(1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 3),
price2 = c(1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1)),
.Names = c("calc.method", "price1", "price2"), row.names = c(NA, -15L), class = "data.frame")
## calc.method price1 price2
##1 A 1 1
##2 A 1 1
##3 A 2 1
##4 A 1 1
##5 A 2 1
##6 A 2 1
##7 A 1 1
##8 B 1 2
##9 B 1 1
##10 B 2 2
##11 B 2 1
##12 B 2 2
##13 B 2 1
##14 B 1 2
##15 B 3 1
We get:
print(agg_data)
### A tibble: 2 x 6
## calc.method count count_non_rep_1 mean_price1 count_non_rep_2 mean_price2
## <fctr> <int> <int> <dbl> <int> <dbl>
##1 A 7 5 1.40 1 1.0
##2 B 8 4 1.75 8 1.5