This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I want to reshape/ rearrange a dataset, that is stored as a data.frame with 2 columns:
id (non-unique, i.e. can repeat over several rows) --> stored as character
value --> stored as numeric value (range 1:3)
Sample data:
id <- as.character(1001:1003)
val_list <- data.frame(sample(1:3, size=12, replace=TRUE))
have <- data.frame(cbind(rep(id, 4), val_list))
colnames(have) <- c("id", "values")
have <- have %>% arrange(id)
This gives me the following output:
id values
1 1001 2
2 1001 2
3 1001 2
4 1001 3
5 1002 2
6 1002 3
7 1002 2
8 1002 2
9 1003 1
10 1003 3
11 1003 1
12 1003 2
What I want:
want <- data.frame(cbind(have[1:4, 2],
have[5:8, 2],
have[9:12, 2]))
colnames(want) <- id
Output of want:
1001 1002 1003
1 2 2 1
2 2 3 3
3 2 2 1
4 3 2 2
My original dataset has >1000 variables "id" and >50 variables "value".
I want to chunk/ slice the dataset get a new data.frame where each "id" variable will represent one column listing its "value" variable content.
It is possible to solve it via a loop, but I want to have the vectorized solution.
If possible with base R as "one-liner", but other solutions also appreciated.
You can create a unique row value for each id and use pivot_wider.
have %>%
group_by(id) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = id, values_from = values) %>%
select(-row)
# A tibble: 4 x 3
# `1001` `1002` `1003`
# <int> <int> <int>
#1 1 3 1
#2 3 2 3
#3 2 2 3
#4 2 2 3
Or using data.table
library(data.table)
dcast(setDT(have), rowid(id)~id, value.var = 'values')
data
df <- structure(list(id = c(1001L, 1001L, 1001L, 1001L, 1002L, 1002L,
1002L, 1002L, 1003L, 1003L, 1003L, 1003L), values = c(2L, 2L,
2L, 3L, 2L, 3L, 2L, 2L, 1L, 3L, 1L, 2L)), class = "data.frame",
row.names = c(NA, -12L))
Related
So, for example, I have the following dataframe, data:
col1
col2
1
5
1
5
1
3
2
10
2
11
3
11
Now, I want to make a new column, col3, which gives me the number of unique values in col2 for every grouping in col1.
So far, I have the following code:
length(unique(data$col2[data$col1 == 1]))
Which would here return the number 2.
However, I'm having a hard time making a loop that goes through all the values in col1 to create the new column, col3.
We can use n_distinct after grouping
library(dplyr)
data <- data %>%
group_by(col1) %>%
mutate(col3 = n_distinct(col2)) %>%
ungroup
-output
data
# A tibble: 6 × 3
col1 col2 col3
<int> <int> <int>
1 1 5 2
2 1 5 2
3 1 3 2
4 2 10 2
5 2 11 2
6 3 11 1
Or with data.table
library(data.table)
setDT(data)[, col3 := uniqueN(col2), col1]
data
data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L,
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA,
-6L))
You want the counts for every row, so using a for loop you would do
data$col3 <- NA_real_
for (i in seq_len(nrow(data))) {
data$col3[i] <- length(unique(data$col2[data$col1 == data$col1[i]]))
}
data
# col1 col2 col3
# 1 1 5 2
# 2 1 5 2
# 3 1 3 2
# 4 2 10 2
# 5 2 11 2
# 6 3 11 1
However, using for loops in R is mostly inefficient, and in this case we can use the grouping function ave which comes with R.
data <- transform(data, col3=ave(col2, col1, FUN=\(x) length(unique(x))))
data
# col1 col2 col3
# 1 1 5 2
# 2 1 5 2
# 3 1 3 2
# 4 2 10 2
# 5 2 11 2
# 6 3 11 1
Data:
data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L,
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA,
-6L))
I have a data frame that looks like this
column1
1
1
2
3
3
and I would like to give a unique ID to each element. My problem is that I can not
find a way the unique IDs to start from zero and be like this
column1 column2
1 0
1 0
2 1
3 2
3 2
Any help is appreciated
Try this, cur_group_id from dplyr will create the id from 1 but you can easily make it to start from zero:
library(dplyr)
#Data
df <- structure(list(column1 = c(0L, 1L, 2L, 3L, 3L)), class = "data.frame", row.names = c(NA,-5L))
#Mutate
df %>% group_by(column1) %>% mutate(id=cur_group_id()-1)
# A tibble: 5 x 2
# Groups: column1 [4]
column1 id
<int> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 3 3
We could use match
library(dplyr)
df1 %>%
mutate(column2 = match(column1, unique(column1)) - 1)
data
df1 <- structure(list(column1 = c(1L, 1L, 2L, 3L, 3L)), class = "data.frame",
row.names = c(NA,
-5L))
This question already has answers here:
Insert rows for missing dates/times
(9 answers)
How to add only missing Dates in Dataframe
(3 answers)
Closed 3 years ago.
I have a dataset that look something like this:
Person date Amount
A 2019-01 900
A 2019-03 600
A 2019-04 300
A 2019-05 0
B 2019-04 1200
B 2019-07 800
B 2019-08 400
B 2019-09 0
As you'll notice in the "date" column, there are missing dates, such as '2019-02' for person A and '2019-05' and '2019-06' for person B. I would like to insert rows with the missing date and amount equal to the one before it (see expected result below).
I have tried performing group by but I don't know how to proceed from there. I've also tried converting the 'date' and 'amount' columns as lists, and from there fill in the gaps before putting them back to the dataframe. I was wondering if there is a more convenient way of doing this. In particular, getting the same results without having to extract lists from the original dataframe.
Ideally, I would want to having a dataframe that look something like this:
Person date Amount
A 2019-01 900
A 2019-02 900
A 2019-03 600
A 2019-04 300
A 2019-05 0
B 2019-04 1200
B 2019-05 1200
B 2019-06 1200
B 2019-07 800
B 2019-08 400
B 2019-09 0
I hope I was able to make my problem clear.
Thanks in advance.
We can first convert the date to actual date object (date1) by pasting "-01" at the end, then using complete we create a sequence of 1 month date objects for each Person. We then use fill to get Amount equal to the one before it and to get data in the original form we remove "-01" again from date1.
library(dplyr)
library(tidyr)
df %>%
mutate(date1 = as.Date(paste0(date, "-01"))) %>%
group_by(Person) %>%
complete(date1 = seq(min(date1), max(date1), by = "1 month")) %>%
fill(Amount) %>%
mutate(date = sub("-01$", "", date1)) %>%
select(-date1)
# Person date Amount
# <fct> <chr> <int>
# 1 A 2019-01 900
# 2 A 2019-02 900
# 3 A 2019-03 600
# 4 A 2019-04 300
# 5 A 2019-05 0
# 6 B 2019-04 1200
# 7 B 2019-05 1200
# 8 B 2019-06 1200
# 9 B 2019-07 800
#10 B 2019-08 400
#11 B 2019-09 0
data
df <- structure(list(Person = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), date = structure(c(1L,
2L, 3L, 4L, 3L, 5L, 6L, 7L), .Label = c("2019-01", "2019-03",
"2019-04", "2019-05", "2019-07", "2019-08", "2019-09"), class = "factor"),
Amount = c(900L, 600L, 300L, 0L, 1200L, 800L, 400L, 0L)),
class = "data.frame", row.names = c(NA, -8L))
I have a dataset like this
>head(grocery)
milk bread juice honey eggs beef ... (140 more variables)
1 4 3 1 4 2
2 5 4 2 4 3
1 2 6 0 7 0
0 1 5 3 3 1
4 10 2 1 5 8
I want to select 3 columns that have largest sum values, showing the order, column name and sum values, and place them in descending order. Like this:
1 eggs 23
2 bread 22
3 juice 20
How can I do this?
Thank you very much for helping!
With dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
gather(key,value) %>%
group_by(key) %>%
summarise(Sum=sum(value)) %>%
arrange(desc(Sum)) %>%
top_n(3,Sum)
# A tibble: 3 x 2
key Sum
<chr> <int>
1 eggs 23
2 bread 22
3 juice 20
Data:
df <- structure(list(milk = c(1L, 2L, 1L, 0L, 4L), bread = c(4L, 5L,
2L, 1L, 10L), juice = c(3L, 4L, 6L, 5L, 2L), honey = c(1L, 2L,
0L, 3L, 1L), eggs = c(4L, 4L, 7L, 3L, 5L), beef = c(2L, 3L, 0L,
1L, 8L)), class = "data.frame", row.names = c(NA, -5L))
Original answer
In base R you can find sums of values for each column, sort the resulted values in decreasing order, subset first 3 values and cbind them to get the desired output:
cbind(sort(colSums(dat), T)[1:3])
# [,1]
#eggs 23
#bread 22
#juice 20
Updated answer
...how I can go back to the original data set from this solution?...
Here I subset original data set by names corresponding to columns with three largest columns sums. Probably there is better solution, that one what I can find right now.
dat[, names(sort(colSums(dat), T)[1:3])]
# eggs bread juice
#1 4 4 3
#2 4 5 4
#3 7 2 6
#4 3 1 5
#5 5 10 2
Data:
dat <- read.table(
text = "milk bread juice honey eggs beef
1 4 3 1 4 2
2 5 4 2 4 3
1 2 6 0 7 0
0 1 5 3 3 1
4 10 2 1 5 8",
stringsAsFactors = F,
header = T
)
Take a look at this tiny example:
data <- NULL
data$c <- c(1,2,3,4)
data$b <- c(4,5,6,7)
data$a <- c(1,1,1,1)
apply(data, 2, sum)
arraysum <- NULL
for(i in names(data)){
arraysum$name <- append(arraysum$name,i)
arraysum$sum <- append(arraysum$sum, sum(data[[i]]))
}
arraysum$sum
arraysum$name[order(arraysum$sum, decreasing = T)]
This question already has an answer here:
R: How to get the last element from each group?
(1 answer)
Closed 3 years ago.
I am dealing with data frames and have a dataset let's say
ID Name
1 aaa
1 aaa.
1 aaa
2 ccc
3 111.
3 333
3 111
3 111
I want the longest string for each ID
Output.
1 aaa.
2 ccc
3 111.
Data:
dat <- structure(
list(
ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L),
Name = c("aaa", "aaa.", "aaa", "ccc", "111.", "333", "111", "111")
),
class = "data.frame",
row.names = c(NA,-8L)
)
With dplyr:
dat %>%
group_by(ID) %>%
filter(nchar(Name)==max(nchar(Name)))
# A tibble: 3 x 2
# Groups: ID [3]
ID Name
<int> <chr>
1 1 aaa.
2 2 ccc
3 3 111.
With base(might need some filtering to match Name):
aggregate(.~ID,dat,function(x) x[which.max(x)])
ID Name
1 1 aaa.
2 2 ccc
3 3 333
Using slice and dplyr as suggested by #Wil:
dat %>%
group_by(ID) %>%
slice(which.max(nchar(Name)))