This question already has an answer here:
R spreading multiple columns with tidyr [duplicate]
(1 answer)
Closed 5 years ago.
I'm using the below method to cast variables in a dataframe from long to wide format. However, I'm looking for an alternative way, using another package.
Any help is much appreciated?
subject <- c(1:10, 1:10)
condition <- c(rep(1,10), rep(2,10))
value <- c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5)
rating <- c(1, 3, 5, 2, 3, 5, 6, 7, 5, 3, 5, 7, 3, 6, 3, 5, 6, 7, 7, 8)
df <- data.frame(subject, condition, value, rating)
library(data.table)
df_wide <- dcast(setDT(df), subject ~ condition, value.var=c("rating", "value"))
We can use tidyverse
library(tidyverse)
df %>%
gather(key, val, value:rating) %>%
unite(cond, key, condition) %>%
spread(cond, val)
# subject rating_1 rating_2 value_1 value_2
#1 1 1 5 1 1
#2 2 3 7 2 2
#3 3 5 3 3 3
#4 4 2 6 4 4
#5 5 3 3 5 5
#6 6 5 5 1 1
#7 7 6 6 2 2
#8 8 7 7 3 3
#9 9 5 7 4 4
#10 10 3 8 5 5
Related
I want to create a sequence of numbers as a string. I have columns "start" and "end" indicating the start and end of the sequence. The desired output is a string with a sequence by 1. See the example below.
df <- data.frame(ID=seq(1:5),
start=seq(2,10,by=2),
end=seq(5,13,by=2),
desired_output_aschar= c("2,3,4,5", "4,5,6,7", "6,7,8,9", "8,9,10,11", "10,11,12,13"))
View(df)
Thank you in advance...
The following solution needs only one *apply loop.
mapply(function(x, y) paste(x:y, collapse = ","), df$start, df$end)
#[1] "2,3,4,5" "4,5,6,7" "6,7,8,9" "8,9,10,11" "10,11,12,13"
With the new lambdas, same output.
mapply(\(x, y) paste(x:y, collapse = ","), df$start, df$end)
Mapply to call the different seq function, sapply to call the columns
sapply(
data.frame(mapply(seq,df$start,df$end)),
paste0,
collapse=","
)
X1 X2 X3 X4 X5
"2,3,4,5" "4,5,6,7" "6,7,8,9" "8,9,10,11" "10,11,12,13"
Using dplyr -
library(dplyr)
df %>%
rowwise() %>%
mutate(output = toString(start:end)) %>%
ungroup
# ID start end output
# <int> <dbl> <dbl> <chr>
#1 1 2 5 2, 3, 4, 5
#2 2 4 7 4, 5, 6, 7
#3 3 6 9 6, 7, 8, 9
#4 4 8 11 8, 9, 10, 11
#5 5 10 13 10, 11, 12, 13
We could use map2 from purrr
library(dplyr)
library(purrr)
df %>%
mutate(output = map2_chr(start, end, ~ toString(.x:.y)))
ID start end desired_output_aschar output
1 1 2 5 2,3,4,5 2, 3, 4, 5
2 2 4 7 4,5,6,7 4, 5, 6, 7
3 3 6 9 6,7,8,9 6, 7, 8, 9
4 4 8 11 8,9,10,11 8, 9, 10, 11
5 5 10 13 10,11,12,13 10, 11, 12, 13
A data.table option
> setDT(df)[, out := toString(seq(start, end)), ID][]
ID start end desired_output_aschar out
1: 1 2 5 2,3,4,5 2, 3, 4, 5
2: 2 4 7 4,5,6,7 4, 5, 6, 7
3: 3 6 9 6,7,8,9 6, 7, 8, 9
4: 4 8 11 8,9,10,11 8, 9, 10, 11
5: 5 10 13 10,11,12,13 10, 11, 12, 13
Updated:
Hi! I have a data like this.
structure(list(V1QB10 = c(1, 1, 1, 2, 1, 3, 3, 1, 4, 2), V1QB12A = c(2,
1, 2, 3, NA, 2, 2, 3, 2, 2), V1QB12B = c(NA, 2, 2, 2, 2, 1, 2,
2, 2, 2), V1QB12C = c(NA, 1, 2, 2, 2, 1, 2, 2, 2, 2), sum = c(NA,
4, 6, 7, NA, 4, 6, 7, 6, 6)), row.names = c(NA, 10L), class = "data.frame")
This is how the data looks like:
V1QB10 V1QB12A V1QB12B V1QB12C sum
1 1 2 NA NA NA
2 1 1 2 1 4
3 1 2 2 2 6
4 2 3 2 2 7
5 1 NA 2 2 NA
6 3 2 1 1 4
7 3 2 2 2 6
8 1 3 2 2 7
9 4 2 2 2 6
10 2 2 2 2 6
Variable "sum" is the sum of "V1QB12*".
Now I'm trying to calculate the mean of the "sum" by "V1QB10":
dt %>%
group_by(V1QB10) %>%
dplyr::summarise(n=n(), mean=mean(sum), sd=sd(sum)) %>%
as.data.frame()
I'm expect the calculation like:
for V1QB10==1, the n is 3 (remove 2 observations with NA in "V1QB12*"), and sum up the "sum": 4+6+7=17, then calculate the mean: 17/3, and the sd.
But I found that I keep getting mean of 17/5. Trying to replace the code with n=n(V1QB12A) also didn't work.
Maybe I'm overthinking this problem. How I'm gonna do to fix it?
Thank you!
I'm not completely sure I follow what you're looking for, but the dplyr package has a nifty drop_na() function that will remove the NA's if you use it like this:
dt <- dt %>%
drop_na() %>%
dplyr::mutate(sum=rowSums(dplyr::select(., contains("V1QB12")), na.rm=T))
dt %>%
group_by(V1QB10) %>%
dplyr::summarise(n=n(), mean=mean(sum), sd=sd(sum)) %>%
as.data.frame()
Result:
V1QB10 n mean sd
1 1 3 5.666667 1.5275252
2 2 2 6.500000 0.7071068
3 3 2 5.000000 1.4142136
4 4 1 6.000000 NA
This question already has answers here:
Replacing character values with NA in a data frame
(7 answers)
Closed 1 year ago.
I am trying to change the 6s to NAs across multiple columns. I have tried using the mutate_at command in dplyr, but can't seem to make it work. Any ideas?
library(dplyr)
ID <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) #Create vector of IDs for ID column.
Score1 <- c(1, 2, 3, 2, 5, 6, 6, 2, 5, 4) #Create vector of scores for Score1 column.
Score2 <- c(2, 2, 3, 6, 5, 6, 6, 2, 3, 4) #Create vector of scores for Score2 column.
Score3 <- c(3, 2, 3, 4, 5, 5, 6, 2, 6, 4) #Create vector of scores for Score3 column.
df <- data.frame(ID, Score1, Score2, Score3) #Combine columns into a data frame.
VectorOfNames <- as.vector(c("Score1", "Score2", "Score3")) #Create a vector of column names.
df <- mutate_at(df, VectorOfNames, 6=NA) #Within the data frame, apply the function (6=NA) to the columns specified in VectorOfNames.
dplyr has the na_if() function for precisely this task. You were almost there with your code and can use:
mutate_at(df, VectorOfNames, ~na_if(.x, 6))
ID Score1 Score2 Score3
1 1 1 2 3
2 2 2 2 2
3 3 3 3 3
4 4 2 NA 4
5 5 5 5 5
6 6 NA NA 5
7 7 NA NA NA
8 8 2 2 2
9 9 5 3 NA
10 10 4 4 4
You could use :
library(dplyr)
df %>%mutate_at(VectorOfNames, ~replace(., . == 6, NA))
#OR
#df %>%mutate_at(VectorOfNames, ~ifelse(. == 6, NA, .))
# ID Score1 Score2 Score3
#1 1 1 2 3
#2 2 2 2 2
#3 3 3 3 3
#4 4 2 NA 4
#5 5 5 5 5
#6 6 NA NA 5
#7 7 NA NA NA
#8 8 2 2 2
#9 9 5 3 NA
#10 10 4 4 4
Or in base R :
df[VectorOfNames][df[VectorOfNames] == 6] <- NA
I'm trying to generate random number by group with multiple times.
For example,
> set.seed(1002)
> df<-data.frame(ID=LETTERS[seq(1:5)],num=sample(c(2,3,4), size=5, replace=TRUE))
> df
ID num
1 A 3
2 B 4
3 C 3
4 D 2
5 E 3
In ID, I want to generate sequential random number without replacement with (for example) 4 times.
If ID is A, it will randomly select numbers among 1:3 4 times. So, this will be
sample(c(1,2,3,1,2,3,1,2,3),replace=FALSE)
or
ep(sample(c(1:4), replace=FALSE),times=4)
If the results is 3 2 1 2 1 3 2 3 3 1 1 2, then the data will be
ID num
1 A 3
2 A 2
3 A 2
4 A 1
5 A 1
6 A 3
7 A 2
8 A 1
9 A 3
I tried several things, like
df%>%group_by(ID)%>%mutate(random=sample(rep(1:num,times=4),replace=FALSE))
It failed. The warning appeared with In 1:num
I also tried this.
ddply(df,.(ID),function(x) sample(rep(1:num,times=4),replace=FALSE))
The error appeared again, with NA/NaN.
I would really appreciate if you let me know how to solve this problem.
We can create a list-column and then unnest it to have separate rows.
n <- 4
library(dplyr)
df %>%
group_by(ID) %>%
mutate(num = list(sample(rep(seq_len(num), n)))) %>%
tidyr::unnest(num)
# ID num
# <fct> <int>
# 1 A 2
# 2 A 2
# 3 A 2
# 4 A 3
# 5 A 3
# 6 A 1
# 7 A 3
# 8 A 1
# 9 A 1
#10 A 3
# … with 50 more rows
I'm not quite clear on your expected output.
The following samples num elements from 1:num with replacement, and stores samples in a list column sample.
library(tidyverse)
set.seed(2018)
df %>% mutate(sample = map(num, ~sample(1:.x, replace = T)))
# ID num sample
#1 A 2 1, 1
#2 B 4 3, 4, 1, 2
#3 C 2 1, 1
#4 D 4 3, 3, 4, 4
#5 E 2 2, 2
Or if you want to repeat sampling num elements (with replacement) 4 times, you can do
set.seed(2018)
df %>%
mutate(sample = map(num, ~as.numeric(replicate(4, sample(1:.x, replace = T)))))
#ID num sample
#1 A 2 1, 1, 1, 2, 1, 2, 1, 1
#2 B 4 3, 3, 4, 4, 4, 4, 4, 2, 3, 4, 3, 3, 2, 1, 1, 2
#3 C 2 1, 1, 1, 1, 1, 1, 1, 2
#4 D 4 2, 3, 2, 1, 3, 4, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1
#5 E 2 2, 1, 2, 2, 1, 1, 1, 2
I'm attempting to select the correct column to sum the total of a from within a data frame column using ddply:
df2 <- ddply(df1,'col1', summarise, total = sum(substr(variable,1,3)))
It appears not to be working because you can't sum a character, but I am trying to pass the reference to the column, not sum the literal result of the substring. Is there a way to get around this?
Example Data & Desired output:
variable = "Aug 2017"
col1 Jun Jul Aug
1 A 1 2 3
2 A 1 2 3
3 A 1 2 3
4 A 1 2 3
5 A 1 2 3
6 B 2 3 4
7 B 2 3 4
8 B 2 3 4
9 C 3 4 5
10 C 3 4 5
Desired Output:
1 A 15
2 B 12
3 C 10
This works with dplyr instead of plyr.
# create data
df1 <- data.frame(
col1 = c(rep('A', 5), rep('B', 3), rep('C', 2)),
Jun = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3),
Jul = c(2, 2, 2, 2, 2, 3, 3, 3, 4, 4),
Aug = c(3, 3, 3, 3, 3, 4, 4, 4, 5, 5))
variable = 'Aug 2017'
# load dplyr library
library(dplyr)
# summarize each column that matches some string
df1 %>%
select(col1, matches(substr(variable, 1, 3))) %>%
group_by(col1) %>%
summarize_each(funs = 'sum')
# A tibble: 3 × 2
col1 Aug
<fctr> <dbl>
1 A 15
2 B 12
3 C 10
I also highly recommend reading about nonstandard and standard evaluation, here:
http://adv-r.had.co.nz/Computing-on-the-language.html