Combining several columns based on matching text in R - r

I ran a study in Qualtrics with 4 conditions. I'm only including 3 in the example below for ease. The resulting data looks something like this:
condition Q145 Q243 Q34 Q235 Q193 Q234 Q324 Q987 Q88
condition How a? How b? How c? How a? How b? How c? How a? How b? How c?
1 3 5 2
1 5 4 7
1 3 1 4
2 3 4 7
2 1 2 8
2 1 3 9
3 7 6 5
3 8 1 3
3 9 2 2
The questions in the 2nd row are longer and more complex in the actual dataset, but they are consistent across conditions. In this sample, I've tried to capture the consistency and the fact that the default variable names (all starting with Q) do not match up.
Ultimately, I would like a dataframe that looks like the following. I would like to consolidate all the responses to a single question into one column per question. (Then I will go in and rename the lengthy questions with more concise variable names and "tidy" the data.)
condition How a? How b? How c?
1 3 5 2
1 5 4 7
1 3 1 4
2 3 4 7
2 1 2 8
2 1 3 9
3 7 6 5
3 8 1 3
3 9 2 2
I'd appreciate any ideas for how to accomplish this.

library(tidyverse)
file = 'condition,Q145 ,Q243 ,Q34 ,Q235 ,Q193 ,Q234 ,Q324 ,Q987 ,Q88
condition,How a?,How b?,How c?,How a?,How b?,How c?,How a?,How b?,How c?
1 ,3 ,5 ,2 , , , , , ,
1 ,5 ,4 ,7 , , , , , ,
1 ,3 ,1 ,4 , , , , , ,
2 , , , ,3 ,4 ,7 , , ,
2 , , , ,1 ,2 ,8 , , ,
2 , , , ,1 ,3 ,9 , , ,
3 , , , , , , , 7 , 6 , 5
3 , , , , , , , 8 , 1 , 3
3 , , , , , , , 9 , 2 , 2'
# Read in just the data without the weird header situation
data <- read_csv(file, col_names = FALSE, skip = 2)
# Pull out the questions row and reshape into a dataframe to make the next part easy
questions <- gather(read_csv(file, col_names = FALSE, skip = 1, n_max = 1))
# Generate list of data frames (one df for each question)
split(questions, questions$value) %>%
# Then coalesce the columns
map_df(~do.call(coalesce, data[, .x$key]))
Gives the following result:
# A tibble: 9 x 4
condition `How a?` `How b?` `How c?`
<int> <int> <int> <int>
1 1 3 5 2
2 1 5 4 7
3 1 3 1 4
4 2 3 4 7
5 2 1 2 8
6 2 1 3 9
7 3 7 6 5
8 3 8 1 3
9 3 9 2 2
Of course, if you intend to move to long format eventually, you might just do something like this:
data %>%
gather(key, answer, -X1) %>%
filter(!is.na(answer)) %>%
left_join(questions, by = 'key') %>%
select(condition = X1, question = value, answer)
Resulting in the following:
# A tibble: 27 x 3
condition question answer
<int> <chr> <int>
1 1 How a? 3
2 1 How a? 5
3 1 How a? 3
4 1 How b? 5
5 1 How b? 4
6 1 How b? 1
7 1 How c? 2
8 1 How c? 7
9 1 How c? 4
10 2 How a? 3
# ... with 17 more rows

Related

Rank ordering a rows of a data.frame in R

I was was if there is a way to rank-order rows of my Data below such that rows that simultaneously have the largest values on each of risk1, risk2 and risk3 (NOT TOTAL Of the three) are at the top?
For example, in my Desired_output, you see that id == 4 simultaneously has the largest values on risk1, risk2 and risk3 (4,3,2).
For all other ids, there is a 1 or 0 on at least one of the risk1, risk2 and risk3.
Note: Tie's are fine. 4,3,2 == 2,3,4 == 3,2,4.
Data = data.frame(id=1:4,risk1 = c(1,3,5,4), risk2 = c(8,2,1,3), risk3 = c(0,1,4,2))
Desired_output = read.table(h=T,text="
id risk1 risk2 risk3
4 4 3 2
3 5 1 4
2 3 2 1
1 1 8 0
")
Maybe this helps - loop over the rows, sort the elements, paste, convert to numeric, use that to order the rows
Data[order(-apply(Data[-1], 1, \(x)
as.numeric(paste(sort(x), collapse = "")))),]
-output
id risk1 risk2 risk3
4 4 4 3 2
3 3 5 1 4
2 2 3 2 1
1 1 1 8 0
This does the trick:
library(dplyr)
Data %>%
arrange(-row_number())
id risk1 risk2 risk3
1 4 4 3 2
2 3 5 1 4
3 2 3 2 1
4 1 1 8 0

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

aggregate dataframe subsets in R

I have the dataframe ds
CountyID ZipCode Value1 Value2 Value3 ... Value25
1 1 0 etc etc etc
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0
and would like to aggregate based on ds$ZipCode and set ds$CountyID equal to the primary county based on the highest ds$Value1. For the above example, it would look like this:
CountyID ZipCode Value1 Value2 Value3 ... Value25
2 1 4 etc etc etc
5 2 2
6 3 3
7 4 9
9 5 1
10 6 0
All the ValueX columns are the sum of that column grouped by ZipCode.
I've tried a bunch of different strategies over the last couple days, but none of them work. The best I've come up with is
#initialize the dataframe
ds_temp = data.frame()
#loop through each subset based on unique zipcodes
for (zip in unique(ds$ZipCode) {
sub <- subset(ds, ds$ZipCode == zip)
len <- length(sub)
maxIndex <- which.max(sub$Value1)
#do the aggregation
row <- aggregate(sub[3:27], FUN=sum, by=list(
CountyID = rep(sub$CountyID[maxIndex], len),
ZipCode = sub$ZipCode))
rbind(ds_temp, row)
}
ds <- ds_temp
I haven't been able to test this on the real data, but with dummy datasets (such as the one above), I keep getting the error "arguments must have the same length). I've messed around with rep() and fixed vectors (eg c(1,2,3,4)) but no matter what I do, the error persists. I also occasionally get an error to the effect of
cannot subset data of type 'closure'.
Any ideas? I've also tried messing around with data.frame(), ddply(), data.table(), dcast(), etc.
You can try this:
data.frame(aggregate(df[,3:27], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
Fully reproducible sample data:
df<-read.table(text="
CountyID ZipCode Value1
1 1 0
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0", header=TRUE)
data.frame(aggregate(df[,3], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
# Group.1 x CountyID
#1 1 4 2
#2 2 2 5
#3 3 3 6
#4 4 9 7
#5 5 1 9
#6 6 0 10
In response to your comment on Frank's answer, you can preserve the column names by using the formula method in aggregate. Using Franks's data df, this would be
> cbind(aggregate(Value1 ~ ZipCode, df, sum),
CountyID = sapply(split(df, df$ZipCode), function(x) {
with(x, CountyID[Value1 == max(Value1)]) }))
# ZipCode Value1 CountyID
# 1 1 4 2
# 2 2 2 5
# 3 3 3 6
# 4 4 9 7
# 5 5 1 9
# 6 6 0 10

generate sequence of numbers in R according to other variables

I have problem to generate a sequence of number according on two other variables.
Specifically, I have the following DB (my real DB is not so balanced!):
ID1=rep((1:1),20)
ID2=rep((2:2),20)
ID3=rep((3:3),20)
ID<-c(ID1,ID2,ID3)
DATE1=rep("2013-1-1",10)
DATE2=rep("2013-1-2",10)
DATE=c(DATE1,DATE2)
IN<-data.frame(ID,DATE=rep(DATE,3))
and I would like to generate a sequence of number according to the number of observation per each ID for each DATE, like this:
OUTPUT<-data.frame(ID,DATE=rep(DATE,3),N=rep(rep(seq(1:10),2),3))
Curiously, I try the following solution that works for the DB provided above, but not for the real DB!
IN$UNIQUE<-with(IN,as.numeric(interaction(IN$ID,IN$DATE,drop=TRUE,lex.order=TRUE)))#generate unique value for the combination of id and date
PROG<-tapply(IN$DATE,IN$UNIQUE,seq)#generate the sequence
OUTPUT$SEQ<-c(sapply(PROG,"["))#concatenate the sequence in just one vector
Right now, I can not understand why the solution doesn't work for the real DB, as always any tips is greatly appreciated!
Here there is an example (just one ID included) of the data-set:
id date
1 F2_G 2005-03-09
2 F2_G 2005-06-18
3 F2_G 2005-06-18
4 F2_G 2005-06-18
5 F2_G 2005-06-19
6 F2_G 2005-06-19
7 F2_G 2005-06-19
8 F2_G 2005-06-19
9 F2_G 2005-06-20
Here's one using ave:
OUT <- within(IN, {N <- ave(ID, list(ID, DATE), FUN=seq_along)})
This should do what you want...
require(reshape2)
as.vector( apply( dcast( IN , ID ~ DATE , length )[,-1] , 1:2 , function(x)seq.int(x) ) )
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6
[27] 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2
[53] 3 4 5 6 7 8 9 10
Bascially we use dcast to get the number of observations by ID and date like so
dcast( IN , ID ~ DATE , length )
ID 2013-1-1 2013-1-2
1 1 10 10
2 2 10 10
3 3 10 10
Then we use apply across each cell to make a sequence of integers as long as the count of ID for each date. Finally we coerce back to a vector using as.vector.

Resources