The data I have is almost similar to the data below.
A=01-03
B=04-06
C=07-09
D=10-11
data<-read.table (text=" ID Class Time1 Time2 Time3
1 1 1 3 3
2 1 4 3 2
3 1 2 2 2
1 2 1 4 1
2 3 2 1 1
3 2 3 2 3
1 3 1 1 2
2 2 4 3 1
3 3 3 2 1
1 1 4 3 2
2 1 2 2 2
3 2 1 4 1
", header=TRUE)
I want to create 2 columns right after the Class column, i.e. the Bin and Zero columns based on A, B, C and D and IDs.
Therefore A goes to IDs 1,2, and 3. B goes to the next IDs, i.e., 1,2 and 3, and C goes to the next IDs, i.e., 1,2,3 and so on. Column Zero gets only numbers zeros. So the outcome would be:
ID Class Bin Zero Time1 Time2 Time3
1 1 01-03 0 1 3 3
2 1 01-03 0 4 3 2
3 1 01-03 0 2 2 2
1 2 04-06 0 1 4 1
2 3 04-06 0 2 1 1
3 2 04-06 0 3 2 3
1 3 07-09 0 1 1 2
2 2 07-09 0 4 3 1
3 3 07-09 0 3 2 1
1 1 10-11 0 4 3 2
2 1 10-11 0 2 2 2
3 2 10-11 0 1 4 1
Please try the below code
library(tidyverse)
#use character vector with quotes
A='01-03'
B='04-06'
C='07-09'
D='10-11'
data<-read.table (text=" ID Class Time1 Time2 Time3
1 1 1 3 3
2 1 4 3 2
3 1 2 2 2
1 2 1 4 1
2 3 2 1 1
3 2 3 2 3
1 3 1 1 2
2 2 4 3 1
3 3 3 2 1
1 1 4 3 2
2 1 2 2 2
3 2 1 4 1
", header=TRUE)
#create a separate dataframe with bin column
data2 <- data.frame(bin=c(rep(A,3),rep(B,3),rep(C,3),rep(D,3)))
data3 <- bind_cols(data, data2) %>% mutate(zero=0)
If you are open to a dplyr based solution, you could use
library(dplyr)
data %>%
group_by(ID) %>%
mutate(Bin = c(A, B, C, D),
Zero = 0,
.after = 2) %>%
ungroup()
This returns
# A tibble: 12 × 7
ID Class Bin Zero Time1 Time2 Time3
<int> <int> <chr> <dbl> <int> <int> <int>
1 1 1 01-03 0 1 3 3
2 2 1 01-03 0 4 3 2
3 3 1 01-03 0 2 2 2
4 1 2 04-06 0 1 4 1
5 2 3 04-06 0 2 1 1
6 3 2 04-06 0 3 2 3
7 1 3 07-09 0 1 1 2
8 2 2 07-09 0 4 3 1
9 3 3 07-09 0 3 2 1
10 1 1 10-11 0 4 3 2
11 2 1 10-11 0 2 2 2
12 3 2 10-11 0 1 4 1
Related
In my toy data, for each unique study, the numeric variables (sample and group) must have an order starting from 1. But:
For example, in study 1, we see that there are two unique sample values (1 & 3), so 3 must be replaced with 2.
For example, in study 2, we see that there is one unique group value (2), so it must be replaced with 1.
In study 3, both sample and group seem ok meaning their unique values are 1 and 2 (no replacing needed).
For this toy data, my desired output is shown below. But I appreciate a functional solution that can automatically replace any number of numeric variables in a data.frame that have lost their order just like I showed in my toy data.
m="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 3 1 A
1 3 1 B
1 3 2 A
1 3 2 B
2 1 2 A
2 1 2 B
2 2 2 A
2 2 2 B
2 3 2 A
2 3 2 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
data <- read.table(text=m, h=T)
Desired_output="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 2 1 A
1 2 1 B
1 2 2 A
1 2 2 B
2 1 1 A
2 1 1 B
2 2 1 A
2 2 1 B
2 3 1 A
2 3 1 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
You can do:
library(dplyr)
data %>%
group_by(study) %>%
mutate(across(tidyselect::vars_select_helpers$where(is.numeric),
function(x) as.numeric(as.factor(x)))) %>%
as.data.frame()
The resultant data frame looks like this:
study sample group outcome
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B
Here is an alternative (not as elegant as #Allan Cameron +1 ) dplyr solution:
library(dplyr)
df %>%
group_by(study) %>%
mutate(x = n()/length(unique(sample)),
sample = rep(row_number(), each=x, length.out = n()),
y = length(unique(group)),
group = ifelse(y==1, 1, group)) %>%
select(-x, -y)
study sample group outcome
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B
I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))
I have a data frame that looks like this:
Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2
As you can see in this data frame there are 6 different combinations in the N and S columns, and 8 consecutive rows of each combination. I want to create a new data frame where one row from each combination (be it 3 & 1 or 1 & 2) is randomly selected and then put into a new data frame so there are 8 consecutive rows of each different combination. That way the entire data frame of all 48 rows is completely reorganized. Is this possible in R code?
Edit: The desired output would be something like this, but repeating until all 48 rows are full and the subject number for each row would have be random because it is a randomly selected row of each N & S combo.
Subject N S
3 1
1 1
3 2
1 2
2 2
2 1
2 2
3 2
2 1
1 1
3 1
1 2
A solution using functions from dplyr.
# Load package
library(dplyr)
# Set seed for reproducibility
set.seed(123)
# Process the data
dt2 <- dt %>%
group_by(N, S) %>%
sample_n(size = 1)
# View the result
dt2
## A tibble: 6 x 3
## Groups: N, S [6]
# Subject N S
# <chr> <int> <int>
#1 Sub6-3 1 1
#2 Sub5-1 1 2
#3 Sub1-5 2 1
#4 Sub5-8 2 2
#5 Sub2-4 3 1
#6 Sub3-1 3 2
Update: Reorganize the row
The following randomize all rows.
dt3 <- dt %>% slice(sample(1:n(), n()))
Data Preparation
dt <- read.table(text = "Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2",
header = TRUE, stringsAsFactors = FALSE)
I have a dataframe that has survey response items (scale 1-4). This is what the data looks like for the first 10 respondents:
Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n
1 1 2 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1
3 2 1 1 1 1 1 1 2 2
4 4 4 2 2 3 3 4 4 3
5 1 1 1 1 1 1 1 2 1
6 4 4 4 3 4 4 2 4 4
7 3 3 4 3 3 3 4 4 3
8 3 3 2 2 4 2 3 3 2
9 1 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1
I fit an graded response model to the data, and now have theta hats for each response pattern. There are 901 observations in the raw data, but only 547 observations of theta.hat. The reason is because there is a single theta.hat for each observed response pattern - e.g., a score of '1' across all items appears 94 times. The theta.hat dataframe looks like this:
Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n Obs Theta
1 1 1 1 1 1 1 1 1 1 94 -1.307
2 1 1 1 1 1 1 1 1 2 10 -.816
3 1 1 1 1 1 1 1 1 4 1 -0.750
4 1 1 1 1 1 1 1 2 1 22 -.803
5 1 1 1 1 1 1 1 2 2 6 -.524
What I am trying to do is merge the theta.hats with the original data. This seems to require matching the response patterns across two datasets. So, for example, line 10 in the raw data (with all '1's) would receive a theta hat of -1.307 because it matched the response pattern in line 1 of the theta matrix. Both datasets are structured so each variable is a numeric column.
I'm not sure how to send a reproducible dataset for this case, but am happy to if you have suggestions.
Thank you,
Andrea
How about a simple merge? Assuming your first dataset (responses) is assigned to df.1 and the second dataset (modeled with theta) is assigned to df.2:
merge(df.1, df.2, by = names(df.1), all.x = TRUE)
# Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n Obs Theta
# 1 1 1 1 1 1 1 1 1 1 94 -1.307
# 2 1 1 1 1 1 1 1 1 1 94 -1.307
# 3 1 1 1 1 1 1 1 1 1 94 -1.307
# 4 1 1 1 1 1 1 1 2 1 22 -0.803
# 5 1 2 1 1 1 1 1 1 1 NA NA
# 6 2 1 1 1 1 1 1 2 2 NA NA
# 7 3 3 2 2 4 2 3 3 2 NA NA
# 8 3 3 4 3 3 3 4 4 3 NA NA
# 9 4 4 2 2 3 3 4 4 3 NA NA
# 10 4 4 4 3 4 4 2 4 4 NA NA
The dataframe looks like this :
Customer_id A B C D E F G
10000001 1 1 2 3 1 3 1
10000001 1 2 3 1 2 1 3
10000002 2 2 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 5 2 4 7 2 4
10000003 1 5 2 6 3 7 2
10000003 1 1 2 2 1 2 1
10000004 1 2 3 1 2 3 1
10000004 1 3 2 3 1 3 2
10000004 1 3 2 1 3 2 1
10000004 1 4 1 4 1 3 1
10000006 1 2 3 4 5 1 2
10000006 1 3 1 4 1 2 1
10000008 2 3 2 3 2 1 2
10000008 2 3 1 1 2 1 2
10000008 1 3 1 1 2 2 1
There are multiple entries for each customer_id. I need to create another data frame from this existing data frame. The new data frame should contain only the last row for every customer_id. It should look like this
10000001 1 1 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 1 2 2 1 2 1
10000004 1 4 1 4 1 3 1
10000006 1 3 1 4 1 2 1
10000008 1 3 1 1 2 2 1
Something like this (hard to code without the data in R format):
dataframe[ rev(!duplicated(rev(dataframe$Customer_id))),]
or better
dataframe[ !duplicated(dataframe$Customer_id,fromLast=TRUE),]
You can also use aggregate
aggregate(. ~ Customer_id, data = DF, FUN = tail, 1)
## Customer_id A B C D E F G
## 1 10000001 1 2 3 1 2 1 3
## 2 10000002 2 2 1 4 2 3 1
## 3 10000003 1 1 2 2 1 2 1
## 4 10000004 1 4 1 4 1 3 1
## 5 10000006 1 3 1 4 1 2 1
## 6 10000008 1 3 1 1 2 2 1
Assume your data is named dat,
Here's one way using by and rbind, although the other two methods (aggregate and duplicated) are much nicer:
> do.call(rbind, by(dat,dat$Customer_id,FUN=tail,1))
## Customer_id A B C D E F G
## 2 10000001 1 2 3 1 2 1 3
## 4 10000002 2 2 1 4 2 3 1
## 7 10000003 1 1 2 2 1 2 1
## 11 10000004 1 4 1 4 1 3 1
## 13 10000006 1 3 1 4 1 2 1
## 16 10000008 1 3 1 1 2 2 1