This question already has answers here:
R Partial Reshape Data from Long to Wide
(2 answers)
Closed 6 years ago.
I am struggling to reshape this df into a different one, I have this:
ID task mean sd mode
1 0 2 10 1.5 223
2 0 2 21 2.4 213
3 0 2 24 4.3 232
4 1 3 26 2.2 121
5 1 3 29 1.3 433
6 1 3 12 2.3 456
7 2 4 45 4.3 422
8 2 4 67 5.3 443
9 2 4 34 2.1 432
and I would like to reshape it in this way discarding sd and mode and placing the means in the rows like this :
ID task mean mean1 mean2
1 0 2 10 21 24
2 1 3 26 29 12
3 2 4 45 67 34
Thanks a lot for your help in advance
You need to create a new column first by which we can pivot the mean values. Using data.table, this approach works:
library(data.table)
dt <- data.table(df) # Convert to data.table
dcast(dt[,nr := seq(task),
.(ID)],
ID + task ~ nr,
value.var = "mean")
# ID task 1 2 3
#1: 0 2 10 21 24
#2: 1 3 26 29 12
#3: 2 4 45 67 34
Consequently, you can always rename the columns to what you want them to be called.
reshape(cbind(df,time=ave(df$ID,df$ID,FUN=seq_along)),dir='w',idvar=c('ID','task'),drop=c('sd','mode'),sep='');
## ID task mean1 mean2 mean3
## 1 0 2 10 21 24
## 4 1 3 26 29 12
## 7 2 4 45 67 34
Data
df <- data.frame(ID=c(0L,0L,0L,1L,1L,1L,2L,2L,2L),task=c(2L,2L,2L,3L,3L,3L,4L,4L,4L),mean=c(
10L,21L,24L,26L,29L,12L,45L,67L,34L),sd=c(1.5,2.4,4.3,2.2,1.3,2.3,4.3,5.3,2.1),mode=c(223L,
213L,232L,121L,433L,456L,422L,443L,432L));
Related
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 6 years ago.
I'm cleaning a dataset, but the frame is not ideal, I have to reshape it, but I don't know how. The following are the original data frame:
Rater Rater ID Ratee1 Ratee2 Ratee3 Ratee1.item1 Ratee1.item2 Ratee2.item1 Ratee2.item2 Ratee3.item1 Ratee3.item2
A 12 701 702 800 1 2 3 4 5 6
B 23 45 46 49 3 3 3 3 3 3
C 24 80 81 28 2 3 4 5 6 9
Then I am wondering how to reshape it as the below:
Rater Rater ID Ratee item1 item2
A 12 701 1 2
A 12 702 3 4
A 12 800 5 6
B 23 45 3 3
B 23 46 3 3
B 23 49 3 3
C 24 80 2 3
C 24 81 4 5
C 24 28 6 9
This reshaping is a little bit different from this one (Reshaping data.frame from wide to long format). As I have three parts in the original data.
First part is about the rater's ID (Rater and Rater ID).
The second is about retee's ID (Ratee1, Ratee2, Ratee3).
The Third part is about Rater's rating on each retee (retee*.item1(or2)).
To make it more clear, let me brief the data collecting process.
First, a rater types in his own name and ID,
then nominates three persons (Ratee1 to Ratee3),
and then rates the questions regarding each retee (for each retee, there are two questions).
Does anyone know how to reshape this? Thanks!
We can use melt from data.table
library(data.table)
melt(setDT(df1), measure = patterns("^Ratee\\d+$", "^Ratee\\d+\\.item1",
"^Ratee\\d+\\.item2"), value.name = c("Ratee", "item1", "item2"))[,
variable := NULL][order(Rater)]
# Rater RaterID Ratee item1 item2
#1: A 12 701 1 2
#2: A 12 702 3 4
#3: A 12 800 5 6
#4: B 23 45 3 3
#5: B 23 46 3 3
#6: B 23 49 3 3
#7: C 24 80 2 3
#8: C 24 81 4 5
#9: C 24 28 6 9
Let's assume I have a data frame consisting of a categorical variable and a numerical one.
df <- data.frame(group=c(1,1,1,1,1,2,2,2,2,2),days=floor(runif(10, min=0, max=101)))
df
group days
1 1 54
2 1 61
3 1 31
4 1 52
5 1 21
6 2 22
7 2 18
8 2 50
9 2 46
10 2 35
I would like to select the row corresponding to the maximum number of days by group as well as all the following/subsequent group rows. For the example above, my subset df2 should look as follows:
df2
group days
2 1 61
3 1 31
4 1 52
5 1 21
8 2 50
9 2 46
10 2 35
Please note that the groups could have different lengths.
For a base R solution, aggregate days by group using a function that keeps the elements with index greater than or equal to the maximum, and then reshape as a long data.frame
df0 = aggregate(days ~ group, df, function(x) x[seq_along(x) >= which.max(x)])
data.frame(group=rep(df0$group, lengths(df0$days)),
days=unlist(df0$days, use.names=FALSE)))
leading to
group days
1 1 84
2 1 31
3 1 65
4 1 23
5 2 94
6 2 69
7 2 45
You can use which.max to find out the index of the maximum of the days and then use slice from dplyr to select all the rows after that, where n() gives the number of rows in each group:
library(dplyr)
df %>% group_by(group) %>% slice(which.max(days):n())
#Source: local data frame [7 x 2]
#Groups: group [2]
# group days
# <int> <int>
#1 1 61
#2 1 31
#3 1 52
#4 1 21
#5 2 50
#6 2 46
#7 2 35
data.table syntax would be similar, .N is similar to n() in dplyr and gives the number of rows in each group:
library(data.table)
setDT(df)[, .SD[which.max(days):.N], group]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
We can use a faster option with data.table where we find the row index (.I) and then subset the rows based on that.
library(data.table)
setDT(df)[df[ , .I[which.max(days):.N], by = group]$V1]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2
I am doing simulations and am trying to add error to a column repeatedly, specifically to the column titled Ao. In my output, the first 30 rows are correct; we have the initial data, the first year of altered data (error added to Ao), but then afterwards, where I would like to have 30 years of added error, I get repeats of Year 2 for Ao up to year 30. My goal is that I add error after each year of sampling. Ie. Year 2 is Year 1 Ao + error. Year 3 is Year 2 Ao + error, so on and so forth. Any helpers? Cheers.
for(t in 1:30){
Error<-rnorm(1000,0,1)
m<-rep(year1data$m,30)
r<-rep(year1data$r,30)
a<-rep(year1data$a,30)
g<-rep(year1data$g,30)
Year<-rep(2:31, each=TotSpecies)
Species<-1:TotSpecies
Ao<-year1data$Ao+sample(Error,TotSpecies,replace=FALSE)
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
TotSpeciesdata<-rbind(year1data,TotSpeciesdata)
}
> TotSpeciesdata
Species Year Ao m r a g
1 1 1 25.770783 43 119.110786 3.2305180 2.6526471
2 2 1 53.908914 138 161.894541 0.7342070 0.1151602
3 3 1 2.010732 226 193.820489 2.2890904 3.6248105
4 4 1 23.742254 332 17.315335 1.4009572 2.0037931
5 5 1 4.291080 63 187.591209 0.2563995 2.1553908
6 6 1 4.691113 343 116.267867 0.3899113 3.3950085
7 7 1 604.133044 224 132.240197 3.0410743 0.7985524
8 8 1 13.332567 166 5.367118 0.7921644 1.7861011
9 9 1 3.759268 141 212.340970 2.8733737 2.7123141
10 10 1 3.647390 209 259.400858 0.1249936 0.6594659
11 11 1 23.731109 10 114.171147 2.2437372 0.9867591
12 12 1 85.116996 69 167.412993 0.8306823 2.8905148
13 13 1 31.684280 277 177.025460 2.7618332 2.9245554
14 14 1 30.657523 205 21.710438 2.7661347 1.5911379
15 15 1 12.240410 85 210.121109 2.8827455 3.0418454
16 1 2 27.038097 43 119.110786 3.2305180 2.6526471
17 2 2 54.251600 138 161.894541 0.7342070 0.1151602
18 3 2 2.010636 226 193.820489 2.2890904 3.6248105
19 4 2 22.699369 332 17.315335 1.4009572 2.0037931
20 5 2 4.542589 63 187.591209 0.2563995 2.1553908
21 6 2 3.607833 343 116.267867 0.3899113 3.3950085
22 7 2 604.480756 224 132.240197 3.0410743 0.7985524
23 8 2 13.663513 166 5.367118 0.7921644 1.7861011
24 9 2 2.138715 141 212.340970 2.8733737 2.7123141
25 10 2 3.642769 209 259.400858 0.1249936 0.6594659
26 11 2 22.897993 10 114.171147 2.2437372 0.9867591
27 12 2 85.490897 69 167.412993 0.8306823 2.8905148
28 13 2 31.689202 277 177.025460 2.7618332 2.9245554
29 14 2 30.644419 205 21.710438 2.7661347 1.5911379
30 15 2 12.050207 85 210.121109 2.8827455 3.0418454
31 1 3 27.038097 43 119.110786 3.2305180 2.6526471
32 2 3 54.251600 138 161.894541 0.7342070 0.1151602
33 3 3 2.010636 226 193.820489 2.2890904 3.6248105
34 4 3 22.699369 332 17.315335 1.4009572 2.0037931
35 5 3 4.542589 63 187.591209 0.2563995 2.1553908
36 6 3 3.607833 343 116.267867 0.3899113 3.3950085
37 7 3 604.480756 224 132.240197 3.0410743 0.7985524
38 8 3 13.663513 166 5.367118 0.7921644 1.7861011
39 9 3 2.138715 141 212.340970 2.8733737 2.7123141
40 10 3 3.642769 209 259.400858 0.1249936 0.6594659
41 11 3 22.897993 10 114.171147 2.2437372 0.9867591
42 12 3 85.490897 69 167.412993 0.8306823 2.8905148
43 13 3 31.689202 277 177.025460 2.7618332 2.9245554
44 14 3 30.644419 205 21.710438 2.7661347 1.5911379
45 15 3 12.050207 85 210.121109 2.8827455 3.0418454
The main problem you have with your approach is the line:
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
Because Year is a 30 * TotSpecies vector, but all the others are just TotSpecies long. So in effect, you are recycling all columns except Year 30 times when you create the data frame, which will lead to the year 2 data repeated 30 times, among other things. If you just have Year <- rep(i + 1, TotSpecies) I think your logic will work fine. That said, here is an alternate approach:
This will, for each species, create an incrementing random walk starting with Ao for that species for 5 years (just did that for display purposes):
set.seed(1)
year1data <- data.frame(species=1:10, year=1, Ao=runif(10, 1, 700))
TotSpeciesData <- do.call(
rbind,
lapply(
split(year1data, year1data$species),
function(data)
with(
data,
data.frame(species=species, year=year, Ao=c(Ao, Ao + cumsum(rnorm(5)))
) ) ) )
head(TotSpeciesData, 15)
Note I excluded columns m-g since they don't seem directly relevant to your particular question, but you can add them relatively easily. I also only did 5 years in addition to year 1 so you can see the results here, but that is also easy to change:
species year Ao
1.1 1 1 186.5906
1.2 1 1 185.7701
1.3 1 1 186.2575
1.4 1 1 186.9958
1.5 1 1 187.5716
1.6 1 1 187.2662
2.1 2 1 261.1146
2.2 2 1 262.6264
2.3 2 1 263.0162
2.4 2 1 262.3950
2.5 2 1 260.1803
2.6 2 1 261.3052
3.1 3 1 401.4245
3.2 3 1 401.3796
3.3 3 1 401.3634
It has been pointed out that the code that you provided above, or at least that I have edited, repeats itself every 15 years, rather than being unique year year in a step-wise fashion. I edited it as shown below:
TotSpeciesData <- do.call(
rbind, #bind the table by rows
lapply( #applying the function in list form
split(year1data, year1data$Species), #splits data into groups by species
function(data)
with(
data,
data.frame(Species=Species, Year=1:Community, Ao=c(Ao, Ao + cumsum(rnorm((TotSpecies-1),0,2))),m=m, r=r, a=a, g=g) #data frame is Species, Year,
) ) )
TotSpeciesData$Ao[TotSpeciesData$Ao<0]<-0 #any values less than 0 go to 0
TotSpeciesData<-TotSpeciesData[order(TotSpeciesData$Year),] #orders the data frame by Year
When I do this code:
TotSpeciesData[TotSpeciesData$Species==1 & TotSpeciesData$Year %in% c(1,2,16,17),]
I end up with an output showing that the data is repeating itself.
Species Year Ao m r a g
1.1 1 1 48.49161 239 332.9625 3.791778 2.723104
1.2 1 2 49.62851 239 332.9625 3.791778 2.723104
1.16 1 16 48.49161 239 332.9625 3.791778 2.723104
1.17 1 17 49.62851 239 332.9625 3.791778 2.723104
Any comments toward this?
I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45