I have a dataframe data that looks like this:
> data
id var1 var2
1 1000 32 2.3
2 1000 34 2.5
3 1000 33 NA
4 1000 36 2.4
5 1001 32 3.1
6 1001 NA 2.5
7 1001 45 NA
8 1002 45 2.6
9 1002 37 NA
10 1002 33 3.1
11 1002 NA 3.3
As you can see, each ID has multiple observations (3-4 each). I want to add another variable (column), which acts like an index and numbers each observation within the ID. This is ideally what the dataframe would look like after adding the variable:
> data_goal
id var1 var2 index
1 1000 32 2.3 1
2 1000 34 2.5 2
3 1000 33 NA 3
4 1000 36 2.4 4
5 1001 32 3.1 1
6 1001 NA 2.5 2
7 1001 45 NA 3
8 1002 45 2.6 1
9 1002 37 NA 2
10 1002 33 3.1 3
11 1002 NA 3.3 4
What would be the best way to do this in R?
If it's relevant, my ultimate goal is to reshape the data into "wide" format for further analyses, but for that I need an index variable.
Here is a solution that uses dplyr:
# reproducing your data
data<- data.frame(rbind(c(1,1000,32,2.3),c(2,1000,34,2.5),c(3,1000,33,NA),
c(4,1000,36,2.4),c(5,1001,32,3.1),c(6,1001,NA,2.5),c(7,1001,45,NA),
c(8,1002,45,2.6),c(9,1002,37,NA),c(10,1002,33,3.1),
c(11,1002,NA,3.3)))
colnames(data)<-c("row", "id","var1","var2")
library(dplyr)
# use pipes ( %>% ) to do this in a single line of code
data_goal<-data %>% group_by(id) %>% mutate(index=1:n())
You can easily use dplyr to reshape the data too. Here is a resource if you are unfamiliar: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
library(data.table)
setDT(dat)[,index:=seq(1,.N),by=id]
Related
I have a large column with NAs, I want to rank the time as shown below. I want to keep NAs while I exclude them from the analysis,
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=True)
I want to get the following table:
time Rank
40 3
30 4
50 2
NA NA
60 1
NA NA
20 5
I have used the following code:
df$Rank<--df$time,ties.method="mim")
#fixed data
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=TRUE)
You can do something like
nonNaIndices <- !is.na(df$time)
df$Rank <- NA
df$Rank[nonNaIndices] <- rank(df$time[nonNaIndices],ties.method="min")
resulting in
> df
time Rank
1 40 3
2 30 2
3 50 4
4 NA NA
5 60 5
6 NA NA
7 20 1
Note: Please make sure to check your question for missing function calls before submitting it. In your case it could be guessed from the context.
You can use dense_rank from dplyr -
library(dplyr)
df %>% mutate(Rank = dense_rank(-time))
# time Rank
#1 40 3
#2 30 4
#3 50 2
#4 NA NA
#5 60 1
#6 NA NA
#7 20 5
This question already has answers here:
R Partial Reshape Data from Long to Wide
(2 answers)
Closed 6 years ago.
I am struggling to reshape this df into a different one, I have this:
ID task mean sd mode
1 0 2 10 1.5 223
2 0 2 21 2.4 213
3 0 2 24 4.3 232
4 1 3 26 2.2 121
5 1 3 29 1.3 433
6 1 3 12 2.3 456
7 2 4 45 4.3 422
8 2 4 67 5.3 443
9 2 4 34 2.1 432
and I would like to reshape it in this way discarding sd and mode and placing the means in the rows like this :
ID task mean mean1 mean2
1 0 2 10 21 24
2 1 3 26 29 12
3 2 4 45 67 34
Thanks a lot for your help in advance
You need to create a new column first by which we can pivot the mean values. Using data.table, this approach works:
library(data.table)
dt <- data.table(df) # Convert to data.table
dcast(dt[,nr := seq(task),
.(ID)],
ID + task ~ nr,
value.var = "mean")
# ID task 1 2 3
#1: 0 2 10 21 24
#2: 1 3 26 29 12
#3: 2 4 45 67 34
Consequently, you can always rename the columns to what you want them to be called.
reshape(cbind(df,time=ave(df$ID,df$ID,FUN=seq_along)),dir='w',idvar=c('ID','task'),drop=c('sd','mode'),sep='');
## ID task mean1 mean2 mean3
## 1 0 2 10 21 24
## 4 1 3 26 29 12
## 7 2 4 45 67 34
Data
df <- data.frame(ID=c(0L,0L,0L,1L,1L,1L,2L,2L,2L),task=c(2L,2L,2L,3L,3L,3L,4L,4L,4L),mean=c(
10L,21L,24L,26L,29L,12L,45L,67L,34L),sd=c(1.5,2.4,4.3,2.2,1.3,2.3,4.3,5.3,2.1),mode=c(223L,
213L,232L,121L,433L,456L,422L,443L,432L));
I am trying to shorten a chunk of code to make it faster and easier to modify. This is a short example of my data.
order obs year var1 var2 var3
1 3 1 1 32 588 NA
2 4 1 2 33 689 2385
3 5 1 3 NA 678 2369
4 33 3 1 10 214 1274
5 34 3 2 10 237 1345
6 35 3 3 10 242 1393
7 78 6 1 5 62 NA
8 79 6 2 5 75 296
9 80 6 3 5 76 500
10 93 7 1 NA NA NA
11 94 7 2 4 86 247
12 95 7 3 3 54 207
Basically, what I want is R to find any possible and unique combination of two values (observations) in column "obs", within the same year, to create a new matrix or DF with observations being the aggregation of the originals. Order is not important, so 1+6 = 6+1. For instance, having 150 observations, I will expect 11,175 feasible combinations (each year).
I sort of got what I want with basic coding but, as you will see, is way too long (I have built this way 66 different new data sets so it does not really make a sense) and I am wondering how to shorten it. I did some trials (plyr,...) with no real success. Here what I did:
# For the 1st year, groups of 2 obs
newmatrix <- data.frame(t(combn(unique(data$obs[data$year==1]), 2)))
colnames(newmatrix) <- c("obs1", "obs2")
newmatrix$name <- do.call(paste, c(newmatrix[c("obs1", "obs2")], sep = "_"))
# and the aggregation of var. using indexes, which I will skip here to save your time :)
To ilustrate, here the result, considering above sample, of what I would get for the 1st year. NA is because I only computed those where the 2 values were valid. And only for variables 1 and 3. More, I did the sum but it could be any other possible Function:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 42 NA
2 2 1 6 1_6 37 NA
3 3 1 7 1_7 NA NA
4 4 3 6 3_6 15 NA
5 5 3 7 3_7 NA NA
6 6 6 7 6_7 NA NA
As for the 2 first lines in the 3rd year, same type of matrix:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 NA 3762
2 2 1 6 1_6 NA 2868
.......... etc ............
I hope I explained myself. Thank you in advance for your hints on how to do this more efficient.
I would use split-apply-combine to split by year, find all the combinations, and then combine back together:
do.call(rbind, lapply(split(data, data$year), function(x) {
p <- combn(nrow(x), 2)
data.frame(order=paste(x$order[p[1,]], x$order[p[2,]], sep="_"),
obs1=x$obs[p[1,]],
obs2=x$obs[p[2,]],
year=x$year[1],
var1=x$var1[p[1,]] + x$var1[p[2,]],
var2=x$var2[p[1,]] + x$var2[p[2,]],
var3=x$var3[p[1,]] + x$var3[p[2,]])
}))
# order obs1 obs2 year var1 var2 var3
# 1.1 3_33 1 3 1 42 802 NA
# 1.2 3_78 1 6 1 37 650 NA
# 1.3 3_93 1 7 1 NA NA NA
# 1.4 33_78 3 6 1 15 276 NA
# 1.5 33_93 3 7 1 NA NA NA
# 1.6 78_93 6 7 1 NA NA NA
# 2.1 4_34 1 3 2 43 926 3730
# 2.2 4_79 1 6 2 38 764 2681
# 2.3 4_94 1 7 2 37 775 2632
# 2.4 34_79 3 6 2 15 312 1641
# 2.5 34_94 3 7 2 14 323 1592
# 2.6 79_94 6 7 2 9 161 543
# 3.1 5_35 1 3 3 NA 920 3762
# 3.2 5_80 1 6 3 NA 754 2869
# 3.3 5_95 1 7 3 NA 732 2576
# 3.4 35_80 3 6 3 15 318 1893
# 3.5 35_95 3 7 3 13 296 1600
# 3.6 80_95 6 7 3 8 130 707
This enables you to be very flexible in how you combine data pairs of observations within a year --- x[p[1,],] represents the year-specific data for the first element in each pair and x[p[2,],] represents the year-specific data for the second element in each pair. You can return a year-specific data frame with any combination of data for the pairs, and the year-specific data frames are combined into a single final data frame with do.call and rbind.
I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.
You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')
You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5
library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")
I have two different datasets arranged in column format as follows:
Dataset 1:
A B C D E
13 1 1.7 2 1
13 2 5.3 2 1
13 2 2 2 1
13 2 1.8 2 1
1 6 27 9 1
1 6 6.6 9 1
1 7 17 9 1
1 7 7.1 9 1
1 7 8.5 9 1
Dataset 2:
A B F G
13 1 42 1002
13 2 42 1002
13 2 42 1002
13 2 42 1002
13 3 42 1002
13 4 42 1002
13 5 42 1002
1 2 27 650
1 3 27 650
1 4 27 650
1 6 27 650
1 7 27 650
1 7 27 650
1 7 27 650
1 8 27 650
Row numbers of both datasets are variable but they contain data for two samples (for example, column A: 13 and 1 of both datasets). I want C D and E values of dataset 1 to be placed in dataset 2, those having the same values of A and B in both datasets. So, joining should be based on A and B. I need to do this for about 47560 rows.
I am new in R so should be thankful if I could get code for saving the new merged dataset in R.
Use the merge function in R.
Reference from : http://www.statmethods.net/management/merging.html
Edit:
So first you'd need to read in the datasets, CSV is a good format.
> dataset1 <- read.csv(file="dataset1.csv", head=TRUE, sep=",")
> dataset2 <- read.csv(file="dataset2.csv", head=TRUE, sep=",")
If you just type the variable names now and hit enter you should see a read-out of your datasets. So...
> dataset1
should read out your data above. Then I believe the following should occur...I may be wrong...
> dataset1_2 <- merge(dataset1, dataset2, by=c("A","B"))
EDIT 2 :
> write.table(dataset1_2, "c:/dataset1_2.txt", sep=" ")
Reference : http://www.statmethods.net/input/exportingdata.html