I am new in R . I have a data frame containing 3 columns.
first one shows ID , for each household we have a uniqe ID. the other columns shows relationship(1 for father , 2 for mother and 3 for children . third columns shows their age.
now i want to know how many twins are there in each family. ( twins are childs that have same age in each family)
my data frame:
Id relationship age
1001 1 60
1001 2 50
1001 3 20
1002 1 70
1002 2 68
1002 3 23
1002 3 27
1002 3 27
1002 3 23
1003 1 60
1003 2 40
1003 3 20
1003 3 20
result:
id twins
1001 0
1002 2
1003 1
Here's an R base solution using aggregate
> aggregate(age ~ Id, function(x) sum(duplicated(x)), data=df[df[,2]==3, ])
Id age
1 1001 0
2 1002 2
3 1003 1
It's a little difficult to attempt these without a working example. You can use dput() to create one. ... but I think this should work.
library(plyr)
df= df[df$relationship==3,]
ddply(df, .(id,age), nrow)
or rather it gives the number of children (not just twins)
almost <- ddply(df[df$relationship==3,], .(Id,age), function(x) nrow(x)-1)
aggregate(almost$V1, list(almost$Id), FUN =sum )
# Group.1 x
#1 1001 0
#2 1002 2
#3 1003 1
Related
I have a dataframe data that looks like this:
> data
id var1 var2
1 1000 32 2.3
2 1000 34 2.5
3 1000 33 NA
4 1000 36 2.4
5 1001 32 3.1
6 1001 NA 2.5
7 1001 45 NA
8 1002 45 2.6
9 1002 37 NA
10 1002 33 3.1
11 1002 NA 3.3
As you can see, each ID has multiple observations (3-4 each). I want to add another variable (column), which acts like an index and numbers each observation within the ID. This is ideally what the dataframe would look like after adding the variable:
> data_goal
id var1 var2 index
1 1000 32 2.3 1
2 1000 34 2.5 2
3 1000 33 NA 3
4 1000 36 2.4 4
5 1001 32 3.1 1
6 1001 NA 2.5 2
7 1001 45 NA 3
8 1002 45 2.6 1
9 1002 37 NA 2
10 1002 33 3.1 3
11 1002 NA 3.3 4
What would be the best way to do this in R?
If it's relevant, my ultimate goal is to reshape the data into "wide" format for further analyses, but for that I need an index variable.
Here is a solution that uses dplyr:
# reproducing your data
data<- data.frame(rbind(c(1,1000,32,2.3),c(2,1000,34,2.5),c(3,1000,33,NA),
c(4,1000,36,2.4),c(5,1001,32,3.1),c(6,1001,NA,2.5),c(7,1001,45,NA),
c(8,1002,45,2.6),c(9,1002,37,NA),c(10,1002,33,3.1),
c(11,1002,NA,3.3)))
colnames(data)<-c("row", "id","var1","var2")
library(dplyr)
# use pipes ( %>% ) to do this in a single line of code
data_goal<-data %>% group_by(id) %>% mutate(index=1:n())
You can easily use dplyr to reshape the data too. Here is a resource if you are unfamiliar: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
library(data.table)
setDT(dat)[,index:=seq(1,.N),by=id]
This question already has answers here:
Insert rows for missing dates/times
(9 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 5 years ago.
I am stuck finding the right approach to the following task.
My original dataframe has the following structure (with many more gde_nr):
df <- read.table(header = TRUE, text = "
gde_nr month amount
1001 1 4000
1001 3 1002
1001 4 1283
1001 5 4352
1002 3 2
1002 4 34
1002 6 300
")
My goal is to create a full monthly sequence for every gde_nr, like this:
df_result <- read.table(header = TRUE, text = "
gde_nr month amount
1001 1 4000
1001 2 0
1001 3 1002
1001 4 1283
1001 5 4352
1001 6 0
1001 7 0
1001 8 0
1001 9 0
1001 10 0
1001 11 0
1001 12 0
1002 1 0
1002 2 0
1002 3 2
1002 4 34
1002 5 0
1002 6 300
1002 7 0
1002 8 0
1002 9 0
1002 10 0
1002 11 0
1002 12 0
")
In a first step I used group_by followed by nest
library(tidyverse)
df %>%
group_by(gde_nr) %>%
nest()
My ideas to proceede:
A) join a dataframe containing the sequence 1:12 and repeating gde_nr. purrr reduce() might be an option
B) use map()
Shure I am open to a competely different approach to this.
Thanks in advance!
Jürgen
I have two different datasets arranged in column format as follows:
Dataset 1:
A B C D E
13 1 1.7 2 1
13 2 5.3 2 1
13 2 2 2 1
13 2 1.8 2 1
1 6 27 9 1
1 6 6.6 9 1
1 7 17 9 1
1 7 7.1 9 1
1 7 8.5 9 1
Dataset 2:
A B F G
13 1 42 1002
13 2 42 1002
13 2 42 1002
13 2 42 1002
13 3 42 1002
13 4 42 1002
13 5 42 1002
1 2 27 650
1 3 27 650
1 4 27 650
1 6 27 650
1 7 27 650
1 7 27 650
1 7 27 650
1 8 27 650
Row numbers of both datasets are variable but they contain data for two samples (for example, column A: 13 and 1 of both datasets). I want C D and E values of dataset 1 to be placed in dataset 2, those having the same values of A and B in both datasets. So, joining should be based on A and B. I need to do this for about 47560 rows.
I am new in R so should be thankful if I could get code for saving the new merged dataset in R.
Use the merge function in R.
Reference from : http://www.statmethods.net/management/merging.html
Edit:
So first you'd need to read in the datasets, CSV is a good format.
> dataset1 <- read.csv(file="dataset1.csv", head=TRUE, sep=",")
> dataset2 <- read.csv(file="dataset2.csv", head=TRUE, sep=",")
If you just type the variable names now and hit enter you should see a read-out of your datasets. So...
> dataset1
should read out your data above. Then I believe the following should occur...I may be wrong...
> dataset1_2 <- merge(dataset1, dataset2, by=c("A","B"))
EDIT 2 :
> write.table(dataset1_2, "c:/dataset1_2.txt", sep=" ")
Reference : http://www.statmethods.net/input/exportingdata.html
I am trying to identify the time of primary ambulance arrival for a number of patients in my dataframe=data.
The primary ambulance is either the 1st, 2nd, 3rd or 4th vehicle on scene (data$prim.amb.num=1, 2, 3, or 4 for each patient/row).
data$time_v1, data$time_v2, data$time_v3 and data$time_v4 have a time or a missing value, which corresponds to the 1st, 2nd, 3rd and 4th vehicles, where relevant.
What I would like to do is make a new variable=prim.amb.time with the time that corresponds to primary ambulance arrival time. Suppose for patient=1, the ambulance was the first. Then I want data[1,"prim.amb.time"]=data[1,"time_v1"].
I can figure out the correct time_v* with the following:
paste("time_v", data$prim.amb.num, sep="")
But I'm stuck as to how to pass the resulting information to call the correct column.
My hope was to simply have something like:
data$prim.amb.time<-data$paste("time_v", data$prim.amb.num, sep="")
but of course, this doesn't work. I'm not even sure how to Google for this; I tried various combinations of this title but to no avail. Any suggestions?
Although I liked the answer by #mhermans, if you want a one-liner, one solution is to use ?apply as follows:
#From #mhermans
zz <- textConnection("patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1000 1 30 40 60 100
1001 3 40 50 60 80
1002 2 10 30 40 45
1003 1 24 40 45 60
")
d <- read.table(zz, header = TRUE)
close(zz)
#Take each row of d and pull out time_vn where n = d$prime.amb.num
d$prime.amb.time <- apply(d, 1, function(x) {x[x['prime.amb.num'] + 2]})
> d
patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4 prime.amb.time
1 1000 1 30 40 60 100 30
2 1001 3 40 50 60 80 60
3 1002 2 10 30 40 45 30
4 1003 1 24 40 45 60 24
EDIT - or with paste:
d$prime.amb.time <-
apply(
d,
1,
function(x) {
x[paste('time_v', x['prime.amb.num'], sep = '')]
}
)
#Gives the same result
Set up example data:
# read in basic example data for four patients, wide format
zz <- textConnection("patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1000 1 30 40 60 100
1001 3 40 50 60 80
1002 2 10 30 40 45
1003 1 24 40 45 60
")
d <- read.table(zz, header = TRUE)
close(zz)
In the example dataset I'm thus assuming your data looks like this:
patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1 1000 1 30 40 60 100
2 1001 3 40 50 60 80
3 1002 2 10 30 40 45
4 1003 1 24 40 45 60
Given that data structure, it is perhaps easier to work with a dataset with a vehicle per row, instead of a patient per row. This can be accomplised by using reshape() to convert from a wide to a long format.
dl <- reshape(d, direction='long', idvar="patient.id", varying=list(3:6))
# ordering & rename var for aesth. reasons:
dl <- dl[order(dl$patient.id, dl$time),]
dl$vehicle.id <- dl$time
dl$time <- NULL
dl
This gives a long dataset, with a row per vehicle:
patient.id prime.amb.num time_v1 vehicle.id
1000.1 1000 1 30 1
1000.2 1000 1 40 2
1000.3 1000 1 60 3
1000.4 1000 1 100 4
1001.1 1001 3 40 1
1001.2 1001 3 50 2
1001.3 1001 3 60 3
1001.4 1001 3 80 4
1002.1 1002 2 10 1
1002.2 1002 2 30 2
1002.3 1002 2 40 3
1002.4 1002 2 45 4
1003.1 1003 1 24 1
1003.2 1003 1 40 2
1003.3 1003 1 45 3
1003.4 1003 1 60 4
Getting the arrival time of the first ambulance per patient then become a simple oneliner:
dl[dl$prime.amb.num == dl$vehicle.id,]
which gives
patient.id prime.amb.num time_v1 vehicle.id
1000.1 1000 1 30 1
1001.3 1001 3 60 3
1002.2 1002 2 30 2
1003.1 1003 1 24 1
I have a dataset with a lot of entries. Each of these entries belongs to a certain ID (belongID), the entries are unique (with uniqID), but multiple entries can come from the same source (sourceID). It is also possible that multiple entries from the same source have a the same belongID. For the purposes of the research I need to do on the dataset I have to get rid of the entries of a single sourceID that occur more than 5 times for 1 belongID. The maximum of 5 entries that need to be kept are the ones with the highest 'Time' value.
To illustrate this I have the following example dataset:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
1 1001 108 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1005 119 2
2 1006 120 2
2 1005 121 1
2 1007 122 1
3 1010 123 5
3 1480 124 2
The example in the end should look like this:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1007 122 1
3 1010 123 5
3 1480 124 2
There are a lot more columns with data entries in the file, but the selection has to be purely based on time. As shown in the example it can also occur that the 5th and 6th entry of a sourceID with the same belongID have the same time. In this case only 1 has to be chosen, because max=5.
The dataset here is nicely ordered on belongID and time for illustrative purposes, but in the real dataset this is not the case. Any idea how to tackle this problem? I have not come across something similar yet..
if dat is your dataframe:
do.call(rbind,
by(dat, INDICES=list(dat$belongID, dat$sourceID),
FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))
Say your data is in df. The ordered (by uniqID) output is obtained after this:
tab <- tapply(df$Time, list(df$belongID, df$sourceID), length)
bIDs <- rownames(tab)
sIDs <- colnames(tab)
for(i in bIDs)
{
if(all(is.na(tab[bIDs == i, ])))next
ids <- na.omit(sIDs[tab[i, sIDs] > 5])
for(j in ids)
{
cond <- df$belongID == i & df$sourceID == j
old <- df[cond,]
id5 <- order(old$Time, decreasing = TRUE)[1:5]
new <- old[id5,]
df <- df[!cond,]
df <- rbind(df, new)
}
}
df[order(df$uniqID), ]
A solution in two lines using the plyr package:
library(plyr)
x <- ddply(dat, .(belongID, sourceID), function(x)tail(x[order(x$Time), ], 5))
xx <- x[order(x$belongID, x$uniqID), ]
The results:
belongID sourceID uniqID Time
5 1 1001 101 5
6 1 1002 102 5
4 1 1001 103 4
2 1 1001 104 3
3 1 1001 105 3
7 1 1005 106 2
1 1 1001 108 2
10 2 1005 109 5
16 2 1006 110 5
11 2 1005 111 5
17 2 1006 112 5
12 2 1005 113 5
15 2 1006 114 4
9 2 1005 115 4
13 2 1006 116 3
8 2 1005 117 3
14 2 1006 118 3
18 2 1007 122 1
19 3 1010 123 5
20 3 1480 124 2
The dataset on which this method is going to be used has 170.000+ entries and almost 30 columns
Benchmarking each of the three provided solutions by danas.zuokas, mplourde and Andrie with the use of my dataset, provided the following outcomes:
danas.zuokas' solution:
User System Elapsed
2829.569 0 2827.86
mplourde's solution:
User System Elapsed
765.628 0.000 763.908
Aurdie's solution:
User System Elapsed
984.989 0.000 984.010
Therefore I will use mplourde's solution. Thank you all!
This should be faster, using data.table :
DT = as.data.table(dat)
DT[, .SD[tail(order(Time),5)], by=list(belongID, sourceID)]
Aside : suggest to count the number of times the same variable name is repeated in the various answers to this question. Do you ever have a lot of long or similar object names?