i'm not sure how to best describe this, so i'll just show you. i have two variables.
A:
ID
1 121
2 122
3 123
4 124
5 125
6 126
7 127
8 128
9 129
and B:
var1 var2 var3
1 57.1 116.5 73.0
2 38.1 15.8 22.7
3 84.2 99.2 72.2
and i would like them to end up as such:
ID
1 121 57.1
2 122 116.5
3 123 73.0
4 124 38.1
5 125 15.8
6 126 22.7
7 127 84.2
8 128 99.2
9 129 72.2
does that make sense? i'd like to maintain the original variable and add a column that is the rows, in order, of the other variable. preferably i'd like this as a data frame.
thanks in advance.
data.frames and matrices are filled by column by default, as you want create a numeric vector filling by row, you will need to transpose the data.frame before coercing to a numeric variable, so it will be in the order you want.
A$value <- c(t(B))
transposing a data.frame gives a matrix, which is coerced to a numeric vector by c.
Assuming B is a data.frame, you can do:
cbind(A,var.name=as.vector(as.matrix(B)))
You can pass the new column name instead of var.name
Related
This question already has answers here:
Mean per group in a data.frame [duplicate]
(8 answers)
Closed 1 year ago.
a data of penicillin production including four treatment(A,B,C,D)'our columns' and five blocks'row'.
I need to calculate sum and mean of each row separately. dataframe brings the variable in col and I cannot define variables of treatment A and sum it up. I wanna know how to write them the way that I can have 4 numbers in each row in order to calculate its mean and sum...
here is my code:
pencilline=c(89,88,97,94,84,77,92,79,81,87,87,85,87,92,89,84,79,81,80,88)
treatment=factor(rep(LETTERS[1:4],times=5))
block=sort(rep(1:5,times=4))
datap=data.frame(pencilline,block,treatment)
datap
datap_subset=unlist(lapply(datap,is.numeric))
datap_subset
pencilline block treatment
TRUE TRUE FALSE
rowMeans(datap[,datap_subset])
[1] 45.0 44.5 49.0 47.5 43.0 39.5 47.0 40.5 42.0 45.0 45.0 44.0 45.5 48.0 46.5 44.0 42.0 43.0 42.5 46.5
which gives false rowMeans.
Do you want this?
library(dplyr)
datap %>% group_by(block) %>%
summarise(mean = mean(pencilline))
# A tibble: 5 x 2
block mean
<int> <dbl>
1 1 92
2 2 83
3 3 85
4 4 88
5 5 82
its baseR equivalent
aggregate(pencilline ~ block, datap, mean)
block pencilline
1 1 92
2 2 83
3 3 85
4 4 88
5 5 82
the formula is
f=exp(-d*t)+exp(g*t)-1
my dataset includes many observations(f in the formula) on several different times(t) for several subjects. And I want to get estimations on d and g.
How should I code for this in R? And I don't know how to determine starting values, since every subject might have different curve shapes.
Here are some hypothetical examples:
subject t f
1 1 0 515.6
2 1 70 62.9
3 1 126 34.8
4 1 181 18.5
5 1 245 28.9
6 1 289 29.6
7 1 359 109.1
8 1 408 33.2
9 1 531 16.9
10 1 569 97.2
I have hundreds of subjects, and I want to estimate the parameters (d and g) on personal level, means different curve for different subject.
My data (crsp.daily) look roughly like this (the numbers are made up and there are more variables):
PERMCO PERMNO date price VOL SHROUT
103 201 19951006 8.8 100 823
103 203 19951006 7.9 200 1002
1004 10 19951006 5 277 398
2 5 19951110 5.3 NA 579
1003 2 19970303 10 67 NA
1003 1 19970303 11 77 1569
1003 20 19970401 6.7 NA NA
I want to sum VOL and SHROUT by groups defined by PERMCO and date, but leaving the original number of rows unchanged, thus my desired output is the following:
PERMCO PERMNO date price VOL SHROUT VOL.sum SHROUT.sum
103 201 19951006 8.8 100 823 300 1825
103 203 19951006 7.9 200 1002 300 1825
1004 10 19951006 5 277 398 277 398
2 5 19951110 5.3 NA 579 NA 579
1003 2 19970303 10 67 NA 21 1569
1003 1 19970303 11 77 1569 21 1569
1003 20 19970401 6.7 NA NA NA NA
My data have more than 45 millions of observations, and 8 columns. I have tried using ave:
crsp.daily$VOL.sum=ave(crsp.daily$VOL,c("PERMCO","date"),FUN=sum)
or sapply:
crsp.daily$VOL.sum=sapply(crsp.daily[,"VOL"],ave,crsp.daily$PERMCO,crsp.daily$date)
The problem is that it takes an infinite amount of time (like more than 30 min and I still did not see the result). Another thing that I tried was to create a variable called "group" by pasting PERMCO and date like this:
crsp.daily$group=paste0(crsp.daily$PERMCO,crsp.daily$date)
and then apply ave using crsp.daily$group as groups. This also did not work because from a certain observation on, R did not distinguish anymore the different values of crsp.daily$groups and treated them as a unique group.
The solution of creating the variable "groups" worked on a smaller dataset.
Any advise is greatly appreciated!
With data.table u could use the following code
require(data.table)
dt <- as.data.table(crsp.daily)
dt[, VOL.sum := sum(VOL), by = list(PERMCO, date)]
With the command := u create a new variable (VOL.sum) and group those by PERMCO and date.
Output
permco permno date price vol shrout vol.sum
1 103 201 19951006 8.8 100 823 300
2 103 203 19951006 7.9 200 1002 300
3 1004 10 19951006 5.0 277 398 277
4 2 5 19951110 5.3 NA 579 NA
Say I have the following data frame:
LungCap Age Height Smoke Gender Caesarean
1 6.475 6 62.1 no male no
2 10.125 18 74.7 yes female no
3 9.550 16 69.7 no female yes
4 11.125 14 71.0 no male no
5 4.800 5 56.9 no male no
6 6.225 11 58.7 no female no
Now I want to select all rows where the age is > 11 and gender is female. This gets me what I want:
y[y$Age>11&y$Gender=="female",]
LungCap Age Height Smoke Gender Caesarean
2 10.125 18 74.7 yes female no
3 9.550 16 69.7 no female yes
But this does not:
y[y$Age>11&y$Gender=="female"]
Age Height
1 6 62.1
2 18 74.7
3 16 69.7
4 14 71.0
5 5 56.9
6 11 58.7
I'm very new at R and I don't understand what this second query is doing, other than it's not giving me what I want.
When you subset the dataframe with the first syntax, the first number vector (or logic vector) in the square brackets represents the rows you want to select, while the second (after the comma) represents the columns.
If you do not explicitly insert anything after the comma, R assumes you want all the columns.
If you do not even put the comma, R assumes that the first number refers to what columns you want.
In your case y$Age>11&y$Gender=="female" is a logic vector that refers to position 2 and 3. So if you do not use comma, R thinks you want to only select columns 2 and 3. Therefore you get Age and Height.
I was searching for an answer to my specific problem, but I didn't find a conclusion. I found this: Add column to Data Frame based on values of other columns , but it was'nt exactly what I need in my specific case.
I'm really a beginner in R, so I hope maybe someone can help me or has a good hint for me.
Here an example of what my data frame looks like:
ID answer 1.partnerID
125 3 715
235 4 845
370 7 985
560 1 950
715 5 235
950 5 560
845 6 370
985 6 125
I try to describe what I want to do on an example:
In the first row is the data of the person with the ID 125. The first partner of this person is the person with ID 715. I want to create a new column, with the value of the answer of each person´s partner in it. It should look like this:
ID answer 1.partnerID 1.partneranswer
125 3 715 5
235 4 845 6
370 7 985 6
560 1 950 5
715 5 235 4
950 5 560 1
845 6 370 7
985 6 125 3
So R should take the value of the column 1.partnerID, which is in this case "715" and search for the row, where "715" is the value in the column ID (there are no IDs more than once).
From this specific row R should take the value from the column answer (in this example that´s the "5") and put it into the new column "1.partneranswer", but in the row from person 125.
I hope someone can understand what I want to do ...
My problem is that I can imagine how to write this for each row per hand, but I think there need to be an easiear way to do it for all rows in once? (especially because in my original data.frame are 5 partners per person and there are more than one column from which the values should be transfered, so it would come to many hours work to write it for each single row per hand).
I hope someone can help.
Thank you!
One solution is to use apply as follows:
df$partneranswer <- apply(df, 1, function(x) df$answer[df$ID == x[3]])
Output will be as desired above. There may be a loop-less approach.
EDIT: Adding a loop-less (vectorized answer) using match:
df$partneranswer <- df$answer[match(df$X1.partnerID, df$ID)]
df
ID answer X1.partnerID partneranswer
1 125 3 715 5
2 235 4 845 6
3 370 7 985 6
4 560 1 950 5
5 715 5 235 4
6 950 5 560 1
7 845 6 370 7
8 985 6 125 3
Update: This can be done with self join; The first two columns define a map relationship from ID to answer, in order to find the answers for the partner IDs, you can merge the data frame with itself with first data frame keyed on partnerID and the second data frame keyed on ID:
Suppose df is (fixed the column names a little bit):
df
# ID answer partnerID
#1 125 3 715
#2 235 4 845
#3 370 7 985
#4 560 1 950
#5 715 5 235
#6 950 5 560
#7 845 6 370
#8 985 6 125
merge(df, df[c('ID', 'answer')], by.x = "partnerID", by.y = "ID")
# partnerID ID answer.x answer.y
#1 125 985 6 3
#2 235 715 5 4
#3 370 845 6 7
#4 560 950 5 1
#5 715 125 3 5
#6 845 235 4 6
#7 950 560 1 5
#8 985 370 7 6
Old answer:
If the ID and partnerID are mapped to each other one on one, you can try:
df$partneranswer <- with(df, answer[sapply(X1.partnerID, function(partnerID) which(ID == partnerID))])
df
# ID answer X1.partnerID partneranswer
#1 125 3 715 5
#2 235 4 845 6
#3 370 7 985 6
#4 560 1 950 5
#5 715 5 235 4
#6 950 5 560 1
#7 845 6 370 7
#8 985 6 125 3