I have a big data set of 72 columns and I want to gather each 3 of columns into a new column and thus getting 24 columns in the end.
I tried using gather() function but it works good for one time only t=i.e., it gather only 3 columns at a time.
Can I use this function in a for loop?
I tried this:
j=0
k=1
l=2
for (i in 2:24){
neww <- gather(columns, "KEy", "Proteins H/L", c((i+j), (i+k), (i+l)), na.rm = TRUE)
j=j+2;
k=k+2;
l=l+2;
}
I need to gather first 3 columns in a single column and then next 3 in another column and so on.
You can use the to_long function from the sjmisc-package for this purpose. This function is a convenient for-loop, which calls multiple gather() calls.
# create sample
mydat <- data.frame(age = c(20, 30, 40),
sex = c("Female", "Male", "Male"),
score_t1 = c(30, 35, 32),
score_t2 = c(33, 34, 37),
score_t3 = c(36, 35, 38),
speed_t1 = c(2, 3, 1),
speed_t2 = c(3, 4, 5),
speed_t3 = c(1, 8, 6))
# check tidyr. score is gathered, however, speed is not
tidyr::gather(mydat, "time", "score", score_t1, score_t2, score_t3)
> age sex speed_t1 speed_t2 speed_t3 time score
> 1 20 Female 2 3 1 score_t1 30
> 2 30 Male 3 4 8 score_t1 35
> 3 40 Male 1 5 6 score_t1 32
> 4 20 Female 2 3 1 score_t2 33
> 5 30 Male 3 4 8 score_t2 34
> 6 40 Male 1 5 6 score_t2 37
> 7 20 Female 2 3 1 score_t3 36
> 8 30 Male 3 4 8 score_t3 35
> 9 40 Male 1 5 6 score_t3 38
# gather multiple columns. both time and speed are gathered.
to_long(mydat, "time", c("score", "speed"),
c("score_t1", "score_t2", "score_t3"),
c("speed_t1", "speed_t2", "speed_t3"))
> age sex time score speed
> (dbl) (fctr) (chr) (dbl) (dbl)
> 1 20 Female score_t1 30 2
> 2 30 Male score_t1 35 3
> 3 40 Male score_t1 32 1
> 4 20 Female score_t2 33 3
> 5 30 Male score_t2 34 4
> 6 40 Male score_t2 37 5
> 7 20 Female score_t3 36 1
> 8 30 Male score_t3 35 8
> 9 40 Male score_t3 38 6
In this case, the time vector (indicating the gathered groups) just takes one of the multiple gathered column name. If this is too confusing, you can also just number the ID variable:
to_long(mydat, "time", c("score", "speed"),
c("score_t1", "score_t2", "score_t3"),
c("speed_t1", "speed_t2", "speed_t3"),
recode.key = TRUE)
> age sex time score speed
> (dbl) (fctr) (dbl) (dbl) (dbl)
> 1 20 Female 1 30 2
> 2 30 Male 1 35 3
> 3 40 Male 1 32 1
> 4 20 Female 2 33 3
> 5 30 Male 2 34 4
> 6 40 Male 2 37 5
> 7 20 Female 3 36 1
> 8 30 Male 3 35 8
> 9 40 Male 3 38 6
See ?to_long for more examples.
I'm not sure, but I think I read something on GitHub that "multiple column gathering" is also planned for tidyr somewhen...
Related
I have below client dataset includes client_id, birth_number and district_id. The birth number is in the form YYMMDD, here is twist - The value is in the form: YYMMDD(for Men) and the value is in the form: YY(+50MM)DD(for Women). I want your help to develop the script in R where we can split the YYMMDD and set condition. based on condition if MM>12 then that row belong to women and the actual month value subtracted by 15 else Men with the same birth number.
please help
The value is in the form: YYMMDD (for men)
The value is in the form: YY(+50MM)DD (for women)
"client_id";"birth_number";"district_id"
1;"706213";18
2;"450204";1
3;"406009";1
4;"561201";5
5;"605703";5
6;"190922";12
7;"290125";15
8;"385221";51
9;"351016";60
10;"430501";57
11;"505822";57
12;"810220";40
13;"745529";54
14;"425622";76
15;"185828";21
16;"190225";21
17;"341013";76
18;"315405";76
19;"421228";47
20;"790104";46
21;"526029";12
22;"696011";1
23;"730529";1
24;"395729";43
25;"395423";21
26;"695420";74
27;"665326";54
28;"450929";1
29;"515911";30
30;"576009";74
31;"620209";68
32;"800728";52
33;"486204";73
An option is to use substring along with ifelse as:
# Get the 3rd and 4th character from "birth_number". If it is > 12
# that row is for Female, otherwise Male
df$Gender <- ifelse(as.numeric(substring(df$birth_number,3,4)) > 12, "Female", "Male")
# Now correct the "birth_number". Subtract 50 form middle 2 digits.
# Updated based on feedback from #RuiBarradas to use df$Gender == "Female"
# to subtract 50 from month number
df$birth_number <- ifelse(df$Gender == "Female",
as.character(as.numeric(df$birth_number)-5000), df$birth_number)
df
# client_id birth_number district_id Gender
# 1 1 701213 18 Female
# 2 2 450204 1 Male
# 3 3 401009 1 Female
# 4 4 561201 5 Male
# 5 5 600703 5 Female
# 6 6 190922 12 Male
# so on
#
Data:
df <- read.table(text =
'"client_id";"birth_number";"district_id"
1;"706213";18
2;"450204";1
3;"406009";1
4;"561201";5
5;"605703";5
6;"190922";12
7;"290125";15
8;"385221";51
9;"351016";60
10;"430501";57
11;"505822";57
12;"810220";40
13;"745529";54
14;"425622";76
15;"185828";21
16;"190225";21
17;"341013";76
18;"315405";76
19;"421228";47
20;"790104";46
21;"526029";12
22;"696011";1
23;"730529";1
24;"395729";43
25;"395423";21
26;"695420";74
27;"665326";54
28;"450929";1
29;"515911";30
30;"576009";74
31;"620209";68
32;"800728";52
33;"486204";73',
header = TRUE, stringsAsFactors = FALSE, sep = ";")
Using the same commands as #MKR, I just prefer the tidyverse approach.
require(tidyverse)
df %>%
mutate(Gender = ifelse(substr(birth_number, 3, 4) > 12,
"Female", "Male"),
birth_number = ifelse(Gender == "Female",
birth_number - 5000,
birth_number))
client_id birth_number district_id Gender
1 1 701213 18 Female
2 2 450204 1 Male
3 3 401009 1 Female
4 4 561201 5 Male
5 5 600703 5 Female
6 6 190922 12 Male
7 7 290125 15 Male
8 8 380221 51 Female
9 9 351016 60 Male
10 10 430501 57 Male
11 11 500822 57 Female
12 12 810220 40 Male
13 13 740529 54 Female
14 14 420622 76 Female
15 15 180828 21 Female
16 16 190225 21 Male
17 17 341013 76 Male
18 18 310405 76 Female
19 19 421228 47 Male
20 20 790104 46 Male
21 21 521029 12 Female
22 22 691011 1 Female
23 23 730529 1 Male
24 24 390729 43 Female
25 25 390423 21 Female
26 26 690420 74 Female
27 27 660326 54 Female
28 28 450929 1 Male
29 29 510911 30 Female
30 30 571009 74 Female
31 31 620209 68 Male
32 32 800728 52 Male
33 33 481204 73 Female
So I have a csv file with column headers ID, Score, and Age.
So in R I have,
data <- read.csv(file.choose(), header=T)
attach(data)
I would like to create two new vectors with people's scores whos age are below 70 and above 70 years old. I thought there was a nice a quick way to do this but I cant find it any where. Thanks for any help
Example of what data looks like
ID, Score, Age
1, 20, 77
2, 32, 65
.... etc
And I am trying to make 2 vectors where it consists of all peoples scores who are younger than 70 and all peoples scores who are older than 70
Assuming data looks like this:
Score Age
1 1 29
2 5 39
3 8 40
4 3 89
5 5 31
6 6 23
7 7 75
8 3 3
9 2 23
10 6 54
.. . ..
you can use
df_old <- data[data$Age >= 70,]
df_young <- data[data$Age < 70,]
which gives you
> df_old
Score Age
4 3 89
7 7 75
11 7 97
13 3 101
16 5 89
18 5 89
19 4 96
20 3 97
21 8 75
and
> df_young
Score Age
1 1 29
2 5 39
3 8 40
5 5 31
6 6 23
8 3 3
9 2 23
10 6 54
12 4 23
14 2 23
15 4 45
17 7 53
PS: if you only want the scores and not the age, you could use this:
df_old <- data[data$Age >= 70, "Score"]
df_young <- data[data$Age < 70, "Score"]
I have a dataset that looks like this:
ID SEX WEIGHT BMI
1 2 65 25
1 2 65 25
1 2 65 25
2 1 70 30
2 1 70 30
2 1 70 30
2 1 70 30
3 2 50 18
3 2 50 18
4 1 85 20
4 1 85 20
I want to calculate fat free mass (FFM) and attach the value in a new column in the dataset for each individual. These are the functions to calculate FFM for males and females:
for males (SEX=1):
FFMCalMale <- function (WEIGHT, BMI) {
FFM = 9270*WEIGHT/(6680+216*BMI)
}
and for females (SEX=2):
FFMCalFemale <- function(WEIGHT, BMI) {
FFM = 9270*WEIGHT/(8780+244*BMI)
}
I want to modify this function so it check for the SEX (1, male or 2 is female) then do the calculation for FFM based on that and apply the function for each individual. Could you please help?
Thanks in advance!
You could use ifelse
data$FFM <- ifelse(data$SEX==1,
FFMCalMale(data$WEIGHT, data$BMI),
FFMCalFemale(data$WEIGHT, data$BMI))
A data.table approach:
mydata <- read.table(
header = T, con <- textConnection
('
ID SEX WEIGHT BMI
1 2 65 25
1 2 65 25
1 2 65 25
2 1 70 30
2 1 70 30
2 1 70 30
2 1 70 30
3 2 50 18
3 2 50 18
4 1 85 20
4 1 85 20
'), stringsAsFactors = FALSE)
close(con)
library(data.table) ## load data.table
setDT(mydata) ## convert the data to datatable
FFMCalMale <- function (WEIGHT, BMI) {
FFM = 9270*WEIGHT/(6680+216*BMI)
}
FFMCalFemale <- function(WEIGHT, BMI) {
FFM = 9270*WEIGHT/(8780+BMI)
}
setkey(mydata, SEX)
mydata[, FFM := ifelse(SEX == 1,
FFMCalMale(WEIGHT, BMI),
FFMCalFemale(WEIGHT, BMI))][]
# ID SEX WEIGHT BMI FFM
# 1: 2 1 70 30 49.30851
# 2: 2 1 70 30 49.30851
# 3: 2 1 70 30 49.30851
# 4: 2 1 70 30 49.30851
# 5: 4 1 85 20 71.63182
# 6: 4 1 85 20 71.63182
# 7: 1 2 65 25 68.43271
# 8: 1 2 65 25 68.43271
# 9: 1 2 65 25 68.43271
# 10: 3 2 50 18 52.68243
# 11: 3 2 50 18 52.68243
Here are two ways, one just taking the dataframe (assuming it contains columns with the names SEX, WEIGHT, and BMI):
dffunc <- function(dataframe) {
ifelse(dataframe$SEX == 1,
9270 * dataframe$WEIGHT / (6680 + 216 * dataframe$BMI),
9270 * dataframe$WEIGHT / (8780 + dataframe$BMI))
}
or as you originally formatted it, but adding the SEX parameter:
func <- function(WEIGHT, BMI, SEX) {
ifelse(SEX == 1,
9270 * WEIGHT / (6680 + 216 * BMI),
9270 * WEIGHT / (8780 + BMI))
}
I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45
Combining 2 columns into 1 column many times in a very large dataset in R
The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).
Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))
Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.
1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.
Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then
R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""),
+ rs157=paste(pop[,7],pop[,8],sep=""),
+ rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
status sex age disType rs123 rs157 rs132
1 0 0 42 0 11 24 44
2 1 1 37 0 31 44 44
3 1 0 38 0 11 24 44
4 0 1 45 0 31 22 44
5 1 1 25 0 31 24 44
6 0 1 31 0 11 44 44
7 1 0 43 0 11 44 44
8 0 0 41 0 11 44 44
9 1 1 57 0 31 22 24
10 1 1 40 0 11 22 24
and now you can do counts and whatnot on pop2:
R> sapply(pop2[,5:7], table)
$rs123
11 31
6 4
$rs157
22 24 44
3 3 4
$rs132
24 44
2 8
R>