Group by and determine which entries there are in a given group - r

Suppose you have a data frame df with 5 attributes: x1, x2, x3, x4, Year, as follows:
set.seed(1)
x1 <- 1:30
x2 <- rnorm(10)
x3 <- rchisq(25, 2, ncp = 0)
x4 <- rpois(6, 0.94)
Year <- sample(2011:2014,30,replace=TRUE)
noRow <- max(length(x1), length(x2), length(x3), length(x4), length(Year))
df <- list(x1=x1, x2=x2, x3=x3, x4=x4, Year=Year)
attributes(df) <- list(names = names(df), row.names=1:30, class='data.frame')
and output
x1 x2 x3 x4 Year
1 1 -0.6264538 4.2807226 0 2014
2 2 0.1836433 1.6273105 0 2014
3 3 -0.8356286 0.3144031 0 2012
4 4 1.5952808 0.6216108 0 2012
5 5 0.3295078 0.9374638 1 2014
6 6 -0.8204684 0.1363947 2 2013
7 7 0.4874291 2.4985843 <NA> 2013
8 8 0.7383247 2.0162627 <NA> 2012
9 9 0.5757814 2.7218900 <NA> 2012
10 10 -0.3053884 2.4119764 <NA> 2014
11 11 <NA> 1.1082308 <NA> 2013
12 12 <NA> 2.4140052 <NA> 2011
13 13 <NA> 3.1249573 <NA> 2011
14 14 <NA> 0.2615523 <NA> 2012
15 15 <NA> 0.4381074 <NA> 2014
16 16 <NA> 0.6944394 <NA> 2013
17 17 <NA> 0.8599189 <NA> 2014
18 18 <NA> 0.2924151 <NA> 2013
19 19 <NA> 1.6834339 <NA> 2012
20 20 <NA> 0.4848175 <NA> 2012
21 21 <NA> 3.1606987 <NA> 2011
22 22 <NA> 2.3705121 <NA> 2011
23 23 <NA> 0.7808625 <NA> 2013
24 24 <NA> 0.4621734 <NA> 2011
25 25 <NA> 1.9421776 <NA> 2012
26 26 <NA> <NA> <NA> 2013
27 27 <NA> <NA> <NA> 2014
28 28 <NA> <NA> <NA> 2012
29 29 <NA> <NA> <NA> 2012
30 30 <NA> <NA> <NA> 2011
I would like to group by year and determine if for a given year we have no entries in one or more attributes.
Using
library("dplyr")
df1 <- df %>%
dplyr::group_by(Year) %>%
dplyr::mutate(count = n())
only gives me the number of entries in a given year, but it doesn't tell me the which attributes are present/non-missing in a given year.
Thanks for sharing your ideas.
Wished output:
Year x1 x2 x3 x4
2011 1 0 1 0
2012 1 1 1 1
2013 1 1 1 1
2014 1 1 1 1
where 1 means there's at least one entry for the variable in a given year, and 0 otherwise.

This code solves your problem:
df$attrib_ok <- !is.na(rowSums(df[1:4]))
df1 <- df %>%
dplyr::group_by(Year) %>%
dplyr::mutate(count=sum(attrib_ok)) %>%
dplyr::select(-attrib_ok)
but it seems you have created a corrupt dataframe where this solution doesn`t work.
You have to create previously a non-corrupt dataframe like this:
set.seed(1)
x1 <- 1:30
x2 <- c(rnorm(10), rep(NA, 20))
x3 <- c(rchisq(25, 2, ncp = 0), rep(NA, 5))
x4 <- c(rpois(6, 0.94), rep(NA, 24))
Year <- sample(2011:2014,30,replace=TRUE)
df <- data.frame(x1,x2,x3,x4,Year)
Code to get your wished output:
df1 <- data.frame(Year=df$Year,!is.na(df[1:4]))
df1 <- aggregate(.~Year, data = df1, FUN = sum)
df1 <- data.frame(Year=df1$Year, apply(apply(df1[,2:5], 2, as.logical), 2, as.numeric))

Related

R construct a data frame using 2 column

I have a data frame like:
df
group group_name value
1 1 <NA> VV0001
2 1 <NA> VV_RS00280
3 2 <NA> VV0002
4 2 <NA> VV_RS00285
5 3 <NA> VV0003
6 3 <NA> VV_RS00290
7 5 <NA> VV0004
8 5 <NA> VV_RS00295
9 6 <NA> VV0005
10 6 <NA> VV_RS00300
11 7 <NA> VV0006
12 7 <NA> VV_RS00305
13 8 <NA> VV0007
14 8 <NA> VV_RS00310
15 9 <NA> VV0009
16 9 <NA> VV_RS00315
17 10 <NA> VV0011
18 10 <NA> VV_RS00320
19 11 <NA> VV0012
20 11 <NA> VV_RS00325
21 12 <NA> VV0013
22 12 <NA> VV_RS00330
so I want to construct an other data frame using the columns "group" and "value", all the group 1 (df[df$group == 1,]) will get the data in "value" column (VV0001, VV_RS00280) and construct the data.frame like:
group value
1 VV0001 VV_RS00280
and then the next df[df$group == 2,], and so on, at the end will be:
group value
1 VV0001 VV_RS00280
2 VV0002 VV_RS00285
3 VV0003 VV_RS00290
4 VV0004 VV_RS00295
I tried to do it manually but the nrow(df) is big, > 3000 or more !!
Thanks
You may try,
library(dplyr)
library(tidyr)
df %>%
rename(idv = group) %>%
mutate(group_name = rep(c("group", "value"),n()/2)) %>%
group_by(idv) %>%
pivot_wider(names_from = group_name, values_from = value) %>%
ungroup %>%
select(-idv)
group value
<chr> <chr>
1 VV0001 VV_RS00280
2 VV0002 VV_RS00285
3 VV0003 VV_RS00290
4 VV0004 VV_RS00295
5 VV0005 VV_RS00300
6 VV0006 VV_RS00305
7 VV0007 VV_RS00310
8 VV0009 VV_RS00315
9 VV0011 VV_RS00320
10 VV0012 VV_RS00325
11 VV0013 VV_RS00330

Add column from df2 to df1 based on match between df1 and df2

I have two data sets df1 and df2, which have one column "ID" and "Country" in common:
df1 <- data.frame(ID=c(1:20), State=c("NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","CA","IL","SD","NC","SC","WA","CO","AL","AK","HI"))
df2 <- data.frame(ID=c(1,2,3,4,5,"NA","NA","NA","NA","NA"), Year=c("2020","2021","2020","2020","2021","2020","2020","2021","2020","2019"),State=c("NA","NA","NA","NA","NA","CA","SC","NY","NJ","OR"))
How can I add Year from df2 to df1 to the same ID that exists in df1 OR the same State that exists in df1?
The reason why I want to make this change: I just need to add this "Year" information from df2 to df1.
Here's a dplyr solution:
library(dplyr)
df1 <- df1 %>%
mutate(join = ifelse(State == 'NA', ID, State))
df2 <- df2 %>%
mutate(join = ifelse(State == 'NA', ID, State))
df_new <- left_join(df1, df2, by = "join") %>%
mutate(State = coalesce(State.x, State.y)) %>%
select(-c(State.x, State.y, join, ID.y)) %>%
rename(ID = ID.x)
This gives us:
ID Year State
1 1 2020 NA
2 2 2021 NA
3 3 2020 NA
4 4 2020 NA
5 5 2021 NA
6 6 <NA> NA
7 7 <NA> NA
8 8 <NA> NA
9 9 <NA> NA
10 10 <NA> NA
11 11 2020 CA
12 12 <NA> IL
13 13 <NA> SD
14 14 <NA> NC
15 15 2020 SC
16 16 <NA> WA
17 17 <NA> CO
18 18 <NA> AL
19 19 <NA> AK
20 20 <NA> HI
You could do:
df1 <- type.convert(df1)
df2 <- type.convert(df2)
df1 %>%
left_join(select(df2, -State), 'ID') %>%
left_join(select(filter(df2, is.na(ID)), -ID), 'State') %>%
mutate(Year = coalesce(Year.x, Year.y), Year.x = NULL, Year.y = NULL)
ID State Year
1 1 <NA> 2020
2 2 <NA> 2021
3 3 <NA> 2020
4 4 <NA> 2020
5 5 <NA> 2021
6 6 <NA> NA
7 7 <NA> NA
8 8 <NA> NA
9 9 <NA> NA
10 10 <NA> NA
11 11 CA 2020
12 12 IL NA
13 13 SD NA
14 14 NC NA
15 15 SC 2020
16 16 WA NA
17 17 CO NA
18 18 AL NA
19 19 AK NA
20 20 HI NA

Transfer pivottable to another table in R

In my research I have a dataset of cancer patients with some clinical information like cancer stage and treatment etc. Each patient has one row in a table with this clinical information. In addition, each patient has, at one or several timepoints during the treatment, taken blood samples, depending on how long the patient has been followed at the clinic. The first sample is from the first visit and the second sample is from the second visit at the clinic, and so on.
In the table, there is a variable (ie. column) that is named Sample_Time_1, which is the time for the first sample. Sample_Time_2 has the time (date) for the second sample and so on.
However - the samples were analysed at the lab and I got the result in a pivottable, which means I have a table where each sample has one row and therefore the results from one patient is displayed on several rows.
For example, create two tables:
x <- c(1,2,2,3,3,3,3,4,5,6,6,6,6,7,8,9,9,10)
y <- as.Date(c("2011-05-17","2012-06-30","2012-08-11","2011-10-15","2011-11-25","2012-01-07","2012-02-15","2011-08-13","2012-02-03","2011-11-08","2011-12-21","2012-02-01","2012-03-12","2012-01-03","2012-04-20","2012-03-31","2012-05-10","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
z <- c(123,185,153,153,125,148,168,187,194,115,165,167,143,151,129,130,151,134)
Sheet_1 <- matrix(c(x,y,z), ncol=3, byrow=FALSE)
colnames(Sheet_1) <- c("ID","Sample_Time", "Sample_Value")
Sheet_1 <- as.data.frame(Sheet_1)
Sheet_1$Sample_Time <- y
x1 <- c(1,2,3,4,5,6,7,8,9,10)
x2 <- c(3,3,2,3,2,2,4,2,3,3)
x3 <- c(1,2,2,3,3,1,3,1,1,2)
x4 <- as.Date(c("2011-05-17","2012-06-30","2011-10-15","2011-08-13","2012-02-03","2011-11-08","2012-01-03","2012-04-20","2012-03-31","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
x5 <- as.Date(c(NA,"2012-08-11","2011-11-25",NA,NA,"2011-12-21",NA,NA,"2012-05-10",NA), format="%Y-%m-%d", origin="1960-01-01")
x6 <- as.Date(c(NA,NA,"2012-01-07",NA,NA,"2012-02-01",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
x7 <- as.Date(c(NA,NA,"2012-02-15",NA,NA,"2012-03-12",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
Sheet_2 <- as.data.frame(c(1:10))
colnames(Sheet_2) <- "ID"
Sheet_2$Stage <- x2
Sheet_2$Treatment <- x3
Sheet_2$Sample_Time_1 <- x4
Sheet_2$Sample_Time_2 <- x5
Sheet_2$Sample_Time_3 <- x6
Sheet_2$Sample_Time_4 <- x7
Sheet_2$Sample_Value_1 <- NA
Sheet_2$Sample_Value_2 <- NA
Sheet_2$Sample_Value_3 <- NA
Sheet_2$Sample_Value_4 <- NA
I would like to transfer the Sample_Value for the first date a sample was taken from a patient from Sheet_1 to Sheet_2$Sample_Value_1 and if there are more samples, I would like to transfer them to column "Sample_Value_2" and so on.
I have tried with a double for-loop. For each patient (=ID) in Sheet_1 I have run through Sheet_2 and if there is a mach on ID, then I use another for-loop to see if there is a mach on a Sample_Time and insert (using if) the Sample_Value. However, I do not manage to get it to work and I have a strong feeling there must be a better way.
Any suggestions?
Is this what you want:
Prepare Sheet_1 for reshaping from long to wide by introducing an extra column with unique ID for each blood sample per patient
Sheet_1$uniqid <- with(Sheet_1, ave(as.character(ID), ID, FUN = seq_along))
And with this, do the re-shaping
S_1 <- reshape( Sheet_1, idvar = "ID", timevar = "uniqid", direction = "wide")
which gives you
> S_1
ID Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2 Sample_Time.3
1 1 2011-05-17 123 <NA> NA <NA>
2 2 2012-06-30 185 2012-08-11 153 <NA>
4 3 2011-10-15 153 2011-11-25 125 2012-01-07
8 4 2011-08-13 187 <NA> NA <NA>
9 5 2012-02-03 194 <NA> NA <NA>
10 6 2011-11-08 115 2011-12-21 165 2012-02-01
14 7 2012-01-03 151 <NA> NA <NA>
15 8 2012-04-20 129 <NA> NA <NA>
16 9 2012-03-31 130 2012-05-10 151 <NA>
18 10 2011-12-15 134 <NA> NA <NA>
Sample_Value.3 Sample_Time.4 Sample_Value.4
1 NA <NA> NA
2 NA <NA> NA
4 148 2012-02-15 168
8 NA <NA> NA
9 NA <NA> NA
10 167 2012-03-12 143
14 NA <NA> NA
15 NA <NA> NA
16 NA <NA> NA
18 NA <NA> NA
The number after the dot in the colnames is the uniqid.
Now you can merge the relevant columns from Sheet_2
S_2 <- merge( Sheet_2[ 1:3 ], S_1, by = "ID" )
and the result should be what you are looking for:
> S_2
ID Stage Treatment Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2
1 1 3 1 2011-05-17 123 <NA> NA
2 2 3 2 2012-06-30 185 2012-08-11 153
3 3 2 2 2011-10-15 153 2011-11-25 125
4 4 3 3 2011-08-13 187 <NA> NA
5 5 2 3 2012-02-03 194 <NA> NA
6 6 2 1 2011-11-08 115 2011-12-21 165
7 7 4 3 2012-01-03 151 <NA> NA
8 8 2 1 2012-04-20 129 <NA> NA
9 9 3 1 2012-03-31 130 2012-05-10 151
10 10 3 2 2011-12-15 134 <NA> NA
Sample_Time.3 Sample_Value.3 Sample_Time.4 Sample_Value.4
1 <NA> NA <NA> NA
2 <NA> NA <NA> NA
3 2012-01-07 148 2012-02-15 168
4 <NA> NA <NA> NA
5 <NA> NA <NA> NA
6 2012-02-01 167 2012-03-12 143
7 <NA> NA <NA> NA
8 <NA> NA <NA> NA
9 <NA> NA <NA> NA
10 <NA> NA <NA> NA

how to read matrix text file into r using datatable

I am having trouble reading a text file that has data formatted in a matrix format as follows:
Location Product Day1 Day2 Day3 Day4 ... Day1 Day2 Day3
Jan Jan Jan ... Feb Feb Feb
123 23 8 9 3
234 25 2 4 9
254 87 3
213 56 7 5
It is essentially a time series that has quantities of products by location by day. I want to eventually convert this into a "sql" table format.
My trouble is when I've tried the following to just skip row 2 and import the rest of the data with the fill = TRUE, I don't get the desired result. The actual counts get shifted to the right and don't align to the first "header" row. I want to combine row 1 and two together to make a date field starting from Day1 in row 1 and then leave empty fields as null or NA. Then eventually pivot this data to be in the following format:
Location Product Period Count
123 23 Jan 1
234 25 Jan 1 5
234 25 Feb 3 9
How can I accomplish this?
This demonstrates the auto-position guessing function, fwf_empty of pkg:readr. I could not get the read_fwf-function to accept a text connection argument to the file argument, so needed to save the text as a slightly edited version that looks like:
Location Product Day1 Day2 Day3 Day4 Day1 Day2 Day3
Jan Jan Jan Jan Feb Feb Feb
123 23 8 9 3
234 25 2 4 9
254 87 3
213 56 7 5
The R code:
require(readr)
fwf_empty(file="~/Untitled 4 copy.txt")
$begin
[1] 0 9 17 22 27 32 40
$end
[1] 8 16 21 26 31 36 55
$col_names
[1] "X1" "X2" "X3" "X4" "X5" "X6" "X7"
> read_fwf(file="~/Untitled 4 copy.txt", fwf_empty(file="~/Untitled 4 copy.txt"))
Warning: 8 parsing failures.
row col expected actual
2 X9 4 chars 3
3 X8 4 chars 2
3 -- 9 columns 8 columns
4 X9 4 chars 3
5 X3 4 chars 2
... ... ......... .........
.See problems(...) for more details.
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 Location Product Day1 Day2 Day3 Day4 Day1 Day2 Day3
2 <NA> <NA> Jan Jan Jan Jan Feb Feb Feb
3 123 23 <NA> <NA> 8 <NA> 9 3 <NA>
4 234 25 2 4 <NA> <NA> <NA> <NA> 9
5 254 87 3 <NA> <NA> <NA> <NA> <NA> <NA>
6 213 56 <NA> 7 <NA> <NA> 5 <NA> <NA>
Then rename the columns and remove the first two lines:
> colnm <- paste0( inp[1,], inp[2,])
> colnm
[1] "LocationNA" "ProductNA" "Day1Jan" "Day2Jan" "Day3Jan"
[6] "Day4Jan" "Day1Feb" "Day2Feb" "Day3Feb"
> colnames(inp) <- colnm
> inp[-(1:2), ]
LocationNA ProductNA Day1Jan Day2Jan Day3Jan Day4Jan Day1Feb Day2Feb
3 123 23 <NA> <NA> 8 <NA> 9 3
4 234 25 2 4 <NA> <NA> <NA> <NA>
5 254 87 3 <NA> <NA> <NA> <NA> <NA>
6 213 56 <NA> 7 <NA> <NA> 5 <NA>
Day3Feb
3 <NA>
4 9
5 <NA>
6 <NA>

Combining columns in a dataframe each with partial information

I have a large data set which used different coding schemes for the same variables over different time periods. The coding in each time period is represented as a column with values during the year it was active and NA everywhere else.
I was able to "combine" them by using nested ifelse commands together with dplyr's mutate [see edit below], but I am running into a problem using ifelse to do something slightly different. I want to code a new variable based on whether ANY of the previous variables meets a condition. But for some reason, the ifelse construct below does not work.
MWE:
library("dplyr")
library("magrittr")
df <- data.frame(id = 1:12, year = c(rep(1995, 5), rep(1996, 5), rep(1997, 2)), varA = c("A","C","A","C","B",rep(NA,7)), varB = c(rep(NA,5),"B","A","C","A","B",rep(NA,2)))
df %>% mutate(varC = ifelse(varA == "C" | varB == "C", "C", "D"))
Output:
> df
id year varA varB varC
1 1 1995 A <NA> <NA>
2 2 1995 C <NA> C
3 3 1995 A <NA> <NA>
4 4 1995 C <NA> C
5 5 1995 B <NA> <NA>
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C C
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
If I don't use the | operator, and test against only varA, it will come out with the results as expected, but it will only apply to those years that varA is not NA.
Output:
> df %<>% mutate(varC = ifelse(varA == "C", "C", "D"))
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C <NA>
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
Desired output:
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B D
7 7 1996 <NA> A D
8 8 1996 <NA> C C
9 9 1996 <NA> A D
10 10 1996 <NA> B D
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
How do I get what I'm looking for?
To make this question more applicable to a wider audience, and to learn from this situation, it would be great have an explanation as to what is happening with the comparison using | that causes it not to work as expected. Thanks in advance!
EDIT: This is what I meant by successfully combining them with nested ifelses
> df %>% mutate(varC = ifelse(year == 1995, as.character(varA),
+ ifelse(year == 1996, as.character(varB), NA)))
id year varA varB varC
1 1 1995 A <NA> A
2 2 1995 C <NA> C
3 3 1995 A <NA> A
4 4 1995 C <NA> C
5 5 1995 B <NA> B
6 6 1996 <NA> B B
7 7 1996 <NA> A A
8 8 1996 <NA> C C
9 9 1996 <NA> A A
10 10 1996 <NA> B B
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
R has this annoying tendency where the logical value of a condition that involves NA is just NA, rather than true or false.
i.e. NA>0 = NA rather than FALSE
NA interacts with TRUE just like false does. i.e. TRUE|NA = TRUE. TRUE&NA = NA.
Interestingly, it also interacts with FALSE as if it was TRUE. i.e. FALSE|NA=NA. FALSE&NA=FALSE
In fact, NA is like a logical value between TRUE and FALSE. e.g. NA|TRUE|FALSE = TRUE.
So here's a way to hack this:
ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB))
How do we interpret this? On the left side of the OR, we have the following: If varA is NA, then we have NA&FALSE. Since NA is one step above FALSE in the hierarchy of logicals, the & is going to force the whole thing to be FALSE. Otherwise, if varA is not NA but it's not 'C', you'll have FALSE&TRUE which gives FALSE as you want. Otherwise, if it's 'C', they're both true. Same goes for the thing on the right of the OR.
When using a condition that involves x, but x can be NA, I like to use
((condition for x)&!is.na(x)) to completely rule out the NA output and force the TRUE or FALSE values in the situations I want.
EDIT: I just remembered that you want an NA output if they're both NA. This doesn't end up doing it, so that's my bad. Unless you're okay with a 'D' output when they're both NA.
EDIT2: This should output the NAs as you want:
ifelse(is.na(varA)&is.na(varB), NA, ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB)), 'C','D'))
Per #Khashaa comment. This should do the trick and get you to the desired output.
df %>%
mutate(varC = ifelse(is.na(varA) & is.na(varB), NA,
ifelse(varA %in% "C" | varB %in% "C", "C", "D")))

Resources