Creating a new column from an existing column using R - r

I am trying to create a new column (variable) according to the values that appear in an existing column such that if there is an NA in the existing column then the corresponding value in the new column should be 0 (zero), if not NA then it should be 1 (one). An example data is given below:
aid=c(1,2,3,4,5,6,7,8,9,10)
age=c(2,14,NA,0,NA,1,6,9,NA,15)
data=data.frame(aid,age)
My new data frame should look like this:
aid=c(1,2,3,4,5,6,7,8,9,10)
age=c(2,14,NA,0,NA,1,6,9,NA,15)
surv=c(1,1,0,1,0,1,1,1,0,1)
data<-data.frame(aid,age,surv)
data
I hope that my question is clear enough.
The R community's help is highly appreciated!
Baz

surv = 1 - is.na(age)
> data
aid age surv
1 1 2 1
2 2 14 1
3 3 NA 0
4 4 0 1
5 5 NA 0
6 6 1 1
7 7 6 1
8 8 9 1
9 9 NA 0
10 10 15 1
>

If I'm understanding correctly:
data$surv <- 1
data$surv[is.na(data$age)] <- 0
or
data$surv <- ifelse(is.na(data$age), 0, 1)

An alternative to #mod's 1-is.na(foo) solution, is to just invert the TRUE/FALSE with !, and call as.numeric(). This involves more typing, but the intention and explicit coercion to numeric is apparent.
> as.numeric(!is.na(c(2,14,NA,0,NA,1,6,9,NA,15)))
[1] 1 1 0 1 0 1 1 1 0 1

Related

How to perform the equivalent of Excel sumifs in dplyr where there are multiple conditions?

I get the correct output shown below, with code beneath, in the SumIfs_1 column which calculates the sum of all Code2's in the array for the single condition where all Code1's in the array are < current row Code1:
Name Group Code1 Code2 SumIfs_1
1 B 1 0 1 1
2 R 1 1 0 2
3 R 1 1 2 2
4 R 2 3 0 4
5 R 2 3 1 4
6 B 3 4 2 5
7 A 3 -1 1 0
8 A 4 0 0 1
9 A 1 0 0 1
Code:
library(dplyr)
myData <-
data.frame(
Name = c("B","R","R","R","R","B","A","A","A"),
Group = c(1,1,1,2,2,3,3,4,1),
Code1 = c(0,1,1,3,3,4,-1,0,0),
Code2 = c(1,0,2,0,1,2,1,0,0)
)
myData %>% mutate(SumIfs_1 = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]])))
I'd like to expand the code to add another condition to the above sumifs() equivalent, creating a sumifs() with multiple conditions, where we add only those Code 2's for Groups in the array that are < than the current row Group, as further explained in this image (orange shows what already works in the Excel equivalent of the above code for SumIfs_1, yellow shows the sumifs() with more conditions that I am trying to add (SumIfs_2)):
Any recommendations for how to do this?
I'd like to stick with sapply() if possible, and more importantly I'd like to stick with dplyr or base R as I'm trying to prevent package bloat.
For what it's worth, here's my humble attempt to generate the SumIfs_2 column (which does not work):
myData %>% mutate(SumIfs_2 = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]][Group[1:n()] < Group[x]])))
You're doing the same thing pretty much, you just need to add another & condition where you are subsetting.
Also you don't need to call Code1[1:n()], when you call Code1 it already takes all of the values in the Code1 column.
I believe you are looking for
myData %>% mutate(SumIfs_2 = sapply(1:n(), function(x) sum(Code2[(Code1 < Code1[x]) & (Group < Group[x])])))
Name Group Code1 Code2 SumIfs_2
1 B 1 0 1 0
2 R 1 1 0 0
3 R 1 1 2 0
4 R 2 3 0 3
5 R 2 3 1 3
6 B 3 4 2 4
7 A 3 -1 1 0
8 A 4 0 0 1
9 A 1 0 0 0

How to convert the result of xtabs() into dataframe in R? [duplicate]

This question already has answers here:
How to convert a table to a data frame
(5 answers)
Closed 4 years ago.
I have data like dataframe df_a, and want to have it converted to the format as in dataframe df_b.
xtabs() gives similar result, but I did not find a way to access elements as in the example code below. Accessing through xa[1,1] gives no advantage since there is a weak correlation between indexing by numbers ("1") and names ("A"). As you can see there is a sort difference in the xtabs() result, so xa[2,2]=2 and not 0 as on the df_b listing.
> df_a
ItemName Feature Amount
1 First A 2
2 First B 3
3 First A 4
4 Second C 3
5 Second C 2
6 Third D 1
7 Fourth B 2
8 Fourth D 3
9 Fourth D 2
> df_b
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
> df_b$A
[1] 6 0 0 0
> xa<-xtabs(df_a$Amount~df_a$ItemName+df_a$Feature)
> xa
df_a$Feature
df_a$ItemName A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
> xa$A
Error in xa$A : $ operator is invalid for atomic vectors
There is a way of iterative conversion with for() loops, but totally inefficient in my case because my data has millions of records.
For the purpose of further processing my required output format is dataframe.
If anyone solved similar problem please share.
You can just use as.data.frame.matrix(xa)
# output
A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
## or
df_b <- as.data.frame.matrix(xa)[unique(df_a$ItemName), ]
data.frame(ItemName = row.names(df_b), df_b, row.names = NULL)
# output
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
Without using xtabs you can do something like this:
df %>%
dplyr::group_by(ItemName, Feature) %>%
dplyr::summarise(Sum=sum(Amount, na.rm = T)) %>%
tidyr::spread(Feature, Sum, fill=0) %>%
as.data.frame()
This will transform as you require and it stays as a data.frame
Or, you can just as.data.frame(your_xtabs_result) and that should work too

Replace with maximum value for specific columns by group/ID in R [duplicate]

This question already has an answer here:
R: apply simple function to specific columns by grouped variable
(1 answer)
Closed 5 years ago.
I'm trying to convert a dataset that has multiple observations per person over a period of time. For example, person 1 can be obese and not obese (just overweight) during this time. Here's an example from person 1:
ID Obese Overweight
1 NA NA
1 NA NA
1 0 1
1 1 0
1 0 0
2 NA 0
2 0 1
2 0 NA
I need to replace the values in each column to "1" if a 1 appears at all WITHIN THAT COLUMN, across a specified number of columns (there are 700+; e.g. c(5:749)) BY "ID". Ideally, the output would look like:
ID Obese Overweight
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
2 0 1
2 0 1
2 0 1
First I changed all the NAs to 0's; I then figured I could take the maximum along each column and replace (by ID), but can't find documentation on how to do this by group ("ID") AND a given set of columns (i.e. c(5:749)). Also I would not want to create new columns, but rather just replace values within columns already existing within the data frame.
I got it to work for a single variable, but couldn't translate this into a loop to go through a set of variables...
dat2 <- dat[, Obese:= max(Obese), by=ID]
Also I think a loop would take too long given the data size. Any other recommendations? Thanks in advance. Here's an example dataset:
dat <- as.data.frame(matrix(NA,18))
dat$id <- as.character(c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3))
dat$ob1 <- as.character(c(NA,NA,0,1,0,NA,0,1,0,0,0,0,0,0,0,0,0,0))
dat$ob2 <- as.character(c(NA,NA,1,0,0,NA,0,0,1,0,0,0,0,1,0,0,0,0))
dat <- dat[,-1]
As far as the linked paged using "lapply", it doesn't seem to work in the case where all values are NA (or 0) for a given individual. In this scenario, it seems to "fill in" / impute with values from other columns (which never appeared in the column in the original dataset); this was clearly spotted when a binary variable was imputed/replaced with a continuous value. Any idea why this may be happening?
I think tapply is helpful for this case.
You can find the max for each id by
with(dat, tapply(ob1, id, max))
My solution is:
dat$ob1 <- as.numeric(dat$ob1)
dat$ob2 <- as.numeric(dat$ob2)
dat[is.na(dat)] <- 0
dat$ob1 <- with(dat,tapply(ob1,id,max)[id])
dat$ob2 <- with(dat,tapply(ob2,id,max)[id])
dat
id ob1 ob2
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
7 2 1 1
8 2 1 1
9 2 1 1
10 2 1 1
11 2 1 1
12 2 1 1
13 3 0 1
14 3 0 1
15 3 0 1
16 3 0 1
17 3 0 1
18 3 0 1

R: How to sum two separate values of two variables?

I have data 7320 obs of 3 variables: age groups and contact number between them. Ex:
ageGroup ageGroup1 mij
0 0 0.012093847617507
0 1 0.00510485237464309
0 2 0.00374919082969427
0 3 0.00307241431437433
0 4 0.00254487083293498
0 5 0.00213734013959765
0 6 0.00182565778959543
0 7 0.00159036659169942
1 0 0.00475097494199872
1 1 0.00748329237103462
1 2 0.00427123298868537
1 3 0.00319622224196792
1 4 0.00287522072903812
1 5 0.00257773394696414
1 6 0.00230322568677366
1 7 0.00205265986733139
and so on until 86. I have to calculate mean of contact number (mij) between ageGroups so that, for example, ageGroup = 0 contacts with ageGroup1 =1 with mij and ageGroup = 1 contacts with ageGroup1 = 0 with mij. I need to sum this values and divide by 2 to get an average between then. Would you be so kind to give me a hint how to do that all over the data?
Use ddply from plyr package (assuming your dataframe is data)
ddply(data,.(ageGroup,ageGroup1),summarize,sum.mij=sum(mij))
ageGroup ageGroup1 sum.mij
1 0 0 0.012093848
2 0 1 0.005104852
3 0 2 0.003749191
4 0 3 0.003072414
5 0 4 0.002544871
6 0 5 0.002137340
7 0 6 0.001825658
8 0 7 0.001590367
9 1 0 0.004750975
10 1 1 0.007483292
11 1 2 0.004271233
12 1 3 0.003196222
13 1 4 0.002875221
14 1 5 0.002577734
15 1 6 0.002303226
16 1 7 0.002052660
I think I see what you're trying to do here. You want to treat interactions between the two ageGroup columns as being non-directional and get the mean interaction? The code below should do this using base R functions.
Note that since the example dataset is truncated, it will only give a correct answer for the group with index 01. However if you run with the full dataset, it should work for all interactions.
# Create the data frame
df=read.table(header=T,text="
ageGroup,ageGroup1,mij
0,0,0.012093848
0,1,0.005104852
0,2,0.003749191
0,3,0.003072414
0,4,0.002544871
0,5,0.00213734
0,6,0.001825658
0,7,0.001590367
1,0,0.004750975
1,1,0.007483292
1,2,0.004271233
1,3,0.003196222
1,4,0.002875221
1,5,0.002577734
1,6,0.002303226
1,7,0.00205266
",sep=",")
df
# Using the strSort function from this SO answer:
# http://stackoverflow.com/questions/5904797/how-to-sort-letters-in-a-string-in-r
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
# Label each of the i-j interactions and j-i interactions with an index ij
# e.g. anything in ageGroup=1 interacting with ageGroup1=0 OR ageGroup=0 interacting with ageGroup1=1
# are labelled with index 01
df$ind=strSort(paste(df$ageGroup,df$ageGroup1,sep=""))
# Use the tapply function to get mean interactions for each group as suggested by Paul
tapply(df$mij,df$ind,mean)

R bins are percentages of column length

I have a table of several columns, with values from 1 to 8. The columns have different lenghts so I have filled them with NAs at the end. I would like to transform each column of the data so I will get something like this for each column:
1 2 3 4 5 6 7 8
0-25 1 0 0 0 0 1 0 2
25-50 5 1 2 0 0 0 0 1
50-75 12 2 2 3 0 1 1 1
75-100 3 25 1 1 1 0 0 0
where the row names are percentages of the actual length of the original column (i.e. without the NAs), the column names are the original 0 to 8 values, and the new values are the number of occurances of the original values in each percentage. Any ideas will be appreciated.
Best,
Lince
PS/ I realize that my original message was very confusing. The data I want to transform contain a number of columns from time series like this:
1
1
8
1
3
4
1
5
1
6
2
7
1
NA
NA
and I need to calculate the frequency of occurences of each value (1 to 8) at the 0-25%, 25-50% et cetera of the series. Joris' answer is very useful. I can work on it. Thanks!
Given the lack of some information, I can offer you this :
Say 0 is no occurence, and 1 is occurence. Then you can use the following little script for the results of one column. Wrap it in a function, apply it over the columns and you get what you need.
x <- c(1,0,0,1,1,0,1,0,0,0,1,0,1,1,1,NA,NA,NA,NA,NA,NA)
prop <- which(x==1) / sum(!is.na(x))*100
result <- cut(prop,breaks=c(0,25,50,75,100))
table(result)

Resources