There is a wide data set, a simple example is
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(7,8,8,9,10,10),
"cx"=c(11,12,12,13,14,14))
I'm looking for a way to assign the values in "ax" to column "bx" and "cx". Here, imagine we have thousands of columns we intend to replace with "ax", so I want this to be done in an automated approach using R. The expected output look like
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(1,2,2,3,4,4),
"cx"=c(1,2,2,3,4,4))
I've thought of, and tried using mutate_at and ends_with, but this has not work for me. For example, I tried
df %>%
mutate_at(vars(ends_with("x")), labels = "ax")
and this prints an error. Not sure what's wrong or what's to be added to get this working, so I would like to request your help on this. Thank you very much!
A simple way using base R would be :
change_cols <- grep('x$', names(df))
df[change_cols] <- df$ax
df
# id ax bx cx
#1 1 1 1 1
#2 2 2 2 2
#3 3 2 2 2
#4 4 3 3 3
#5 5 4 4 4
#6 6 4 4 4
I would suggest this tidyverse approach using across() to select the range of variables you want:
library(tidyverse)
#Data
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(7,8,8,9,10,10),
"cx"=c(11,12,12,13,14,14))
#Mutate
df %>% mutate(across(c(bx:cx), ~ ax))
Output:
id ax bx cx
1 1 1 1 1
2 2 2 2 2
3 3 2 2 2
4 4 3 3 3
5 5 4 4 4
6 6 4 4 4
Another option with mutate_at()
df %>%
mutate_at(vars(matches("x$")), ~ax)
# id ax bx cx
# 1 1 1 1 1
# 2 2 2 2 2
# 3 3 2 2 2
# 4 4 3 3 3
# 5 5 4 4 4
# 6 6 4 4 4
I'd like to add rows to a dataframe based on a vector within the dataframe. Here are the dataframes (df2 is the one I'd like to add rows to; df1 is the one I'd like to take the rows from):
ID=c(1:5)
x=c(rep("a",3),rep("b",2))
y=c(rep(0,5))
df1=data.frame(ID,x,y)
df2=df1[2:4,1:2]
df2$y=c(5,2,3)
df1
ID x y
1 1 a 0
2 2 a 0
3 3 a 0
4 4 b 0
5 5 b 0
df2
ID x y
2 2 a 5
3 3 a 2
4 4 b 3
I'd like to add to df2 any rows that aren't in df1, based on the ID vector. so my output dataframe would look like this:
ID x y
1 b 0
5 b 0
2 a 5
3 a 2
4 b 3
Can anyone see a way of doing this neatly, please? I need to do it for a lot of dataframes, all with different numbers of rows. I've tried using merge or rbind but I haven't been able to work out how to do it based on the vector.
Thank you!
A solution with dplyr:
bind_rows(df2,anti_join(df1,df2,by="ID"))
# ID x y
#1 2 a 5
#2 3 a 2
#3 4 b 3
#4 1 a 0
#5 5 b 0
You can do the following:
missingIDs <- which(!df1$ID %in% df2$ID) #check which df1 ID's are not in df2, see function is.element()
df.toadd <- df1[missingIDs,] #define the data frame to add to df2
result <- rbind(df.toadd, df2) #use rbind to add it
result
ID x y
1 1 a 0
5 5 b 0
2 2 a 5
3 3 a 2
4 4 b 3
What about this one-liner?
rbind(df2, df1[!df1$ID %in% df2$ID,])
ID x y
2 2 a 5
3 3 a 2
4 4 b 3
1 1 a 0
5 5 b 0
I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.
The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))